Stdout was a bad, bad boy

There's two statements you hear bandied about with Python 3:

Python uses Unicode for all strings!
Python uses UTF-8 as the default character encoding!

The first is true enough, at least internally. Python uses Unicode strings, coded in UCS-2 or UCS-4 (probably UCS-4 for you).

I discovered there's a gotcha with the second statement.

So here's what happened. I was working on a database app for a friend. He has a database with thousands of companies in there, and the companies are based in Thailand mostly and some are international. The database entries were compiled in Thailand, and guess what? Some of the entries used Thai script! Oh weird, funny squiggly characters everywhere! But nothing Unicode doesn't have a codepoint for, and nothing UTF-8 can't encode.

So after some epic coding, the database app is all working nicely and displaying records on the console. And then in a moment of madness (well actually because he wanted to access the database over the web) I got the thing working as a CGI web app, and the whole thing choked. And after a closer look it seemed to be choking on the Thai characters. Oh dear. I just knew those squiggly characters were going to be trouble.

So, a quick check of my content header:

header = '''
Content-type: text/html; charset=utf-8


<html>
<head><title>Query Results</title></head>
<body>
<table border='1'>
<thead>
<tr><th>ID</th><th>Company Name</th><th>Location</th><th>Categories</th><th>Services</th></tr>
</thead>
<tbody>
'''

Well, I definitely have a UTF-8 in there.

After an overdose of caffeine I realized that while Python was happily emitting a byte stream encoded in UTF-8, stdout itself was being opened in ascii mode by Apache.

There was only one thing for it. I was going to have to grab stdout by the scruff of the neck and tell it in no uncertain terms it was going to grok UTF-8 it had better like it!

I worked up a little output function to help out:

# Support Unicode characters in web stream
# Odd one - utf-8 does not appear to be selected for non-tty type output
# in some cases !!!
utf8stdout = open(1, 'w', encoding='utf-8', closefd=False)

def write_out (s):
    print (s, end='\r\n', file=utf8stdout)

So, now stdout gets opened as UTF-8 whether it likes it or not! Now, UCS-4 integers would already get encoded to UTF-8 on their way out from Python land to the big wide world (by default), but now stdout was configured explicitly to be compatible with them.

It all started working, stdout stopped choking. I breathed a sigh of relief. Those squiggly characters became cute and cuddly rather than the mark of the Devil.

All was good in the world, until I ran into another problem with stdout, but that's another story for another time.

Things to remember

Python 3 uses Unicode for strings (internally) in UCS-4.
Unicode strings needed to be encoded and decoded as they exit and enter Python.
stdout might not be opened with UTF-8 encoding (you might need to do it explicitly). For example, it is when working in a correctly configured terminal on Mac OS X, but may not be when opened by Apache.
You can explicitly specify the encoding when you open a file (file can be stdin, stdout, stderr).

Useful Notes

sys.getdefaultencoding() Return the name of the current default string encoding used by the Unicode implementation.

There's a difference between Python running as an interpreter, as a standalone application, as a CGI application on a web server.

sys.maxunicode An integer giving the value of the largest Unicode code point, i.e. 1114111 (0x10FFFF in hexadecimal).

Changed in version 3.3: Before PEP 393, sys.maxunicode used to be either 0xFFFF or 0x10FFFF, depending on the configuration option that specified whether Unicode characters were stored as UCS-2 or UCS-4.

stdin, stdout, stderr:

These streams are regular text files like those returned by the open() function. Their parameters are chosen as follows:

The character encoding is platform-dependent. Under Windows, if the stream is interactive (that is, if its isatty() method returns True), the console codepage is used, otherwise the ANSI code page. Under other platforms, the locale encoding is used (see locale.getpreferredencoding()).

Under all platforms though, you can override this value by setting the PYTHONIOENCODING environment variable before starting Python.

>>> locale.getpreferredencoding()
'UTF-8'
>>> 

>>> import os
>>> os.isatty(1)
True
>>> 

>>> sys.stdout.isatty()
True
>>>

Here's a little bit of debugging code I put at the start of my app to figure out what was going on:

DEBUG_MODE = True

if DEBUG_MODE:
    cgitb.enable()

if DEBUG_MODE:
    if sys.stdout.encoding != 'UTF-8':
        error_str = '''
        <html>
        <body><h1>ERROR</h1></body>
        <h2>Encoding of stdout was not UTF-8 it was {}</h2>
        </html>
        '''.format(sys.stdout.encoding)
        print (error_str)
        exit()

Welcome to Tony's Notebook

Stdout was a bad, bad boy

Things to remember

Useful Notes