Welcome to Tony's Notebook

Python 3 and Unicode

And so having stuffed our brains with character sets and encodings we get down to the nitty gritty of how to use all this in Python 3.

Before you go any further you might find it useful to fire up a Python 3 interpreter by typing python3 in your terminal and pressing enter.

NOTE: You need to make sure your browser is capable of displaying Unicode characters to read this page correctly! πŸ™ˆ

How to enter Unicode characters in Python

Here's how to declare a Unicode string in Python:

>>> s1 = "πŸ˜€"

So how did we get that character in there? Well, you can use the character map tool on Mac or equivalent. You can also enter the character in a couple of other ways. The first is using the long name. In the interpreter you could do:

>>> s = "\N{SMILING FACE WITH OPEN MOUTH AND COLD SWEAT}"
>>> t = "\N{PILE OF POO}"
>>> s
'πŸ˜…'
>>> t
'πŸ’©'

The other way is to use a Python \u notation for two-byte characters:

x = "\u0394"
>>> print(x)
Ξ”

There's also \U for four byte characters:

y = "\U0001F49C"
>>> print(y)
πŸ’œ

Unicode strings by default

You'll notice Unicode strings are now the default in Python 3. This is stored internally in Python as a series of code points, that is UCS-4. You can confirm this:

>>> import sys
>>> print(sys.maxunicode)

If it prints out 1114111 you're on UCS-4.

Now what if you want to print Unicode characters out? You can just use print:

>>> print(s1)

Will print out the smiling face character as expected. So far so good!

Encoding strings

Now, what if we want to write these strings out from Python's cozy world of UCS-4 to some harsh reality of file systems and networks?

First, check your default encoding:

>>> import sys
>>> print(sys.getdefaultencoding())

This will return 'utf-8' which is the default encoding for Python. If you don't get 'utf-8' your world is about to fall apart. UTF-8 makes sense as much of the real-world now uses UTF-8.

Just to be sure (belt and braces) you can also check the encodings for standard files:

>>> import sys
>>> print(sys.stdin.encoding)
>>> print(sys.stdout.encoding)
>>> print(sys.stderr.encoding)

Will all return 'utf-8'.

The thing to remember is when you write out from Python to the real world, you need to encode your UCS-4 codepoints. By default that will be done using UTF-8. When you read in the byte stream it will be assumed to be (by default) UTF-8, and if it's not you'll get a mess.

Files

When you are reading files you'll need to know the encoding that the contents of the file uses. For example, let's say you have a crusty old Windows file from a floppy disk using ISO-8859-1. To read that file correctly you would do this:

fin = open(filename, 'r', encoding='iso8859-1')
text = fin.read()
fin.close()

Notice how to specify the inbound encoding.

Now let's say you wanted to write that out as UTF-8, you'd do this:

fout = open(filename, 'w', encoding='utf-8')
fout.write(text)
fout.close()

Or more simply:

fout = open(filename, 'w')
fout.write(text)
fout.close()

This works as UTF-8 is the default.

It's worth pointing out that it would not make too much sense to read a UTF-8 file full of Unicode characters, and then attempt to write that out as, say, ASCII. You'd get and error anyway, something along the lines of the following:

>>> z
'This is a πŸ’©'
>>> z.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f4a9' in position 10: ordinal not in range(128)

Note from the error message ASCII has to be a value from 0 to 127, a range of 128.

Byte streams

If you are dealing with byte streams, you may need to encode or decode these as appropriate. For example, let's say we had a byte stream we knew was encoded with UTF-8, we could decode it as follows:

bs = b'This is a \xf0\x9f\x92\xa9'
z = bs.decode('utf-8')
print(z)

We can of course go the other way:

s = "This is a πŸ’©"
bs = s.encode('utf-8') # bs is now an encoded byte stream

This above snippet takes a Python Unicode string and encodes it as a UTF-8 byte stream.

So, what happens when the wrong encoding is used? Take a look at the following Python interpreter session:

>>> x = b'\xf0\x9f\x98\x85'
>>> x
b'\xf0\x9f\x98\x85'
>>> print(x)
b'\xf0\x9f\x98\x85'
>>> print(x.decode('utf-8'))
πŸ˜…
>>>
>>> print(x.decode('utf-16'))
ιΏ°θ–˜        # Not what we wanted!
>>>

You can see that given some byte stream, if we guessed its encoding as 'UTF-8' we'd get the correct character printed out. If we however assumed it was UTF-16 and decoded it as such we'd get much confusion!

Summary

The take away here is that Unicode is much easier in Python 3. Unicode is the default for strings, UCS-4 is used by Python internally, and the default encoding for dealing with the real world in UTF-8. That is all very sensible. Python provides tools for entering Unicode strings directly, and for encoding and decoding when working with byte streams directly, or when reading and writing files.

So that concludes this short series on character sets and encodings. I wish you much πŸ’› in your dealings with Unicode!

References