What are Character Encodings?

My purpose in this brief series of articles on character sets and encodings is to lay the groundwork for an upcoming series of articles.

In this article I follow on from our look at character sets and dive into the deep waters of character encoding. I won't look at code examples in this piece, because in the next article in the series I will be discussing Python 3 and Unicode.

We already saw some examples of where the code point and the encoding are the same. In ASCII for example, the SPC character is code point 32 and is also encoded as 32. The same idea applies to the other single byte systems we say such as ANSI, Windows CP-1252, and ISO-8859-1. This works great for these rather limited character sets, but it won't work well for Unicode due to Unicode's vast number of code points.

UCS-4

As we saw Unicode currently has around 137,994 characters. That's more than can be stored in a byte, or even two bytes. The obvious option is to go for four bytes. We can just store a Unicode code point (representing a character) as a 32-bit number. For example, A could be stored as the following 4 bytes:

00 00 00 41

Problem is our bandwidth is clobbered if we send data out over the network like this. We are using 4 bytes for something that could be encoded in one byte using ASCII.

The precomposed character ñ would be stored as:

00 00 00 F1

ñ is officially known by Unicode as "LATIN SMALL LETTER N WITH TILDE". All code points have this type of textual description.

This is out of ASCII's range, but still only one byte, but we are using 4 bytes in UCS-4

The emoticon 😅, aka "SMILING FACE WITH OPEN MOUTH AND COLD SWEAT", would be stored as:

00 01 F6 05

This works well and is simple. You have a fixed-width encoding (4 bytes per code point).

As you will see in my next article UCS-4 is what Python uses internally (at least on most modern systems). Most external systems don't use this encoding though - it's just too wasteful of bandwidth for content that is comprised of characters at the lower end of the Unicode table. It is however somewhat more efficient if the majority of the characters you encode are at the high end of the Unicode table.

UCS-2

This is also a fixed-width encoding scheme of two bytes per code point. You can see what the problem is going to be - if you want to encode that cold sweat fella 😅 you have a problem as it's a three byte code point - 01 F6 05. Yes, UCS-2 is only capable of encoding what's known as the Basic Multilingual Plane, or BMP. Oh crap - time for the 💩emoticon.

So we can dismiss UCS-2 as obsolete? Not so fast there bucko! There's one reason why UCS-2 may be useful to know about. Some legacy systems used UCS-2, and some current systems have their roots in it. You should also be aware of some of the shortcomings of UCS-2.

UCS-2 made a lot of sense because you had the idea of one two-byte "character" per code point. Developers were used to the idea of that because each ASCII or extended ASCII code point was correlated with one one-byte character. Things seemed fluffy and cute because you only had to change your mindset from "one byte per character" to "two bytes per character". If only things were going to be that simple...

So developers set to work with the idea that characters were now two-bytes instead of one byte. Applications for the early 32-bit versions of Windows (Windows NT) for example often used UCS-2. Java's character char was also designed to be 16-bit.

There were a couple of things that then derailed this happy state of affairs...

The first was that there are code points in BMP that are not a character per se, but more like a character part - an accent for example. These composable characters, made from two code points, could lead to issues in string handling code. If UCS-2 handling code naively expected "one character per code point" nirvana, then it was going to break for them at some point.

Then the Unicode Consortium came back and said - sorry chaps, 16-bits ain't going to be enough after all. Yeah, that got you...

The solutions to this now painful state of affairs was the introduction of UTF-16 and the phasing out of UCS-2.

Side note: UCS is Universal Character Set and is, in fact, a character set! UCS-2 and UCS-4 are both encodings, although these are fixed-width encodings that do not involve transformation. Confusingly Unicode is sometimes also known as Universal Coded Character Set, or UCS.

UTF-16

When it was realized that UCS-2 time machine was not going to work, they invented UTF-16 around 1996. UTF stands for Unicode Transformation Format. UTF-16 is a variable-width encoding scheme. It is a variable-byte format does not have a fixed width like UCS-2 (two bytes) and UCS-4 (four bytes). UTF-16 does have the advantage that it can encode all valid Unicode code points.

UTF-16 is itself something of a hack. It uses two bytes to access the code units in the BMP, and is in fact the same as UCS-2 here (which was one of the design aims). Each code unit encoded is numerically equal to the code point. So far, so good. However, to access code points beyond the BMP it uses an additional two bytes, forming what's known as a surrogate pair. There's a tricky algorithm that is then used to access the higher planes of Unicode (sometimes jokingly referred to as the astral planes).

These surrogate pair shenanigans are the reason why, despite the 32-bit goodness of Unicode, there's a upper limit to the code points of 0x10FFFF. Or, at least, if you try to access something above that number UTF-16 will explode, or something. I really don't want to think too much about that to be honest...

NOTE: Because most of the commonly used characters are in the BMP, the surrogate pair handling code in a lot of applications is not always fully tested, and is often a source of bugs and security issues.

UTF-16 is used internally on systems such as Java, JavaScript and Windows. This is not ideal as a lot of the world is now using UTF-8. This means there are work arounds to deal with this.

Old versions of Python (around 2000 ish, 2.x) used UTF-16 internally for dealing with Unicode. Python now uses UCS-4 internally (at least on most platforms, on some it might use UCS-2). Most web pages tend not to use UTF-16, and use UTF-8 instead.

UTF-8

I will go out on a limb here and say UTF-8 is the most important encoding to know today.

UTF-8 is a variable-width multi-byte scheme for encoding Unicode. The idea is you encode a code point with a variable number of bytes depending on what that code point needs. For example, you can encode all of the ASCII code points (the first part of the Unicode table) using one byte of UTF-8. ASCII only requires 7-bits, but the most significant bit of that byte is reserved for UTF-8 and is used to indicate whether more bytes follow for this specific encoding.

Other code points would require a greater number of bytes to encode in UTF-8. For example, "SMILING FACE WITH OPEN MOUTH AND COLD SWEAT" would be encoded as f09f9885 - a total of 4 bytes, but uses only 21-bits of that due to the theoretical last code point of U+10FFFF (the end of Plane 16) in order to retain compatibility with UTF-16's inability to encode beyond this point due to its convoluted algorithm.

Generally UTF-8 is a sensible default. You can't go too far wrong if you think Unicode character set plus UTF-8 encoding. I tend to use UTF-8 as my default encoding in everything as do many tools and systems out there. UTF-8 is the default encoding for Python when interfacing with the real world (remember UCS-4 is only used internally by Python).

The only reason you need might to dip into the murky waters of other encodings is where you frankly don't have a choice - you really need to read that old DOS text file into your Python program for example. Other legacy encodings will also need decoding to work in modern Python 3 programs. (More on that in my upcoming article on Python and Unicode).

When things don't look right

When a document is displaying characters that don't appear to make sense, for example you have question marks in odd places, blocks where there should be a legible character and so on, it's probably an encoding issue. You probably have your browser (or editor, terminal, mail client etc.) set to one encoding, and the document you are trying to read or display is in another encoding. By way of example take a look at this:

>>> print(x)
b'\xf0\x9f\x98\x85'
>>> print(x.decode('utf-8'))
😅
>>> 
>>> print(x.decode('utf-16'))
鿰薘        # Not what we wanted!
>>>

You can see here we start with a byte stream, which you know from earlier is the UTF-8 encoding of the "SMILING FACE WITH OPEN MOUTH AND COLD SWEAT" code point. If we decode that byte stream with UTF-8 we get the correct character. If however our browser, or terminal, was set to decode information as UTF-16 we would get 鿰薘 which is not right!

Remember that whenever dealing with text or other series of "characters" from the real-world, it does not make sense to process that data assuming it's all ASCII. You need to know what the inbound data encoding is, and when data is outbound you need to make sure you are encoding it in a way that can be sensibly processed by the receiver.

Configuring encodings

HTML

In HTML you would normally set the content type to match how your HTML is actually encoded. If you saved the HTML file as UTF-8, then set the content type to UTF-8:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Email

You will probably set your encoding in your email client for outbound email, although the default is likely to be UTF-8. The mail header would have something like for UTF-8:

Content-Type: text/plain; charset="UTF-8"

The email client will check the content type of inbound mail to make sure the correct decoding is used.

Terminals

In your terminal app you should be able to set the preferred encoding. I have this set to UTF-8.

In Terminal I select a profile, go to Advanced and then set the text encoding to "Unicode (UTF-8)"

This will allow your terminal to display the full range of Unicode code points (as long as your system has appropriate fonts to render the glyphs).

Editor

You should set your preferred encoding in your editor. Usually it will be UTF-8 by default. See bottom right in VSCode.

Emacs will display the current encoding on the status line. You can reload the file with whatever encoding you want.

Summary

In this article I have gone through the very basics of character encodings. This should provide the groundwork for future articles that involve knowing a bit about character sets and encodings. In the next article in this series I will take a look at the specifics of Python and Unicode.

NOTE: This is a complex area and I'm sure I've made a few mistakes somewhere a long the way. You can always contact me to let me know...

Welcome to Tony's Notebook