Welcome to Tony's Notebook

What are Character Sets?

This is the first in a short series on character sets and character encodings. I'm rewriting this as a precursor to some other pieces I've got coming up. As the planned pieces need a basic understanding of character sets and encodings I thought I need to cover those first. This first article looks at character sets.

Back in the mists of time...

Back in the mists of time there was ASCII and ASCII was all that mattered. Originally it was designed from the ideas of telegraph codes.

If you are blanking on "telegraph code", that means a code to send information over a telegraph system - yes those things they have in Westerns were some poor unfortunate sits in a cubicle at the railway station and receives the message that the gunslingers are coming into town and everyone better hide. They found it convenient to run the telegraph lines next to the railway lines, and so the telegraph posts were often situated in the town's railway station. High Noon and all that...

For a while, the main telegraph code was Morse code, where characters are encoded as a series of dots and dashes. There were other telegraph codes beside Morse code. There was even a telegraph code for Chinese characters!

Anyway, with the very early computers of the 1960s it was realized that Morse code wasn't going to cut it and so ASCII was developed. But what exactly is ASCII?

At its most basic ASCII is a table of characters. The table starts at 0 and goes up to 127, giving 128 entries. For example, SPC is table position 32. A is 65. z is 122.

The so called "printable characters" are in the range 32 to 127 decimal. There are also a bunch of "non-printable" characters - things like BEL at 7 which was "Bell". BS at number 8 is Backspace.

The thing about the printable characters in the ASCII character set was they were all English characters. The rest of the world didn't exist - at least as far as ASCII was concerned. England also kind of didn't exist too - there is no £ in ASCII - American Standard for Information Interchange - the clue is in the name. $ is there of course at position 36. There are a total of 128 characters supported in ASCII (0 to 127 decimal) - so only 7-bits are required to encode each character.

ASCII is also strange in that it is a character set and an encoding. A character set lays out a table that maps numbers (or more formally code points) to a character. The encoding says how that character is actually stored or transmitted. In the case of ASCII the position in the table is how the character is encoded. For example, position (code point) 37 decimal in the table (character set) is the % character. But % is also stored and transmitted (encoded) as 37. In other words the code point and the ultimate representation of the character are both '37' or '0100101' in binary. I will go into more details on encoding in the next article.

ANSI

Things soon got more complicated, especially with the advent of the IBM PC (and DOS) and a growing number of computer users for whom English was not their first language. There were people who used, shock horror, accented characters, and even squiggly characters!

While ASCII was a concrete standard at this point, it only covered the first 7 bits of a byte. This meant the most significant bit of a byte was available to extend the table of characters (character set) from 128 (0 to 127) entries to 256 (0 to 255). In other words, the table could be expanded from 128 to 255 to include some of those accented and squiggly characters, and a few other weird things like box drawing characters. The box drawing characters were used in DOS type applications (before Windows) to create dialogs. You could create dialogs with different frame types, buttons, and even simulate shadows with the box drawing characters.

The table entries from 128 to 255 became something of a "wild west" for characters. While 0 to 127 was the ASCII standard, 128 to 255 was a free-for-all, with PC OEM manufacturers adopting different entries according to their markets. ANSI was an attempt to bring some order to the chaos of the 128 to 255 zone. The key idea was that of code pages, where you could switch out the 128 to 255 area with different sets of characters, depending on market. For example, if you were targetting the Russian market you could switch in a code page that supported Russian characters. Each code page had a number. So the Greek code page was Microsoft OEM DOS CP 737. Hebrew was Microsoft DOS CP 862. There were also IBM code pages for Japanese, Korean, and a very limited set of Chinese characters - burgeoning markets for the IBM PC and DOS-based clones (which was the main OS on PCs at the time).

MS-DOS even had a command for selecting the code page. For example, chcp 850 would select the CP-850 code page for all devices in the PC that supported it.

ANSI, while collecting together these assorted character sets, was still a one-byte encoding per code point system.

And then Windows came along

Windows 1.0 rocked up on the scene sometime around 1985, and brought a whole new bunch of code pages with it. One of the most well known (at least here in the West) was Windows CP-1252. Windows CP-1252 was still an 8-bit character set. CP-1252 went on to become one of the most popular 8-bit character sets in the world (and still is). Windows CP-1252 is sometimes referred to as Windows Latin 1.

ISO-8859-1

Another popular character set still found in the wild is ISO-8859-1 and family. This is still a single byte character set, with single byte encoding. This character set was also the default method of dealing with documents (typically web pages) delivered via HTTP where a MIME type of text/ was specified. Now, in HTML 5 this has changed to Windows-1252. ISO-8859-1 is also known as Latin1. There are other character sets in the family sequentially numbered up to ISO-8859-15. Bizzarely, ISO-8859-15 is also known as Latin9 and sometimes Latin0! Confused? The main thing to remember is that information coded in ISO-8859-1 is out there are needs to be handled from time to time. It is also deemed to be superceded by Windows-1252 for web standards.

Unicode - Beyond the Byte

The main problem with these systems so far is they only allowed for a character set consisting of 256 characters (one byte). They attempted to solve this limited number of characters by using the concept of code pages to switch in the required character set to extend ASCII as we already saw. It's still limited though. Take Chinese for example. In the simplified Chinese alphabet there are 2,235 characters. Oops - hello code page mayhem!

Obviously, things had to go beyond the limitations of ASCII, ANSI and the byte.

Enter Unicode...

Unicode is a standard that creates a character set table for (at least) every character on the planet. It is essentially an unlimited character set table in that there is no hard limit to the number of code points. Typically though 32-bits is far enough. 32-bits is actually enough to create a character set table of up to 4,294,967,296 code points (0 to 4,294,967,295 decimal). Which is - a lot of characters!

The Unicode character set includes made-up languages, such as Klingon and Elvish. Yes, Elvish is a proposed part of the Unicode standard. Unicode also includes all sorts of dingbats, emojis and whatnots, including the infamous Pile of Poo emoji. Cute little chap. The most recent version of Unicode, 12.1 includes 137,994 characters. There's plenty of room in those 32-bits we were talking about.

16-bits can store 65,536 code points (0 to 65,535 decimal), and while not enough to represent all code points, it covers most of the useful ones.

NOTE: Unicode is also known as ISO-10646

Unicode complexity

Unicode is an extremely complex character set, with planes and blocks and standardized subsets. For example, the first 128 entries (0 to 0x7F) of the Unicode character set is the standardized subset known as Basic Latin, and corresponds to ASCII. The set from 0x80 to 0xFF is known as Latin-1 Supplement. I'm not going to cover these additional complexities in this piece, but thought you should at least be aware of them.

Unicode also has what you can think of as ready-made characters, so called precomposed characters. For example, there is a single code point for e-acute (as used in French). Precomposed characters are provided in Unicode mainly for backwards compatibility with older character sets.

Unicode also supports so-called decomposed characters. Here, you could for example combine an acute accent code point with the e code point to create an e-acute character.

This provide immense flexibility and efficiency, especially for dealing with complex writing schemes. Rather than having to have a large numb er of precomposed characters (and associated fonts), you can create these characters using simpler subsets. For example, you could create all French 'e characters' for a single 'e' character and then combine that with accent code units (grave, acute, circumflex etc.) as required. Similarly you could do the same for the 'a' character. This drastically reduces code points and other resources required.

For this reason, and the fact that "characters" might be things like emoticons, Unicode doesn't associate code points with "characters", but rather each code point is associated with a code unit. In fact a code unit might not be a character in the way we think of it.

Even though in formal Unicode parlance a code point is associated with a code unit, I do sometimes use the term character instead of a code unit. Mostly this is when I have a specific renderable character in mind, such as 'A'.

How are Unicode code units represented in the real world?

If we had a series of Unicode code units (a Unicode string), such as "HELLO" it might look like this as a series of Unicode code points:

U+0048 U+0045 U+004C U+004C U+004F

Note each "character" here is specified using the Unicode U+code point notation.

What would this look like in memory or on disk? A guess might be:

00480045004C004C004F

But what if we stored big-endian? What if we used four bytes per character? Also, in our representation above there's a total of 10 bytes (two bytes per character). "HELLO" in ASCII is five bytes, so there's some wastage of memory/space/bandwidth there.

We are now into the topic of how strings of characters are actually stored in memory and on disk, or transmitted over the Internet, and there are various encodings that can be used.

The sample Unicode string above is encoded with UCS-2 encoding, which specifies two bytes per character. The number matches the code point in this encoding. I will get into the weeds on character encodings in the next article.

Summary

So at this stage we hopefully have some idea of what character set is, and that there are various character sets out there, Unicode being the most important one in use today. Many of the older character sets have been assimilated into Unciode as subsets, for example the first 128 entries in Unicode are the same as ASCII.

In the next article I will look at character encodings - that is, how character strings are stored and transmitted in the real world, rather than when considered as a code point in a character set.