An introduction to the UTF encodings

The UTF encodings are a collection of formats used to represent Unicode characters. I will describe their design, their purpose, and their pros and cons. But first we have to go a little further back in time, to before UTF.

ASCII

ASCII dates back to 1960, and is the oldest text encoding still in common use. It is extremely simple, consisting only of single bytes with values between 0 and 127, each value mapping to a certain character. For example (in hexadecimal), 0x69 is 'i', 0x6E is 'n', and 0x6F is 'o'. These numbers are entirely arbitrary; the important thing is that there is a known mapping. These are then encoded like so:

"onion" = 01101111 01101110 01101001 01101111 01101110
          [ 0x6F ] [ 0x6E ] [ 0x69 ] [ 0x6F ] [ 0x6E ]

The problem with this is that only 128 values can be represented in this way, as in ASCII the leftmost bit is never used. What if we wanted to write the Norwegian (dialectal) word ‘ærå’, meaning ‘the honour’? The letters ‘æ’ and ‘å’ aren’t in ASCII, because there isn’t space.

ISO 8859 / Latins 1 to 16

The first solution to this problem was to simply use the leftmost bit of each byte to encode another 128 values. Now the mapping went from 0 to 255. Let’s be honest, there are far more than 256 characters in all the world’s languages — so people used the same encoding format (again, single bytes) to represent a number of different mappings, called code pages. The most common of these, Latin-1, would encode ‘ærå’ as follows:

"ærå" = 11100110 01110010 11100101
        [ 0xE6 ] [ 0x72 ] [ 0xE5 ]

Unless we already know the specific mapping being used, the numbers are completely useless. Depending on the mapping used, the above string could represent ‘ærå’ or ‘ćrĺ’ or ‘ĉrċ’ or ‘цrх’ or ‘نrم’ or ‘ζrε’ or ‘זrו’ or ‘ๆrๅ’ or ‘ęrå’ or ‘ærć’. As one can imagine, this caused a lot of problems. If you’ve ever seen a web browser use the wrong symbols anywhere, ISO 8859 was almost certainly involved.

UCS-2

Learning from the mistakes of ISO 8859, we realised a single mapping should be used to represent all characters, rather than using code pages. So we created a single mapping that would become known as Unicode. At the time 65,536 characters were in this mapping, which could be represented in two bytes. Therefore, a simple two byte encoding called UCS-2 was developed. (ASCII or Latin-1 could be thought of as ‘UCS-1’, although there isn’t really such a thing.) In UCS-2 a code sequence can be represented in one of two ways:

"아침글" = 11000101 01000101 11001110 01101000 10101110 00000000 - big-endian
"아침글" = 01000101 11000101 01101000 11001110 00000000 10101110 - little-endian
           [    0xC545     ] [    0xCE68     ] [    0xAE00     ]

The representation depends on which byte order used: whether the first byte in any given two byte number is the most significant (big-endian), or the least (little-endian). We can identify the byte order by prepending a byte order mark (BOM), a number that happens to be identifiable and represented differently depending on the byte order: 0xFEFF. (0xFFFE is not allowed.)

"ærå" = 11111110 11111111 00000000 11100110 00000000 01110010 00000000 11100101 - big
"ærå" = 11111111 11111110 11100110 00000000 01110010 00000000 11100101 00000000 - little
        [    0xFEFF     ] [    0x00E6     ] [    0x0072     ] [    0x00E5     ]

For simplicity we will assume everything is big-endian from now on.

UCS-4 / UTF-32

The Unicode Consortium later decided they needed even more space for characters, and enlarged the mapping from 65,536 codes to 1,114,112. This meant UCS-2 could now only represent a fraction of the necessary values. As a result, UCS-4 appeared, later also called UTF-32. Again, the format was very simple: a series of four byte numbers.

00000000 00000000 11111110 11111111 00000000 00000001 11110100 10101001
[           0x0000FEFF            ] [           0x0001F4A9            ]

The trouble is, just as UCS-2 was incompatible with ASCII, UCS-4 is incompatible with UCS-2, and UCS-2 had already been adopted by a lot of large systems, like Windows and Java. So people wanted an alternative to UCS-4 that was backwards-compatible with UCS-2.

UTF-16

UTF-16 is a modified UCS-2 with a ‘trapdoor’ mechanism was built in to allow for larger numbers, using surrogate pairs. A surrogate pair is a pair of numbers, each two bytes long, that together represent the single larger number. To produce a surrogate pair for a number between 0x10000 (the limit of UCS-2) and 0x1FFFFF (the limit of Unicode), we subtract 0x10000 and then ‘pack’ the resulting bits into a mask:

aabbbbbbbbccdddddddd = 110110aa bbbbbbbb 110111cc dddddddd

For example, 0x1F4A9 becomes 0xF4A9, and is packed like so:

00001111010010101001 = 11011000 00111101 11011100 10101001
                       [    0xD83D     ] [    0xDCA9     ]
                       [             0x1F4A9             ]

A lot of programmers believe UTF-16 to be a fixed-length format, in that two bytes always equal one character (like UCS-2 was), but that is false: for fixed-length you need UTF-32. A lot of programmers also think that UTF-16 must be superior to other formats because it’s used by the likes of Windows and Java, but that’s only for compatibility with UCS-2.

UTF-8

UTF-32 is prohibitively large for normal text, being four times larger than ASCII, which are more common in average English text. UTF-16, too, was twice as large as ASCII, and had lost the benefit of UCS-2 being fixed-length. However, an alternative encoding, UTF-8, works in a similar way to UTF-16’s ‘trapdoor’ mechanism, but is also compatible with ASCII and is for English text no larger on average. Numbers are packed into a mask depending on their size:

              aaaaaaa = 0aaaaaaa
          aaaaabbbbbb = 110aaaaa 10bbbbbb
     aaaabbbbbbcccccc = 1110aaaa 10bbbbbb 10cccccc
aaabbbbbbccccccdddddd = 11110aaa 10bbbbbb 10cccccc 10dddddd

For example, 0x1F4A9 is packed like so:

000011111010010101001 = 11110000 10011111 10010010 10101001
                        [ 0xF0 ] [ 0x9F ] [ 0x92 ] [ 0xA9 ]
                        [             0x1F4A9             ]

‘ærå’ is therefore encoded:

"ærå" = 11000011 10100110 01110010 11000011 10100101
        [    0x00E6     ] [ 0x72 ] [    0x00E5     ]

There are also no byte order concerns. All things considered, UTF-8 is definitely the superior string encoding. You should use it everywhere.