UTF-Sets

A while ago I came up with a simple new data structure I call a UTF-Set. This data structure isn't exactly revolutionary, but it is pretty neat, and it can reveal some interesting characteristics of the UTF-8 encoding. The example implementation I've written is a mere 52 lines of standard C.

Let's say we have a stream of UTF-8 text, and we want to print a list of all distinct Unicode codepoints ('runes') in the stream. (For simplicity I'm going to use the full UCS character space, up to U+7FFFFFFF, which UTF-8 was originally designed for.) For example, if we were to read in the 5-rune, 18-byte emoji 👩‍👩‍👧 then we would print out its 3 distinct runes: U+200D, U+1F467, U+1F469. To do this we need a set to keep track of the runes we've seen. A bitmask is a simple and compact implementation for a set, with one bit per rune, but this bitmask would take up 256 MiB, generally mostly empty, with most runes clustered towards the lower end (the Basic Multilingual Plane).

This is similar to the scenario in which UTF-8 thrives. UTF-8 is efficient when runes tend to be smaller numbers, but if we were to write out every rune in sequence then it would take up 10.62 GiB, whereas UTF-32 would use only 8 GiB. UTF-Sets use a similar tactic, with a worst case of 260 MiB which is only marginally larger than the 256 MiB bitmask, and it will generally be much more efficent. Compared to generic sets, UTF-Sets benefit from the neat ability to use less space while storing common (smaller) codepoints, as well as an extremely simple implementation.

Structure

A UTF-Set is organised into a tree of blocks, each of 512 bytes (the same size as a Unix block), split into 64 words of 64 bits each. The number 64 will come up a lot; the reason is that, if we ignore ASCII for the moment, the top two bits of a UTF-8 byte are unused except for self-synchronisation: a leading byte starts with 11, and a continuation byte with 10, but if we assume that the UTF-8 stream is valid then we may as well only look at the bottom six bits. 26 = 64. In a way, UTF-8 is a 6-bit character coding with a 2-bit synchronisation & compression header.

(Incidentally, every non-ASCII UTF-8 codepoint of length n can be uniquely identified by a Base64 string of length n, e.g. 😸 = wfY4.)

 1  typedef struct block UTFSet;

The most distinctive feature of a UTF-Set is its root block, which corresponds to the leading byte of a UTF-8 codepoint, and mimics its characteristic bit pattern. The number of continuation bytes to follow a UTF-8 leading byte can be determined by counting leading ones (i.e. set bits) in its 6-bit 'payload'. I implemented this as a small lookup table. (Note that the last two entries would not make legal leading bytes.)

 2  const char clo[64] = {
 3      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 00xxxx
 4      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 01xxxx
 5      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 10xxxx
 6      2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 6, // 11xxxx
 7  };

In a UTF-Set, the number of leading set bits indicates the height of the tree at that word position in the root block. That is, how many branching nodes (blocks) we must descend through before we reach the boolean leaf values (which indicate whether the corresponding rune is in the set). There is a clear parallel here between the height of the tree and the length of the UTF-8 sequence.

The first 32 words of the root block, having 0 leading set bits, could then each be a pointer to 64 boolean leaf values; the following continuation byte would identify the corresponding leaf. However, as we're working with booleans, we can instead embed a 64-bit bitmask directly in the root block, occupying the space that would otherwise store the 64-bit pointer.

The remaining 32 words of the root block each then point to an additional 'branching' block. All blocks except for the root block are homogeneous: they have the same type of value stored in all their words. If the leading byte had 1 leading set bit then all of the sub-block's words are bitmasks; if it had 2 then they are all pointers to sub-sub-blocks, all of whose words are bitmasks; and so on.

 8  struct block {
 9      union child {
10          struct block *ptr;
11          uint64_t bits;
12      } blk[64];
13  };

(The challenge of writing a generic UTF-Map in C++, with a template specialisation for the boolean case, is left as an exercise for the reader.)

Insertion

Now let's see how this tree structure works when you add a UTF-8 sequence to the set. As arguments, we will take the UTF-Set and a UTF-8 string; we will parse the first UTF-8 codepoint from the string, add the codepoint to the set, and return a pointer to the next character in the string. For simplicity, I will assume that the UTF-8 sequence is valid.

14  char *addutf(UTFSet *set, const char *s)
15  {

The first case we'll look at is a UTF-8 leading byte. In this case, we start at the word corresponding to the leading byte's 64-bit 'payload', count its leading set bits to find the number of continuation bytes, and then traverse one level of the tree for each continuation byte, save the last. By the end of this we have a pointer to the bitmask containing the boolean leaf value appropriate for the final continuation byte, which we have not yet read.

16      union child *tp;
17      unsigned char c = *s++;

18      if (c >= 0300) {
19          tp = &set->blk[c % 64];
20          for (unsigned int n = clo[c % 64]; c = *s++, n > 0; n--) {
21              if (!tp->ptr)
22                  tp->ptr = calloc(1, sizeof(struct block));
23              tp = &tp->ptr->blk[c % 64];
24          }
25      }

(Note that we write UTF-8 byte literals in octal, as is natural given their 6-bit true nature.)

Now let's look at an ASCII byte. As I mentioned before, UTF-8 effectively has a 2-bit synchronisation & compression header; the topmost bit is for compression. What I mean by this is, any ASCII byte 0xxxxxxx can be expanded to a 2-byte 'overlong' UTF-8 sequence 1100000x 10xxxxxx. It might be inefficient to do this with a UTF-8 stream, but in a UTF-Set no space is actually saved by allocating the two special blocks that would be needed for ASCII, especially since our use of bitmasks means that those codepoints would just be stored in the first 128 bits of the root block. So this means we can 'decompress' an ASCII byte to a 2-byte UTF-8 sequence; we read the 7th bit to see what the leading byte would be, and then leave the byte to have its lower 6 bits read as if it were a continuation byte.

26      else if (c < 0200)
27          tp = &set->blk[c / 64];

If we're given a string starting with a continuation byte then something has gone wrong; we must be at the start of a valid UTF-8 codepoint.

28      else
29          return NULL;

We can now set the appropriate bit in the bitmask, as specified by the final continuation byte in the UTF-8 sequence (or by the lower 6 bits in the case of an ASCII byte), and finally return the next byte in the string.

30      tp->bits |= UINT64_C(1) << (c % 64);

31      return (char *)s;
32  }

Iteration

Now that we've put some runes into our set, we need a way to iterate over them. We might like to print them, for example. For this we'll take as arguments the set and a function to be called on each rune in the set.

33  void foreach(const UTFSet *set, void (*f)(char32_t))
34  {

We'll iterate over each word in the root block, taking note of the number of continuation bytes needed (n), and taking those bits of the leading byte that do not form its leading set bits to be the initial leading fragment of the rune (r), before calling another function foreach1 that acts as a recursive workhorse for the function.

35      for (unsigned char c = 0; c < 64; c++) {
36          unsigned int n = clo6[c];
37          char32_t r = c % (64 >> n);

38          foreach1(set->blk[c], f, n, r);
39      }
40  }
41  void foreach1(union child t, void (*f)(char32_t), unsigned int n, char32_t r)
42  {

What we do while recursing depends on whether there are any more continuation bytes. If n is 0 then we're at the leaf value bitmask, and can iterate over its bits. We could do this more cleverly to avoid looping over unset bits, but I've gone for the simplest approach. We then add the 6-bit value to the rune fragment r to form the whole rune, and call the function f on it.

43      if (n == 0) {
44          for (unsigned char c = 0; c < 64; c++)
45              if ((t.bits & (UINT64_C(1) << c)))
46                  fcn((r * 64) + c);
47      }

Finally, if there are more continuation bytes still to come, then we descend into each of the block's sub-blocks in turn, decrementing n to keep track of the height of the tree, and concatenating another rune fragment onto the end of r.

48      else if (t.ptr) {
49          for (unsigned char c = 0; c < 64; c++)
50              foreach1(t.ptr->blk[c], f, n - 1, (r * 64) + c);
51      }
52  }

And that's it! We can now add UTF-8 codepoints to our set, and then iterate over them.

Usage

We can now use this data structure to easily write the program I described at the beginning of this article: we read some UTF-8 strings and then print out the codepoints encountered.

    void prunes(const char *s)
    {
        UTFSet set = {0}; // empty set

        while (*s)
            s = addutf(&set, s);
        foreach(&set, &prune);
    }
    void prune(char32_t r)
    {
        printf("U+%04" PRIX32 "\n", (uint32_t)r);
    }

The code for this (from last year) can be found on GitHub.