Variable size integer encoding

It is a known fact of nature that most integers are small. When transferring data over the network, or to/from disk, or pretty much anywhere the IO is slow, keeping data compact is a plus. Obviously, compression is an option, but it is fairly CPU heavy. Taking advantage of the fact that, in practice, most integer values are small is an option to effectively reduce the amount of data for a modest cost in term of CPU.

LEB128

LEB128 is a variable length encoding used in the file format DWARF, used for debug data and exception unwinding. Variation of LEB128 are used in many software, including git, protocol buffer and many others. The technique involve splitting the integer in groups of 7bits. Each bytes in the encoding contains 7 bits of the integer, plus one bit indicating if there is a next group of 7 bits coming.

Range Encoding
0 .. 2^7 – 1 0xxxxxxx
2^7 .. 2^14 – 1 1xxxxxxx 0xxxxxxx
2^14 .. 2^21 – 1 1xxxxxxx 1xxxxxxx 0xxxxxxx
2^21 .. 2^28 – 1 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx
2^28 .. 2^35 – 1 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx

UTF-8

UTF-8 was made with a few constraints in mind. First, it had to be compatible with ASCII, which means all value bellow 128 must be encoded as this. In addition, none of the remaining bytes in the multibyte encoding must be in the 0 – 127 range, so legacy software wouldn’t confuse them for ASCII characters. As text processing is often performance critical, it needed to be decodable quickly. Finally, decoding would happen forward, but also backward.

UTF-8 ‘s first byte is made of a variable numbers of 1, followed by a 0, and then actual bits from the data. The number of ones indicate the number of bytes this value uses. For one byte values, the number of ones would be 0, and, as a result, integer in the 0 – 127 range can be encoded as themselves (they all start with a 0 bit). This means that, contrary to the LEB style of encoding, the number of bytes is known up-front and this can be leveraged to increase ILP when forward decoding.

Following bytes are started by 10, and then 6 buts of payload. This is necessary so that none of theses bytes ends up in the 0 – 127 range, but also to help backward decoding. Because 1 bytes integers start with 0 and 2 bytes ones starts with 110, a bytes starting with 10 is a sure sign that it is within a multi-byte encoding.

Range Encoding
0 .. 2^7 – 1 0xxxxxxx
2^7 .. 2^11 – 1 110xxxxx 10xxxxxx
2^11 .. 2^16 – 1 1110xxxx 10xxxxxx 10xxxxxx
2^16 .. 2^21 – 1 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
2^21 .. 2^26 – 1 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The UTF-8 encoding is not as efficient in term of space as the previously mentioned encoding, however, it is remarkable that it can achieve such a level of compatibility with raw ASCII, while remaining fairly compact and efficient.

Bitcoin’s var_int

The bitcoin protocol uses a form of variable size encoding for integer, mostly for expressing array like structures. The first byte of the integer is either 0xFD for 3 bytes encoding, 0xFE for 5 bytes encoding, 0xFF for 8 bytes encoding and is considered to be a 1 bytes integer for anything less than 0xFD .

this scheme is quite primitive but is very easy to understand. It encode integer on 1 bytes for values up to 252, but pessimize everything past that point because it require a 1 byte prefix.

Range Encoding
0 .. 252 (1 byte integer)
253 .. 2^16 – 1 0xFD (2 byte integer)
2^16 .. 2^32 – 1 0xFE (4 byte integer)
2^32 .. 2^64 – 1 0xFF (8 byte integer)

Dlugosz’s Variable-Length Integer Encoding

This encoding is used in the ZIP format. It strive to encode small integers efficiently, but also try to have efficient encoding for larger yet common integer formats, such as 128 bits UUID, arbitrarily large integers and has extension points for future encoding. The format uses a header as does UTF-8 , but the header is somewhat more complex and subsequent bytes do not need to start with 10.

Follow that link to learn more about Dlugosz’ Variable-Length Integer Encoding.

Putting it all together

All these formats strike different trade offs. However, most do have some inherent inefficiencies. For instance, most of these formats offer redundant ways to encode small numbers as large number encoding can also be used to store small numbers. The range that large numbers can encode could by offsetting it by the largest value the previous range can encode. This comes at the cost of one addition per encoding/decoding, which is relatively cheap. While the range that one would win isn’t that great, it is sometime a problem to not have a canonical representation of numbers, and, in these case, a range check needs to be performed anyway, and that’s no less expensive than the addition, but do not extend the range.

Negative integers are represented as 2’s complement in computers. This effectively makes negative number looks like very large number and defeat most of these encoding schemes. In that case, when the number is negative, one will flip all the bits of the number, putting it back in the small number range. Then a bit is stolen somewhere to indicate if the bits of the final result needs to be flipped or not. If this bit is the chosen bit is the least significant one, encoding and decoding can be sone as such:

1
2
3
4
5
6
7
8
9
10
11
12
13
uint encode(int x) {
    auto base = x * 2;
    return (x >= 0)
        ? base
        : ~base;
}

int decode(uint x) {
    auto base = x / 2;
    return (x & 0x01)
        ? base
        : ~base;
}

Rolling our own variable length encoding format

In this exercise, we will create an encoding format for unsigned numbers. It matter to us that there is a canonical way to encode numbers, and we want the encoding and decoding process to be fast, so we’d like to know the number of bytes we are dealing with up-front. We don’t care much about backward decoding or accommodating other special number formats such as Dlugosz’s does.

We will chose to encode out first byte just as UTF-8 does, but remove the constraint that subsequent bytes needs to start with 10. This will allow to use an extra 2 bits per byte and will reduce the work needed to be done to clean these 2 bits when decoding. Because we want a canonical encoding, we will offset the ranges encoded.

Because we do not care about number greater than 2^64 in our use case, we can simply chose that if the first byte is full of ones, then the following 8 bytes are the integer we are encoding.

Range Encoding
0 .. 0x7F 0xxxxxxx
0x80 .. 0x407F 10xxxxxx xxxxxxxx
0x4080 .. 0x20407F 110xxxxx xxxxxxxx xxxxxxxx
0x204080 .. 0x1020407F 1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx
0x10204080 .. 0x081020407F 11110xxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx

Because we are offsetting the ranges, there is no need to do a range check to verify that the encoding is canonical, except for the 9 bytes representation.

Comparing our new format to others

On this graph, Bitcoin’s format is plotted in blue. As we can see, the only range of value where it outperform its competitors is the [128 .. 252] interval. It under performs compared to alternatives quite badly.

LEB128 is plotted in red, while our new format is plotted in green. As we see, the range offsetting doesn’t change much. The line are almost on top of one another. However, this format is interesting compared to LEB128 as it is bounded to 9 bytes encoding, when LEB128 require more for large 64 bits numbers, and we can know the size up-front.

Leave a Reply

Your email address will not be published. Required fields are marked *