Merkle tree are awesome. They allow to hash an ordered set of elements and provide a way, known as a Merkle proof, to prove that a given element is at some position in the set without having to provide the whole set. While this is a very useful tool, the fact that it is ordered can be somewhat troublesome. Inserting or removing an element in the middle of the set will require significant work and, more generally, 2 sets containing the same elements, but in a different order will yield 2 radically different trees.

Today, I’d like to introduce a new tool to the software engineer toolbox, which, as far as I know – and that probably tells more about my searching-foo than anything else, hasn’t been proposed before. The goal is to hash an **unordered** set in such a way that:

- The end result is the same, no matter in what way the tree is created
- Insertion and deletion are ln(n)
- It is possible to produce a proof that an element is contained in the set without producing the set

## Merkle tree

To produce a Merkle tree, one hashes every items to be contained in the tree. Each of these hash is going to be a leaf of the tree. Then these hash are combined 2 by 2 and rehashed to produce a list of parents nodes. The process is repeated recursively until only element remains: the root of the tree.

As a picture if often much clearer than words, so here an example:

This tree as for root **a2bf**. This hash is a unique identifier of the ordered set. It is then possible to prove that an element is at a certain position in the set by producing the element, its position plus a set of hashs to reconstruct the path from this specific item to the root of the tree. Anybody can then combine the hashs recursively to end up at the root, proving that this items is indeed in the set at the given position.

The values required to produce a Merkle proof for **Item1** are marked in yellow in this picture. To verify the proof, one hashes **Item1** and find **6eec**, then combine it with **bf52** to find **e481** and finally with **3a5f** to find **a2bf**, the root the tree. It is in fact possible to not reveal the element itself, but only its hash if desired.

While this is very useful I intend to convince you that if one is willing to accept the set to be unordered, one can get something even better: a Merklix tree.

## Merklix tree

The trick is to use a key, for instance – but not necessarily – and element’s hash, to determine its position in the set. Contrary to a Merkle tree, that means that key are going to be sparse. A radix like structure can be used to store the elements, by making each node a sub tree of all elements having a common prefix in their key. Essentially, this is a mix between a radix and a Merkle tree: a Merklix tree. For instance, our example tree would look like (considering 3 => 0011, 6 => 0110, b => 1011, e => 1110) this:

In this example, **bf1b** is the subtree of all elements having a hash starting by **0** and **2fd8** the subtree of all the elements having a hash starting by **1**.

Because of its new structure, the tree may not be balanced, however, we expect the imbalance to be acceptable as we are using a cryptographically secure hash function and producing a lot of elements with the same prefix would be prohibitively expensive to do at scale.

To demonstrate insertion, let’s add a new **Item4** in our tree, with **ff79** as hash.

Now, **d9fe** is the subtree of all elements with prefix **11**, **4d61** the subtree of all elements with prefix **1** and **7b03** the root of the tree. The insertion only required to recompute ln(n) hashs. A Merklix proof can still be produced by providing a path in the tree + a set of hash to verify the path up to the root.

It should be obvious by now, but deletion is also ln(n), and the next illustration show the tree after deleting **Item0**.

Note that it doesn’t matter if **Item0** was deleted before **Item4** was added, the end tree would look exactly the same.

## What is this tree useful for ?

So we abandoned the ordered structure of the Merkle tree, what do we got in return ? As it turns out, quite a lot.

First, a same set of elements will always yield the same tree no matter what are the operation done to get there. This is very important for system where consistency is hard to maintain, which is common in distributed system. The nodes of the system will eventually converge toward the same set and the same root, no matter how they get there.

We can insert, delete or mutate an element from the tree in ln(n) time. A Merkle tree require o(n) to do so. This tree is more efficient to work with.

In the example above, we used the elements own hash as a key in the tree. But as it turns out, we can use any hash we wish as a key. Therefore, Merkelix tree can be used as an unordered map as well as an unordered set.

Because the position of an element in the tree is known, it is possible to produce a proof of absence, producing the path in the tree that would lead to element if it was in, but do not. Merkle tree on the other hand, can only prove that element is or isn’t at a given position, not in the whole set.

Observant readers will have noticed that subtree are, in effect, Merklix tree themselves containing a subset of the tree. As a result, the workload can be sharded by prefix. The root from each shard can be combined to rebuild the tree root in o(k), k being the number of shards. Therefore, the workload of a node would be o(n/k + k) or o(n/k) if n >> k^2 .

As a result, the Merlix tree can be pruned, as long as all nodes do not prune the same part of the tree. Infrequently accessed part of tree can for instance be committed on disk, or in a DB, or simply be delegated to another shard.

Finally, it is possible to use the proof of presence or absence to insert, remove or modify an element in the tree, even the part that where pruned ! The proof contains all the hashes that are required along the path. While this may seems counter-intuitive, several write operation can be issued from a proof using the same root. While, after the first write, the root will have changed, all the hashes required for the second will either be in the proof or in the set of hashes computed during the first operation. This means that a node can compute and store a set of m operations is a set of size n in o(m*ln(n)) space and time, this even while having pruned the whole tree !

The knowledge of the content of the tree is only required to produce the Merklix proof, but all subsequent operations can be performed by any actor **without any knowledge of the content of the tree** except its root.

In the next article, I will discuss how Merklix tree can be used to improve blockchain based technologies such as Bitcoin to scale better, notably by using them to checkpoint the UTXO set.

Brilliant. I look forward to your next article.

Pingback: Using Merklix tree to checkpoint an UTXO set | Deadalnix's den

It seems that absence proofs require something else:

leaf proofs; i.e. the ability to prove that some node is a leaf.

One simple way to achieve this, is to tag internal nodes not only with the common prefix as you do above, but with the subtree size as well, measured as number of leaves in it.

A leaf proof then shows that its parent has subtree size either 2, or 1 greater than the sibling.

If the length of the key is known, you know you have a leaf when the prefix length reaches the key length.

Pingback: Using merklix tree to shard block validation | Deadalnix's den