%/DmqNODlO97dAlm7a7jrEQlnF5XhH7dQv1+xjwCyJ0c=.sha256

@aljoscha6 years ago %/DmqNODlO97dAlm7a7jrEQlnF5XhH7dQv1+xjwCyJ0c=.sha256

The Problem

Ssb has that property, so why should there be a problem? The problem is one of complexity. If I gave you the k-th and l-th message from the same feed, and k has a lower sequence number than l, how could you verify that message l has indeed been created after message k? You'd have to traverse the full hash chain from l back to k. In the worst case, k would be zero. Then, you'd need to traverse the full feed. So to verify message l, you need to perform l many checks - and to do so, you need to know the data of l other messages. Put into more mathematical terms: the length of a validity certificate is linear in the sequence number of the message.

What if you only cared about a subset of the messages of a feed, but still wanted the full security guarantees? With current ssb, your only chance is to download the full feed, even though you don't care about much of it. This is the problem that merkle trees can solve. But first a few words on a related concept: Out-of-order (ooo) messages.

Ssb provides ooo messages, where a server can ask its peers to deliver it a message of a certain hash. That way, you can retrieve messages without downloading their full feed. That's fine, but it forfeits the guarantee about creation order. Instead, you trust that whoever gave you that hash has already verified the validity of that message. While that's fine for relatively inconsequential, one-off retrievals such as fetching a post in a thread you are interested in, it is not a solid foundation for regularly fetching a subset of a feed. And in general, it is somewhat dangerous that there are essentially two types of messages floating around: ones where you know for sure that creation order guarantees hold, and ones where you only trust somebody that they hold. Application developers need to cleanly separate those if things are to stay secure.

More generally, this split between verifiable and unverifiable messages makes the whole system more complex to reason about. This may sound vague and abstract, but this additional complexity has surfaced multiple times already, imposing metadata overhead on all messages.

I don't want to argue that ooo messages are a bad feature per se. But if there was an alternative that managed to preserve security guarantees, it should surely be considered. And merkle trees provide this alternative.

Merkle Trees

(This section is rather technical and assumes some familiarity with basic computer science themes regarding data structures and complexity.)

Merkle trees (see the wikipedia article) are collections that can give a certificate (basically a proof) that they contain a certain item. Crucially, the size of such a certificate is only logarithmic in the number of items, as is the verification time. Proof of membership is not exactly what ssb needs, but it is close enough to be adapted to our needs. There are few other points that need some tweaking. For one, merkle trees assume a bounded size and a well-known root hash, but ssb feeds have neither. Furthermore, we want to avoid adding additional nodes to our message graph, so we'd rather store non-leaf nodes implicitly.

Let's start with a basic scenario to get a feel for the setting: The following graphic shows an ssb feed of eight messages and a corresponding merkle tree.

m0 to m7 are the messages, h0 to h7 are hashes of those messages, the dotted edges between the first two layers are ssb's regular backlinks. h<x><y> is the hash of the concatenation of h<x> and h<y>. You can imagine a time-axis going from left to right. Instead of the regular symmetric tree layout, I've moved the hash nodes to the first point in time at which they can possibly be computed. An non-dotted edge from node v to node u means that u can not have been computed if v didn't already exist (else, the hash function would be broken).

In the regular merkle tree setting, certificates work as follows: Assuming I already knew the hash of the item I'm interested in (say h6) and the root hash (h01234567), how can someone prove to me that h6 is in the tree? They only need to give me h7, h45, and h0123, and then I can check whether hash(h0123 | hash(h45 | hash(h6 | h7))) == h01234567. If it is, then h6 did indeed contribute to the creation of my well-known root value h01234567. What we care about here is that this also implies that h6 (and thus m6) was created before h7. And if I later asked for a certificate for h3, the correct certificate also implies that m3 was indeed created before m6 (by checking that the h0123 value is the same in both certificates (which happens implicitly since otherwise the final root hash would differ)).

Transfering this to ssb, the first problem is that we don't have a well-known root hash. But that is fine, we just send the root as part of the certificate instead. m6 is signed, so we already know that it is either a member of the correct feed, or the feed owner has been forking the feed. We can't detect forks with only a single message, since for a fork you always need at least two, contradictory ones. Rather then collection membership, we only care about being able to verify that any message of seqnum < 6 that we receive at a later point in time must have already existed when we received m6. And that's what the merkle tree gives us. Well, except for a few problems we still need to deal with.

Assume we received m6 and the associated certificate c6 described above (h7, h45, h0123, and h01234567), and now we ask for m1. The certificate c1 we get consists of h0 and h01. The server we are requesting from might not even know about any messages that were published after m1. But now we can't check whether c1 contradicts c6. c6 contains h0123, but does not tell us how h01 must relate to it. Fortunately, there's an easy fix: Each certificate must contain all the diamon-shaped nodes (h0, h01, h0123, and so on) that existed up to that point in time. Each diamond node is the root of the complete binary merkle trees over the first 2^i messages for some i. Each certificate is created by "working downwards" until one of these roots is reached, so by including them, we ensure that all certificates can be checked against each other.