Complexity
The scheme encodes a binary tree in the feed, the number of inner nodes in a binary tree is linear in the number of leafs, so we end up with only linear space overhead (compare with e.g. this approach that has quasilinear overhead) in the feed. The metadata is not evenly distributed accross nodes, but it is bounded at log_2(seqnum)
.
More concretely: On average, this adds two hashes to each metadata entry (one for the bridge (except for powers of two), one for the regular merkle tree (on average)).
Certificates are paths to the root of the tree and then back to the first message, so they have length at most 2 * log_2(seqnum)
. Verification time complexity has the same bound.
Optimizations
A few rather trivial ones:
- no need to store
hash(msg_k)
in the feed, you can compute that directly frommsg_k
- no need to store
hash(hash(msg_<k-1>) | hash(msg_k))
in the feed, you can compute that directly from themsg_k
as well (msg_k
already containshash(msg_<k-1>)
- it's the backlink) - we can interpolate between feed size overhead and certificate size by dropping the highest non-leaf layers of the tree, sending a bounded number of leaf nodes instead as part of the certificate. The actual certificate can then be computed from those leaf nodes. This is not really an optimization, it just allows us to keep feeds more compact at the cost of making certificates less performant. Note that this only adjusts constants, it never improves the linear feed size overhead or breaks the logarithmic certificate size/performance.
Another nifty optimization can be done when sending certificates around. Two certificates for messages from the same feed will always share some nodes (if they didn't they couldn't prove the guarantees we are interested in). So it is not always necessary to send full certificates. Certificates share more data the closer the two keys are. When requesting a message of seqnum k
, the server can also attach the next smaller and the next larger seqnum from the same feed for which it already has certificates. The peer can then easily compute the minimum amount of data it needs to send. <unproven intuition> I think when requesting a full feed in random order that way, the total amount of certificate overhead is only linear in the feed size (rather than quasilinear without that optimization)</unproven intuition>.
Conclusions
We can efficiently allow random-access verification of messages. The current approach for random-access messages (ooo) simply gives up and forfeits some of ssb's security guarantees. Merkle-ooo can fix that.
Another application would be initial sync (CC @andrestaltz): You could first request the newest n
messages of a feed, and hand them to the client. Afterwards, you can request the remainder of the feed, doing traditional verification via backlinks.
More generally though, this could provide the basis for partial feed subscriptions. Plugins could expose their own replication rpcs that efficiently determine what kind of messages to send, and the merkle-tree validation would do the rest. The core does not need to care about the criteria by which the subset of the feed is selected. Due to the optimization where it is not necessary to send full certificates, this would allow linear-time verification of the partially subscribed feeds, just like regular ones. And that's the really, really, really powerful feature enabled by merkle trees.
The inability to do partial subcriptions is one of the two weakest points of ssb in my opinion (the other one being the non-hierarchical nature of identities). Merkle trees can fix that.