Reachability in Databases
Forking this post about a database that allows efficient querying of the causal order of ssb messages to not derail the thread.
@mix:
SQUEEE!!!
Indeed.
TLDR: Building a database that supports querying the causal order is not trivial, but possible.
Context: The discussion about removing mandatory timestamps from ssb messages has led us to the realization that we'd need the database layer to support efficient queries about the causal order of messages, i.e. whether some message is cryptographically guaranteed to be older than some other message. The causal order is determined by cypherlinks: If msg1 refers to the hash of msg2, it must be newer. We can treat the messages as the vertices of a graph, and cypherlinks as directed edges. The transitive closure of the edge relation (aka the reachability relation) is the causal order we are interested in. When asking whether some msg x is newer than another msg y, we are asking "Is there a directed path from x to y?".
This is a well-studied problem in database theory and implementation. As a theoretical introduction I'd recommend this summary, but it lacks any newer results (published 1990). For an overview of the current state of the art, I recommend the related-work section of this paper. But here is a summary of the basics, for your convenience.
All solutions to reachability query have to balance out three factors: Time complexity of query processing, space complexity of database indices, and time complexity of creating those indices. The two trivial solutions are:
- not storing any indices, perform a depth-first search on the graph to answer a query. Query time is then linear in the number of vertices. This is not acceptable for ssb.
- store the full reachability relation in indices. Query time is then logarithmic in the number of messages (or constant in the average case by using hash tables). This however requires storage space quadratic in the number of messages. This might be sufficient to test out how ssb without timestamps could work, but it is not acceptable in the long run.
The paper I linked to does a good job of roughly sketching the known approaches to interpolate between these two extremes, so I won't go into all of these. The gist is: They are good enough, and they get implemented in real databases. But there are a few ssb-specific properties we could use to obtain even better solutions:
We'd want efficient incremental updating of the indices, we can't recompute all indices from scratch whenever a message arrives. Luckily, we rarely lose edges or vertices from the message graph (only in the case of blocking). And we never add arbitrary edges, we only need to handle adding a single vertex and its outgoing edges (i.e. receiving a new message). These properties should allow us to find more efficient solutions than the general case. N.B: I did not find any literature on incrementally maintaining the reachability indices in general, so I did not find any on append-only databases in particular either.
We know that some subset of messages are in a total order, namely the messages from the same feed. No message is in more than one of these total orders. This allows a compact representation of reachability information that only takes up O(n log n)
space: sequence numbers. Incrementally updating those takes constant time as well. These can be used to derive new algorithm from those for arbitrary dags. I also like how this implies that a sophisticated ssb database will always store sequence numbers explicitly, even if they are not part of the metadata.
Some further observations: The message graphs ssb produces will be fairly sparse for the most part. That's good, the heuristics in the literature work well on sparse graphs. Theoretically, we even have graphs of bounded degree, because there's a message size limit. So in theory, there might be some funky fpt algorithm. In practice, you can fit too many cypherlinks into a single message for that to actually matter.