You are reading content from Scuttlebutt
@aljoscha %WpD965FnDSGXxBoWous/kepl1JcG9VqhlKnpINmUydk=.sha256
Re: %uJX/grT0v

...continuation of previous post

O(log(len_a)) Indexing Schemes

The previous schemes saved a lot of index space, but the query time complexity dropped to worse than logarithmic. In this section, we'll propose two indexing schemes that can answer queries in logarithmic time, while still taking up much less space than storing the "full" indexing scheme.

"sqrt-2" worked by storing shadow differences to previous vertices in regular distances. Choosing any distance other than sqrt(len_a) gives worse results: For smaller distances, more iterations are needed to hop between cache vertices, for larger distances, it takes more iterations to find the first cache vertex. Even for "sqrt-1", larger distances are a problem, they incur larger cache sizes. Intuitively speaking, there's redundancy among the caches of successive vertices, and the distance hops serve as barriers against that redundancy. Increase the distances, and much more redundant data is stored.

To design a scheme with logarithmic lookup time, it is thus not sufficient to simply store shadow differences to a_(i / 2), as that would bloat indices. But still, logarithmic lookup time requires skipping over many vertices during the query. To solve this, we choose a scheme where only a few verties allow jumping far back during queries, and all other vertices jump towards those. It can be thought of as the traversal of a binary tree from an arbitrary node towards its smallest leaf, which has a worst case time of twice the height of the tree.

The log Indexing Schemes

"log-1": Each vertex a_i stores shdiff(reach(a_i), ind(a_(traverse(i)))), where traverse(i) is defined as follows:

  • if i is a power of two, traverse(i) = i / 2 (i.e. a bitshift to the right)
  • else, traverse(i) is a bitshift to the left: Shift the binary representation of i one bit to the left (inserting a 0 as the new least significant bit), dropping the most signifiant bit.

Index creation and queries proceed analogously to "sqrt-1". Unlike "sqrt-1", the "log-1" scheme does not require any reindexing when appending a vertex to the feed. Analgously to how "sqrt-2" is derived from "sqrt-1", you can derive "log-2" from "log-1" by using every log_2(len_a)-th vertex as a cache vertex.

There's really not a lot to say about this scheme that hasn't been said about the "sqrt" schemes, except that it is surprisingly elegant and powerful.

Reality Checks

Everything written so far assumes a very simple model: Append-only dags where new edges are only introduced as outgoing edges of a new vertex, and no queries other than binary reachability queries. The real setting of an ssb database is more complicated.

Blocking and Selective Deletion

Blocking a feed means removing all the feed's messages from the message graph. Selective message deletion (enabled by #offchain-content) allows removing arbitrary messages from the message graph. Both of these affect indices. It is simple to remove edges from the graph, but the problem is that their reachability information stays cached in other vertices indices. When querying infomation about new vertices, they will iteratively ask those same caches, so the information would linger in the system forever.

There are two trivial solutions two this: First, you can simply ignore the problem. As long as those indices are only used for causal ordering, the additional edges don't hurt - they only tighten the partial order. The other trivial approach is to rebuild all indices in the background, whenever a vertex/feed is deleted. This approach is expensive however, making vertex deletion at least O(n).

A smarter solution would start by deleting the vertex and its edges, and then checking locally which caches are affected by this, updating them as needed. The computationally expensive aspect of this is finding out whether removing an edge really removes a vertex from a cache, or whether it can still be reached along different paths. Also, there can be many caches affected by a single vertex deletion.

If blocks/deletions are expected to be rare events and to primarily happen on messages with relatively few outgoing links (such as human-readable texts), it might be ok to keep the indices simple and have deletion be somewhat expensive (especially since it can happen in a background-thread). Otherwise, it can make sense to increase the space usage of caches by storing information about how many paths there are that allowed reaching any particular message. Or you could use bloom filters to store the set of messages that refer to each message. There are probably more sophisticated approaches for leveaging ssb's properties for simpler vertex deletion, but this document is already getting long enough.

Out Of Order Messages

Out of order messages are messages for which the full feed is not known. These would need to be handled differently than normal feeds.

Messages can contain cypherlinks to other messages that the local database may not know about. Once such a message becomes known, it may provide more reachability information, complicating incremental index maintainance.

Topological Sorting

The database should be able to efficiently output a topological sorting of the vertices based on causal order, which the proposed indices do not support out of the box. Additional considerations are around storing backlinks, which may make it easier to efficiently compute topological sortings on the fly or reverse topological order. Incidentally, this sort of additional information would also speed up vertex deletion.

Custom Indices

In addition to reachability data, these indexing schemes could be used to store custom indices, such as reachability via only certain message types, or only via certain entries of a message. If the database exposes the ability to create such custom indices, it becomes more important to correctly and efficiently handle vertex deletion.

Another possibility for such custom indices is k-reachability (reachability where the shortest path connecting two vertices may have length at most k).

Conclusion

As the previous subsections show, the indexing schemes presented here are not sophisticated enough for a real-world ssb implementation. But they demonstrate how feed structure can be leveraged for efficient indexing schemes, and can serve as the basis for more complex ones.

Join Scuttlebutt now