You are reading content from Scuttlebutt
@aljoscha %2g/bxlJjCjnehn+K9b9cLH4Z9iay6iPm8S0owMgo4YE=.sha256
Re: %uJX/grT0v

...continuation of previous post

O(sqrt(len_a)) Indexing Schemes

Suppose a is a feed of length len_a. In the previous scheme, the vertices store the full (shadow) set of vertices reachable from them. The central idea behind the new scheme is to only cache a subset of this information, and restore the full reachability information via an iterative graph traversal. For this, we need another definition:

  • the shadow difference shdiff(V, W) between two vertex sets V and W is the set containing all vertices v in V for which there is no vertex w in W such that w shadows v

The sqrt-1 Indexing Scheme

The new indexing scheme: Each vertex a_i stores shdiff(reach(a_i), ind(a_(i - sqrt(len_a)))). Just like the previous scheme, the indices can be computed by iterating in reverse topological order. Instead of computing the full reach(a_i), it is sufficient to use the union of the index sets of all outneighbors of a_i as the first argument to shdiff. This also works for incrementally updating vertices.

Only storing the shadow differences can greatly reduce space usage. Answering whether b_j < a_i is done via the following algorithm:

// Return value of `false` means that b_j is not older than a_i, but not necessarily
// that b_j is newer than a_i.
function older(b_j, a_i) {
  let current = a_i;

  while (current != null) {
    if (shadows(b_j, a_i.ind)) {
      return false;
    } else if (shadows(a_i.ind, b_j)) {
      return true;
    } else {
      current = get_message(a_i.feed, a_i.seqnum - sqrt(a_i.feed.length)); // null if second argument negative
    }
  }

  return false;
}

In the worst case, this requires a_len / sqrt(a_len) = sqrt(a_len) iterations.

Whenever the feed length reaches a new square number, all indices need to be recomputed. This yields amortized O(sqrt(n)) incremental index update costs (every sqrt(n) messages, all n messages need to be scanned). Where this is undesirable, the scheme can instead use the current sequence number for determinining how much data to cache in some vertex. Vertex a_i would store shdiff(reach(a_i), ind(a_(i - floor(sqrt(i))))), the query algorithm needs to be updated accordingly.

The sqrt-2 Indexing Scheme

Now we will introduce another concept for significantly reducing index size, while keeping roughly the same average-case query time at the cost of worse worst-case query time. The core idea is that not every vertex needs to store index information, instead it suffices if only some deliberately placed "cache vertices" store indices.

The "sqrt-1" scheme is modified to only store data in every sqrt(a_len)-th vertex. Intuitively, to answer a query, you first traverse the feed until you hit a cache vertex, and then you traverse the cache vertices. Finding the first cache vertex only takes sqrt(a_len) iterations in the worst case, so the overall time complexity stays the same. The total index size however is divided by sqrt(a_len), which makes an actual difference (O(a_len) turns into O(a_len / sqrt(a_len)) = O(sqrt(a_len))).

The previous paragraph omits an important fact that worsens the query time: In addition to finding the first cache vertex, we also need to consider everything reachable from the starting vertex up until the first cache vertex. This can include outgoing edges to vertices on other feeds, which might not be cache vertices either. To answer a query, you need to recursively compute temporary index sets, i.e. shadow sets of the union of all out-neighbors's index sets (which may again get computed on the fly). In the worst case, this can cascade across many feeds and messages. It is even possible that reachability information for multiple vertices from the same feed needs to be computed.

This scheme of scattering cache vertices is thus only reasonable if we can assume that message graphs which result computationally expensive queries are unlikely. Which they are, they only happen if a large swath of new messages from different feeds reference each other, and the new messages also have to happen to not include any cache vertices. So high query times are unlikely to happen by accident, and they would require a high number of malicious feeds to use them for denial of service attacks, which could be mitigated by simple blocking.

There are also a variety of mechanisms to improve these queries: When choosing between different out-neighbors of a vertex, implementations should prefer vertices with a low distance to the next cache-vertice, to increase the probability that the query can be directly answered from that cache (and its preceeding caches). Additionally, implementations can explore the graph in a breadth-first search (rather than the depth-first search you'd get by the obvious recursive implementation) to minimize the probability that the same feed needs to be visited multiple times. To protect against malicious input, it might make sense to include some randomization. And finally, if answering a query has required traversing many nodes, the implementation can simply add the newly computed shadow sets to appropriate vertices, speeding up any subsequent queries in the same area of the graph. This last point increases the complexity, but also makes the whole approach much more viable.

Due to these considerations, any indexing scheme presented in this document can be modified to only store indices every so often. Since this is an engineering concern rather than a real design question, we will ignore it from now on.

continued in next post...

Join Scuttlebutt now