You are reading content from Scuttlebutt
@aljoscha %w5aLKFIuTttR8o6QB8SNOb/iqhiWDbAzmwaJIZQBjSA=.sha256
Re: %QJEpN8LN1

...continuation of the previous post

Deduplication

A database might choose to store content completely separated from metadata, and look up content by hash. This results in a very nice property: If multiple messages have the same content, they are only stored once. Currently, an application developer has to use blobs to achieve that effect, which means giving up upon the nice logical data model of messages, as well as the ability for plugins to inspect the content (e.g. to automatically create database indices for cypherlinks). With offchain content, devs get the best of both worlds.

Conceptual Clarity

This is a fuzzy/subjective one. Basically, the sigchain currently stores both metadata and data. Blobs are also data, but not part of the sigchain. With offchain content, the sigchain would consist purely of metadata, and all actual data would be stored off-chain. I personally think it is more elegant that way.

Taken to the extreme, you could argue that in such a setting, having both blobs and messages is redundant. If messages have no size limit, deduplicate automatically, can't we drop blobs completely? (For the sake of the argument, let's also assume that message content is not forced to be a map. The details of this don't belong into this post, but it can (and imo should) be done. Also for the sake of the argument, consider that #hsdt allows byte strings, so a message content could simply be the binary content of an arbitrary file). There are still some differences (messages have a mandatory type, blobs can be conjured into existence without a corresponding metadata entry in the sigchain), but it is interesting how much these two would converge. This won't really influence the decision on offchain content, but we should keep this in mind for later.

Cons

As I am clearly a proponent of offchain content, this section may be lacking. I didn't conciously omit any arguments to be more persuasive, but I may have missed a bunch of arguments.

More Data Overall

The total amount of data stored/transmitted/hashed is increased by one hash per message.

More Hashing

One more hash per message needs to be computed during verification, the total size of bytes that needs to be hashed increases by the size of each content's hash.

Yet Another Change

In particular, this would warrant new replication rpcs. There's a simplistic #blob-content approach for implementing this, where we'd use actual ssb blobs to store message content. The old replication methods could be used as usual, and you would then request the content blobs when clients want to display them. But that requires more roundtrips, as well as flooding the network with blob request, undoing the efficienty gains that messages get from ebt replication.

So #blob-content just doesn't cut it. Instead, we'd need to implement replication rpcs that explicitly deal with the new situation. The most simple ones would be "Only send metadata" and "Send each message's metadata followed by its content" (well, and in theory there's also "Only send content", but that one is useless in practice). Another reasonable one would be "Send all metadata, and send content in a separate stream with independent backpressure". And then there's the potential for more complex ones, as sketched in the "Faster Initial Sync" section of this post.

You can look at that and remark that clearly offchain-content makes replication much more complicated. But in some sense, it breaks down replication into two orthogonal components, metadata and content. The current system complects them, the resulting replication is rather crude - all or nothing. With metadata and content as orthogonal vectors, the rpcs can combine these two to address arbitrary points in the spanned vector space (it isn't really a vector space in the mathematical sense, but it gets surprisinly close on an intuitive level).

Concluding Remarks

Selective deletion alone makes offchain-content worth it in my opinion. We can't ignore the laws that mandate deletion of certain pieces of data, and staying compliant without having to delete and block whole feeds is a huge win. Compared to that, deduplication, the message size limit, conceptual clarity and the possibility of feed compression are mostly nice extras. More efficient initial sync however is another big deal, especially with mmmmm's struggle to get a reasonable mobile experience.

The overall increase in data and hashing would most likely be offset by the use of clever replication rpcs. And as usual, I remain blissfully ignorant of the fact that these changes need to be implemented. Sbot is a working ssb implementation, so from its perspective, these changes mean reimplementing a bunch of stuff that already works sufficiently well. But from the perspective of future protocol implementors it really doesn't make a big difference, especially if all legacy messages can be handled by a single, easy to implement rpc. And I care more about the potentially large number of future implementors, than about the current javascript implementation. As one of those potential future implementors, I may be biased though... =)

Join Scuttlebutt now