You are reading content from Scuttlebutt
@aljoscha %QJEpN8LN1t3BrIkUQ3WoOMWRsMArbVUZCpTeBYcuqfw=.sha256

Moving message content out of the sigchain

(aka #offchain-content aka #blob-content)

The building blocks of the scuttleverse are the user's sigchains. Each time you post a message, it is appended to your identity's sigchain, a sort of personal blockchain. This is accomplished by running the whole message - both metadata and content - through a hash function, and using those hashes as the cypherlinks that build the signchain. This post talks about a possible change to that scheme: What if instead of running the message content through the hash function, you'd only use a hash of the content to compute the message's hash? This suggestion has been brought up independently at multiple points in the past. In this post, I'll summarize pros and cons of this alternate approach.

A tiny parenthesis on terminology: The #blob-content hashtag has been proposed to refer to this concept. The actual implementation will very likely not involve blobs at all, so I'm using #offchain-content instead, which is descriptive of what precisely the proposal is about, rather than referring to implementation details.


If ssb only wanted to do efficient data replication, it would not need a sigchain, any linked list would do. What the sigchain gives us, is data integrity. Given any message in a feed, ssb can (and does) verify its authenticity by traversing the list of previous messages, checking that all cypherlinks are indeed correct. These checks necessarily need to acces all the data that was used to compute those hashes. In particular, this currently means the content of all messages. All content has to be stored, else message verification can not work. But if the cypherlink of a message only depended on its metadata and a hash of its content, than you'd only have to store the hash of the content to preserve verifiability of the whole sigchain. From this observation, we can derive the arguments that support offchain content

Pros

Selective Deletion

Sometimes, you may want to delete a message, maybe because it offends you, or because its content is illegal in your nation's jurisdiction. But under the current system, you can not delete message content, so you'd have to delete the whole feed. This is not always appropriate. With offchain content, you'd get that ability. You could give a list of banned hashes to your ssb implementation , and whenever it receives a message whose content hashes to one of those banned values, it does not get saved. You could even get fancy and store socially curated banlists on ssb itself.

Users can be required by law to delete specific content. Right now, that would mean blocking the whole feed, so these laws would wipe out full ssb identities, even if they only posted a single message that violates these laws. offchain content fixes that.

Faster Initial Sync

When downloading data from the scuttleverse for the first time, or after a longer offline period, you need to wait a long time for all downloads to finish. In many cases, you might want to look at the newest messages first. But ssb has to verify their integrity before displaying them, and thats a good thing, let's not optimize performance by forfeiting security guarantees. But to verify the new messages, we first have to download all previous messages. And in particular, we have to download all previous message's content. With offchain content, the total amount of data needed for verification would shrink significantly. When syncing, the ssb server could first request all the metadata (including content hashes), and then the content in reverse chronological order. This could drastically reduce the time needed for apps to become functional when performing a large sync.

This proposed replication scheme is not appropriate in all cases, but there's nothing forcing us to always use it. And we could do clever rpcs like "Sync in chronological order, unless it is more than n messages, in that case send metadata first and then the reverse-chronological contents.". Nothing forces us to implement any of these at once, but there is vast potential for speeding things up.

Another useful potential replication rpc here: "Sync in chronological order, but skip all message contents larger than k bytes, I'll get those later on demand, if I really need them."

Lifting the Message Size Limit

There is currently a hard(-ish, unicode and ssb are weird) limit on the maximum size of ssb messages. That's the reason why I will have to split this post into multiple parts. The makes sense, a reasonably-sized sigchain is essential for ssb to work. But if we only stored a hash of the content, then the content size would not affect the sigchain size any more. For most user of the current ssb applications, the practical impact will be writing posts without having to worry about hitting a size cap. To developers in general, it simplifies the API for sending messages, without the need to handle the case of messages that are too large.

Whether it still makes sense to enforce a new limit, albeit much higher than the current one, will need more careful discussion. Blobs also currently have a size cap, even tough they don't bloat the sigchains. I think #bpmux will make it possible to completely lift the limit (or rather to set it to 2^64 - 1), which in practice amounts to the same as having no message size limit at all, since the operating system would terminate any application that would deal with values of that size anyways.

Feed Compression, Garbage Collection, and Forgetting Content

Some feeds may get very large. But you probably won't be constantly accessing all the data from years ago. With the change, you could locally delete old, rarely accessed data, and simply download it from the network again, if you do happen to need access again.

Now, there are risks involved with this - if everybody did this, data might get lost. So doing this on all devices might not be a good id. But just having the option for doing this on devices with limited hard drive space expands the range of settings where ssb can be useful. Looked upon another way, these very risks are a feature in the case of illegal content.

And maybe, you might want some content of yours to disappear from the network. Ssb is a distributed setting, so you can't force anyone to delete it. But you can ask nicely. And since ssb works by surrounding yourself with trusted "friends", there's a good chance they will respect your wish and delete the message content and add its hash to their local list of banned hashes.

continued in the next post...

@aljoscha %w5aLKFIuTttR8o6QB8SNOb/iqhiWDbAzmwaJIZQBjSA=.sha256

...continuation of the previous post

Deduplication

A database might choose to store content completely separated from metadata, and look up content by hash. This results in a very nice property: If multiple messages have the same content, they are only stored once. Currently, an application developer has to use blobs to achieve that effect, which means giving up upon the nice logical data model of messages, as well as the ability for plugins to inspect the content (e.g. to automatically create database indices for cypherlinks). With offchain content, devs get the best of both worlds.

Conceptual Clarity

This is a fuzzy/subjective one. Basically, the sigchain currently stores both metadata and data. Blobs are also data, but not part of the sigchain. With offchain content, the sigchain would consist purely of metadata, and all actual data would be stored off-chain. I personally think it is more elegant that way.

Taken to the extreme, you could argue that in such a setting, having both blobs and messages is redundant. If messages have no size limit, deduplicate automatically, can't we drop blobs completely? (For the sake of the argument, let's also assume that message content is not forced to be a map. The details of this don't belong into this post, but it can (and imo should) be done. Also for the sake of the argument, consider that #hsdt allows byte strings, so a message content could simply be the binary content of an arbitrary file). There are still some differences (messages have a mandatory type, blobs can be conjured into existence without a corresponding metadata entry in the sigchain), but it is interesting how much these two would converge. This won't really influence the decision on offchain content, but we should keep this in mind for later.

Cons

As I am clearly a proponent of offchain content, this section may be lacking. I didn't conciously omit any arguments to be more persuasive, but I may have missed a bunch of arguments.

More Data Overall

The total amount of data stored/transmitted/hashed is increased by one hash per message.

More Hashing

One more hash per message needs to be computed during verification, the total size of bytes that needs to be hashed increases by the size of each content's hash.

Yet Another Change

In particular, this would warrant new replication rpcs. There's a simplistic #blob-content approach for implementing this, where we'd use actual ssb blobs to store message content. The old replication methods could be used as usual, and you would then request the content blobs when clients want to display them. But that requires more roundtrips, as well as flooding the network with blob request, undoing the efficienty gains that messages get from ebt replication.

So #blob-content just doesn't cut it. Instead, we'd need to implement replication rpcs that explicitly deal with the new situation. The most simple ones would be "Only send metadata" and "Send each message's metadata followed by its content" (well, and in theory there's also "Only send content", but that one is useless in practice). Another reasonable one would be "Send all metadata, and send content in a separate stream with independent backpressure". And then there's the potential for more complex ones, as sketched in the "Faster Initial Sync" section of this post.

You can look at that and remark that clearly offchain-content makes replication much more complicated. But in some sense, it breaks down replication into two orthogonal components, metadata and content. The current system complects them, the resulting replication is rather crude - all or nothing. With metadata and content as orthogonal vectors, the rpcs can combine these two to address arbitrary points in the spanned vector space (it isn't really a vector space in the mathematical sense, but it gets surprisinly close on an intuitive level).

Concluding Remarks

Selective deletion alone makes offchain-content worth it in my opinion. We can't ignore the laws that mandate deletion of certain pieces of data, and staying compliant without having to delete and block whole feeds is a huge win. Compared to that, deduplication, the message size limit, conceptual clarity and the possibility of feed compression are mostly nice extras. More efficient initial sync however is another big deal, especially with mmmmm's struggle to get a reasonable mobile experience.

The overall increase in data and hashing would most likely be offset by the use of clever replication rpcs. And as usual, I remain blissfully ignorant of the fact that these changes need to be implemented. Sbot is a working ssb implementation, so from its perspective, these changes mean reimplementing a bunch of stuff that already works sufficiently well. But from the perspective of future protocol implementors it really doesn't make a big difference, especially if all legacy messages can be handled by a single, easy to implement rpc. And I care more about the potentially large number of future implementors, than about the current javascript implementation. As one of those potential future implementors, I may be biased though... =)

User has not chosen to be hosted publicly
@cel %Q4VVsun2nI4TbyO4JNyBIeH5W8pum4jBkjj4ic3WXaM=.sha256

I wonder how #blob-content would affect indexing. Same for e.g. streaming message metadata in order and then message content in reverse chronological order. How much do the current indexes assume/require messages in a feed are in order? Currently messages received through ssb-ooo are not indexed, except for retrieval by id. But it is useful for applications to be able to find message using other indexes.

I think we could implement deleting messages with current messages/feeds. Senders and requesters of messages could keep track of banned messages and the message previous to each one. When fetching a feed, if you find that your replication state for a feed says the latest message you have for that feed is one of the ones previous to a banned message, you change it to the id of the banned message and increment your sequence number for that feed's replication state. Then proceed replicating, and you will have validated the feed except for the banned message, trusting the external info about what message the banned message followed. I think the sequence number helps here because it means you can verify that the banned message took up exactly one value in the sequence space (and not a negative value which would allow cycles). When sending a feed to peer, if you are about to send a banned message, you omit it, maybe sending a note referring to a message where someone asserted that the message was banned, and then resume sending the feed. Or, to save bandwidth when replicating to peers that don't support this protocol, just stop sending messages for the feed when reached the banned message. The peer should then notice that the latest message they got is one that precedes a banned message, and then they will update their state to skip the banned message and then resume replicating by requesting the feed from the next sequence number.

About hashing message content without metadata: wouldn't it mean a message id would not point to a unique message? If the id is the hash of the content, it might be many feeds published the same content, at different times. How would a UI show this?

User has not chosen to be hosted publicly
@Dominic %FEQnvdFPqDPDonvyyt15Pa7EI9ULQ8u9czikYsz7YQw=.sha256

there are some indexes where the order of messages is significant. for example, wether a block comes after or before a follow determines wether you are following or blocking someone!

  1. this needs order-free indexes. If something like this was enabled, it would mean that messages may come in different orders, and we'd need to use indexes that behaved consistently. That is a pretty good idea anyway, since private groups will introduce that sort of thing too - you might be added to a group later, and so would have to go back and reindex, which might be expensive, or you could design indexes in a way that that didn't matter so much.

  2. at the time we introduced a new format, we could make the signing algorithm be sign(metadata(message) + hash(message.content), sk) then you could still verify a message knowing the metadata and the hash of the content, even if you did not know the content... hmm, the hash would have to also be hash(metadata+hash(content)+signature). Hashing the message in two stages like this would be easy, but fully integrating separated content would be hard. However, if we had separate content hash, it would enable separate content to exist, even if we didn't implement that soon. (we'd need to have 1, order-free indexes squared away first anyway)

There are lots of other complications this would introduce, that would need to be discussed, but we don't need to actually commit to implementing it, or finishing that discussion, to have 2.

@aljoscha %El2y48D2qLtqg8hf9G70SiQuHB69K30tJ+RS3xd3NsQ=.sha256

Ugh, I forgot to mention another relevant counter argument, I really should stop writing these things in one go just before going to sleep...

offchain-content adds another failure mode: it is possible for the server to be aware that there is some message at a certain point in the chain, but to not have its content. That means a breaking change to the ssb-client API, the content entry would suddenly allowed to be undefined. Clients can currently rely on it always being an object (with at the very least a type key). This is a big one, forcing all application developers to adapt.

This also mean that replication rpcs need to be able to account for this situation, but that's just an implementation detail to keep in mind, not really an issue.


@moid

Messages aren't that large on average, for the most part folks aren't writing lengthy multi-part posts.

In the larger picture, ssb is an arbitrary database, not just a social network. I can imagine lots of use-cases where all of negligable size limit, deduplication, and indexability are extremely helpful, e.g. storing the dependency graph of a decentralized package manager, or messages that represent directories in a file system.

Constantly having to worry whether your messages hit an arbitrary limit can be a showstopper for certain kinds of applications, especially those that interact more with machines then with humans (who are able to split up their output on their own). There'd probably come a point where different applications develop different, inefficient and ad-hoc ways to get around the message limit.

but I can't see lots of regular messages that would be duplicated.

Again, that's just if you consider social networks.

[...] you'll pay a lot for it in terms of greatly increasing the number of calls to fetch content.

As I stressed in the original post, this would not be implemented via blobs, but rather as specialized rpcs. There would be no additional roundtrip cost whatsoever.


@cel

I think we could implement deleting messages with current messages/feeds

That is an interesting proposal, but it does rely on fully trusting your replication peers, something that both Dominic and I are reluctant to do.

About hashing message content without metadata: wouldn't it mean a message id would not point to a unique message?

Nope, I guess I was unclear on that one. The offchain-content would the compute the hash of a message as hash(concat(all_the_currently_hashed_stuff_except_the_content_field, hash(content))).


@frankiebee

Why not just write a client that implements many different protocalls rather then change ssb to fit a large use case.

I honestly didn't see this proposal as expanding the scope of ssb. It doesn't add new functionality to its API, it just improves upon things that are already there. And I won't stop trying to improve ssb details just because they are already kinda working.

in any distributed system selctive deletion sounds like a lie.

I am well aware of that. But this is not about deleting data on other machines, it is about deleting data from your own machine, while still being able to verify messages from the author.


@Dominic

there are some indexes where the order of messages is significant

True, but the order this depends on is the order of the sigchain, not some arbitrary replication order. If the current implementation relies on those to match, than this will mean additional work for sbot, but it isn't a conceptual problem in general.

There are lots of other complications this would introduce, that would need to be discussed,

Ack, this post was only an introduction to the concept itself, not about actual implementation and roll-out.


If you posted a response but I didn't @mention you back, then you are outside of my replication range and I didn't get the response.

@cel %Gaz+3sKPuVRWH7wTQcBlTZZMXXi7VQQgr4JL7W54OXI=.sha256

@Aljoscha

I think we could implement deleting messages with current messages/feeds

That is an interesting proposal, but it does rely on fully trusting your replication peers, something that both Dominic and I are reluctant to do.

I don't think it should require fully trusting replication peers. Replication peers would be able to deny sending you messages from someone's feed - but they can do that already. Decisions would be local as to whether to skip a sequence number when requesting a feed if you expect a banned message at that sequence number (vs. keep trying to replicate it, waiting for a peer who doesn't consider it banned to send it to you), and whether to drop a received message if it is a banned one, and whether to omit sending a message that is banned. The interesting part is which messages to consider banned - which would be common to any such implementation for message banning, I think. I would imagine people could publish a message identifying a message to ban, and the reason, and the sequence number and previous message id. Then you could consider a message to be banned based some configurable heuristic including number of people who declared it banned, your follow-hops to them, etc.

Thanks for clarifying the bit about message id hashing. Being able to calculate the message id with the hash of the content instead of the content itself seems like a good idea to me for the next version, to help enable seperate replication.

@aljoscha %P4Hb3N6IMpKrwkGa9efmQcuvyXuuHmX4AxXzQGY+IHI=.sha256

@cel

[...] trusting the external info about what message the banned message followed.

This is the part where a malicious peer could trick you (if I understood your scheme correctly). Say our sigchain is m3 -> m2 -> m1 -> nil, and m2 is banned. When replicating with a malicious peer, it could give you m3, tell you that m2 is banned, and then give you some m1' in place of m1. Since you don't receive the banned m2, you trust the peer that m2's backlink pointed to m1', which is not true.

Due to signatures, you could only lie about your own feed, but that is still enough to to violate the guarantees ssb currently gives.

The interesting part is which messages to consider banned - which would be common to any such implementation for message banning, I think.

Yes, that's a super interesting question. But it should remain in user-space, not leak into the protocol. The protocol does not need to know about this, just like ssb does not inherently need any sort of follow graph. We can't possibly find the one and only correct and optimal approach that works for all use-cases in all settings, so the protocol should instead remain completely agnostic and just provide the building blocks for custom solutions. In all likelihood, a de-facto standard plugin will emerge, just like ssb-friends. But nobody is locked into using it.

@Dominic %v4PDc1MrUCfjIjIyC35TH4idK1wtQ8ZfLPVcdBjFFIE=.sha256

@aljoscha another possibility, although it is a more complicated change than separately hashing the content, would a merkle tree hash of the entire chain, instead of just the previous message. Then it would be possible to prove that any two messages are in the same chain (that the later indirectly signs the former) without actually having all the intermediate messages. (dat supports this, but I felt it's interactions with indexes was too complicated, but not against introducing a new log format that supports this)

@aljoscha %T/OsKkgosORjseDBjtIOibXLPi8oegoNkZ7FLKl9v60=.sha256

@Dominic I didn't dare going down that road on my own, by I'm open to explore it. Maybe we'd want to add some sort of type tag to feeds, similar to those I'd like to see on messages? Then we could have legacy feeds, offchain-content feeds, and possibly later partially verifiable or even partially subscribable ones.

@aljoscha %vUgqP0gVONU0L9RZSmLt25ckqXlofa/4bXNEELUBi4I=.sha256

Addendum to the idea of adding "some sort of type tag to feeds": Feeds are not really manifest entities in the sense messages are, so you can't really attach metadata to them. They only exists as the target of cypherlinks. But: We have sigils (or sigil ids in a binary encoding) on cypherlinks that describe the kind of target, those could serve as some kind of distinguishing feature should we need one.

@Dominic %GDNTWCfnsT8pu2rfVU/CpRwidu4NJIwUopeIB8QfPqw=.sha256

@aljoscha we could have really any number of types of feeds. I think I'd keep '@' to id a feed, and just append something else to the other end .ed25519 okay, so that is the signature algorithm the feed uses, but it doesn't actually do anything yet. We just put some sort of signifier there so that it would be possible to distinguish with other variations we might use in the future. In a binary format, that would be replaced with a varint (and I guess, allocate ranges to people experimenting here) probably, just use multiformats

In particular, I'm really interested in bls group signatures ... that will also require some additional validation rules if you want to be able to add or remove directors (because the key will change)

Supporting partial replication ranges for a feed will break somethings (indexes where the latest of a type sets the value) but would be very useful for things where having the latest record is more important than the entire history - for example, the weather forecast.

But, all these things are extras that would be cool to have, but that we can work on later.

@aljoscha %QIzhknNB67FFH5w9Wjv+d2dKVN/60HPLJkKggMn7PqM=.sha256

@Dominic

But, all these things are extras that would be cool to have, but that we can work on later.

Fully agreed, all we need to do for now is to keep things extensible yet compact.

@aljoscha %i5c3ZSLEfE7Nco3Nkkxfnts3BYAT2c9hVUUblR72jwg=.sha256

I think I'd keep '@' to id a feed, and just append something else to the other end .

Ack on that, analogously to message cypherlinks.


Another pretty valuable replication option enabled by offchain-content: Only fetching the content of messages with a certain type. That way, my non-coder friends would not have to download all the data that e.g. a package manager app could add to my feed. They might still choose to do so anyways, but having the option to opt-out in resource-constrained settings is nice.

An advanced mechanism you could build on top of this involves all applications registering which types of messages they deal with at the server, and the local server then only fetching those. Or if storage overhead is not a problem, it could still fetch all others, but only after the directly useful message contents have been fetched, improving perceived performance.

User has chosen not to be hosted publicly
@yi %wJoO4qgK2QXaDtYsUjMEPTTTfrWfxm76rL/keNBTKmQ=.sha256
Voted # Moving message content out of the sigchain (aka #offchain-content aka #bl
User has chosen not to be hosted publicly
User has chosen not to be hosted publicly
Join Scuttlebutt now