You are reading content from Scuttlebutt
@aljoscha %l7PFjPvDHdmqskz9UNjEfjE0J6ktjKi5KMdrF+vZHPo=.sha256
Re: %FI3kBXdFD

Countering the arguments against canonical encodings

As I read it, @Dominic's main arguments against enforcing a canonical encoding are complexity and the fact that the alternative of "serialize once, verify from the original encoding" works. I consider the complexity argument void, as demonstrated by hsdt. With a simple encoding, canonicity does not require a lot of work.

The "serialize once, verify encodings" scheme would not work for something like ipfs. For the ipfs, the data deduplication requirements mandate the use of a canonic encoding. This does not hold for ssb messages. By assumption, each message contains a unique combination of feed id and backlink, so no two ssb messages should ever result in the same hash. Does that mean that we never hash the same logical value twice, and thus all my rambling is irrelevant? Yes, and no, with strict emphasis on the "no" part.

Due to the message uniqueness assumption, the "serialize once, verify encodings" approach does work for ssb. It's not pretty, our code could not be reused by any other project without this property, but it would work. It places some serious constraints on all ssb implementations though.

Hashing is not the only time in the lifecycle of an ssb message when it gets encoded. Encoding is also necessary for transmitting messages to other peers, and for persistent storing of messages. In each of these three encoding use cases, there are different properties we want the encoding to have. We might want to save bandwith by compressing the data as much as possible when sending over the network, or we might store larger, but quick-to-access data in the database. Different implementations might want to take different choices here - we can not predict their needs. These encoding choice always involve tradeoffs.

The encoding for the hashing however, is fairly arbitrary. It should be somewhat performant, but that's it. But with "serialize once, verify encodings", we force every database to store the possibly suboptimal original encoding, whose only purpose was to compute a completely arbitrary hash. Why does it have to be stored? If Alice sends a message to Bob, Bob needs to verify it. Now suppose Carol is not directly connected to Alice, but she subscribed to her feed. Carol gets the message from Bob. But Carol still needs to verify it. But in order to so, without canonical encodings, she needs the original message encoding. So Bob (and everyone) needs to persist this encoding, and needs to send it over the network as well.

Bob could have two databases, one for original encodings to send during data replication to peers, and one for fast access for client applications. Running two different databases with the same content wastes space. And have fun guaranteeing consistensy between these two. So as a result, "serialize once, verify encodings" forces every forseeable ssb database backend ever into a specific storage format, or to deal with inefficient space usage and synchronization problems. Similiarilly, it constrains how peers can sensibly exchange messages. More space-efficient encodings become useless, since the original encoding needs to be stored anyways.

Now here's the thing: a canonical encoding used for hashes solves these problems. It decouples the bytes to be hashed from the bytes to be persisted or transmitted. With a canonic hash encoding, you can read from your highly specialized database encoding into a logical value, compute its hash, and it is guaranteed to be correct. If we find more efficient formats for data exchange between peers, we can use them, decode received values, compute the canonical encoding, hash it, and verify against the claimed hash of the message.

Of course we can just pretend we found an encoding that will be sufficient for all future rpc and database usage, hardwire it into the protocol by using it for hash computation, and have future devs curse us. But I remember a thread on ssb where futureproofing protocols (e.g. via multiformats) was generally considered a good thing. A canonic encoding does exactly this. And it costs us very little - hsdt would be both canonic, and more efficient (in both space useage and CPU cycles) then the current json-based implementation. And a new format can be introduced in a backwards-compatible way, due to ssb multihashes.

Join Scuttlebutt now