You are reading content from Scuttlebutt
@aljoscha %QTFcrr914HI8y8bTfMWRFczNdQ/ncDM7yZ5Ttd/nn0s=.sha256

Compact Encoding of Legacy Metadata

Currently, ssb encodes message metadata as json. While that works (ok, actually it doesn't...), it is rather wasteful. Metadata is structured data, there's no need to use a self-describing data format. A sensible encoding also shouldn't need to base64 encode signatures, hashes etc. So here's a draft for better metadata encoding.

Note that this is only about encoding legacy metadata for backwards-compatibility. This post is not about changes to the metadata itself, those are completely orthogonal and will happen at a later point in time. The only hard requirement for the new encoding is that you can always compute the (exactly one) corresponding json encoding. Put formally, there must be an injective function from the newly encoded values to the json-encoded ones. This property is crucial, since otherwise messages can't be verified. I won't specify how to reconstruct the original encoding in this post, but it should be straightforward to see that it is possible.

Legacy Metadata

The metadata currently attached to each message is specified here in the protocol guide, this post assumes that you read that section. But there are a few things I want to point out as they might not be obvious to everybody:

  • Metadata order is fixed up to only one variation, author and sequence order can be swapped. So the new encoding will need a bit to indicate how to restore the corresponding json.
  • We are not forced to store data in the same order as the json did it.
  • hash carries no information at all, there's nothing to encode there
  • a timestamp can be any 64 bit floating point number except Infinity, -Infinity, -0 and NaN
  • each of those floats has exactly one valid ssb-json encoding, so sending the byte pattern is sufficient to reconstruct the json metadata
  • we can encode cypherlinks, keys, signatures etc. in any way we want, we are not tied to the <sigil><base64data><suffix> format
    • in particular, sigils are unnecessary because we already know the type of data
      • the previous data can also be null
  • the cryptographic primitive for the signature is already implied by the author key type

The Encoding

A few more considerations:

  • it should be simple to remove the signature
    • signature be either the first or last piece of data
  • by just storing the data sequentially, you need to parse everything up to the desired piece of data to access it. That's not a problem for metadata, but it means that message content should be at the end of the data.
    • if the total length of all data is known (which will be the case with both muxrpc and bpmux), then content length can be computed without parsing if it is at the end of the data
  • it should be simple to add feed id and sequence number to a message without them, since replication rpcs don't really want to send those over the network as they can be computed locally

The last thing to consider is how to encode feed authors (i.e. public keys), signatures (where the cryptographic primitive is already known due to the feed author metadata) and the hash to the previous message (not a full cypherlink since it is already known that it links to a message).

I discussed key encoding here, we can use that format. I like the idea @kas raised of using hex encoding for human-readable representations, but that's of no concern for the metadata encoding.

Signatures only need to specify the length of the signature, not the cryptographic primitive (since that's already implied by the feed's public key). It might be possible to come up with a very clever scheme where the feed key already encodes the signature length. I don't think doing so is a good idea. Instead, the signature should simply be a varint length followed by that many bytes of signature data.

Placing the signature before the feed id is a tiny bit inconvenient since you don't know the crypto primitive of the signature before parsing the author, but the other ordering concerns outweigh this.

What remains is the encoding of hashes. This can be done nearly the same way as public key encodings. I wrote this up in a different post to keep things somewhat organized.

The Encoding (For Real This Time)

The raw encoding is a simple concatenation of the following bytes <signature><author><seqnum><timestamp><previous><data_length><data>, where:

  • <signature> is a varint foo followed by foo > 1 bytes. The least significant byte of the varint specifies the order of author and sequence when computing the json signing format.
  • <author> is a yamf-pubkey
  • <seqnum> is a canonical varint
  • <timestamp> is an IEEE_754 64 bit float (1 sign bit, 11 exponent bits, 52 mantissa bits, in that order, negative zeros, infinities and NaNs are invalid)
  • <previous> is a yamf-hash
  • <data_length> is a varint containing
  • <data> is whatever data format we come up with (my current proposal is this)
@aljoscha %T9u8UstTZtNuHhoe5zxvgpWnFso72EVHSMiwo9CYRHs=.sha256

The least significant byte of the varint specifies the order of author and sequence when computing the json signing format.

Should have been "The least significant bit [...]".

@aljoscha %xVqT/69yELAEPyDcekvh/+L1cat/j2ejXvE13LIW5Vg=.sha256

As per this exchange, <author> also needs an additional varint to indicate the feed type.

There was also this idea for an additional varint in cypherlinks to messages, but I'm not as convinced anymore. Messages should be as self-describing as possible, including their encoding. I'd rather keep an encoding indicator in the message metadata (talking about the metadata improvements now, not legacy metadata), and distinguish between legacy messages and self-describing new messages via the hash indicator ("hash tag"?). So unless anyone complains and I don't change my mind on my own, that's the road I'll explore further.

@aljoscha %HFF/8SmwfkQaHuphWxiXALvRBHjZlDiXChgM+9htWbU=.sha256

The more I'm thinking about the issue of attaching a tag to feed ids, the less I'm convinced it is a good idea.

Conceptually, this information belongs to the feed itself. By putting it into the references, a bunch of problems arise:

  • this information is duplicated in every single reference rather than having a single source of truth
    • inefficient
    • not robust, inconsistencies will crop up
  • public keys do not uniquely identify feeds anymore, there can be two feeds with the same pubkey but different tag
  • anyone can claim that a feed has a certain type, without needed the private key to prove it
  • malicious actors can cause you to interpret data in a different way than it was intended, by specifying a non-truthful type
    • this screams "attack vector"
    • and how could you even know which type is correct?
      • trying everything and seeing what works is not a good strategy, also polyglots) can exist
  • a single tag is fairly restrictive, what if we needed feed types which were parameterized over e.g. a number, or some keys, or whatever
    • all the previous issues are vastly amplified in this scenario

These problems arise because feeds - unlike messages - are not really entities that exist and can be linked to. So to solve this, we'd need a place to store information about a feed. And there's a fairly obvious one (ignoring backwards-compatibility for now): The feed's first message. It wouldn't actually have to be a real message, but the first entry of the feed's sigchain would point to a hash of some data rather than being a "null pointer". Or maybe we don't even need any indirection there, the first sigchain entry can simply contain all the feed-type data in place of the backlink.

To get this backwards-compatible: Legacy messages can't have any feed type rather than the implicit default one, and the non-legacy metadata design can accommodate this.

CC @Dominic @dinosaur

@aljoscha %kW58mRD2Y7xc8DFDB9BtfeKvfYBNMdapCoz0alcXtM0=.sha256

Using the first message of the feed doesn't work with ooo. Instead, we can put the feed type into the metadata of every message. That is still more efficient than putting it into links (every message is only published once, but can be linked to much more than once), and the feed type is specified by the owner of the feed's private key. And just like feed id and seqnum, non-ooo replication can omit sending the feed type over the wire with every message.

Join Scuttlebutt now