Compact Encoding of Legacy Metadata
Currently, ssb encodes message metadata as json. While that works (ok, actually it doesn't...), it is rather wasteful. Metadata is structured data, there's no need to use a self-describing data format. A sensible encoding also shouldn't need to base64 encode signatures, hashes etc. So here's a draft for better metadata encoding.
Note that this is only about encoding legacy metadata for backwards-compatibility. This post is not about changes to the metadata itself, those are completely orthogonal and will happen at a later point in time. The only hard requirement for the new encoding is that you can always compute the (exactly one) corresponding json encoding. Put formally, there must be an injective function from the newly encoded values to the json-encoded ones. This property is crucial, since otherwise messages can't be verified. I won't specify how to reconstruct the original encoding in this post, but it should be straightforward to see that it is possible.
Legacy Metadata
The metadata currently attached to each message is specified here in the protocol guide, this post assumes that you read that section. But there are a few things I want to point out as they might not be obvious to everybody:
- Metadata order is fixed up to only one variation,
author
andsequence
order can be swapped. So the new encoding will need a bit to indicate how to restore the corresponding json. - We are not forced to store data in the same order as the json did it.
hash
carries no information at all, there's nothing to encode there- a timestamp can be any 64 bit floating point number except
Infinity
,-Infinity
,-0
andNaN
- each of those floats has exactly one valid ssb-json encoding, so sending the byte pattern is sufficient to reconstruct the json metadata
- we can encode cypherlinks, keys, signatures etc. in any way we want, we are not tied to the
<sigil><base64data><suffix>
format- in particular, sigils are unnecessary because we already know the type of data
- the
previous
data can also benull
- the
- in particular, sigils are unnecessary because we already know the type of data
- the cryptographic primitive for the signature is already implied by the
author
key type
The Encoding
A few more considerations:
- it should be simple to remove the signature
- signature be either the first or last piece of data
- by just storing the data sequentially, you need to parse everything up to the desired piece of data to access it. That's not a problem for metadata, but it means that message content should be at the end of the data.
- if the total length of all data is known (which will be the case with both muxrpc and bpmux), then content length can be computed without parsing if it is at the end of the data
- it should be simple to add feed id and sequence number to a message without them, since replication rpcs don't really want to send those over the network as they can be computed locally
The last thing to consider is how to encode feed authors (i.e. public keys), signatures (where the cryptographic primitive is already known due to the feed author metadata) and the hash to the previous message (not a full cypherlink since it is already known that it links to a message).
I discussed key encoding here, we can use that format. I like the idea @kas raised of using hex encoding for human-readable representations, but that's of no concern for the metadata encoding.
Signatures only need to specify the length of the signature, not the cryptographic primitive (since that's already implied by the feed's public key). It might be possible to come up with a very clever scheme where the feed key already encodes the signature length. I don't think doing so is a good idea. Instead, the signature should simply be a varint length followed by that many bytes of signature data.
Placing the signature before the feed id is a tiny bit inconvenient since you don't know the crypto primitive of the signature before parsing the author, but the other ordering concerns outweigh this.
What remains is the encoding of hashes. This can be done nearly the same way as public key encodings. I wrote this up in a different post to keep things somewhat organized.
The Encoding (For Real This Time)
The raw encoding is a simple concatenation of the following bytes <signature><author><seqnum><timestamp><previous><data_length><data>
, where:
<signature>
is a varintfoo
followed byfoo > 1
bytes. The least significant byte of the varint specifies the order ofauthor
andsequence
when computing the json signing format.<author>
is a yamf-pubkey<seqnum>
is a canonical varint<timestamp>
is an IEEE_754 64 bit float (1 sign bit, 11 exponent bits, 52 mantissa bits, in that order, negative zeros, infinities and NaNs are invalid)<previous>
is a yamf-hash<data_length>
is a varint containing<data>
is whatever data format we come up with (my current proposal is this)