Fun With Cypherlink Encodings for HSDT
Hsdt will treat cypherlinks as special data types. So we get to encode them any way we want. Wheee, endless possibilities and lots of room for bad decisions!
The current (stringly typed) cypherlinks used by ssb consists of three parts:
- the sigil, indicating what sort of entity it links to, currently one of feed, message or blob
- the primtive, indicating the cryptographic primitive that computed the cypherlink
- the content, the raw data that addresses the entity
The sigil dictates which primitives are allowed, it does not make sense to have the "sha256" hash function primitive on a feed id (which is just a public key), vice versa a message link of primitive "ed25519" would be meaningless.
Any sort of cypherlink encoding needs to be future-proof, in that it supports
- new sigil types
- new hash functions
- with arbitrary (but fixed) digest sizes
- new keypair primitives
- with arbitrary (but fixed) public key sizes
Hsdt is based on cbor, so how can cbor be extended to accomodate cypherlinks? Cypherlinks are simple (as opposed to compound) values, and there's a whole major tag) for those. Of those, the additional types 0
to 19
and 28
to 30
are unassigned, and hsdt does not use additional types 23
, 24
, 31
, and possibly 25
. Even if we leave open additional type 25
(cbor assigned it for 16 bit floats), that leaves us with 26 values to use for efficient encodings. Allocating multiple additional types allows to very efficiently represent currently known cryptographic prmitives (and even some unknown ones).
Side-note on hsdt extensibility: It might seem reckless to use up all of these potential extension points for hsdt. But there is an interesting interplay between protocol extensions and canonicity requirements. Hsdt has to reject any non-canonical content. But suppose it also supported future extensions by requiring implementations to ignore certain pieces of data. An implementation could thus accept some message that contains an unknown data extension, but after the implementation learns about that extension, it suddenly detects that the included data was non-canonical. Should it now retroactively ignore the message? This problem basically means that there can only be very limited extensibility in hsdt, as will be used below. But fundamental changes will require a (conceptually) new data format with its own hash primitive. So hsdt itself can happily use up a bunch of cbor's extension points, as they could not be built upon after the initial specification anyways.
Back to the actual encoding: The first thing we need to have is fully future-proof encoding of arbitrary cypherlinks. My simple suggestion for this: A (canonical) varuint for the sigil id (these are small numbers, not ascii sigils), followed by a varint for the cryptographic primitive's id, followed by a varint for the length of the content, followed by the actual content. Basically a multiformats hash preceded by a sigil, and with our own table of hash ids.
With the general encoding done, there are still additional type bits left to concisely express cypherlinks. For example the tag byte 0b111_00000
(major type 7, minor type 0) could be a shorthand for {sigil: blob, primitive: sha256]
. Since the primitive has a well-known digest size (32 byte), there's no need to encode, it tag byte can be followed directly by the content. Encoded this way, the cypherlink takes up 33 bytes in total, whereas the general encoding takes up 1 + 1 + 1 + 2 + 32 = 37 bytes and requires a bunch of varint encoding/decoding.
These compact encodings can also be pre-allocated for not-yet-used primitives, we could reserve an additional type for a hash function with a 64 byte digest, without having to know which one exactly it is going to be.
My suggestion for these efficient encoding slots are as follows:
{sigil: feed, primitive: ed25519}
{sigil: feed, primitive: unknown_but32_byte_public_key}
{sigil: feed, primitive: unknown_but_64_byte_public_key}
{sigil: feed, primitive: unknown_but_64_byte_public_key}
{sigil: feed, primitive: unknown_but_128_byte_public_key}
{sigil: message, primitive: json_sha256}
{sigil: message, primitive: transitional_semicanonical_sha256}
{sigil: message, primitive: hsdt_sha256}
{sigil: message, primitive: hsdt_without_timestamp_sequence_number_feed_id_sha256}
{sigil: message, primitive: transitional_semicanonical_unknown_128_byte_digest}
{sigil: message, primitive: hsdt_unknown_128_byte_digest}
{sigil: message, primitive: hsdt_without_timestamp_sequence_number_feed_id_unknown_128_byte_digest}
{sigil: message, primitive: hsdt_another_unknown_128_byte_digest}
{sigil: message, primitive: hsdt_without_timestamp_sequence_number_feed_id_another_unknown_128_byte_digest}
{sigil: message, primitive: hsdt_without_timestamp_sequence_number_feed_id_unknown_256_byte_digest}
- all of the above
sigil: message
slots again forsigil: blob
Total: 25 efficient slots, plus the general encoding, takes up 26 of CBOR's primitive values.
Due to the canonicity requirement, implementations must use efficient slots whenever possible. We'd define representations for all of these in the general encoding as well though, because future formats might want to use different efficient slots and would thus refer to these legacy primitives via the general representation. By adding all of these to the general representation even though hdst won't use them, we can keep the general representation consistent across data formats.
The main remaining question for me: Are the efficient representations even worth it, or should we simply stick with a single, general representation? The savings are a few bytes per cypherlinks as well as faster encoding/decoding, the cost is slightly larger code size (well, and someone has to write that code). I think that at the potential scale we could reach, those saved bytes will add up to matter (and I fully expect hsdt to be used for data other then limited-size messages, so even single values could potentially contain a lot of hashes).
I'll post a companion piece to this on cypherlink (and signature) encoding for data with a well-known schema (metadata, some rpcs). That one will be a lot shorter, as the basic considerations apply in both cases.