You are reading content from Scuttlebutt
@aljoscha %a5c/BWCeraIb6ET40XQtPWnmObOmhX3wVLwr8ZgixSc=.sha256

Fun With Cypherlink Encodings for HSDT

Hsdt will treat cypherlinks as special data types. So we get to encode them any way we want. Wheee, endless possibilities and lots of room for bad decisions!

The current (stringly typed) cypherlinks used by ssb consists of three parts:

  • the sigil, indicating what sort of entity it links to, currently one of feed, message or blob
  • the primtive, indicating the cryptographic primitive that computed the cypherlink
  • the content, the raw data that addresses the entity

The sigil dictates which primitives are allowed, it does not make sense to have the "sha256" hash function primitive on a feed id (which is just a public key), vice versa a message link of primitive "ed25519" would be meaningless.

Any sort of cypherlink encoding needs to be future-proof, in that it supports

  • new sigil types
  • new hash functions
  • with arbitrary (but fixed) digest sizes
  • new keypair primitives
  • with arbitrary (but fixed) public key sizes

Hsdt is based on cbor, so how can cbor be extended to accomodate cypherlinks? Cypherlinks are simple (as opposed to compound) values, and there's a whole major tag) for those. Of those, the additional types 0 to 19 and 28 to 30 are unassigned, and hsdt does not use additional types 23, 24, 31, and possibly 25. Even if we leave open additional type 25 (cbor assigned it for 16 bit floats), that leaves us with 26 values to use for efficient encodings. Allocating multiple additional types allows to very efficiently represent currently known cryptographic prmitives (and even some unknown ones).

Side-note on hsdt extensibility: It might seem reckless to use up all of these potential extension points for hsdt. But there is an interesting interplay between protocol extensions and canonicity requirements. Hsdt has to reject any non-canonical content. But suppose it also supported future extensions by requiring implementations to ignore certain pieces of data. An implementation could thus accept some message that contains an unknown data extension, but after the implementation learns about that extension, it suddenly detects that the included data was non-canonical. Should it now retroactively ignore the message? This problem basically means that there can only be very limited extensibility in hsdt, as will be used below. But fundamental changes will require a (conceptually) new data format with its own hash primitive. So hsdt itself can happily use up a bunch of cbor's extension points, as they could not be built upon after the initial specification anyways.

Back to the actual encoding: The first thing we need to have is fully future-proof encoding of arbitrary cypherlinks. My simple suggestion for this: A (canonical) varuint for the sigil id (these are small numbers, not ascii sigils), followed by a varint for the cryptographic primitive's id, followed by a varint for the length of the content, followed by the actual content. Basically a multiformats hash preceded by a sigil, and with our own table of hash ids.

With the general encoding done, there are still additional type bits left to concisely express cypherlinks. For example the tag byte 0b111_00000 (major type 7, minor type 0) could be a shorthand for {sigil: blob, primitive: sha256]. Since the primitive has a well-known digest size (32 byte), there's no need to encode, it tag byte can be followed directly by the content. Encoded this way, the cypherlink takes up 33 bytes in total, whereas the general encoding takes up 1 + 1 + 1 + 2 + 32 = 37 bytes and requires a bunch of varint encoding/decoding.

These compact encodings can also be pre-allocated for not-yet-used primitives, we could reserve an additional type for a hash function with a 64 byte digest, without having to know which one exactly it is going to be.

My suggestion for these efficient encoding slots are as follows:

  • {sigil: feed, primitive: ed25519}
  • {sigil: feed, primitive: unknown_but32_byte_public_key}
  • {sigil: feed, primitive: unknown_but_64_byte_public_key}
  • {sigil: feed, primitive: unknown_but_64_byte_public_key}
  • {sigil: feed, primitive: unknown_but_128_byte_public_key}
  • {sigil: message, primitive: json_sha256}
  • {sigil: message, primitive: transitional_semicanonical_sha256}
  • {sigil: message, primitive: hsdt_sha256}
  • {sigil: message, primitive: hsdt_without_timestamp_sequence_number_feed_id_sha256}
  • {sigil: message, primitive: transitional_semicanonical_unknown_128_byte_digest}
  • {sigil: message, primitive: hsdt_unknown_128_byte_digest}
  • {sigil: message, primitive: hsdt_without_timestamp_sequence_number_feed_id_unknown_128_byte_digest}
  • {sigil: message, primitive: hsdt_another_unknown_128_byte_digest}
  • {sigil: message, primitive: hsdt_without_timestamp_sequence_number_feed_id_another_unknown_128_byte_digest}
  • {sigil: message, primitive: hsdt_without_timestamp_sequence_number_feed_id_unknown_256_byte_digest}
  • all of the above sigil: message slots again for sigil: blob

Total: 25 efficient slots, plus the general encoding, takes up 26 of CBOR's primitive values.

Due to the canonicity requirement, implementations must use efficient slots whenever possible. We'd define representations for all of these in the general encoding as well though, because future formats might want to use different efficient slots and would thus refer to these legacy primitives via the general representation. By adding all of these to the general representation even though hdst won't use them, we can keep the general representation consistent across data formats.

The main remaining question for me: Are the efficient representations even worth it, or should we simply stick with a single, general representation? The savings are a few bytes per cypherlinks as well as faster encoding/decoding, the cost is slightly larger code size (well, and someone has to write that code). I think that at the potential scale we could reach, those saved bytes will add up to matter (and I fully expect hsdt to be used for data other then limited-size messages, so even single values could potentially contain a lot of hashes).

I'll post a companion piece to this on cypherlink (and signature) encoding for data with a well-known schema (metadata, some rpcs). That one will be a lot shorter, as the basic considerations apply in both cases.

@aljoscha %R7+kN37P8H4kGoTek1DjzD2Nq2OK7x4tGwxChwyVkTs=.sha256
  • all of the above sigil: message slots again for sigil: blob

Huh, that didn't really make a lot of sense, did it. Blob primitives are only determined by the hash function, not by any sort of encoding. So there are actually some more efficient slots that can be allocated. Oh well, but the general idea stays the same.

@mikey %iHLgIYxkuhQ9Eabbaq0KEHcJ7L0mDhjIGFX6DUFDwQY=.sha256

@aljoscha this looks great!

that leaves us with 26 values to use for efficient encodings...

Total: 25 efficient slots, plus the general encoding, takes up 26 of CBOR's primitive values.

i'm not sure about using all the primitive values in the first possible spec. at the same time, there's no way to know what the future has in store for us and our wee protocol. on the one hand, maybe we'll never want another primitive value, maybe this is it. on the other hand, maybe not to long after we implement this, we want something better, or realize something terrible, and wish we had another available primitive value.

maybe we want to implement the general, less efficient case first (it is also the easier way to implement), then once we feel solid about what cases are worth improving the performance, we make those cases more efficient. oh no but then what about hashing.

@aljoscha %1kJ2VFXzxkoeDfxYeyN6yBfHpS8+u5Si59C21gnM+J8=.sha256

Side-note on hsdt extensibility: It might seem reckless to use up all of these potential extension points for hsdt. But there is an interesting interplay between protocol extensions and canonicity requirements. Hsdt has to reject any non-canonical content. But suppose it also supported future extensions by requiring implementations to ignore certain pieces of data. An implementation could thus accept some message that contains an unknown data extension, but after the implementation learns about that extension, it suddenly detects that the included data was non-canonical. Should it now retroactively ignore the message? This problem basically means that there can only be very limited extensibility in hsdt, as will be used below. But fundamental changes will require a (conceptually) new data format with its own hash primitive. So hsdt itself can happily use up a bunch of cbor's extension points, as they could not be built upon after the initial specification anyways.

@dinosaur Was that paragraph unclear, or did you miss it? Canonicity implies that we can't do non-breaking changes, that's why I allocated all of the primitives. Any future format that might add e.g. 16 bit or 128 bit floats would need a different hash suffix than hsdt.

In general, future extensibility requires ignoring parts of the data, and I don't think that can be reconciled with canonicity. And even if it could, what would the semantics of this be? Should a message signal to the client "by the way, you are missing some information you can't understand, either because you are outdated, or because it is complete garbage, who knows?"? I (tentatively) think disallowing unknown content is the correct choice here. And as a consequence, we get to allocate all those tasty, efficient values.

@mikey %h5/GN4X1B83fdrvLmlmhfqinp+9rLXIl832CTT4bWfk=.sha256

@Aljoscha okay sorry, i probably skimmed over that paragraph and missed the meaning inside. so to loop back what i understand: once we create a data encoding protocol, we can't extend it because that would break hashes. if we want to extend it, we need to create a new data encoding protocol with a new identifier.

@aljoscha %IVg2rLd8F/5cGPwGgZ34AqWC3fpODmsK5sGh7YMtzeI=.sha256

@dinosaur

to loop back what i understand: once we create a data encoding protocol, we can't extend it because that would break hashes. if we want to extend it, we need to create a new data encoding protocol with a new identifier.

Yup, I fear that's about it. A protocol extension would not really "break hashes", but implementations that are unaware about the extension have no way of verifying canonicity, and thus can't confidently not-reject those messages. At least I did not find a satisfactory way around this.


Here are some other fun optimizations I toyed with along the way that did not make it into the proposed format.

One cute optimization is to interpret 0 < length_value < 8 in the generic encoding as indicating a length of 2 ^ (6 + length_value). This allows representing the lengths 32, 62, 128, 256, 512, 1024, 2048 with a single byte, and that's probably all we'd ever need. The problem with this are potential new kinds of sigils were lengths between 0 and 8 actually make sense. I can't come up with any scenario that could possible lead to such a situation, but I really don't want to be the guy who prevented it just in order to save one or two bytes on potential future hash encodings...

The other optimization is to actually disallow new sigils, in which case we could use bitshifted varints to store both sigil and primitive. This would consistently save one byte in the general encoding, as well as open up the previously mentioned optimization. But disallowing any new sigils seems rather harsh, I can actually come up with use cases (or at least non-ssb protocol extensions) that might want their own cypherlink types. There's a compromise of giving e.g. exactly four bytes to sigils, so that the protocol could add a couple more if needed. There's a really high probability of that being sufficient for a few decades/centuries. But again I don't think the savings are worth sacrificing arbitrary extensibility.

@aljoscha %kFNhfPsphpeqoChT7BIuUFURbOzIRuJmcYzuOWpubGo=.sha256

On the topic of alternative ways to go: There's also the option of clearly separating encoding and hash function indicators in cypherlinks. The current proposal (well, it's not really a proposal yet, more an opinionated exploration of the design space) mashes these two together, an id signifies both the hash function and the encoding that was used to turn the logical value into hashable bytes. That's how we can introduce new formats in the current, string-based cypherlink encoding, too. But hsdt could instead decouple those completely, tracking different ids for encodings and hash functions. And now that I think about it, nothing stops the json-string encoding from appending two suffixes, one for the hash and one for encoding (with an empty encoding suffix indicating the default json hashing encoding).

Separating those would mean adding another varint to the generic hsdt encoding, so we'd get <sigil-varint><encoding-varint><hash-varint><length><content>. The cost of this would be a single additional byte, but we'd reinforce the conceptual distinction. And having to manually assign all possible combinations of (encoding, hash) pairs into combined ids would be really annoying, as would be the resulting ad-hoc disambiguation code in implementations. In some sense, the same arguments for giving sigils their own varint also apply here. The more I think about this, the more I'm leaning toward separating encoding and hash.

CC @Dominic since this could be done with the json encoding as well.

@aljoscha %+gs04DnW5XyfckqZH4rFSffL38MUytg9IUNm+mTyID0=.sha256

It is interesting how ipfs did not face this choice, as they settled on canonical cbor at the very beginning of the design. This locks them into a single encoding, but being locked into cbor isn't that bad, the "c" in "cbor" is no joke. More interestingly though, it prevents any future extensions of their logical data model. Which again is probably fine, especially since they have arbitrary-value-keys for maps, not just strings, but who knows - at some point they might regret it. It certainly limits the possibility of "subclassing" the protocol for more specialized settings.

Join Scuttlebutt now