Ramblings On Variable Length Natural Numbers
The binary ssb protocols will indicate a bunch of types by prefixing a certain tag. Such a tag is nothing but a natural number. We want these tags to take up little space, so we need to choose an appropriate encoding. My drafts so far simply used the ipfs varuint, but there are a few problems with that one.
There is of course no single best encoding, it depends on the expected usage. So I'll summarize some thoughts (many formed by reading through the multiformats varuint issues) and key questions, and propose a format I think would be appropriate for ssb.
I'll start with the most controversial point: Should the encoding allow arbitrarily large integers, or should there be a cutoff point (such as 2^64 - 1)?
The premise of arbitrary extensibility can only be upheld with arbitrary length integers. Else, there would come a point were we would not be able to e.g. add any more hash functions to ssb. That's why you'd default to an encoding that supports integers of arbitrary size.
But there are a bunch of problems with that. Computers have inherently limited resources, so truly arbitrary extensibility can not be reached in any case. More practically relevant: Computers are really good at dealing with 64 bit integers, but anything larger incurs performance overhead. A truly variable-length integer type is much less efficient than native fixed-width number types.
There are also concerns regarding malicious input: If the encoding only supports a finite number of numbers, then there is a maximum length for the encodings. An arbitrary-size format would contain valid encodings of arbitrary lengths, a condition implementations would have to deal with. For that reason, the ipfs varuint format disallows any integers larger then 2^64 - 1. This also allows implementations to use efficient, native 64 bit arithmetic.
But at that point, why not restrict the domain to natural numbers smaller than 2^64 in the first place? The main argument is that the ipfs format can be extended to larger integers once it becomes necessary. I don't buy into that argument, in my opinion that would be a breaking change. Part of the "API"/"contract" of the current encoding is that encodings have a maximum length of 9 bytes. Lifting the size limit is a breaking change to that guarantee. Old implementations would reject newly-valid values.
Setting a maximum value can also make the encoding itself more efficient.
So in my opinion, there are only two valid choices:
- commit to arbitrary length and deal with the consequences
- pick a maximum value and stick with it
Ssb will use varints to tag the variants of union types (e.g. specifying what kind of cryptographic primitive is encoded), and to encode lengths. For both of these, a range between 0 and 2^64 - 1 is more than sufficient. So unless anyone objects, I will design the binary protocols with a variable-length encoding of unsigned 64 bit integers.
(pings to people who might want to object: @Dominic, @dinosaur, @cel, @arj, @cryptix, @keks, @moid, @mix, this list is not exhaustive)
The next point is the ability to detect the length of the encoding by looking only at the first (few) bytes. The current ipfs varuint does not allow this, you need to parse the whole thing. If the first byte indicates the length, the remaining bytes can simply consist of the binary representation of the number, so parsing becomes very fast.
Finally, we want canonicity: Each encodable number has exactly one possible encoding, not more.
Some stuff that these encodings can do, but ssb probably shouldn't support:
- signed integers
- neither lengths nor tags need to be negative
- endianess indicator
- can speed up closed systems that use native endianess throughout, but ssb is inherently open
- so we'll use network byte order (aka big endian) all around
With all of that out of the way, here is a proposal for a variable-length encoding of 64 bit unsigned integers, suitable for use in ssb, heavily inspired by this one:
continued in next post...