@aljoscha6 years ago %GrYD8mj3wV5s1bqknoYa+rsoeusSuT3sjPBaktTL7ps=.sha256

Ramblings On Variable Length Natural Numbers

The binary ssb protocols will indicate a bunch of types by prefixing a certain tag. Such a tag is nothing but a natural number. We want these tags to take up little space, so we need to choose an appropriate encoding. My drafts so far simply used the ipfs varuint, but there are a few problems with that one.

There is of course no single best encoding, it depends on the expected usage. So I'll summarize some thoughts (many formed by reading through the multiformats varuint issues) and key questions, and propose a format I think would be appropriate for ssb.

I'll start with the most controversial point: Should the encoding allow arbitrarily large integers, or should there be a cutoff point (such as 2^64 - 1)?

The premise of arbitrary extensibility can only be upheld with arbitrary length integers. Else, there would come a point were we would not be able to e.g. add any more hash functions to ssb. That's why you'd default to an encoding that supports integers of arbitrary size.

But there are a bunch of problems with that. Computers have inherently limited resources, so truly arbitrary extensibility can not be reached in any case. More practically relevant: Computers are really good at dealing with 64 bit integers, but anything larger incurs performance overhead. A truly variable-length integer type is much less efficient than native fixed-width number types.

There are also concerns regarding malicious input: If the encoding only supports a finite number of numbers, then there is a maximum length for the encodings. An arbitrary-size format would contain valid encodings of arbitrary lengths, a condition implementations would have to deal with. For that reason, the ipfs varuint format disallows any integers larger then 2^64 - 1. This also allows implementations to use efficient, native 64 bit arithmetic.

But at that point, why not restrict the domain to natural numbers smaller than 2^64 in the first place? The main argument is that the ipfs format can be extended to larger integers once it becomes necessary. I don't buy into that argument, in my opinion that would be a breaking change. Part of the "API"/"contract" of the current encoding is that encodings have a maximum length of 9 bytes. Lifting the size limit is a breaking change to that guarantee. Old implementations would reject newly-valid values.

Setting a maximum value can also make the encoding itself more efficient.

So in my opinion, there are only two valid choices:

commit to arbitrary length and deal with the consequences
pick a maximum value and stick with it

Ssb will use varints to tag the variants of union types (e.g. specifying what kind of cryptographic primitive is encoded), and to encode lengths. For both of these, a range between 0 and 2^64 - 1 is more than sufficient. So unless anyone objects, I will design the binary protocols with a variable-length encoding of unsigned 64 bit integers.

(pings to people who might want to object: @Dominic, @dinosaur, @cel, @arj, @cryptix, @keks, @moid, @mix, this list is not exhaustive)

The next point is the ability to detect the length of the encoding by looking only at the first (few) bytes. The current ipfs varuint does not allow this, you need to parse the whole thing. If the first byte indicates the length, the remaining bytes can simply consist of the binary representation of the number, so parsing becomes very fast.

Finally, we want canonicity: Each encodable number has exactly one possible encoding, not more.

Some stuff that these encodings can do, but ssb probably shouldn't support:

signed integers
- neither lengths nor tags need to be negative
endianess indicator
- can speed up closed systems that use native endianess throughout, but ssb is inherently open
- so we'll use network byte order (aka big endian) all around

With all of that out of the way, here is a proposal for a variable-length encoding of 64 bit unsigned integers, suitable for use in ssb, heavily inspired by this one:

continued in next post...

@aljoscha6 years ago %QlFuGsQKY2nSE5IXPk2c/9lRuTpqX6hpBHi0KJ+XI98=.sha256

...continued from previous post

VarU64

The VarU64 encoding for unsigned 64 bit integer works as follows:

first byte	value
0b0000_0000 - 0b1111_0111	The numeric value of the byte
0b1111_1000	The byte is followed by an 8 bit unsigned big-endian integer
0b1111_1001	The byte is followed by a 16 bit unsigned big-endian integer
0b1111_1010	The byte is followed by a 24 bit unsigned big-endian integer
0b1111_1011	The byte is followed by a 32 bit unsigned big-endian integer
0b1111_1100	The byte is followed by a 40 bit unsigned big-endian integer
0b1111_1101	The byte is followed by a 48 bit unsigned big-endian integer
0b1111_1110	The byte is followed by a 56 bit unsigned big-endian integer
0b1111_1111	The byte is followed by a 64 bit unsigned big-endian integer

Each integer may only be encoded using the smallest possible number of bytes. When decoding, violations of that constraints must be reported as errors.

This format (compared to the current ipfs varuint):

restricts the domain to 64 bit unsigned integers
indicates the length of the value in the first byte
admits exactly one valid encoding per number
can be parsed very efficiently
optimizes for small values (can store 248 different values in a single byte, compared to 128 for the ipfs varuint)
pays for these advantages by leaving quite a few byte strings unused (the encodings that do not use the smallest possible number of bytes)
- if it ever becomes absolutely necessary to extend this format to handle integers of larger (or even arbitrary) size, these unused values can enable such an extension

I'll revisit the #yamf formats as necessary once I'll need them in the rust ssb implementation.