A few more remarks to make hsdt more sensible:
- directly dumping the bytes of floats can be trouble because of different machine representations. A serious encoding should use an IEEE 754 exchange format.
- using a full byte as the tag rather than the minimum makes encoded messages larger than they theoretically need to be, but dealing with the compact data is probably slower. I would personally err on the side of simplicity rather than theoretical optimality.
- to make some of the encoding specifics less arbitrary, I see two obvious routes to take:
- simple: use big-endian, fixed size ints for lengths rather than varints, and indicate collection sizes as the number of elements
- efficient: use varints for lengths, and indicate collection sizes as the total size in bytes
If this was only used for calculating hashes, I'd go with the simple version. But it will often be sensible to compute the hash and then send the data on the network, in the same encoding. In this case, I tentatively lean towards the efficient solution. Note that calling it "efficient" is somewhat simplistic: Encoding and decoding varints is more expensive than fixed size ints, but that should be negligible. More important is the unit of lengths. Decoding becomes simpler if collections are prefixed by their size in bytes, but this makes encoding more complicated. You can't really do streaming encoding anymore ,since you need to encode the full collection content to know the size. But I think this wouldn't be too expensive, also you can compute the sizes without actually encoding. Add some memoization for nested size precomputation, and things should stay efficient. Non-streamability is not that much of a problem because we have a size limit on messages. Speaking of which, a new encoding is an opportunity to fix the intermingling of size limit and unicode character counts.
With these above changes incorporated, I would consider the format good enough. It's still only a drop-in replacement for json. If there is serious consideration of adopting a new format, I'd like us to take the time and consider possible improvements to the logical data structures themselves. Some ideas (not saying all of these are good ones or would be compatible with or sensible for ssb): integers, rationals, byte arrays, NaN, infinities, sets.
As for compressed floats, my hunch is that it's simply not worth it. But this looks like a good starting point for exploration rather than rolling our own.
And then there's the actual engineering aspect, i.e. benchmarking different encoding details like float representations, how to store data type tags, etc.
I feel like I should add a disclaimer that I have no idea what I'm doing here. I'm just looking for clean ways to achieve nice mathematical properties, but I'm neither experienced in protocol design/implementation, nor do I know a lot about the performance properties and considerations of networks and data bases. It is entirely possible (and rather likely) that common sense is misleading me, or that I'm missing points that would be obvious to someone more qualified/experienced.
Anyways, I feel like I can't postpone reading the full IEEE 754 standard any longer. Maybe tomorrow I'll then try to find a sensible way of dealing with floats, and then I could do a js implementation of the simple
version of hsdf, to further prove the concept. But for now, I really need to get to sleep. And I kinda wanted to implement a lisp rather than dealing with ssb internals again... Oh well =D