TLDR: Everything gets better if we treat objects as unordered key/value pairs rather than ordered key/value pairs.
For those wondering about the v8 implementation details I am talking about: https://github.com/ssbc/ssb-feed/issues/11
I feel like I am arguing a different point than @Dominic. Performance improvements are nice, but I primarily see a new encoding as a chance to fix some protocol internals. While this might not be the feedback you expected (and it might not be part of the rabbit hole you explored during the self-nerd-snipe), I still urge you to consider it seriously.
The point of an in-place-read system is that you don't actually need to encode or decode it, once you've created it. You create it, you store it to a file, you read it from that file and send it over the wire, they receive and validate it, store it and read it. but no one encodes or decodes it any more. If we have in-place-read we don't actually need bijectiveness, what bijectiveness gives us is an easier upgrade path from the current system, which is a different thing, but still something very useful.
But at some point, the data needs to be translated into the higher-level data structure that is exposed by the api. A getHistoryStream
function won't return raw bytes, it will return some sort of (language specific) values. My questions target this part of the process.
The encoding function should be a function from these values to byte buffers, and it needs to have the following properties:
- actually be a (total) function (each value has exactly one encoding)
- be injective (different values produce different encodings)
- have an inverse function (aka decoding)
- the inverse function is injective (different encodings decode to different values)
The current system does not have all of these, so doing the same stuff, just more efficient, won't make things any better. Maybe faster, but still fundamentally broken.
Breakdown of the above requirements:
Total Encoding Function:
If the function is not total (i.e. there exist unencodable values), there are values that can not be hashed. This requirement is (maybe even too) obvious. The json solution fulfills it.
Each value having at most one encoding is necessary so that content-addressing via hashes works. Whether the json encoding fulfills this one, depends on your definition of the set of values we want to encode. In json, a different ordering of the key/value pairs in an object results in different encodings. In consequence, the order of the values of objects matters. This has been a major source of frustration for all protocol implementers, as well as the dependence on explicitely unspecified v8 behaviour.
With a new encoding, we get a chance to fix it, by treating maps (aka objects) as (unordered) sets of key/value pairs, rather than (ordered) lists of key/value pairs. This is what most people would intuitively expect, and as far as I know, that is also how all current rpc methods treat objects.
If there was one thing I could change about the protocol, this might be it. Technically, the current json encoding is a proper function, but only because the domain is a superset of what would actually be sensible.
Injective Encoding
This one is pretty clear. Failing to be injective not only messes up content-addressing, but it alsomeans we can't get a decoding function otherwise. Json fulfills this.
Inverse Function
The inverse function does the encoding, so it is quite obvious as well. It only needs to be defined over valid encodings, not arbitrary byte sequences (but since a peer might send arbitrary bytes, there needs to be error handling). For each encoding, there must be at exactly one decoded value. For js (where objects are ordered), this would require some sort of canonical decoding order, propably established implicitly by decoding in sequence.
Being a proper function is especially important in the ssb context, because otherwise the (probabilistic) guarantee of values with the same hash being the same values is lost.
Injective Decoding Function
Json fulfills this, but only because the order of key/value pairs of in-memory objects matters. If the js api representing values as actual maps rather than js objects, than the json encoding would not have an injective decoding function.
Why does this property matter? Not fulfilling it violates the guarantee that things with a different hash refer to different values.
So technically, the json encoding works. But practically, replicating the order-dependent maps is a pain in any language that is not javascript. And actually, it would also be a pain in js, if the implementation did not rely on details of v8. Maybe #somebodyshould make a PR to v8 that changes the internals, thus breaking the reference ssb implementation.