TLDR:
@Piet:
Maybe this is ok?
@Aljoscha: NO!
Since this rather long response starts with some definitions, you can skip straight to the core argument in the next post.
This is a more principled argument for a canonical data format, now that I'm more rested. Sleep is a wonderful thing, it lets you detect that abbreviating Hypothetical SSB Data Format
as HSDT
might include an error. But messages are immutable, so it's hsdt now. Maybe it's an acronym for Hermie's Shiny Data Treasure?
Anyhow, let's talk about data with a canonical encoding, since that's all I seem to be doing these days. By "data", I mean the logical set set of values an ssb message could have. The logical values are not the same as actual json encodings, and they are also not the same as in-memory representations.
For example, the json strings 1
, 1.0
, 1.0E0
, 1.0e0
, 10.0E-1
, 0.01E1
, 1.0000000000000001
, etc. all decode to the same 64 bit floating point number, as specified by the IEEE 754 floating point standard. The fact that some of them also all describe the same mathematical integer, rational, real, etc. is completely irrelevant. We only care about the logical data model, and in our case, that's IEEE 754 64 bit floats. I'll call encoded strings like these equivalent, but not identical. They all decode to identical logical values though.
With objects, this difference extends into memory. We tend to think of objects as maps from strings to arbitrary logical values. For maps, there is no specified order of the entries, we only need to be able to insert, remove and query keys. So {"a": 1, "b": 2}
and {"b": 2, "a": 1}
are equivalent but not identical encodings. If you expect their decoded values to be logically identical though, you are wrong. Js preservers the order of object entries, so these two parse to different objects. Most critically, when decoding and then reencoding them via JSON.stringify
, we get to non-equivalent encoding strings again. I won't use this post to argue for true maps rather than ordered sequences of pairs as the logical data model for ssb, I did that elsewhere.
Just for completeness: Strings also have problems with escape sequences, but I just want to establish some base terminology, not discuss all the problems with json - such a discussion would not fit into the message limit.
A canonical encoding is one where there is exactly one valid encoding string for every logical value. For floats, that could be done by disallowing all but one decimal representation. For true maps, that means fixing the order in which the key-value pairs must be listed, for example by sorting the keys alphanumerically.
With this established, why do I care so much about canonical encodings? Because ssb needs to compute hashes. Hash functions are true mathematical functions by definition, so given the same input, they produce the same output. Intuitively, we would thus expect any single logical value to always result in the same hash, both across implementations, not to mention inside a single execution of a single program on a single machine. But you can't compute the sha256 hash of a logical data structure, you have to encode it first. And if there are multiple equivalent but not identical encodings of a value, different implementations could produce different hashes for the same logical value. A valid implementation is even allowed to randomly chose a valid encoding each time. In all subjectivity, I consider this an undesirable property.
Continued in the next post...
cc @Dominic, @Piet, @arj, @cryptix, @keks, @cel