You are reading content from Scuttlebutt
@Dominic %FI3kBXdFD6j+uISXluhirJm70QZpJIYzc65OX1XPJ5c=.sha256

argument against cannonical formats for signed data

fork from @aljoscha's proposal about a cannonical data format SDN

I think the SDN proposal just goes to show how difficult it is to design a cannonical format, and the benefit of that is just the ability to parse to an intermediate data structure, then serialize again, and have the hash still be the same. Given the kind of data structures different languages and runtimes might prefer are not necessarily exactly the same enough to preserve all features required for cannonicity, this can be awkward...

as it was that the JSON implementation which did support json==JSON.stringify(JSON.parse(json)) cannonicity in V8 (but not in the JSON spec)

If it doesn't require a lot of code, it requires a lot of specification...

On the other hand, a serialize once design is simpler overall (the signer is the only one to serialize a structure, other readers parse, but keeps a handle on the raw bytes, and just write that instead of reserializing) and doesn't introduce requirements to parse it into any particular type of structure.

Implementations can choose their own optimizations, but that is difficult with a cannonical format. And also, a non-cannonical format can be designed for other desirable features, such as read performance, in-place access, or ease-of-implementation.

I think these are in practice more desirable than cannonicity.

@aljoscha %ZiLzt1ZBsSzltPioH2AoorHGV96bWjPCl8Afhr621O8=.sha256

@Dominic I lack the energy for a thorough response to this, but here are a few immediate reactions. I apologize in advance if these turn out too raw or direct. I hope this does not come out as too negative, I really do appreciate you taking the time to pick up this conversation.

fork from @aljoscha's proposal [...]

Just to clarify: This was not intended as a proposal for ssb, this was simply me going down a fun rabbit hole about choosing a syntax for a lisp. This started by thinking about highly dynamic languages with content-addressable module loading, see #ssb-annah for the origins of this (I remember mentioning a "lisp with an event loop" somewhere in that channel, that's what I was fleshing out yesterday). But I'll take the bait and consider the potential usage for ssb in this response.

Had I intended this as an ssb proposal, I would have suggested a binary format instead, together with a human-readable, non-canonic display format.

I think the SDN proposal just goes to show how difficult it is to design a cannonical format, [...]

I disagree with this being difficult, it took less than 24 hours to come up with that - all the difficult stuff has already been solved elsewhere. The equality relation and the total order take some space to write down, because they are inductively defined over all data types, and there's a bunch of data types (more than ssb would need btw). But none of the cases are really complicated - the sole exception being floats. But canonical floats become easy in a binary format - IEEE 754 defines binary exchange formats. Disallow all but one NaN representation, done.

[...] and the benefit of that is just the ability to parse to an intermediate data structure, then serialize again, and have the hash still be the same.

There's a more important part: Any two programs which happen to generate the same data will serialize it to something with the same hash. That's one of the first properties I'd want a content-addressed system to have.

Given the kind of data structures different languages and runtimes might prefer are not necessarily exactly the same enough to preserve all features required for cannonicity, this can be awkward...

ssb enforces data structures, no matter whether its data format has a canonical representation or not. It currently enforces a data structure where maps are ordered sequences of pairs, which is awkward.

as it was that the JSON implementation which did support json==JSON.stringify(JSON.parse(json)) cannonicity in V8 (but not in the JSON spec)

I'm not sure if I parsed this sentence correctly, but the property you state does not hold in V8, and everything would be horrible if it did hold.
JSON.stringify(JSON.parse("2.00000000000000000000000000000001")) == "2", and that's a good thing. Else, node would be forced to always remember the corresponding source string to any json it parses. Which defeats some of the points in parsing.

If it doesn't require a lot of code, it requires a lot of specification...

Code in total, or code one needs to implement oneself?

On the other hand, a serialize once design is simpler overall (the signer is the only one to serialize a structure, other readers parse, but keeps a handle on the raw bytes, and just write that instead of reserializing) and doesn't introduce requirements to parse it into any particular type of structure.

This completely gives up on the idea that equivalent data structures should produce the same hash. And even stronger, it gives up on the idea that identical data structures should produce the same hash. To me, this is completely absurd.

Implementations can choose their own optimizations, but that is difficult with a cannonical format. And also, a non-cannonical format can be designed for other desirable features, such as read performance, in-place access, or ease-of-implementation.

Does a canonical format inherently conflict with performance and in-place access?

As for the ease of implementation: I don't think the complexity is too big (assuming non-textual floats). There just isn't a format with a readily available implementation (although XML comes close). In js, we have easy access to highly optimized json parsers. These optimizations might make them more complex, than a straightforward implementation fo something like sdn would be.

@aljoscha %euTgqawxHIGxR6rGWWvsn7nNVKQ4jojF0ljMoGXvhBM=.sha256

Ok, sleep is overrated anyways. Here's a 20 minute design of a hypothetical data format ssb could use. I just wrote this down without spending much time thinking about it, so please don't interpret this as a serious proposal. This simply tries to serve as a demonstration that nothing about a data format with nice properties is inherently complex. From the perspective of a computer, this format is both simpler and easier than json.

The concrete encoding is fairly arbitrary, there surely are more efficient alternatives that could be substituted without violating any important properties. Also, this is not at all how I'd design this from scratch, instead it incorporates the json/js/ssb quirks I remember off the top of my head.

Hypothetical SSB Data Format (HSDT)

Logical Data Types

An hsdt value is one of the following:

  • null
  • a boolean (true or false)
  • a utf8 encoded string (may include null bytes) // TODO replace ut8 by whatever encoding js actually uses, if that is how ssb stores strings
  • an IEEE 754 double precision floating point number, excluding the infinities and NaNns
  • an ordered sequence of values, called an array
  • an unordered mapping from strings to values, called an object. An object may not contain the same key multiple times (strictly speaking, this may not be compatible with ssb, because json doesn't enfore this)

Human-Readable Encoding:

Encode arbitrarily (whitespace, floats, object order etc) as json. Never use the human-readable encoding programmatically, other than to diplay it to the user.

Encoding

Non-canonic binary encoding or an hsdt value:

  • if it is null, encode as an unsigned 8-bit integer 0
  • if it is true, encode as an unsigned 8-bit integer 1
  • if it is false, encode as an unsigned 8-bit integer 2
  • if it is a float, encode as an unsigned 8-bit integer 3, followed by the 64 raw bits of the float
    • when parsing, error if the raw bits signify an ininity or NaN
  • if it is a string, encode as an unsigned 8-bit integer 4, followed by the length as a varint, followed by that many bytes
    • when parsing, error if not valid <insert the encoding used by ssb here>
  • if it is an array, encode as an unsigned 8-bit integer 5, followed by a varint indicating the number of elements (alternate version: a varint indicating the number following bytes that correspond to this array), followed by the encodings of the contained values in order
  • if it is an object, encode as an unsigned 8-bit integer 6, followed by a varint indicating the number of entries (alternate version: a varint indicating the number following bytes that correspond to this object), followed by the encoding of the key of one of the entries (optimisation: skip the tag designating it as a string), then the encoding of the corresponding value. Repeat until all entries have been encoded.

Canonic Encoding

The only source of nondeterminism in the non-canonic encoding is the order of entries in a map. For the canonic encoding, sort the entries lexicographically based on their key.

@aljoscha %H3/+bNmXKaOlSPSRlvF7FUIPKs+lwpyKXTE0A4dpCvc=.sha256

I lied, there's another source of nondeterminism. For the canonical encoding, all varints must be the shortest possible varint to express the number (i.e. zero must be encoded as 0000_0000, something like 1000_0000_0000_0000 is not allowed).

User has not chosen to be hosted publicly
@aljoscha %zxVj+dRhYbo/lJ2k+frMN0Mbg5LV8I2v1nAvyFvG/4I=.sha256

Wow @Aljoscha thank you so much for this work you're doing here.

It's really important engineering work that no-one else is tackling.

This isn't engineering, it's polemic a proof of concept. hsdt as sketched above is not a good format!.

Maybe we can avoid having to have the one true encoding by using multiformat

Ssb does not need one true encoding, all cypherlinks carry a (free form) hash suffix, which could simply indicate the encoding that was used to compute the hash. And ssb server implementations can talk to each other via morse-code encoded messages if they like - the database doesn't care. No need to switch to multiformats because of this. What we do need is one true logical data model.

@Dominic %EcT/R+1IlqbZ4L4cmkGXAyHQP59C1XbJj4B+IreZCr8=.sha256

@Aljoscha sorry it seems I did misinterpret your intentions, I jumped to the conclusion you were intending it as an ssb format. (glad I made a seperate thread though!)

I'd be pretty happy with the format you describe (with length delimited, not element delimited, collections)

One thing is I have been wondering about is a varint based float. The most common number is probably a small integer. And encoding that as a base ten ascii string is more efficient for small numbers! (but obviously not for large numbers) but if you had a pair of signed varints (integer, exponent) you could encode everything efficiently. you'd need a length on the number fields though, but if you used the bottom couple bits to encode the type. If the length was greater than the length of the first varint, then include the exponent, otherwise it's an integer. I feel this is a bit ugly, though. Maybe just using 64bit float is good enough?

Also, this is an interesting idea: https://en.wikipedia.org/wiki/Unum_(number_format)

User has not chosen to be hosted publicly
@aljoscha %ruNFSjPd/7SbhIhwU2Vce8beucrCmCzlTVZN9DacmLQ=.sha256

A few more remarks to make hsdt more sensible:

  • directly dumping the bytes of floats can be trouble because of different machine representations. A serious encoding should use an IEEE 754 exchange format.
  • using a full byte as the tag rather than the minimum makes encoded messages larger than they theoretically need to be, but dealing with the compact data is probably slower. I would personally err on the side of simplicity rather than theoretical optimality.
  • to make some of the encoding specifics less arbitrary, I see two obvious routes to take:
    • simple: use big-endian, fixed size ints for lengths rather than varints, and indicate collection sizes as the number of elements
    • efficient: use varints for lengths, and indicate collection sizes as the total size in bytes

If this was only used for calculating hashes, I'd go with the simple version. But it will often be sensible to compute the hash and then send the data on the network, in the same encoding. In this case, I tentatively lean towards the efficient solution. Note that calling it "efficient" is somewhat simplistic: Encoding and decoding varints is more expensive than fixed size ints, but that should be negligible. More important is the unit of lengths. Decoding becomes simpler if collections are prefixed by their size in bytes, but this makes encoding more complicated. You can't really do streaming encoding anymore ,since you need to encode the full collection content to know the size. But I think this wouldn't be too expensive, also you can compute the sizes without actually encoding. Add some memoization for nested size precomputation, and things should stay efficient. Non-streamability is not that much of a problem because we have a size limit on messages. Speaking of which, a new encoding is an opportunity to fix the intermingling of size limit and unicode character counts.

With these above changes incorporated, I would consider the format good enough. It's still only a drop-in replacement for json. If there is serious consideration of adopting a new format, I'd like us to take the time and consider possible improvements to the logical data structures themselves. Some ideas (not saying all of these are good ones or would be compatible with or sensible for ssb): integers, rationals, byte arrays, NaN, infinities, sets.

As for compressed floats, my hunch is that it's simply not worth it. But this looks like a good starting point for exploration rather than rolling our own.

And then there's the actual engineering aspect, i.e. benchmarking different encoding details like float representations, how to store data type tags, etc.


I feel like I should add a disclaimer that I have no idea what I'm doing here. I'm just looking for clean ways to achieve nice mathematical properties, but I'm neither experienced in protocol design/implementation, nor do I know a lot about the performance properties and considerations of networks and data bases. It is entirely possible (and rather likely) that common sense is misleading me, or that I'm missing points that would be obvious to someone more qualified/experienced.

Anyways, I feel like I can't postpone reading the full IEEE 754 standard any longer. Maybe tomorrow I'll then try to find a sensible way of dealing with floats, and then I could do a js implementation of the simple version of hsdf, to further prove the concept. But for now, I really need to get to sleep. And I kinda wanted to implement a lisp rather than dealing with ssb internals again... Oh well =D

User has not chosen to be hosted publicly
@aljoscha %sGSwIni903eIrSjuwh3Bjm3pBxDoRieBojCChIMzkLw=.sha256

Argument against the argument against canonical formats for signed data

TLDR:

@Piet:

Maybe this is ok?

@Aljoscha: NO!


Since this rather long response starts with some definitions, you can skip straight to the core argument in the next post.

This is a more principled argument for a canonical data format, now that I'm more rested. Sleep is a wonderful thing, it lets you detect that abbreviating Hypothetical SSB Data Format as HSDT might include an error. But messages are immutable, so it's hsdt now. Maybe it's an acronym for Hermie's Shiny Data Treasure?

Anyhow, let's talk about data with a canonical encoding, since that's all I seem to be doing these days. By "data", I mean the logical set set of values an ssb message could have. The logical values are not the same as actual json encodings, and they are also not the same as in-memory representations.

For example, the json strings 1, 1.0, 1.0E0, 1.0e0, 10.0E-1, 0.01E1, 1.0000000000000001, etc. all decode to the same 64 bit floating point number, as specified by the IEEE 754 floating point standard. The fact that some of them also all describe the same mathematical integer, rational, real, etc. is completely irrelevant. We only care about the logical data model, and in our case, that's IEEE 754 64 bit floats. I'll call encoded strings like these equivalent, but not identical. They all decode to identical logical values though.

With objects, this difference extends into memory. We tend to think of objects as maps from strings to arbitrary logical values. For maps, there is no specified order of the entries, we only need to be able to insert, remove and query keys. So {"a": 1, "b": 2} and {"b": 2, "a": 1} are equivalent but not identical encodings. If you expect their decoded values to be logically identical though, you are wrong. Js preservers the order of object entries, so these two parse to different objects. Most critically, when decoding and then reencoding them via JSON.stringify, we get to non-equivalent encoding strings again. I won't use this post to argue for true maps rather than ordered sequences of pairs as the logical data model for ssb, I did that elsewhere.

Just for completeness: Strings also have problems with escape sequences, but I just want to establish some base terminology, not discuss all the problems with json - such a discussion would not fit into the message limit.

A canonical encoding is one where there is exactly one valid encoding string for every logical value. For floats, that could be done by disallowing all but one decimal representation. For true maps, that means fixing the order in which the key-value pairs must be listed, for example by sorting the keys alphanumerically.

With this established, why do I care so much about canonical encodings? Because ssb needs to compute hashes. Hash functions are true mathematical functions by definition, so given the same input, they produce the same output. Intuitively, we would thus expect any single logical value to always result in the same hash, both across implementations, not to mention inside a single execution of a single program on a single machine. But you can't compute the sha256 hash of a logical data structure, you have to encode it first. And if there are multiple equivalent but not identical encodings of a value, different implementations could produce different hashes for the same logical value. A valid implementation is even allowed to randomly chose a valid encoding each time. In all subjectivity, I consider this an undesirable property.

Continued in the next post...

cc @Dominic, @Piet, @arj, @cryptix, @keks, @cel

@aljoscha %l7PFjPvDHdmqskz9UNjEfjE0J6ktjKi5KMdrF+vZHPo=.sha256

Countering the arguments against canonical encodings

As I read it, @Dominic's main arguments against enforcing a canonical encoding are complexity and the fact that the alternative of "serialize once, verify from the original encoding" works. I consider the complexity argument void, as demonstrated by hsdt. With a simple encoding, canonicity does not require a lot of work.

The "serialize once, verify encodings" scheme would not work for something like ipfs. For the ipfs, the data deduplication requirements mandate the use of a canonic encoding. This does not hold for ssb messages. By assumption, each message contains a unique combination of feed id and backlink, so no two ssb messages should ever result in the same hash. Does that mean that we never hash the same logical value twice, and thus all my rambling is irrelevant? Yes, and no, with strict emphasis on the "no" part.

Due to the message uniqueness assumption, the "serialize once, verify encodings" approach does work for ssb. It's not pretty, our code could not be reused by any other project without this property, but it would work. It places some serious constraints on all ssb implementations though.

Hashing is not the only time in the lifecycle of an ssb message when it gets encoded. Encoding is also necessary for transmitting messages to other peers, and for persistent storing of messages. In each of these three encoding use cases, there are different properties we want the encoding to have. We might want to save bandwith by compressing the data as much as possible when sending over the network, or we might store larger, but quick-to-access data in the database. Different implementations might want to take different choices here - we can not predict their needs. These encoding choice always involve tradeoffs.

The encoding for the hashing however, is fairly arbitrary. It should be somewhat performant, but that's it. But with "serialize once, verify encodings", we force every database to store the possibly suboptimal original encoding, whose only purpose was to compute a completely arbitrary hash. Why does it have to be stored? If Alice sends a message to Bob, Bob needs to verify it. Now suppose Carol is not directly connected to Alice, but she subscribed to her feed. Carol gets the message from Bob. But Carol still needs to verify it. But in order to so, without canonical encodings, she needs the original message encoding. So Bob (and everyone) needs to persist this encoding, and needs to send it over the network as well.

Bob could have two databases, one for original encodings to send during data replication to peers, and one for fast access for client applications. Running two different databases with the same content wastes space. And have fun guaranteeing consistensy between these two. So as a result, "serialize once, verify encodings" forces every forseeable ssb database backend ever into a specific storage format, or to deal with inefficient space usage and synchronization problems. Similiarilly, it constrains how peers can sensibly exchange messages. More space-efficient encodings become useless, since the original encoding needs to be stored anyways.

Now here's the thing: a canonical encoding used for hashes solves these problems. It decouples the bytes to be hashed from the bytes to be persisted or transmitted. With a canonic hash encoding, you can read from your highly specialized database encoding into a logical value, compute its hash, and it is guaranteed to be correct. If we find more efficient formats for data exchange between peers, we can use them, decode received values, compute the canonical encoding, hash it, and verify against the claimed hash of the message.

Of course we can just pretend we found an encoding that will be sufficient for all future rpc and database usage, hardwire it into the protocol by using it for hash computation, and have future devs curse us. But I remember a thread on ssb where futureproofing protocols (e.g. via multiformats) was generally considered a good thing. A canonic encoding does exactly this. And it costs us very little - hsdt would be both canonic, and more efficient (in both space useage and CPU cycles) then the current json-based implementation. And a new format can be introduced in a backwards-compatible way, due to ssb multihashes.

@aljoscha %f6sO0SWfJDqAgWNKEa9kCh4PC/RCtx6ByPaAyScukbc=.sha256

Errata to the section on nonequivalent objects: Js actually leaves the iteration order unspecified. But that basically makes things worse, not better.

@Dominic %A/FTHW8A5cEBS0T1BLMe2K3hJs9/V4wda/KBgZWDtXg=.sha256

@Aljoscha I'll let you have a cannonical format, if it's possible to do the keep-the-serialized data form.

my goals:

  • optimize for database reads/queries (because this is something the user can constantly feel)
  • nice to have: receive messages and write them directly to database, though if we come up with something significantly better later converting is okay.
  • transitional format that can represent the current data, yes, broken json format.
  • a better format that is not order dependant, etc, fully cannonical.

I think we have different but overlapping goals, but I think we can find something that works for everyone.

@aljoscha %3r6sjTBz6pcqF98lWRqavdyD+PqRzmcE+tlCdYKZucs=.sha256

@cft I only received your posts just now, but we independently settled (mostly) on using a strict subset of CBOR. I'll interpret that as a good sign.

I was unaware of COSE, it is nice to see an "official" standard for this, although it seems like not much of it applies to ssb.

User has not chosen to be hosted publicly
@aljoscha %DbNCQ5aBfR+p+5qGPcHY5NUcgic7Frem4DKCPUn33+g=.sha256

@cft Disclaimer: I don't really know what i'm talking about when it comes to crypto.

  • ssb does transport encryption at a different layer of the stack, not directly on the (cbor) message data
  • signing also needs to account for meta data, which will not be in cbor
  • encrypted messages completely anonymize the recipients, I don't think COSE supports that. But these are probably the use case most likely to benefit from COSE.
  • in general, mechanisms for upgrading crypto primitives are handled via multiformats

I will definitely do a thorough read-through of the COSE spec when starting to flesh out metadata formats. But for the reasons listed above, I does not look like we can "just use COSE" directly.

Join Scuttlebutt now