You are reading content from Scuttlebutt
@Dominic %YH6PDaakwNu0D83EBHBLuwSlVhr/5aezVoE8p0IaNf0=.sha256

encoding design goals

In the interests of clarifing the on-going discussion about the encoding and signing formats, I'd like to list the design goals and name them so that any proposal can easily list what goals it meets.

space-efficient

the encoding should be compact when sent or stored.

parse-efficient

in the interests of database queries, the format should be fast to read a few fields from thousands of encoded values.

cannonical

The same data should get exactly the same hash. It should not be possible to encode the same data in different ways.

backward-compatible

It should be possible to represent the current JSON based data in the proposed format, such that signatures in the current format can be verified by for example, applying a simple transformation to get JSON, and then calculating the hash and signature from that.

With respect to key order, the goal of backward-compatible is not compatible with cannonical, but with respect to floats it may be (will have to investigate V8's behavior in this regard)

well-specified

Ideally, we would use an official standard (such as CBOR) or (better) a subset. A possible compromise is to adapt such a standard to facilitate other goals.

easy to implement

If there are any alterations to a specified standard, it ought to be easy to implement, or checking it should be simple. This should synergize with well-specified.

current proposals

the currently suggested proposals and what goals they satisify.

null proposal

the JSON format currently in use.

backwards-compatible (by definition) and easy-to-implement (but only in javascript)
not easy-to-implement in any other enviroment, nor does it meet the other goals.
It is mildly parse-efficient in the sense that there are worse formats.

better binary encoding (transitional)

thread code

Targets backwards-compatible (via a transformation to json) and parse-efficient. pretty space-efficient although that is not primary goal. expected to be equally easy-to-implement in any enviroment.

The specification needs work. Something closer to a published standard but meeting the same goals would be better.

mvhsdt (cannonical)

thread a subset of CBOR with an additional optimization of space-efficient for arrays.

Targets cannonical, well-specified and space-efficient. This is not backwards compatible, so implementers would need to choose between either also supporting the null protocol or not supporting the current messages.

your proposal here?

would it be possible to have an encoding that could suitably start as a backward-compatible design and then switch to a cannonical design (by sorting keys, etc) to a become a cannonical format?

what goals are important to people? for me the biggest priority is parse-efficient and backward-compatible. but I would be happy to design for a transition towards cannonical. in this post

I would prefer to have something backwards compatible and take the network to the next level as a whole. Leaving the community behind would be a faliure.

@Aljoscha, you seem to be focused on cannonical - how do you feel about a design that can transition from backward-compatible to cannonical. (As I see it, this would require an internal representation that preserves key order, but just for old messages)

@cryPhone📱 %Koilz8/Bsqn1nu0Ws2lrYcDlOrf0eug/Gdi3o4gOqCU=.sha256

the quote is from me, not aljoscha.

User has not chosen to be hosted publicly
@aljoscha %UEfWC40HZ2Gw0VuNCyrppK37lfsSM+u+MdTnroxEiy0=.sha256

@Dominic I'm all for backwards-compatible extension and against a reboot/fork. I think hsdt can be introduced in a backwards-compatible manner.

Just to check whether I correctly understood your formulation of the backwards-compatibility criterion (it took me a few attempts to parse this):

The first part of your formulation of the backwards-compibility requirement ("It should be possible to represent the current JSON based data in the proposed format, [...]") does not talk about the logical data models, but about concrete representations, correct? Because the logical model is satisfied by hsdt: The logical data model of hsdt is a superset of the js values that form the logical value of json. This distinction is important, is important, e.g. json does not support Infinity, yet the logical model contains infinities, since e.g. 123E456 gets parsed to the float representing positive infinity.

But for concrete encodings, it's true that canonicity and this criterion conflict (for object order, as well as for floats and for escape sequences in strings). But do we really need such a strong criterion for introducing the new encoding without forking the network?

I simply assumed implementations would keep sending messages in the encoding that is used by their hash. So all old messages would be sent as json, but newly generated messages signed by hashing the new encoding would be sent in this new encoding. There'd be a transition period where sbot would update to include code for dealing with the new format, while still using json. And once sufficient time has passed and most sbots are able to handle the new format, we start using the new encoding.

We don't get the efficiency benefits for old messages, but this is playing the long game. And if you want rpcs that can transmit old messages more efficiently, that's completely orthogonal. You could implement rpcs that transmit old messages as non-canonical cbor, the standard even contains a section on how to convert between json and cbor. But this does not effect the encoding used for hash generation at all.

User has not chosen to be hosted publicly
User has not chosen to be hosted publicly
@aljoscha %Qi6B9Eobql5XOX0EcRaFOmu1h6P2rKTkAVe1IsQfIkw=.sha256

What happens to historical messages, of which thousands exist? Did I misunderstand something?

Their relative amount on the network will converge towards zero. Imagine if the www made design decisions that optimized for the tiny set of websites at 1994. The unproportional impact of those site, compared to the web of today would be ridiculous. Oh wait, they did actually make those irrational decisions. It's one of the reasons why pretty much any format in the www is a mess.

The old messages worked so far, they will not suddenly break on us. They are not well-designed, but they do get the job done. Their hashes are already set in stone, no way to change it. All implementations will have to be able to verify old messages, but not to produce them.

If efficiency demands it, there can be a separate set of rpcs (or some additional rpc parameters) that allow transmitting them in some cbor subset with a bijection between json. Receivers can then transform the cbor to json and verify. But this efficient json replacement won't be used in any hash calculations anywhere.

An alternate choice is to transmit the old json in the form that was used to compute the hash. That's even more inefficient than the current design, but it makes it extremely simple for newer ssb implementations to handle legacy messages: Just run sha256 on the bytes, done (actually there's some complication because of the signature is inserted inside the json, not appended, but that can be worked around, since all signatures have the same size). Since there will be virtually no old messages, the lower bandwidth efficiency (and it's not even that much worse than the current system) does not outweight the fact that no one needs to emulate the current verification algorithm. This is my preferred approach, but I expect @Dominic to weight the short-term performance penalty of this higher than the simplicity gains, and thus to oppose it. :green_heart: Dominic, no hard feelings.

User has not chosen to be hosted publicly
User has not chosen to be hosted publicly
@Dominic %hHjcqytnr1uBwKZ5gMdOqkRJ8UIFXArjW8WnZvktRA4=.sha256

@cryptix sorry, that quote is indeed from you. I had intended to attribute that to you and then the following bit address @aljoscha

@aljoscha yes these two concerns are orthagonal - and that is a good thing! they are not opposing concerns! that means it's possible to have a design that addresses them both!

Any change like this will require a significant amount of ground work, and quite a bit of work to deploy it. Any plan of development needs to be rewarding. I can definitely see the case for long term improvements. But to actually be viable, I belive we need short term improvements also. I personally feel strongly motivated to make the database more performant. This is something I can do that will improve scuttlebutt across the board, and make it more appealing to anyone who wants to build on top of scuttlebutt. If it's possible to address this in your design, I can absolutely say it will have my support and enthusiasm.

Because it's not just about designing something better, that's easy. It's not enough to envision utopia, you also need a plan for how you are gonna get from A to B. This a key part of ssb philosophy (upgradeable principle) and what separates us from projects like Xanadu.

I don't wanna have a ipv6 situation - something that is obviously better, but for some reason everyone is still using the old crappy one. The history of technology is littered with examples were the objectively better technology didn't win (betamax vs vhs is a favorite example) it's essential to design for how the idea is distributed, not just what it is.

So: I wanna get the old crappy one better, in a way they'll upgrade just for short term reasons, but in doing so, have all the pieces in place to switch to the new better format.

I propose a roll out plan as follows: implement a transitional format that has semi-cannonical json semantics (and thus clarifies the current format) but optimizes performance, but converts to json before hashing and signing - then we switch that to fully cannonical mode - we just hash and sign the binary data and skip the conversion, but we still gain the performance improvement.
And, also importantly, we fully clarify the old format which we still expect implementers to support.

Ideally, "clarification of the current format" could mean something like an easy to include c library that handled it fully so that implementers don't need to write that code again.

@aljoscha %sQTPrTr9+G9/9tB+YpC8hHcA86g7N4XrDqMZOd/fmuw=.sha256

yes these two concerns are orthagonal - and that is a good thing! they are not opposing concerns! that means it's possible to have a design that addresses them both!

I think this quote perfectly demonstrates why we seem to disagree on so many things although we have similar goals. To you, the orthogonality means that we can find a single solution to both. To me, it means that we should avoid a single solution to both. Each of these problems can be tackled individually, without the need for any compromises. A single solution needs to satisfies the constrains of both problems, which will either increase complexity or reduce quality.

I do not see the roll out plan as an ordered sequence of steps. To me, these are two fully independent changes that could happen in any order.

@Dominic %J1gpPxJ+DYP0uYwYlnPQ5XdhF8XBgsr5eH2CLkbZQRY=.sha256

@aljoscha okay, but lets just explore wether this is possible. we can compare designs that meet my goals directly, meet your goals directly, or meet both our goals. Maybe there is a design that works for both of us? if so that is the "politically correct" choice.

@Dominic %jxEH4dy298BPMEM4oWm/bAUo7xJITE6uZINVpLrBqs4=.sha256

@cft btw: a goog p2p protocol needs to be robust to partial failures - we must assume that there are always some connections that fail, and some peers that cannot talk to some other peers, so we can treat upgrades as partial failures. Some peers can eagerly upgrade, some peers can upgrade with compatibilty, and some are stuck on the legacy mode - but the compatiblity peers provide a bridge between old and new formats. At least, this makes the transport protocol easy to upgrade (we already have done this with EBT), message encoding formats are still hard, but it's still an improvement over the web. However, I think that getting things to upgrade is much easier now days. I remember installing my first web browser (netscape) from a CD our ISP gave us, that makes shipping updates quite difficult! But I havn't owned a computer with a CD drive in years!

User has not chosen to be hosted publicly
@aljoscha %Oa+reViWN5VtHd+JPb/CzJ7BM9Z//Hw/IErYlJ8CTeM=.sha256

Just jotting down some immediate thoughts/questions on this post, without yet having fully thought through things:

  • much much does the ssb vs ip analogy hold? Packets holding ip addresses are ephemeral, so at some point the last ip4 packet will disappear. Ssb needs to store all old messages, so we don't have that luxury.
  • each cypherlink to a message signifies some sort of trust, it implicitly establishes that the author considers the linked-to message valid. Does this trust vanish in an upgrade process that turns old messages into new messages?
  • how does this relate to changes in the message format?
    • breaking changes (such as making maps unordered (yeah, this is a breaking change, we are kinda cheating by assuming that no applications rely on map ordering (let's keep being pragmatic please, we don't want to be stuck with ordered maps forever)))
    • non-breaking changes such as adding new data types (e.g. integers)
  • how does this relate to updates of cryptographic primitives? Looks like somewhere in here is a mechanism to preserve (at least some of the) trustworthiness of messages whose cryptographic primitives have been broken.

Would be delighted if anyone has more fully formed thoughts on these points or could link to some literature.

CC @cft (This is the last @mention spam for today. Probably. Unless it is not. It's really unfortunate your feed got delayed and we could not have any of these discussions closer to real-time.)

Join Scuttlebutt now