You are reading content from Scuttlebutt
@aljoscha %euTgqawxHIGxR6rGWWvsn7nNVKQ4jojF0ljMoGXvhBM=.sha256
Re: %FI3kBXdFD

Ok, sleep is overrated anyways. Here's a 20 minute design of a hypothetical data format ssb could use. I just wrote this down without spending much time thinking about it, so please don't interpret this as a serious proposal. This simply tries to serve as a demonstration that nothing about a data format with nice properties is inherently complex. From the perspective of a computer, this format is both simpler and easier than json.

The concrete encoding is fairly arbitrary, there surely are more efficient alternatives that could be substituted without violating any important properties. Also, this is not at all how I'd design this from scratch, instead it incorporates the json/js/ssb quirks I remember off the top of my head.

Hypothetical SSB Data Format (HSDT)

Logical Data Types

An hsdt value is one of the following:

  • null
  • a boolean (true or false)
  • a utf8 encoded string (may include null bytes) // TODO replace ut8 by whatever encoding js actually uses, if that is how ssb stores strings
  • an IEEE 754 double precision floating point number, excluding the infinities and NaNns
  • an ordered sequence of values, called an array
  • an unordered mapping from strings to values, called an object. An object may not contain the same key multiple times (strictly speaking, this may not be compatible with ssb, because json doesn't enfore this)

Human-Readable Encoding:

Encode arbitrarily (whitespace, floats, object order etc) as json. Never use the human-readable encoding programmatically, other than to diplay it to the user.

Encoding

Non-canonic binary encoding or an hsdt value:

  • if it is null, encode as an unsigned 8-bit integer 0
  • if it is true, encode as an unsigned 8-bit integer 1
  • if it is false, encode as an unsigned 8-bit integer 2
  • if it is a float, encode as an unsigned 8-bit integer 3, followed by the 64 raw bits of the float
    • when parsing, error if the raw bits signify an ininity or NaN
  • if it is a string, encode as an unsigned 8-bit integer 4, followed by the length as a varint, followed by that many bytes
    • when parsing, error if not valid <insert the encoding used by ssb here>
  • if it is an array, encode as an unsigned 8-bit integer 5, followed by a varint indicating the number of elements (alternate version: a varint indicating the number following bytes that correspond to this array), followed by the encodings of the contained values in order
  • if it is an object, encode as an unsigned 8-bit integer 6, followed by a varint indicating the number of entries (alternate version: a varint indicating the number following bytes that correspond to this object), followed by the encoding of the key of one of the entries (optimisation: skip the tag designating it as a string), then the encoding of the corresponding value. Repeat until all entries have been encoded.

Canonic Encoding

The only source of nondeterminism in the non-canonic encoding is the order of entries in a map. For the canonic encoding, sort the entries lexicographically based on their key.

Join Scuttlebutt now