You are reading content from Scuttlebutt
@aljoscha %lvG7sVymWqomjgoyvqkrYJ7nSn8hiPQI4cLqA7A/b9o=.sha256

I accidentally designed a human-readable data format today. It tries to deal with the issues of canonical data representation necessary for content-addressing, and with the associated problems with floating point numbers. Also, it an serve as the syntax of a #lisp. So in case anyone is interested, here's the spec.

SDN

simple data notation

The billionth human-readable data format. Derived from and very similiar to edn, but more minimalistic. Also attempts to specify how to precisely deal with floating point numbers.

General

SDN must be encoded as utf8.

The whitespace characters are \n (ASCII character 10) and (ASCII character 32).
Whitespace is ignored other than to separate elements.

An sdn file consists of exactly one element.

Any line starting with ; outside of a string literal is considered whitespace.

Elements

nil

nil is an element, it is the single value of the unit type. It represents absence of information.

Booleans

true and false are elements, they are the two values of the boolean type.

Strings

"A string, \\ \" \t \n \u11B3  "

(This description is mostly stolen from TOML)

Strings are surrounded by quotation marks. Any Unicode character may
be used except those that must be escaped: quotation mark, backslash, and the
control characters (U+0000 to U+001F, U+007F).

Escape sequences:

\t         - tab             (U+0009)
\n         - linefeed        (U+000A)
\"         - quote           (U+0022)
\\         - backslash       (U+005C)
\uXXXX     - unicode         (U+XXXX)
\UXXXXXXXX - unicode         (U+XXXXXXXX)

Any Unicode character may be escaped with the \uXXXX or \UXXXXXXXX forms.
The escape codes must be valid Unicode scalar values.

Integers

An integer consists of one or more digits 0 - 9, optionally prefixed by a -. No integer other than 0 itself may begin with a 0.

-0 is not a valid integer.

If an integer is suffixed by a N, it is an arbitrary precision integer. Else, integers outside the range of a 64 bit two's complement (smaller than -9223372036854775808 or larger than 9223372036854775807) are invalid.

Floats

A float consists of an integer (without an N suffix), followed by a dot ., followed by one or more digits 0 - 9. It may be prefixed by a -. It may be suffixed by an exponent. An exponent is an E, optionally followed by a -, followed by one or more digits 0 - 9.

NaN, Infinity, -Infinity are valid floats. -0.0 and 0.0 designate two different floats. There is only a single NaN (no signaling, no sign bit, no payload, etc).

Floats are IEEE 754 64 bit floats. Any floats that can not be represented exactly must be rounded nearest, tied to even.

Rationals

A rational consists of one or more digits 0 - 9, followed by a /, followed by one or more digits 0 - 9. It is an arbitrary precision rational number. The denominator of a rational is not important, i.e. 2/6 may be internally stored as 1/3.

Symbols

Symbols represent identifiers, they consist of alphanumeric characters, or # : / . * + ! - _ ? $ % & = < >. A symbol may not start with a numeric character, and nil, true, false, NaN, Infinity, and -Infinity are not valid symbols.

When some characters can be parsed as either a number or a symbol, they must be parsed as a number.

Lists

(x 14)

An ordered sequence of elements, represented as (, optional whitespace, any number of elements, optional whitespace and ).

Set

#{
  a b c
}

An unordered collection of elements, represented as {, optional whitespace, any number of pairs of elements, optional whitespace, and }. Each element may appear at most once.

Map

{
  a b
  c d
}

An unordered collection of key-value pairs, represented as {, optional whitespace, any number of pairs of elements, optional whitespace, and }. Each element may appear at most once.

Equality

To enforce the uniqueness of set entries and map keys, there has to be an equality relation.

nil is equal to nil, true is equal to true, false is equal to false.

Two integers are equal if they consist of the same characters. 64 bit integers and arbitrary precision integers are never equal.

Two symbols are equal if they consists of the same characters.

Two strings are equal if they result in the same data after resolving escape sequences.

Two floats are equal if they result in the same IEEE 754 64 bit float after rounding. In particular, NaN is equal to NaN.

Two rationals are equal if they both describe the same rational number, e.g. 2/6 equals 1/3.

Two lists are equal if they have the same length and all corresponding pairs of elements are equal.

Two sets are equal if they contain the same set of entries, independent of the order.

Two maps are equal if they contain the same set of pairs, independent of the order.

All other pairs of elements (in particular elements of different types) are unequal.

Continued in the next post.

@aljoscha %BWJdgqSV4sAPx4hlGYRmxyoMK2hd4iQFr4NwkpgXppU=.sha256

Canonical encoding

There can be multiple encodings of some value that decode to that same value. In some settings (e.g. when working with content-addressable storage), this ambiguity is harmful. This section describes a canonical way of encoding data to prevent these problems.

The sources of non-unique encodings are:

  • leading and trailing whitespace of an sdn file (this includes comments)
  • escape sequences in strings
  • floats with exponents
  • floats with rounding errors
  • rationals with different denominators for the same number
  • whitespace in lists, sets and maps (this includes comments)
  • order of entries in sets and maps

The canonic sdn representation of some data is obtained as followed:

Omit all leading and trailing whitespace. Omit all whitespace that is not needed for separating elements in collections. Separate these elements by exactly one space character (ASCII 32).

In strings, do not use any escape sequences other than for quotation marks ", the backslash \ and the control characters (U+0000 to U+001F, U+007F). Escape " as \", \ as \\, and control characters as \uXXXX.

Floats must be encoded such that there is a 0 left of the decimal point, and the exponent must always be included. For example, 10.0 becomes .01E2, -2.0 becomes -0.2E1, 3.0E0 becomes 0.3E1, 0.4 becomes 0.4E0. Additionally, if deserializing a float would involve rounding, serialize it as the float it would round to.

For rationals, use the denominator with the smallest possible absolute value.

For sets and maps, we define a total order on canonical elements. Sets must list their elements sorted from smallest to largest, and maps must sort from smallest to largest key. The order is defined as the reflexive (according to equality as defined above), transitive closure of the following relation <:

  • nil < false
  • false < true
  • true < any 64 bit int
  • any 64 bit int < any arbitrary precision int
  • any arbitrary precision int < any float
  • any float < any rational
  • any rational < any string
  • any string < any symbol
  • any symbol < any list
  • any list < any set
  • any set < any map
  • for 64 bit ints a and b: a < b if a is smaller than b
  • for arbitrary precision ints a and b: a < b if a is smaller than b
  • for floats a and b: a < b if totalOrder(a, b) is true, as defined in the IEEE 754 standard. Which, by the way, sits behind a paywall, but searching the web for IEEE 754 pdf or something similiar helps.
  • for rationals a and b: a < b if a is smaller than b
  • for strings a and b: a < b if the utf8 encoding of a is lexicographically smaller than that of b
  • for symbols a and b: a < b if the utf8 encoding of a is lexicographically smaller than that of b
  • for lists a and b whose entries are already canonical:
    • if a is empty and b is not: a < b
    • if the first entry in a is smaller than the first entry in b: a < b
    • else, a < b if (a without its first entry) < (b without its first entry)
  • for sets a and b whose entries are already canonical:
    • if a is empty and b is not: a < b
    • if the smallest entry in a is smaller than the smallest entry in b: a < b
    • else, a < b if (a without its smallest entry) < (b without its smallest entry)
  • for maps a and b whose entries are already canonical:
    • if a is empty and b is not: a < b
    • if the smallest key in a is smaller than the smallest key in b: a < b
    • if the smallest key in a has a value smaller than the value associated with the smallest key in b: a < b
    • else, a < b if (a without its smallest key) < (b without its smallest key)
User has chosen not to be hosted publicly
@aljoscha %egSI90yb8njJAfRCKTiK6YTuIrhQgPTl0INjyRbUDNs=.sha256

@alanz Good catch! Changed to

A rational consists of an optional -, followed by one or more digits 0 - 9, followed by a /, followed by one or more digits 0 - 9.


I also realized the canonical representation of floats is ambiguous, it would allow both 0.1E0 and 0.01E1. (Hopefully) fixed by changing to

Floats must be encoded such that there is a 0 left of the decimal point, a nonzero digit after the decimal point, and the exponent must always be included.

Equivalently, it could mandate to always use the lowest possible exponent (still with a single 0 left of the decimal point), but the other formulations seems easier to understand to me.

I just found a canonical float encoding in the XSD spec, which might be better than rolling my own.

Also, simply printing and reading floats is apparently rather complicated.

I fear I'll have to read the full IEEE 754 standard and at least those papers, if I want to make sdn a real thing.

@Christian Bundy %P06BAdmR4oCJxijS5VhYcmo0KKJXc8mcW3QUUdiWL5k=.sha256

@Aljoscha

Interesting! I have a question about symbols:

Symbols represent identifiers, they consist of alphanumeric characters, or # : / . * + ! - _ ? $ % & = < >.

Could this be a typo for "non-alphanumeric"? I'm also curious whether this is specifically non-alphanumeric non-whitespace printable ASCII or all non-alphanumeric characters. Hope OI'm reading this right, thanks for sharing!

@aljoscha %bF/4zJewgbDJF0K0TRhXaDSgZ8bx23hBJu2rc47B6Ls=.sha256

@christianbundy Alphanumeric is correct. These are identifiers as used in a typical programming language, i.e. names of variables. Disallowing alphanumerics would result in some rather interesting code =D Anyways, (hopefully) clarified as

Symbols represent identifiers, valid characters are alphanumeric characters and any of # : / . * + ! - _ ? $ % & = < >.

The list of non-alphanumeric characters is a true subset of printable ASCII characters, chosen to allow writing common operators. In lisps, there is no distinction between operators and function, you'd invoke (+ 2 3) just like any function. The non-alphanumerics are chosen to allow common names such as standard arithmetic operators.

Constraints on these were:

  • ASCII only (allows implementation without caring about encoding details (if the implementation is allowed to assume valid utf8-encoded files))
  • no conflicts with the syntax for collections
  • leave out enough characters to allow syntax extensions (a lisp probably wants to use a superset of this data format, e.g. for quoting and pseudoquoting)
  • avoid confusion (no [ ] , ', few people would intuitively read [1,2,3] or 'a' as identifiers

The list can be extended without breaking backwards-compatibility, so new characters could be added - although I don't expect a practical need for this.

I was unsure about including both . and :. Perhaps I should remove one of these, so that a superset of this format could use them for scope resolution (i.e. foo.bar as in js or foo::bar as in rust).


A (somewhat) easy solution for the canonical form of floats could be to use the criteria defined in the papers on float printing:

  • correctness (rounds to the correct floating point number)
  • optimality (uses the fewest number of characters)
  • some tiebreaker in case multiple correct and optimal results exists. This could be as simple as lexicographic ordering, but I'll look whether these algorithms have some consistent, clean (if maybe implicit) tiebreaker.

Rationals need some updates as well. What is the canonical representation of x/0 for any positive integer x? What about -x/0? How are equality and the total order defined on these? Should they even be allowed? I was also sloppy with the syntax, it currently allows arbitrary trailing zeros.

@aljoscha %VeoeBI3halXd4s0WGPZe4F3tchyE+KRiZv86F4yX/cQ=.sha256

Published this on github.

I decided to disallow rationals with a denominator of 0, since this would force readers to have a data structure to represent this. In all the use cases i can imagine, simply using one of NaN, Infinity and -Infinity should suffice anyways.

Added a disclaimer about the conflict between human-friendly formats and canonical forms:

The canonical encoding somewhat defeats the point of a human-readable format, since it normalizes away all whitespace. For this reason, it might be better to use a more efficient (and in the case of floats less painful) binary encoding. For this to work, there has to be a bijection between valid canonical sdn encodings and valid canonical binary encodings.

And I settled on a canonical format for floats:

Floats must be encoded such that the resulting string:

  • rounds to the correct float
  • has a 0 left of the decimal point
  • has a nonzero digit after the decimal point
  • includes the exponent
  • is a shortest possible string satisfying these criteria
  • if there are multiple shortest strings satisfying these criteria, chose the one with the smaller exponent
  • if there are multiple shortest strings satisfying these criteria with the same exponent, chose the smallest number among them

This may look complicated, but there are algorithms for this, which are tuned for performance and are used in real projects (e.g. there is (or at least was) code in V8 that can compute this canonical form).


This has now reached a point where I can step out of the rabbit hole and start implementing it. I won't implement the canonical formatting until I need it. I also have a binary representation sketched out, but that will have to wait until I want to actually use it as well.

@aljoscha %jv8E22imE76+/3qX//Lp1Vn/HsBmvwAsa4cNQbCjQ6I=.sha256

Since I have now had an unhealthy overdose of float formatting considerations: @keks and @cryptix, how do you serialize floats when generating ssb messages in go? Do you use the same defaults as V8 as for when to use exponential notation, whether to use 'e' or 'E' etc?

I wonder whether these are specified as part of ecma script, or whether these are again implementation details that crept into the ssb protocol =/

User has not chosen to be hosted publicly
@cryptix %aDed3Gm2u14+kfnaOAcjyDUZOwVx87AVzb6KKnCZrZ4=.sha256

@Aljoscha for the parse and pretty print I so far used strconv.FormatFloat (godoc) with f. To be honest: I just fiddled around to get the tests passing. I'm realtivly sure I didn't have to consolidate different cases and took the passing of the tests over some large tests feeds as enough, which might have been negleant.

ps: I just saw that the golang json decoder also has its own Number type which exposes the parsed data as a string literal, which might be handy if this becomes a problem?

@aljoscha %4jNCUWoP0chc5Q/dxxSs+19zCHq8+6RmfYNEUH45ox4=.sha256

@cryptix In case you missed it, in this thread we unraveled the spec that ssb happens to use. So if you ever need a definite reference, you can look there. Have fun with the ecmascript specification, it is a joy to read.... =/

Join Scuttlebutt now