You are reading content from Scuttlebutt
@aljoscha %BWJdgqSV4sAPx4hlGYRmxyoMK2hd4iQFr4NwkpgXppU=.sha256
Re: %lvG7sVymW

Canonical encoding

There can be multiple encodings of some value that decode to that same value. In some settings (e.g. when working with content-addressable storage), this ambiguity is harmful. This section describes a canonical way of encoding data to prevent these problems.

The sources of non-unique encodings are:

  • leading and trailing whitespace of an sdn file (this includes comments)
  • escape sequences in strings
  • floats with exponents
  • floats with rounding errors
  • rationals with different denominators for the same number
  • whitespace in lists, sets and maps (this includes comments)
  • order of entries in sets and maps

The canonic sdn representation of some data is obtained as followed:

Omit all leading and trailing whitespace. Omit all whitespace that is not needed for separating elements in collections. Separate these elements by exactly one space character (ASCII 32).

In strings, do not use any escape sequences other than for quotation marks ", the backslash \ and the control characters (U+0000 to U+001F, U+007F). Escape " as \", \ as \\, and control characters as \uXXXX.

Floats must be encoded such that there is a 0 left of the decimal point, and the exponent must always be included. For example, 10.0 becomes .01E2, -2.0 becomes -0.2E1, 3.0E0 becomes 0.3E1, 0.4 becomes 0.4E0. Additionally, if deserializing a float would involve rounding, serialize it as the float it would round to.

For rationals, use the denominator with the smallest possible absolute value.

For sets and maps, we define a total order on canonical elements. Sets must list their elements sorted from smallest to largest, and maps must sort from smallest to largest key. The order is defined as the reflexive (according to equality as defined above), transitive closure of the following relation <:

  • nil < false
  • false < true
  • true < any 64 bit int
  • any 64 bit int < any arbitrary precision int
  • any arbitrary precision int < any float
  • any float < any rational
  • any rational < any string
  • any string < any symbol
  • any symbol < any list
  • any list < any set
  • any set < any map
  • for 64 bit ints a and b: a < b if a is smaller than b
  • for arbitrary precision ints a and b: a < b if a is smaller than b
  • for floats a and b: a < b if totalOrder(a, b) is true, as defined in the IEEE 754 standard. Which, by the way, sits behind a paywall, but searching the web for IEEE 754 pdf or something similiar helps.
  • for rationals a and b: a < b if a is smaller than b
  • for strings a and b: a < b if the utf8 encoding of a is lexicographically smaller than that of b
  • for symbols a and b: a < b if the utf8 encoding of a is lexicographically smaller than that of b
  • for lists a and b whose entries are already canonical:
    • if a is empty and b is not: a < b
    • if the first entry in a is smaller than the first entry in b: a < b
    • else, a < b if (a without its first entry) < (b without its first entry)
  • for sets a and b whose entries are already canonical:
    • if a is empty and b is not: a < b
    • if the smallest entry in a is smaller than the smallest entry in b: a < b
    • else, a < b if (a without its smallest entry) < (b without its smallest entry)
  • for maps a and b whose entries are already canonical:
    • if a is empty and b is not: a < b
    • if the smallest key in a is smaller than the smallest key in b: a < b
    • if the smallest key in a has a value smaller than the value associated with the smallest key in b: a < b
    • else, a < b if (a without its smallest key) < (b without its smallest key)
Join Scuttlebutt now