%BTnweXwHe134te24iGMOM/n+QK6V429bLDVcUxnjI3E=.sha256

@aljoscha5 years ago %BTnweXwHe134te24iGMOM/n+QK6V429bLDVcUxnjI3E=.sha256

#hsdt

Here's the hsdt data format definition I'll begin implementing now. CC @Dominic

Quick summary

The format is obtained from the current, json-based one, by adding the following values:

signed and unsigned integers (8, 16, 32, 64 bit)
two float types (32 and 64 bit), supporting negative zero, both infinities, and a single NaN value
byte strings
multifeeds (i.e. @foo.ed25519)
multihashes (i.e. &bar.sha256)
sets
maps with arbitrary keys

Actual Spec

Abstract Data Model

The hsdt data model is based upon cbor. Some features were omitted to ensure canonicity, others were added for ssb-specific use cases.

Null

null is an hsdt value that carries no information.

Booleans

true and false are hsdt values, called booleans.

Integers

Hsdt allows signed and unsigned fixed-width integers of the sizes 8, 16, 32 and 64 bytes. They are referred to as i8, i16, i32 and i64 (signed) and u8, u16, u32 and u64 (unsigned).

Floating Point Numbers

Hsdt allows the 32 bit and 64 bit floating point numbers from the IEEE 754 standard. They are referred to as f32 and f64.

The only difference to IEEE 754: There is just a single value to represent NaN per float type, not multiple ones.

Multifeeds

A multifeed is an hsdt value.

Multihash

A multihash is an hsdt value.

Byte Strings

An ordered sequence of bytes of length between 0 and 2^64 - 1 (inclusive) is an hsdt value, called a byte string. Such a byte string may include null bytes.

UTF-8 Strings

An ordered sequence of bytes which form valid utf8 of length between 0 and 2^64 - 1 (inclusive) is an hsdt value, called a utf-8 string. Such a utf-8 string may include null bytes.

Arrays

Let n be a natural number less than 2^64, and let v_0, ..., v_n be hsdt values.

The ordered sequence [v_0, ..., v_n] is an hsdt value, called an array.

Arrays

Let n be a natural number less than 2^64, and let v_0, ..., v_n be hsdt values.

The (unordered) set #{v_0, ..., v_n} is an hsdt value, called a set.

Maps

Let n be a natural number less than 2^64, let k_0, ..., k_n be pairwise distinct hsdt values, and let v_0, ..., v_n be arbitrary hsdt values.

The (unordered) set of pairs (k_i, v_i) for all 0 <= i <= n is a legacy value, called a map. The pairs are called entries, the first entries of the pairs are called keys, and the second entries of the pairs are called values.

@aljoscha5 years ago %2OU+8wxtzrfZpbapZhuiHZyrdBMMkr2KAP51+Ngbpoc=.sha256

#hsdt

And an encoding:

Default Encoding

This encoding of hsdt values is based on a canonical subset of cbor, extended to express the non-cbor values (sets, multifeeds, multihashes). It is used for both message signing and the exchange of messages between processes.

The default encoding is defined as follows:

Encoding Null

This is identical to cbor.
null is encoded as the byte 0b111_10110 (0xf6).

Encoding Booleans

This is identical to cbor.
true is encoded as the byte 0b111_10101 (0xf5).
false is encoded as the byte 0b111_10100 (0xf4).

Encoding Integers

Integers are encoded with a tag byte indicating size and signedness, followed by the big-endian representation of the integer, using two's complement for signed integers. Note that this differs from how cbor handles integers.

The tag byte is taken from the following table:

Integer Type	Tag Binary	Tag Hex
u8	0b000_11000	0x18
u16	0b000_11001	0x19
u32	0b000_11010	0x1a
u64	0b000_11011	0x1b
i8	0b001_11000	0x38
i16	0b001_11001	0x39
i32	0b001_11010	0x3a
i64	0b001_11011	0x3b

Encoding Floating Point Numbers

This is identical to cbor.

A 32 bit floating point number is encoded as the byte 0b111_11010 (0xfa), followed by the four bytes of the number (sign, exponent, fraction in that order).

A 64 bit floating point number is encoded as the byte 0b111_11011 (0xfb), followed by the eight bytes of the number (sign, exponent, fraction in that order).

Encoding Multifeeds

A multifeed is encoded as the byte 0b111_11100 (0xfc), followed by the compact encoding of the multifeed.

Encoding Multihashes

A multihash is encoded as the byte 0b111_11101 (0xfd), followed by the compact encoding of the multihash.

Encoding Byte Strings

Byte strings are encoded as in cbor, with the following restrictions:

no indefinite length byte strings (additional type 31 is not allowed)
byte strings must use the shortest possible encoding of their length

Encoding Utf-8 Strings

Utf-8 strings are encoded as in cbor (called "text strings" there), with the following restrictions:

no indefinite length utf-8 strings (additional type 31 is not allowed)
utf-8 strings must use the shortest possible encoding of their length

Encoding Arrays

Arrays are encoded as in cbor, with the following restrictions:

no indefinite length arrays strings (additional type 31 is not allowed)
array must use the shortest possible encoding of their length

Encoding Sets

Sets are encoded just like arrays, except the second-most-significant bit of the first byte is 1 rather than 0 (e.g. the empty set is 0b110_00000 or 0xc0).

The entries of the set must be sorted lexicographically by their encoding. When decoding, encountering an entry that is not lexicographically greated than its predecessor must be treated as an error. Note that this also disallows duplicate entries.

Encoding Maps

Maps are encoded as in cbor, with the following restrictions:

no indefinite length arrays strings (additional type 31 is not allowed)
maps must use the shortest possible encoding of their length

The keys of the map must be sorted lexicographically by their encoding. When decoding, encountering a key that is not lexicographically greated than its predecessor must be treated as an error. Note that this also disallows duplicate keys.

Non-Canonic Encoding

The non-canonic hsdt encoding lifts the restriction that sets entries and map keys must be sorted lexicographically. Note that:

duplicate set entries or map keys are still disallowed
no other restrictions are lifted (in particular, length indicators for strings/collections must still be as small as possible)

@aljoscha5 years ago %UFX7zsitzPTfmbbbaHkOHXOwnk62XZ7t3b/gd1NMZKI=.sha256

#hsdt

Errata for the encoding: 32 bit floating point NaN is always encoded as 0xfa7fc00000, 64 bit floating point NaN is always encoded as 0xfb7ff8000000000000.

@Dominic5 years ago %zO5nEo8xY4UlxxAPr3BmqBKe/g9J+B2wFHKMA1JSMLc=.sha256

@aljoscha quick questions:

what are the advantages of your way of doing integers over CBOR's?
All I really care about is having in-place access to fields. Such that pointer = seekPath(record, key) returns a pointer to where the value associated with key is inside the record. I think this is the key to having a high performance database
what does hsdt stand for again?

@aljoscha5 years ago %LBCn1/fpa6HQBo9CgQ9R8EUxdTfcqpvKVe8mhDPcwvE=.sha256

#hsdt

@Dominic

what are the advantages of your way of doing integers over CBOR's?

CBOR does not represent negative integers as two's complement, but rather in a way such that any value of major type 1 is negative. This makes it impossible to send a positive number as a signed integer type. In hsdt, it should be possible to send e.g. the integer 5 but declaring its type as a signed 32 bit integer. That would be encoded as the tag with major type 1 and additional type 26, followed by the bytes 0x00000005.

Put another way: Cbor does not care about preserving type information beyond "its an integer", using the different additional types only to minimize the encoding size. Hsdt uses them to preserve the exact type instead.

Not having to convert unsigned integers into the machine representation (endianess aside) is also a plus.

All I really care about is having in-place access to fields. Such that pointer = seekPath(record, key) returns a pointer to where the value associated with key is inside the record. I think this is the key to having a high performance database

Yup, that should be mostly possible. You can't do binary search on sets/maps though, since you don't know where new values begin without scanning through things linearly. But at least you can skip across primitives (null, bool, integers, floats, both strings, multifeeds, multihashes) after looking at the first byte (or first few bytes for variable-length primitives).

So really efficient access still needs some sort of indices, but a naive linear scan should be pretty fast (and as a corollary, generating indices should be pretty fast).

what does hsdt stand for again?

Hypothetical SSB Data Format, with the "t" being a two-in-the-morning typo. But I decided to roll with it, celebrating the magic of immutable append-only logs. A possible backronym is Hermie's Shiny Data Treasure.

I'm not attached to that name at all, feel free to suggest a better name/backronym.

@Dominic5 years ago %ATVbjXMZecYRG84majjbp/fl9T+eEUUcl5Uud/pVNcA=.sha256

#hsdt

@aljoscha but do you have any objections to length delimited collections? this also makes an in-place reader simpler because you can tell the byte length of anything just by looking at the tag. I'm convinced we need this unless you can show me that the perf difference is not significant.

@aljoscha5 years ago %422qMvK2NktCbyAq58TlL/65YmHG5BRE3qufcZPaRlQ=.sha256

#hsdt

@Dominic

I'm convinced we need this unless you can show me that the perf difference is not significant.

I'm convinced we don't need this unless you can show me that the perf difference is significant.

but do you have any objections to length delimited collections?

Why yes, as usual I have objections. A couple of them, so I'll try to keep each single one concise (actually I'm just lazy and want to go to sleep).

For the remainder of this post, "size" refers to the number of entries in a collection, whereas "length" refers to the number of bytes the encoding takes up.

length-delimited does not enable binary search, it just allows skipping over collections without iterating their content
- worst-case access time complexity thus stays linear, it only speeds up scanning through nested data structures
all hsdt values will be contained in messages, so this isn't relevant for scanning through the full flume log
size-delimited is much simpler/easier to produce (length-delimited needs memoization to keep linear time complexity, and that memoization is annoying to get right and either requires preallocating additional memory for caching the length (which is wasteful if the data is never actually serialized), or expensive dynamic memory allocation (with some sort of tree datastructure just to keep track of the lengths, with bad cache locality))
size-delimited is simpler/easier to parse
- imagine you are a parser and you encounter an array
  - part of me wants to just stop here, that sentence is already beautiful as it is
- you read the size, and you allocate that many slots of memory for new values
- but if you got the length instead, what would you do? You can't just allocate that many bytes, twenty u32s each require different allocations than two sets of length 50 each.
the format is intended to be used for signing and transport, it seems off to compromise on simplicity to support a use-case that is needed for neither
the database does not need to store data in the transport format, it should be very cheap to convert into a length-based one before writing to disk, especially if you compare the conversion time to the I/O time or the signature verification cpu time
- even then I'm not convinced that pure length-delimiting is the correct choice, since parsing rather than scanning still needs to happen (in particular, once you scanned to a value, you'll likely want to parse it if it's a non-primitive)
  - I guess you could actually use the encoding as an immutable in-memory representation as well, even for collections, but that seems both like a hyper-optimization and pretty fringe/limited
- so a format suitable for flume might want to have both length and size indicators
  - which really is fine for the db implementation, but would be needlessly complicated for the signing/transport format
not all db backends will be flume-based, people might just put the data into some sort of nosql database. Those people would pay the complexity cost of a length-delimited format without any gains.

For those reasons, I'm strongly in favor of keeping the signing/transport format simple and - if need be - storing the data in some different format inside the db (which may or may not be flume). Having implementation details of flume leak into the signing format feels wrong.

@Dominic5 years ago %ZaSxWaNfpUOZE+BWg6k14oR1Lzy1pN9Vb+7UHK0hV38=.sha256

I guess you could actually use the encoding as an immutable in-memory representation as well, even for collections, but that seems both like a hyper-optimization and pretty fringe/limited

yes this is exactly what I'm already doing! and have found empirically that yes, it does improve performance significantly.

size-delimited is simpler/easier to parse

I want to avoid parsing it!

Okay, a compromise: If you where hypothetically gonna specify a length delimited version,
as I'll need to actually make this to test whether it is faster or not, what would it look like?

@aljoscha5 years ago %4Bb9HC8i3AUoiIJBTpQVFiY2ykxz9j6jkHPpGFO81m8=.sha256

#hsdt

I want to avoid parsing it!

But for a lot of use cases you can't completely do that, right (honest question)? Say a flumeview reduce thing scanned to a certain entry inside a nested map. Now you still need to parse that value so that the flumeview can access it (for example if it's a map and you want logarithmic lookup time, there's no way around parsing since you can't do binary search on the encoding).

I think I get what you want to do, and I see a lot of the benefits, but at the end of the day it is still a detail of a single implementation choice among many. It's hard to justify that complicating the signing process (and that's what this is fundamentally about - db encodings can change, even the transport encodings can change, but it's the signing encoding that we are going to be stuck with).

If you where hypothetically gonna specify a length delimited version,
as I'll need to actually make this to test whether it is faster or not, what would it look like?

Whatever you want it to look like. The simplest approach would specify lengths instead of sizes at the beginning of arrays, sets, objects, but using exactly the same encoding otherwise. You'd just set and interpret those values differently than the size-based version.

@Dominic5 years ago %VIiEefcE+s8SfaMv43f1kqZH01VxYYq6KItLAzx24Wc=.sha256

@aljoscha parsing leaf values (numbers, strings, etc is fine) though, for strings you could just use a char pointer into the object, anyway. I mean, don't parse collections.

I did a little test, taking the c program I mentioned here and making it also allocate enough memory to copy the raw record into (parsing it would take more, because you'd unpack the values, etc), run time (of the whole program) went from 0.510s to 0.823s. Just adding malloc and memcpy...

@aljoscha5 years ago %Ggoxx4JEmhCMS5t+JIBVXm69GyQtmoXTgFnrw1k6mVM=.sha256

Two more thoughts on this:

rather than sorting purely lexicographically, it would be possible to sort signed integers and floats by numeric value instead. I don't see any clear advantages in that, just want to mention it for completeness' sake.
whatever collection representation we settle on, clmr should use the same

@Dominic5 years ago %NYLZ2CxDBNvVS0yK1p8nRpgotqtNUVHiVCmhiyrWzQk=.sha256

@aljoscha

Whatever you want it to look like. The simplest approach would specify lengths instead of sizes at the beginning of arrays, sets, objects, but using exactly the same encoding otherwise.

I asked you how you would do it. can you be more specific? I want to be able to point to an officially sanctioned spec for in-place parsable collections. I wasn't asking you to put it in the signing spec: I was asking, can I have a in-place version of this?

In-place reads optimizes for the database, and the database is the bottleneck.
The database has to process hundreds of thousands of messages as fast as possible.

Of course, you can still parse an in-place format.
But if you have process many records in bulk, you don't have to.

@aljoscha5 years ago %Wb3VNLJymJ9+RmJ1XtDBSUPbp8hIlvhzJEPSuoSIASE=.sha256

#hsdt

@Dominic

Can you tell me more about how what you want relates to capnproto? Reinventing this properly would be a bunch of work, since there's stuff like padding and alignment issues. Capnproto has many features we wouldn't need, but maybe a somewhat simple scheme definition for hsdt could be sufficient for your use case.

@Dominic5 years ago %pRyIKGQktECVZqZEOYwBoAmhmCw9XlGysGlDmxiibNo=.sha256

@aljoscha unfortunately capnproto doesn't work because the ssb data is schemaless (except for the top level of each message). Indexes need to index user data to be useful.

@aljoscha5 years ago %YkS3+abC6RwZATZE7AfUTnWh0wBR2/BjV13Z0XPgM2Q=.sha256

#hsdt

For the record, I changed my mind on integer representations. There's no real information in the size of the integer type, a program can represent it in memory however it wants. Using a u32 won't keep anyone from publishing a message with a u64 anyways. So if anyone (including future-me) wants to build on this, I'd recommend treating all integers as the same type rather than distinguishing by fixed-size representations.

(related stuff to think through: Integers vs naturals, integers vs floats, integers vs rationals, integers of a maximum size vs integers of arbitrary size)