%kZTlq9I8hMKB50sDyjIyQ5Ea5A+4E8WJJoCRSiB1dNw=.sha256
I am once again delving into the murky world of binary encoding formats. You might have heard of my previous work in this bipf as it's now used in manyverse's db2.
bipf
was intended to be a binary JSON substitute (like msgpack, cbor) except unlike msgpack and cbor, it is designed for in-place reads. That is, the ability to pluck out particular field values without reconstituting the entire data structure. Which is much slower because now the memory allocator and garbage collector is involved.
But this time I want that - but also more - an important design goal is specifically that it's easy to implement in other languages, such as C.
Also I wanted the in-place reads, but stronger, by using schemas and fixed positions.
fixed positions: in bipf lots of things were encoded using varints, but that means you always need loops to scan through the bytes to find the start and end of the int. It does save some bytes, but if you use fixed integers, a particular int is always the same place in the format so you can just read a i32 at a fixed memory location and that's it. It's just a single step supported in the hardware.
So integers, floats, booleans, are always the same size and are always written at the front of an encoded object. Then you have values like strings or arrays or embedded objects that might be an arbitrary size. In bipf these are length delimited, so that you can check the length, and in the next step jump past the whole value*. Having a length delimited field would mean you cannot have any fixed place fields after that. You always have to read the length of preceding field... so you have just a single variable size field and it has to be the end. Or... there is another way - you use pointers. There is a section with direct values, which are always a fixed size. Then there is a data section, with variable sized values. To encode a variable sized value, you write the length delimited value into the data section, then write a pointer to it in the fixed section. The pointer is always the same size so it doesn't offset any other values after it.
The final trick is to use relative pointers. The value of an absolute pointer is the memory address of the pointed value. But if it moves in memory, such as by sending it to another device, it will be wrong. But instead, we use the relative offset. The value of the relative pointer is the memory address of the pointed value, minus the memory address of the pointer. This means if a block of memory including both relative pointer and pointed value are copied around (sent to another device) then they still make sense. The cost of this operation is just a read and an addition!
Using fixed size relative pointers means that arrays indexes are at predictable locations so you do things like binary search on them.
This means that reading any value at a known path, even if deep into the object can be done with pregenerated code without loops.
The only downside is that you need schemas.
* this is the main trick that distinguishes bipf's design from cbor/msgpack and enables fast simple in-place reads, a cbor/msgpack parser doesn't know where an object ends unless it scans past every value, recursively