You are reading content from Scuttlebutt
@Dominic %kZTlq9I8hMKB50sDyjIyQ5Ea5A+4E8WJJoCRSiB1dNw=.sha256

I am once again delving into the murky world of binary encoding formats. You might have heard of my previous work in this bipf as it's now used in manyverse's db2.

bipf was intended to be a binary JSON substitute (like msgpack, cbor) except unlike msgpack and cbor, it is designed for in-place reads. That is, the ability to pluck out particular field values without reconstituting the entire data structure. Which is much slower because now the memory allocator and garbage collector is involved.

But this time I want that - but also more - an important design goal is specifically that it's easy to implement in other languages, such as C.

Also I wanted the in-place reads, but stronger, by using schemas and fixed positions.

fixed positions: in bipf lots of things were encoded using varints, but that means you always need loops to scan through the bytes to find the start and end of the int. It does save some bytes, but if you use fixed integers, a particular int is always the same place in the format so you can just read a i32 at a fixed memory location and that's it. It's just a single step supported in the hardware.

So integers, floats, booleans, are always the same size and are always written at the front of an encoded object. Then you have values like strings or arrays or embedded objects that might be an arbitrary size. In bipf these are length delimited, so that you can check the length, and in the next step jump past the whole value*. Having a length delimited field would mean you cannot have any fixed place fields after that. You always have to read the length of preceding field... so you have just a single variable size field and it has to be the end. Or... there is another way - you use pointers. There is a section with direct values, which are always a fixed size. Then there is a data section, with variable sized values. To encode a variable sized value, you write the length delimited value into the data section, then write a pointer to it in the fixed section. The pointer is always the same size so it doesn't offset any other values after it.

The final trick is to use relative pointers. The value of an absolute pointer is the memory address of the pointed value. But if it moves in memory, such as by sending it to another device, it will be wrong. But instead, we use the relative offset. The value of the relative pointer is the memory address of the pointed value, minus the memory address of the pointer. This means if a block of memory including both relative pointer and pointed value are copied around (sent to another device) then they still make sense. The cost of this operation is just a read and an addition!

Using fixed size relative pointers means that arrays indexes are at predictable locations so you do things like binary search on them.

This means that reading any value at a known path, even if deep into the object can be done with pregenerated code without loops.

The only downside is that you need schemas.

* this is the main trick that distinguishes bipf's design from cbor/msgpack and enables fast simple in-place reads, a cbor/msgpack parser doesn't know where an object ends unless it scans past every value, recursively


@mix.exe %N+cwsEK61i41Kff+AzJbrjHvzGrqb1KvrpdqbGo8ujA=.sha256
Voted I am once again delving into the murky world of binary encoding formats. Yo
@mix.exe %2DnFrLwdCx8bPm5F/uI7CvwJt74VFOg5OP4f0EuaFac=.sha256

Sounds like a cool thing to explore. Would you try implementing it in e.g. C first to make sure it works well there before implementing it other languages?

@Dominic %LS/Aq6dX07sLcnpZ81TOq7Bwn5h7ibGppRuahvhDWIQ=.sha256

well I have explored C enough to know basically what it will be like but I'm doing the prototype in js because writing tests etc is much easier. C will be next. (C because it generates wasm with the least ceremony)

User has not chosen to be hosted publicly
User has not chosen to be hosted publicly
@Dominic %aZOKMK3FW/bKGpBuKvyqTzrGrVau2uAo7w9GUIvcI2E=.sha256

Oh yes! I just started doing zig last week, it's like C but a million times better

@mikey %+nlX6ljzeSFlIpBlEzvqW5e5LPp+OvPrUDqB8aZZ6zY=.sha256

using schemas

@Dominic using schemas !? đŸ¤¯

@mikey %uhEdRsh/gVBK/A4x2DkXW3nZyT0uRmMI/6VRkvcpyWM=.sha256

ℹī¸ "better binary encoding": %PDWcZEf...

@andrestaltz %KWV86qSF8ausnsd4JBuR5f3w1FmlykrK9fDdm9tTP/0=.sha256

I saw Dominic's new repo for a schema-driven binary format. Then @arj and I briefly spoke about schemas, but turns out they don't help SSB much because you get messages from peers who don't necessarily follow your schema, and we still need to take those messages into account.

@Dominic %LDGlyndJ18bkdGWs4g4iCi9iMmy3zF6Nze6tlX3TuOI=.sha256

@dinoworm 🐛 there is a time and place for schemas. And although they may be more hassle in someways because you know what you are getting up front a lot of redundency can be left out, and thus especially for the case of parseless in-place reads, process them much faster.

User has not chosen to be hosted publicly
User has not chosen to be hosted publicly
Join Scuttlebutt now