You are reading content from Scuttlebutt
@aljoscha %043+a3lLbXxJM0L7QDJ3rAKTb6sWXuAkRuhS/wOT+hQ=.sha256

Compact Legacy Message Representation (CLMR)

I wrote up a spec for a compact representation of legacy messages. It uses cbor for the content data, and an efficient binary encoding for the metadata. Also the first byte is never the same as that of a json encoding, which might come in handy later (and I intend to also keep that property for the metadata redesign).

This is semantically identical to the currently used messages, but it takes up less space and is simpler to parse. Intended uses:

  • replication rpcs can exchange this form of the data (CC @Dominic)
  • database can store this instead of json (CC @arj, @keks)

The format will not be used to compute signatures, so it is not tied heavily into the core of the protocol. That's why I simply went ahead and wrote up the full thing. None of it is set in stone (yet) either, so feel free to criticize.

There's a rust implementation for the serialization of this.

Size of my (somewhat recent) feed in bytes, for different encodings:

  • signing encoding (json with whitespace): 2160151
  • json without whitespace: 1951148
  • clmr: 1620807

Considering how much of my feed consists of json strings that clmr can't compress further, that's not too bad. More importantly, it is much simpler to both produce and deserialize than json. And the same binary varints, identifiers, etc. will also be used in the new metadata format. Ideally, the json can be fully deprecated, except for signature computation and UIs.

Coming soon: rust deserialization, testing of the rust implementation, test data set.
Coming soon-ish: js bindings.

A small detail, the handling of private boxes, is still under active discussion. For the rust implementation, I simply encoded the private box format specifier with a single zero byte, which is what my suggested approach to the general multibox encoding would do.

@Anders %7jgRciJekt7WTC43lML2gVOlYCTjcuQJmZPwBpqqScI=.sha256

Would be interesting to see performance numbers for serialization and deserialization of this compared to json.

@aljoscha %9E98fBA31M63TynnWK8674nX7N4Dx9lK2lvz/LGqlTc=.sha256

@arj Benchmark script, the times below are for (de)serializing my complete feed. For deserializing, the output is immediately thrown away, there's not even memory allocation (else, clmr would outperform the json even more).

Quick summary (full cargo bench output at the end of this post):

  • parse full feed from compact json: 38.134 ms
  • parse full feed from clmr: 5.1947 ms

  • serialize full feed as compact json: 1.2881 ms

  • serialize full feed as clmr: 396.65 us

No guarantees that the clmr implementation is bug-free, but at least serializing and then deserializing the feed as clmr results in the exact same feed, so that's a good indicator that the benchmark does not rely on buggy behavior.

That was more drastic than I expected. I think the biggest gain is not having to base64 encode/decode everything. The json implementation decodes all base64 into actual byte buffers. Since that needs to happen at some point anyways, and it is most efficient to just do it once during deserialization, I think it is fair to include this in the benchmark.

There also a few more advantages not captured in this benchmark:

  • smaller encoding sizes lead to less i/o and memory allocation overall
  • can pass around pointers into the raw message (in particular to the raw signature and author) rather than having to copy memory

CC @Piet, @dinosaur, @Dominic

cargo bench

Benchmarking deserialize each message/deserialize signing json: Collecting 100 samples in estimated 194.08 s (5050 iteration                                                                                                                            deserialize each message/deserialize signing json                        
                        time:   [38.377 ms 38.386 ms 38.398 ms]
                        change: [-0.9875% -0.9166% -0.8528%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
Benchmarking deserialize each message/deserialize compact json: Collecting 100 samples in estimated 193.00 s (5050 iteration                                                                                                                            deserialize each message/deserialize compact json                        
                        time:   [38.128 ms 38.134 ms 38.140 ms]
                        change: [-1.0764% -1.0070% -0.9280%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
deserialize each message/deserialize clmr                                                                             
                        time:   [5.1803 ms 5.1947 ms 5.2089 ms]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

serialize each message/serialize signing json                                                                             
                        time:   [1.2791 ms 1.2829 ms 1.2872 ms]
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild
serialize each message/serialize compact json                                                                             
                        time:   [1.2847 ms 1.2881 ms 1.2921 ms]
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe
serialize each message/serialize clmr                                                                            
                        time:   [395.12 us 396.65 us 398.28 us]
Found 16 outliers among 100 measurements (16.00%)
  2 (2.00%) low mild
  6 (6.00%) high mild
  8 (8.00%) high severe
@aljoscha %txYPVfV2eaBlP7Cn2yw1X45CCRttUWtLJXa0OAvv+iQ=.sha256

Correction: For deserializing, the output is immediately thrown away.

@Anders %PUDLWlZbLzgEgWbm/NEjH9OcQQLL15XBLwU7m6tolio=.sha256

That is impressive, especially considering that the file size it not that much different.

What is the story with rust interop? Do js bindings? In this case I guess one will have to generate/compile a lib per arch. What is the story with wasm? Is it easy to generate. Also it would be interesting to see the performance of a wasm version. That last point is mostly out of curiosity.

@aljoscha %3EHfeOvZXxKqr7ozUESw95wHV8222KrIW+Qr1yghUnI=.sha256

As for js bindings, @Piet is the person to ask.

User has not chosen to be hosted publicly
@Dominic %RxvoY7N428ePVBHwDf4Io+p5OvvQ5b1uG9/nJBlBJwU=.sha256

@piet I'd also be even more interested in a wasm version than bindings (although a comparison would be very worthwhile)

@mikey %q/4RqLd8OQ/iriUkyhpgsOenraGwhOS0HCi36GGmfU4=.sha256

@dominic: the plan is to have both native bindings (with cross-compiled prebuilds for every arch) and wasm (as a fallback). the hypothesis is our hottest code paths (message verification, encryption, hashing, etc) deserve native bindings for eeking out performance, while wasm will be as portable (in JavaScript land) yet faster than plain old JavaScript.

@Dominic %bVVz9mb/m8jL6LZoCa+sf+t3Q17ji6VJt7Nimpsbk5c=.sha256

@dinosaur I'd like to actually measure the difference in performance here - crossing the gap between js and native is not as cheap as it could be, so for lots of small things it might not be faster. So if you do full parsing here and create all the js objects, I wouldn't be too surprised if native bindings to this arn't significantly faster than JSON.

@piet If it's better than twice the peformance of JSON (by this benchmark) I owe you a box of beer. ;)

User has not chosen to be hosted publicly
@Dominic %LdsS0OmGvbq1A1/Qji6lMKWfaDbGjp2jYKn5qGWc1XE=.sha256

@piet sorry just realized this was most of a year ago ); if I recall correctly I wrote scripts to copy the flumelog into the format of flumelog-aligned-offset

I added some documentation so you should be able to figure out how to run them. hmm, did we ever get node bindings for this?

User has not chosen to be hosted publicly
@Dominic %+oyDDYIFkzFFEFjaSlTDksAXkky4zGYMTjxDpc774WU=.sha256

@piet dang, I had hoped that I would be buying you the box of beer and enjoying a twice as fast codec );

This is why I'm advocating the approach of bipf (though that module is just a proof of concept) the trick here is don't create new javascript objects. This enables you to filter without creating objects at all. Mostly, you read something from the database, look at a couple of values, pass it to another place, and probably reserialize it again to the network. But with an in-place format, you can skip both parsing and serializing. You only parse the properties you look at!

@beroal %p5LsEYFCrkwHhLfo46fK2PrXOkC8Y9ZJByx3Kwq/5BY=.sha256
Voted # Compact Legacy Message Representation (CLMR) I [wrote up a spec](https:/
Join Scuttlebutt now