You are reading content from Scuttlebutt
@Anders %Pnas64BJ60qbHF34BvsHiXYvMRj7CyKq8nsOlDCsWqA=.sha256

SSB Butt2 feed format

First of all about the name, all the creative energy went into designing the format so I went with the default thing of just calling it butt2 :) Happy to entertain suggestions.

mtg art opt artist: Rien

I have been working on a new format for some time now and have a working prototype to make sure that it performs as expected. The reason for a new format are many, but what really made this something we can actually roll out to existing clients was the work we did in the NGI pointer grant on meta feeds.

Instead of copy pasting the whole spec I'll just highlight some of the key points:

Butt2 is a new binary feed format for SSB. The goal is to be simple, easy to integrate into existing implementations, be backwards compatible with existing applications and lastly to be performant.

A benchmark of a prototype shows the time it takes to validate and convert for storing in a database to be reduced in half for single message validation. While bulk validation with a signature for every 25 messages takes roughly 1/7 the time of existing format. To put these numbers into perspective, on an Intel i7-10510U it takes 3 minutes and 20 seconds to validate and convert 1 million messages, while butt2 takes 28 seconds.

I have a pretty good idea of what is needed for this to work in ssb-db2, but would love to hear from other implementations.

User has not chosen to be hosted publicly
@Anders %OuCPmTXh8y2N9CK/bK7PO6AwuNKIN75D2hFjeR/RS5Y=.sha256

@moid each message signs itself just like current SSB feed format. What is different here is that you can optionally include signatures that spans a range. So lets say you get the first 100 messages from a feed and the author chose to also include signatures for every 25 messages. Then you can check the signatures on message with sequence 25 that covers seq 1-24 and then the signature on the message 25 itself. Similarly for sequence 50 you check sequence 26-49 and then the direct signature on message 50. That way you end up checking 8 signatures instead of 100. This is only for the signature part, you do all the normal checks on the other messages including checking backlinks etc.

We have this kind of debounced validation in db2 already so we should be able to leverage this for the above.

Some numbers from the prototype:

checking 1 feed of 7797 messages

db query: 934.799ms
validate classic js: 1.497s
validate classic rust (chunk size 100): 643.784ms
validate classic rust (chunk size all): 367.473ms

extract data: 57.453ms
single message validation: 713.28ms

next two are validation in bulk:
verify simple & hash: 61.372ms
verify signature: 61.289ms

encode butt2 to bipf for db: 78.488ms
encode classic to bipf for db: 211.728ms

I'm currrently tidying up the prototype and splitting out the code into a proper module that will hopefully be public within the next week.

@zoo [planetary] %NM+iyQuVF1PvuuzYsgrv/9qJiZ6IUT/v7nq1EdXCVx8=.sha256
Voted # SSB Butt2 feed format First of all about the name, all the creative ener
User has not chosen to be hosted publicly
@Anders %NRsPDa49gFrWkue/K9F0xbT1l5HcWJi2C0HF7nGaZEc=.sha256

@moid the code is up now at https://github.com/arj03/ssb-butt2. Still some things missing, but getting there.

@Mix Android there are no lipmaa links 👺

And signing the msg keys for batch validation should be fine

I tested the size reduction and it appears that butt2 is roughly 80% the size of classic over the wire.

User has not chosen to be hosted publicly
User has not chosen to be hosted publicly
@zoo [planetary] %GkSfdjDMXVr3R30uCFUX/x8fMZUSjFB8U2QbjwKVJi8=.sha256

ooooh

@Dominic %Xf2JIAYPmJJ/0SRVoVQn3NwlC636pQbXmJW1HjLIKQ4=.sha256

I'm looking at the code - does it store the top level values as bipf arrays? https://github.com/arj03/ssb-butt2/blob/master/index.js#L8-L11

layered? because they the content is encoded as a bipf object, but then it looks like it's encoded as a bipf buffer in another array?

Reading the code I spotted several things, that feel "nit picky" to mention and 7 times faster is great anyway, but I'll ask anyway. The main question I have is about how the batch signing works.

it also seems to me that there are micro-optimizations that could add up,
for example, https://github.com/arj03/ssb-butt2/blob/master/index.js#L288
using Object.keys to get an array of the signature keys when for(var key in signatures) might be faster?

Instead of layering, what about just implementing a non-recursive decode, it would just decode the top layer and return buffers for the values? seems it might achive the same result, while still keeping the whole thing as a single valid bipf object?

The important question: how does batch signing work?

https://github.com/arj03/ssb-butt2/blob/master/index.js#L288

does it keep in memory an array of all signatures? else why is it accessing a signature from an array based on the sequence?
https://github.com/arj03/ssb-butt2/blob/master/index.js#L256

It sounds like you use specially formatted batch signature messages?

I think you could make a much simpler approach: since hashes preserve the uniqueness of the hashed data, it means signing the hash is as good as signing the data, and signing the hash of the hash (or the hash of data containing the hash) is also good enough. So therefore, every ssb message signs not just the message content but the whole chain of preceding messages. If you changed one bit in the first message the hashes of all subsequent messages would change.

So... instead of having a special batch signature - just sign each message like normal and when verifying, queue up N messages, then randomly verify one. then you can write the messages up to the verified one.

@Anders %KR8f9T3qMYMbhMFiuZ2wKfyNQqp287qCnqsZ/0MjgBU=.sha256

butt2butt

@Anders %XeStFwUGR3n32QyhlE8bj9XyvavbUCpNJHaENgkXx+o=.sha256

An update on this.

The feed format now has an offical name: buttwoo :)

For bulk validation I decided to go with the solution Dominic mentioned by just validating the signature of the last message.

It supports subfeeds

There is a PR adding support for it to DB2 and EBT. Both are quite far and just need some box2 work before they can be merged.

I optimized BFE and URI2 recently and that gave a good boost.

Validate + add 15k messages to database

buttwoo large: 2.3s
buttwoo small: 2.15s
classic large: 3.5s
classic small: 2.1s
@aljoscha %0XQVRvvpbPl1nBBVPDgVrAB4zWmVTnh8CfZqlpWvJC8=.sha256

@arj

For bulk validation I decided to go with the solution Dominic mentioned by just validating the signature of the last message.

Just because it hasn't been mentioned yet in this thread: what happens if I start appending to another person's log? That gives me a valid hash chain in which every message is correctly signed by its author, it's just that authorship changes halfway through the log.

Do you perform checks for detecting this case? Which behavior do you recommend/mandate for implementations when this case occurs? Is there even a slight possibility of clients receiving entries from a log like this and doing all sorts of undesirable things because one of the core contracts they expect a log to uphold has been broken, for example, displaying messages from the old log under the name of the author of the new suffix, or having a sudden switch of authorship in their timeline view?

You can still easily (and quickly, compared to signature validation) verify that all messages have the same author, but that should be specified somewhere.

Nitpicking on the spec:

Status: In review

By whom? For how long? Under which criteria? Don't tease the poor reader like that.

The verification section is only local to each message. It should at the very least link to the specification for log verification (valid hash chain, single-author, etc). Which exists, right? =P

@aljoscha %o+3XFaGO6iSgnW2EEO946rZB5FaF2S1HNmro/lcVnR0=.sha256

More spec remarks, putting on a deeply black hat, so be warned:

author: ssb-bfe-encoded buttwoo feed ID, an ed25519 public key

The bipf spec heavily implies that its compound values have to contain further bipf data ("with schemaless json-like semantics", "OBJECT : 5 (101) // sequence of alternating bipf encoded key and value"), but it is ambiguous about arrays ("ARRAY : 4 (100) // sequence of any other value"). Assuming arrays must also contain bipf values, you cannot put an ssb-bfe-encoded feed ID into your top-level bipf array. So do I have to wrap these in an bipf buffer (of tatically known length, nonetheless)? At the very least, this needs clarification in both specifications.

parent: ssb-bfe-encoded buttwoo message ID used for subfeeds. For the top feed this must be BFE nil.

See above.

Both author and parent must use redundant encoding, for example: why is the author ssb-bfe encoded, if you already know it has to be a buttwoo feed ID? Is an optimizing implementation allowed to just ignore the useless bytes and just look at the key instead? Or do I still have to check that they contain the only valid - i.e., completely pointless - byte pattern? This opens up room for mistakes or just peers that sow chaos, for no benefit.

sequence: number for this message in the feed

Which data type is this? How is it encoded? "number" is not a bipf type.

timestamp: integer representing the UNIX epoch timestamp of message creation

This is a bipf int (32 bit) I assume? Considering that author/parent are not bipf values either, this should be specified. Any reason to not go for 64 bit? 32 bit run out in 16 years, your crypto primitives hopefully last longer.

tag: a byte with extensible tag information (the value 0x00 means a standard message, 0x01 means subfeed, 0x02 means end-of-feed). One can use other tags to mean something else. This could be used to carry for example files as content.

Please specify what an implementation must do with unknown tags.

contentLength: the length of the bipf-encoded content in bytes

hash: concatenation of 0x00 with the blake3 hash of the bipf-encoded content bytes

This contradicts that content is not necessarily bipf-encoded.

If the spec mandates a 0x00 byte at the start of the hash, then an implementation has to reject everything that does not have that byte. Any change that allows other starting bytes would be a breaking change, i.e., a whole new format. So as the spec is currently written, this byte is completely redundant and should be removed from the spec.

The content is a free form field. When unencrypted, it SHOULD be a bipf-encoded object.

Please specify what implementation must do when it is not a bipf-encoded object.

A buttwoo message consists of a bipf-encoded array of 3 fields:

Metadata must be an bipf encoded array of 8 elements:

The only unknown about the length of a message is the length of the metadata. Since the metadata starts with its length, storing the message as a bipf array is redundant, you could simply concatenate metadata, signature and content instead. An efficient limitation would ignore the first bite of the message encoding - but it has to verify that the length is correct. This byte is reduntant and forces reduntant computations (branching even) on all implementations.

Similarly, the length of the metadata array is only influenced by whether the parent is null is not. Should you really encode this by prefexing the metadata with two different, fairly large, arbitrary looking integers? Especially since the parent encoding contains the information about whether it is null again.

Overall, the information whether the parent is null is stored in three different places, in three different ways, and while only one place needs to be checked to function correctly, the consistency of all three locations has to be verified. This violates DRY in a pretty horrible manner.


Also, just out of curiosity: did you condiser VarU64 for bipf, and if so, why did you choose LEB128 above it. Adoption/implementation availability, or design reasons? CC @Dominic (?)

@Anders %voBbCGifUkKz2dwpD7DBDtQB24cNT5JnmuBaNBH0lrY=.sha256

@Aljoscha thanks for taking a look :)

Just because it hasn't been mentioned yet in this thread: what happens if I start appending to another person's log

I didn't mention that explicitly, but from the code linked you can see that all other things are checked except the signature of the other messages.

You can still easily (and quickly, compared to signature validation) verify that all messages have the same author

Do you mean checking the author field? Or that the signature was created by the said author without checking that the signature is actually correct?

Btw. this optimization is not documented in the spec. There is also the posibility to randomly selecting 1 message in each round and verifying that in full. Then verifying the messages after that one using the same method until you reach the end. It's probably best that each client doesn't follow the same rules. In any case these are just optimizations, you can still verify each message in full (which is what the benchmark does btw.).

Status: In review
By whom? For how long? Under which criteria? Don't tease the poor reader like that.

Like a good stew until it's finished :P Currently I don't expect this to change much so the spec will probably be finalized once we merge that DB2 PR.

The verification section is only local to each message. It should at the very least link to the specification for log verification (valid hash chain, single-author, etc). Which exists, right? =P

It links to bamboo so? :-)

@Dominic %MvYC3YcHqwc+aA/yGSAePk8ftFWoXeqEmKczcs6mCZU=.sha256

@Aljoscha I did not consider varu64, the reason being that it did not exist yet ;) first commit, bipf: 8th June 2018 , first commit, varu64: 20 jan 2019

It's a bit late to say this but I had at the time only intended bipf as a proof of concept. It the time, in particular I wanted a compact float encoding. As it is bipf just uses a 64 bit float, which is what timestamps should be, for compat with legacy (js) implementations.

A tiny bikeshed on varu64: instead of encoding whether it's 1,2,3,4,5,6,7,8 bytes to encode the number, might as well just have 1,2,4,8 then you can just use the standard int reading ops, and no need to handle 3,5,6,7. A format like this saves data by storing small numbers more compactly. A number stored in 1 byte takes only one 1/4 the space of 4 bytes, but a 5 byte number only makes a 5/8 saving compared to 8 bytes. I do like that you can read the size of the record from the first byte.

I think I already said this, but I think it would be a good idea if butt2 has lipma links. Even if full support for partial replication isn't implemented at first, it could be... but without, you'd have to create new a new feed format again.

Join Scuttlebutt now