You are reading content from Scuttlebutt
@aljoscha %EwwjtvHK7i1MFXnazWTjivGEhdAymQd0xR+BU82XpdM=.sha256

@Dominic Can we simplify the ".box" suffix for encrypted message content?

Currently, encrypted message content is a json string, beginning with some canonic base64, followed by .box followed by - well - anything (the implementation uses the regex *). Here are a few reasons to change this:

  • fun with unicode: The suffix could include quotes, and even escape sequences for arbitrary unicode code points. Which are evil very annoying. Do we allow unpaired surrogate points (hello again, my dear friend wtf-8), resulting in non-unicode suffixes? Are matching surrogates parsed as two unicode code points or as a single unicode scalar value? (How) do we enforce canonicity of escapes? Do we want all future programs that display this format to humans to have to deal with emoji rendering?
  • compatibility with efficient binary format: The next thing I'll work on is a compact, binary format for legacy messages, preserving full backwards compatibility. For all other multiformats, I'll simply assign an integer to each cryptographic primitive. With the current multiboxes, that's impossible, those would need to store the arbitrary suffix rather than a compactly encoded integer.
  • indeterminate length: A parser just has to go on decoding, which is really unhelpful considering the input might be malicious. There's an upper bound on the length due to the message size limit, but having that leak into the parser construction for the box multiformat is pretty weird, especially since it might also be used in other contexts.

Proposed alternative: Limit the suffixes to hex encodings (lowercase) of natural numbers smaller than 2^64, without any leading zeroes. As a regex: /\.box(?:[1-9a-f][0-9a-f]*)?/.

This addresses all of the above issues, and there's a bijection between 64-bit integers and suffixes. Implementations can thus use an integer internally and reconstruct the (legacy) encoding. The integer 0 incidentally maps to the suffix .box - what a happy coincidence! And it is simpler than decimal encoding, altough that's something I could also live with.

Cons: No more boxymoron. More seriously: Are there any messages with suffixes other than .box in the wild? If not, I really want to restrict the suffixes to hex numbers. Counterarguments will only be taken seriously if they propose a consistent set of rules for dealing with escaped surrogate code points ;-)

CC @dinosaur, @cryptix, @keks, @cel

@aljoscha %WgT/NxQU7Kv/rLY3XA9Wo5pJUc64AIwWqDv3eIbRihE=.sha256

Corrected regex: /\.box(?:[1-9a-f][0-9a-f]{0,15})?/

@mikey %6NvkFa4kZegQwDLiiP5Fq20RiqgaAmpHLd9RmrMmrTM=.sha256

mostly unrelated, but reminded me to say: this recent ssb-ref pull request does the same boxymoron pattern but for shs: https://github.com/ssbc/ssb-ref/pull/22/files#diff-168726dbe96b3ce427e7fedce31bb0bcR168

User has not chosen to be hosted publicly
@aljoscha %0gQsJT/rVv8DbIT3FIbrRM4ao2EfH1xO57LfFfc+iCM=.sha256

@keks With the other multiformats, a server implementation can just ignore any messages with unknown formats, since it can't verify them anyways. But the boxes are irrelevant for verification, so I'd expect servers to just store the messages even if they can't decrypt. That's what js sbot currently does (ok, I only checked that ssb-validate accepts them, I don't know what happens when sbot wants to attempt decryption and realizes it doesn't know the format - @Christian Bundy, do you happen to know this or where to look it up?). This makes it impossible to use a table lookup, you'd still need to deal with values unknown to the table, storing the actual byte pattern.

If human-readable rendering is a concern, the rendering function can always do a lookup and use the actual name of the encryption algorithm.

@Christian Bundy %dkche9C3uV0DedX57yHm1eW3ekyyP51xkHkv1BNn4QI=.sha256

@aljoscha

My understanding is that the secure-scuttlebutt module handles this in minimal.js, specifically the unbox() function. The gist is that if msg.value.content is a string then it attempts decryption with every unboxer we have (currently just main_unboxer, which uses ssb-keys).

The relevant bit of ssb-keys is util.toBuffer(), which seems to just chop off the sigil and the suffix (. and everything after it) and turn it into base64. I don't think it even checks the format, it just attempts decryption and returns an object or nothing at all.

Hope that helps!

User has not chosen to be hosted publicly
@Christian Bundy %SgbYzNMlYuXte2nU5quIUFDploFHShfgakwnLbBOmcg=.sha256

Also, since we're talking about lookup tables, is this somewhere that we could/should use multiformats? The lookup tables may already exist.

@aljoscha %AVaHj+GPmB+JfpnupJSa4P/TUQX8Iw3qEkCJvJbcJaU=.sha256

Thanks @Christian Bundy, so sbot indeed keeps messages whose suffix it doesn't know.

@keks Upon receiving .boxfffffffffffffff without knowing that format, the db (as well as the in-memory representation) can just store it as "a bunch of encrypted bytes, using the format 2^64 - 1" rather than "a bunch of encrypted bytes, using the format identified by the utf8 string "fffffffffffffff"". This is really just about efficient representation of unknown formats.

I think this would be by far the simplest solution, but if you people disagree, I'm also fine with e.g. a bounded-length alphanumeric suffix. Just as long as it disallows escape sequences, is restricted to the basic multilingual plane (excluding " and control characters to keep json compatibility), and puts a maximum on the length of suffixes, I'm happy.

@Christian Bundy

Also, since we're talking about lookup tables, is this somewhere that we could/should use multiformats? The lookup tables may already exist.

Binary encodings for the ssb multiformats are coming up next, you can get a sneak preview here. But there is no (preexisting) multiformats table for the kind of decryption algorithms ssb private messages need.

User has not chosen to be hosted publicly
@aljoscha %aFGI0o7r3IktOhfMKAyePUPyODIY20w/GorpN7m9wZA=.sha256

@keks I agree that format-per-usecase is not good from a privacy perspective. That's something to address in the full metadata redesign. But this issue is only about legacy messages. Before the metadata redesign, I'll spec out a more efficient encoding for legacy messages as well, but there won't be any changes in semantics. And for the legacy stuff, that simply means one suffix per algorithm.

As for the 32 vs 64 bit question: I've written up some notes on finite integers as opposed to arbitrary-legth integers here. I just chose 64 bit fairly arbitrarily, and I'd want to consistently use the same varint format accross the protocol stack (here is what I intend to use for multikeys and multihashes, multiboxes would be the same but without any efficient length-encoding shenanigans).

@Dominic %Ue7TB0SbZSmbS4WwgTM69kUmZOAzaoJyS1lYxJijW6U=.sha256

hmm, so what you are saying here, @aljoscha is that you want to use integer identifiers for formats/types instead of strings.

The real question behind that is: if we have numbers for protocols, how are those allocated?
I kinda intuitively feel that string names that people choose might be less likely to collide... But i'm imagining you laughing at this right now. okay okay.

Okay so - if we have numbers, do we have to have some sort of government that allocates them? because I'd kinda like to not have that. Is there something else we can do here?

@aljoscha %fiWF8ohZ19/ShUFsCEAdDcxgIWVabFPvu/qmoR5NYDY=.sha256

@Dominic

do we have to have some sort of government that allocates them?

Everyone already needs to agree on the data and metadata format. This is just an extension of it. People already agree on the strings the current json-encoding of e.g. multihash uses. A server must reject messages that use an unknown suffix in the metadata, because it can't verify them. So we already have this centralized namespace. It's just more efficient to allocate integers rather than strings.

Multiboxes are conceptually different from multi{hash/key}, in that unknown formats are accepted and saved rather than flat-out rejected. When #hsdt comes along and adds cypherlinks to the free-form data format, then multi{hash/key} will also need to store unknown formats. But multi{hash/key}s in the metadata will still be rejected by a server that doesn't know the crypto primitive.

I'm arguing that we should rather store these future-proof multiformats by using integers than strings, as that is far more efficient. Well, and I want to keep emojis and surrogate points out of our multiformats handling.

The ability to express 2^64 different primitives is more than enough already. It's also enough space that people can use their custom box formats or to use their own cypherlinks inside content data, without a high probability of collisions. So there's no more "goverment" than there already is: The consus of correctly implementing the protocol.

User has not chosen to be hosted publicly
@Christian Bundy %TbrP/VvkVk/q4C38vuktI2mKl26rsX36aSW4O012iew=.sha256

@aljoscha

But there is no (preexisting) multiformats table for the kind of decryption algorithms ssb private messages need.

Could you unpack this a bit? I totally understand that the specific algorithms we use aren't in the multicodec repo, but it seems like it would be simple to just add them to the repo rather than come up with our own table. Is there a reason why our codecs couldn't be described in a multiformat table?

@aljoscha %EPCZEy4DMqrh763t8gx1iyQFKX2R0JCP/YF0RPMLzi8=.sha256

@keks

Actually I was just thinking that possibly, at some point in the future someone bridges networks and has ciphertext in a foreign format[...]. So maybe it's not just for legacy reasons.

Yeah, the non-legacy format should be designed to deal with this kind of thing. I'm don't want the future system to be more rigid than the legacy stuff. But the legacy system is both too rigid (in that it can only do an identifier per format) and too open ended (in that the identifiers are way too complicated). The overcomplication is the part we can easily deal with. Everything else, we can tackle with the clean slate of the non-legacy format.

As for the hashing, I just don't see how this provides any advantage over just allowing any 64 bit id without caring about how it was created.

@Christian Bundy

Is there a reason why our codecs couldn't be described in a multiformat table?

Is there a reason why they should be? =P

Ssb will end up using tables to look up multiformats identifiers, but we will be better-suited using our own formats. I have rambled a bit on shortcomings of the ipfs varints here, we can save the length indicator in some cases, and I'm not a fan of their policy of trying to cover all the formats at once. I guess I can go into more detail if you want, but that won't happen today.

@Christian Bundy %/Trks/ctxM74XRFqo64Af9+T9vmL7qwJavsaFTyQwKc=.sha256

@aljoscha

Is there a reason why they should be? =P

I think cooperating on standards is useful -- it spreads the work between collaborating peers, and lends credibility to the shared implementation. Personally, I try to cooperate whenever possible. Competition is overrated. 🙃

I like the idea of absolute optimal solutions, but I'm more inclined to cooperate on a good enough solution rather than rewrite our own. If their solution has critical problems that make it a non-starter, then I'd fully support inventing here, but I wanted to ask about alternatives. I'd love to read more of your thoughts/opinions if/when you have time.

User has not chosen to be hosted publicly
@Dominic %Dw5gKcPhF+k2t2MFpsHnVB2pEf4czoo86sAXuG9MyXE=.sha256

@aljoscha it sounds like the problems with ipfs's varint multiformats are with the way varints are encoded, not with the table - if we take these two separately, we could still use the same integer -> format map, with a more cannonical way to encode the integers. I think you are saying that the varints ipfs uses have problems, @christianbundy wants to use the same mapping.

@keks if we reserved two bytes we'd capture every thing currently in ipfs's multiformats table

I like @keks's suggestion. The difference here is whether the set of sub protocols and formats is open or closed - lets say that hash id of a format is actually derived from the hash of it's spec document, such that you can request that spec for protocols you don't understand.

Alternative version of that: instead of attaching a truncated spec hash to each link, include a hash of the table of formats you support, so you can use a different mapping. Hmm, actually this might be too complicated - what do you save when someone uses a different mapping to you?

So I think the simpler way is an official table, but also allow hash based ids, so that experimenting with a new type can't be blocked by a committee. Okay, that's just a variation on @keks's suggestion. lets say, it's a varint for the mapping, but there is a mapping for an extended type, which can be a truncated hash of the spec.

@aljoscha %Nh6FLWt9OwxOkJXBXFIJCADuNtgknVLu3yBW4zN9rkI=.sha256

TLDR: The data is exchanged by computers, so make stuff machine-friendly. Also: "Why?" is a more important question than "Why not?"

I'll be honest: I'm unable to relate to how you could think those are good ideas. All of what you write is certainly possible, but just why would you want to do this?

This post has cost me a lot of energy, and it might come across as agressive at times. I apologize for this, but I'd greatly appreciate som sort of acks (disagreement counts as an ack) from @keks and @Dominic on the main points. And also, please point out points I did not sufficiently address.

What are we going for? A simple, and consequently efficient representation of multiformats in general and multiboxes in particular. Those formats are primarily designed for computers to handle and exchange. The current representation is unnecessarily complicated - especially from the point of view of a computer.

A multibox consists of some sort of identifier for the encryption algorithm (and currently ssb only uses private-box), and then some bytes to apply the algorithm to. That's all there is to it. Storing the bytes is simple, all we need is a way of specifying the algorithm.

We need to distinguish between different algorithms, and we need to support a lot of them. The simplest way of doing that is to use natural numbers. Add in an arbitrary, but sufficiently large maximum number (because arbitrarily large formats in an adverserial setting are not a good idea), and you get 64 bit unsigned integers. I think that those are the objectively simplest solution that satisfies all requirements. Everything that is more complicated needs some very good reasons.


So what do we gain from applying some sort of hashing scheme? Frankly, I just can't come up with an answer to that question. Maybe I just don't get it. But all I can do is give some arguments why I think none of the points you raised actually apply.

As for @Dominic's concern of centralized allocation, I argued here that there's no difference between strings and integers.

@keks

We can still assign numerical values to these in a centralized table somewhere for more efficient binary transfer, but when it comes to string representations I think we should give it a name. e.g. .box-pfs.

In my opinion you are looking at this backwards. This is for machine-consumption first. We are not assigning a numeric identifier to a stringly-typed concept. We are assigning a human-readable string to a numeric identifier. And we should absolutely do that, something like patchbay's or manyverse's raw message view should definitely display a string rather than a number. But this can (and imo should) stay fully independent from the actual encoding.

@keks

What do you do when someone sends a .boxfffffffffffffff and you don't have that format? You already have a lookup table fmtID -> decryptFunc, and you can have another publicly accessible lookup table fmtString -> fmtID. I don't see why that wouldn't be possible.

It's possible. But why add another lookup when we can avoid it?

@keks

Currently it's pretty easy to make sense of most ssb message json. I don't understand why we would give that up for the benefit of not needing one more lookup table.

This is a pretty fundamental point. And again, this is because we are feeding human-friendly data to machines. The formats should be optimized for machines first. What we'd get from that is 25 % smaller messages and eight times faster parsing. And if a human want's to look at this stuff, we can still convert it to a readable representation. There's just no need to do everything in a human-readable format.

Ideally, I want to see the json transport format fully deprecated. The only use of json should be for hash ad signature computation of legacy messages (and that's only because we can't change it anymore). If humans want to look at a raw message, we can present it however we want. The hexadecimal ascii encoding of box identifiers is only for hash computation. To humans, we can still display ".box-pfs". Incidentally, I have no idea what pfs could possibly mean, so I as a human still need to look it up, just like I'd have to look up .box123.

@keks

If we try to make the format as general as possible and later realize we missed something, the only thing we can do is to make a backwards-incompatible change to the format - but that is just upgrades, not selecting between formats with different use cases.

It's the other way around actually. If box-2 is a breaking change over box-1, then box-2 has the same relation to box-1 as refrigerator-54 has to box-1: None at all. From a machine's perspective, version numbers only make sense for backwards-compatible upgrades. All breaking changes are a completely new thing instead. Semver got this wrong: The major version should actually be part of the name, not of the version number.

continued in next post...

@aljoscha %J+yeB8xUkWnnuVey6IcH6/O63rYP9OFINXeQm8Rl2Po=.sha256

@keks

[...] the collision probability with 1000 64bit hashes is pretty darn low (<10^(-13)).

This is exactly the same as directly choosing a random number between 0 and 2^64-1. And pretty much the same as choosing a number between 2^32 and 2^64-1, if we want to reserve a few trillion ids. Well, or between 255 and 2^64-1 (not that it makes much of a difference).

@keks

As for the hashing, I just don't see how this provides any advantage over just allowing any 64 bit id without caring about how it was created.

That way the database can compute the index number, even though it doesn't know the format string.

If you directly encode the index instead, you can also compute the index number: by applying the identity function. That's even simpler that hashing.

@Dominic

The difference here is whether the set of sub protocols and formats is open or closed - lets say that hash id of a format is actually derived from the hash of it's spec document, such that you can request that spec for protocols you don't understand.

How realistic is it that this will get implemented? How realistic is it that such a lookup will be simpler than a lookup by numeric id? And say we could look up the corresponding spec for an identifier. What then? If the machine can't act on it alone, then we didn't gain anything.

@keks if we reserved two bytes we'd capture every thing currently in ipfs's multiformats table

None of which is a multibox format. So why?

And if this is not about only multiboxes: Why should all formats share the same table? A key would need to reject any non-key formats anyways. Also, the clmr proposal uses compact type-length-value encodings that can skip encoding the length for certain identifiers. This isn't set in stone of course, but why should we drop that property?

So I think the simpler way is an official table, but also allow hash based ids, so that experimenting with a new type can't be blocked by a committee.

The format does not need to care about this! It receives an unknown id, it stores the message, done. No need to care about whether that identifier was created by incrementing a counter, by choosing a random number, or by hashing some specific sequence of bytes.

So I think the simpler way is an official table, but also allow hash based ids, so that experimenting with a new type can't be blocked by a committee.

This doesn't make sense. How is a committee able to block the numeric identifier 937931567 but unable to block the format whose hash is 937931567?


A general note: Hashes distribute evenly over all 2^64-1 possible values, making varints useless for compression. Assigning small numeric identifiers allows us to keep the encodings smaller via varints.

And a final remark: This is such a tiny detail, it's just not worth adding any complexity. The gains are so ridiculously small, we should just go with the simplest possible solution.

@Dominic %C+eXXrHJaPHnB5cYMj2SYF9Qziu4c8qYtGPwb2Iuu5I=.sha256

@aljoscha you always treat these things as a purely technical problem and don't write about (or acknowledge what we write about) social implications. Well, we think it's worth thinking through whether an allocation system is centralized or not. We are not just trying to build an efficient system, we are also trying to build a system that embodies healthy social relations.

We might think through it and decide that you are right, and that straight up numbers are fine, or that we can leave a door open to hash identifiers later, etc. But we need to explore the possiblities first.

I think that those are the objectively simplest solution that satisfies all requirements.

objectivity.jpg

For example: allocating codes with a strictly incremental index is more centralized than allocating ranges - because whoever you've allocated the range to can then choose what to do within the range.

It doesn't necessarily matter if any two ssb instances arn't fully compatible with each other.
This protocol is already, really a suite of protocols if they have some overlap, they can communicate.

@Dominic %LcyvIn7RTJ2m33SyqK6EL+hYQeP4AJBH4hHnrYCbHUA=.sha256

dang jpg
objectivity.jpg

User has not chosen to be hosted publicly
@aljoscha %0IOEn7lUXZrU0Dw4CHOJp9wY8M44cNuicr0cjP1xwHw=.sha256

@Dominic, @keks, thank you both for your responses, they clarified a lot for me.


On the one side of the design space, we have integer identifiers, on the other end we have bounded-length strings. While I personally prefer integers, both of these are fine, both of these are conceptually simple. The compromises in-between these extremes are where stuff gets complicated.

With numeric identifiers, we can allocate some for the core protocol and leave the remaining ids for experimentation and extensions. But we can't dictate whether the free ids are to be assigned as hashes or not. Whatever hashing scheme you design, you still can't stop me from using a random id. That's the whole point of keeping these open for independent extensions. So in my opinion, an identifier-based system should not care about where those identifiers came from. If you want convention-based lookups, you can implement those on a higher level. But the protocol itself doesn't need to know.

The multicodec approach is to just preallocate identifiers for all the things. Not only is this fundamentally impossible to do, it also goes against the whole idea of allowing independent extension.

These are the main reasons why I feel like those compromises are not worth it. If that means that numeric identifiers are out of the question and we should use strings after all, that's ok. It's the unsound (or at the very least complicated) interpolation between numbers and strings that I am opposed to.


My main assumptions regarding the question of numeric ids vs strings:

  • when exchanging data between machines, use a format optimized for machines
  • when inspecting the data exchanged between machines, you can transform the data into a human-readable form, so there are little gains in making the exchange format itself human-readable
  • an identifier space of size 2^64 is large enough to prevent centralization

The main argument for numeric identifiers is that they are more efficient. They are also slightly simpler than strings, but as long as the strings have bounded length and are restricted to a sensible unicode subset, that should be fine.

But I'm failing to buy the arguments in favor of strings. I see no inherent decentralizing property in unicode strings of bounded length over bit-strings of bounded length (aka unsigned integers). Human-readability can be achieved through transformation into json. The only hard advantage I can acknowledge is that of human-readability of unknown formats.

Receiving "box-pfs" is easier for humans to handle than "box-26f3a". But what do we do with this information? If we want to treat the box as an opaque object of unknown content, all we need to do (and in fact can do) is to check equality. That's something number allow us to do as well. And if you want something more suitable for your puny human brain, then the human-friendly format can convert those into chernoff faces or whatnot (the modern equivalent being base-emoji...). Realistically though, I doubt there will be many situations where a large number of unknown, hard-for-humans-to-distinguish box suffixes will become a problem, so hexadecimal encoding should suffice.

If you want to go beyond simple comparisons, then you need to look up the details of the suffix anyways, so the descriptiveness of the name doesn't make a real difference.


I hope this helped clarify my view of the design space.

User has not chosen to be hosted publicly
@aljoscha %v26bTEuM/SzsYFtN+shm4yObp5pQ1OgWTRBYGWClATk=.sha256

@keks

A bunch of responses, in no particular order:

With the part of the identity function, I didn't mean literally taking the bytes of the string, but I see how I failed to communicate this - "identity" was not the right word. What I actually meant is having an (essentially arbitrary) bijection between uint64s and some set of json strings, and only allowing those strings as suffixes. A reasonable way of doing that is using the hexadecimal encoding of the uint64 without leading zeros.

Actually, I suggest reserving all ids that start with a 0 bit

Larger numbers get larger varint encodings, so that makes the non-preassigned ones second-class (in fact, all of them would need 9 bytes to encode). So what about reserving all ids that end with a 0 bit (aka even numbers)? And also, do we really need to reserve 2^63 identifiers?

For hashing, I think I would use the rightmost 53 bits though, so the hex is the same, except in 50% of the cases the first digit. Otherwise we would use the first 64 bits except the first bit (weird) or the first 63 bits - but prepending a 1 shifts all the bits and so the hex will look completely different (also weird).

I don't understand this part. Wouldn't you just take the string, hash it to an identifier, then encode the identifier in whatever format you need? Also I don't get where the 53 comes from. And the "hex will look completely different" from what?

I think just shifting in a 1 bit into the digest would work fine.

User has not chosen to be hosted publicly
@aljoscha %it3GqVJOYLCEwmyp5hMDSvElyERRgs6KYPI5yHt4xDo=.sha256

@keks I still feel like we are talking past each other.

We need to clearly separate the logical data values from the actual encodings. The logical value of a multibox consists of an identifier for the primitive, and a slice of bytes (the cyphertext). To me, this discussion is primarily about settling on the type of the primitive-identifier.

Assuming the identifier type is uint64, we can then look at possible encodings, and there are a bunch of those. For the machine-exchange format, I would have defaulted to using a VarU64 (n.b. not the same as a multiformats varint) for the identifier, followed by the cyphertext (varu64 for the length, followed by that many raw bytes).

The next encoding we'd need is one that is compatible with the current json-based signing format that is used to compute the signature of each message. This is done by base64 encoding the cyphertext, appending .box and then appending a suffix identifying the primitive. The suffix for the "private box" primitive is the empty string. Whatever encoding scheme we come up with needs to respect this.

The signing encoding doesn't really need to be all that human-friendly, which is why I'd just go with simply hex-encoding (without any leading zeros) the varu64 identifier to obtain the suffix. This conveniently maps the identifier zero to the empty string.

Where does the hashing scheme fit in here? Honestly, I have no idea. If you want to display a multibox to a human, take its logical value and do whatever you want - the ssb protocol shouldn't need to care. You can suggest a default approach, but I'm free to completely ignore it anyways.


The question of which identifiers to reserve is pretty orthogonal to the encoding questions. But considering that a varint encoding is likely, it might make more sense to reserve based on the least-significant bit than based on the most significant bit, since the former assigns the same number of efficiently encodable identifiers to the reserved and the non-reserved space.

User has not chosen to be hosted publicly
@aljoscha %rypZnZMuhVA+AtxU2ACqRR+w0qkWIZ6SGEg2zQKQjx0=.sha256

I'm using "uint64" or "unsigned 64 bit integer" as a shorthand for "natural number between 0 and 2^64 - 1 inclusive".

@keks

May I suggest: [...]

What are you suggesting this for (this question is genuine, not rethoric)? It isn't the legacy signing encoding (we should just use a simple bijection (e.g. base32 as you suggested)). It isn't the transport encoding either (that one will use a compact binary representation based on some variable length integer encoding). But those are the only encodings in the spec.

Are you designing some sort of human-producable input format?

User has not chosen to be hosted publicly
@aljoscha %Ozz67pSRX+7DeQ7CWmfZyKvznMXflSYW7mDjWsKllRU=.sha256

@keks

I don't know what you mean. I thought we are designing the spec right here right now?

Ok, let me rephrase: The scuttlebutt protocol needs encodings for exactly two purposes: Message signing, and message exchange. Nothing else is strictly necessary.

but being able to copy-paste (and edit by hand) human-parseable multibox representations is important to me.

This would be a feature of your application (e.g. go-sbot), but not part of the protocol. I can write my own ssb implementation and have it talk to go-sbot, while completely ignoring your human-friendly representation.

I'm fine putting in some sort of "recommended human-friendly encoding" into the spec, and I like your suggestions for this purpose, but it would be purely that - an optional recommendation.

The only option for truly making this "first class" I can see at this point, is mandating that all server must be able to decode messages received in this format. But the protocol only describes data-exchange between machines, and those should use the binary encoding instead. So this would feel very forced. And I'd strongly prefer keeping the minimum stuff you need to do to call your program an ssb implementation as small as possible. It violates the principle of minimality.


This discussion is now also getting very close to the minimal feature set of an ssb server. If we want to specify a general interface for user-server interaction, then something like this starts making more sense.

I think of the ssb protocol(s) as three layers:

  • message format and signing encoding
  • communication between servers for replication
  • communication between server and other processes

The topmost layer is the most important one. Two applications that agree on this can work with data from the scuttleverse, even if they communicate by carrier pidgeon (i.e. ignore the lower layers). You could have a scuttleverse that uses a completely different replication layer, and that is fine.

Everything that implements the topmost and middle layer is part of the main scuttleverse (modulo shs appkey). We don't care how exactly @vendan authored or read data from their database, but still they were part of the scuttleverse.

The third layer is a domain where we can only give recommendations, but can't enforce anything. All the stuff we can enforce lives in the first two layers (and even then "enforcing" is not the correct terms, we are just not communicating (directly) with those who implement their own second layer, but they can still be happily building their own islands that can bridge into the main verse). The human-readable format should in my opinion live in the third layer, not above. It makes a lot of sense to define one, to foster interoperability, but it just won't be as "first class" as the signing format (topmost layer) or the transport encoding (middle layer).


When I started this discussion, I wanted to define details of layer one ("Is it ok to restrict the multibox identifier space to 2^64 values?" and "Can we find a sensible way to encode those 2^64 identifiers in the signing format?"). I feel like you (and also @Dominic) are fine with an identifier space of size 2^64. I also feel like we agree that a simple bijection between those identifiers and short strings is the way to go for the signing encoding.

So at this point, I am sufficiently unblocked, and I think we don't need to finalize a decision of the human-friendly format just yet.

A short summary of the outcome (or at least the current state) from my perspective:

  • a multibox consists of a cyphertext and a natural number between 0 and 2^64-1 inclusive that identifies the en/de-cryption algorithm
  • private-box has identifier 0
  • to allow people to use their own, non-official formats, we promise to never use an odd identifier for the main protocol(s)
    • if there is a strict advantage in using the most-significant bit instead that please tell me, I'm not aware of one
    • using the least-significant bit however allows people who are choosing there identifiers non-randomly to use shorter ones
  • the legacy signing encoding can use the hexadecimal encoding (lower-case) of the identifier
    • with the metadata redesign we'll switch to a binary signing encoding and deprecate the legacy one, so in practice there won't be (m)any messages for which the larger string-length compared to base32/58/64 is an issue, but everyone still needs to support it, and base16 is simpler
  • we will at some point define human-friendly representations for all the data formats. For multiboxes, the scheme proposed here looks reasonable.

If you have good reasons for partitioning the identifier space differently, or for not using hexadecimal for the identifier encoding, I'm open to switching - those are just my (slightly) preferred choices.

User has not chosen to be hosted publicly
@aljoscha %WTlLNna8T6Wk6/Ta44Jb6XP6m1SnmrANWSu9yDca6FU=.sha256

So we can have a proper spec that says "the string representation of ids looks like this:" - even though we don't use it in the most basic use case.

Ack, let's do this.

I think the format here should be as consistent as possible with the format we plan to introduce later. So if we already have a pretty good idea about how our string representation will look like, let's just use that format. Otherwise we'll have "that one weird string representation that is basically the same as the one we usually use, but for some reason it's hex and not base32".

Fair enough. So let's bikeshed this some more =)

In general, do we want to interpret the number as eight big-endian bytes and then encode those, or do we write out the number with the lowest amount of digits? The first makes more sense for stuff like base32/58/64 encoding, the latter is more appropriate for a decimal or hexadecimal representation (e.g. seventeen would become a1 in hex rather than base-whatever-encoding eight bytes).

As for concrete encodings, I see the following options:

  • base10: arguably the easiest one for humans, but the least compact option
  • base16: by far the simplest one (unlike all the other formats, there are neither padding nor canonicity issues)
  • base64: what all the other legacy stuff already uses
    • padding or no padding?
  • url-safe base64: compact and, well, url-safe
    • but all the other legacy stuff is already regular base64
  • base58: even human-friendlier than base64 while still staying pretty compact, but probably the most complicated one
  • base32: @keks said so =P
    • any particular reason you favor this, or was it just a placeholder suggestion?

Personally I think writing the number in base 10 or base 16 is easier for humans to deal with than encoding eight bytes in some weird scheme.

User has not chosen to be hosted publicly
@aljoscha %2Jg/TdQ+3kA4Z6BmhhU6JEnQu7W3G5t4bhc4yTdi9OQ=.sha256

@keks

Right, should have been 0x11.


When we write numbers in natural text, we use a variable-length encoding, e.g. 17 instead of 00000017 (imagine that the latter was exactly eight bytes of information, I'm too lazy to calculate how many leading zeros we'd need and dealing with padding issues). If we choose to flat-out encode all eight bytes, that is what formats like base64 where designed to do. Basically we turn the number into an array [0, 0, 0, 0, 0, 0, 0, 0x11], then base64 encode the whole thing, yielding AAAAAAAAABE= (we can choose to drop the padding "=").

Alternatively we can do the natural-text, variable-length encoding by dropping all leading zero digits (A in the case of base64). So 17 just becomes BE. My point was that BE (radix 64) is not as human-friendly as 17 (radix 10) or 11 (radix 16).

I need to stop now, university stuff... Hope this clarified my intended meaning a bit.

@mikey %s0Ogt7IeCnWzuymw+8N/jaULlkjwZGB/hdehFk+c2xk=.sha256

:+1: base32 as the human-readable base: %CgGEJ0V...

i also appreciate what @keks is saying in general: humans should not be an after-thought of the protocol, i don't think the success of Scuttlebutt will ever be from how we enable machines to interface with each other, it will come from how we enable humans to use machines to interface with each other, more human accessibility is worth a less efficient machine protocol.

User has not chosen to be hosted publicly
@aljoscha %VxBjPlpHZ+f1++WiUd1Vgq1Mt2TJD6CQQIGG6gWOq/0=.sha256

Ok, so here's what I'd put into the spec:

  • a multibox consists of some cyphertext and a natural number between 0 and 2^64-1 (inclusive) identifiying the crypto (private box has identifier zero)
  • clmr encodes a multibox as a VarU64 of the id, followed by the cyphertext length as a VarU64, followed by the cyphertext
  • for signing json, private box uses the suffix .box, other ones use .box-<decimal-without-leading-zeros>

Is that alright? @keks, @Dominic, @dinosaur

The js ssb-validate module should change the regex for boxes accordingly.

User has not chosen to be hosted publicly
User has not chosen to be hosted publicly
@aljoscha %7oH4yI4R7sZLmHPfNp64WcVeBROlb9v4jDGh2BCQtYM=.sha256

@keks Sorry, I misread the part on base10. base32 it is then.

User has not chosen to be hosted publicly
User has not chosen to be hosted publicly
@aljoscha %dzwyuIIKFvu4iEhznAfklTz/IwgZc0ddzs06B+Zd0cw=.sha256

@keks Why do you feel like these should not use the same?

If this should not be the same as the human-friendly format, I'd prefer straight up .box<base32 without leading zeroes>, which removes the special case for private box.

User has not chosen to be hosted publicly
User has not chosen to be hosted publicly
@aljoscha %SlA2ALoa+dCTzQd/0yWrgiRel4kBUd4UKy0L9BpKzPI=.sha256

@keks I'm not a fan of Crockford base32. Crockford requires to decode multiple different characters to the same value, e.g. 1, I, i, L, l all decode to 1. While that might be nice to correct human errors, it's not appropriate for creating canonical forms.

I'd like to just go with rfc4648 base32, uppercase and without padding.

(well actually I'd prefer base32hex, but that one sacrifices even more human-friendliness)

User has not chosen to be hosted publicly
@aljoscha %H1j3n0MxM+3nRsOaFJ6GzrOE0urD11FPKih8t+B836g=.sha256

@keks The base32 suffix influences the signature. If we allow different encodings, it isn't sufficient to store the id, but you also need to store the concrete encoding. So we'd need to disallow anything but the canonic encoding. It wouldn't really be general crockford-base32, but a canonical subset of it.

I'm fine with canonic crockford, but not arbitrary crockford.

User has not chosen to be hosted publicly
@aljoscha %ZUML5bNr9IT/D8BhNVjknS9dMwoUmg8wiCUamn1DLd8=.sha256

@keks @Dominic

Is this spec ok for both of you?

@aljoscha %ckLpdPvlzwlgg+sz8dgP/vnBycbCzVmzbEpI4ZT5yR0=.sha256

Here's a regex for ssb-validate:
\.box(?:(?:[1-9A-F][0-9A-HJKMNP-TV-Z]{12})|(?:[1-9A-HJKMNP-TV-Z][0-9A-HJKMNP-TV-Z]{0,11}))?$
You can probably come up with a better one. Stuff that should match:

.boxF0123456789AB
.boxAH
.box10
.box1
.boxZZZ

Stuff that must not match:

.boxG0123456789AB
.box0
.box01
.boxU
.boxa
@aljoscha %vryqRDRvC85iyTt8Jg6673i4iFWim6P/vph8F6CvV54=.sha256

Ping @Dominic, can we get this into ssb-validate (see post above for a regex that rejects invalid box suffixes)? If js ssb doesn't enforce this, having it in some written spec doesn't really matter at this point.

CC @keks

Join Scuttlebutt now