TLDR: The data is exchanged by computers, so make stuff machine-friendly. Also: "Why?" is a more important question than "Why not?"
I'll be honest: I'm unable to relate to how you could think those are good ideas. All of what you write is certainly possible, but just why would you want to do this?
This post has cost me a lot of energy, and it might come across as agressive at times. I apologize for this, but I'd greatly appreciate som sort of acks (disagreement counts as an ack) from @keks and @Dominic on the main points. And also, please point out points I did not sufficiently address.
What are we going for? A simple, and consequently efficient representation of multiformats in general and multiboxes in particular. Those formats are primarily designed for computers to handle and exchange. The current representation is unnecessarily complicated - especially from the point of view of a computer.
A multibox consists of some sort of identifier for the encryption algorithm (and currently ssb only uses private-box), and then some bytes to apply the algorithm to. That's all there is to it. Storing the bytes is simple, all we need is a way of specifying the algorithm.
We need to distinguish between different algorithms, and we need to support a lot of them. The simplest way of doing that is to use natural numbers. Add in an arbitrary, but sufficiently large maximum number (because arbitrarily large formats in an adverserial setting are not a good idea), and you get 64 bit unsigned integers. I think that those are the objectively simplest solution that satisfies all requirements. Everything that is more complicated needs some very good reasons.
So what do we gain from applying some sort of hashing scheme? Frankly, I just can't come up with an answer to that question. Maybe I just don't get it. But all I can do is give some arguments why I think none of the points you raised actually apply.
As for @Dominic's concern of centralized allocation, I argued here that there's no difference between strings and integers.
We can still assign numerical values to these in a centralized table somewhere for more efficient binary transfer, but when it comes to string representations I think we should give it a name. e.g. .box-pfs.
In my opinion you are looking at this backwards. This is for machine-consumption first. We are not assigning a numeric identifier to a stringly-typed concept. We are assigning a human-readable string to a numeric identifier. And we should absolutely do that, something like patchbay's or manyverse's raw message view should definitely display a string rather than a number. But this can (and imo should) stay fully independent from the actual encoding.
What do you do when someone sends a .boxfffffffffffffff and you don't have that format? You already have a lookup table fmtID -> decryptFunc, and you can have another publicly accessible lookup table fmtString -> fmtID. I don't see why that wouldn't be possible.
It's possible. But why add another lookup when we can avoid it?
Currently it's pretty easy to make sense of most ssb message json. I don't understand why we would give that up for the benefit of not needing one more lookup table.
This is a pretty fundamental point. And again, this is because we are feeding human-friendly data to machines. The formats should be optimized for machines first. What we'd get from that is 25 % smaller messages and eight times faster parsing. And if a human want's to look at this stuff, we can still convert it to a readable representation. There's just no need to do everything in a human-readable format.
Ideally, I want to see the json transport format fully deprecated. The only use of json should be for hash ad signature computation of legacy messages (and that's only because we can't change it anymore). If humans want to look at a raw message, we can present it however we want. The hexadecimal ascii encoding of box identifiers is only for hash computation. To humans, we can still display ".box-pfs"
. Incidentally, I have no idea what pfs
could possibly mean, so I as a human still need to look it up, just like I'd have to look up .box123
.
If we try to make the format as general as possible and later realize we missed something, the only thing we can do is to make a backwards-incompatible change to the format - but that is just upgrades, not selecting between formats with different use cases.
It's the other way around actually. If box-2 is a breaking change over box-1, then box-2 has the same relation to box-1 as refrigerator-54 has to box-1: None at all. From a machine's perspective, version numbers only make sense for backwards-compatible upgrades. All breaking changes are a completely new thing instead. Semver got this wrong: The major version should actually be part of the name, not of the version number.
continued in next post...