You are reading content from Scuttlebutt
@aljoscha %iy6Pekirbl7hHHHuGsYWyfY9j0qqoV4EbFiApL6q3cQ=.sha256

Simpler Message Metadata

Warning: This is a lot of text, containing a lot of information, as well as strong opinions. Dive in at your own risk.

My proposals around #hsdt are focusses on a json replacement in schemaless free-form data, i.e. the content field of ssb messages. But that is not the only place where ssb uses json. Another place where json is used for schemaless data is muxrpc. I won't go into muxrpc here, but it is worth to keep in mind that most of the current discussion around data formats can be applied to muxrpc as well. In this post, I'm going to discuss the metadata of messages.

CC @Dominic @arj @cel @cryptix @Piet

tldr: message metadata can be radically simplified and shrinked, roughly an order of magnitude

Introducing hsdt will require a new hash suffix. That gives us complete freedom on how to lay out message metadata. And there is some room for improvement over the current situation. @Dominic has just posted some information relevant to these considerations.

A quick quote from the protocol guide for convenience:

previous Message ID of the latest message posted in the feed. If this is the very first message then use null. See below for how to compute a message’s ID.
author Public key of the feed that the message will be posted in.
sequence 1 for the first message in a feed, 2 for the second and so on.
timestamp Time the message was created. Number of milliseconds since 1 January 1970 00:00 UTC.
hash The fixed string sha256, which is the hash function used to compute the message ID.
content Free-form JSON. It is up to applications to interpret what it means. It’s polite to specify a type field so that applications can easily filter out message types they don’t understand.

There's also a "signature" field, containing an ssb cypherlink which depends on the encoding of all the fields listed in the above tables.

All messages have this same set of metadata, and we can utilize that. In practice, some oversights have crept in, but hsdt can ignore all of them. In this post, I'll discuss the idealized setting that applies to hsdt. I'll do a later post on how this can be adapted to the semi-canonical json replacement.

So how should the metadata be encoded? We can simply concatenate the actual metadata values in a predetermined order, no need for any keys. The encoding becomes <total size of message in bytes?><feed id?><backlink to previous message><sequence number?><hash?><timestamp?><content><signature>, where ? means that we could get rid of the entry. Now on to the details. There are a lot of them!

Message Size In Bytes

This is an interesting one with a few possible ways to go.

The first take on this: Messages have variable length, so to decode them, we need to tell the decoder the length. Thus we need to prefix it. This could be done either with a fixed-width integer (depending on what message size limit we end up settling upon, 2 bytes might be enough), or with a varint.

A second view: The message is already transported somehow, the transport framing already knows the length, so no need to include it again, simply reuse the length as specified by the transport layer. This places some restrictions on the transport protocol though, it can't do something like <sum of length of three messages><msg1><msg2><msg3>. The current transports don't do this (afaik), but I dislike the idea of tight coupling with transport protocol details.

And a third perspective: All metadata has to be self-describing in its length, so there's no strict need to preix the length. A parser can just read through the metadata one piece at a time, allocating memory on the fly, and automatically knowing when to stop. An implementation can chose to do memory-related optimizations such as bulk allocation by taking transport framing data into consideration, but there's no tight logicl coupling.

I currently favor the third perspective, but I'm not settled yet. Including a size makes parsing easier and adds to the robustness of the whole thing. If the message size was included, it should be the first entry, so that data bases could simply drop it if it wasnt needed. I'd argue against putting it into the hash computation though (it is easy to ignore the first part of the encoding), since it does not include any actual information. It's just meta-meta-data that can be reconstructed from the logical model of the meta-data.

continued in next post...

@aljoscha %wS07I+2kNlzrM92xRKifaUgK7omIdgWxVt+rlXxmhwc=.sha256

The backlinks are the heart of the personal blockchain of each ssb user (aka the feed). Together with the message signatures, they deliver the cryptographic guarantees of ssb. So surely there's nothing to change about these, right? Wrong.

First of all, cypherlinks could be changed to a binary encoding. A byte tagging the type (currently done by an ASCII sigil), followed by a byte indicating the hash function (currently suffixed as a free-form string). In case of healthy paranoia, only use the first four of these 16 total tag bits for the sigil, that leaves 12 bits for the hash function, giving us 16 possible sigils and 4096 possible hashes. If ssb ever uses all of those, it should be burned to the ground anyways. The hash function tag implicitly specifies the length of the hash, so these two tagging bytes can be directly followed by the hash value (in binary, not in base64).

The binary encoding would be simpler to parse than the current encoding (no base64 fiddling, no indeterminate length (putting the suffix at the end makes parsing unnecessarily complicated)), and more efficient (can directly dump the binary hash data, more compact representation of it, much shorter metadata). But this was more of a digression, the actual encoding of cypherlinks is orthogonal to the metadata encoding. It's just easier to roll out both at the same time.

But there's another thing we could improve, and that's related to the feed id metadata. I argue that we can completely drop it, by making a tiny change to the backlink: The first backlink of a feed should not be null, but the cypherlink to the feed id. That way, given any message, you can theoretically determine the feed id by following the linked list. To get the abstract model of ssb working, it is not required to store feed ids with each message. Especially since ssb assumes to always know all previous messages of a feed (else, full verification is impossible), so there can be no "broken" cypherlinks. Don't worry, I'll discuss out-of-order messages ("ooo") in a few paragraphs.

Traversing a cypherlinked list of messages to find the feed id is of course a bad idea in practice. A real implementation would store it, for example by keeping all feeds in a map from ids to sigchains. Or by storing the feed id together with each message. That's how current implementations do it, because the current signature scheme mandates it. But by moving this out of the required metadata, implementations get to chose how to deal with this. The conceptual model becomes more simple, and concrete implementations gain the freedom to improve.

Ssb does have a scenario where we can't traverse the full linked list, at that's ooo. A database can ask its peers for specific messages, without requiring the full feed of the author. Instead of verification, you trust the identities that linked to the requested message that it is indeed valid. Does this mean that the feed id just has to be in the metadata of each message after all? No, because it can be separately transmitted as part of the ooo response. Again, this moves the complexity out of the rigid conceptual model into the flexible implementation details. The rpcs are less flexible than the db implementations, because different peers need to agree on the same rpc calls. But they are still much more flexible than the hash-enforced representation of every single message on ssb up to this point.

Sequence Numbers

Everything I said about the feed id also applies to the sequence number, except that it is even simpler to replace (no need to change the first backlink to null). Let's get rid of sequence numbers!

Hash

The hash metadata must match the hash suffix that is already included in the signature. The information is redundant -> drop it. It is already optional) anyways.
Just for completeness: Since this is conceptually part of the signature, the current default representation of hash followed by content followed by signature is suboptimal for parsing reasons. Also for completeness: Just like cypherlinks could be encoded in binary, so could be signatures, so that the hash data becomes either a single or maybe two bytes of data.

Timestamp

There's already been discussion on these, both recent and some time ago. They are still mandatory, but the requirement for them to increase monotonically has been lifted. So now you can generate random timestamps, without the protocol punishing you. And if it can be done, than it will be done. There are enough people who randomize their git commit times for anonymization already.

Still including mandatory timestamps just gives client devs the illusion that they could rely on them, when they really shoudn't. Client devs have been burned by this in the past, and it will happen again, unless timestamps are completely removed.

In a distributed system, it is impossible to get true global timestamps, both logically and through limitations of physics. And why would ssb even want global timestamps, if it is about subjectivity? What we do want, is a total order on messages, which is why timestamps even use floats rather than integers. But we already have this total order, it is given by the sigchain. And unlike timestamps that can be made up, this order is cryptographically verified.

Justifying timestamps because they are needed for database queries is dangerous. They might be made up, so the queries might not make sense. And not every message type needs this, so why enforce it on everyone? If you want timestamps for your message, add a timestamp to its content, and use a flume view for database indices. Also, there will be other implementations, that might not want to provide timestamp-based queries (because they are completey useless in a ditributed setting anyways), so the core ssb model should not force them upon devs.

If your ssb-related code relies on timestamp queries, it is already broken.

Content

Yeah, let's not remove that one. I do have some ideas for a better encoding than json though...

Signature

Let's not remove these either. But a binary representation would be sensible, following the same arguments and design as the binary cypherlinks. If cypherlinks use the encoding of 4 sigil bits and 12 hash function bits, then signatures would also need to have 12 hash funcion bits, thus requiring two bytes as a tag and leving 4 bits unused. That's totally ok, just mentioning it for completeness.

continued in next post...

@aljoscha %zdZgOCh5yDq60+PAr0zbSbYc8GFfY+7cbfG6eKIMy18=.sha256

Summary

I propose the following metadata format: <binary backlink><hsdt content><binary signature>. It is slightly more efficient and simple than the current one.

On the surface, these proposals involve some radical changes to the very core of ssb. But keep in mind that this only changes the "implementation details" of the protocol, not its "api". Some of these ideas shift complexity to other parts of the ssb stack, e.g. by giving the database the responsibility to track feed ids and sequence numbers. Still, I think this is more than just moving complexity around. Keeping the core as simple as possible is more valuable than making e.g. the database easier. The database is an actual implementation detail, whereas the message format is not. In true implementation details, iterative improvement is much more realistic than in the core protocol. The more complexity is moved into these details, the more empowered are we to tackle the challenges.

And aside from this lofty, abstract bullshit, there are real performance gains. Just ballpark the current size of the metadata and compare it to the size of a single binary-encoded backlink and a binary-encoded signature. That might be a decimal order of magnitude. There are a lot of tiny messages (e.g. votes aka "likes", contact, etc.) where the metadata far outweights the content. So with these changes, I get my conceptual purity, and @Dominic gets the performance. Everybody wins. Well, except the people who need to adapt the database and the rpc protocols. But that might include me, as I really want to see these changes incorporated.

I'm deeply excited that a new hash-suffix gives us the opportunity to introduce all these changes in a backwards-compatible way.


A quick addendum regarding ssb's rpc protocol: Just like messages, muxrpc has both metadata of a fixed scheme, and free-form content. The metadata could be optimized as discussed in this post, the actual data could use hsdt. I'd prefer to postbone changes to muxrpc right now and focus on messages. I already have sketches for a format #bpmux that improves over packet-stream (more space-efficient, easier to encode/decode, built-in backpressure, more consistent, clearer separation between multiplexing and rpc layer, no type information, optional datagram based implementation with truly independent backpressure of substreams). Since packet-stream and muxrpc are somewhat intermingled (e.g. backpressure was added as part of muxrpc, not packet-stream), it makes sense to tackle both of these at once. But I won't open this discussion right now, I'm already spamming the network enough with #hsdt. And the current discussion is already taking too much of my time anyways.

User has chosen not to be hosted publicly
User has not chosen to be hosted publicly
User has not chosen to be hosted publicly
@aljoscha %lQTKP0cSpy6kDDRJ08c64qDvzrrd8zso96vkPHVkjRU=.sha256

@moid Just as before, by transmitting the sequence number. Server implementations would still keep track of sequence numbers and use them as usual, they just would not be included in every single message. Of course, and implementation is free to add the sequence number to every database entry. But it would not need to be transmitted as part of the message, and it would not be involved in the hash computation.

@Anders %d6DTEU92CiNVw8nf7OcThnbRFDTdgb6SSUcZmEkeIuQ=.sha256

I've enjoyed reading these posts. Thanks for putting them together.

Lets say we wanted to do something like the above, how would you roll it out?

@Dominic %wTjDyP6RnllwKgxy0azWu+qHqUvZea9Qo75PiXlCrn8=.sha256

notes:

type is not polite, it's mandatory. I just made a PR for this

cypherlink is a cryptographic token that refers to another object. hashes, public keys, bitcoin addresses all count as cypherlinks, but message cyphertexts and signatures do not.

Definitely agree on binary formats for cryptographic data (links, signatures, cyphertext)


I think removing the sequence number forces the hand of implementers a bit too hard.


Also, timestamps: The application layer loves timestamps. Everyone is requesting indexes for that. As protocol designers, the application developers are our users, we need to keep them happy if we want people to use our protocols. If you remove the timestamps they'll just add them to the content, anyway, so I don't think you'll win this argument. If someone starts spraying randomly timed messages it will tend to be behave weirdly in applications, and that will have social rammifications.


Anyway, while these might be good ideas, I think it's going too far introduce these all at once. Just rolling out the binary format will be hard enough.

@aljoscha %C86nXLzeNO1oSYyIiupm5hSejDByJK7Y9OTKpYcPXlQ=.sha256

type is not polite, it's mandatory. I just made a PR for this

So conceptually, type is part of the metadata, it is just hidden inside the content objects. In case of a metadata redesign, I'd want to move it to the actual metadata. That's clearer and also saves a few bytes (because there's no need for a string key "type").

I think removing the sequence number forces the hand of implementers a bit too hard.

Doesn't it free the hands of implementers? Right now, there is an enforced way to deal with them, the change would allow them to handle them how they want. Maybe handling message-embedded sequence numbers is done implicitly when using a schemaless database, but that's not the only choice. Since metadata is actually structured, it would make sense to store it in a relational database. And in that case, implementers need to deal with it anyways.

Having to roll your own sequence number memoization can't be that complicated. Saving a tiny amount of work for a subset of the implementors (of which there won't be that many) can't be worth complicating the protocol. In some sense, explicitly transmitted timestamps sacrifice user experience (in the form of performance) for developer convenience. Which is not an approach to software engineering I support.

Fyi, to me personally keeping the core as minimal as possible is a stronger motivation than the performance. But I can't really support the desire for minimalism with concrete argument, just a general sense that a minimal, simple core leads to better result in general. Performance improvements are just a symptom of that.

Also, timestamps: The application layer loves timestamps. Everyone is requesting indexes for that. As protocol designers, the application developers are our users, we need to keep them happy if we want people to use our protocols.

Distributed timestamps are broken, that's a fact. And the current design encourages devs to use them anyways. As protocol designers, we have a responsibility to keep devs from doing broken things. It's like telling your kids that playing with firecrackers is unsafe, and then putting them unsupervised into a room full of firecrackers and lighters. Ok, that's a ridiculous metaphor, but you get my point?

If you remove the timestamps they'll just add them to the content, anyway, so I don't think you'll win this argument.

That's what I'd consider winning the argument (except I don't think of this whole thing as something to "win"). Timestamps are a nice, cosmetic thing to display, humans like them. Go ahead and add them to your messages. Go through the trouble and create flumeviews if you like. And then deal with the problems when things inevitably break. But you can't go and blame the protocol designers.

If someone starts spraying randomly timed messages it will tend to be behave weirdly in applications, and that will have social rammifications.

Yeah, that's a real interesting thing about ssb. Some people would start blocking you if you randomized them, others might understand the privacy concerns and actively seek out clients/apps that use timestamps sensibly (i.e. as nice-to-have, completely unreliable pieces of information that may or may not be true). If timestamps are not mandatory, than these things sort themselves out through social dynamics, and each group can have its preferred solution. Making them mandatory removes this agency from the users, everyone just has to pay the cost for them.

Timestamp randomization is fascinating, because it may look like malicious action to some(I actually toyed with the idea to do it, just to make client devs deal with it, which is basically altruistically motivated malicious behavior), whereas it's an anonymity-preserver necessity for others.

Who knows what would happen if all the timestamp-opponents (and I'm definitely not the only one) coordinated to simultaneously start randomizing. Would that divide scuttleville, or would the popular clients adapt? I will not try to start this experiment, but I would definitely be in the randomization camp.

Anyway, while these might be good ideas, I think it's going too far introduce these all at once. Just rolling out the binary format will be hard enough.

I fear that the introduction of the binary format might be the best shot we get at doing these changes. Adding a new multihash is not something we should do lightly, but these changes require a new multihash. Hsdt requires a new multihash as well. I'd rather not rush out the binary format and then never get the momentum for the improved metadata encoding, but instead take some time and clean up the whole hashing-related business, which involves both message data and meta data.

@Dominic %PR0VVV3dD0Q99s4KUxD5CZPwm6d66V/SzvLGOR1Qg2A=.sha256

So conceptually, type is part of the metadata, it is just hidden inside the content objects. In case of a metadata redesign, I'd want to move it to the actual metadata.

yes, but I don't want to have something like that outside the cyphertext on encrypted messages.

In some sense, explicitly transmitted timestamps sacrifice user experience (in the form of performance) for developer convenience

Apart from bandwidth, I doubt this would make a measurable difference to performance, once it's combined into the entire application. And how much % smaller does it make the bandwidth?


The sorts of changes you are proposing would take a lot more groundwork to roll out than just a new encoding format. perfect is the enemy of good. good is still better than bad. If we can deploy an improved protocol, we can deploy another one. Given there are a lot of aspects of ssb I could be working on improving, and I can't work on all of them, I'd rather focus on the most rewarding things and balance that with how much work it will be. This is too much for now.

@aljoscha %CyL8VuY12NAk3fqJ2jMpuegRUlvMOLvnszczYS0KaBw=.sha256

@Dominic

This is too much for now.

Ok, that's a clear statement and I can work with that. Here is a list of facets of an hsdt rollout, packed into individually implementable chunks. We should be able to pick any subset of these and bundle it into a single rollout. I do not include any semi-canonical stuff, because that can be added at any point, independent from our choices here (and I'll be happy to contribute to the design).

  • introduce a new hash suffix for messages encoded in canonical minimum viable hsdt (note: not the final spec!)
  • optionally remove byte strings from hsdt, if supporting them is too much work in the data base
  • optionally remove Infinity, -Infinity, NaN and/or -0 from hsdt, if supporting them is too much work in the data base
  • extend the logical data model of mvhsdt with any subset of the following data types (some of these might imply database changes)
    • various fixed-size integers
    • specialized type for cypherlinks
    • sets
    • maps with arbitrary object keys (requires ssb-client to represent them as Maps, not objects, this is a breaking change)
  • specify a fixed order of metadata for hsdt messages, encode them efficiently (this includes a binary encoding of cypherlinks in metadata)
  • drop the hash metadata
  • drop the timestamp metadata
  • drop the author and sequence number metadata
  • prefix each message content with a string that specifies the type (thus it can be encrypted, yet does not need to be encoded as an ordinary map entry)

If we can deploy an improved protocol, we can deploy another one.

I'll take you by the word.

@aljoscha %CaCuBNYvDYasBowf5ZjTsCT8lW1ycMwfFsaqrTF+SzA=.sha256

Just to clarify things, not as a serious argument to include this in the first rollout (and honestly, I can totally accept if this never makes it into ssb):

In some sense, explicitly transmitted timestamps sacrifice user experience (in the form of performance) for developer convenience

Apart from bandwidth, I doubt this would make a measurable difference to performance, once it's combined into the entire application. And how much % smaller does it make the bandwidth?

"timestamps" should have been "sequence numbers" in that quote, stupid brain. I agree that the performance argument is somewhat ridiculous for sequence numbers (in variable length encoding, two bytes will suffice for the vast majority of feeds), but sequence numbers would be removed together with feed ids, and those do make up a relevant percentage of the metadata. But even then, it is not the performance, but the conceptual minimality I'm drawn towards.

@Dominic %GN1uEDcWiGqbXBOwMzJ0zxNzybSrT3DoXN8+p1PJhzg=.sha256

An idea occurred to me for a compromise: include the extra metadata in the signature but not the encoding. Then, when requesting messages, indicate wether extra metadata can be dropped. If supported by the implementations, the extra fields can be dropped from the transmission, but wether they are dropped from the storage is up to the implementation. todo: calculate how much difference this would make with current data.

@aljoscha %gHwVDcT2Soc5SnsnBx/4hn6idf7iZVJOALJr5LXLyV8=.sha256

Honestly, that seems like a lot of unnecessary complexity, I'd rather go with either always or never including it.

What does an implementation have to do if we don't include sequence number and feed id in both the hash-encoding and the transport encoding? The simplest scheme that comes to my mind (and that incidentally would require little changes to sbot) is to look up in the database for the previous message. The db stores the feed id and sequence number of the previous message, so we get them. Add the feed-id to the metadata, increment the sequence number and add it to the metadata, then write the message with this additional information to the database.

You probably want to add in some caching for performance, but effectively all this does is a reduction.

Whoever is crazy enough to reimplement ssb, they will:

  • implement a handshake and a streaming cypher
  • implement a multiplexing system and its codec
  • decode all the multiformats
  • set up a database
  • write a gossip scheduler
  • recognize and produce a bunch of rpcs

I don't think adding a reduce operation to that list will make a relevant difference in development effort.

@aljoscha %QVmprjsKV8vT6tCpCeZ18luAtSxVJU5gwe2MOHyfc9s=.sha256

Tiny update on the metadata: I'm currently leaning towards prefixing the metadata with a varint id. It would initially be used to distinguish between encrypted messages and unencrypted messages. At some later point, it could indicate messages where the metadata does not contain timestamp/sequence-number/feed-id (otherwise we'd have to conflate that with cypherlink encoding indicators, even though this is not really an issue of content encoding), or encrypted messages that use an alternative encryption scheme (here the id gives us future-proofness), and who knows what other types of messages could one day become necessary.

@Dominic %8L/kdTz2HcTrkN+C07YKRTvwhcrq1YFef/+xSUJ7Kr4=.sha256

Oh I should add a bit more here: I'm far from convinced that removing sequence ids is a good idea.
Counting them your self, sure it's possible - but then you have to receive every message in order know that is correct. If out-of-order messages also send the sequence, it's now an untrusted number... thinking through all the implications of that hurts my head. Also, with signed sequence numbers you can immediately know the total order of any two feed messages, but without, you'd have to traverse the previous links until you get one or the other (since timestamps are non-monotonic now) ... that's possibly thousands of messages.

The sequence number is essential for replication - so, just include it in the signed data!

I just think there are far more useful things we could be working on.

@aljoscha %yCfMJcM19PEN2vLdMxo5cDGFOwhqZlhLiybwBV7wdPU=.sha256

Also, with signed sequence numbers you can immediately know the total order of any two feed messages, but without, you'd have to traverse the previous links until you get one or the other (since timestamps are non-monotonic now) ... that's possibly thousands of messages.

Unless you save them in the database, which makes it an O(1) lookup. I'm not advocating for implementations to stop using computed sequence numbers internally, that would be inefficient madness. Sbot could simply annotate each received message with the sequence number, and store them together in the db. Basically, what I'm saying here: memoization completely solves this, and is easy to implement.

If out-of-order messages also send the sequence, it's now an untrusted number... thinking through all the implications of that hurts my head.

Now, this is a counterargument I can take seriously, I'm looking forward to the head-hurting. OOO right now means putting trust about the validity of the ooo-message into the feed from which you took the message's hash. Since that currently covers the sequence number, you can use it. If the hash did not cover the sequence number, you'd have to trust the peer that answered the ooo request. So this is something where not hashing and signing sequence numbers is objectively worse.

What does the sequence number of an ooo-message give us? On its own, not much, just a lower bound on the size of the message's feed, which is pretty useless. It becomes more interesting when multiple ooo messages from the same author have been received. We can tell causal order between those messages based on sequence numbers. I do not and can not know whether having that information might become necessary at some point in the future, or if not necessary then at least useful. But for a future-proof protocol, that just screams "Stop optimizing and leave this in!".

Feed ids are also interesting, the peer who answered the ooo request could make up any identity as the message's author. That doesn't sound good...

Thank you @Dominic for raising that point, this did change my mind. I may be stubborn, but not immune to good arguments. We just differ in what kind of arguments we consider "good" :green_heart:

I am very tempted to put seqnum and feed id at the end of the metadata, so that non-ooo replication can easily choose not to send them over the wire, as they are still redundant in that setting. I haven't made up my mind whether I'd want to make that mandatory (memoziation really is trivial compared to all the other hoops to go through in order to write an ssb server). But it is worth keeping in mind, saving these bytes (the feed id in particular) would make a relevant impact.

@aljoscha %dn7ucXaeyjozlAE2Ma5gsPd2mpi8an4hmgGGk83QaqE=.sha256

Getting out two more thoughts on metadata: One on the type (as in "string", "map", etc.) of message content, and one on the actual implementation of optional timestamps. Both are somewhat connected by their usage of a varint tag for metadata.


As alluded to here, conceptually there really shouldn't be anything forcing the content of a message to be a map. Nothing makes maps inherently more special than other json/hsdt values. But with the current ssb design, there are two aspects that restrict the type of the content field:

  1. private messages always use a string ending in ".box".
  2. Each message must have a type, which is currently stored in the content - that's the thing that forces content to be a map for unencrypted messages. It also forces the encrypted string of a private message to decode into such a map.

The first point is a somewhat hacky implementation with some unintended consequences, a metadata varint tag can clean that one up. If the tag indicates that the message is encrypted, it does not have to reuse the same metadata structure as unencrypted messages, it can simply include the cyphertext (or its hash if we implement offchain-content) as a binary string.

The more interesting case is the type field. Since each message has a type, it is conceptually part of the metadata. But the type should be encrypted for private messages. But those only encrypt content, not metadata, thus the type was moved into the content. Which - again - is more of a hacky shortcut than a clean solution. But again, tagged private messages allows us to change that. Encrypted private messages can encrypt the type and yet keep it in the metadata. And with that ability, regular messages don't need to put it into the content, instead it can be moved into the regular metadata (note to future me: argue to make type a binary string and to increase the size limit to be able to contain 512 bit hash digests).

And with those changes, everything is disentangled, and the content is not arbitrarily restrained to be a map.


The other point are timestamps, in particular their optionality and their privacy. This part of the post makes the assumption that optional timestamps are allowed.

As I already wrote about above, private messages do not encrypt metadata. And even though ssb explicitly hides the number and identities of the recipients of a private message, it happily attaches plaintext timestamps. That's kinda weird. By the same mechanism that can encrypt the message type, I propose to also encrypt the time stamp. A non-recipient of such a message could not tell the timestamp of such a message. Currently, this is a problem, because ssb assumes that every message has a well-known timestamp. But if we are in a setting were timestamps are optional and private messages encrypt them, non-recipients can treat all private messages as timestamp-less and no further changes are required to any code paths.

Now for the concrete representation of messages without timestamps: I'd simply use the message tag varint again. Messages without timestamps then simply don't include one, saving 8 bytes. Even if an actual API like ssb-client might want to signal absence of timestamps via a garbage value, there's no need to send any garbage values over the wire.

Two more quick, unrefined thoughts on these timestamps:

  • should an encrypted message hide wether it is timestamped or not? If that is signalled by the tag byte, than that information is public. I don't really see how that would hurt though.
  • users should be able to set whether they want to attach timestamps to their messages or not. But in some cases, devs might want to have full control over timestamps, so they'd move them inside the message content. If they do so, is it a good idea to allow them to signal that no metadata timestamp should be included? Saves space, and more importantly will prevent a lot of ambiguity and weird user interface behavior if metadata timestamp and internal timestamp differ.
@mikey %mpHclG/lY+BKvl0wnI0c+cHlJZXsi0SjY8RleEgbu1o=.sha256

similar to bringing the type field into the top-level metadata, would it make sense to bring the "tangle" fields (currently root and branch) into the top-level metadata? said another way, are "tangles" a fundamental concept that apply to every message? as our current design means it's up to application authors to enforce the desired semantics, which hasn't always happened, and doesn't include messages types beyond post.

@Dominic %F3vllZjj1vrA6NOxxv6LWik/C67Jhty275Kj/5Ofh6Q=.sha256

I think a maximum of fields should be inside the encrypted portion, and that portion should be as uniform as possible. having two separate encrypted portions would leak metadata, in that the length of the type field could imply what sort value it was. Same goes for not having branch/root outside the content.

We can't validate what format the encrypted portion of a message is, because we might not be able to decrypt it, so really, none of these restrictions are actually enforced with private messages.


To avoid YAGNI it helps a lot to describe a plausible application that is enabled by a feature, not just propose a feature in abstract. If you describe some feature in abstract, I'm probably just gonna ask you to describe the application, so I'll appreciate when someone does this because it will save me time. This goes for everyone, and any kind of software.

@mikey %+nTQ9Gjg56IoGN+xiFRddEmZKwWPUlKN3KuKsRsw1cY=.sha256

@dominic: i'm not sure which comment yours was in reply to, with regards to moving root, and branch fields to the metadata, i think the plausible application in question is the current Scuttlebutt stack as developed. as in, would our current use cases be better served if we made this change? i'm not anticipating future problems, i'm focused on the current problems where our message semantics are inconsistent or incomplete because developers fall into the pit of despair.

@aljoscha %LnYxFFRtIlgNMPN8SjIY1nibQpf9jgsPeLG7Nf98jpc=.sha256

@Dominic

I think a maximum of fields should be inside the encrypted portion, and that portion should be as uniform as possible. having two separate encrypted portions would leak metadata, in that the length of the type field could imply what sort value it was.

Valid point, but not a show-stopper. The encrypted message can simply be created by concatenating the length-prefixed type and the content, then encrypting the whole thing. Same for the timestamp.

To be honest, I don't have a concrete use case for non-map content. But allowing it strictly expands the set of possible messages, without any real drawback. And viewed from another angle: As we change the metadata anyways, we have full, unconstrained freedom to chose any format we like. How could we possibly justify to emulate a hacky and entangled legacy solution, if there is a cleaner one?

If you insist on a use case: The closest thing to a blob that is still verified to have been uploaded by a certain id at a certain point in the causal order, is a message whose content is solely a binary string.

@dinosaur

I'm not so sure about protocol awareness of tangles. Not every message really needs these, but everyone who does need them has all the tools to build them. Also, the desire for a tight partial order is opinionated, not in itself better or worse. And due to arbitrary network partitions (the most common case being an offline peer writing data), branches/tangles can never give us any quality guarantees anyways.

The thing I found most interesting about tangles are the conflict resolution rules. But those differ form use case to use case, so that won't benefit from being part of the core.

@Anders %694uh+U/lx/LX9gb8MAzTvbsv7iF0L3cj6M71nL/AAc=.sha256

If we are going that way, could we consider top-root or something similar, so that one can easily extract a whole tree instead of doing nested queries. Think nested comments.

@Dominic %LqNgWFBetS4q3tMtD/FNCGf1wUC0fBTIq1xK6uxBuC8=.sha256

@aljoscha I think it would be more elegant to just drop the enforcement of the type field, if there was really a need for messages with absolutely minimal overhead.

@dinosaur but if you want to make current applications easier to implement, wouldn't it not changing be easier? Otherwise, then you have to deal with the two ways of doing it, the worse way, and the better way. It seems to me that a far better response would be to just create a library that makes it easy to handle a particular nuance (such as the branch field). Helper libraries also don't require consensus from protocol implementors!

@aljoscha %LDtyPsdzt0QMFXL9fPRhBNzMF8rqGmLaUc196cjluu4=.sha256

@Dominic

@aljoscha I think it would be more elegant to just drop the enforcement of the type field, if there was really a need for messages with absolutely minimal overhead.

But why? What is more elegant about forcing somebody who wants to send a single value to wrap it in a completely useless map? Why not do fully general solution that also moves the type where it belongs - into the metadata?

Additionally, optional "type" would be a breaking change.

@Dominic %195n5UHHG961upVVXkDNAPMuG/We7ESmAbxff+aggX4=.sha256

What is more elegant about forcing somebody who wants to send a single value to wrap it in a completely useless map?

sorry, I didn't mean that, I meant: just make content a binary object, with no type field.
The type field doesn't really mean much anyway. the rest of the message can still be anything.

@aljoscha %1WWeVo4Ji3VujjzGPa4VeBAyxlXtJTPOp/qhIFhbsr4=.sha256

@Dominic Ack. But we still want to keep type in the metadata, right? It may not mean much, but everything fundamentally relies on interpreting it.

Would you be open about changing the "type" information itself to be an arbitrary byte string rather than a utf8 string? That's trouble for the js API, but it allows using hash digests as types without any annoying encoding. The js API would have to use a buffer for the type, so it couldn't be a regular key on the content object anymore. That's a breaking change too, but it would reinforce that type is metadata, not content. Also, are there any reasonable arguments against moving the size to 512 raw bytes, to allow larger hash digests as types? Or maybe even a few bytes more, so that such a scheme could use multihashes?


cc #ssb-clients

I don't think we will get around a breaking js api change anyways, so we might as well try to put as many of those changes as we can into a single major update (I'm open to splitting those up depending on how we end up rolling out the underlying changes):

  • move type out of the content, make it a buffer in the metadata, with some size cap (say 512 + 8 = 520 bytes)
  • allow the timestamp metadata to be missing (if we do optional timestamps)
  • allow the content itself to be missing (#offchain-content)
  • We'll have to use some sort of non-enumerable properties to introduce the new hsdt types, otherwise they would be indistinguishable from regular objects. Not necesarily breaking, but clients need to be aware of those to not accidentally treat them like normal objects.
  • change to Object.getOwnPropertyNymes (or similar) for hash computation (I'm unsure how doable this on is, also not sure if this really is breaking to the client api, but it does mean that some things might get ignored that previously affected the message's hash)

I'm probably forgetting some more breaking js-api changes.

@mikey %fKbGBW8xKB410ZVWtnIihtA0FnY9YU+Zq+0ze6uw/u0=.sha256

hi @Aljoscha: regarding message types, with our current approach where message content is an map with a type field, this means the content is self-describing, as in the content includes information about the type. compare to blobs which are not self-describing, we embed information in the reference links about the media type, size, etc, which means if you have a blob you don't always know the type (although as with file types on linux we can usually compute the type from magic numbers in the content). in keeping with your desire to remove the map for message content, maybe message content is a tuple with (type, content), which could be represented similar to multiformats. not sure if this is a worthwhile approach but came up in my mind so thought i'd share.

@aljoscha %lyLRaTtjLfmI14jYP04UHauw1zDxLAEznfj8paXt3SI=.sha256

@dinosaur I'm not quite sure how this is any different from moving the type into the metadata. A message cypherlink currently points to a tuple of (feed id, seqnum, ..., content) (pretend I filled in all the metadata and the order magically matches). By moving the type into the metadata, we get a tuple (feed id, seqnum, ..., type, content), which is exactly the same as (feed id, seqnum, ..., (type, content)).

User has not chosen to be hosted publicly
@Dominic %+gAwfnw6wP/xq9j0/fC9nWeBxFZxxQSDH1x1s/SpFvY=.sha256

@cft I must disagree. Well, you could include feed,seq as a hint, but you should always include the message hash. The message hash is much more immutable than a signed message. it's possible the private key you are responding to is stolen or leaked, so it's possible that in the future a new message with that sequence exists. Or even, it seems more likely that the asymmetric signature algorithm gets compromised (quantum computers) but hashes should still work.

User has not chosen to be hosted publicly
User has not chosen to be hosted publicly
@aljoscha %QdGyujXkkJuIYSJT100htOZLV5dKnTIRvaaWUJ9ywW8=.sha256

@cft I can see the point in including the seqnum, but I'm still not sold on including the author. The author is already implicit in the signature. When you request a message by hash, the peer answering the request can just tell you the author, and you verify it via the signature. Of course they could lie to you and give you a bogus author, but since this is immediately detected by the failed signature verification, that's just as if they never answered your request in the first place.

Join Scuttlebutt now