You are reading content from Scuttlebutt
@aljoscha %QOTRH1C4E0XrpawMrw3u4m6SwGC6GpZ3w9NyP7gJa0I=.sha256

Protocol Changes - A Todo List

For general transparency, added resilience against falling coconuts, and admittably as a defensive reaction to this exchange, I typed up my current set of scribbled notes on what still needs doing to upgrade the ssb protocol. It's very rough, but I'd rather start working on the entries themselves, than spending more time packaging them. There are a few open questions in there, a few more things I have yet to properly write about, and a bunch of preferred outcomes of mine.

Metadata

Timestamps

  • keep timestamps as they are, make them optional, or drop them?
  • timestamps as floats or ints?
  • how to encode optional timestamps in the binary metadata of hsdt messages
    • omit, or special value? Depends on other metadata considerations

Other Medadata

  • more in-depth writeup on possible paths forward (dedicated post)
    • figure out cost/benefits ratios for the things we could roll out in one go, suggest a reasonable package, probably keeping feedid and sequencenumber
  • figure out how to get type in to the metadata, but still encryptable
    • suggest increasing size of type field so that it can hold 512 bit hash digests once 256 bit hashes become broken
  • use the changes to set a new message size limit, or at least its calculation, based on actual number of bytes
    • including or excluding metadata?
  • content as "blob" (dedicated post)
    • metadata could include hash of data, not the data itself
    • personal blockchain then consists purely of metadata
    • allows deleting and blocking specific messages without blocking a whole feed (GDPR!)
    • replication rpcs can concatenate metadata and actual data, no need for additional blob roundtrips
    • how does this work with the type field?
    • how does this interact with encrypted messages?
    • drawbacks:
      • additional hash verification
      • takes some work to implement
        • changes rpcs
        • changes db
        • db can mabye "cheat" and store things as usual, converting as needed?
        • additional failure mode in the client api: can have a message's metadata but not its content
  • work out binary encodings cyoherlinks, hashes and signatures
    • could use ipfs multihashes, but should use a different type table
    • could roll our own where the hash type implies the length, and only future hash types must encode length explicitly
      • can preallocate ids for as-of-yet undetermined hash functions of a specific digest size
      • not a lot of work
  • once all this is decided upon, find a binary encoding

Data

HSDT

  • add integers (yes)? add cypherlinks (yes maybe)? add sets (no maybe)? and arbitrary-key maps (no maybe for now)? (dedicated post)
  • how can new data types can be introduced to the ssb-client js api? (dedicated post)
    • simply break things?
    • preferred: use non-enumerable properties, maybe toJSON if needed - this is solvable
    • migrate stringly-typed cypherlinks to hsdt cypherlinks?
  • find out how much new data types require changes to sbot's database
    • Infinity, -Infinity, -0 and NaN are probably fine?
    • what about integers?
    • what about cypherlinks?
  • find out whether some db data migration is needed when rolling out hsdt
  • fix the last errors in the testing-area implementation
  • update the implementation with all additional datatypes
  • prepare a body of afl-generated test cases for other implementations to test against
  • figure out the conflict between keeping the protocol extensible by ignoring certain data vs ensuring canonicity
  • encode collection sizes in bytes or items? (dedicated post)
    • leaning towards items, esp. since bytes don't really make decoding easier. When decoding an array, you'd like to know how many items to allocate, not their total size.
      • considerations: async parsing, malicious size indication, highly nested input, when does encoding and decoding happen, compactness of data, ease of implementation, efficiency of implementation (benchmarks, ask arj or dominic for json data sets and convert json to cbor?), transport format framing data, does the binary metadata include the content size in bytes, streaming encoding and decoding

Semicanonical Json Replacement

  • all of the above again, with fewer decisions but more ugly workarounds
    • not looking forward to this, but open to help dominic with the design

RPCs

  • read up on the replication rpcs
  • how should hsdt messages be added?
    • should clients be able to only request non-legacy msgs?
  • figure out which rpc updates semicanonical json requires
  • in general, try to make rpcs take encodings as args and work across all of them
  • propose an encoding-agnostic rpc for asking for messages in the format that gets hashed, not the wire format
    • new server impls only need to implement this and can then replicate the whole set of messages
    • as efficiency demands it, they can update to direct handling
    • nb: with the current hsdt proposal, hash encoding and transport encoding are identical
    • nb: for legacy json message, hash ecoding and transport encoding differ, and obtaining the hash encoding is painful
    • nb: future ssb versions might want to add more efficient transport encodings (there's always the option of compressing the data), so this rpc should work with future formats as well

Other

  • figure out a whole rollout plan, what needs to be done, and coordinate things so it gets done, do as much as needed myself (but trying to stay clear from the database layer...)
  • figure out how to present a summary of everything for dominic after burning man
    • and write it (dedicated post)
  • write my dream spec for ssb, ignoring backwards compatibility (suggested by @Piet) (dedicated post)
  • write down spec for leylines, similiar to ssb, except: (dedicated post)
    • feeds are trees rather than lists, allowing partial subscription and replication
    • untrusted gossiping mode with temporary ids over a fully random overlay
  • write down the full bpmux spec (thinking is already done) (dedicated post)
  • work on personal projects, not just ssb
  • draw more birds
@cryPhone📱 %HidNHo7F810IrmnBFC6twX5/WI+w7sqRHXwc/qsfPGo=.sha256

Great list! Would love to work towards this when and where I can.

I’m sure I missed it between all the other good parts in this: sane identifier alphabet? Like urlsafe base64 or base58 or whatever the cool kids use these days.

@aljoscha %H2DZWnHRyC1IYs/WSsLyOG5aXOdUUQnapcgIZmWRd7Q=.sha256

@cryptixInTheCloud @cryptix "identifier"?

@aljoscha %q+aGwdKVqrMXIVo5EOQlEvU/1M93EVUI+Jo+8v/ILuA=.sha256

New entry: digest this post, learn about the rpcs in question, figure out how this effects everything else.

@aljoscha %XjFVZ8xJgT+L+OwWOWml5mBECjKFQwpfpp1J60ks0g0=.sha256

Current Blockers

To continue moving forward with this effort, there are a few things that need to be resolved, and I can't do that alone.

  1. Can I go ahead and add various fixed-size integers, 32 bit floats, and cypherlinks to hsdt? @Dominic
  2. Can we make timestamps optional, can the devs of the major clients estimate how much of their code would break? @mix, @Matt McKegg
  3. Can we drop feed id and sequence number from the metadata? I can work with any of "yes", "no" and "negotiate as part of the rpc", just need a decision. @Dominic

With the first one resolved, I can write down the final hsdt spec and start implementing (currently leaning towards doing it in rust rather than C).

With the second and third one resolved, we can figure out a specific metadata format. That will open up a tiny digression about binary representation of cypherlinks, and a larger one about signing just a hash of the non-meta data instead of the full data itself. But both of these can wait until they've become unblocked.

User has not chosen to be hosted publicly
@mix %4VCRtntIopJy4IWwAqLa0AgJGlDZGLb95sQSljfHIp0=.sha256

@Aljoscha I think for (2) we need to organise a meeting about this because it's too hard to coordinate (for me) loosely in a thread.

@cryptix @matt @dominic (and others building clients?) can we do a call ?

@Dominic %f00PrGMe66vEB10Yb54q2KDOM5usonuHsVIIY8dZjKE=.sha256
  1. yes. 2. & 3. no. not yet please.

There is enough to do in rolling out a new encoding, but the changes in 2 and 3 will require a lot of changes in both back end implementations and clients. I'd rather get good at rolling out changes, than to just roll out one massive change with everything in it.

@aljoscha %NmC5u6cvQMvYMSmTkyX/k701TPjhi7ppKu2IGfjGmiM=.sha256

@Dominic Not even optional timestamps, which don't change the backend at all? We can wait until the client devs have tried out how much that would affect them. Pretty please? Well, ok... :broken_heart:

@Dominic %9cmtmRKSORrBj8CHFumcneCl4uJecOSrxmGXsYdl5xk=.sha256

@aljoscha as I've expressed else where, the idea of implementing optional timestamps doesn't actually excite me. It doesn't lead to new capabilities, or make ssb compelling for new use cases. Of all the things we could be working on, I think there are much more interesting priorities. If this was a thing that more people (especially application developers) where requesting, that might be different.

@aljoscha %EEx6vwTiyJNCsrNimMah+MgOFCATrKNb25T/C1DoZRQ=.sha256

Imagine each sentence in this post started with "In my opnion".

Well no, it is not exciting. It is a mundane, boring, technical detail, and it makes things harder for application developers, so they certainly won't line up lobbying for it (except for @cel maybe? :smiley: And @Matt McKegg if I get to be very hopeful?). But ultimately it improves the quality of the protocol.

Allow me another weak attempt at an analogy: Advocating a healthy and balanced diet to your children is neither fun nor flashy. And your children will fight you along the way, because chocolate is so much better in the short term. They will be angry, and they will project that anger at you. But still you keep your stance, because you know that long-term it is inevitably better to not base your diet upon chocolate.

This is how I feel about timestamps. They are super convenient, but also completely unreliable. You can't built a strong foundation upon timestamps, yet the protocol actively encourages to do exactly that. And people with privacy concerns are forced out of the community through social mechanisms. This is not healthy. In my opinion, the protocol devs have a responsibility to do the boring work and make the unpopular but correct choice.

I have yet to hear a single counter-argument that is not "convenience, fun, chocolate".

@andrestaltz %9eFliOYHe+GGaPcuLgYX+NlEnrx0xlyyARIrgqtMiMA=.sha256

I would have put it more softly, but I agree with Aljoscha on timestamps. We need to handle the corner cases on the protocol level, and it's part of the work that helps build a stronger foundation. I believe Dominic is more interested in providing foundation for new kinds of features, and we need those too, but we need the current basics to function better as well.

@Dominic %KFuCEevuM1Vx9HylUsINAkgpt7xsYhCtUyD804rP6DQ=.sha256

@aljoscha okay, well if you can convince the app developers to not use timestamps (which will mean giving them something else) then we can remove timestamps... I'm not against this in principle, I just don't think this is a priority.

To extend your metaphor - yes, I don't wanna tell my kids "no chocolate" (it's not like I'm shovelling it down their throats) I don't mind if they eat chocolate sometimes, as long as they are also playing outdoors, and learning new things. I'm not gonna turn around and say "sorry this is a chocolate free house"... because these arn't actually my children! They can do what they want and I can't make them upgrade sbot if they don't feel like it! This is the whole point of ssb, even as the protocol designer, I can't force things on anyone. If I want people to adopt my changes, I've gotta make them feel good.

So, you are not wrong that timestamps are not something you can fully rely on (and even less so, now that they are not required to be monotonic)... but as a relative priority, there are other things I want the app devs to do more. (such as implement support for multiple identities, and enable any message type to be private, which are both now available in sbot but not in any clients)

So, short version: you need to convince the app devs about this, not me.
Btw, you know you could just set your timestamps to zero, that doesn't require a protocol change. That wouldn't be as annoying as random timestamps, because at least they are obviously unset. App devs could just use receive time instead, as patchwork does now for timestamps in the future.

@mmckegg %1c7uRu4f4uuzCOAWI6bUTXKJA+xWbNUs0npR0yob1+4=.sha256

I'm not a fan of timestamps for all of the reasons you describe here @Aljoscha.

I would be really keen for some kind of causal public feed ordering. I think this was discussed recently. If there was a way to query to query the indexes in a causal order, and only use timestamps if available as a display hint, I think things would be much more solid.

Threads make this pretty easy since you can acknowledge other posts as branches, but something I've been thinking about is tagging root posts with similar branches. Maybe the hashes of a few other feeds' latest messages as a reference point. I haven't had a chance to catch up on these discussions, so apologies if I'm repeating something that's already been talked about. And if there was no available reference for a given post, you'd fall back to sync time compared to other posts, and update the order later if more info was discovered.

Anyway, you have my support in this matter @Aljoscha.

@aljoscha %JsPQu85HMOkm/m04Fn3b3h93yj88r5YT0DxkNzsAkK4=.sha256

Btw, you know you could just set your timestamps to zero, that doesn't require a protocol change. That wouldn't be as annoying as random timestamps, because at least they are obviously unset.

That is exactly what I mean by optional timestamps: Use some garbage value that does not take up any additional space, but that can (or depending on the client API must) be detected by application devs. Whether that garbage is a magic number or signaled explicitly is fully up to the encoding.

because at least they are obviously unset

"Obviously" to humans, not computers. And depending on who you ask, the "obvious" choice is null, undefined, NaN, Infinity, -Infinity, 0, -1, None, or one of a billion other things. All I'm asking for is an arbitrary but binding choice on the protocol level, so that we get this in an interoperable and machine-friendly way.

I'm not gonna turn around and say "sorry this is a chocolate free house"... because these arn't actually my children! They can do what they want and I can't make them upgrade sbot if they don't feel like it! This is the whole point of ssb, even as the protocol designer, I can't force things on anyone.

I agree with this in general, but disagree with specific parts, based on the distinction between metadata and message content. I don't presume any authority for a chocolate ban either - devs can (and in many cases should) include timestamps in message content. That makes them encryptable, keeps them out of replication or ssb-client APIs, and if things blow up, it is their own responsibility. Think of it as responsibility sandboxing if you like.

The "content" part of messages is where we give devs all possible freedom. But everything else is not part of the API. It should be possible to build upon ssb without knowing any of the metadata implementation details, all that matters are the abstract guarantees we give. And the abstract guarantee of reliable timestamps is bogus and should have never been given in the first place.

Ssb encapsulates all the messy details of an eventually consistent, decentralized database. Encapsulation is always a matter of trust, the application devs trust the protocol devs to do a good job with those internals, so they don't need to deal with them themselves. No need for the encapsulated details to "feel good".

@Matt McKegg thanks for voicing your support (same to @andrestaltz). Getting the db to support efficient causal order queries will be some work, but it is possible. Besides all the complexity, we could quickly hack such queries together by choosing one of the trivial options: Either performing a breadth-first search on the cypherlink-graph on each query (slow), or storing the full transitive closure of the cypherlink-graph's edge relation (extremely fast, but takes up lots of space). Either is probably still fine at our current scale, and there are available solutions to get this efficient in the long term.

Join Scuttlebutt now