No idea whom you might be talking about...
I reported attendance already? I just opened manyverse to report attendance right now! I guess someone in the community finally cracked time travel, then. This space keeps delivering astonishing technical innovations...
Anyone want to learn about a new data structure? I co-authored something about a new data structure.
It has been some years since I worked on #bamboo , and by now I'd make a lot of different choices. Today I finally got down to writing them down. Reed is about as bamboo 2.0 as things will get.
There will be another chance next Thrsday at the same time, looking into Meadowcap.
The next session will be on Thursday the 15th of February. Same time, same place: 15:00 UTC (aka 🍉️ melon hour) on the earthstar discord: https://discord.gg/WbmxKMvk?event=1204031524404461660
Reading https://willowprotocol.org/specs/grouping-entries/index.html#grouping_entries
Willow Specification Reading group
@gwil and I invite everyone to the first Willow specification reading group, on Thursday the 8th of February, at 15:00 UTC (aka 🍉️ melon hour).
We meet on the earthstar discord server: https://discord.gg/WbmxKMvk?event=1204031524404461660
We'll be reading the data model specification together, and you can ask any question you might have along the way.
Announcing the Willow Protocol(s)
Hey everyone,
@gwil and I are excited to finally broadcast a project we have been working on for the past ten months: Willow, a family of protocols for building peer-to-peer systems. Willow sits in a similar problem space as Scuttlebutt, but with some fundamental design choices that favor mutability and multi-writer support over the immutability and single-writer focus of ssb.
A whole lot of more precise info is on the website: https://willowprotocol.org/
We have a stable data model, an almost-stable, capability-based system for access control, and a replication protocol that gwil is currently in the process of implementing. Soon, all of these will power the next version of Earthstar.
This project is our exploration of various ideas that have originally germinated in the scuttleverse, particularly the rejection of global singletons and the focus on local communities have shaped the design quite a bit.
Anyways, we are proud of it, excited, and hope that others here might share our excitement!
cc @elavoie
Congrats =)
@gwil is essentially working full-time on implementing (and documenting) willow. He has funding for converting the current earthstar codebase (typescript) to willow, which involves, well, implementing willow. We are waiting to hear back on a funding application for doing rusty willow things as well.
For me, it is primarily a research project, but one that I would really like to see used in the world. Also I'm slowly funneling some students toward the codebase.
We are not quite in the place for the big announcement on scuttlebutt™ yet, but we are slowly getting there. Ideally even before scuttlebutt has emptied completely =D
@andrestaltz Have you seen Rault, Pierre-Antoine, Claudia-Lavinia Ignat, and Olivier Perrin. "Distributed access control for collaborative applications using CRDTs." Proceedings of the 9th Workshop on Principles and Practice of Consistency for Distributed Data. 2022.?
That's a CRDT (and hence, lattice) for access control, including the problem of mutual removal iirc.
Think the following does justice to Todd and the brave wizard hackers?
Concurrently to the design of certificate transparency,
Todd [48] proposed an identical approach under the moniker
of merkle mountain ranges (MMRs). This work laid the foun-
dation for the open timestamps standard [37], and is used in
the anonymously designed MimbleWimble [20] [17] and its
light client design FlyClient [9].
@misc{todd2012,
title = {Merkle Mountain Ranges},
author={Peter Todd},
howpublished = {\url{https://github.com/opentimestamps/opentimestamps-server/blob/4f5a3c6ae56be766cc6d83e31fb5341f78ecad7c/doc/merkle-mountain-range.md}},
note = {Accessed: 2023-09-20}
}
@inproceedings{bunz2020flyclient,
title={Flyclient: Super-light clients for cryptocurrencies},
author={B{\"u}nz, Benedikt and Kiffer, Lucianna and Luu, Loi and Zamani, Mahdi},
booktitle={2020 IEEE Symposium on Security and Privacy (SP)},
pages={928--946},
year={2020},
organization={IEEE},
note={\url{https://ieeexplore.ieee.org/document/9152680}}
}
@misc{opentimestamps,
title = {Open Timestamps},
author={open timestamps},
howpublished = {\url{https://opentimestamps.org/}},
note = {Accessed: 2023-09-20}
}
@misc{mimblewimble,
title = {MimbleWimble},
author={Tom Elvis Jedusor},
howpublished = {\url{https://github.com/mimblewimble/docs/wiki/A-Brief-History-of-MimbleWimble-White-Paper}},
note = {Accessed: 2023-09-20}
}
@inproceedings{fuchsbauer2019aggregate,
title={Aggregate cash systems: A cryptographic investigation of mimblewimble},
author={Fuchsbauer, Georg and Orr{\`u}, Michele and Seurin, Yannick},
booktitle={Advances in Cryptology--EUROCRYPT 2019: 38th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Darmstadt, Germany, May 19--23, 2019, Proceedings, Part I 38},
pages={657--689},
year={2019},
organization={Springer},
note={\url{https://eprint.iacr.org/2018/1039.pdf}}
}
@7suh Thank you for the links, I hadn't really looked beyond the Todd one before.
As far as I can tell, MMRs are identical to the certificate transparency scheme (RFC 9162, formerly RFC 6962), and the FlyClient paper seems to agree:
The idea had been proposed before in the context of certificate transparency logs [38] to show that any particular version of an append-only log is a superset of any previous version.
So my second paper is not doing the same a MMRs, in fact it outperforms MMRs. Or are you saying I should present MMRs as the progenitor and CT logs as the reinvention? This page states that google started developing CT in 2011, they had running logs and an RFC in 2013. Given the whole complexity of the rfc (MMDs etc), chances are they had the merkle tree construction a lot earlier. So I'd be unconfortable to present one as definitely preceding the other.
But I'm definitely going to cite MMRs, probably via the FlyClient paper.
Reading these also pointed me to vector commitments, which will make it into the related work section of the first paper.
Here's my in-order feedback from a single read-through. Disclaimer: I'm in the process of rereading Walkaway right now, so both the paper contents and my remarks feel very "default" (in the Walkaway sense) to me right now, and hence rather negatively connotated.
- "to redecentralize major Internet services
[,]
often using" more comma - section 2.1 contributors, last sentence: "boundary by reserving updates to other contributors [that meet the contribution requirements of those updates]."
- "Producers are similar to regular knowledge producers, such as software developers, bloggers, writers, video producers, etc." Unclear what "regular" is, only the next sentence makes clear what is being contrasted here.
- "because the dominant cost in a peer-to-peer system is developer’s time" citation needed =P
- "For the special case of software updates from developers of the system, the same contribution gives both the ability to push a contributor own’s updates and pull the latest developer updates." This was not enough for me to understand why this is a special case that requires specific treatment.
- 2.3, time-based subscription: reading the name, my mind immediately jumped to the problem of any global notion of time requiring concensus finding and introducing centralizing (or at least global-singletonalizing) system components. And sure enough, suddenly and nonchalantly a blockchain pops up for the first time (excluding the abstract), as if it was no big deal. All of this makes sense, but I had hoped for things to be more explicit. To me, springing a bockchain on a system design is kind of a big deal.
- "Contributors have an incentive to use a peer-to-peer system because it can potentially provide more affordable replication services than cloud-based alternatives (see Section 4), therefore they can obtain updates from their favorite content providers for less than on other alternatives." This immediately makes me question the underlying assumption that centralized pricing is driven by infrastructure cost and not by profit maximization. Looking forward to section 4 now, but I'm also bit annoyed you are confronting me with this assumption without any justification.
- regarding absence of free redistribution: I expected something stronger there; given how you talked about incentives for not doing this in the abstract, I gave you some trust in advance and expected some moderately game-theoretic validation later on (not that I would have enjoyed it, but I did expect it). Your leading argument in the section (building non-conformant clients takes technical skill) I found too weak, all you need is a single tenured hippie for everything to break. And this reads like your closing your eyes from reality, where free sharing of copyrighted content runs rampant. I expect the section to draw most of the reviewers criticism, it might be a good decision to present your strongest arguments only. Also you should not base your analysis on the assumption that staying within the system the design is the only option for developers of deviant clients to be compensated. Reality shows that a significant portion of people happily uses adware to access copyrighted content for free, for example. Or a deviant developer (collective) might reach critical wikipedia mass and be sustainable off donations. You might argue that puts them in a shitty situation, but unfortunately their opinion is more important than yours in this case.
- "However, we assume they have limited resources to do so and will be eventually be exposed and blocked," by whom and why?
- "We assume loyal users do not share private keys, therefore signatures uniquely identify them. We do not assume anything about non-loyal users." So a single tenured hippie can pay for updates and then freely share the secret key with anyone to completely circumvent your system?
- "will be used to forever prevent the offending users from ever participating in replication with loyal users again." There seems to be an underlying assumption that creating new keypairs is not cheap, which you should make explicit and justify in order for this argument to hold any weight (if this reasoning is given in your citations, say so).
- "If users break the sequentiality of their updates, they are eventually automatically blacklisted." does user == producer here, and breaking sequentially == forking?
- "Updates are encoded as Git commits [5] and signed by the producer under a self-certifying branch reference [10]." This is a completely jarring switch from high-level design principles discussion to a very specific implementation detail.
- "secret handshake protocol [20]
[,]
otherwise" more comma. Also might want to consider citing the general concept of authenticating handshakes rather than two arbitrarily selected instances. Contrasting ssh with this weird homebrewn solution would make me hesitate for a moment as a review - "a replica needs devId, the identifier of the account from which developers of the system are retributed." What is the relation between users and devIds? one-to-one?
- section 4: does not resolve to which degree cloud providers charge based on these costs and to which degree they charge because they can make a bigger profit. If 95% of the cloud provider cost was due to a profit margin, then there is no reason to assume that your approach would be significantly less expensive for consumers, because the producers could still aim for the same profit margins.
- related work: I was missing discussion of in-app incentives that are tied to a blockchain, aren't there a bunch of startups that will fix the world this way? Also, nfts? Basically, a lot of web3 enthusiast stuff must be out there. How about work like anylog https://www.cidrdb.org/cidr2020/papers/p9-abadi-cidr20.pdf ?
When I started reading this, I was wondering when the concept of capabilities would pop up, especially given that you talk about "access control" in the abstract (though not thereafter). But by the end, I didn't terribly miss them either.
Found this by searching for "Norwegian" of all things, then looked at about 4 years ago when this happened.
For what it's worth, I agree that thinking about notes as A,A#,B,C,C#,D,D#,E,F,F#,G# is indeed a bad system. But imo the correct response is movable do solfege, not 12-tone based notation.
Western classical music has a significant bias toward seven of the twelve possible notes. The notation is an organically grown, highly simplified variation on huffman coding that provides reasonable amount of compression with negligable decoding latency.
This is not primarily about written notation being more compact (though that's a nice bonus, all those 12-tone notations are comparatively unwieldy), but about reducing the bandwidth at which you can meaningfully read. Once you've internalized the decompression step, you will be more efficient than if you had to decode a 12-tone notation. Citation or experimental evidence needed of course, but that's the underlying idea. It's not like past humans where too stupid to try out 12-tone-based notation. It just happens to be less efficient (citation needed again, admittedly, I'm at work and can't actually research this right).
Current notation is certainly not optimal in this respect, I'd be happy to read up on proposals that address the inefficiencies of writing in melodic minor. Unfortunately, I have yet to find a proposal that is even aware of issues of relative orientation and compression.
you don't seem to have addressed the basic argument that I started with, which is: why name only 7 notes in our 12-tone equal temperament note system?
Compression.
I'm not claiming it's perfect. I'm pointing out it's a hybrid between absolute and relative (to the tonal center) instructions on what to do.
What triggers me is when people (frequently computer scientists) completely fail to see the value in those relative instructions (or their existence), design a system that only conveys absolute instructions, and then claim it to be superior.
Look at an article like https://musicnotation.org/tutorials/intervals/ They just assume that current notation is dumb, without devoting a single word to considering why it "distorts" intervals the way it does.
Or they have lovely passages like
In other words, the visual representation is not proportional to the sound it is representing. What one sees does not correspond to what one plays or hears.
No! Of course a diminished seventh sounds completely different from a major sixth, even on a piano. Because neither of them occurs in a vacuum.
Or https://musicnotation.org/tutorials/reading-playing-music-intervals/ Playing/improvising/singing by interval between individual notes is how they claim making music works. But that's just a stepping stone to thinking relative to a tonal center. For actual mastery, thinking in intervals is a bad habit that needs to be overcome. Traditional music notation is not "badly designed", it is tailored to (a particular kind of) mastery.
That whole website is a shining example of never trying to falsify one's hypotheses. They claim that interval notation is full of "ambiguities and inconsistencies". Do they ever stop to question whether that assumption might be false? No! What they should do is let 100 classically trained musicians transcribe Bach by ear, then analyze how many inconsistencies there are. I'm happy to wager that not a single classically trained pianist, conductor or composer would put down a major sixth where Bach wrote a diminished seventh.
Gah, sorry for ranting. But this particular topic happens to trigger me and I've independently encountered this "improvement" one time too often this week.
I think there's too much in Western music theory that doesn't make a lot of sense, and this bothers me, since deep down, music is just math on sound waves.
Western music notation is designed for efficient reading of western classical music. It is neither designed for easy writing, nor for easy reading of arbitrary music.
Correct usage of enharmonic equivalents is crucial for efficient sight-reading. c# e# g#
I can parse, memorize, and play in an instant, c# f g#
is nonsensical garbage I have to treat as three individual notes rather than a meaningful unit. If the piece is written in C# major, i.e., has 7 sharps, I don't even parse those three particular notes, I simply note I am to play the tonic and move on. Its the same pattern of dots in any key, merely translated on the y-axis.
Moving the pattern to certain positions will give not major chords but minor chords. An hey, those happen to be the correct choice while the piece stays in that key. The pattern just says "play the default chord". If I need to do something exceptional — say, if I'm in the key of C major, to play A major rather than a minor, the notation explicitly signals that something exceptional is going on, by adding a sharp.
This also explains why you'd ever find an e# minor chord: going from g# major to e# minor looks perfectly natural in western notation (because it is the relative minor, a super common concept), whereas going from g# major to f minor is nonsensical, and, written in the key of g# major (or in anything from which you would sensibly reach g# major), looks completely jarring.
Western notation happens to correctly convey absolute pitches, but those are fairly unimportant to most listeners. They have shifted over time! Significantly more important for the effects that music has is how one sound relates to the next. And that's where western notation really shines, in conveying relations between notes.
And it pays of for the trained practitioner. Whereas reading music written by composers with perfect pitch who don't care about music theory is an effing pain.
Stuff like C# being the same as Db, but contextually having to use one of these over the other, just bothers me. Why would you have to call me only Andre on Tuesdays but call me Staltz on Wednesdays? It's the same person!!
Imagine somone asks you how to get to the train station. Why would you have to give them different directions depending on where they ask you? It's the same place!! You should just give them GPS coordinates, that is so much simpler and will definitely help them.
Thanks for pointing it out, I'll adjust the colors!
@Mix Android For the first paper, I'd probably need three passes. I'd skim first to see whether I care, in the next pass I'd make sure to grasp all definitions, and in the third pass I'd follow along the proofs. Fully digesting the complete paper from scratch would probably take me days - while I'm pretty fast at the skim-and-discard-if-boring phase, I'm not the quickest at internalizing technical details.
To be honest, I'm quite glad I don't need to read the first paper =P. Intuitive concepts one could explain face-to-face in 5min packaged in a formal written presentation are always annoying to chew through. But I have to play by those rules for now...
@7suh Finished the proper write-ups, see %8ZxNBht...
I've finished two papers on append-only logs, to be submitted to Usenix security 2024.
https://arxiv.org/abs/2308.13836
The first argues that "append-only log" is a misleading name, proposes "prefix authentication" as an alternative, and gives a detailed summary and recontextualization of prior work.
https://arxiv.org/abs/2308.15058
The second proposes a new way of using backlinks for verification that outperforms hypercore, bamboo, and certificate transparency logs.
@mix.desktop I think you were the one to come up with the vertebra and spine terminology some years ago. Let's see if those can make it into academia =)
I must have posted these things so often I don't really want to repeat them, but:
If you want a performant varint, consider VarU64 (the end of the readme points to discussions of why varint designs like LEB are pretty bad).
If you want a full non-json format, you might also enjoy a look at the valuable values.
Yes!
@andrestaltz The Internet has you covered: https://htime.io/
Now works on smaller displays as well.
@cinnamon's seasonal clock website went down, so @gwil and I rehosted a shiny new version at https://seasonalclock.org/.
@gwil doubled down on the idea of having an emoji for every hour. And after a few reckless renamings, we arrived there!
The clock now uses a light theme if that's your operating-system-level preference.
And finally, we now have a few parameters that can be set via the URL.
Latitude, longitude and offset to UTC parameters let you create links that always display the clock for a particular location on earth:
- Berlin in summer (latitude 52.31 degrees, longitude 13.24 degrees, UTC+2):
https://seasonalclock.org/?lat=52.31&lon=13.24&offset=2
- Wellington (latitude -41.17 degrees, longitude 174.46 degrees, UTC+12):
https://seasonalclock.org/?lat=-41.17&lon=174.46&offset=12
Also, you can highlight certain hours based on their name or utc offset: https://seasonalclock.org/?hl=6&hl=rainbow
The repository is hosted at https://github.com/sgwilym/seasonal-hours-clock/tree/main.
I've been meaning to publish on this for a while, [...].
To specify that a bit: if you take the base-2 linking scheme construction but only store data in the leaves, you get a more efficient result than any prior published work. That construction is imo what the "optimally efficient accountable time-stamping" should have chosen, but apparently they didn't see it.
That construction is also highly related to a perfect deterministic skip list where the links between layers go into the "wrong" direction compared to a classic skip list: change the links that in the classic construction to always go to the highest possible layer, and you again obtain the base-2 linking scheme with data stored in leaves only.
Jup, the schemes are highly related. The way I think about it is that MMRs, certificate transparency logs and hypercore are essentially doing the same thing in slight variations: entry n
has a couple of disjoint trees of descending size in its past (take the binary representation of n
, each one-digit at position k
corresponds to a tree of size 2^k
). The mechanism by which you connect those trees is almost an implementation detail. The optimally efficient accountable time-stamping has the earliest occurrence of that pattern I know about.
Bamboo uses the scheme from New Linking Schemes for Digital Time-
Stamping, whereas the analogy you described works even better for the scheme from Time-
Stamping with Binary Linking Schemes (which is exactly the log-2 construction you mentioned). I've been meaning to publish on this for a while, but writing takes so much time (as does teaching)...
Working link to the essay: https://habla.news/a/naddr1qqxnzd3cxyerxd3h8qerwwfcqy88wumn8ghj7mn0wvhxcmmv9uq32amnwvaz7tmjv4kxz7fwv3sk6atn9e5k7tcpramhxue69uhkummnw3ez6un9d3shjtnwda4k7arpwfhjucm0d5hsygyzxs0cs2mw40xjhfl3a7g24ktpeur54u2mnm6y5z0e6250h7lx5gpsgqqqw4rsf67qa5
when validating backlinks, you never know if you've gotten to "the bottom" because tomorrow you may discover a new bottom;
This is a non-argument argument: why do we need a system that has "the bottom" rather than "the bottoms"? Does a singular bottom simplify things because there is a singular root? Any algorithm over a tangle has to be able to handle incomparable elements already, hence, every tangle algorithm is already able to handle multiple sinks.
you can't calculate the depth of a msg, and depths are useful for sliced replication and for validation.
The exact same calculation as before works (this is a specific instance of my claim on the generality of tangle algorithms above). For any entry without predecessors, the depth is 0. For any other other, the depth is the maximum depth of its predecessors plus one. That definition gracefully handles multiple sinks.
I'm sorry if this seems pointlessly nitpicky, but so far I still don't see how a single root provides any benefit beyond yielding an identifier. And I still maintain that identifiers can just as well be generated from a "tangle-creation-message" that doesn't contain any content. In some sense that still gives you the singular point of reference. But it doesn't attach any content to it.
The single-root design entangles tangle identification and tangle content, and to me that is more complex than keeping these two concepts independent. If your design philosophy is different, I'll happily begrudgingly stop asking for clarification, because at that point it becomes a discussion about how to identify tangles, not about what is a tangle. But right now it feels to me like you are attributing value to some algorithmic properties of the single-root approach which I simply do not see. And it leads to a more complex definition of a tangle than the simpler "a dag where each node contains the same identifier (the tangle identifier) as an extra piece of data". And whereas I don't really care about what these identifiers look like and where they come from, I do care about getting the central definition on which the whole system builds to be as simple as possible.
I would flip it around, I wondering if the generality of multi-root DAGs are necessary to give up the simplicity (of implementation, of reasoning when debugging, and for performance) of single-root DAGs.
Does a single root simplify anything beyond yielding an identifier for the tangle?
One of the things I never got about tangles is the need for a distinguished root. There seems to be no reason why a tangle has to have a single root. I can think of many cases where a post beginning a thread starts by mentioning two others.
100% agreed. A feed is identified by a pubkey, not a single first message. It can in fact have multiple messages of sequence number 1 (of course it is forked then, but still it is possible). I'd much prefer if tangles had a a randomly chosen identifier, say the hash of some create_tangle
message in the log. Such a message would not have any content and would not itself be part of the tangle. It would only be a message in a log in order to force its hash to be random.
Any definitions and algorithms need to deal with incomparable posts anyways, so there's no real benefit to having a single starting point.
Doesn't a TLA proof fall exactly into your third category of problematic papers? =P
The current specification using a single counter and state testing on even-oddness was jointly
rediscovered in collaboration with Christian F. Tschudin and Ramon Locher during a CRDT
Seminar at University of Basel during the Spring Semester of 2023. It was a small disappointment,
after a literature review, to realize we were 4 years too late to claim originality.
I know the feeling. That mechanism must be rediscovered by sooo many people =D
Skipped over the proofs, but it's a really nice, easy-to-follow write-up. The introduction jumped fairly abruptly into the subject matter, but that's probably more an issue of personal taste.
What mix is talking about here is essentially what I meant by "If you enforce that all messages in a tangle by a single author must be causally related[...]" in the github issue I believe ("single author" == "single device"). The linking scheme allows you to check exactly that efficiently.
Are there any plans to make this official enough that I could get my university to cover expenses? In the previous iteration, we had some scheduled talks/presentations for that purpose, and a small website (I'd love to show my supervisor something more than some manyverse screenshots).
I turned my master's thesis on set reconciliation into a more readable form: https://arxiv.org/abs/2212.13567
@Mix Android recently posted something on prolly trees, some of this work can serve as a replacement for prolly trees with some nicer properties (see the last paragraph of section 2.1, and sections 5.3 and 6). CC @arj @andrestaltz
I will be in London from the 16th to the 21st of December. Anyone want to hang out? CC @SoapDog (Macbook Air M1) @sandreae.Android
Also, anywhere I could crash the night from the 16th to the 17th?
I can make both =)
Edit: for some reason I assumed your initial vec contained Emoji by value rather than by reference. So please mentally add one layer of &
to all emoji-related types in the prior post.
Do you need your initial Vec of Emoji to keep existing after obtaining you derived collection? If not, you can use Vec::into_iter, which takes ownership of the original Vec
(i.e., you cannot use it anymore afterwards), but which yields Emoji
rather than &Emoji
like the iterator obtained via Vec::iter
. Writing for emo in my_vec {...}
desugars to using the by-reference iterator from Vec::iter
(emo
has type &Emoji
), whereas for emo in my_vec.into_iter() {...}
iterates by values (emo
has type Emoji
).
If you need to keep the original collection, try map
ping Emoji::clone
over the iterator obtained from Vec::iter
.
My first intuition is to use a Bloom filter [...] but something better may be possible.
"Set reconciliation" might be the (annoyingly non-obvious) name of what you're looking for. See the related work section of this (chapter 5, page 50) for a literature overview that is fairly complete to the best of my knowledge (disclaimer: I wrote this).
An interesting property of pull is that it's possible to ask a peer to look for a post they don't have yet.
Is this really inherent to pull/push? You can also have a push-based system in which the pushing is suspended until missing data becomes available; this effectively happens every time when live-streaming data between peers in ssb. In a system where data is linearly ordered but it is possible for individual data to be missing, you can of course also resume pushing with newer data rather than suspending upon missing data, and eventually push that data should it surface.
Another viewpoint: pure push ("you may send arbitrary amounts of data to me") is not practical, as no one can guarantee arbitrarily large amounts of resources to process those arbitrary amounts of data. Any sensible push system takes the form of "you may send arbitrary parts of this finite collection of data to me". Squint a bit and this is just pull with some additional implementation details: you request a single datum (the finite collection) and the means of receiving it can be split up over time.
I'm pretty much just rambling now, but the "2, 20, 20000?" question of yours resonated with me. I think I'm looking for a clarification why "in case some got deleted before reaching me" seems to be problematic for you. (I'm not trying to say it isn't - I spent a good time trying to conceptually make efficient replication of ssb messages for which only the metadata is required but the content might be missing work gracefully even when arbitrary pieces of content are unavailable, and eventually gave up. See the #offchain-content tag if you have way too much time and want to dig into some discussions that happened on ssb around that topic. Unfortunately I don't think there are posts about the replication difficulties.)
this conceptually being a cache rather than a log
Do you still have sequence numbers in your model? If you replace sequence numbers by arbitrarily chosen ids, and allow overriding values in the resulting cache, you pretty much get the data model behind earthstar: each peer holds an updatable (i.e., mutable) mapping from (author, id) pairs to data.
@Jeremy List @nyx The precise definition of how to encode things for signing is here.
@nyx You might like https://github.com/AljoschaMeyer/valuable-value (which was born out of my frustration with existing data formats and informed directly by having reverse-engineered the scuttlebutt encoding in order to write that specification above).
"you can only read this if we both are interested". Hmmm, cryptographic challenge...
This boils down to computing the logical and of two bits held by different parties without either of them having to reveal their bit. Incidentally, a flatmate of mine has put some time into this, and referred me to Yao, Andrew C. "Protocols for secure computations." 23rd annual symposium on foundations of computer science (sfcs 1982). IEEE, 1982.. That reference tackles a more general problem, but can be a good starting point.
For this specific problem, there exists a write-up in German by some researchers I do not know, but also a sheet of handwritten notes in English one room away from my desk. So if anyone wants to seriously delve into this, I could make introductions.
My intuition tells me spanning trees are involved.
The nice thing about ebt though is that the spanning tree are a purely global phenomenon, an individual node isn't aware of what those trees look like beyond its neighbors.
I cannot speak about the network protocol details in ssb's flavor of ebt, but the conceptual idea (see the paper @moid mentioned) is quite simple: consider a network of nodes, where every node is only connected to a small number of other nodes; different connections may incur different delays when sending a message. One of the nodes produces a sequence of messages, the goal is to broadcast this sequence of messages to all nodes.
We want to optimize three metrics: total number of bits transmitted, latency until the message has reached every node if no connection failures occur, and the impact of connection failures. There are three basic strategies:
- eager flooding (push): When receiving a message for the first time , send it to every neighbor. This has minimal latency, cannot fail if the network stays connected, but has to transfer many bits.
- lazy flooding (pull): When receiving a message for the first time, notify every neighbor. They can then request the message from you. This is more efficient in terms of bits transferred, as every message is transmitted only once per node. The resistance to connection failures is the same as that of eager flooding. The latency is however three times as high as the optimum.
- push along a minimum-weight spending tree: Globally compute a minum-weight (the weight of a connection is its latency) spanning tree over the network, perform eager push along this spanning tree only. Results in minimum latency and minimum number of bits transmitted, but vulnerable to connection failures on the tree.
Each of the three basic strategies is optimal with respect to two of the metrics but pretty bad with respect to the third. EBT is a more complex strategy that strikes a better balance: Perform eager push along a minimum spanning tree, but also perform lazy flooding on all the other connections. If you receive a lazy notification faster than the eager push of the corresponding message (up to a grace period to avoid triggering this case too many times), the spanning tree is apparently not optimal anymore. Notify the node that gave you the fast lazy notification to eagerly push to you in the future (also ask it for the message in question, in case your old eager supplier will take too long to transmit the message to you), and tell your previous eager neighbor to switch to lazy mode.
With this, latency is still optimal, and only a small number of non-optimal bits are transferred (we assume lazy notifications to be almost negligible in their size), yet the system is highly resistant to connection failures: when an eager connection breaks, the receiving node will receive a lazy notification at some point over a different connection and thus change the tree to not contain the broken connection anymore.
So we get a very efficient, self-healing system, but the local operations of every node are fairly simple.
@Mix Android
No prior experience with this feature or scale, but Graphviz can embed images in the graph nodes it lays out. Write a script that converts your graph into a corresponding DOT (the input language for graphviz) file and places the images in the file system, then invoke graphviz.
Graphviz should handle that scale, and is quite flexible (raster graphics, svg, pdf all work, several layout engines to choose from).
There are some (minor, imo) warts.
Would you mind sharing what you dislike about it (author speaking here)? Bamboo is stable and thus I cannot change it, but here are what I would consider its warts:
- the tag byte probably shouldn't exist
- crypto primitives should not be hardcoded into the spec (insisting on multiformat hashes in particular was probably a bad idea)
- an entry should also include the accumulated size of all entries starting at the lipmaa link target
Other stuff:
- I really wish I had used the terminology of skip-link and predecessor-link rather than lipmaalink and backlink
- more flexible (monotonically increasing, accumulated) metadata beyond entry size
Or a minor update to support a new hash
Is it really minor though? If it mandates implementations to treat certain messages different than before the update, then all old implementations that have not implemented the update yet suddenly do not conform to the specification anymore. That's a breaking, i.e., major change as I see it; new behavior is not a strict superset of old behavior, but old behavior has to change.
At the very least, you have to very carefully specify how to handle as-of-yet unsupported hashes, and make sure that not knowing some hash format still leaves an implementation conformant to the specification. Following that argument, an implementation that does not know about any hash format should also be conformant (albeit useless). Otherwise, adding hashes clearly is a breaking change.
More spec remarks, putting on a deeply black hat, so be warned:
author: ssb-bfe-encoded buttwoo feed ID, an ed25519 public key
The bipf spec heavily implies that its compound values have to contain further bipf data ("with schemaless json-like semantics", "OBJECT : 5 (101) // sequence of alternating bipf encoded key and value"), but it is ambiguous about arrays ("ARRAY : 4 (100) // sequence of any other value"). Assuming arrays must also contain bipf values, you cannot put an ssb-bfe-encoded feed ID into your top-level bipf array. So do I have to wrap these in an bipf buffer (of tatically known length, nonetheless)? At the very least, this needs clarification in both specifications.
parent: ssb-bfe-encoded buttwoo message ID used for subfeeds. For the top feed this must be BFE nil.
See above.
Both author and parent must use redundant encoding, for example: why is the author ssb-bfe encoded, if you already know it has to be a buttwoo feed ID? Is an optimizing implementation allowed to just ignore the useless bytes and just look at the key instead? Or do I still have to check that they contain the only valid - i.e., completely pointless - byte pattern? This opens up room for mistakes or just peers that sow chaos, for no benefit.
sequence: number for this message in the feed
Which data type is this? How is it encoded? "number" is not a bipf type.
timestamp: integer representing the UNIX epoch timestamp of message creation
This is a bipf int (32 bit) I assume? Considering that author/parent are not bipf values either, this should be specified. Any reason to not go for 64 bit? 32 bit run out in 16 years, your crypto primitives hopefully last longer.
tag: a byte with extensible tag information (the value 0x00 means a standard message, 0x01 means subfeed, 0x02 means end-of-feed). One can use other tags to mean something else. This could be used to carry for example files as content.
Please specify what an implementation must do with unknown tags.
contentLength: the length of the bipf-encoded content in bytes
hash: concatenation of 0x00 with the blake3 hash of the bipf-encoded content bytes
This contradicts that content is not necessarily bipf-encoded.
If the spec mandates a 0x00 byte at the start of the hash, then an implementation has to reject everything that does not have that byte. Any change that allows other starting bytes would be a breaking change, i.e., a whole new format. So as the spec is currently written, this byte is completely redundant and should be removed from the spec.
The content is a free form field. When unencrypted, it SHOULD be a bipf-encoded object.
Please specify what implementation must do when it is not a bipf-encoded object.
A buttwoo message consists of a bipf-encoded array of 3 fields:
Metadata must be an bipf encoded array of 8 elements:
The only unknown about the length of a message is the length of the metadata. Since the metadata starts with its length, storing the message as a bipf array is redundant, you could simply concatenate metadata, signature and content instead. An efficient limitation would ignore the first bite of the message encoding - but it has to verify that the length is correct. This byte is reduntant and forces reduntant computations (branching even) on all implementations.
Similarly, the length of the metadata array is only influenced by whether the parent is null is not. Should you really encode this by prefexing the metadata with two different, fairly large, arbitrary looking integers? Especially since the parent encoding contains the information about whether it is null again.
Overall, the information whether the parent is null is stored in three different places, in three different ways, and while only one place needs to be checked to function correctly, the consistency of all three locations has to be verified. This violates DRY in a pretty horrible manner.
Also, just out of curiosity: did you condiser VarU64 for bipf, and if so, why did you choose LEB128 above it. Adoption/implementation availability, or design reasons? CC @Dominic (?)
For bulk validation I decided to go with the solution Dominic mentioned by just validating the signature of the last message.
Just because it hasn't been mentioned yet in this thread: what happens if I start appending to another person's log? That gives me a valid hash chain in which every message is correctly signed by its author, it's just that authorship changes halfway through the log.
Do you perform checks for detecting this case? Which behavior do you recommend/mandate for implementations when this case occurs? Is there even a slight possibility of clients receiving entries from a log like this and doing all sorts of undesirable things because one of the core contracts they expect a log to uphold has been broken, for example, displaying messages from the old log under the name of the author of the new suffix, or having a sudden switch of authorship in their timeline view?
You can still easily (and quickly, compared to signature validation) verify that all messages have the same author, but that should be specified somewhere.
Nitpicking on the spec:
Status: In review
By whom? For how long? Under which criteria? Don't tease the poor reader like that.
The verification section is only local to each message. It should at the very least link to the specification for log verification (valid hash chain, single-author, etc). Which exists, right? =P
I was looking forward to seeing the daily-ish marker drawing of yours in my feed, but there are none. What am I doing wrong, @sonar?
The #p2panda folks gave me the opportunity to flesh out some ideas beyond bamboo, here is a write-up: https://aljoscha-meyer.de/magma
This investigates what sort of system you get when you consider every entry of the log to describe a state change, and then always publish both the state change compared to the predecessor entry, and the state change compared to the skip link predecessor, in a single entry. It generalizes bamboo and append-only logs, as those just implement a very particular kind of state change, namely, appending to a log.
The write-up is incomplete (it just stops at some point) and will stay that way, but there is still a lot of interesting stuff that goes beyond bamboo in there. I will eventually try to publish this in a scientific paper, but it will be a while until I get to that.
I wrote a standalone overview on binary linking schemes, the concept that powers #bamboo, a while ago: https://aljoscha-meyer.de/linkingschemes (requires javascript). I don't think I've shared it on here yet.
The writing is fairly math-y and austere, but it also contains some nice interactive visualizations, such as this one of certificate pools (the data that peers have to store in order to guarantee that partial replication is verifiable).
Hello again everyone!
It's been a while, but it looks like I can interact with the scuttleverse again. For the last two years, I have been dealing with chronic wrist pain, forcing me to use speech recognition (Dragon, on Windows). And the scuttlebutt client story on Windows, especially when not able to install nodejs (whose installation messed with the python dependency of the speech recognition engine), was rather bleak. I have also been struggling with some mental health issues, which didn't make reaching out to fellow butts any easier.
But not everything is doom and gloom. I finally started seeing a therapist, I'm actively trying to improve the wrist situation, and apparently manyerse on desktop works, is stable, and does not disturb the speech recognition engine. I've also wrapped up my Master's degree, only to stay in academia: I'm doing a PhD now in the Open Distributed Systems group at the Technical University Berlin. My research there will probably be a continuation of all the replication stuff I've been writing about for the last years.
If I started @-mentioning people I'm excited to interact with again, I would only forget some. Because there's so many!
Looking forward to the future,
Aljoscha
PLEASE IGNORE UNLESS YOU’RE INTERESTED but is it completely ridiculous to switch between groups of 2 and 3 to try to get the average group size approach E?
I am still too bad at math to give an explanation, but I did implement this a while ago, just to find out. Approaching e
is not an improvement, consistently doing groups of three yields smaller certificate pools and shortest paths.
Oh no, you foiled my insidious plan of lurking for a while without being noticed!
It turned out that manyverse desktop simply works without impacting the speech recognition setup =). Thank you, @andrestaltz .
{ "type": "about", "about": "@zurF8X68ArfRM71dF3mKh36W0xDM8QmOnAS5bYOq8hA=.ed25519", "image": "&V7ixGKpJaJJcwtlDNj4HtbprXYxH9fLAcCcvCobyhdo=.sha256" }
Every currency needs a neutral element of addition. #noney
@andrestaltz It turns out that more structured plaintext formats like code are a lot more amendable to speech recognition than free-form text, especially if you use some sort of macros or snippets. Only navigating a code base can be fairly annoying, depending on how much you want to also reduce your mouse usage.
@André Yes, I'm using Dragon, together with the Caster framework. I haven't extensively customized it, but I am quite happy with the quality of Dragon's output. I'm certainly less productive than before, but it beats constant wrist pain, to the degree that I am even accepting working on a Windows machine.
@Mix Android I didn't, I resorted to a virtual machine running arch and then the patchbay AppImage.
By routing the output of a Windows-only speech recognition engine to a virtual machine running arch Linux and Patchbay, I can now literally say:
Hello again, world.
I had another attempt at defining a replication protocol for bamboo logs: https://github.com/AljoschaMeyer/bamboo-point2point
This might not only be relevant for those who have played around with bamboo (@piet, @hoodownr, @adz, @cafca, @sophie), but also for its general thoughts on specifying which parts of a log to replicate, CC @André, @christianbundy, @Netscape Navigator™
This uses some terminology changes that have not yet made it into the bamboo specification:
- A link to a predecessor entry is now called the predecessor link.
- A lipmaalink is now called the skip link.
- A link that is either a predecessor link or a skip link is called a backlink.
- A log is now identified by its log id, which consists of its public key and its log number.
time travellers (people who try to attach nodes causally back in time in the DAG)
@cblgh, to define DAGs free of time travelling (i.e. tangles) in a declarative way, any of the following equivalent defs do:
- DAGs whose transitive reduct equals the original DAG itself
- DAGs where an edge from u to v implies there is no other path from u to v
- DAGs without induced paths
- the class of Hasse diagrams
I finally did a writeup on the set replication algorithm I talked about at the Basel event: https://github.com/AljoschaMeyer/set-reconciliation
To get used to working with speech recognition, I wrote a larger blog-like post on sequence (streaming) abstractions: https://github.com/AljoschaMeyer/sequence-abstractions/blob/master/README.md
On the plus side, I can now write again, so you can also expect the occasional link to a dense wall of text.
To rest my wrists, I have switched to a voice-controlled computer setup. Unfortunately, this also meant switching to Windows. I didn't manage to get any ssb client to work, so I will have to restrict my activity to occasionally checking mentions and private messages on the linux machine.
Would you consider tweaking your script that produces the "Thanks for this post ; I have just sent you G1 libre money." replies to write private messages rather than public posts? To me, these automated comments feel spammy to the point where I'm considering blocking regardless of your other posts, especially since some clients bump threads to the top of the feed view based on new replies, so your likes now carry an unproportional weight on what people see of the scuttleverse.
Guess who is also searching for a flat in Berlin at the start of April (or May), is fluent in German, and sometimes pretends to be a musician? I didn't plan on having to look for a new place, but that's life...
Since JSON is not deterministic, we need to serialize the message exactly how we found it.
ssb defines a deterministic way of serializing the json, see https://spec.scuttlebutt.nz/feed/datamodel.html#signing-encoding and https://spec.scuttlebutt.nz/feed/messages.html#json-encoding
@notplants Sorry for the super short response (and ignoring the numerous parts of your post that I agree with), but this all I have capacity for right now:
- hell yes for poems dealing with technical topics!
perhaps the way in which spoken words disappear helps make clear the difference between sign and signified
the sign (the speech) immediately disappears
the signified, what is remembered by whoever heard it, is stored for longer, but in a form that is private
and needs to be retranslated into the context of the current moment whenever(if) it is spoken again
perhaps its good to think about what is a communication transport layer (air) and what is intended as an archive
technically, your poem isn't etched into any chain, since you posted it as a blob. Nodes can delete that file without impacting their ability to replicate your feed.
idea no.3 relies on the notion of a global concept of time shared by all nodes. In theory, this is impossible, and practical approximations of global time rely heavily on connectivity [citation needed]. Global time is dangerous business.
@moid Npm is weird, how did it allow me to publish circular dependencies in the first place? Anyways, published shs1-crypto 1.0.2, which fixes this maybe?
@Alex This thread might interest you. CC @Emmi
First mechanism that pops into my mind for usernames: display the one that is lexicographically shorter. To the id that published the concurrent names, display a really annoying popup that tells them to choose one of them. That one is then published again, with backlinks to the previous, concurrent ones. As soon as other nodes receive the new one, they'll use it since it is unambiguously newer.
Disclaimer: no, I don't seriously advocate really annoying popups. But the general pattern of "arbitrary choice for others, conflict resolution prompt for the author" seems useful in many cases (whenever the arbitrary choice doesn't do harm). Could also somehow visualize to other clients that an arbitrary choice to resolve concurrency issues has taken place, e.g. give the name a different background color, append an icon, or whatever. This touches again on the topic that an offline first world will need (graphic) design conventions for conveying concurrency.
You seem to imply that there is consensus finding among groups of coders, but is that really what happens? Most of the time, at least from my perspective, it is simply a single dev (or rarely two or three) doing their thing. Sometimes that work is picked up by others. Most of the time, it isn't.
There is barely any consensus finding of larger groups at all. Consequently, there have been almost no advances which required wider consensus over the last months/years.
For non-basel-attendants: this is about a hypothetical world in which instead of ordered logs peers would merely insert items into ever-growing sets. These sets would be replicated via dark magic the mechanism the first question pertains to and which I will explain on here soon, probably in video form.
What sort of hash function would one use to guarantee ordered items in the binary search tree?
The hash function has nothing to do with the ordering, the trees I drew were simplified. Just sort stuff by its natural order in the tree, and compute the corresponding fingerprints (hashes in the leaves, sums (or XORs) of hashes in the inner nodes). This btw is again simplified, usually BSTs also store data in inner nodes - you'd then store value, hash of the value, and XOR of left-subtree-hash, own-value-hash and right-subtree-hash.
If one still needs partial-order for some messages (like username changes) how does one guarantee that there is no fork again?
You probably meant total order rather than partial orders. The best approach is to not require a total order for the application layer in the first place. Otherwise, mutual exclusion for writes becomes necessary - enforcing that would be an application concern. Crucially, if application-layer total order is violated, the raw data can still be replicated. Different apps using the same set would be unaffected.
Btw @cafca:
@aljoscha: Finite logs are an approach. They have just a limited length, like 256 messages, and everything before that is pruned
This describes a ringbuffer, but I really meant limited length: it would be flat-out impossible to append the 257th message.
BTW: Today we had snow - @Aljoscha had wished for having snow while he is in Basel and now it happened!
Guess who slept through it and woke to what looks like yet another day in April?
Did you see the ssb paper yet? There's a comparison with NDN in there, and some more comments in the zine version.
@mix I have no idea. I was merely pointing out that there are analogies to type systems, not making any recommendations to use particular things for ssb. In fact, static typing for ssb sounds like an antipattern, for the reasons you mention.
But, while I'm at it, a random comment on json schema for ssb messages: as far as I know (if I'm wrong, please ignore), the schemas only say "there's a cypherlink here" rather than "there's a cypherlink to a msg of the following schema here: <inline schema or name of a preexisting schema>". It's a bit like programming C but using void pointers exclusively, or programming java but declaring all variables as having the type "Object".
Reached the end of my typing capability for today (wrist pain), but hope this is still useful =)
Do we have a settled hashtag for this event yet?
Guess we do now =) #p2p-web-basel
Is there anything I should know in order to access the location with a mobility scooter?
The elevators in the building (spiegelgasse 1) are fairly small, but there should be a ramp and a larger elevator in a side entrance in spiegelgasse 5, leading directly to the seminar rooms where sessions will be held. @SoapDog and @zelf, could you give @farewellutopia your phone numbers so that they can reach us when they are on site? We can scout the side entrance in advance and make sure everything works out.
lactose intolerant
The food will be vegan by default, but not necessarily without lactose. But we'll figure something out.
In the spirit of asking you completely random questions, do you know if there are syntax highlighters themes that are usable for folks with common types of color blindness?
I don’t offhand. I would imagine since color blind people do see some colors, it could be a matter of just using the right ones?
I can also imagine using different fonts and font styles (italics, bold, underlined, etc) to reduce the reliance on color recognition.
⚡ Lightning Talks in Basel ⚡
Hey Basel meetup attendants, the event begins in two days, which among other things means that there will be lightning talks in two days. Most of you will have some interesting stuff going on, so if you feel like spending five minutes getting like-minded people excited about something that excites you, now is the perfect time for thinking about how to best do that. That's all
- vegetarian
- no, I'm already here: %NHtbZUZ...
- violet, I don't remember why
@cblgh Just pick whichever station is nearest to where you need to get to.
Indeed =D
We can’t make promises for the self-organized sessions, but if no one objects, we’d like to record the scheduled sessions (tutorial on Friday and “invited talks” on Saturday).
Check out this post, that should contain everything Markus will need to know =)
@moid, @André True, I didn't consider this. That's kinda bad...
For a while I’ve really wanted to implement a thing where each message contains links to the latest messages from a few of the feeds you’re aware of.
Do consider the privacy implications though, especially with respect to non-public follows and blocks, as well as feeds that are being requested not through the follow graph but through some other means. Someone who chooses not to attach truthful timestamps to their messages could suddenly still be pinned to their timezone by looking at the truthful timestamps that are linked to/from. Another interesting question: Should you only respect those dedicated "global causality links" when determining what to link to, or should you consider any and all links? What about links in encrypted messages?
Re size and choice of messages: First, obtain a pool of potential messages to link to. The ideal (correct and minimal) pool can be obtained by:
- figure out the set of all messages from all feeds that you have in your local db
- throw away those that are already reachable from your new message by transitively following all its cypherlinks (including the backlink(s))
- from the remaining, throw away those that can be reached from another of the remaining messages
These steps can be done efficiently, assuming you already keep a topological sorting of everything. Also note that in step one you can take the newest message of each feed only, without changing the end result. Still, if this is too complicated/expensive, you could simply use the newest message of each feed you now of, regardless of when you last linked to that feed. That might mean including superfluous information, but it doesn't hurt that much.
As for size concerns and the choice from this pool: randomness is your friend. The resulting ordering won't be perfect (i.e. total) anyways, so no harm in reducing implementation complexity by sacrificing a bit of quality. Just set a maximum number of such backlinks per message, and if the pool size exceeds it, pick random messages from the pool until you hit the maximum. It would be more fancy to determine which subset of the pool would have the highest quality (provide backlink paths to the largest number of messages), but probably unnecessary.
my guess (ha!) is that it would only be a few links per message.
Technically speaking, it is bounded by (i.e. at most) the number of feeds that publish concurrently.
Another interesting thought here: It might make sense to periodically publish dedicated messages rather than piggybacking on regular content messages, to keep the causal order tight even if you don't publish often. How often do you do this though? The extreme case (publish whenever you receive a new message) leads to a cascade of mostly useless ordering messages =)
Finally, the non-randomized parts of the above somewhat rely on everyone having access to the same set of messages (i.e. fully replicated feeds without private messages). Different views of the network mean that the amount of information you might publish to obtain the tightest causal order might not be very tight for someone else, because they might be missing the target of a cyberlink. Another argument for simply picking random feeds instead.
@moid (emphasis mine)
A distributed DB is what’s needed for that.
Not for the single-device scenario, or am I missing something? You merely need to persist the new message and the current seqnum in an atomic transaction (either both succeed or both fail, but no partial failures). Pretty sure that this is the only correct solution that exists (again, single-device only, I'm leaving multi-device frontier consensus to magic, since it is impossible to achieve in an offline-first way, i.e. with partition tolerance and availability). An ssb server needs to do this atomically somewhere, so it might as well put it into the secret file while it is doing the atomic magic. Conceptually it doesn't become more complex. The actual implementation of cross-platform atomic filesystem writes in a transaction that also includes flume writes is a different story though... still, exposing only half of the relevant data (the secret key) over the file system is kinda weird.
@Aljoscha proposed something at 36c3 where you basically sort the hashes in a forked scenario to get a linear feed again
Perhaps I didn't put enough emphasis on the part afterwards where I argued that that would be a really bad idea =(
@andrestaltz, if there's a call, I'd like to join in. This is something that I've spent a lot of time thinking about, but I haven't written down most of it yet. In Basel, I'll also be talking about this topic quite a bit I think. spoiler: I believe the only way of circumventing this problem is to not build protocols on append-only logs (instead moving similar problems to the application layer).
some older thoughts of mine on fork recovery: %EwjVUh1...
Your example here correctly points out, that possession of the secret key is insufficient for safely appending. What is needed is the secret and the newest seqnum. If you magically managed to update these across devices, then multi-device setups would be safe as well. In that sense, there's nothing really new about the example you gave, it merely emphasizes the problem differently. For the single device setup, there's a straightforward (conceptually at least) solution: When publishing a message, atomically update the newest seqnum in the secret file as well. Then, the file is truly sufficient to enable safe appends. Right now, the "frontier" information is kept elsewhere, and its importance is being understated everywhere.
@SoapDog So you can make it? That's really nice to hear/read =)
BSL is correct, as are "Euroairport" or "Basel/Mulhouse/Freiburg" - that's all the same airport. It's important to take the swiss exit, else you'll land in France and getting to Basel could become quite annoying.
There wasn't enough interest ahead of time for us to coordinate the youth hostel accommodation - people have been finding cheaper spots on their own anyways. Easiest way of staying with other ssb folks is probably to just ask around for others who haven't settled on accommodations yet.
That's a shame, @jiangplus. Thanks for the heads up. All the best to you and yours
@mix Only thing that jumps out to me is that since context determines whether something encodes a feed or a message, there could be two separate "codes" tables, one for things of type "key" and one for things of type "feed". Feels slightly cleaner to me, but is probably completely irrelevant in practice.
@pkill9
Yes, there are "out of order messages" (usually abbreviated as "ooo"), the code in the reference js implementation is here. Few clients use it though.
If you are into technical details, you might want to see also this critique of the current ooo mechanism, and look at hypercore or bamboo that don't share that problem.
seems simple and a single byte allows us probably more message and feed-types than we will need before the box2 spec itself needs composting?!
Yes, probably. In that sense, the "type" of my tlv encodings that allows 2^64-1 variants is indeed overkill.
shaThinggyCode.writeUInt8(1) // the "magic code" for sha256 type keys
Any particular reason for 1
rather than 0
?
@theblacksquid
No full C implementation, #sbotc is the closest thing. There is a go server ( #go-ssb ) primarily maintained by @cryptix and @keks, and the rust #sunrise-choir stack is somewhat close to a working server as well.
@mix Off the top of my head:
- since it is clear from context, you don't necessarily need to encode that something is a feed id or message id
- if you want to be able to extend this to future ssb changes, e.g. new kinds of messages or feeds, you might still want to add an indicator
- serialize the sha256-thingy before the raw bytes of the hash, not afterwards - when parsing you want to know what to expect as early as possible rather than retroactively
- decide whether you want to store the sha256-thingy and similar things as text or as a small numeric identifier
- if identifier, the simplest option is probably a single byte or another fixed-width integer (specify endianess!), or some varint
- store hash/key as raw bytes rather than baseXY
- either prefix them with their length, or in case of numeric type identifiers you can uniquely define a length implied by each identifier
see also https://github.com/AljoschaMeyer/stlv and https://github.com/AljoschaMeyer/ctlv (more fancy than stlv, probably overkill)
CC @keks, @elavoie, @adz, @cblgh, @Daan, @piet, @SoapDog, @cft, @zelf, @Cafca, @Ace, @graham, @arj, @cryptix, @happy, @cmarxer, @Dima, @jiangplus, @hoodownr, @af, @kisumenos, @emile, @smyds
SSB Event in Basel - The Official Infothread
In this post, you will find all the information you might need regarding the Basel SSB event in one place. Also check out the event website or the handy flyer.
From Friday the 21.02.2020 to Sunday the 23.02.2020, we will gather with ssb folks and Swiss academics and students in Basel for three days of discussion on scuttlebutt and its underlying principles. While there will be some structured program blocks, the main focus lies on participant-driven sessions. Quoting the announcement blob that we will use to lure in invite external academics:
Secure Scuttlebutt (SSB) is an application-level secure, persisted publish/subscribe system that has gained popularity in the decentralized Web movement. The aim of this workshop is to bring together researchers, software builders and members of the SSB community to study the properties and potential of Secure Scuttlebutt. The workshop is open to other decentralized approaches and networking technolgies e.g. DAT, Holochain, IPFS or SOLID, as problems and technologies often have considerable similarities. Also, a SSB tutorial preceds the workshop, helping people to get familiar with the Secure Scuttlebutt technology and value system in short time. A travel and accommodation grant has been put in place to help facilitate attendance from people without institutional funding.
I want to participate - what do I need to know?
Venue for the event is the Spiegelgasse 1 in Basel. Arrival time starts at 13:00, followed by an SSB tutorial primarily aimed at the external guests. The main welcoming session will begin at 16:00. The event ends on Sunday at ca 15:00.
We ask for an entry fee of 60 Swiss francs (or 56€) to cover expenses for meals (40 francs for 2x breakfast, 3x lunch, 2x dinner) and to reroute some money to folks who otherwise would not be able attend (see below). You can pay on site. Any excess money will flow into the european ssb open collective.
I'd like to attend but Switzerland is too expensive
If your financial situation would prevent you from attending, please reach out to @aljoscha, @cft or @zelf on ssb, or via email to christian.tschudin@unibas.ch. We can offer the following support:
- waiving the attendance fee
- covering travelling expenses (as much as our budgets allows)
- sleep at one of our places so you don't need to pay for accomodations
Please do not hesitate to contact us.
Additionally, if you have an academic affiliation and can give a short presentation on some ssb-related topic, we can access university funds to pay for your travel expenses.
I'm excited and want to do stuff and things
That's not a question, but still we have you covered. Since most of the program is done in a participant-driven unconference style, you can already think about sessions you'd like to host, discussions you'd like to initiate, etc. And if you feel like it, why not prepare a lightning talk?
Also, we'd love to see people take responsibility for some part of the event: perhaps you'd like to coordinate the preparation of a meal, host the lightning talks, know a way of gently raising people's energy on Sunday morning, or see another opportunity that is currently missing in the (still preliminary) schedule? Then please reach out to us and we'll make sure you can have your fun =). And finally, any suggestions for vegan recipes are greatly appreciated.
@cafca, @adz: you two are the first to respond to the hostel plan, so it looks like there isn't enough interest for a bulk reservation. I've heard rumors of a signal group and plans for an ssb-airbnb though.
More Basel Event Information, Yay!
Another short info dump before we are ready to do the "official", polished thread with all the info in one place:
- A preliminary website with a preliminary schedule is up: http://p2p-basel.org/
- There will be a fee of 40 Swiss francs (ca 37€) to cover all meals (2x breakfast, 3x lunch, 2x dinner)
- We will also raise a 20 franc fee for the conference itself. The main goal here is to have money to reroute to those who could otherwise not attend the event. Anyone who needs financial support to be able to attend can contact us*, and we will try to help with travel costs and accommodation (also they wouldn't have to pay the conference fee of course).
* either contact me, @cft or @zelf here on scuttlebutt, or via email to christian.tschudin@unibas.ch
CC @keks, @elavoie, @adz, @cblgh, @Daan, @piet, @SoapDog, @cft, @zelf, @Cafca, @Ace, @graham, @arj, @cryptix, @happy, @cmarxer, @Dima, @jiangplus, @hoodownr, @af, @kisumenos, @emile, @smyds
Moving Ahead With the SSB Event in Basel
Hey all,
we are making progress with the planning for the Basel SSB event that is happening from Friday the 21st of February to Sunday the 23rd of February. @zelf has joined us in the main organization effort, so now we are finally making some progress =). Here are some useful pieces of information:
- Arrival time will be around 13:00 on Friday, with an introduction to SSB for external guests (from academia etc) beginning at 14:00. The "official" welcome and intention-setting will begin at 16:00, so that would be a good time for even the seasoned butts to have arrived.
- The scheduled part of the event will end on Sunday at 15:00.
- For accommodations, we could collectively book some 4-bed rooms in the Basel youth hostel (walking distance to the event venue) for around 42 Swiss franks per night - slightly cheaper than their regular prices.
- We will likely collect a small fee to cover the costs for the communal meals. Also, we hope to find a location where we can prepare some of those together.
- After this weekend, we should be able to share a flyer for the event, a preliminary schedule, and possibly a website. And perhaps even an actual name for the event.
There'll be more updates after the weekend, for now we'd like to ask who of you would be interested in sleeping at the youth hostel? The main alternative would be seeking accommodation on your own. We'll also organize a few sleeping slots at private places (my place, @cft's, possibly some friends of @zelf's). So if the cost of accommodation would be prohibitive to you, please reach out to us.
@keks, @elavoie, @adz, @cblgh, @Daan, @piet, @SoapDog, @cft, @zelf, @Cafca, @Ace, @graham, @arj, @cryptix, @happy, @cmarxer, @Dima, @jiangplus, @hoodownr, @af, @kisumenos, @emile
@gwil Just send me a private message with some time slots that could work for you =).
The "magical mechanism" is whichever replication/routing layer gets implemented.
It’s like if every day I wrote about my day in a new text file, and added the filename of yesterday’s text entry at the beginning. I’m keeping a diary but there’s no diary really – it’s a concept in my head, just individual text files referring to each other. Am I understanding this right?
Yes, that's exactly it. And then there is also a magical mechanism that automatically pushes copies of newly created pages to whomever is interested in the diary which they are part of (that's the really interesting part: the diary itself is not a piece of data, but it is still something that does exist and can be referred to. It is a "piece" of codata).
I'm limited in my typing capabilities right now (wrist trouble), but I'm always happy to hop on a call and give more context to help with understanding stuff.
@hoodownr No container format yet, and I'm not particularly opinionated on that stuff anyways. Just ship whichever bytes you want, bamboo doesn't care.
Confusingly enough, when I'm writing on bamboo, I use the word "log" to refer to what @cel has defined as a "feed". Bamboo is quite minimalist in that sense: there are logs made up of entries, and that's it. Every entry contains exactly one signature and one or two hashes of other entries.
In a sense, logs don't really exist, there are just a bunch of individual pieces of data. Some of these happen to form syntactically correct entries. Some collections of these entries also fulfill certain validity criteria (all signatures by the same author, backlinks and lipmaalinks forming a long chain, sequence numbers denoting the position in those chains). These collections are then called "logs". But perceiving them as logs is an external attribution or perhaps recognition of structure. There is no piece of data that can claim to "be a log".
Thus, it is also impossible to actively categorize or collect arbitrary entries in "logs". The log structure is uniquely determined by the entries themselves. And since entries are immutable values ("data"), there's no changing that later. Whoever publishes entries has full control over the kind of structure they are grouped as, and thus over the structure that can be exploited by the protocol for efficient replication.
Does the existing legacy JS implementation conform to this?
Yes. The process for getting that data set was the following:
- implement stuff in rust to the best of my knowledge
- let a fuzzer run over that rust implementation, producing a varied set of data
- feed that data to the node reference implementation, if the node implementation does something different than the rust implementation, adjust the rust implementation and go back to 2.
- profit
@moid In case you haven't seen this data set (since it is technically generated from rust, not js): https://github.com/sunrise-choir/legacy-value-testdata
That's all the test data you will ever need. Also, ignore the protocol guide and follow the spec instead =)
See here for more comments.
@af Accommodations will be self-organized. We might be able to organize a few couch or floor slots.
It is decided, the event will take place from Friday the 21st of February to Sunday the 23rd of February.
Can you share the link here so we won't be surprised by hundreds of unexpected attendants? =)
@Alex The main problem with cutting of the suffix rather than invalidating the whole thing is that the malicious actor can repeatedly fork again, but at an earlier point. If they want to delete the full thing, they can always do that by producing two (or more) first messages. But they can cause even more havoc by repeatedly altering (rolling back) the state of the world, converging towards the world where the feed doesn't exist.
and discarding the two (or more?) messages that caused the fork,
Give me a precise definition of what to discard and what to keep (and also the "new" and "it" in "prevent any new messages from being added to it"), and I'll try to tell you how it is broken =P.
@kisumenos I'll mention you in the next update, which will happen soon-ish. Hopefully definitely.
@Ace Jup, exactly.
@elavoie and @Offray
The (pretty much deprecated by now) bamboo point-to-point spec includes some careful work for detecting, proving and propagating information on forks, but it is scattered throughout that document. I'm happy to chat more about that, but preferably using voice to rest my wrists. I consider that topic to be pretty much solved (for me), I merely haven't clearly communicated that stuff yet.
The hard part is to decide on what to actually do when a fork is detected. Currently I'm leaning towards invalidating the whole feed, but there doesn't seem to be an objectively right solution.
@Ace The one in the message object is when the message was published (or rather. the time that the author claims the message was published). The outer timestamp is the time when your ssb-server received the message. It is an implementation detail of the js implementation that has nothing to do with the general ssb protocol. The js server allows sorting messages by receive time.
@hoodownr p2panda is why bamboo is named as it is =)
I'm going to talk about append-only logs at day two, 14:00 as a self-organized session: https://talks.komona.org/36c3/talk/D9FGXK/ (
Perhaps some of you would like to listen? Interesting for those who want to know more about how ssb works, and for those who want to know the biggest inherent problems.
streaming abstractions might happen in a (even) more informal setting some time on day four.
Looks like the part @emile linked to is the correct one, link to a German site for the part (sorry but I don't know anything about this stuff): https://www.teilehaber.de/itm/lenker-radaufhaengung-hinterachse-rechts-jp-group-1150201480-src1824609.html
Also, they'd want to buy the thing.
A shot in the blue: can anyone who attends #36c3 (or happens to be in Leipzig at 27.12.2019) bring a specific spare part for a Volkswagen T3), namely the swing arm for the right rear wheel? Asking for a friend of a friend, apparently a wedding (gift? surprise? thing?) is at stake. If anyone could help, I can relay the friend's contact information =)
I made an error here:
Bender, Michael A., Jeremy T. Fineman, and Seth Gilbert. “A new approach to incremental topological ordering.” Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2009.: assigns labels (“weights”) to all nodes, an order on the labels is consistent with a topsort on the nodes (this approach is unsuitable for ssb, appending messages to the longest log consistently hits the worst-case label update cost)
In the sorting we actually care about, the direction of the edges is from older to newer, not from newer to older. So it is wrong that in ssb one would continuously hit the worst-case label updates on append operations. The approach would actually work fine for ssb.
(CC @christianbundy)
@Rabble There are no merkle trees involved actually =P
PLEASE IGNORE UNLESS YOU’RE INTERESTED but is it completely ridiculous to switch between groups of 2 and 3 to try to get the average group size approach E?
I HAVE NO IDEA I'VE BEEN WONDERING THE SAME FOR MONTHS NOW but I'm bad at math so I'm unable to compute whether these are indeed more efficient than the uniform group size of three. I'm also too bad at math to figure out whether the paper already covers that case or not. They claim to have found the optimal graph and then give a proof (that is too complicated for me to check, or at least too combinatoric...) of optimality. But who knows, maybe they made an error. On the other hand, there is no particular reason why radix economy should even have something to do with the path length in those graphs.
Also, the paper deals with maximum path length (i.e. the worst case), we might actually be happier with optimal average path length (as long as there is still a reasonable bound on the worst case).
Would this even be a good idea?
Nah, probably not... =D
Have a drawing of the base-2 graph:
As opposed to the base-3 one:
It turns out that three nestings are indeed more efficient than two: https://pdfs.semanticscholar.org/76cc/ae87b47d7f11a4c2ae76510dde205a635cd0.pdf
Any number of nestings works, so there's a whole family of graphs. I usually explain them using the 2-based one, since it's the simplest one. But in most situations, a real implementation should go for the more efficient one.
I’ve spent too much time around computers and usually default to binary.
In case you didn't know already, you might be surprised that three has better radix economy than two, and thus in principle base-3 computers can be more efficient than base-2 ones. See also ternary computers.
{ "type": "tamaki:publication", "img": "&PfZXiJzgVEcvQ1jSzt+vgpfv6fIprLcGrWcxmTxd5ZA=.sha256", "title": "Ghostly Bellpepper", "description": "Why did I draw a green bellpepper in the first place? I don't even like green bellpepper. Well, the scanner aparently doesn't like them either. At least one can still see that two of these green pencils are actually way too blue for a bellpepper... And on the plus side, the scanner hid my bad technique =)", "caption": "Pencil drawing of a bellpepper, but scanning didn't work well, so it looks like the brightness had been turned up." }
Bart Jacob’s book on coalgebra
I've started reading and working through some exercises, then I got distracted by other shiny things. Same with this category theory book. I'll get there eventually. In any case, that first chapter of the coalgebra book alone made so many pieces in my mind suddenly click, it was amazing. I guess I never thanked you for that recommendation, so: thank you!
If you can explain what you’re looking for I might know some other references.
Nothing specific, was merely glancing at the relation between streams and comonads. The most helpful text (to me at least) I found was this.
I can sort of see the partial replication piece but I’m not sure how this will enable soft deletes.
Soft deletion is made possible by signing the hash of the payload rather than the payload itself. I can then locally delete the payload while keeping its hash around for verification purposes. Gabby Grove does the same. I've begone to become skeptical of this approach after realizing how my attempt at a point-to-point replication protocol got drastically more complex because of missing payloads. I'm currently thinking more about settings where soft deletes are either impossible or restricted to messages that one only needs for partial verification rather then for their content.
Do you plan to combine this with Gabby Grove?
See %4qRWSVD... (also bamboo is both more efficient and more expressive than Gabby Grove)
looking over lipmaas thesis I’m kind of wondering what the significance of base 3 is in your formulas. I can see there’ s a result referred to on optimal anti-monotonic graphs, but that terminology is not common in the graph theory literature I’m aware of. These are DAGs that have some invariant properties under edge deletions?
They define these "anti-monotone graphs for binary linking schemes" as directed graphs on the natural numbers where:
- each number has an edge to its predecessor, and edges only point to smaller numbers ("linking scheme")
- each number has at most two edges ("binary", duh)
- for
w <= x < y <= z
, if there is an edge fromz
tox
, then there cannot be an edge fromy
tow
(note thaty
is strictly greater thanx
). Equivalently: the function from nodes to the edge target is antimonotone. Intuitively: Imagine the naturals on a 2d coordinate system: it must be possible to draw the edges without any crossings and without drawing below the x axis.
Motivation: corresponds to verifiable append-only logs (and equivalently secure relative time-stamping). Predecessors ensure reachability between any pair of nodes. If edges preserve causality (as they do when they are implemented as secure hashes), reachability is the happened-before relation. A log is not forked iff the happened-before relation is a total order. From antimonotonicity it follows that any path between two nodes includes the shortest path between them. Thus it is possible to perform verification greedily along the lipmaalinks - even though you are "skipping" over some backlinks, you never miss vital information (this is handwavy, but if you want non-handwavy, you can read the papers =P).
Of all the binary antimonotone linking schemes, we are interested in those where for all nodes x
, y
, there is a path of length O(log(y - x))
(for y >= x
w.l.o.g.). Lipmaa et al give an infinite family of such graphs, one for each natural number. The family member corresponding to the number three is the one with the shortest average path length.
http://www.cs.ioc.ee/~tarmo/papers/cefp05.pdf
Informally, a Freyd category is a symmetric premonoidal category together and an inclusion from a base category. A symmetric premonoidal category is the same as a symmetric monoidal category except that the tensor need not be bifunctorial, only functorial in each of its two arguments separately.
Why, thank you for that informal introduction. I can't even tell whether that first sentence is grammatically correct =/
I think that paper might contain information I'd like to know, but I don't think I'll get it from there...
Feature request: if the name starts with Patch
(case-sensitive obviously), the components of the version number must be read in reverse.
Also a suggestion for @mix: chances are that @ChristianBundy will forget to specify the precise format of the version
(how many components?, suffixes for alpha/beta/arbitrary-text?, arbitrary markers between numbers?, maximum value for numeric components?, etc.), because nobody specifies that part. Rather than choosing your own, slightly incompatible version format, you could try to find something that is technically compatible but crashes his implementation. Stuff like version numbers that round to infinity, veeery long version numbers, etc. Bonus points if you manage to crash a plugin that takes down the whole sbot =)
The point I made here still stands though: To help with the current situation of post thread rendering, adding information on how to interpret root
and forks
(or whatever these things are called) might be more helpful that the useragent thing. Also I might be uncomfortable with revealing my useragent but would be ok with merely revealing how threads should be interpreted. In that vein, please make the version
component optional so that paranoid anti-fingerprinters can skip it.
For what it's worth, I prefer the object over the string.
Best someone can do here @Thomas Renkert is if they access your secret, they can read all your private messages, and they can impersonate you.
They can also deliberately fork the feed at sequence number 1, thus making all of the old content invalid. Ssb doesn't implement this invalidation yet, but it probably should. So if someone gets hold of a secret key, they can effectively delete the feeds content. Current ssb merely fails to propagate that information properly. But at the very least, the data might not propagate to anyone who doesn't have the "old" content yet (note that "old" technically isn't an accurate term since from the system's point of view, neither of the two branches of the feeds is preferable over the other).
(Sorry for the technical language @Thomas Renkert, but this is indeed a purely technical artifact of ssb's protocol - if we could solve this problem on the technical layer while keeping all decentralization properties, we would. Unfortunately, the only known solution is proof-of-work-like consensus à la insert-your-favorite-cryptocurrency-here).
@John
Random comments on Feed.hs and implementing the signing format, ranging from possibly helpful to quite likely annoying:
- The most accurate reference documentation you will find is here for message metadata and here for message content.
- If you are fine with creating bindings to native code, use this rust implementation rather than running your own. It does exactly what you need and is well-tested.
- If you do run your own, use this set of test data to catch errors in your implementation. The readme lists the major pitfalls an implementation can run into.
- Use an existing library for formatting floats in signed json. Chances are you will need to fork something and tweak some hardcoded parameters. You probably don't want to do this in Haskell, and I expect the existing formatting options in Haskell to be bindings to native code. If running your own bindings is an option then
just use the rust impl of the verification processyou can use this for ssb-compatible float formatting. Creating bindings for that crate should be simpler than for the whole verification machinery, since it has virtually no dependencies. - Json escape sequences can be tricky (careful with surrogate pairs), but the test data set will catch that stuff for you. Rolling your own is annoying but can be done in a few hours or less.
- Serialization order of object entries is weird, but not too bad once you know about it. The hard part is accepting that this is really part of the protocol. Again, the test data has you covered.
- Aside from the float formatting, the whole thing isn't actually that bad if you mindlessly follow the documentation linked above. Sure, it's more complex that needed, but can be done in a reasonable amount of time.
- Make sure your in-memory representation of things matches the ssb data model. That might mean not being able to use off-the-shelf json implementations (not that they'd print floats the way you need it anyways). Representing sequence numbers as
Nat
s is probably fine (though remember that trying to create a message with seqno > 2^53 will result in an unverifiable message), usingTime
for timestamps seems like a bad idea. - Remember that ssb floats exclude NaN, Inf, -Inf and -0.
@corlock This wouldn't be an official congress talk, it would just be me hijacking borrowing the place where ssb people hang out to dump some thoughts.
@Rabble Yup, this would be pure software engineering. I can simply do two session I guess. If I end up talking about append-only logs, it will also involve a part on the importance of how such protocols deal with forks. That could be of interest to dat/hypercore people as well.
I have a feeling extracting this part of it into a library in a more sensible language would be pretty good.
Fixed @dominic =P
Huh, good that you are asking. While you could traverse the path of hashes, you couldn't verify the signature of y
if it is missing its content. Perhaps including two signatures could help there? It's too late here to properly think through that though, and it certainly removes some elegance. Perhaps this simply doesn't work out at all.
@elavoie
To verify that two messages are consistent with each other, one must trace a path from the newer one to the older one, checking that hashes match up. For ssb, there is exactly one such path between any pair of messages, in general antimonotone schemes, there is exactly one such shortest path. Suppose you only cared about to messages, x
and z
and you want to verify that they are consistent with each other. Suppose further the shortest path between them is z -> y -> x
. I you already had message x
and asked a peer for message z
, then they would also have to deliver message y
to you so that you could trust that what they claim to be z
is indeed consistent with your x
.
In regular ssb, this means that if the other peer has locally deleted the content of message y
and only kept the metadata, then they couldn't satisfy your request. If z
however contained the regular hash of y
as well as the content-less hash of y
, then they could provide the metadata of y
without the content, allowing you to trace the path from z
to x
, even though they didn't store the content of y
. If later you did care about the content of y
and fetched it, then you would of course verify that the regular hash of y
claimed by z
also matched.
An Alternate Offchain-Content Mechanism
Just a small idea that just came to me: instead of implementing #offchain-content by making the message body a hash of the actual payload, it would also be possible to have two version of each link in the sigchain, one that hashes the metadata and the payload, and one that skips over the payload. This mechanism avoids the indirection of the other approach and thus feels slightly more elegant. Not sure whether this is practically relevant at all, but it is fun how after months of thinking about something you can still come up with new approaches that feel completely obvious in retrospect.
I guess I could also do something on append-only logs and and the graphs powering #bamboo. That would be less interesting for me since I've already shared that stuff before, but it would fit better into the dweb space.
Mini-Talk on Stream Abstractions at 36c3
Since it would be nice to have some sessions in the ssb space at #36c3: Would enough of you be interested in listening to me rambling about API design for working with sequences of data (iterators, streams, sinks, observables, etc) for ca 45 minutes? These are some of the most fundamental abstractions in programming, and e.g. the pull-streams powering the js ssb implementation demonstrate what a big impact these designs can have. Whereas most such APIs are fairly ad hoc, I'd like to present a rather systematic, language-agnostic approach that I've been mulling over for quite some time now. Would come with fun tidbits about theoretical computer science.
@lxoliva Patchwork scans for recipients only in the first message of a private thread. So you can send a dummy message to your friend, and then answer that message with as many profile links as you wish.
not being able to delete delete requests and contact messages otherwise Index rebuilds for the friend graph will get very messy and time consuming.
Other things will also get very messy and time consuming, that's unavoidable when adding deletions. Anything that maintains state outside of logs (e.g. db indexes) needs to either adopt more complex algorithms or accept the fact that deletions can trigger a full or partial rebuild. Special-casing some messages does not change that and seems rather short-sighted. What happens if contact messages get replaced by something else? Will the new thing also be added to that set of forbidden types? Who maintains this set, and how? How do servers deal with changes to this forbidden set? They might interact with other peers that don't know about the changes yet, or peers that know about some future changes that aren't known locally. Lots of new and fun sources of complexity, for a rather questionable gain.
I hope that you all are very aware of the massive increase in complexity that deletions bring to the whole stack (beyond the mere protocol).
Cory: Are we trying to build libs that support async and sync? If we do, should the apis match up?
Sean: Let's follow the rust ecosystem as the example. If that's what all the other libs do, we should do it too. But it remains to be seen. There are lingering ergonomic issues.
Dhole: async closures, wakers, are hard at the moment. I'm still learning futures and the library is very young.
Sean: I feel the same way. The real trouble I've had with futures is when things get complex like muxrpc where you have multiple streams on top of another stream. I've tried to write packetstream differently to how aljoscha did it, but it still feels quite clumsy. I've got a fantasy that one day async will be great but for now we should keep exploring in sync.
Alj: On async stuff. Sean already mentioned the code I wrote ages ago. I think it is possible to write an abstraction over it. I was learning the api's as I went so probably the code I wrote is not perfect. One non-obvious problem that blocked me: it was not possible to schedule something that wasn't 'static. You have one underlying async thing you can write to. And then you have virtual streams on top of it. You want to be able to send something into a virtual stream and have it flush automatically but that was hard to do. The new futures api is way better, so some of these problems might have vanished.
Cory: If we build an app that uses async, does async force you to have everything in one executor?
Alj: It doesn't have to be; one thread could run the library code on an executor, and another thread could run (sync) app logic and then communicate to the eventloop thread in whichever way it likes
Cory: Are we looking to build an ecosystem where people are building on top of flume via flume views,
Mikey: I do think flume is a good idea. It's just not good in js. I think the append only log + views is a good abstraction. We've implemented the existing js offset log. It's the source of truth. And the views answer queries. From the developers point of view patchql is where we're heading. People seem excited about and it seems extensible. For the layer below it (flume views)
Piet: I agree with mikey; also just build ssb-db, which is flume-like. Offset log + sqlite view. answers questions needed for basic replication. you could and should run patchql alongside that for higher level queries. SRC probably won't build a general-enough stack to handle every possible eventuality
Cory: Unclear whether flumedb-rs is mature enough to use; can i be building flume views based on that
(too busy listening to type good notes)
Cory: What is the goal wrt multi apps on one computer? How does single write work?
Sean: SRC has talked about this. Matt has done thinking about this. Patchwork + patchbay historically didn't play together well. I'd like it if people could come into the rust stack at any level and can build the things they need. Patchql is very specific, aimed towards a "classic" scuttlbutt app, it uses the current ssb message encodings.
Mikey: To answer the 'is there 1 ssb server and how do we manage that?' it's a looooong running conversation. Our current thinking is that every app has it's own ssb-server. The other part to mention is that we haven't gotten to the ssb-server layer of the stack. We don't have a cohesive developer facing api like js does flume plugins. Or configurations. We are getting closer because we do have working lower lever modules.
Sean: My motivation medium term is to get the minimum viable peer working that can sync back and forth with existing ssb. With patchql someone could build an app on top of it and that will force the lower layers to mature. We're still learning about the apis we designed. Maybe after that we can focus on making the lower layers usable eg doing lower level network stack stuff. I'd love it to be fairly general purpose. Cory, if you're keen to play with your chat app, that'd be fucking awesome.
Aljoscha: Conceptually, we've been exploring flume ideas in basel as part of my internship for @cft. We're looking things rather abstractly. Cory asked "we have this log that is a single source of truth...and then what?" What we think is a good way forward is that the "views" are able to interact with each other. "views" can ask other views for information to build up more data. Current js flume desgin flaws: The build indexes whenever new data comes in + they are sync and no other views can proceed while one is still indexing. We want to build the views to be lazy, so they only build when they're queried. And they can be passed batches of information to index.
Sean: I'd like to have an interface is higher level than flume. But if you need more control then you can use flume.
(Lots of chatter about how to modularize the rust ssb stack. Perhaps for now just use patchql and don't implement things like ssb-friends that are a subset of patchql's functionality)
Cory:
Mikey: WRT message encoding. I think lots of us are keen to change the encoding. And the message types. Especially with tangles etc, we're not tied to the existing message types, they're not set in stone. We will want a way to configure to be able to handle legacy + new message types. But we still want to get the stack to the point it can talk to the existing stack. I don't think anyone is convinced that the current stack is good, all we have to show for it is a bunch of burned out contributors.
Alj: A useful level of abstraction might be an append-only log of arbitrary bytes, so that the message metadata encoding can be swapped out. The question is what is an append-only log? Does it support partial replication? Is content on-chain or off chain?
Alj, re state of bamboo: exploring grow-only sets rather than append-only logs; could allow multiple devices to use same id. bamboo is mostly finished; confident that partial replication is good; might want to change it so that it carries arbitrary bytes, and one extension could have offchain content, other extensions might not.
Actions:
- Sean will post a thread which is a general rust coordination thread. I'll summarise the current state of thing. I'll probably play more of a facillitator role.
- Sean: Schedule another call in 2 weeks.
- Dhole & Adria0: will try to get an async handshake + boxstream, and will start exploring an async muxrpc API.
Call notes
rusty scuttlebutt collab
Attendees
- piet
- sean
- aljoscha
- dhole
- adria0
- cory
- mikey
- moid
Intros
- Sean
- has been working src. But more recently been busy with fixing a new house. Keen to get back into rust because construction work isn't coding. Want to know what little bits can work on inbetween
- Aljoscha
- ~2 years ago did the crypto + shs. Was able to send a whoami to a server. Saw the recent post about whoami and was intrigued.
- More recently did work for SRC with the message encoding
- not actively coding on ssb in rust currently
- Dhole
- was looking for a project in rust and found out about ssb. I started working on shs and boxstream. Adria did some rpc + got whoami going.
- Adria
- working with dhole the last year.
- Piet
- in the netherlands since a few months, originally from NZ
- trying to wrap up ongoing ssb rust work in favor of a "real" job
- "replication and stuff"
- an andriod app (TBA to the public soon-ish)
Cory
- unmutes himself via left-shift
- doing recurse center in NY, learning rust and crypto
- personal project: use secret handshake and ssb broadcast for ephemeral chat (encrypted via boxstream), interacting with ssb database for name-resolution etc
- working with scott on a non-async fork of the networking stuff
Mikey
- live in nz
- do rusty stuff
- part of the Sunrise Choir. But more of the co-ordinator, but not coding. Really just the big bad boss
- been around scuttlebutt for ~5years
- keen to see ssb become fun and useful, not just for hipsters
- summer is coming and I just bought a van and I'm gonna convert it to camper
- do web dev professionally.
- art~hack tonight.
moid
- have been following SRC/rust stuff for a while; just listening in mostly
Future stuff
Sean: Dhole & Adria, what are your plans going forward?
- Dhole: We're using async-std, want to do everything async. Am still using some unstable features (pin) but hoping it will stabilise.
Cory: Is your code sync?
Dhole: It's sync but we're trying to make it async.
Sean: For the last year I've been waiting for async, but we're still running into pain with the async stuff. Piet hit pain with async borrow checking. I've enthusiastically encouraged scott to have a go porting existing code to sync. I'm cautiously optimistic that async will get nicer and nicer.
(The above is not really accurate, it's been a while I thought through this stuff. Basically I haven't been able to generalize append-only ropes to tangles ("append-only braids"?), except when the underlying monoid operation is commutative.)
To confirm, my intention was not to nominate @cblgh, I was merely encouraging people to buy him some mate. As a compromise, perhaps cblgh could receive whatever a club mate at #36c3 would cost.
Re partial replication of tangled identities: I've also hit a wall there. You can do it somewhat trivially if merge conflicts are impossible (i.e. ordering doesn't matter) by just doing partial replication all of the involved feeds (e.g. via lipmaalinks or via hypercore's mechanism). But I couldn't figure out the general tangle case either. Then again, tackling partial replication of arbitrary DAGs (that's what it would amount to) before even knowing how to do full replication of arbitrary DAGs efficiently is very ambitious anyways.
I had already resigned to that grinding having been for nothing
Potentially the next meetup could be in connection to 36c3 or at the Basel meetup? @Josh-alja
Sure, the Basel meetup seems like a nice opportunity.
But I see the issue, even though I’m not sure it’s so bad because we only introduce cycles with 1 vertex.
Agreed, loops should be ok.
I haven't really followed the whole private groups discussions, so I can't really comment on multi-group messages.
CC @piet %G3QkiZf...
It would be so lovely to have you at #36c3 =)
For practical reasons it would be less of a hassle for me if you dropped me from that list, leaving more for the other people. If you insist on expressing some support for my previous work, you can buy a club mate for @Alex at #36c3 on my behalf =)
=P
More seriously though, if I understood those posts correctly, you are doing something that is isomorphic to having root
and previous
being optional. Whether you encode that as someHash
and null
, someHash
and specialSelfMarker
or Some(someHash)
and None
(in an imaginary world with static typing) doesn't really matter. That self reference sounds more like a different way of looking at an optional entry than an entirely new thing.
Now if you started to allow this %self
reference in other places as well, then we might run into problems (i.e. directed cycles in the graph). But surely that wasn't the intention, was it?
@queks This does not terrify me.
Our append-only log paper is here [...].
Huh, I read over that paper already =).
Their logs are different than ours in structure, if I understood it correctly, they maintain grow-only sets/dictionaries that evolve over time. So whereas ssb has one piece of data at each position in the log, they have a set/dictionary at each point in the log that can be obtained from the previous one with an insertion operation. These associative data structures are indeed a lot more tricky than what ssb needs to do. I'd still argue that "our" usage of the term "append-only log" is more appropriate.
It is decided, the event will take place from Friday the 21st of February to Sunday the 23rd of February. Technically, skipping the Friday got one more vote, but 14 * 3 * 24h > 15 * 2 * 24h
, which is to say that we really want to have three days of scuttleness =)
@keks, @elavoie, @adz, @cblgh, @Daan, @piet, @SoapDog, @cft, @zelf, @Cafca, @Ace, @graham, @arj, @cryptix, @happy, @cmarxer, @Dima, @jiangplus, @hoodownr
@af, please do come by for Saturday and Sunday as well, we can save the design and lispy bits for then =)
We'll keep y'all posted with more information and updates as we figure things out ourselves
I may hate timestamps, but this is still something I'll probably use at some point 😗
@hoodownr Not sure if serious (but most definitely hoping so).
Bump =)
Results will be accumulated on Friday (-ish). You can of course still attend without having participated in this poll, this is only for settling on the date.
CC @cft, @piet @adz @keks @cryptix @zelf @Alex @Powersource @elavoie @graham @arj @happy @cmarxer @Dima, @andrestaltz, @Powersource, @Daan, @SoapDog, @Cafca, @Sophie, @smyds, @jiangplus, this list is not exhaustive <3
Please like this post iff you would attend the event if it was held from Friday the 21st of February to Sunday the 23rd of February.
Please like this post iff you would attend the event if it was held from Thursday the 20th of February to Saturday the 22nd of February.
Please like this post iff you would attend the event if it was held from Monday the 24th of February to Tuesday the 25th of February.
Please like this post iff you would attend the event if it was held from Saturday the 22nd of February to Sunday the 23rd of February.
Please like this post iff you would attend the event if it was held from Thursday the 20th of February to Friday the 21st of February.
Butt-Scuttle Basiliensis
Settling on a Date
As you may have already seen, there will be a gathering of scuttlebutts in Basel, Switzerland in February 2020. Now we want to determine the precise date.
We'd like to place the event somewhere from the 20th of February (a Thursday) to the 25th (a Tuesday). If we do two days, we could either fit them in the weekend, or put them adjacent to it, leading to a free weekend with lots of butts in the same place. Or alternatively we could use the weekend + one additional day for the main event.
So here are the options:
- 20th - 21st (Thursday - Friday, two days)
- 22nd - 23rd (Saturday - Sunday, two days)
- 24th - 25th (Monday - Tuesday, two days)
- 21st - 23rd (Friday - Sunday, three days)
- 22nd - 24th (Saturday - Monday, three days)
There are five posts below, one for each of these dates. Please like all the dates on which you would attend the event. Please don't like those posts if you don't plan on attending. If you feel like it, you can also drop a comment in this thread regarding your preference on two vs three days. Or comment on anything else that comes to your mind as well =)
Crucial hole in the list of related work: Van Renesse, Robbert, et al. "Efficient reconciliation and flow control for anti-entropy protocols." proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware. ACM, 2008. beat Dominic to the name, they define a gossip protocol called "Scuttlebutt". Funnily enough, they also deal with dissemination of (key, seqnum)
pairs, but they throw away all but the newest value per key.
CC @Dominic
@moid That link doesn't work either, but I could read the article by accidentally switching browsers (firefox crashed, gave me an "oops I crashed" dialogue and then loaded the page in chromium... I don't get it either). I think they either didn't like disabled cookies or disabled referer headers (you'd be surprised how many sites break once you disable those...).
I don't know whether I should be saddened more by the fact that apparently one of my user agents doesn't deserve to render that article while another does, or that they did transfer the data and then remove it from the GUI, making it that I can't (easily) access the data that (briefly) resided on my own machine.
Re Merkle trees … the atomspace is a graph database. Its’s a “well known” theorem of #mathematics that every graph can be decomposed into a collection of trees.
@Linas Could you expand or point to some more information about how you do the decomposition? Does it efficiently handle arbitrary updates of the underlying graph?
Aaron Sorkin: An Open Letter to Mark Zuckerberg
Log in or create a free New York Times account to continue reading in private mode.
Someone somewhere seems to to assign different meaning to the word "open" than I do. It doesn't help that it initially loads the text and then later removes it via js.
@moid, mind sharing an illegal copy of that open letter?
Implication? Riffles imply information about the riverbed, but the riverbed doesn't imply existence of riffles.
If you want to go beyond the pure yes/no style of logical implication, this looks like a homomorphism (or just morphisms) if you squint a lot. A homomorphism from A to B indicates that the structure of A can also be found in B (and thus A influences B), but the other direction doesn't necessarily hold.
Perhaps you'd have more luck in #philosophy rather than #mathematics.
That web of documents piece is something I've been thinking about as well. One option for a moderately simple yet useful application to test-drive #bamboo is what I like to call the markdown web (mwd). Essentially the www but without js, css, replacing html with markdown, and http with bamboo. In a sense a modernized, decentralized gopher. That idea has been floating around my brain for years now... hopefully bamboo will get there at some point.
On the other hand, I think that a well-designed capability system controlling side-effects in a "general-purpose web-like thing" would be the better solution. But it'd be much more complex of course.
Note to self/TODO: Add an option for advancing the cursor past multiple payloads at once that don't pass the min/max size filters in a payload-only synchronization.
@cft Thanks for the feedback =)
Wow, quite a complex thing
No, it really shouldn't be, that's why I am so unhappy with the presentation =/. The system is actually rather simple, you set up ranges and then transmit log data from oldest to newest. But then there are tons of details, it becomes hard to see the forest in between all those trees. Having to distinguish between metadata and payloads, allowing to skip over some but not all things, correctly handling concurrent actions, these things add up. It's not that any of these are complex to handle, there's usually one obvious solution. But spelling it all out makes it look like a lot is going on. And to be honest, I just wanted to get this thing out there because it has taken so much time to spell out all the details...
Can you add some high-level introduction and definitions?
Those are supposed to be defined in the bamboo spec. Which I'm currently rewriting, so I can't link to those definitions just yet. An actionable takeaway here for me is that I'll add a glossary at some point.
Also, it seems that you have “standing proposals” - maybe we could and should call them “subscriptions” then?
Yes, I'm unhappy with the terminology as well. I did a last-minute change away from more standard terminology ("query", "request", "subscription"), because I want to convey that these things are fundamentally bidirectional. Once two endpoints have established that they both care about some feed starting at a certain point, then any of them can push data to the other once it becomes available. That's why I chose "synchronizing" rather than "serving" or "replicating". Perhaps "equalizing" would be another choice. Or simply "sharing"? "Merging" or "joining" in the sense of lattice operations (the whole thing can be seen as computing the join over the two data sets of the endpoints)? "Jointly advancing the knowledge frontier" seems a bit heavy-handed. But I agree that calling this point-of-synchronization-thing a "proposal" even after it had been confirmed is confusing.
Bamboo Point-2-Point
A protocol for how two nodes can keep each other's bamboo logs updated: https://github.com/AljoschaMeyer/bamboo-point2point
I'm not really happy with the presentation, but the protocol itself feels fairly good. I've thought through a bunch of interesting optimizations but ended up kicking pretty much all of them out of the protocol in favor of simplicity.
Major missing features/capabilities:
- cannot persist synchronization information across sessions, I think this should be done through a separate protocol
- doesn't support query results that miss some information, only full (prefixes of) ranges can be transmitted (e.g. if a peer asks for messages 17 to 100000 but I only have 18 to 100000, I can't send anything)
Both of these would be really neat to have, but would dramatically increase the complexity of the whole thing (sorry @cft). It should be possible to do persistent synchronization state through a companion protocol though.
And then there’s also the program that decides which transitions to take.
That was actually a very automaton-centric observation. What would this look like with e.g. lambda calculus or a cellular automaton where there is less of a clear distinction between program and state? What about process calculi (starts daydreaming about each separate process writing to its own log and replicating them asynchronously...)?
we also have to have a bath
Oh, the beauty of immutable messages =D
And thanks for the kind words
@【Netscape Navigator™】 The README is kind of a mess right now, it pretty much assumes the reader already knows most of the stuff. I'm currently working on a better writeup.
To hopefully answer your question (feel free to ask for clarification): We need to be able to traverse a (short) path between any pair of messages, to ensure that they are both "on the same fork". That's really all we (and also SSB) can guarantee: not that there are no forks, but rather that all data that is currently available locally is consistent with each other. And this consistency between any pair of messages is shown by tracing a path of valid hashes from one to the other.
There are two parts to this: efficiency and correctness. Limaalinks provide the efficiency, they guarantee short paths (length logarithmic in the difference of the sequence numbers). But for correctness, we need the backlinks. In order to have a path between any pair of messages in general, we also have to have a bath from each message to its predecessor in particular. The backlinks must always be there, the only point of variation in an antimonotone binary linking scheme (See the papers in the README) is the choice of the lipmaalinks.
@cft Trying to loop back what I got out of this, since that was fairly abstract (not that I'm going to be much more concrete I'm afraid):
We can view computation as a state transition system. A particular execution can be observed as a sequence of traces (labelling the transition steps rather than the states themselves). By persisting and sharing these traces as append-only logs, we can build really dumb interpreters of these logs that can provide rich functionality. Is this roughly what you tried to communicate?
A few immediate thoughts:
- In a way, this would be to ethereum what current ssb is to bitcoin: sidestepping gobal consensus by instead observing a multitude of individual viewpoints.
- In different situations it can be beneficial to log either the states or the state transitions. And then there's also the program that decides which transitions to take. Sometimes the best choice might be to log that program instead. This actually starts to smell like compression of computations?
- How does this relate to Futamura Projections?
- see also #ssb-annah (CC @joeyh )
It's time for some more #bamboo breakage =/
- moving the tag byte back to the very beginning of the encoding (this should really be the first thing, since it could indicate e.g. the size of the public key)
- swapping the position of the backlink and the lipmaalink in the encoding (the data behind the tag is sorted by how often it can be omitted in an optimizing replication protocol, and in a protocol that only transmits metadata that can be verified, it is never possible to encounter a situation where the backlink can be omitted but the lipmaalink can not)
- removed the length indicator that comes prior to the signature in the encoding (this was a remnant from earlier versions where the signature scheme wasn't fixed to ed25519, now it is completely redundant since all signatures have a length of 64 bytes)
More importantly, I think I also want to fixate the hash format rather than using a multihash. The signature scheme is already fixed, so having the other crypto primitive being variable is weird. A log of a broken hash scheme is unverifiable anyways, so appending new entries with a new, secure hash function doesn't really make sense. And finally, having to support multiple hash values for the same link target leads to awkward situations in verification and replication. There's the option of using a fixed hash function for the links and a multihash for the payload, but the benefits don't seem to be worth the added complexity. Also note that the tag byte at the start can still be used to define a log format that supports a new hash function while staying backwards-compatible with bamboo (similar to the signature scheme).
I'll have to sleep on the non-multihash decision, but I'm already pretty certain Ill go with it (compare also how hypercore doesn't use multiformats). This is btw unrelated to whether a more high-level protocol such as the content format of ssb ("cypherlinks") should use multihashes.
In contrast, previous logs either have linear-sized proofs or need extra trust assumptions.
Will you be there to grill them about all the existing schemes and what kind of trust assumptions they managed to eliminate from those?
Thank you all for the enthusiastic responses <3
Looks like most of you prefer February, so that's when the event is going to happen. I'm going to pester cft about the scheduling constraints in February, and we'll try to quickly settle on a date so that you all can start planning.
Since tangles are basically just DAGs, it could be a good idea to look into the terminology used in maths. Turns out the mathematicians also have a bunch of competing words though. But I'd like to suggest "source" instead of "root". A source node is a node with no incoming edges, a term used for arbitrary graphs. The term "root" is usually reserved for trees, which tangles definitely are not. In particular there's been discussion about tangles possibly having "multiple roots", which would be an oxymoron really. Same with "first".
CC #ssb-research #crabmeet #basel #switzerland
Also pinging a few people in Europe directly who might be interested (apologies to all the people I will inevitably forget): @piet @adz @keks @cryptix @zelf @Alex @Powersource @elavoie @graham @arj @happy @cmarxer @Dima
Everybody please ping around freely =)
SSB Event in Basel
Hello denizens of the scuttleverse,
we've been wanting to host an SSB event in Basel (Switzerland), and we are now entering the planning stage. This post is for sharing our intentions, gauging interest and honing in on a date.
What?
Two days of sharing knowledge, stories, ideas, hopes, doubts, emotions, and visions for all things SSB and append-only-loggy, in a participant-driven unconference/open space format.
Where?
The gathering will be at Basel University, where @cft heads the computer networks group and I am a lowly happy intern.
Who?
Anyone who is interested, whether from academia or not. We should be able to cover travelling expanses for people from an academic background or those who have recently finished their studies and could talk about their thesis. We also expect a number of local students to attend and get a glimpse into the world of SSB.
When?
This is the big question of this post. Two days seem like a sweet spot that is both worth travelling for and also reasonable to accommodate. As for the exact date, we are looking for something in either December or February. So I'd like to invite everyone who is interested in coming to share here (or via a private message) which of these would be better for you, or perhaps even specific data ranges that would or wouldn't work.
I will then try to fold these up to do a more specific post in the near future™.
Questions and Comments?
Feel free to post them in this thread, or drop me a private message. The overall plans are still very fluid, which is to say we don't precisely know what awaits us yet. But it will certainly be interesting and fun. Any suggestions and ideas for the event and format themselves are highly appreciated.
TLDR: Introducing substreams with independent backpressure introduces inefficiencies that need to be considered.
While investigating replication protocols for bamboo, I've come to the conclusion that I want to limit the number of back-pressured substreams. Since you are only allowed to hand out sending credit for amounts of data that you can definitely handle (i.e. giving out speculative credit violates correctness), resource management can become pretty inefficient once you have independent substreams. If you have a budget of k
bytes of memory for handling incoming data, and you have n
substreams, than each of them can only get k / n
credit at most. If these substreams are created over time, you might want to give an earlier one more credit, but if it then doesn't consume it, you can't get it back for the other streams. So if you are insistent on maintaining correctness while also establishing a hard bound on resource usage, you have to opt for fairly small maximal credit per substream (or allocate lots and lots of resources). All of this is to say: naively creating a substream for each replicated feed is a bad idea (not implying that anyone suggested it, just stating the kinda obvious). Credit-per-substream also creates a bit more state to manage, but that's mostly negligible I think.
It is completely possible to fairly interleave independent data over a single stream with a total amount of credit for that whole stream. The only thing we loose compared to truly independent substreams with backpressure is the ability of the consumer to prioritize logical data streams. With true substreams, you can "starve" one of them out of credit, so the other one gets to use all the bandwidth. Giving the consumer the ability to prioritize which data to receive at what speed is really the main reason for introducing an independent substream. And as an important corollary: when this ability is not needed, it might be better to not go down the route of fully independent multiplexing.
Do SSB servers need to prioritize replication of different feeds, e.g. saying that a friend's data gets extra bandwidth compared to a foafoaf? I guess you could do that, but it doesn't sound important enough to justify the decreased efficiency imo. Do we want the server to allow client applications to prioritize bandwidth allocation dynamically? I can't imagine any applications actually bothering with that stuff. So interleaving replication of different feed's messages over a single actually-backpressured stream seems completely fine to me.
Large blocks are a different story, you wouldn't want to pause all replication while downloading a 10GB blob. And you probably also don't want a 16MB blob to wait for a 10GB blob either. The naive solution is to create a substream for each blob. More sophisticated: created substreams for ranges of sizes of blobs (e.g. 0 to 2^8 bytes, 2^8 to 2^16 bytes, and so on), then send blobs sequentially in fifo order across the stream for blobs of the appropriate size. That way, a blob will only be blocked by other blobs of a similar size. The exact ranges are a matter of tuning (Do the exponents increase in steps of 8, or perhaps 4, or maybe even multiplicative? Can we lump everything below 2^16 together? Should it be possible to dynamically adjust the bands?). And then there's the question of whether it makes sense for consumers to be able to prioritize between these "channels" (i.e. whether they need separate backpressure), or whether they should merely be fairly interleaved with a common backpressure scheme controlling the overall blob transfer rate. If you go for the latter, then you've eliminated all true multiplexing from the point-to-point replication, greatly simplifying the overall protocol.
In terms of implementation complexity, the main cost is that the interleaving then lives in an ssb-replication-specific part of the protocol, whereas true multiplexing could be handled by a lower-layer abstraction. But it is probably possible encapsulate the pseudo-multiplexing based on fair interleaving into its own library. In terms of resource consumption, the pseudo-multiplexing has the potential of increasing the throughput because greater amounts of credit an be handed out without "sitting idle" on a substream that has some credit but doesn't have any data to send, and without having to be able to handle many substreams suddenly using up all their credit in a simultaneous burst. The drawback is that the consumer has less control over the order and prioritization in which it receives things.
The initial bamboo-point-2-point replication protocol might end up running over a single logical stream, true multiplexing would only be involved in running replication concurrently to the peer sampling service (the part that gossip to maintain an overlay network) and perhaps some other completely independent subprotocols.
Reading up on the weakness of tcp checksums, I'm starting to dream about a major ISP deliberately introducing errors that escape the error correction mechanism, causing bugs in all the protocols that assume that the probability of tcp transmitting corrupt data is vanishingly low (it isn't!). Like one of the "evil" malloc replacements that deliberately make common incorrect usages of malloc/free crash although they happen to work "most of the time" with normal memory allocators.
Boxstream (currently used by ssb) guarantees data integrity by crashing in case of the network corrupting some data in the tcp stream. Can you point me to stream encryption protocols that can backtrack and try again instead of fully aborting?
Another option would be to introduce a wrapper above tcp but below the encryption that does nothing but providing a stronger data integrity check and can backtrack (or selectively refetch segments) rather than immediately aborting the connection. Could use either a CRC, a non-cryptographic hash function (see e.g. here for a brief comment on using xxhash instead of a CRC) or possibly even a cryptographically secure hash to deal with actively malicious networks (probably not that useful, if the network tries to DOS you, it can probably achieve that goal anyways). That actually starts to sound like a really useful and reusable (e.g. both for boxstream or something like InterMAC) abstraction. Does this exist already?
CC @dominic, @keks, @cmarxer, perhaps you can point me to prior art?
How about blind mode for viewers?
@hoodownr Nah, my stuff (#pavo-lang) doesn't really fit this. Perhaps something like morte would fit your description? Then again, you probably want to go for imperative programming in this setting. Aarg, I don't have time to think about this =/
@cryptix By coincidence (really) I have just posted this, which can serve as a foundation for a muxrpc replacement that should run over tcp.
If you are willing to look into non-tcp protocols, QUIC might actually be a more robust choice than SCTP (since QUIC is actually engineered to work despite the internet trying hard to kill everything that is neither tcp nor udp).
Here is a spec for multiplexing byte streams over a single connection: https://github.com/AljoschaMeyer/bymux
This is essentially an alternative to #packet-stream (which is used by muxpc which is currently used for ssb replication), and it deprecates #bpmux, my prior attempt at this. This time I'm fairly confident that I got the semantics right (backpressure, heartbeats, stream creation, stream closing semantics and their interactions are tricky). And this time I'm actually happy with the API. CC @dominic since we talked about this in Hamburg.
This is part of developing a replication protocol for #bamboo (CC @piet, @hoodownr), the replication protocol will need to multiplex at least the peer sampling service and the replication, and possibly a few more things.
@Rabble I don't quite follow. This would be about spawning new linked lists that happened to share the first few items with another list. But it doesn't help with merging distinct logs.
@piet, @hoodownr Have some breaking changes: https://github.com/AljoschaMeyer/bamboo/commit/9dfd3b29383d9d24f907716b896a68d91ed9c791
- added the 64 bit id that allows to maintain multiple distinct logs with the same keypair
- changed around the order of the different data pieces in the signing encoding
- bravely resisted the temptation to add in a mechanism for branching off from an existing log by continuing it under a different 64 bit id
- I'll probably end up adding that in at some point...
@ChristianBundy %SABuw7m... is the first message, according to %3zZSJ6s...
#inktober 3: bait - mouse drawing edition (thank you @Aadil for the link to Noodle)
I don’t know how much time I’ll have, I haven’t drawn in months, and I might not even have a way of deriving pixels from paper except for a webcam…
The webcam quality is so bad, I won't even bother with uploading things. But I'm drawing along, and I'll hopefully upload the drawings at some point (sometime in December at the very latest).
Version control is actually one of the primary use cases driving my interest in this space. But I doubt either mix or I will be able "recommend an authorship spec" anytime soon.
Would there be a need for a separate class which can push updates but not add additional collaborators?
For what it's worth I'm not convinced that this is necessary, at least for public artifacts. Write access alone is enough to for a malicious actor to wreak havoc, and them adding more malicious actors doesn't really make things worse, assuming that these additions can be undone efficiently, e.g. if removing the actor from a certain point in their feed also undoes their changes to the authorship set. This is a different story for private artifacts, granting people read-access vs writing/write access are definitely separate capabilities. In the public case, I personally prefer the all-or-nothing approach. This is one of those cases where we can deliberately shape our technology in a way that fosters building trust among people, rather than copying the default corporate approach.
Actually, maybe just use the immediate dominator directly without that special case for zero/one predecessor. I had a hunch it would make a definition based on orders more elegant, but it also introduces a special case there. Defining the reuglar idom of a DAG in terms of orders is pretty elegant however: consider the partial order obtained from a single-source DAG by taking the reflexive, transitive closure of the edge relation. The idom of a node n
is the unique largest node that can be compared to all nodes that are <= n
and that is not n
itself.
@mix The point of divergence of a node n
is very close to a well-known concept from graph theory: the immediate dominator) of n
, so there's no need to worry about surprises hiding here. Quoting the definition from that link because I'm lazy:
In computer science, in control flow graphs, a node d dominates a node n if every path from the entry node to n must go through d.
By definition, every node dominates itself.A node d strictly dominates a node n if d dominates n and d does not equal n.
The immediate dominator or idom of a node n is the unique node that strictly dominates n but does not strictly dominate any other node that strictly dominates n. Every node, except the entry node, has an immediate dominator.
Armed with that knowledge, we can define the point of divergence (PoD): The PoD of a node with zero or one incoming edges is the node itself, the PoD of a node with strictly more than one incoming edge is the immediate dominator of that node.
It is unique (because idoms are unique), it always exists as long as we have a single source (aka root), and on a DAG it can be computed in time linear to the number of nodes that are "part of the (generalized) diamond shape" by doing a backwards breadth-first search until one node has been reached on all branches of that search (idom on arbitrary dags takes slightly more than linear time).
@mix Yup, straight "set" rather than "append" is the easiest way. A bit more complex but probably worth it (for the same reasons that make us use "append" over "set" in the regular case as well): we already talked about how a merge message contains the operation that takes us from the value at the point of divergence to the merge result. In addition, the merge message could include the link to that point of divergence (or you could also compute it dynamically). Then in the algorithm, the magical part becomes let state_at_this_point = concat(processed.get(s.point_of-divergence()), s.monoid_value());
.
Two quick additions:
- adding authors would not require forking, you would use a grow-only set for authors (grow-only sets form a commutative monoid)
- this feels like the #walkaway approach of authorship management =)
@mix @matt I had a quirky idea about approaching the problem of authors of a tangle concurrently removing each other. It involves a small detour at the start, and I'm not sure whether it is even a good idea, but at least it is interesting.
Github has a concept of ownership for repositories. Beyond CRUD actions on the master branch of a repo, there are two particular actions I want to look at: branching and forking. Branching essentially creates an alternate reality from the master branch. Master ignores everything that goes on in the branch, and the branch ignores everything that goes on on master. But they share a common history, and they share the same owners/authors. At some point, a branch can be merged back in.
Forking is essentially the same, except that it changes the set of authors who have write access to the new graph of git objects. So we can view branching as a special case of a fork, namely those forks where the new set of authors equals the old set of authors. Or inversely, you could call a forks a generalization of branches that allows to change authors. I'm going to talk about the generalized case, but I'll call it "branching" moving forward, since "forking" is already ssb terminology on the feed level.
We could design tangles such that they were aware of this concept of branching. I've used the analogy of a "pull request" here, but in a branch-aware tangle model, this wouldn't actually be a pull request. The proper way to get changes into a tangle as a non-author would be to branch the tangle giving yourself authorship, do the changes, then send a "pull request" to the original authors who then merge that branch into the tangle proper. How exactly would this look like? I have no idea, this is conceptual only.
My first instinct was to declare this out of scope for #ahau, and that's probably still a good idea. But here is why I'm posting about it anyways: we can use branching to non-destructively remove authors. Simply create a branch that does not include the author to be removed in the set of authors for the branch, and continue work on that branch rather than the old one. Essentially this would mean that the new branch became the "Master" branch, and there'd probably need to be a way to automatically signal that Master has been migrated (or more precisely: that the feed that did the branching considers Master to have been migrated).
Because this is non-destructive, the removed author could just as well continue on the old Master, or perhaps they'd do their own branch that removes the author(s) they'd been clashing with. In this situation (and in general), each client can choose for themselves which branch of the tangle they consider to be Master, and their apps would use Master to render the UI. In an amicable authorship removal, all authors (including the removed one) could publish what they consider to be the new Master branch, and clients would automatically switch to it unless instructed otherwise.
Ooh, there's a bug =D
processed.insert(current, state_at_this_point);
used to be right behind current = pending.get_next();
, by moving it to the end, I made it so that the condition if set_difference(s.predecessors(), processed).is_empty()
is never fulfilled.
A flag (or predefined multidimensional clarification scheme) has inherent semantics, client names are purely syntactic. Suppose you read a message with truth/useragent: Volcano
, that doesn't help you with rendering the thread at all, you still have to guess (just like the current situation).
Not saying that including the client is useless, just pointing out that it tackles a different problem (or at least a different solution space).
The above needs to be changed to end up with a set of final states, a (node, value)
pair is put into the set of final states if it has no incoming edges.
@mix
Here's my first instinct for such an algorithm in pseudocode. I'm only sketching the easy case of auto merges, but I hope it is helpful nonetheless.
var pending: Queue<Node> = Map::singleton(the_root_of_the_tangle); // Stores which nodes need to be processed next, lazily filled as the algorithm proceeds through the tangle.
var current: Node; // as the algorithm proceeds, each node in the graph is `current` exactly once, moving from *pending* to *current* and then into *processed*. Starts out uninitialized. Maintains an invariant: all predecessors of *current* are keys in *processed*.
var processed: Map<Node, Monoid>; // The nodes which have already been processed, mapping them to the accumulated monoid value at that point. When the algorithm terminates, the keyset is the full tangle. For efficiency it might be better to not store *all* the intermediate monoid values, but for the sketch I'm leaving it like this (also this makes it easier to handle real merges)
while !pending.is_empty() {
current = pending.get_next();
// mark all successors whose other predecessors have already been processed as pending
for s in current.successors() {
if set_difference(s.predecessors(), processed).is_empty() {
pending.insert(s);
}
}
if can_not_automerge(s.predecessors()) {
// magically wave your hands
} else {
let state_at_this_point = concat_all(s.predecessors(), s.monoid_value());
processed.insert(current, state_at_this_point);
}
}
let final_state = processed.get(current);
because conversation tangles are suuuuper sloppy atm.
While we still suffer under the yoke of sloppyness, I have a (hopefully small) feature request: clients could include an entry "flipped": bool
in their message schema to indicate whether root
and branch
are used in the patchbay or patchwork way. Then at at some point (reached whenever you fine authors of #ssb-clients find the time), clients could use this to render both kinds in the intended way.
CC @mix, @ChristianBundy, @Siegfried, @cel
out before a lengthy flame war about which mode gets to set the flag to true
and which one sets it to false
@mix As another approximation, you could also use "A peer must publish messages such that sequence numbers of the prev
messages that refer to the same author's feed are never decreasing". This is not 100% accurate (Suppose author A posts message X, B responds to it with message Y. Now author C could post message Z with a prev
of B, followed by message ß with a prev
of A.), but it is simple to test for and enforce (for each author keep track of the "frontier" of knowledge that author has about all other authors. This is quadratic in teh number of authors, but completely fine for a small-ish number of authors).
The longest-path loopholes seem fairly easy to reach if I understood the rule correctly:
B C D E
A
F G
// time progresses from left to right, the top and bottom row don't know about each other
In this graph, suppose we have received all messages except G, and now we append our message H. prev
are E and F. Next, we receive G, and now we want to append I. Then the longest path rule says nothing about including G, since that's still not a longest path.
Another loophole: Assume we know all messages and then publish H, prev
are E and G. Now we would still be allowed to append I with prev
of H and F. Or for that matter, it would be totally fine for D to include both C and B in its prev
.
@bobhaugen Yeah, what I was referring to was a dynamic topsort algorithm where each sequence of node/edge insertions/deletions that results in the same graph must result in the same topsort, even though there might be multiple valid ones. A batch algorithm is fine by just picking an arbitrary linear extension of the partial order, but such a dynamic algorithm needs to always result in a particular linear extension. Essentially using a consistent tiebreaker for each pair of messages where neither is reachable from the other, e.g. "choose the one with the lexicographically lesser hash".
@ChristianBundy I remembered that there was a more detailed version of the first paper I linked to: Haeupler, Bernhard, et al. "Incremental cycle detection, topological ordering, and strong component maintenance." ACM Transactions on Algorithms (TALG) 8.1 (2012): 3. (you can skip the stuff on strong component maintenance, the first parts cover the same material as the shorter paper, but in greater detail and with clearer examples).
As for your suggestion: The critical part here is: when does this happen, and which data do you store in-between? I'm assuming you want to maintain a list of all messages, sorted by claimed (and where necessary fixed/adjusted) timestamp. When a new messages comes in, you compute the fixed timestamp if necessary, then insert it in that order. Is that correct?
That approach is basically that of the paper, so I won't go into too much detail. But the main deep sea monster is the following: Assume you have two messages A and B, and A has an outgoing link to a message you don't have yet. You arbitrarily (or based on inaccurate timestamps) sort A < B. Now you get message C, which links to B, and it turns out that A linked to C. So now you have to shift A past B in your sorting. And there might be a bunch of messages in between that need to be shifted as well. And where exactly do you insert C (and if this seems too easy, imagine that multiple messages had dangling links to C, and C had multiple outgoing edges)?
Tentatively sorting by claimed timestamp would however greatly reduce the number of these "dramatic" shifts (with accurate timestamps, you wouldn't have sorted A < B in the first place). so assuming that clocks are properly synchronized, everyone is willing to publicly attach accurate timestamps to their every message, and no malicious actors are present, then you should indeed hit the worst-case behavior remarkably rarely. That assumption about malicious actors is always interesting in the context of ssb: you can argue malicious actors away because you'd block them, but you'll probably only do that once they "got" you (or enough of your friends) by tricking you/them into doing a lengthy computation.
Efficiently maintaining the order is an interesting (and fortunately basically solved) problem, that's what the "dynamic ordered list" data structure is for. See Dietz, Paul, and Daniel Sleator. "Two algorithms for maintaining order in a list." Proceedings of the nineteenth annual ACM symposium on Theory of computing. ACM, 1987. and Bender, Michael A., et al. "Two simplified algorithms for maintaining order in a list." European Symposium on Algorithms. Springer, Berlin, Heidelberg, 2002. if your are curious.
The latter paper is also noticeable for an easter egg after the references:
A Final Thought
Dietz and Sleator is quite influential
With its tags and its proofs by potential
But to teach it in class
Is a pain in the --
So our new result is preferential.
And @cft's slides: tschudin-icn2019-20190925.pdf
@Christian Bundy Do note that recomputing the topsort from scratch whenever we get a new message doesn't scale well (any append takes time linear in the size of the whole database). So you'll probably want a dynamic algorithm) for anything serious. Dynamic topsort is an open problem, we don't know the yet what efficiency can be achieved. The state of the art as far as I'm aware of:
- Haeupler, Bernhard, et al. "Faster algorithms for incremental topological ordering." International Colloquium on Automata, Languages, and Programming. Springer, Berlin, Heidelberg, 2008.: incrementally updates an ordering
- https://pdfs.semanticscholar.org/289b/1275aa40f5aee869ee19988ec1b7d5f96890.pdf to all nodes, an order on the labels is consistent with a topsort on the nodes (this approach is unsuitable for ssb, appending messages to the longest log consistently hits the worst-case label update cost)
Both of these can still take O(n) for a single append in the worst case, but such cases would be very rare for ssb. The biggest problem in that area is that while these algorithms can guarantee nice amortized bounds in a purely incremental setting (no deletions), there is no better bound than "rerun batch topsort on every change" once deletions are allowed. The complexity bounds given in the literature are usually for incrementally building up a complete graph, producing a valid topsort at every intermediate step.
Ssb message graphs has some neat properties that could possibly be exploited:
- partitioned into totally ordered disjunct sets (aka feeds)
- edges don't appear out of nowhere, and they appear in batches (all outgoing cypherlinks from a message)
- deletions are rare (and usually restricted to removing whole feeds), so a purely incremental algorithm could be feasible
I've burned multiple days on trying to leverage these specialized aspects for an ssb-specific topsort algorithm, but with no success. If I were to implement this today, I'd basically implement the algorithm from that first paper (with some obvious optimizations with respect to seqnums compactly representing the backlinks of messages within a feed).
Dynamic topsort comes with a related, interesting problem: Assume two serves that run the same dynamic topsort algorithm and receive the same set of messages, but receive those messages in different orders. Can we find an efficient topsort algorithm that guarantees that the two servers will arrive at the exact same ordering of messages? I couldn't find any trace of this problem in the literature, and I couldn't make any progress myself beyond "this is so much harder than it looks =(".
[...] but one of the nice things about it is that if it’s part of the tangle you can clearly see what was after an authorship edit on the linear parts of the tangle. It looks like you’re trying to copy that property by copying the heads of different tangles into adjacent tangles.
Come to think of it, all the updates to the authorship tangle need to do is include the newest seqnum of the authors that it is aware of (in addition to the actual addition and removal of authors of course). That allows us to reconstruct enough of the temporal (i.e. causal) relation between the thread tangle(s) and authorship tangle.
More abstractly, we can define the reduction algorithm for a tangle as taking as input a tangle and a set of authorship ranges (as discussed previously in this thread). The algorithm wouldn't care about how the authorship ranges were obtained. They might have been computed from an identity tangle, or from some other messages, or they might have been a hardcoded object (nice for testing purposes). This approach would mean that no information can flow from the thread tangle to the source of authorship information, since it isn't known (and might even change over time). That constraint actually simplifies things, the tangle can then be very close to how current post threads work.
It also feels to me like you’re still enabling a scenario where you can change the history of valid authors however you want with the authorship tangle.
Independent of whether this is a good idea or not (I can see where your concerns are coming from), my hunch is that a good, general solution will naturally support this. You can always put further restrictions on these operations. But I don't think that relinquishing the ability to do these operations opens up new design space.
or you’re just rephrasing?
I was rephrasing %x3GMY1I...
Do you mean: conflict at top level means conflict at some sub-level(s)
No, I mean more complex message graphs, e.g.
A
|
B
/ \
C D
\ / \
F G
| |
H |
\ /
I
Does I
have to declare what exactly it is merging, or can this information be derived? If it can be derived, are there situations where it can not bet derived? Also keep in mind that you can have more than two concurrent updates.
If I’m right we can also write a really nice upgrade to the merge strategy which uses composite-monoids and that will feel great.
As long as you manage to get the hand-coded ones exactly right (as if they were created through a composition library), then yes.
I have no idea =)
The current design with the explicit first
implies a unique root/source. If it was left implicit, then in would certainly be possible to have multiple sources (nodes without incoming edges). Implicit sources would also allow sharing of nodes across multiple tangles. But I think authorship management and generally talking about or attaching attributes to a tangle would become more complicated. I think for the current explorative work (that needs to amount to a real system for mix), staying focused on the simpler case of a unique, explicit root is a good call.
there are multiple merges (across diff properties with different merge strategies)
Technically speaking, that's not true: there is only one merge (on the elements of the monoid), but the monoid happens to be composed out of smaller monoid in a principled way, so we want to compose merges as well. This view helps us with the question "[...] whether it needs to summarise the diff for all properties over the fork, or just the ones that would conflict.". In the spirit of consistency, this should imo be handled analogous to omission of identity elements. If it is fine to omit identity elements as entries of an object of submonoids, then it should also be fine to omit the merge resolution for non-conflicting entries of submonoids.
Gut says be sensible and just do the former.
I disagree with your gut, and in light of recent events, that might even be a good sign...
Btw the final scheme needs to be able to deal with nested conflicts as well, the "concatenated mess of fields" can be nontrivial and requires a precise definition.
e.g. you traverse down D branch first, you don’t know that D was invalid
Fair point. My remark that authorship changing messages should be processed first doesn't really solve this, since they might be hidden further down a branch (imagine C was "another update!" and E was "mix is not author"). So there'd still be backtracking involved.
This sounds like the data structure is working against us, so instead of trying to find out a complicated algorithm, we might better look for alternate data structures.
What happens if we take authorship information out of the thread tangle, and move it into a dedicated tangle that consists of nothing but authorship changes? The root of the the thread tangle would point to the authorship tangle that controls it. An authorship tangle itself can either point to another authorship tangle that controls it, or it might not point to any one, in which case it would be self-governing.
To link these two together, messages in the thread tangle would also point to the heads of the corresponding authorship tangle. I'm not sure whether the messages in the authorship tangle should also point to the heads of the thread, probably though. But that makes it tricky to have one authorship tangle mange multiple threads. Not sure whether you'd want to support that or not. In any case, seqnums should definitely act as implicit indicators of causality (between two messages of the same author, act as if there was a link from the newer to the older message).
This separation doesn't help us with stuff like concurrent mutual authorship removal, but it allows us to traverse the thread in one pass for reduction (handwavy again, but hopefully the intuition is clear).
Handling conflicting authorship changes would effectively be a matter of performing merges on the authorship tangle, which seems quite elegant.
In another thread @past-mix wrote
One feed cannot participate on multiple heads of a hydra
But that is essentially what you are advocating for here, so you'll need to draw some sort of distinction (and possibly put it into the message content?).
and electron apps can’t even have sex.
people assumed protozoa couldn’t have sex, but statistics on their genome diversity suggests that they are definitely fucking… we just don’t know how. And they’re fucking without apparent genders…
Patchbay decided to show me this the first entry of /posts
, without the context of it being about actual protozoas rather than the co-op. That was a weird way to start into the day...
@mix Not quite sure what to do with that example, since it is commutative. So F is superfluous, and if you wanted to publish it anyways, it would be add 0
. The cummulative one would imo be A + B + C + D = 9 = A + B + D + C
.
That above paragraph covers the auto-mergeable case, which you probably didn't care about. So let's pretend that addition didn't commute (or we simply didn't define an auto-merge policy).
should F say:
a) “add 6” (the decision of the person resolving the conflict")
b) “remove 5” (assume both are applied in a predicatable order - e.g. by hash, alphabetically - then the result is adjusted by merger)
I choose c): "none of the above" =P
F itself wouldn't "say" an accumulated result, instead it would contain (or somehow express) how to get from B to the desired state. So if the manual merge wants to ignore C and use D only, then F would be "add 2". Aside: In the special case where the mechanism for merges is to simply drop everything but one message, it might be more elegant for F to say "use D" instead of "add 2".
In the diagram with the accumulated values, I'd then go for "add 4", that is A + B + F
(which is also A + B + D
since that's the semantics that I assumed the merge should have).
b) “remove 5” (assume both are applied in a predicatable order - e.g. by hash, alphabetically - then the result is adjusted by merger)
This is problematic, because "removing" might not always be defined. "Remove 5" is equivalent to "add the inverse of 5", with the inverse of x
being the element x'
such that x + x' = e
(writing e
for the identity element). Monoids don't mandate the existence of inverses, a monoid with inverses is a group). But even in group land, it seems unnecessarily restrictive to express the merge result as the difference between the concatenation of all merged branches in a predictable order: why not allow arbitrary merges? In our example, F might be "add 27", and that could be just fine. Less contrived: if you had concurrent text edits, you might want to write a new paragraph that incorporates both perspectives rather then fusing them based on syntactic structure.
Forgot the second option regarding author availability: Designing a message type that allows you to "subscribe" to a tangle, under the hood servers would convert this into automatically requesting the feeds of the authors. This is fairly unrealistic in the current situation (and perhaps for ssb in general, always keep in mind that my perspective involves non-ssb log formats), but it is an option nonetheless.
I was thinking of that array kinda like a string of elements in the same way “shampoo” is a string with “duplicate entries” of “o”, but “shampoo” is still a unique element of the set of strings
In the interpretation I was going for, "shampoo" and "shamoop" are equated though. But "shampoo" and "shampo" are not, the count of characters matters. Basically: two multisets are equal if their sequential representations are anagrams.
re %UMvtgA4...
Opinions: as long as robin respects the tangle authors, if prev includes C (i.e. robin has received C), then it must not include D. If E replied to D without being aware of C, then it is fine that everyone who has C will ignore both D and E. If you wanted to find a "better" solution, it would have to depend on the semantics of the messages, which would be to complicated (and what are the "objective" semantics anyways in a system where everyone can interpret payloads as they see fit). Sometimes this might be annoying, but keep in mind that authorship removals are probably rare events, and if they need to occur, there's likely worse nontechnical trouble than a few transitively ignored messages.
From this perspective, it follows that F is not necessary. In fact it doesn't even make sense: E was not part of the tangle (because D wasn't), and robin is now aware of that. It's a shame that one of their own messages got dropped, but that's life in distributed systems. You wouldn't publish a "merge" with a message completely unrelated to the tangle - a category in which both C and D fall under my interpretation. Another take: F is mostly robin clarifying their own view of the system. We don't want everyone to spam us with information about their replication progress.
A feed cannot publish messages attaching earlier in the thread than it’s already posted
Yes! Else you could create cycles, and tangles are by definition acyclic. Use seqnums to enforce happened-before among messages of the same thread (more efficient than traversing the feed's backlinks, and works with ooo).
One feed cannot participate on multiple heads of a hydra
I agree with this in scenarios where there's no contentious authorship (or authorship removal in general). I'm not so sure abuot situations that involve competing authorship forks: if two authors remove each other concurrently, I want to be able to participate in both resulting tangles. I'll need to think this through some more.
Perhaps this means that you must follow the other authors in a authorship-permissioned thread (?)
That's one option, others that come to mind (not evaluating here, just bringing them up):
- ooo is fine, but it is the client's responsibility to only update a tangle once they have transitively retrieved (via ooo) all prev messages up until the root.
So answering my earlier question (“can we just look at author state in context of the message being posted, or should we have a final state that we use to look back to check validity”)
I didn't 100% understand that post, but here's a sketch how I'd reduce state (under the rather strict semantics I've proposed above):
- start with all messages that claim to be heads
- traverse the
prev
s until the root is reached- throw away everything that doesn't reach the root
- now, beginning from the root, traverse the prev links backwards again to build up the reduced state (insert hand-waving about merge conflicts here)
- handle messages that change authorship first
- corollary of this: a message should be able to either change authorship rights or update the tangle's monoid, but not both
- don't traverse links to messages that were posted by non-authors
- handle messages that change authorship first
@mix
"Weights" as in the counts of how often (possibly negative) something appears in the multiset. The multiset could have two identical entries granting authorship, e.g. if two people concurrently grant access to the same person at the same starting point. If we wanted to convert it into a true set (no duplicates), the concatenation of these two entries would be that same entry again, but annotated with weight two. Then we'd know that it would take two or more remove actions (or a single one of weight at least 2) to remove authorship.
it should not be possible to have two copies (double-spending, etc.)
I'm looking at this through a more scuttlebuttly view: Anyone can issue arbitrary assertions to their own feed over which they have total control. Once we start talking about data structures that span multiple feeds, it becomes a matter of subjective interpretation and trust. Say A and B fork off their own PoIs. Which one of them is "real"? None of them is. SSB deliberately does not enforce consensus here. I can choose which fork to accept for myself. I can even accept both. Taken to an extreme, this results in a world where the readers resolve merge conflicts, not the authors. Or another framing: Readers always have the option to override merge conflict resolution, but at the risk of landing in their own subjective bubble that has diverged from what most other people consider "reality".
I like the idea of having one field too, but can’t yet imagine how to do it.
The monoid could look like
[ // order is irrelevant, but duplicate entries are supported (a multiset). This could be turned into a true set by adding weights to all elements
{ action: "add", target: "@mix", seq: 42 },
{ action: "add", target: "@alj", seq: 17 },
{ action: "remove", target: "@mix", set: 99 },
]
The reduced state (the domain of the monoid action) would be the (weighted) ranges, which could in turn be simplified to simply a mapping from ids to seqnums at which the id is a valid author, since that's probably all you care about most of the time.
If I understood your post correctly, then the monoid I'm suggesting is fairly close to your option 1, whereas the reduced state is close to option 2 (which makes sense, we are accumulating across the full history into a useful, expressive summary data structure).
I think a pattern I’m seeing is that ideally idempotency is not a function of the transformations, but rather the resolution - this means transformations will clash less … because idemopotency in transformations breaks commutativeness in a lot of cases.
Neat, I like this view. It perfectly captures and generalizes the idea of maintaining the weighted set (or equivalently allowing duplicates) and then collapsing the precise weights into {negative/zero/positive} in the action.
because idemopotency in transformations breaks commutativeness in a lot of cases
Yeah, stupid idempotency. If only people (including myself) didn't expect it so often =/
I’m feeling a little bit tense about how deep this space is and how much it feels like it would be prudent to not be hasty… but I am compelled to need to start somewhere and soonish. Perhaps I’ll give myself another day or two next week to write code around some ideas.
Given the care you have put into this, I'd be surprised if it went badly. You won't find the perfect system, it probably doesn't exist. But as long as you manage to avoid inconsistencies, it should be fine. Whatever you end up implementing isn't set in stone, other people will hopefully keep experimenting with different solutions.
If you are afraid of not reaching a design of the necessary quality (I personally don't think you'll have that problem), did you consider some more restrictive kaitiaki management options? Examples that come to mind are fixed sets of kaitiaki, grow-only (can not remove kaitiaki status), exactly one kaitiaki (which could change over time, and/or which might be a feed that is controlled through some out-of-band consensus mechanism, e.g. a bot that creates messages based on loomio poll results). That last point is exemplary of a probably large family of options where you gain simplicity by inviting some centralization back into the system.
And then there's always the option of running consensus algorithms or locking algorithms within ssb logs (this is a whole different can of worms rabbit hole, but I expect you can find plug-and-play algorithms in the literature) to enforce a total order to prevent merge conflicts and coordinate changes to the set of kaitiaki.
@mix Fyi I'm coming around to only storing the starting point of the authorship state, especially after noticing how similar this is to friends/access graph maintenance. (but I still think having two entries is worth than condensing it into one). But I still feel like I haven't reached a good understanding of the space yet.
No-one likes big hydras
No one liked the lernaean hydra either. Didn't stop it from hanging around and eating people. There will be big hydras no matter what.
(which admittedly does not mean that we shouldn't try to avoid creating them, I'm just stressing that we have to be able to deal with them)
I don’t want the added complexity of being able to time travel and add add and remove people as authors far in the past.
That results in a kinda weird dynamic: suppose I'm not an author, but I post a bunch of useful edits anyways. The authors then make me an author for exactly the (closed, finite) range of those edits. This is basically a pull request, so far so good. Now suppose I made two different "pull requests". Since I made them from the same feed, one of them is the first and one of them is the second one. A regular author "merges" the second one. Now they want to merge the first one. Oops, doesn't work =/
Remember that we are dealing with immutable time travel, which is a lot less problematic than mutable time travel. And merging in data from early points in time can be done efficiently if we have a tree structure caching the state at each internal node (it'll probably be a good while though until you reach the scale where this becomes necessary).
Is this a terrible idea @aljoscha?
I don't know =/ Perhaps it might be worthwhile to express each add/remove action as a single, integrally weighted range (that may or may not be open-ended). But then, to be able to represent my above example (i.e. to make concatenation closed), you'd need to be able have sets of disjoint ranges. And suddenly, we are back at my original proposal. I guess I just don't find it that complicated. I mean there's some inherent complexity in the problem space, but the ranges seem to add very little fluff.
Your authorsUpdatedAt
field also needs to be able to handle disjoint ranges, so you'll end up with at least the same amount of complexity, just spread across different entries and with some duplication that needs to stay in sync.
The other extreme would be to drop the whole monoid framework and just live in a world of changes being applied to a distinct state object.
ooo, that elegant solution is commutative !
What a conspicuously happy coincidence =P
how do we prevent mix from pretending he wasn’t online when he said “aljoscha is not an author” and says “yeah my view of previous, and of aljoschas last edit was waaaaay back then”?
We can't on a purely technical level, because mix might have actually been offline for that long. Global time does not exist.
does this update to a authorUpdatedAt monotonically increase all sequences?
Do we really want to restrict this to monotonic cases? Is there a clear technical gain? In principle it seems fine to retroactively add stuff.
if this edit removes an author is it trying sneaky things like trying to remove a valid edit from upstream that’s already in the tangle?
It's a feature, not a bug. If we trust the kaitiaki to do their job properly, why not give them this power? And we do trust them, since we already give a malicious actor the ability to revoke everyone else's rights.
I have a hunch that having two different entries for authors
and authorsUpdatedAt
is not the best approach, because there could be inconsistencies between the two fields (Informally: what happens if one of them talks about an author that is not present in the other? Formally: we have to make sure that the concat operation is closed)). Conceptually we track for each author ranges of seqnums with an associated count of how many times they have been given access for that range (as usual, access can be negative). So add alj from seqnum 20
add alj from seqnum 30
remove alj from seqnum 25
would result in authorship counts alj: 0..19: 0, 20..24: 1, 25..29: 0, 30..: 1
. These ranges would need to be represented as a js object, probably omitting any ranges for which there is a count of exactly zero. Was this comprehensible?
- kaitiaki concurrently removing each other
- perhaps need an over-ride rule where if there are no authors, one of the last ones can magically still edit…
One way (definitely not the only valid one) of thinking about the resolution of two concurrent operations is to serialize them, i.e. define an order in which they are applied. So if you get messages m1 and m2 concurrently, you might either perform m1 followed by m2, or m2 followed by m1. Interestingly, none of these two options can lead to an empty set of kaitiaki. So merge conflict resolution through serialization seems to be an attractive option in this case (whereas e.g. in concurrent editing there are situations involving deletion where none of the possible serializations make sense).
adding seq number [...] is hard for a group identity (see Identity Tangle)… maybe figure that out later
The brute force solution is a "vector clock" containing all identities that are part of the group identity, a vector of all seqnums (or perhaps a map from public keys to seqnum). The seqnums of an individual feed can be regarded as a special case of this.
@SoapDog (Macbook) Well, I woke up this morning and I was in Basel. Which made a lot of sense, considering I went to sleep there yesterday. I will go to sleep in Basel in just a few hours again. There's really not that much to it. Except for a short window of time around #36c3 (colloquially referred to as Christmas + new year's), I will repeat this pattern of continuously sleeping in the same city.
See also %uTxlPbr...
Thanks for the ping @moid. I've seen this some years ago, it is good to know that they are still alive and kicking. My main qualm is that there are many, many opinions baked into the whole system. While there is innovation across many axis, it is all part of a tightly coupled system. But still, they are combining a bunch of fun, non-mainstream approaches into a coherent, not-purely-academic "product", so that's always nice.
And since we are on ssb: Unison has the problem that tiny changes result in completely distinct new code objects (due to the new hash). Imagine you had a feed per program/function/insert-your-favorite-unit-of-code-here instead, then you could keep a persistent sense of identity while changing stuff by incrementing sequence numbers. And you could even do semver-like greedy fetching of new "versions" of code (caveat: all the usual problems of semver as opposed to minimal version resolution still apply).
@mix This is looking good, I don't see any big problems on first reading. A few comments that came to mind:
first
is technically not necessary. I can see how it is useful, but you'll need to clearly define what happens if first
is invalid (points to a message whose first
entry is not itself, or points to a message that can not be reached by transitively following prev
links, there are probably other invalid configurations as well)
If you want to go all in on the monoid approach, a reduced document would look like {name: { action: 'set', value: 'Schlossgarten Oldenburg', ... }, ...}
: the reduced state would be the operation that transforms the identity PoI into the desired one. Or yet another viewpoint: the smallest representation of the function that is equal to the composition of all the functions encoded by the prior messages.
The details of this would need to match with whatever message schema you want to use. I personally prefer option A because the data structure works out well (you can iterate over all changes but you can also efficiently check whether a specific field is effected).
When merging heads, I like the idea of using a regular message (with multiple prev
s) if the operations on the heads commute, and doing special merge messages if there is an actual conflict.
Remember that forks can have more than just two heads, and also that merges might only merge a subset of all dangling heads.
UI questions around merge conflicts are always tricky. If you show the latest nonconflicting state, then users may experience rollbacks if they later receive new messages that create a merge conflict "in the past".
I don't get the difference between the hydra case and the dangling hydra case.
What happens if two kaitiaki remove each other concurrently?
Appends by non-kaitiaki will be ignored, right? What happens if they later become a kaitiaki, do their old "unauthorized" changes suddenly get integrated? When adding a kaitiaki, you might want to also specify the seqnum starting from which their posts become authorized.
Same for removing kaitiaki, in the majority of cases you probably don't want to retroactively erase all their updates.
The prior two paragraphs become much more pronounced in a setting with more fine-grained update operations (imagine reverting a bunch of typo fixes when revoking someone's access).
Requiring acceptance from a potential kaitiaki doesn't seem to change a lot on the technical side - if they don't want it, they won't publish edits to the PoI in the first place. So this could be done as a distinct ("cosmetic") aspect that doesn't affect the core algorithms.
Looks like I live in Basel now.
@Linas Thank you for this, having the actual terms to search for is super helpful. And the isomorphism to dependency graphs is neat. I find it curious that the CRDT stuff I've read so far has never mentioned this area at all (not even something like "partial commutativity").
From (very briefly) skimming some TOCs, it seems like there's a bunch of the "obvious theory stuff" (we have a thing, now let us categorize languages and build logics with it) - I'm curious how much of that will turn out to be useful for us. But it'll be a while until I'll take the time to read through these things...
More notes:
- an idempotent semigroup is called a band#Zero_bands)
- it is interesting to look at what happens if we keep the commutativity requirement but drop idempotence: and example would be an integer counter that can be added to and subtracted from. The order we would care about (the usual order on integers) would not be derived from the algebraic laws as with a lattice, but that doesn't really matter.
- operation based CRDTs are simply states that are updated through commutative operations... no need for lattices there
- requiring the operation to not only be commutative but also a semigroup/monoid is not really needed to get conflict-free replication, but monoids are cool (among other reasons because ropes are cool)
- the access-graph can be made commutative by representing it as a multiset that supports negative multiplicities (note that this shreds idempotence). Access would be given to authors whose multiplicity is strictly greater than 0. Whether this would match user expectations is a different question... But it might be a viable alternative to a merge conflict (especially since user interfaces could clearly indicate multiplicities other than zero and one).
- destructive updates can simply be defined with respect to some arbitrary order on the set of values, trying to generalize lattices hasn't really lead anywhere
- the interplay of associativity, idempotence and commutativity is fun though, trying to add "destructive" operations tends to clash with idempotence, which I hadn't expected just from the formal definition (intuitively (for me), idempotence tends to "swallow" updates, commutativity can be used to change how much is swallowed, associativity makes sure that this is disallowed)
I don't know how much time I'll have, I haven't drawn in months, and I might not even have a way of deriving pixels from paper except for a webcam...
So yes, count me in
Expanding a bit on the idea of categorizing destructive updates: this is starting to feel like a generalization of semilattices, with non-destructive updates being functions that are monotonically increasing with respect to the underlying order. A semilattice is a semigroup (a monoid that doesn't need to have an identity element) where the concatenation operation is commutative and idempotent (and addition to being associative). If we do have an identity element (i.e. start out from a monoid rather than a semigroup), we even get a bounded semilattice.
The access graph forms an idempotent monoid, but it is not commutative. If it was, we'd have a semilattice (and thus a crdt). So the interesting task now is to think about "how far away" from commutativity we are. Which again boils down to the local decision of whether two particular elements commute or not.
The interesting question now is this: which effect does the relaxation of the algebraic definition (not requiring commutativity) have on the order-theoretic interpretation? Intuitively I'd expect a relaxation of the requirement that the join of two elements always exists. If two elements commute, do they have a join? What about the other direction? Under which conditions can operations result in "travelling down the order" while still commuting?
My (admittedly very vague) drive behind this is to define data structures that can incorporate knowledge of non-destructive updates (stuff that doesn't move down in the order?). This could make some things more efficient: often we don't need to know the exact state, we only need to know that we are "above" a certain element in some partially ordered state space. For example we'd like to know that we are in a state strictly greater than "A has no access to B" (that's when we get to share B's data with A), but we don't want to fetch the whole history of who else B as given access to - we only want to know that B has not revoked A's access (an operation that "lowers" the position in the partial order).
Another theme here could be the separation of the data domain into multiple, independent areas (some of which might form true semilattices) - this would allow us to ignore updates to areas we aren't interested in. Basically the reverse composing monoids.
Gah, this is stupidly vague... I wish we had more mathematicians here, all this generalization-of-lattices-stuff is probably well-known already. Anyways, I'll stop rambling now.
@mikey Oh, thanks. Apparently I can't read =/
instinctively looks for a close issue
button
Forking @mix from %vie81xI...
I’ve also moved to put all the
tangle
related data in its own area - I want to do this in the future. Start anther thread if you’ve got feels
Just a question: should it be possible for a single message to be part of multiple tangles (I don't know what to think of that yet, but it is a possibility to entertain)? If so, the container format should support it: {tangles:[{...}, {...}]}
.
(my gut feelings on a deliberate container format in general are very positive)
(please don't quote this out of context, I despise docker) and friends with all my heart)
@mix
Nice, those messages are pretty close to (my) home and would definitely work. Just a few nitpicks:
In the context of this thread, I'm deliberately restricting the scope to a feed describing access to itself only. This means that we don't need any tangle stuff at all, we only consider the set of (totally ordered) messages from single identity. All you'd really need is the "meta-seqnum" ("this is the n-th update to my access control graph"). And for ssb you'd probably want the cypherlink to the previous (within the access-rope) and lipmaalinktarget as well.
You got the monoid structure (the operations
entry in your mockup) right. But as detailed in the second post of this thread, I'd also add the tip of the log that you give access to in there. Also, there is some redundancy in that there is an array (set really) of operations and within each operation a set of feeds. I'd probably just have a set of single-feed operations.
{
operations: {
{
type: 'add',
feed: '@ye+QM09iPcDJD6YvQYjoQc7sLF/IFhmNbEqgdzQo3lQ=.ed25519',
at: 'hash-of-your-latest-access-control-change',
at_seqnum: 42
}
}
}
There are a few more details that would need fleshing out:
- a lot of redundancy in that data, e.g. in a bamboo-based setting
feed
andat_seqnum
would suffice, in ssb technicallyat
alone would be enough - might make sense to have separate links to both the previous update and the previous destructive update (an update that revokes access control), this would mean that transitive access only gets invalidated on delete operations rather than on any operation. But that's a different question. It can be generalized though: We have a monoid where we know in advance that some subset of elements always commutes (the non-destructive adds) and a disjunct subset might lead to merge conflicts (deletes). I'll have to let that thought stew a bit, but it feels like we could find a useful, general pattern and corresponding approach/technique/recipe here.
Another solution: Don't equate {}
and {counter: 0}
at all: concat({foo: 42}, {}) == {foo: 42}
and concat({foo: 42}, {counter: 0}) == {foo: 42, counter: 0}
. Leave it to the consumers to treat these two results as equivalent (equivalently: to apply the homomorphism from above). This way we get a monoid on the full set of json values. It's kinda weird to have to propagate stuff like counter: 0
, but it works fine.
it’s not a monoid, it’s something else
If you want to get all technical about it, we have a set of values (the json values) with an associative operation on it, but we only care about the equivalence classes of a relation on the set that puts things in relation if they are equal up to omission of neutral elements in maps. So e.g. {}
and {counter: 0}
would be in relation (assuming 0
is the neutral element of the counter). We can now lift the append functions to operate on these equivalence classes, and that forms a monoid. Alternatively, fix a representative member of each equivalence class and define the resulting section (the function from each json value to the representative of its equivalence class). The codomain of the section (i.e. the set of all representatives) again forms a monoid, and we don't even need to lift the original append function. The important point that allows us to work with the arbitrary values rather than with the representatives is that the section itself a homomorphism from json values to the monoid we actually care about, with respect to append and identity elements.
So you are right, that sketch is technically not dealing with a monoid, but it deals with something that induces a monoid and is homomorphic to it - which is good enough. For this to actually work, the consumers of the data must also respect that homomorphism, which is a fancy way of saying that e.g. {}
and {some_counter: 0}
must be treated equivalently by consumers. A library could help with this by normalizing the data it emits by always applying the section before outputting any data.
An alternate approach that circumvents the homomorphism (but has barely any practical gains) is to reject data that is not in the normal form (one of the chosen representatives). The two sensible choices representatives would be to always omit entries in maps whose keys are neutral element, the other choice is to require presence of all keys. The latter is problematic when trying to evolve the schema, and it also produces a lot of redundant data.
@Melvin Zhang
Huh, good point. Do you have any literature recommendations for an introduction to this stuff?
I've also been idly wondering whether it makes a difference if a program for a problem can be expressed through primitive recursion or whether it needs Turing completeness. But I have no idea where to even start thinking about this more thoroughly...
@mix Here's a sketch for composition of monoids in js: https://gist.github.com/AljoschaMeyer/bb0f47f5d58e69c1d2b0dd0c4bb84d38
Woot @Melvin Zhang, I've been asking myself that exact question ("Is there a simple program that halts iff the Riemann Hypothesis holds?") yesterday (in the context of automated theorem proving over total functions). Rather than burning brain cycles, I moved on to the Collatz Conjecture instead to make my point, since the algorithm there is trivial (just compute the sequence until it reaches 1). But this is so cool!
Future travelers can look at the patch theory references provided here for a closely related topic.
I went on a literature dive today.
There's some early work (1980ties and onward) on merging plaintext files in general and programming languages (context-free syntax, context-sensitive syntax, arbitrary semantic criteria, graph-based representations) in particular, a nice survey is Mens, T., 2002. A state-of-the-art survey on software merging. IEEE transactions on software engineering, 28(5), pp.449-462.. It highlights a few important distinctions, e.g. distinguishing between state-based merging and operation-based merging (ssb logs naturally lend themselves to encode sequences of operations).
More interestingly for the ssb world (ok, at least more interesting to me...) is patch theory, developed in the context of the darcs distributed version control system. Patch theory is about the general composition and application of patches (changes, deltas) over some data structure, and dealing with merges. Beyond the darcs wiki and some pijul documentation, there are papers that lay out formalizations of patch theory based on different branches of pure maths:
- algebraic: Jacobson, J., 2009. A formalization of darcs patch theory using inverse semigroups
- homotopy type theory: Angiuli, Carlo, et al. "Homotopical patch theory." ACM SIGPLAN Notices. Vol. 49. No. 9. ACM, 2014.
- category theory: Mimram, S. and Di Giusto, C., 2013. A categorical theory of patches. Electronic notes in theoretical computer science, 298, pp.283-307.
To be honest, a lot of that material goes over my head - the one based on semigroups was the most accessible to me, but that might have been due to being at least slightly more comfortable in that area of math than in type theory or category theory. Still, all of these suggest a nice mapping to append-only logs (and quite possibly append-only ropes). CC @erick, you might enjoy the academic overlap here. A patch-theory-based vcs on ssb could totally happen at some point.
Bonus links:
- a json crdt: https://arxiv.org/pdf/1608.03960.pdf (basis for automerge)
- algebraic file synchronization: https://www.cis.upenn.edu/~bcpierce/courses/dd/papers/ramsey-sync.pdf (just cool)
Summary: Patch theory covers a lot of what you are looking for, but the available material is pretty abstract and dense. But on the plus side, there are always trivial solutions (manual merging or arbitrary tiebreakers), so getting to something that works won't be too hard.
I did not really find algorithmic takes on this (efficient conflict detection, caching stuff, etc.), that seems to be buried in the implementation of darcs, pijul, and friends.
I’m still struggling to define / name / describe rules for the parts of a “document” which behave differently like this.
Commutativity seems to be a major one: Like counts, attendance etc aggregate changes in a way such that the order in which the changes arrived doesn't matter. It actually isn't even full commutativity: Attending and then revoking attendance is not commutative, but only happens as part of the same, totally ordered log, so we always know the correct order in which to handle the changes.
CRDTs (or automatic merges in general) are often about defining things in a way that they become commutative, even though this often might not map well to the use case. Example: You could trivially make title updates commutative by always preferring the title given by the author whose public key is the lexicographically greatest one. This obviously yields low-quality results, but it doesn't require user interaction. Collaborative CRDT-based text editing often falls back to such arbitrary measures.
A less formal lens is that of "non-destructive" updates.
I like that view of splitting gathering messages into distinct parts with different merge requirements. Seems to be a first step towards defining composable merge mechanisms.
Caption: an accurate representation of my thought process when thinking about merges on graph structures.
@mix.windows A fairly important distinction: which data structure is this about? Are you investigating "mutable" plaintext, or is this about general tree structures (some markup, json, arbitrary ssb messages)?
This is a topic I've wanted to explore for a long time, but I probably won't have the capacity to dive into it in the next weeks. But I'm excited to follow your progress and make annoying comments =) I'll probably look at things rather abstractly (tangles as DAGs, edit operations as operations on monoids or groups, merges as, well, that is where the fun begins...), but hopefully that might become useful at some point.
Boring but quite possibly true: conflict resolution mechanisms are domain specific and looking for the one true way is futile. Compare how e.g. git, a wiki, and a collaborative text editor all handle conflicts on similar data structures in completely different ways.
Corollary (not related to this thread, I just need to vent): if you claim that CRDTs are the one true solution for conflicts in distributed systems, you are probably wrong.
Nevertheless, settling on a good manual merge algorithm that could be used by different applications in completely different domains would be valuable.
Aside from git, I can also recommend a look at pijul which does merging in an arguably nicer (and definitely more principled) way than git.
@cft Are these distribution maps a feature of pcapng? Otherwise, this sounds like private-box (the format used for encrypted ssb messages) would fit the usecase (modulo the low maximal number of recipients).
The While Loop
A while loop allows to execute a block (the body of the loop) multiple times, for as long as a condition expression evaluates to a truthy value. The syntax of the expression is while <expr> <block>
, where <block>
. Its evaluation works as follows:
- Evaluate the condition. If the result is truthy go to 2., else go to 3.
- Evaluate the body, then go to 1.
- Evaluation of the while loop is over, the result is the result of the last evaluation of the body, or
nil
if the body has not been evaluated at all.
A small example program:
let mut a = true;
let mut b = true
while a {
if b {
b = false;
} else {
a = false;
42;
}
}
In this program, the body of the loop is executed two times, the condition is evaluated three times, and the final result is the value 42
.
Using the while loop, it is possible to write programs that never stop evaluating, for example while true {}
. Before we introduced the while loop, evaluation always mapped a program to a value. Now, evaluation only maps some programs to a value while others don't have a well-defined result. Even worse, it is now impossible to define an algorithm that can determine whether a given pavo program terminates or not. There are good reasons for having the while loop in the language though: It allows us to express arbitrary computations. Going into further details would go beyond the scope of this text, the branch of computer science that deals with these things (computability theory) is one of the oldest and most fundamental ones. The main takeaway is that there are some problems that cannot be solved by computer programs, and unfortunately, categorizing arbitrary programs based on some runtime criteria is among them.
There are two special expressions that can appear in the body of a while loop (but that can not appear in other places of a program). When a continue
expression is evaluated, the loop jumps to the next iteration: the condition gets evaluated again, and iteration either continues or is finished (evaluating to nil
). The break
expression direcly finishes the evaluation of the loop, the loop expression evaluates to nil
. Both break
and continue
may optionally be followed by an addition expression. This expression is evaluated as part of evaluating the break/continue expression. In case of continue
, it yields the result of the loop iteration (so the loop evaluates to it if the loop condition evaluates to false
or nil
after the continue expression). In case of break
, it directly yields the result of evaluating the loop.
let mut a = true;
while true {
if a {
a = false;
continue;
break 42; # never evaluated
} else {
break 0;
break 1; # never evaluated
}
} # evaluates to 0, takes two iterations
Exceptions
Remember how evaluation was supposed to turn an expression (or sequence thereof) into a value? That was only a simplification. The actual mechanism is a tiny bit more complicated: Evaluation either successfully yields a value (as is the case with all expressions covered so far), or it might result in an exception. We refer to an expression that yields an exception as throwing the exception. An exception is just an arbitrary value. Whenever the evaluation of a subexpression results in an exception, evaluation of the outer expression (or sequence of expressions) yields that same exception - without evaluating any further subexpressions. So in the programm <expr-1>; <expr-2>
, if evaluating <expr-1>
throws 42
, then the whole program throws 42
and <expr-2>
is nevere evaluated. By extension, this means that running a pavo program has one of three possible outcomes: It might successfully evaluate to a value, it might throw a value, or it might end up in an infinite loop and never terminate.
The expression throw <expr>
evaluates <expr>
and then throws the resulting value.
(throw 42) == (throw 43) # throws 42, the `throw 43` expression is never evaluated
At this point, exceptions don't seem very useful, they are just a mechanism for terminating a program early. Their usefulness stems from another expression that allows us to react to exceptions: try <block-1> catch <name> <block-2>
. The try expression is evaluated by first evaluating the expressions in <block-1>
. If none of them throws, then the whole expresion evaluates to the value of the block. But if the block throws, evaluation of the try expression resumes by executing <block-2>
in an environment in which name
is bound to the exception. The name can optionally be preceded by mut
to make the binding mutable. A few examples:
try {42} catch n { 17 }; # evaluates to 42, the catch block is never executed
try {
0;
throw 1; # will be caught, execution resumes at the catch block
throw 2; # never executed
} catch n {
n
} # evaluates to 1
try { throw 0 } catch mut n {
n = 1;
1
} # evaluates to 1
try { throw 0 } catch n { throw 1 } # throws 1
The whole point of adding exceptions to a programming language is to make it such that every action (i.e. evaluation of ane expression) can either succeed or fail. We expect things to succeed most of the time, so the way that programs are written reflects this: as long as nothing goes wrong, they are essentially executed in sequence. In the exceptional case, we leave this linear control flow and the try/catch mechanism kicks in to handle things. This is another design choice in the language that helps with readability: We don't need to pollute the "happy path" of program execution with error handling all the time, instead, we can isolate the error handling logic in specific places (the catch blocks).
Chapter Recap
This chapter covered:
- the final shape of the evaluation process: An expression or sequence of expressions is evaluated in a environment and either yields a value, throws an exception, or diverges)
- scoping rules for name binding within blocks
- the if expression for conditional execution
- the while expression for repeated execution
- the throw and try/catch expressions for working with exceptions
- that there are problems that no computer program can solve
Chapter 3: Go With the Flow
So far, the evaluation of pavo programs has always been straightforward: Evaluate all expressions of the program in order, recursively including their subexpressions. In this chapter we will learn about expressions that allow to change this control flow, allowing to skip over some expressions or to otherwise deviate from linear execution order.
Conditional Evaluation
An if expression allows to evaluate some sequence of expressions only if a condition expression evaluates to a value other than false
or nil
. Such a value is called truthy.
if <expr> { <exprs> }
is an expression, where<exprs>
is a semicolon-separated sequence of expressions (just like a program).- it is evaluated by first evaluating
<expr>
. If it evaluates tofalse
ornil
, the whole expression evaluates tonil
. Otherwise, it evaluates to the result of evaluating the chained<exprs>
.
A few examples:
if true { 42 }; # 42
if 42 {0; 1}; # 1
if 42 {}; # nil
if false {42}; # nil, the expression 42 is never evaluated
if nil {42}; # nil, the expression 42 is never evaluated
There is also a second form of the if expression: the if-else expression. An if expression may continue with else { <exprs> }
, where <exprs>
is again again a sequence of chained expressions. These are only evaluated if the condition evaluated to false
or nil
(falsey for short).
if false {
42 # never evaluated
} else {
0;
1
} # 1
Finally, instead of following the else
with braces, it may be followed by exactly one blocky expression. A blocky expression is an expression beginning with if
, or one beginning with while
, for
, case
or loop
(all expressions we haven't encountered yet). This is purely to make the code more readable by saving on curly braces, as shown in the following example:
if nil {
0
} else if false {
1
} else {
2
} # 2
Blocks and Scope
A sequence of semicolon-chained expressions within curly braces is called a block. Blocks have an impact on names and the environment: they restrict the scope) of names. Name bindings only extend to the end of their block. A concrete examle:
let a = 42; # a comes into scope, can be used until the end of the program
if true {
let b = 43; # b comes into scope, can be used until the end of the block
a == b; #
} # b goes out of scope, any occurences of b beyond this point would be free
a;
This is necessary to prevent situations where a name could be used even though it was never bound:
if false {
let a = 42
};
a # This is not a valid pavo program, because a is free. There would be no sensible semantics if this was valid.
The interaction between scopes and shadowing can be a bit tricky:
let a = 0;
if true {
let a = 1;
a; # evaluates to 1
};
a # evaluates to 0
To get a clearer grasp of how this works, you can imagine the inner name was completely different:
# this example is equivalent to the prior one
let a = 0;
if true {
let inner_a = 1;
inner_a; # evaluates to 1
};
a # evaluates to 0
The same goes for shadowing where mutability is involved:
let a = 0;
let mut b = 0;
if true {
let mut a = 1;
a = 2; # this is ok because the immutable binding is shadowed by a mutable one
let mut b = 1;
b = 2;
a; # evaluates to 2
b; # evaluates to 2
};
a; # evaluates to 0, the outer binding has never been touched
b; # evaluates to 0, the outer binding has never been touched
Renaming the above program for clarity yields the following, equivalent program:
let a = 0;
let mut b = 0;
if true {
let mut inner_a = 1;
inner_a = 2;
let mut inner_b = 1;
inner_b = 2;
inner_a; # evaluates to 2
inner_b; # evaluates to 2
};
a; # evaluates to 0, the outer binding has never been touched
b; # evaluates to 0, the outer binding has never been touched
Transcript: http://softwarefreedom.org/news/2019/jun/17/transcript-for-republica19/
Interesting quote for @cft and @piet (emphasis mine)
We need to appreciate that the goal of the network is not the constant subsidiary mental activity of human beings. The goal of the network is not push, it’s pull. That is to say, all of the effort to make the network operate on internet time is a form of pollution. What we really wanted was for human beings to initiate requests for what they want, what they need, what they wish for, what they think and what they learn.
Just dropping a note that trying to keep the paths of sublogs with a hierarchical log design private through such a salting scheme doesn't really work: If we maintain a follow-graph, then the follow messages might leak the existence of a sublog under a certain prefix (or in general every cypherlink would, just like in ssb every mention of a public key leaks the information that such a feed exists). I guess the only way around that is to use meaningless path identifiers (e.g. arbitrary numbers) and then provide a mapping from identifiers to actual meaning (e.g. "you can expect the sublog at 0/2/6
to be a git repo") within the payloads of dedicated entries. Back to the drawing board... On the plus side, keeping the path identifiers public enables much more efficient routing/replication.
Maybe we should think in terms of two webs the global and the social one. There is always a tradeoff between resilience, performance and practicability. For global visibility, some tradeoffs cannot be made. Same goes for gossip-networks.
I'm on it =P
In my idea of an ssb-based social-web (websites only exist within your social graph) the publisher is identifiable, but the recipient is a lot harder to track.
Agreed for browsing public content, but for browsing private content that is not kept private through encryption, you need to reveal your identity in order to gain access to it. But since you are getting access based on a web of trust, the trust is hopefully mutual so you don't mind revealing your identity.
My design sketches for replication support three different roles one can play: authenticated as the owner of a log, anonymous, and authenticated with a pseudonymous keypair that is not tied to a log. The latter two would only be given access to public content, the authenticated-as-log-owner (which is ssb's default option) would also be given access to private content based on their position int he social graph. The pseudonymous keypairs allow repeated interactions between otherwise anonymous users (so you can track "preferred" anonymous users who don't freeload etc). I'd run this over tcp via tor onion services (though primarily because it alleviates the need for NAT traversal...).
For a whole lot more of context, see here, bamboo would be suitable candidate for the "lower scuttebutt" part. But I'm not designing it to specifically fit that role, this has just been a coincidence. Hypercore (dat''s log) is another format that could be used for "lower scuttlebutt". There's been some discussion about adding new log formats for ssb, but these have stalled by now. The last specific push forward has been done by @cryptix, I don't know about the status of that proposal.
Mutable Bindings
Bindings as introduced so far are immutable: Once a name is bound to a value, this binding does not change. Sometimes we may want to change the value a name is bound to over time. To do so, we must first mark the binding as mutable by introducing it as let mut <name> = <expr>
. This works just like a regular let expression. Next, we introduce the assignment expression for changing the value to which a mutable binding is bound: <name> = <expr>
evaluates <expr>
and rebinds the name <name>
to the resulting value. The assignment expression itself evaluates to nil
.
let mut a = 42;
a = 43;
a; # evaluates to 43
Only bindings that have been declared as mutable can be assigned to. A program that tries to assign to an immutable binding is invalid and cannot be evaluated at all. This is the third and final kind of static error that pavo programs can have (the other two were invalid syntax and occurence of free names).
In principle, immutable bindings are strictly less powerful than mutable bindings. So why aren't mutable bindings the default (and why are there immutable bindings at all)? This is primarily so that programmers can protect themselves from writing incorrect programs by accident. If you don't intend to mutate a binding, then the language will stop you from accidentally assigning to it. This can be very helpful, especially as programs become large. Additionally, it makes reading the program much simpler: The reader knows they don't have to mentally keep track of a bindings evolution over time it isn't exlpicitly marked as immutable.
Writing programs such that other humans can easily read them is one of the harder aspects of programming - but also one of the most crucial ones. Nobody will use your program or build upon your code if they cannot discern what it does. And they cannot perform changes (e.g. to make it more efficient, add new features, or fix errors) if they cannot understand it in the first place.
Chapter Checklist
This chapter covered:
- programs as sequences of expressions
- the
nil
value for conveying absence of meaningful information - the concept of an environment as a partial mapping from names to values, and its role in the evaluation process
- the let expression to bind names to values
- the assignment expression that can manipulate mutable bindings
- the importance of writing programs that are easy to read
Chapter 2: Name of the Game
Suppose you wrote a lengthy, sophisticated expression, for example 42 == 43
. Suppose further you wanted to use the resulting value at multiple points in your program, e.g. to check whether it was equal to itself. You would have to duplicate the expression, resulting in an awkwardly long program: (42 == 43) == (42 == 43)
. If you wanted to change the original expression later, you would need to manually find all places where you used it and then update them. This process is error-prone and inefficient.
Pavo offers a solution: The result of an evaluation step can be given a name. At later points in the program, the name can be used and its expression evaluates to the value to which the name was bound earlier. This chapter explains the mechanisms that are involved to make this work.
Syntax and Semantics of Names
In order to use names in the programming language, we need a new kind of expression:
- A name is an expression. A name is a sequence of at least one and at most 255 characters that meets the following criteria:
- it only consists of the characters
a
toz
,A
toZ
,0
to9
and_
- it does not begin with a digit (
0
to9
) - it is not one of the following reserved words:
nil
,true
,false
,if
,else
,return
,break
,while
,mut
,loop
,case
,throw
,try
,catch
,for
,let
,rec
,_
- it only consists of the characters
The semantics of names present a problem: We want to assign names dynamically, and we might even want the same name to refer to different values at different expressions in the same program. These requirements cannot be met with an evaluation function that only depends on the expression that is evaluated. So we need to introduce a new concept: An environment. We look at a simplified definition first, and will later refine it.
For now, an environment simply associates a set of names with one correponding value per name. A name that appears in an environment is called bound (to the corresponing value and within the environment), and a name that is not part of an environment is called free. The pair of a name and its corresponding value is called a binding.
Evaluation of an expression always occurs in the context of an environment. There is a default environment that is used for evaluating pavo programs. Implementations may provide mechanisms to evaluate programs in the non-default environment, but unless explicitly stated otherwise, we always assume programs/expressions to be evaluated in the default environment. The default environment includes many useful bindings, for now we only need to concern ourselves with the name int_val_max
that is bound to the value 9223372036854775807
.
With environments defined, the semantics of bound names are straightforward: A bound name expression evaluates to the value to which the name is bound in the evaluation environment. For example, in the default environment, int_val_max == 42
evaluates to false
.
What about free names? What would kjhkjhjk
(which is not bound in the default environment) evaluate to? This is a trick question: such an expression cannot be evaluated at all, similar to how *&[ z$ $
cannot be evaluated. An expression can only be evaluated in an environment if it contains no free names. Just like a syntax error, this is a static error: It is detected before evaluation of the program starts at all. This means that defining the semantics of bound names only is sufficient to get a well-defined semantics overall.
Binding Names
So far, names and environments are not very useful, we are stuck with the same environment throughout the whole evaluation of a program. The next expression changes that: let <name> = <expr>
is an expression that evaluates <expr>
and binds the <name>
to that value. Even if the name had been bound to a value previously, the new binding is used when evaluating further expressions. This process of "overwriting" a binding is called shadowing.
Before we can further explore these concepts, we need to introduce a couple of things. First, what does the let expression itself evaluate to? We don't really care, the expression is used to modify the environment, not to compute a value. To this end, pavo provides the nil
value. The value nil
is used whenever we don't actually care about the result of evaluating an expression. There is a corresponding nil
expression, it evaluates to the value nil
(just like e.g. the expression true
evaluates to the value true
).
The second problem we need to address is that a program consists of only a single expression, so once we wrote a let expression, there's no further expression left that could use the new binding. We fix this by expanding the definition of a program:
- A program consists of zero or more expressions, each separated by any amount of whitespace, a semicolon, and again any amount of whitespace. There may be up to one trailing semicolon (wrapped with any amount of whitespace).
- A program of zero expressions evaluates to
nil
. A program of one or more expressions is evaluated by evaluating the expressions in sequence and yielding the result of evaluating the last expression.
So now we can write programs like the following:
let a = 42;
let b = 43;
a == b; # This expression (and thus the whole program) evaluates to `false`.
I'm pretty sure I wrote down four album titles, but I can only remember three of them. Oh well, posting them on ssb would be too easy anyways. Guess we'll have to wait for the next non-digital connection to open up, perhaps I'll remember the fourth one by then.
@moid More like an sbot but for bamboo/whatever-comes-after-bamboo, or actually a subset of an sbot.
There are a few themes I'll need to explore more before doing a proper writeup. Backpressure via limits on resource usage is one of them. Another crucial one seems to be to separate queries that deal with the logical data structures (e.g. "Give me the best approximation of A's log you can.") from queries that deal with the state of local replicas (e.g. "Give me the list of all feeds for which you currently have a local replica"). When the latter are presented as a stream over time, then deletions need to be conveyed ("The feed of B has been deleted, go ahead and remove their data from your indexes") whereas the best-effort descriptions of append-only datastructures never need to convey destructive updates (there's no "Please un-replicate these messages").
These "append-only queries" are far simpler to do, that's why I wrote "replication server" instead of "application server" (which is what I'd call sbot). In many cases, sbot presents apps with the idealized feed view rather than the actual local content, which is one of the reasons for deletion being a pain, blocked posts sometimes being displayed, etc... Anyways, there's a lot to chew on and some code to write, so this will take a while.
@hoodownr Fun fact: Tales from Topographic Oceans is a very important album to me, it is the album that first made me explore popular music (I came from an almost exclusively classical musical background). I've since grown somewhat disillusioned with the album, but still, there are some wonderful moments on there. Which reminds me: Did you ever check out the album recommendations I put on your postcard from scuttlecamp?
=P
I'm glad that people like it. But there's the privacy stuff, hierarchical logs, and the idea of hardcoding follows and access control management. So lots of stuff left to figure out. Bamboo is sort of a "savepoint" that makes sense in itself but is not the final destination (for me at least).
But I'm currently sketching out designs for a replication server design that is robust to adversarial connections (essentially by enforcing a bunch of limits for resource usage). So there will be progress towards an implementation of some format to report eventually. But I'm not rushing anything out, so it might take some time.
Public service announcement: #bamboo is me messing around with design ideas, it is not an ssb thing. It is also just an intermediate step, not the final design I want to eventually build upon.
@hoodownr There isn't a single unique way to sort it, there are multiple topological sortings. The important part is to not violate the partial order, e.g. you should never sort me::6
before alice::11
, since we now that alice::11
has already existed when me::6
was created. But we have no idea whether me::8
existed prior to alice::12
or the other way around, so you could tiebreak in any way you want (in ssb, you could e.g. tiebreak using the claimed timestamp, or you might simply use whatever ordering the topsort algorithm produces).
Your table corresponds to a graph where there is a directed edge from me::4
to me::3
, me::5
to me::4
and alice::11
and bob::44
, me::6
to me::5
, me::7
to me::6
, ..., me::9
to me::8
and alice::13
and bob::45
, me::10
to me::9
and bob::46
. That's the graph you do the topological sort on. Since every entry points to its predecessor (not as explicit tangle links, but through the backlink in the metadata), entries within a single feed are always ordered correctly. The exact interleaving of the feeds might be arbitrary, but it would always respect the causal order expressed by the tangle.
That's a fairly transaction-based (or interaction-based) mindset. The use of follow and access messages as sketched decouples the actions in time. The follow->approve flow imposes an arbitrary order that doesn't need to be there, because approve->follow really works just as well.
There's a risk for approve->nothing which is of course an inefficient case. But by accepting the risk (and the cost is ridiculously low, especially with transitive access control), we are relieved of needing synchronization. Which means that everything can run concurrently and is thus more efficient and delay/partition-tolerant.
Of course you might have a UI that prompts you whenever you can see a log issuing a follow message for you that doesn't have access yet. But still the underlying data structure (-updates) can be kept "lock-free".
@hoodownr Very quickly:
- yup, tangle backlinks would be in the payloads
- vector clock: technically not the same, but they are pretty similar. In some sense sense, if you made vector clocks sparse and transitive, you'd get tangles.
- backlinks for the tangle should include a hash of the target's metadata, to guarantee that the resulting graphs are acyclic. Without the hash, you could point the tangle to an entry that doesn't exist yet and later create that entry pointing to the first one.
- pseudocode topsort: https://en.wikipedia.org/wiki/Topological_sorting#Algorithms
- getting stuff done within two days: don't spend too much time on heuristics for which feeds to include in the tangle if there are too many, just pick a random subset of the appropriate size. If you really need to save time, don't compute where tangle backlinks are needed, just point to the newest entries of randomly selected logs (this will have a noticeable effect on the quality of the sorting though).
- cats > computers
{ "type": "about", "about": "%gpTylRB8jyShnJ1pdWnRDOwqC0pTW1fmjqHB+GhZSXc=.sha256", "attendee": { "link": "@zurF8X68ArfRM71dF3mKh36W0xDM8QmOnAS5bYOq8hA=.ed25519" } }
@hoodownr This is exactly the problem that tangles solve.
In the most trivial implementation, each message would refer to the newest available entry of all known logs. That includes a lot of redundant information though, you might want to go for the transitive reduction: Each entry should have at most one incoming reference from among all messages you currently know about.
This is the scheme that gives you the best possible ordering information with the least amount of references. This might still be problematic if there is a large number of updates, so you might want to put put a cap on the maximal number of tangle-tip references per message. If there have been more updates than the cap, you'd need some tiebreaker to decide which ones to reference. This means that the partial order induces formed by the tangle is not the tightest one possible, but you can probably approximate it well enough to be useful. Tiebreakers could be metrics (e.g. pointing to the entries such that the largest number of nodes becomes part of the tangle, or minimizing the number of non-tangle nodes per log), random choices, or any combinations thereof.
When you want to order stuff, you'd compute a topological order on the tangle, with a tiebreaker of your choice. Doing this dynamically) can become complicated (and is a longstanding open research problem), but at least batch topsort is simple.
Added the more privacy-preserving variant of bamboo as an optional extension to the spec:
Extensions
This section outlines modifications to the protocol that augment its capabilities at the cost of increased complexity. When talking explicitly about the original format without any extensions, the term vanilla bamboo is used.
Private Bamboo
Vanilla bamboo leaks some private data. If a peer is supposed to have acces to only entry number two, they still get the metada of entry number one. This allows them to learn the size of the payload, and to confirm guesses about the payload (by computing the hash of the guess and comparing against the actual hash).
This can be fixed by adding a 96 bit salt) to the logical data of an entry and adjusting the encoding. The salt should be randomly generated, and there should be no correlation between the salts of different entries. In the encoding that determines the data that gets signed, instead of signing the size and the payload hash, a (yamf-) hash of the concatenation of the salt, size and payload hash is signed.
- vanilla:
sign(tag | size | payload_hash | remaining_stuff)
- private:
sign(tag | hash(salt | size | payload_hash) | remaining_stuff)
When a peer requests the payload of an entry, the salt of the entry is delivered as well, so that they can recompute the hash and check that it is indeed the one that was signed. Salts must thus always be remembered and transfered, private bamboo incurs an overhead of 96 more bits per log entry (as well as the cost of generating salts).
@VI You might like the ideas behind bamboo that would allow you to "forget" parts of a feed while still preserving the ability to verify everything.
Perhaps the confusion is this: While you can have (and verify) metadata without the corresponding payload (the whole point of offchain-content), the reverse situation does not arise: You never ask for a payload without the corresponding metadata. That's exactly because the metadata allows you to perform all the integrity checks you talked about: Verify that the payload actually belongs to the metadata (by checking that the has of the claimed payload fits with the signed hash in the metadata), and then verify the integrity of the metadata by checking backlinks/lipmaalinks for absence of forks (or more precisely: by checking that the backlinks/lipmaalinks don't conflict with any other metadata you have about the same feed.
Payloads are not content-addressed. There might be some hashes involved, but we are really binding a name (author + seqnum) rather than deriving a name. from the content
(without a hash)
There is a hash, and you would use it to verify that you were given the correct data. But you can't use it to request the data (in the sense that nobody would answer your request).
The number of hops doesn't even need to be global, it can be part of the follow/access messages. See also here for a model (developed together with @keks) that is mostly up to date.
Access control would be an allowlist, by default a the feed is not replicated to anyone but yourself. But there'd be an option for giving open access to everyone. This would be overridden by blocking (I think? Haven't gone through the technical implications yet.), in which case you'd effectively get a blocklist. Secure-by-default makes much more sense from a technical and theoretical perspective than open-by-default. If some software wants an open-by-default experience, then it would have to set the open-access flag automatically.
While both access control and something like nickname policies are both privacy related, I'd still put them in very different categories. Access control is fundamentally about routing, something that has implications that reach deeply into the implementation stack. Nickname policies are ate the end of the day nothing but a shallow hint for UI behavior. In user-facing settings, it might make sense to bundle things into something that appears as a singular privacy policy. But on the technical level, these concerns are ultimately orthogonal and should be independent from each other.
I've even been playing with the idea of making follows and access control part of the actual log format, as opposed to ssb's approach where the information resides in regular messages. The reasoning is that subjective interpretation of messages is a core principle of ssb. The friends graph of ssb nothing but a convention that many clients adhere to (because the js ones all use the same plugin, and other languages are copying the exact functionality of that plugin). But I'm completely free to ignore the friends graph, I might even claim that I had no idea how these messages were intended to be interpreted. With access control, this would be pretty shitty. I'd like to have a situation where "private by default, only forward along an allowlist" is part of the protocol. By implementing (or using an implementation of) the protocol, you'd signal that you are aware of that fact. So there would be no excuse for inappropriate sharing of content beyond the intended recipients - it would always be malicious (or negligent), but you couldn't play the "I didn't know" card. This is going beyond purely technical considerations - from a technical perspective I'd actually prefer the elegance of encoding this in regular messages. But it just seems really important to not let room for excuses for violating privacy.
The whole point of that paragraph was initially to justify keeping this separate from a nickname policy, but I guess the content of that paragraph itself is interesting as well =D
It requires more network roundtrips to follow the rope all the way back during replication
Who said the network wasn't aware of the rope structure and couldn't deliver the rope from a single request? =D I'm currently exploring this in unrestrained design mode, not bound by SSB's limitations, and this would be totally possible. In the ranges doc of bamboo there's an option for replicating along the shortest path, which is exactly following the lipmaalinks. So then all that remains is to put the state changes into their own log. Whether that would be done through a second log disjunct from the main log, a sub-log, or just a protocol-specific logical thing that the replication layer is aware of, is fairly unimportant. But the targeted complexities are: Send a request of size O(1) ("please give me the rope foo in the range from x to y), receive a response of size O(log(n)), one round trip overall.
A tangent: [...]
I can understand the urge for a per-feed mutable register, but by now I think that the rope approach combined with hierarchical feeds solves this in a nicer way. In general, I'm always very skeptic of "It is inefficient if we do it on every message, but it will be totally fine if we do it every n
messages". It still means copying the state data a bunch of times.
Could the replication / access control be integrated into the friend graph information as an aspect of your relationship with someone?
I don't quite understand that question, can you elaborate on this? Conceptually granting access and following somebody are completely orthogonal. UIs might hide this and just have a button that both follows and grants access, but I don't see any reason for complecting these on the technical level.
Dang, I forgot to insert those links:
- access control: %zzanl5H...
- append-only ropes: %iS7URZV...
So every message of the feed would include the link to the newest (at that point in time) graph-delta of the access-control-rope. This way, each message efficiently specifies its access controls. In our example, if B1 indeed revoked access rights for A, then B3 would tell Z that A's claim could not be trusted because it was outdated. If B2 was a non-access message though, then Z would know that A's claim is up to date and could deliver the data.
Quick sidenote: This does not help if B revoked A's access after message 3. But while we should use that knowledge if available and reject A access, we can't guarantee that retroactive revokation works: A might already have retrieved B3 before the access was removed.
Transitivity
This does not yet work for transitive access. Suppose C gives access to B, and B to A. Now A should get access to C. Z has C's feed, but does not know about B. Even if A gave B's messages that grant access to A to Z, Z couldn't know whether they were still up to date. This is because the access messages only give access to a feed in total, without specifying any specific point within that feed. The obvious solution and insight number 2: The access messages should point to the newest point in the access-control-rope of the access-grantee at the point of publishing the message. Then, Z can refuse (potentially) stale claims of access even across feeds.
This information can grow stale: If B updated their graph, C would stll point to the old update point. So we do the usual thing: Augment the rope with a tangle that refers to the newest points in the feeds that have access. we can even go so far as to publish a message to our rope that does nothing but update the tangle in case one of the access-holding feeds updates their graph. Whether there are explicit messages, or whether this information is only piggibacked on changes to the own graph, and whether the tangle is kept as tight as possible or saves some space by waiting to accumulate updates are policy decisions. They are generic to any tangle maintenance and don't belong in this text, so I'll move on.
Feeds in Time
This whole thing leads to a more general observation: Our cypher-identifiers (@some_key
) only let us talk about a feed in general, without anchoring it in time. A reference to a message however implies a specific point in time in its feed. Besides the feed at any point in time (@foo
) and the feed at a specific point in time (%foo
), there's also the concept of the feed at any point in time after a certain message. This would be the semantics that the access graph I sketchd above would want to use (there's a separate discussion on whether the graph shouldn't use specific points in time instead, but I won't go into that here).
With partial replication, you could even go so far as to talk about arbitrary subsets of a feed, "feed at a point in time" would simply be a special case of the subset that includes all numbers from 0 up to a point. As usual, encoding arbitrary subsets is inefficient, so other mechanisms might need to be introduced.
In the future, I will definitely try to pay more attention to whether a reference to a feed shouldn't really come with logical time information as well. While this whole text argues with a hypothetical access control graph, similar arguments can be made for the friends graph. With the friends graph it is not as crucial to strictly reject outdated claims, but still a redesign from first principles might end up with starting points as well.
Two Insights on Access Control for Feeds
A bunch of realizations that are painfully obvious in hindsight, i.e. the good stuff. Apologies for needing so many words, but these ideas are quite fresh. They do seem worth sharing, and I'm excited about them, so I want to ping a bunch of people: @arj, @keks, @mix, @mikey, @cinnamon
Setting the Stage: (Transitive) Access Control
Prior writings on access control here. Imagine feeds were not replicated by default, instead they were completely private. To allow replication, you'd put a message into your log that says "@somebutty is allowed to access this feed". Now, when @somebutty asks a node for your feed, it would be given to them. Like the current friends graph, this could be done transitively: If A gives acces to B, and B gives access to C, then C has gained transitive access to A's log. We don't want to do this without limits, instead the access message includes the limit of how many hops of transitivity are allowed. We trust all peers to honour these settings. To make this more realistic, we can also add "deny" messages that allow you to explicitly shut someone out even though a transitive path from your feed might exist (this is completely analogously to blocking in the follow graph (aka friends graph)).
Setting the Stage: Interplay With Partial Replication
Further imagine that partial replication was a thing. It wouldn't necessarily have to be implemented via lipmaalinks, but I'll assume understanding of that particular graph structure in this text. So if "lipmaalink" doesn't ring a bell, refer to that link. Now, here is the main problem I want to write about: Suppose there is a feed A, a feed Z, and there is a feed B whose first entry (B1) grants access to A and Z, and whose third entry (B3) is a message that A and Z are interested in. Z has a local copy of B3, but no other content from feed B, due to partial replication. A is currently talking to Z and asking for B3. Since Z knows nothing about B's access control settings, they deny the request. What can we do about this? A could give B1 to Z. Then Z will see that A has access and should hand over B3. Right? Nope, wrong!
The Problem
Z has no idea whether the access information in B1 is still up to date. B2 might be a message that revokes A's access. If that was the case, and A was a malicious peer, then A would obviously withold that information from Z, claiming to not even have B2. So peers like Z naively trusted a "pseudo-proof of access", then A could gain access to stuff it shouldn't by preying on uninformed peers and clamiing to still be trusted by B. This makes it very hard to revoke trust, and thus issuing trust becomes risky. That's bad, that's the opposite of what we want.
There is a trivial solution: With each message, include the full information of who gets access to it (including the information on how many hops of transitivity). This does not scale at all. So here is insight number one: We can actually do this in an efficient manner.
First, we'll add some indirection: If we had a single message whose content was the full set of access-grants and denies at some point in the log, then other messages would only need to include a link to that message and would thus convey the full access information. That's already much better than pasting that data into every single message, especially since it is repetitive across messages. But now, we still need to copy almost all the access data whenever we change our permissions, because then we need to create a new message to point to.
Here is where append-only-ropes come in: They allow us to encode the access graph over multiple messages, such that updating it is very efficient (O(1)) and only a logarithmic number of messages needs to be looked at (i.e. fetched) in order to reconstruct the full graph. If you don't feel like reading that link, but understand lipmaalinks, the basic idea is very simple. Each "access" or "deny" message expresses a change with respect to the prior accumulated state of the access graph. The state of the graph at some point in the log is computed by starting with the empty graph and then applying all changes in order. To form a rope, the message does not only include the change to the direct predecessor, but also the change to the lipmaa-predecessor. Thus, we can start from the empty graph and reconstruct along these lipmaa-predecesor changes to quickly reach the state of the graph at any point.
continued...
@aljoscha I can speak Cantonese/Mandarin too, perhaps really rustily about decentralised stuff lol… and a bit intimidated about that ivory tower.
The ivory tower was a weak attempt at humor, nothing more. At the end of the day there are going to be a bunch of humans interested in taking in new perspectives - or at least that would be my hope =)
I assume everybody can get everybody’s bamboo log entries. getting the payloads is a different matter. Am I correct?
Yes. My main point here is that even while there might be access control on the payloads, bamboo still leaks some information about these payloads, i.e. its size. So while it is possible to build a reasonably private system on bamboo, if you care about such leaks, then bamboo is not the right format.
This would be exacerbated by replication protocols that freely throw around this metadata to increase efficiency. This would become much more pronounced in the (as-of-now hypothetical) hierarchical version: Leaking a payload size might not be so bad, but leaking the sub-log name /chess/match-against-happy0
is a whole different story - especially if the routing layer forwards that information about me across the globe.
will there be some meta protocol for that somewhere some day?
That would be an application-layer concern. So from the bamboo-perspective: no. From the eventually-I-want-to-build-stuff-on-bamboo perspective: Absolutely.
@mix You got the problem description exactly right, but the content would not be requestable as a blob. Content is addressed by the author's public key and the sequence number, no hashes involved. While nothing would stop you from asking Piet for a message of a certain hash, Piet simply wouldn't give it to you. Unless he is malicious, which falls into the same category as him taking a screenshot of private data and posting it on twitter. Can't protect against that, that's the point where trust comes in. Payload !=
Blob, in fact bamboo has no concept of blobs at all.
It is a good observation though that keeping data that is associated with an author (i.e. messages aka log entries) private is fundamentally different from keeping data that is not associated with an author (i.e. blobs) private. This is one of the points in favor of forgoing content-addressing completely and relying on identity+offset-based addressing exclusively when trying to keep things private.
@cft That's an interesting point. In principle, you are obviously right: nothing can be more resilient than "everybody gets all the data as quickly as possible". But that doesn't mean that we can't approximate this safely enough while still maintaining privacy. Point2point can operate even under the presence of a listener via session-based encryption. But it fails against an adversary that can prevent direct connectivity. Delay-tolerant public broadcast can't deal with listeners, but makes the job of the adversary incredibly difficult: They need to keep sender and recipient in disparate network partitions forever. The compromise that SSB demonstrates is that of using point-to-point encryption for the transfer between the parties that are allowed to have access. This beats a passive listener, while still keeping the job of an active adversary hard: They again need to keep sender and recipient in disparate partitions forever, but on a smaller network (namely the induced subgraph of all the nodes that do have access to the data). So if that graph is large enough, i.e. if sufficiently many nodes trust each other, active attacks should still be sufficiently hard to pull off. The best bet is of the attacker is probably to either isolate the sender, the recipient, or rely on social engineering. And those are the base scenarios we will always have to live with, no matter how sophisticated our system becomes.
Just to give a concrete example of the additional work for making #bamboo (spec) private enough: With bamboo, you can request individual entries of a log, but you also get the metadata of a few other entries in order to be able to verify the integrity of the entry you requested. So all metadata is conceptually public. Suppose I gave you access to only log entry 2. If you requested it, you'd als get the metadata of entry 1. Since you shouldn't have access to entry 1, everything in its metadata that allows you to derive information about its payload is a data leak.
Bamboo signs (and thus puts into the metadata) a bunch of stuff
(that does not pertain to the actual payload), a payload_hash
and the payload_size
. The payload_size
is sensitive metadata, we should try to hide it. Solution: sign a secure hash of the size instead. But that leaves another problem: An attacker can confirm guesses about such hashes, both for the size and for the actual content. To stop that, we should add some random bits - 96 bits should be sufficiently paranoid. So the new signing algorithm could be sign(stuff | hash(96_random_bits | payload_size | payload_hash))
, where |
denotes concatenation.
This introduces one more hash computation, the random number generation, and it requires everyone who has access and wants to be able to replicate the entry to store and transmit an additional 96 bits per entry. And the whole thing got more complicated.
I guess the practical cost is actually not that bad. The randomness is awkward since it introduces nondeterminism, but you could always use a cryptographically secure prng to get around that if determinism is required or if gathering sufficient entropy is too expensive.
This doesn't change at all that a routing layer that needs to keep the things it routes upon private incurs some very real (even in practice) panelties. But a private log format could serve as a basis for a system where open data can be routed efficiently and access-controlled data can use a more privacy-aware routing system without that effort being negated by leaky log metadata.
So thanks @mixmix and @mikey, looks like I won't get to chicken out of the hard stuff...
CC @piet
Pinging rad humans who could totally shake up that conference: @dominic, @mix, @mnin, @Luandro, @noffle, this list is obviously very incomplete but I don't care. You can go on and ping more people of transitive subjective radness =)
Also @paul and mafintosh (whose radness I can only conjecture, but with high confidence)
@cft, please try to find out whether/how the travelling costs would be covered.
This seems like a great opportunity to infuse ethics and diversity into the academic ivory tower.
@mix Agreed that this is the way to go in the long term, but it means engineering an even more complex system. There'd probably be a lot of insights to be had from realizing the simpler system first. I'm probably just rationalizing my (not-quite-yet) decision though...
My first instinct would be to then build the simple case such that it is a strict subset of the more complex system and then later extend it with the additional functionality. In the concrete case of hierarchical-log-based system however, this would mean that even the initial simplified system would need to avoid metadata leaks in the log structure, which makes the whole thing more difficult.
Meh, these things are hard =(
Practical subtext: let’s keep in the same aye?
So, since most of the backlash in this thread came against adding ads specifically, not adding anything in general, let's bikeshed a call-for-donations text and print it from the postinstall of ssb-keys.
Re the problem of libraries/middleware not having the same visibility to end users: They still have direct end users, namely the devs who embed them in their projects. If there was a general culture of paying forward to your dependencies, donations to the application devs would trickle through. So that would be another actionable point for ssb: Establish which library code the js ecosystem relies on, forward some of our money their way, and be vocal about this practice.
forking from %H18RJ2+...
mostly rambling, not really working towards a coherent argument/point
think we came to the conclusion that I do not believe I want “unstoppable information”. I want resilient information, and I also think it’s really important to be able to exert some degree of control over information - not all information should go everywhere.
This is one of the main tensions I've been exploring lately. Unfortunately, engineering systems for unstoppable information is much simpler and results in better performance and reliability. It's really hard to do routing of information if not everyone is allowed to forward the information in the first place. SSB's constraint of only routing the data through peers who actually care about it is effectively equivalent to only routing data through peers who are allowed to access it. This is an area where point-to-point connections (i.e. classical tcp/ip) are at an advantage over content-addressability: You can encrypt data specifically to the intended receiver and then route it through untrusted parties. Privacy-aware-ssb can't do this. The randomized overlay of SSB gossip seems to be the main trick for getting it to work - but it comes at the cost of potentially large, randomized delays (that need to be brought down through careful engineering).
As I've mentioned here, I kinda want to explore the simple case of open information first. I think there's a lot of value to be had from a decentralized solution for putting data into the commons. And I feel like the ipfs approach (which comes from a very similar angle in terms of goals) just doesn't work well enough (a flat namespace addressed through a huge, singleton dht seems super vulnerable, and I won't even go into the wasteful polling that needs to be done on a pull-only delivery mechanism).
But on the other hand, it feels like taking the easy way out when it would be more valuable to tackle the harder problem. Between an access-control graph similar to the friends-graph, and a randomized overlay, this should be doable. In principle, I feel like relying on the random overlay has a definite limit in terms of scalability (once there are many users with disjoint interests, the probability of finding a peer with overlapping interests drops significantly). Then again, there's a ton of engineering one could pour into this to combat that effect. I just don't like relying on engineering power, I strongly prefer systems that just work out. And open information systems fall into that nice category. Also, the more this relies on engineering, the more trouble malicious peers can cause. Perhaps reading walkaway makes one overly paranoid, but I think resilience of such infrastructure against attackers will be crucial (another very strong argument against ipfs as the main open information commons).
I'm all for bringing attention to the current status quo of how self-exploitation is expected in open source. I don't think ads are the best mechanism to do so.
Somewhat stating the obvious here, but nevertheless: Even before they turned into the justification for a rapid descent towards orwellian surveilance, ads haven't been particularly charming or suave. They could make users support open source, but in a non-consensual way. And they necessarily add an intermediate party between the users and the devs they support. I'd like to see ssb stay true to its counterantidisintermediation roots.
So instead of displaying an add, we could display a call for donations, directly from the user to devs, and only if the users choose to do so.
A donation request could actively bring up the issue we want to raise awareness for. An ad would only do so in a subtle, passive-aggressive way that would probably go over the head of most people - especially considering the default mentality when encountering an ad ("I will ignore this, this is annoying, I don't want to engage with this (not even critically).").
We are building ssb without compensation, because we genuinely belief that it can transform the world for the better. Unfortunately, this belief does not satisfy our needs for food and shelter. If you feel like this software had a positive impact on your life, and you can spare the change, please consider donating at https://ssb.nz/donate
Or if there is another free software project you use that you like even more, why not donate to them instead? Writing software can sometimes feel like a thankless job, even for all the good we know it can do. Even a small amount helps to spread the load.
Seems like a more promising way of raising awareness than
<company-name> is your friend. You should like <company-name>.
The deadline for the final version of the paper is in two days, and we are almost finished. Everyone who plans on reading the final paper might as well do it now and notify us of all our embarrassing typos while we can still fix them: the-mostly-final-paper.pdf
Hmm, responses are actually more complicated than needed. Try this instead (introduction and request-section are unchanged):
Naive Bamboo Replication
This text specifies a fairly naive protocol for replicating bamboo logs between two endpoints over a reliable, ordered, bidirectional connected (tcp, websocket, etc). It is inadequate in both efficiency (statelessness makes implementation easier but leads to redundant data transfer) and robustness (no backpressure, no hearbeats, no multiplexing, large payloads might block everything else). On the plus side, it is easy to implement yet pretty expressive, it supports ranges and honors certificate pools.
Packets
Peers exchange packets, each packet is either a request packet, a metadata packet or a payload packet. In the specification for packets, all numbers are encoded as VarU64s.
Requests
A request packet is the concatenation of the following bytes:
- the byte
0x00
to indicate that this is a request packets - the 32 bytes of the public key of the log that is requested
- the start seqnum of the requested range
- the end seqnum of the range, or zero if the range is open
- the head-max seqnum corresponding to the range
- if the end seqnum is nonzero, the tail-min seqnum corresponding to the range
- a byte between 0 and 31 whose five least-significant digits are determined by the following bit flags (left to right):
- 1 if the range is sparse, 0 otherwise
- 1 if the range is a no-metadata range, 0 otherwise
- 1 if there is a minimum payload size, 0 otherwise
- 1 if there is a maximum payload size, 0 otherwise
- 1 if there is a set of forbidden seqnums, 0 otherwise
- if the minimum payload flag is set: the minimum payload size
- if the maximum payload flag is set: the maximum payload size
- if the forbidden seqnum flag is set: the number of forbidden seqnums, followed by the forbidden seqnums (should be in ascending order)
The protocol can be simplified by dropping support for some of the flags. If no flags are supported, the flag byte itself should be omitted.
You could simplify even further by dropping the head-max and tail-min seqnums. It would make me sad though.
Metadata
A metadata packet is the concatenation of the following bytes:
- the byte
0x01
to indicate that this is a metadata packet - the metadata of a log entry
Payload
A payload packet is the concatenation of the following bytes:
- the byte
0x02
to indicate that this is a payload packet - the 32 bytes of the public key of the log to which the payload belongs
- the seqnum of the payload within that log
- the size of the payload
- the payload itself
Metadata and payloads that satisfy a request should be transmitted in the following order: the metadata of the entries in the head of the certificate pool of the start seqnum of the request, followed by the requested data (always metadata first, payload second), followed by the metadata of the entries in the tail of the certificate pool. This should be sorted such that the seqnums of the involved entries are ascending. Certificate pool entries should be trimmed as much as possible, as specified by the head-max and tail-min seqnums.
I’d actually like to sit back and see what you come up with
Aaaaand I wrote a simple replication spec, never mind... Feel free to ignore it I guess.
Naive Bamboo Replication
This text specifies a fairly naive protocol for replicating bamboo logs between two endpoints over a reliable, ordered, bidirectional connected (tcp, websocket, etc). It is inadequate in both efficiency (statelessness makes implementation easier but leads to redundant data transfer) and robustness (no backpressure, no hearbeats, no multiplexing, large payloads might block everything else). Additionally, it is pull-only: New log entries cannot be pushed automatically, they must be polled for. On the plus side, it is easy to implement, yet pretty expressive, it supports ranges, and honors certificate pools.
Packets
Peers exchange packets, each packet is either a request packet or a response packet. Peers should only send response packets if a corresponding request came in. In the specification for packets, all numbers are encoded as VarU64s.
Requests
A request packet is the concatenation of the following bytes:
- the byte
0x00
to indicate that this is a request packets - the 32 bytes of the public key of the log that is requested
- the start seqnum of the requested range
- the end seqnum of the range, or zero if the range is open
- the head-max seqnum corresponding to the range
- if the end seqnum is nonzero, the tail-min seqnum corresponding to the range
- a byte between 0 and 31 whose five least-significant digits are determined by the following bit flags (left to right):
- 1 if the range is sparse, 0 otherwise
- 1 if the range is a no-metadata range, 0 otherwise
- 1 if there is a minimum payload size, 0 otherwise
- 1 if there is a maximum payload size, 0 otherwise
- 1 if there is a set of forbidden seqnums, 0 otherwise
- if the minimum payload flag is set: the minimum payload size
- if the maximum payload flag is set: the maximum payload size
- if the forbidden seqnum flag is set: the number of forbidden seqnums, followed by the forbidden seqnums (should be in ascending order)
The protocol can be simplified by dropping support for some of the flags. If no flags are supported, the flag byte itself should be omitted.
You could simplify even further by dropping the head-max and tail-min seqnums. It would make me sad though.
Responses
A response packet is the concatenation of the following bytes:
- the byte
0x01
to indicate that this is a response packet - the 32 bytes of the public key of the log that is delivered
- the number of data items that follow
- that many data items, satisfying the request to which this is responding
A data item is either the byte 0x00
followed by the metadata of a log entry, or the byte 0x01
followed by a seqnum, a payload size, and then that many bytes of data (the payload of the entry of the given seqnum).
The data items that are included in the response should be in the following order: The metadata of the entries in the head of the certificate pool of the start seqnum of the request, followed by the requested data (always metadata item first, payload item second), followed by the metadata of the entries in the tail of the certificate pool. This should be sorted such that the seqnums of the involved entries are ascending. Certificate pool entries should be trimmed as much as possible, as specified by the head-max and tail-min seqnums.
Every request should be responded to, with as many data items as available. That number might be zero, but that response is still preferred over silently ignoring the request.
That's it. There are a few obvious optimizations that were excluded for reasons of simplicity. Keep in mind that this is only for toy purposes, a more serious protocol would be stateful, more efficient, and more robust.
To clarify: These path examples would all apply within a single log, there'd be no global /foo/bar
. some_public_key.ed25519::///foo/bar
would have nothing to do with another_public_key.ed25519:///foo/bar
. Replication would be guided by a public key first, and then a path to filter within the log addressed by the public key. (and again: the url-like notation is just to clarify things, not something that would actually end up in the raw bytes)
This is only one possible way.
Bamboo replicated over ssb-blobs, let's do this!!11
(I thought about bamboo-over-ftp first, but bamboo-over-blobs is even worse =D)
want to do as much as possible.
Now I really am tempted to write down more specs I'd like to see in the world =)
Out of general interest: For the stuff I plan on building eventually, I want hierarchical feeds, not just flat ones like bamboo. Imagine a bamboo log, but each entry would live at a path (think file path or url) inside that log. If you subscribed to /foo/bar
, you'd get (the properly ordered) entries posted under /foo/bar
, /foo/bar/baz
, /foo/bar/what/ever
etc, but you would not get anything under /qux
or /qux/bla
etc. Asking for /
would give you the full log. (also note that this is just a convenient notation, the format would use more efficient arrays of byte strings rather than human-readable path names)
This would not be privacy focused (and neither is bamboo actually, it wouldn't include payload size metadata if it were focused on privacy), the paths under which messages are posted would be public. This design would allow a very powerful routing/replication layer. The privacy stuff is important and interesting too, but I'd like to do the simpler one first. The overarching motivation is to allow building digital commons on top of this stuff (code repositories, package managers, public domain libraries, art, etc), not private social networks.
so if I were to hammer out more specs, I'd kinda want to move to the hierarchical feed format. @piet and @hoodownr, would you be up for following that route, or would you rather want to continue playing with flat feeds? I don't mind fleshing out bamboo-land either, since I'll be able to port a lot of that over to the hierarchical logs eventually.
@hoodownr I'm having trouble parsing two out of three paragraphs in the above post =D
The nice thing is that the data in bamboo is content-addressed (or rather identity+offset-addressed), so it doesn't really matter how you replicate. I might disagree with what you come up with, but that's totally fine. There won't be a single, definite replication spec anyways.
@hoodownr Cool! I'd actually like to sit back and see what you come up with, rather than dumping a whole overengineered protocol for multiplexing exchange of arbitrary ranges on you. Unless you explicitly want me to do exactly that.
Correction:
So you can reasonably expect the sneaky breaking changes to stop now. You can definitely expect the sneaky breaking changes to stop on 30/04/2020 (well unless there actually are any changes, in which case I would move the freeze-date further back)
=P
And also, I’m not up to date with the latest of @Aljoscha’s spec
In related news, I've just updated the status section of the readme:
Status: Well-defined and useful. Pending stabilization: If I don't see any need for further changes by 30/04/2020, this will be declared stable and any later breaking changes will result in new, non-bamboo specifications.
So you can reasonably expect the sneaky breaking changes to stop now =)
It does assume prior programming experience, and neither do the further chapters.
Forgetting a negation is clearly the most annoying kind of typo =/
Devil in the Details
Before we go on introducing new concepts (i.e. new kinds of expressions or values), we will look at a few details that we have glossed over so far.
Whitespace
In the pavo syntax, whitespace characters (space, tabs, linebreaks) are mostly ignored. They can be inserted to improve readablity. Additionally, a #
character and everything following until the end of the line is considered whitespace as well. This is called a comment.
# This is a valid pavo program, even though it *technically* does not
# consist of solely an expression (it begins and ends with whitespace)
true
When the definition of an expression contains whitespace (such as with <a> == <b>
), this actually means that any amount of whitespace is ok. For example 0== 0
is a valid expression. But whitespace can not be inserted inside an expression where it is not expected, e.g. tr ue
is not a valid expression.
Domain of integers
Pavo does not support arbitrary integers, it only supports integers) from −9223372036854775808
(-2^63
) to 9223372036854775807
(2^63 - 1
). This is a concession to the fact that computers should be able to represent values efficiently. Restricting the range makes it much easier for hardware to deal with the integers. This particular range is commonly supported in hardware directly, it uses 64 bit words) two's complement. For now, you don't need to care about this at all.
Equality
The semantics of the ==
expressions has been defined in terms of values "being equal", but we never defined what that actually means. Let's quickly fix this:
- the value
true
is equal to the valuetrue
and not equal to any other value - the value
false
is equal to the valuefalse
and not equal to any other value - an integer value is equal to an integer value that corresponds to the same integer, ant not equal to any other value
This all seems rather obvious, but we will later encounter values where the question of equality is less trivial and multiple, mutually contradictory definitions would all make sense. In general, the topic of equality is a surprisingly complex one, both from a mathematical) and a philosophical) perspective.
Nested Expressions
<a> == <b>
is an expression if <a>
and <b>
are expressions themselves. In particular, one of them might be a ==
expression itself. This means that e.g. 0 == 0 == true
is a valid expression. But how do we know which ==
is the outer and which is the inner one? If it was (0 == 0) == true
, the expression would evaluate to true
, but if it was 0 == (0 == true)
it would evaluate to false. This is the problem of operator associativity.
Pavo defines some rules to resolve such ambiguities, they are given in Appendix A (sufficient at this point: everything is left-associative, so the first interpretation is correct). Instead of writing ambivalent code, it is much better style to use explicit parentheses, as allowed by the following syntax and semantics:
- If
<a>
is an expression, then(<a>)
is an expression. It evaluates exactly like<a>
.
This allows coders to both clarify their code and override the default associativity where necessary.
An interesting observation about the recursively defined ==
expression is that there are expressions of arbitrarily large size, but there are no expressions of infinite size. This is nice: we avoid the infinite number of problems that usually come with objects of infinite size, yet we will never run into an arbitrary limit on the size of expressions.
Mathemagical Adventures
The concepts that have been introduced so far might seem rather obvious, self-evident, or even boring. That is a very good thing. The main trouble with programming is keeping complexity low, even when programs consist of millions of lines of code. Having a simple core is a necessary prerequisite to achieve that goal.
Math is often about starting out with extremely simple) definitions) and then exploring the surprisingly rich structures that arise. Compared to most branches of pure mathematics, pavo is actually a ridiculouly complex mathematical object. But it is a mathematical object nonetheless, namely the triple of the set of programs, the set of values, and the evaluation function.
This triple really is pavo, unlike any particular implementation. This allows us to write different implementations (programs that execute pavo programs) independently, that might work completely differently. We can reason about pavo programs without knowing the details of how they will actually be executed. And we can use all known mathematical tools and techniques to do so.
There are very few things which we know, which are not capable of being reduc'd to a Mathematical Reasoning; and when they cannot it's a sign our knowledge of them is very small and confus'd; and when a Mathematical Reasoning can be had it's as great a folly to make use of any other, as to grope for a thing in the dark, when you have a Candle standing by you.
-- John Arbuthnot, Of the Laws of Chance (1692)
Chapter Checklist
In this chapter, we learned:
- about syntax and semantics
- about expressions (bools, ints,
==
, and parentheses) - about values (bools and ints)
- about the evaluation of expressions into values
- that math is nice, pavo is math, and thus pavo is nice
Chapter 2: More Expressions (TODO: needs better title)
- (bound) identifiers
- throwing
- operators (equality and ordering)
- control flow
Chapter 3: Functions (TODO: needs better title)
- functions literals
- lexical scoping
- tail-call recursion
Chapter 4: Valuable Lessons
- Overview of the proper values
Chapter 5: A Pattern Language
- Patterns and Destructuring
Chapter 6: Questionable Values
- general concept of proper values
- functions
- cells
- opaques
Chapter 7: Misc (TODO: needs better title)
- eval
- require
Epilogue
- what next? -> learn, teach, and use your skills for good
- theoretical CS is worth learning
- always remember that your code has effects on humans first, machines second
Appendix A: Operator Associativity and Precedence
Appendix B: Formal Grammar of the Syntax
Appendix C: API-Docs for the Toplevel Values
I have drafted the outline and the first chapter for "The Pavo Programming Language" ( #tppl ), the main, tutorial-style pavo documentation. So for the curious:
The Pavo Programming Language
A human-friendly introduction to the pavo programming language, for coders and not-yet-coders alike. May contain traces of math. Which may actually be a good thing.
Imagine a full-page, peafowl-related ("pavo" is latin for "peafowl") title illustration here. It might show one or more birds, or just a crest, perhaps a feather. Let your imagination flow. Try to settle on a nice composition, then picture the details of the artwork.
Now draw the image and send it to me, so that I can use a better illustration than this one.
Prologue
- why yet another language
- a case for libre software
- minimizing complexity as a prerequisite for inclusiveness
Chapter 1: Tabula Rasa
This chapter introduces the basic functioning of the pavo programming language. It does assume prior programming experience, and neither do the further chapters.
If you have never coded before, this text might introduce many unfamiliar concepts, sometimes rather densely packed. Don't rush yourself, re-read passages as needed, take breaks - the text will still be there tomorrow. The reliability of computers can also make them very unforgiving, so it is often better to backtrack and solidify your understanding than to press forward on shaky foundations.
If you do have prior knowledge of programming, you should nonetheless read this chapter carefully and with an open mind. Pavo might do some things differently than you are used to, and mostly-but-not-quite-correct intuition can be a formidable source of frustrating errors.
Syntax and Semantics
A very basic model of programming is the following: At some point in time, a programmer writes down instructions in a way that a computer can understand. Then, at some later point in time, the computer executes these instructions to compute a result. A programming language thus deals with two fundamentally different concepts: It defines what counts as valid instructions (syntax), and it defines how to execute valid instructions (semantics).
Here are three simple pavo programs:
true
42
42 == 42
And here is something that is not a pavo program:
*&[ z$ $
All of these are sequences of characters, but the fourth one does not conform to the syntax of pavo. If you try to execute the fourth "program", pavo refuses to run it - it can't assign meaning to this string of characters. This is called a syntax error. Running the first program yields a result: true
, running the second one yields 42
, and running the third program yields true
again. This might look rather unexciting, but it is worth taking a detailed look at what is going on there.
Expressions and Values
What is a valid pavo program? That's simple: A pavo program is exactly one pavo expression. So what is an expression? There is a set of rules that defines expressions. These rules will be introduced throughout this text, step by step. We start out with the following ones:
true
is an expressionfalse
is an expression- integer numbers are expressions, they may optionally start with a sign (
+
or-
)- the truth is a tiny bit more complicated, we'll refine this later
- assume there are two expressions
<a>
and<b>
, then<a> == <b>
is an expression
Note that we do not assign any meaning to these things, the rules merely describe those sequences of characters that we happen to call expressions. The last of these rules is the most complicated one, it allows us to create expressions such as true == true
or true == false
. This trick of using expressions within the definition of expressions themselves is an example of the general principle of recursive (or inductive) data types.
Equipped with a proper definition of expressions, we can now take take a look at what happens when you run a program that consists of one such expression, that is we'll examine pavo's semantics. To talk about the semantics, we must first define yet another data type: that of values. Values are the things that an execution of a pavo program yields. Just like expressions, they are defined by a set of rules that we will explore throughout the text. We begin with the following rules:
true
andfalse
are values, collectively called bools (also booleans or truth values)- integer numbers are values, also called ints
- again, this is not fully accurate and will be made more precise later
The semantics of pavo define how to turn a program (i.e. an expression) into a value. This process is called evaluation. For each kind of expression, there is a rule that defines how to compute the corresponding value. A computer blindly follows these rules, there is nothing magic about it. Humans can do just the same, they only take longer.
The semantics for our current set of expressions are:
- the expression
true
evaluates to the valuetrue
- the expression
false
evaluates to the valuefalse
- an integer expression evaluates to the corresponding integer value
- Let
<a>
and<b>
be expressions. To evaluate the expression<a> == <b>
, first evaluate<a>
, then evaluate<b>
. If the two resulting values are equal, the expression evaluates to the valuetrue
, otherwise it evaluates to the valuefalse
.
Note that e.g. the expression true
and the value true
are two completely different things. For example it would not make sense to evaluate the value true
, while it is totally fine to evaluate the value true
. Most of the time, the context determines whether just true
refers to the expression or the value.
This is essentially it, programming in pavo always works the same way: you write down an expression, and the computer mindlessly evaluates it according to the pavo semantics and yields the resulting value.
continued in next post
status update: I implemented my own persistent arrays, sets and maps to work with the gc. Might have been overkill, but was a good exercise in any case (I've never actually implemented self-balancing trees before ).
Then I decided that integrating them into the implementation would be boring, so to shake things up I reverted the language design to the original non-lispy one. Ok, the real reason is that I've made up my mind on macros not being worth the substantial increase in complexity. Having a decidable compilation process puts me at ease, and I'm looking forward to building all the cool tooling that would have been prohibited by turing-complete macros.
This pushes the initial release back by a few months. On the plus side, I can now write implementation, tutorial docs, and reference docs in lockstep, so there will be well-documented and working subsets of pavo to share along the way.
{ "type": "about", "description": "not-so-private gathering for the banana ring", "mentions": [ "@C6fAmdXgqTDbmZGAohUaYuyKdz3m6GBoLLtml3fUn+o=.ed25519", "@HEqy940T6uB+T+d9Jaa58aNfRzLx9eRWqkZljBmnkmk=.ed25519", "@zurF8X68ArfRM71dF3mKh36W0xDM8QmOnAS5bYOq8hA=.ed25519" ], "about": "%p2A+mLsYDKOwNyh5lIwgd3f2Cu5KCcg7zQ+9eMh9Jzw=.sha256", "branch": [ "%CSkoAzNlLCPNLg7oBNUwKTGZLOJmuA9Rvwp8sBFxIcs=.sha256" ] }
{ "type": "about", "about": "%p2A+mLsYDKOwNyh5lIwgd3f2Cu5KCcg7zQ+9eMh9Jzw=.sha256", "attendee": { "link": "@zurF8X68ArfRM71dF3mKh36W0xDM8QmOnAS5bYOq8hA=.ed25519" } }
Appropriate technology reminds us that before we choose our tools and techniques, we must choose our dreams and values. For some technologies serve them, while others make them unobtainable.
-- Tom Bender, Rainbook.
Ooh, I might steal that quote if when I finally have the guts to write about the why behind #pavo-lang
With linked-list versus tree, isn’t it the case that regardless of whether backlinks don’t match up or we see two messages with the same sequence number, the result is that the log “forks”, and is no longer trusted?
Yup, both cases are a fork. but without backlinks, you could only detect one of them.
Doesn’t retroactively swapping a message also result in forking, so long as some other peer has the prior, signed message with the same sequence number?
Yes and no. Some other peer having the "original" (in the intuitive sense - the concept of an original doesn't really make sense in a cypherlink-less setting) alone doesn't really help: a single peer must hold both conflicting messages to detect a fork in this world. But with regular ssb replication, that situation doesn't arise. If you have a message of some seqnum, you will never request a message of that seqnum again (within the same feed). Catching seqnum-doubling without relying on cypherlinks while still maintaining efficient replication might be impossible. Cypherlinks themselves only give probabilistic guarantees, this (or fingerprints/checksums in general) allows us to "cheat" and get both efficient replication and (good-enough) fork detection.
Does that make sense?
Don't most clients use the same plugin for displaying profile descriptions as well? So all of this (currently) also applies to descriptions.
@mix, can you confirm this?
Wrote down some words on indicating (sub-)ranges of bamboo logs:
Interesting for @piet and @hoodownr because bamboo, also interesting for ssb since this lists a few concerns that ssb replication rpcs will need to address with respect to offchain-content and partial replication: @arj, @keks, @cryptix, @Dominic
Beyond pure replication concerns, this informs some capabilities that the storage of a bamboo log should provide.
In the list of ssb specs, you missed the ssb spec =P
I went ahead and updated the spec. This was a breaking change. Piet, please delete those misleading recent posts that claim you have a working bamboo implementation with bindings, you only have a cheap, outdated non-bamboo thingy.
(A very, very cool thingy though.)
The remaining potential for breaking changes that I can currently see is including the payload size. It is still giving me headaches, because conceptually it just doesn't fit in (it is not strictly necessary for verification) and it leaks metadata (beyond the length of the log). But I just don't see a way to achieve its benefits without resorting to metadata creep. This thing is bugging me for months now =/
I never started drinking, because I've been afraid to become dependent on the anti-social-awkwardness effects it seems to have (since I could definitely use those a lot of the time...). Not even trying seems to be the easiest way of not missing it. I find it quite fascinating though to observe how people react to that decision. E.g. yesterday an anonymous butt remembered that I don't drink and offered a lemonade instead, which made me feel at ease.
Not much to share here, just adding one more person to the list of self-exiled non-drinkers.
@RangerMauve I just saw your mostly-minimal-spanning-tree on github and it reminded me of @erick's paper here. Not sure whether there's a good overlap, but I guess it won't hurt to ping you about it.
At a first glance, the only room for nondeterminism are the different choices for encoding integers. Just slap "Integers MUST use the smallest possible encoding" on there and spare all implementers the additional headaches.
i just don’t want to require it.
Instead you require everyone to store the content in the form in which it was received (of course they can check whether they received canonic data, but if they can do that, than they probably wouldn't be mad abut mandatory canonicity either, and they still need to be able to deal with noncanonic feeds). If canonic encoding is mandatory, then implementers can still choose to store the original bytes if they feel it is a good choice.
Mandatory canonicity forces everyone to check incoming data for canonicity (cost: a ridiculously tiny amount of time, once). Non-mandatory canonicity forces everyone to store a small (but definitely not ridiculously tiny) amount of data, consuming space forever. Which of these is more efficient doesn't really seem up for debate. What this is really about is trading protocol efficiency for developer convenience. But I don't even buy into the convenience argument, managing additional persistent storage can't possibly be more convenient in the long term than implementing ~20 lines of canonicity checks on integers and then forgetting about it forever. And I did implement the (definitely more than 20 lines of) canonicity checks for legacy ssb from scratch, so this is not ignorance speaking (though I very much enjoyed the "forgetting about it" part).
As for the lipmaalinks: We can simply add them and then completely ignore them for now. Then at a later point we can upgrade rpcs and implementations, but nobody will need to migrate their feed. Fairly certain Dominic also wants to do it this way.
Unrelated: Should timestamps really be unsigned integers? Imo supporting the year 1969 is more valuable than supporting the year 292277026596. I can think of a few creative applications for feeds that are retroactively created as if they belonged to the past.
If every entry starts with the byte 0x85
, isn't it simpler to drop that byte?
Okay, now I'm just trolling... <3
; possible values for content encoding are 0, 1 and 2
; 0 means arbitrary bytes (like an private message, image or tar)
; 1 means json
; 2 means cbor
Why support json at all?
@erick Here is a draft that continues where the above sketch degraded into bullet points. Most of it feels depressingly self-evident, there's simply not a lot of information in there (then again, some of it is very brief right now) =/
Emulating SSB identities over NDN hierarchical names seems to be simple, since hierarchical names are more general than flat ones: Identities can be addressed as public keys under a well-known namespace, e.g. /ssb/somePublicKey.ed25519
. This approach does inherit some properties from NDN that the layer 7 replication mechanism of SSB actively avoids. The canonical repository under the ssb
prefix is a point of centralization, and it inhibits producer mobility.
NDN over SSB
Emulating NDN's pull-based system over SSB's push-based streams is conceptually symple. Streams are the more general concept, a single piece of data can be regarded as a stream of exactly one item. For each piece of NDN data, there would be a new identity whose stream only contains that single item.
TODO pick this up in the section on partial replication, to regain some of the "locality" between related pieces of NDN data
This does however forfeit the ability to organize data through meaningful (hierarchical) names, since each SSB stream has a flat, opaque identifier. Such a hierarchy would need to be established through an external mechanism such as a public key infrastructure. SSB could not directly leverage such a structure, instead it would have to rely on indirection: Hierarchies would be encoded as streams of metadata pertaining to the (trivial) data streams.
Combining NDN and SSB
From the prior sections, it becomes clear that push-based streams are more general than single data items, and hierarchical names are more general than flat ones. We thus briefly sketch a system that combines the two in order to bring some of SSB's benefits to layer 3.
The combination of SSB and NDN would bind hierarchical names of opaque components to streams, not data pieces. The identifier for a piece of data would be the pair of the stream name and its sequence number.
NDN's routing implementation with FIB, PITs, and content stores could be used for the stream names. Subscriptions to a stream would be implemented with long polling. Unlike regular NDN, the sequence-number based backpressure window can be efficiently manipulated, since the sequence numbers would not be opaque strings but first-class citizens of the routing protocol. Whether this system would consitute a push-based system with backpressure or a pull-based system that can talk about the future is ultimately a meaningless distinction.
By separating data identifiers into two components, the combination gets the best of both world: The gains in routing simplicity and efficiency that come with hierarchical names, and the improved efficiency of back-pressured data streams that comes with totally ordered sequence numbers.
While this combination comes short of some of SSB's ideals (full decentralization, comsumer mobility), it demonstrates how SSB's push-based worldview could bring benefits to NDN and other ICN architectures.
Is the absence of lipmaalinks a feature or a bug?
I still maintain that using a self-describing format for the metadata is a source of unnecessary incidental complexity, but I won't keep fighting over this... But if we ever get to a situation where some overlooked implementation detail of some cbor implementation leaks into the spec, I'll happily dish out some annoying "I told you so"s.
I still strongly disagree with a nondeterministic encoding spec. Since there are clear advantages to having a proper function rather than a relation, I'd like to see some justification for your choice. Even if it's just "Ease of reusing the cbor spec and implementations.".
@elavoie Nope, it isn't that complicated. By self-describing I simply meant a format that can be parsed without having to know the schema of the data beforehand. Think relational vs nosql databases: In the latter, the data does not need to conform to a predefined schema, instead it is received and stored in a way such that its structure can be recovered on-demand.
A different way to formulate it: The application data can be an arbitrarily complex instance of a well-known and unchanging algebraic data type (e.g. for json: null, bools, numbers and strings are the base cases, maps and arrays are the recursive cases). That is to say: Application data is dynamically typed.
I've always been annoyed that all creative commons licenses require attribution. If there was CC-share-alike-no-attribution, I'd slap it on pretty much all my non-code output.
Some time ago there was mention of a license on here that basically required you to claim the stuff as your own. So forced plagiarism as the opposite of attribution. @Dominic, do you remember the name of that license?
Section 4 is giving me such a headache, I decided to do a draft of how I'd write it without being constrained by the previous stuff. I feel like I accomplished a stronger presentation, but it is far more abstract than the current version. So now we have not one problematic section, but two. I guess we'll try to interpolate between the two. Anyways, here it is:
SSB in the context of ICN
Despite SSB data replication being currently implemented as a layer 7 protocol, we believe that its underlying principles are worth studying from a layer 3 perspective. To back this claim, we juxtapose SSB and NDN, the latter chosen because it embodies some polar opposite design choices. The comparison shows that while both conceptual models are able to emulate each other, this emulation comes at significant runtime costs.
We start out by describing what we consider the core conceptual model of SSB. Then, after a brief introduction to NDN on a similar conceptual level, we examine how these models can emulate each other. We then use these lessons to sketch a fruitful way of combining SSB with NDN.
Conceptual Model of SSB
If ICN is about the delivery of named data objects [cite], then SSB can be said to be about the delivery of named data streams. The basic unit of addressing is not the individual message, but a full log that might still produce new messages in the future (i.e. a stream). The streams are self-certifying and guarantee reliable causal ordering.
Delivery of streams follows a push model: Once a receiver has expressed interest in a stream, new items are transferred automatically without being requested individually. This process must be subject to receiver-driven backpressure, in the current SSB implementation this is done implicitly through the flow control of tcp connections.
Streams are tied to a singular identity, only this identity can produce new items in the stream. Due to the self-certifying nature of streams, the items that already exist can be served from anywhere and by anybody. The identity producing the stream may be mobile.
An elaboration on this conceptual model is given in [cite]
Conceptual Model of NDN
In NDN, the basic elements of networking are single piece of data, identified by hierarchical names. Each name is a sequence of opaque identifiers. Data is accessed via a pull model: A consumer issues an interest in a name, and the network delivers the corresponding data. The data is signed such that the correctness of the name-to-content binding can be verified. This allows data to be served from any location in the network.
The name of a piece of data not only identifies it, it also induces a unique repository that serves the data. When an NDN node can not directly serve a request, the request is forwarded towards this canonical repository instead.
The main conceptual differences between SSB are thus the pull vs push model, and the decentralized identity-centric approach vs the centralizing name-centric approach.
Pull | Push | |
---|---|---|
Name-Centric | NDN | ??? |
Identity-Centric | ??? | SSB |
SSB over NDN
Emulating SSB over NDN means emulating a push-based system over a pull-based one, and an identity-centric system over a name-based one. Both turn out to be problematic.
Implementing push with the pure request-reply model of NDN comes done to two basic options[cite]: The producer could send an interest to the consumer, to signal that the consumer should itself issue an interest in the newly available data. This approach - aside from misusing the semantics of interest requests - incurs a high latency panelty.
The other approach is regular polling: The consumer periodically signals interest for some data the producer may or may not have created yet. In its simplest form, this can be done by publishing data under a name that ends with a sequence number that is incremented with each produced piece of data. Under this model, the consumer can decide how many items into the "future" to poll for simultaneously. This whole process can be abstracted over with a consumer-side library[cite][cite].
In our conceptual model of NDN, polling is resource intensive. A natural extension would be the inclusion of long-lived or persistent interests[cite]. But even then, a pull implementation would be needlessly inefficient: Polling ahead for multiple items effectively amounts to controling back-pressure through a sliding window, comparable to tcp. But unlike tcp, this window could only be manipulated by one item per (interest) packet. Introducing some form of sequence number arithmetic to increase efficiency would necessitate to drop the concept of purely opaque names.
In conclusion, implementing the pull aspects of SSB over NDN would cost either time (interests triggering interests), space (polling), or it would require significant changes to NDN (long-lived interests + non-opaque names) that would effectively turn it into a name-based SSB.
- identity-centric vs name-based:
- hierarchical names are strictly more general than flat ones:
/ssb/someKey.ed25519
- problem one: centralization
- problem two: producer-mobility
NDN over SSB
- pull vs push:
logs are strictly more general than single data items: can express standalone data as log of one entry
identity-centric vs name-based:
- needs distributed (but not decentralized) component (e.g. pki) to map names to feed ids
- since NDN gets to declare the pki to be out of scope, so do we here
Combining NDN and SSB
- prior subsections show that:
- push more general than pull
- hierarchical names more general than flat names
- why not both?
- streams have hierarchical name, a data identifier is a pair of a stream name and a seqnum
- can do seqnum arithmetic for efficient backpressure-window maintainance while still keeping the name component opaque
- gets to benefit from the simplicity/efficiency gains that hierarchical (and not fully decentralized) names offer
- inherits producer mobility problems (but then again, we don't have an answer for those on L3-SSB either)
Fun fact: if you choose the less general options instead (pull and flat names), you get ipfs.
@Rabble But you'd rarely do a "cold call" to an onion address of a non-pub peer, since they would be offline with a high probability. Instead, I imagined these to be primarily useful for gossiping them around. Address-per-session should work in that setting, the peers that learn your address through gossip learn the current one. The initial connection to the gossip network would still work over the well-known, static addresses of a pub (which may or may not be onion addresses).
So basically: Start up the local server, generate an onion address, connect to a pub, and tell the pub to start sharing that onion address so that other peers who are currently online can initiate a connection to you through it. Also, the pub would hand out the onion addresses of other (non-pub) nodes that they are currently connected to. And then, this exchange continues between the non-pub nodes as well.
Would generating a fresh onion address per session be a viable option, or is it important to keep the onion address stable?
@cinnamon Well, they are working on a set of conventions that allows users to ask anybody who adheres to these conventions to drop specific messages from their local replicas. This still doesn't change the conceptual append-only nature. Of course, if everybody who stores a replica of the affected log truly deletes the message, then the effect of deletion is effectively achieved. You can never be sure though whether this situation has occured.
I fully agree that building social networks on top of immutable, public logs is a bad idea. Unfortunately there seem to be a lot of people who don't hold that view. And for some reason, they seem to cluster around this particular virtual space we are in right now.
@cinnamon Offchain content does not allow people to retroactively alter their log by deleting content. It allows anyone to locally drop data from any log replica without loosing the ability to pass on this replicate to a peer (the deleted data obviously can't be transmitted, but the verifiability of the log is preserved). But the (logical) logs themselves are just as immutable as ever.
@kemitchell They are necessary to check that all messages on has form a linked list rather than a tree. With seqnums, the author might create a tree like
1a - 2a
/
0
\
1b - 2b
and hand you 0 - 1a - 2b
. You couldn't detect that the log wasn't a log but a tree. A bit of prior discussion: %bxliFX0...
Another aspect: The author could retroactively swap out a message, replacing it with a different one that signs the same seqnum but includes different content. Prior discussion: %bxliFX0...
More thoughts on push-vs-pull (definitely out of scope for this paper), CC @cft:
Polling means that the state resides in the client. In ip land, pushing doesn't remove state, but merely moves it into the server (now the server needs to know all the addresses to push to). But with a PIT-based routing scheme, we have native multicast. The state is moved into the network, and more importantly, it is greatly compressed. So a PIT-based push system needs to remember less state overall than a PIT-based pull system. Path label routing loses this advantage, the server needs to store the path labels of all subscribers (as far as I understand it). So I'm a bit surprised that Antonio seems to favor it.
When implementing pubsub on top of ndn via longpolling, the naive approach of only polling for one packet in advance leads to long delays. So the obvious "fix" is to poll for multiple packets in advance. The client can choose how far to poll ahead, thus implementing backpressure. This results in (logically) the same situation as TCP windows. Since any push-based approach will also need to support backpressure, it would need to offer an approach that is better than this window-based longpolling. Thankfully, the pull-based poll-windows are laughingly inefficient: Image the tcp window could only be incremented in steps of one per transferred packet. Vint Cerf would turn in his grave if he weren't still alive. The moment ndn would support explicit arithmetic on sequence numbers, it would effectively turn into a push-based system with a sensible backpressure mechanism. If they'd insist on still calling it pull-based, that would fine by me.
One additional nice property of native support for this approach: It allows to separate backpressure and congestion control. Endpoints would control the backpressure window to indicate the rate at which they can receive, and routers would forward (potentially) smaller windows to apply congestion control. All the additional state they'd need for that would be an integer indicating the difference of the sequence numbers between the actual request and the throttled forwarding.
I hope I'm reinventing obvious wheels here, but the papers I've read so far didn't mention any of this =(
@arj But any new key would imply a new feed type anyways. That feed type can encode things in whatever way it wants. Cypherlinks are a different topic, but those are too high-level for bamboo (quick aside: that's one of the arguments for baking the hierarchy for partial subscriptions into the feed - it can be done without relying on the concept of cypherlinks, which is nice for things built on top of the append-only log that are less generic than a full-blown, generic application framework).
A couple of notes on push vs pull and pull-only systems trying to emulate pull:
- Antonio et al provide a good summary of the ways ndn can emulate push here (sections 2 and 3)
- they don't mention the problems that polling causes for the polling client (as opposed to the network):
- it might be fine to issue 10k subscriptions once, but polling 10k things over and over consumes significant resources (time, power, space (see next point))
- with pubsub, the client can issue requests and forget about them, whereas polling requires to maintain state
- pretty much every pull-only system that gets used in the real world uses hacks (polling) to emulate push but eventually incorporates native push. Why should ndn be different? Empirical evidence:
- http got websockets
- imap got the idle command
- modern databases support change notifications of some form (unfortunately I couldn't find a paper to cite)
- hardware using interrupts rather than polling
- all modern operating systems offer alternatives to polling for ipc systems: kqueue, epoll, etc.
- many programming languages offer push notification of asynchronously completed events (arguably how nodejs made javascript the currently most popular language of the world, other examples are Go, C#, proposal for switch (hey look, a topic on which all tech giants agree))
- thousands of human-hours are poured into engineering solutions for languages without native async programming: libuv and friends for C, futures crate for rust, netty and friends for java, etc.
- but surely, pull-only icn will be different...
Unfortunately, I couldn't find good citeable sources for any of those, it just seems to be everyone going through the same cycle and ending in push land. And probably most of these involved the same ideological war. Here's a quote from a pre-websocket rfc on html push hacks that reeks of politics and strategics:
The authors acknowledge that both the HTTP long polling and HTTP
streaming mechanisms stretch the original semantic of HTTP and that
the HTTP protocol was not designed for bidirectional communication.
This document neither encourages nor discourages the use of these
mechanisms, and takes no position on whether they provide appropriate
solutions to the problem of providing bidirectional communication
between clients and servers. Instead, this document merely
identifies technical issues with these mechanisms and suggests best
practices for their deployment.
Reviewer #19F
The paper does not provide running code or an evaluation of the different options proposed.
EL: It does... Should be clearer that we describe an actual working implementation.
cft: perhaps point out the code size (meh), number of sub-projects/modules/number of client applications?
I think the authors are wrong. You can have a push API in NDN so that SB can take all the benefits NDN provides. The mistake, IMHO, made by the authors comes from the fact that they do not consider a host stack model between SB and NDN. NDN is a L3 forwarding technology. The host stack is not specified, but it does not mean that it cannot be specified. There are host stack models for NDN in the literature that should be used in this paper to obtain all different possible interactions required by the SB application, including the push API.
eric: We think you are wrong.
alj: point to https://conferences.sigcomm.org/sigcomm/2011/papers/icn/p56.pdf for pull bashing
cft: this is also about the pubsub library being submitted to this conference
cft: need a paragraph "being there first" so others can't claim "you are a special case of us"
cft: also important to point out the SPECIAL pubsub we have: authenticated (alj: even self-authenticating), append-only, reliable, immutable
cft: defend how we defined the "waist"
Also, there is no communication system that is purely pull or purely push from a full stack point of view. IP is push but by using the host stack model all possible interaction can be obtained by using the socket API for instance. In NDN the network is pull but again all interactions can be developed and exposed using a socket API.
alj: There's probably a logical fallacy here... it's only deflecting... it glosses over efficiency
cft: I need time to see what hides behind "full stack semantics"
general discussion: push vs pull, our push as a special case/refinemenet/generalization, link to long poll in the web space.
EL: Clarify wording.
Unfortunately the authors fall into the trap by assuming that the request/reply semantics in the NDN protocol is a full stack semantics. Which is not. It is a common mistake in my opinion that has also impaired the usage of NDN for more applications than what we see today.
cft: blocked by finding out meaning of "full set semantics"
alj: Again: ignores efficiency. Emulation isn't free - otherwise we'd be exchanging turing machines
Authors need to cut down the discussion on rather impossible integration of SB with NDN in the final version by focusing on SB on top of NDN. There need to be discussion with text about the integration of SB over NDN using a host stacks as available in the literature and comment about a working integration. Essentially advantages that NDN brings to SB as opposed to using TCP for example.
EL: ???
eric: positive counterattack (to communicate to Luca (meta-level)): reversing the layering gives new insights - even on their properties
eric: this review has no argument why one should abandon this "wrong way of layering", this (deep) change request is not motivated. (again meta-evel)
[Inserting a host stack at the end-point to make the integration between SSB and NDN] would have made the contribution of this paper huge because the SB application is totally meaningful if run on top of NDN.
EL: ???
alj: [redacted because he's being mean]
Other Revisions
alj: related work, DONA: not just a "plain copy", our names are totally ordered (cft: agreed, good point)
Reviewer #19D
whether SSB is an information-centric (as stated in the abstract), or actually identity-centric protocol (as described in section 4.6). Here the issue is not so much of "name-centric" versus "identity-centric", but rather, the question is what the name/identifier applies to. An information-centric architecture applies the the name/identifier delivery to data packets that get delivered through the network .
EL: Clarify, perhaps revise previous sections to re-organize the paper around identity-centric? (kudos to Aljosha for the insight of identity-centric ;-))
EL: Clarify in the intro what identity-centric means.
exactly where in the protocol stack that SSB belongs to; the fact that the paper attempted a comparison between SSB and NDN gives an impression that the authors believe the two protocols are at the same layer in the protocol stack (otherwise comparing protocols at different layers would be like comparing apple with orange), which I do not believe is the case.
EL: Clarify
alj: explictly talk about abstracting over the layer (argue that we get to do it because of info-centric)
cft + alj: talk about this later
also point to the "pure" paper
SSB does not seem belonging to [the narrow waist on the network protocol stack], can SSB still work, if IP goes away completely?
cft: same as above
can SSB work as a universal interconnect layer to support all applications? e.g. including IoT, V2V.
alj: unfair q? erik: too defensive?
EL: Clarify both.
cft: push IS the more general model, covers ANY communication situation by replicating every BLEEP to everwhere and ervybody.
Reviewer #19E
This paper is very interesting, since decentralized applications are challenging. But the discussion about L3 protocol (ICN) and overlay network (SSB) is confusing.
SSB is an event-sharing protocol and an architecture “for social apps”, but ICN is a general network architecture “for all apps and all communications”.
See above
The difference between ICN and SSB is the existence of intermediate nodes. Section 4.4 considered ICN forwarders to SSB nodes, but the intermediate nodes are not “users” and do not have any “interesting" subset of the global data pool. Alternatively, you can develop a cache management strategy according to "interest" of SSB node, and improve SSB's performance. Almost peer-to-peer networks do not have a concept of how to use in-network resources, but an important feature of ICN is utilization of in-network resources, unlike TCP/IP. Therefore, it is important to extend SSB to utilize in-network resources. And we need to consider the design of the “underlay” architecture, vanilla ICN, extended ICN or a new L3 architecture.
EL: No actionable comment...
alj: first part is about :"buying friends" (covred above), second part seems confused about SSB's mandatory "mem-in-the-net" stance
cft: this is just the reviewer being confused - alj + eric in unison: Yes!
Reviewer #19C
"Comparing SSB to NDN": "The second difference is more subtle and seems to be rooted in what NDN considers the main focus of networking." Such anthropomorphism (NDN does not "consider" anything) should be avoided.
EL:
alj: "seems to be rooted in what NDN's designers consider the main focus of networking"
More importantly, your insistence that NDN relies on repos => it is not decentralized, seems to me to be off target. Repos are a scalability mechanism and not a fundamental part of the CCN/NDN architecture.
EL: Clarify the wording
CFT: In SSB there is no repo. NDN can't work without repo - the prefix information in the routing layer IS exactly assuming the repo thing, no way to work in NDN without the repo concept. --> we have to flag this to Luca.
Is it clear that SSB can scale to provide global services - even ones based on social graphs - without relying on, say, an underlying global IP service (which embodies all sorts of centralized facilities), or well-known rendezvous points (cf. Section 7.2)?
CFT: This is an hypothesis that it should scale. (alj: my claim: if flat-label routing can scale, so can we) Perhaps cite pub/sub papers (Carziga/recent pubsub publications) such that if they can do it, we can too. Clarify that we are currently at L7, but research (in another paper) may show that we will be able to do it at L3.
alj: the paper is radical, showing a) push and b) decentralized. There may be a spectrum where you can sacrifice some of these properties, find "compromised" intermediate solutions. Too radical and believe-carrying?
The relationships among certain concepts did not come across clearly. In particular, "user/identity", "relay" and "peer" seem to have similar meanings, but it was not clear whether relays have a log separate from a user. Can there be multiple users per "relay"?
EL: Do a full pass on the paper to make the terminology uniform and clarify definitions of each word.
The "waist" in Figure 2 contains "log format, peer IDs, blob objs", but the text refers to "follow" and "block" messages. Are these specific types of log entries? A somewhat more complete description of the relay protocol would be useful.
alj: this info is in Sect 3, we attempted to keep this sect 2 short and dense.
cft: can we have a forward pointer (if necessary)? Perhaps give some help to interested readers, get them to make a mental picture how SSB works (without reading rhe SSB code).
erik: somebody has to read through, make sure that enough intormation is here so that this question does not show up.
alj: explicitly use the term "follow message" when introducing the concept of following
The term/concept "tangle" comes out of left field in the beginning of Section 3. It would be nice to introduce the concept explicitly. CRUD also may not be immediately known to all readers.
EL: Provide definitions. Focus/refer to sect 4?
alj: we have 5x the term 'tangle'. "= DAG with single source node", plus (tight) partial order property.
Figure 3 was not helpful at all. What relationships are encoded in the positioning? I can't figure out what would replace the "..."s.
alj: expand the block diagram to show two things of the same thing (instead of only ...)
EL: Revise figure and/or add more detail in the caption.
I don't understand your point about the social contract in NDN. [...] What is the motivation for a user (peer? relay?) to forward logs - which can potentially have very high cost - in SSB?
alj: the power of friendship (erik: instead of dominance of economic-centric world view). You store things anyway (because of offline-first, and you care about that content), so it's only about the communication.
alj: maybe remove that referencing to the social graph? Streamline it to decent vs centr, and then one can drop this?
erik: or better add one clarifying sentence? 90% of content is born local?
cft claims that alj said: reference to Haggle/pocket switched network
erik: at L7, SSB uses the economic solution of IP. At L3, intermediate relays need futuer work exploration re economic. --> alj: "Future work" should pick up the L3 work.
alj: point out as out of scope
EL: Add more detail.
Speaking of social contracts, I was surprised that the relay API allows applications to "add to the peer's log". This seems like it opens a significant resource-exhaustion attack, by causing the peer to "follow" a whole bunch of identities.
alj: is this a misunderstanding? Clarify. (don't go into authentication)
alj: long-term identity brings a social cost to abusive behaviour and people will block an abusive actor.
More generally, you don't mention DoS attacks at all. One advantage of NDN (according to some) is that its pull model makes DoS attacks harder than in IP. You might mention how SSB's push model stacks up on this score.
alj: trust-based
cft: conceptually, needs backpressure (controlled push, not shove-in-your-face) (cft likes this way of phrasing it!)
cft: have 2 sentences on DoS: social control/trust, also backpressure in future work
In section 4.5, the statement "Either some item is already in one of the eagerly replicated local logs, or it is not available yet (because SSB is push-based)." What is the response to an NDN pull request in the latter case? It seems that the major problem is not the hierarchical namespace, but the definition of "eventual".
cft: misunderstanding? Just check, but we don't expect much to change to this.
Reviewer #19B
Briefly mention Haggle, SCAMPI, TwiMight, NetInf from ICN, Usenet NetNews (replication strategy) in related work
cft + alj: add very few sentences (and what these papers were working on: social media, replciation strategy) and make the point that we are really adding something new.
alj: core difference: self-certifying streams
it would be useful to understand how an identity extends to multiple devices of a user (is this one or many identities) and how the system deals with compromised keys. [...] So, basically, section 2 seems fine even though I would have liked to see more details.
cft: attentive reader recognizes the multidevice problem, we pay them a service by mentioning it (Even though our answer might be unexciting/disappointing)
Section 3 of the paper, IMO, lacks rigor. This is essentially a set of examples without enough technical depth of precision to allow the reader to follow (and appreciate or criticise) the design details.
EL: I had to cut much of the material that talked about the finer concurrency points of each app because of the page limit...
CFT: Tell Luca that this section is essential to show the breadth of applications supported
The comparison to NDN is section 4 is weird. Why NDN? Ok, it's popular. But then the description is not comprehensive and there are no clear takeaways. I just wanna note that there are pub/sub extensions to NDN which would make the SSB-over-NDN in section 4.3 probably easier (even though this comes at an overhead).
CFT: pub/sub extensions were not published at the time of submission. To check?
alj: which "clear takeaways" did we want to convey? push vs pull and decentralization?
Don't sell work-in-progress as a feature because it is not done yet. A section on "Benefits" belongs into a white paper or marketing material but not here.
EL: Maybe re-read to see whether there really are unsubstantiated claims?
How would you then define eventual consistency and under which assumptions and over which expected time periods would you consider reaching consistency?
CFT: partial eventual consistency maybe define that? Clarify that it is not the classical eventual consistency.
cft: nice argument for ssb: we reach "consistency" very quickly even though there might be no end-to-end connectivity because of the optimistic forwarding/caching
Notes from two calls (ca 4 hours):
ICN 2019 Camera-Ready Version
Publication Opportunity
Christian has been invited to run a panel on applications over ICN, including decentralized Web.
Literature Review Insights (Aljoscha)
Which layer are we living in? Clearly split out the logical concepts that we are talking about and the implementation. Ex: IPFS is NDN with flat-names.
CFT: We can reference the CCR Paper in the SSB paper now on the push communication model.
We are not naming individual pieces of data, but we refer to an entire stream. Maybe should be Stream-Centric?
CFT: NDN has seq no, but it cannot benefit from the constraints that SSB is leveraging.
Push
CFT: NDN has no notification system.
CFT: Pub-Sub has plenty of interesting literature.
CFT: What about scaling to 10M?
CFT: We have to figure out how the NDN namespace is not the whole story.
SSB manages to have a self-certifying stream, which no other system does. Claim: True decentralization can only work with self-certifying names.
CFT: NDN still relies on certificate authorities, so that is a good point.
Identity-centric kind of implies having a social graph in the background. Maybe the core idea is stream-centric?
CFT: NDN is biased towards content providers that would certify the content. We should clarify the identity centric.
Actionable Points
Reviewer #19A
It would be good if a discussion was added that elaborates on the scope of applications that are reasonable for an SSB approach [i.e. it seems it would not be appropriate for sensor devices]. Also, if SSB and ICN is combined how would different application realms be interconnected and/or interworked?
Would SSB be compatible with sensor devices?
CFT: yes - can think of future with pruned logs - no basic interoperability problems. But mention this in an outlook/future work section.
Is SSB general-purpose or is it restricted to specific social applications?
Is it powerful enough? Even if yes, can we back it up? Instead: Claim that push (pure model) is powerful enough rather than ssb.
Alj: Synergies between NDN and SSB, emphasizes the integration rather than an opposition.
How concepts like append only logs can scale for use in a global network might not be obvious to all readers, it would be good to add some text that provides some intuitive explanation or example.
cft: not sure to which text place/section this refers to, but I had hoped that the "social graph heuristics" is SSB's current argument. Action: find that location, and emphasize this point.
alj: too long logs can be handled by starting a new one.
Section 4.6, last sentence: "Beside the push/pull theme, SSB’s identity-centric approach seems to introduce a yet unseen element for ICN."
Please elaborate on what you mean by this.
CFT and alj: one such element is the "streams" things, that data lives in a very controlled (and verifiable) context. More?
Editorial comments:
The keyword section is missing.
cft: to be added
>
Section 1, §2:
s/, true to is decentralized point of view,/, true to its decentralized point of view,/Section 3.1, §4:
You introduce the concept "CRUD" without defining or explaining it.Section 3.3,
For clarity it would be good to add a time axis to Figure 4.
just fix it.
Applications like Patchwork and Beaker don’t need to know whether they’re running on Scuttlebutt or Dat, and if we could agree on some simple levelup-like API for our append-only logs then everything we build on top of that (storage, replication, queries, interfaces, etc) could be portable between our implementations.
I like this idea very much. I don't think you should compare ssb with hypercore directly:
- hypercore has partial replication, ssb does not (yet)
- not insurmountable, the api hooks to control partial replication could extend (in the object-oriented sense) the hooks for regular replication
- hypercore doesn't have blobs (i.e. datums addressed by their hash rather than a position in a particular log)
- ssb places restrictions on the format of messages (valid json, must be object with valid type, max size), hypercore is far more relaxed
What should however be possible is a common api for lower scuttlebutt and hypercore (and also bamboo). You could actually go ahead and implement upper ssb on either of these, just like you could implement dat on top of either of these (modulo payload size limits).
A nice thing that would spare this undertaking some headaches: All of these identify logs by an ed25519 public key, there's no need for compatibility shims.
living systems are only ‘scale free’ across a few orders of magnitude, unlike mathematical ones
This seems to contain a lot of compressed but important thoughts As someone who heavily drifts towards the mathematical ones and would be interested in learning more about how this would clash with the "real world", I'd love to read an elaboration =)
@piet has pointed out that the author of a bamboo log can never change, and thus it doesn't need to be encoded using a multiformat (context for why it is included in the first place is given here). Which is nice, not really because it saves two bytes, but more because the whole thing becomes a tiny bit simpler. So unless anyone can point to something we are missing, I'll update the spec to encode the key as a 32 byte string rather than a yamf-signatory. When ed25519 needs to be replaced, the a new format can use the initial flags byte to indicate whether it is a legacy ed25519 feed (tag byte 0 or 1 as in current bamboo) or a feed that uses the new signing primitive (tag byte 2 or 3).
@Dominic @keks @Anders (but I still believe that using a multiformat for the backlinks is the correct choice, I just don't think a broken hash function should force everybody to migrate to a new identity)
Repo of the paper source: https://github.com/tschudin/ssb-icn2019-paper
@osakachan The same is true for every system of communication ever. Doesn't mean we should give up improving the default case.
Image description: A stylized drawing of an Indian peacock with colorful spirals as its tail. Also my first attempt at using markers.
Current state of pavo dev
I'm currently swamped with other stuff to do, as well as some wrist pain, so pavo dev is not a top priority right now. Also, I managed to mess up the garbage collection. I have some vague ideas on what might cause this, but I haven't investigated properly. I might have to fork the library of immutable collections to use garbage collection internally rather than reference counting.
But I'm still looking forward to the next steps (after fixing the gc): Writing a few base libraries and then a pavo frontend with good error reporting (beware: everything will be AGPL-3.0), writing an entry-level tutorial/introduction (pavo is small enough to actually explain the full language rather than hand-waving things), then work on a package management solution (which will be language-agnostic). And long-term, there's always an optimizing compiler to work on.
Together with the tutorial I also want to set up a few webpages, github readmes are so ugly. A language called pavo (peafowl in latin) offers quite a few avenues for a strong visual style.
I'm still committed to moving beyond the "toy" stage and making this a "real" language. It will take some time, but that's fine by me. But at this stage, I can't really invite help, there's still too much context that only lives inside my brain, and getting that context out is part of the work that needs to be done.
CC @GoodieHicks
I’m assuming Permit means extra access, and (null) means a default level of access to my more public things
In the proposal, the default is "no access". I think of it as either a capability system (you can't do anything unless you very explicitly allowed to) or as the bottom element of a bounded lattice) (like in a security type system). An effect of this principled approach is that the algorithms on the data structure tend to work out very nicely. I guess that's not very satisfying from a user-driven perspective, and I don't want to imply that there'd be no merit to your suggestion. But see also the next point:
my more public things
Since SSB feeds are basically all-or-nothing, that's not really possible in our setting. This is imo a good reason for making "no access at all" the default, since the only real other option ("full access by default") somewhat defeats the whole point. I've spent a good deal of time thinking about how to apply this sort of access control to hierarchical feeds of some sort, and in those settings there's a good change for having a middle ground as the default. I'll keep you posted when I make progress on or write up more of my explorations in that area.
@cinnamon (sorry for being kinda brief)
See here for a concrete mechanism proposal for trust-based harm reduction. Imo per-peer (i.e. per-edge) settings are the way to go, per-feed (i.e. per-vertex) settings don't really cut it.
I personally am more interested in working on trust-based solutions rather than crypto-based ones, in the sense that I enjoy working on the prior but lack knowledge (and interest) in the latter. But both definitely have their place. In some sense though, even if you encrypt it, your are still asking the people who can decrypt it to not pass on the decrypted information. So in some sense, the crypto stuff is not inherently less trust-based than the "naive" replication "hints". There are however very real differences in the underlying dissemination framework: Encrypted content can be cached and forwarded by untrusted parties.
Thank you for writing about privacy risks and privilege! (section 7.1) 🌻
That section was very much inspired by some of your posts on the topic
Review #19E
Overall merit
- Weak accept
Reviewer expertise
- Some familiarity
Paper summary
This paper compared Secure Scuttlebutt (SSB) and ICN.
SSB is a decentralized protocol for social applications, and is based on "single-write" and "append-only" log.
Comments for author
This paper is very interesting, since decentralized applications are challenging. But the discussion about L3 protocol (ICN) and overlay network (SSB) is confusing.
SSB is an event-sharing protocol and an architecture “for social apps”, but ICN is a general network architecture “for all apps and all communications”.
The difference between ICN and SSB is the existence of intermediate nodes. Section 4.4 considered ICN forwarders to SSB nodes, but the intermediate nodes are not “users” and do not have any “interesting" subset of the global data pool. Alternatively, you can develop a cache management strategy according to "interest" of SSB node, and improve SSB's performance. Almost peer-to-peer networks do not have a concept of how to use in-network resources, but an important feature of ICN is utilization of in-network resources, unlike TCP/IP. Therefore, it is important to extend SSB to utilize in-network resources. And we need to consider the design of the “underlay” architecture, vanilla ICN, extended ICN or a new L3 architecture.
Review #19F
Overall merit
- Strong accept
Reviewer expertise
- Knowledgeable
Paper summary
The paper presents a specific application level protocol
called ScuttleButt and the different possible relations
it can have with NDN.
The protocol is a decentralized application level protocol
to provide secure communications between groups (social network).
The paper is not about ScuttleButt but how to use it on top
of NDN. An overview of the L7 protocol is given as well as
NDN. The authors go through the analysis how the two architecture
can serve one another.
Comments for author
The paper does not provide running code or an evaluation of the
different options proposed.
The main problem for SB to used NDN is that the latter does not
provide a push API. SB needs it and the authors identify this is
as the main problem to run SB over NDN.
As a result the paper goes through a series of speculations about
how to influence one architecture with the other in different respects.
I think the authors are wrong. You can have a push API in NDN
so that SB can take all the benefits NDN provides.
The mistake, IMHO, made by the authors comes from the fact that they
do not consider a host stack model between SB and NDN.
NDN is a L3 forwarding technology. The host stack is not specified,
but it does not mean that it cannot be specified.
There are host stack models for NDN in the literature that should be
used in this paper to obtain all different possible interactions
required by the SB application, including the push API.
Also, there is no communication system that is purely pull or purely push
from a full stack point of view. IP is push but by using the host stack
model all possible interaction can be obtained by using the socket API for
instance. In NDN the network is pull but again all interactions can be
developed and exposed using a socket API.
Some references on the topic
Moiseenko et al. Consumer-Producer API for Named Data Networking, ACM ICN 2014
Sardara et al. A Transport Layer and Socket API for (h)ICN:
Design, Implementation and Performance Analysis, ACM ICN 2018,
As an example, take any P2P protocol for file sharing such as BitTorrent.
Is it a push or pull L7 communication protocol?
It needs both interactions to work. The rarest first chunk distribution policy
requires a pull operations first and later on a push operation based on the joint peer
and chunk selection scheduler.
The FTP protocol requires an open before the TCP connection can be used to
push data. HTTP is a request reply protocol that triggers push like copies
by using TCP, but it is not the only way to move data from one host to another.
In pub/sub a join is used to trigger a push. gRPC interactions extend HTTP
RESTful model. Kafka, RSocket and many more application level communication
frameworks can generate all sort of interactions which require the end-points,
which can be two or many more, to determine how to interact.
Even in NDN the producer is more than just a stateless memory with named-data.
Some entity has named the data, pushed the data in the memory, decided how
to segment the data, sign it etc.
The paper is well written and the use case brought up by the authors
is timely and the quality of the presentation is sound and does a good
job to explain all the difficulties in changing the transport protocol
used by the application.
Unfortunately the authors fall into the trap by assuming that the request/reply
semantics in the NDN protocol is a full stack semantics. Which is not.
It is a common mistake in my opinion that has also impaired the usage of NDN
for more applications than what we see today.
On a positive note the authors keep an objective view about the different
options they face while trying to integrate SB over NDN. The analysis is useful
but I'm disappointed by the fact that the authors have not understood
the need to insert an host stack at the end-point to make the integration possible.
It would have made the contribution of this paper huge because the SB application
is totally meaningful if run on top of NDN.
Authors need to cut down the discussion on rather impossible integration of SB
with NDN in the final version by focusing on SB on top of NDN.
There need to be discussion with text about the integration of SB over NDN using
a host stacks as available in the literature and comment about a working
integration. Essentially advantages that NDN brings to SB as opposed to using TCP for example.
Review #19C
Overall merit
- Accept
Reviewer expertise
- Knowledgeable
Paper summary
The paper presents a novel approach to building information-centric applications. While content-retrieval applications are paradigmatic for CCN and NDN, social networking applications seem to be the target of SSB. The fundamental concept of SSB is that nodes - which presumably correspond one-one, or possibly many-one, with users - form connections based on trust. Neighbors in this "social graph" replicate and relay each others' append-only, cryptographically sealed logs. In a manner reminiscent of SPKI/SDSI for public key infrastructure, the focus is not on global consistency, but on each user's subjective (and eventual) consistency - what matters to a user is that user's view of the data universe. Instead of a common global hierarchical namespace, the interoperability layer consists of public keys, crypto algorithms, and a common log entry/message format.
The paper is well-written and readable, and does a very good job covering many aspects including how several applications have been implemented, features to be added, and some limitations of the architecture. It also compares NDN and SSB on several scores, and provides a fairly extensive discussion of their differences (see below for comments).
Comments for author
Overall, this a real contribution - a thorough description of a new approach to information centricity. It is sure to bring out good discussion at the conference. The following are a few specific comments.
In the introduction, subsection "Comparing SSB to NDN": "The second difference is more subtle and seems to be rooted in what NDN considers the main focus of networking." Such anthropomorphism (NDN does not "consider" anything) should be avoided. More importantly, your insistence that NDN relies on repos => it is not decentralized, seems to me to be off target. Repos are a scalability mechanism and not a fundamental part of the CCN/NDN architecture. Is it clear that SSB can scale to provide global services - even ones based on social graphs - without relying on, say, an underlying global IP service (which embodies all sorts of centralized facilities), or well-known rendezvous points (cf. Section 7.2)?
The relationships among certain concepts did not come across clearly. In particular, "user/identity", "relay" and "peer" seem to have similar meanings, but it was not clear whether relays have a log separate from a user. Can there be multiple users per "relay"?
The "waist" in Figure 2 contains "log format, peer IDs, blob objs", but the text refers to "follow" and "block" messages. Are these specific types of log entries? A somewhat more complete description of the relay protocol would be useful.
The term/concept "tangle" comes out of left field in the beginning of Section 3. It would be nice to introduce the concept explicitly. CRUD also may not be immediately known to all readers.
Figure 3 was not helpful at all. What relationships are encoded in the positioning? I can't figure out what would replace the "..."s.
I don't understand your point about the social contract in NDN. Of course there is a social contract among forwarders/relays - this is true in any network service. It seems to be at least as crucial in SSB as in any other service. In IP networks (and presumably CCN/NDN, though there doesn't seem to be much discussion of that to date), service providers typically are driven by profit motive to fulfill this social contract. What is the motivation for a user (peer? relay?) to forward logs - which can potentially have very high cost - in SSB?
Speaking of social contracts, I was surprised that the relay API allows applications to "add to the peer's log". This seems like it opens a significant resource-exhaustion attack, by causing the peer to "follow" a whole bunch of identities. More generally, you don't mention DoS attacks at all. One advantage of NDN (according to some) is that its pull model makes DoS attacks harder than in IP. You might mention how SSB's push model stacks up on this score.
In section 4.5, the statement "Either some item is already in one of the eagerly replicated local logs, or it is not available yet (because SSB is push-based)." What is the response to an NDN pull request in the latter case? It seems that the major problem is not the hierarchical namespace, but the definition of "eventual".
Review #19D
Overall merit
- Weak accept
Reviewer expertise
- Knowledgeable
Paper summary
This paper describes the design of Secure Scuttlebutt (SSB), an identity-centric protocol for building distributed applications. The paper made a comparison between SSB and NDN through a thought exercise of SSB-over-NDN, SSB-alongside-NDN, and NDN-over-SSB.
Comments for author
This is an interesting paper to read, especially given I did not know SSB before, the reading is very informative.
However the paper seems suffering from an unclear understanding of exactly what is SSB:
i) whether SSB is an information-centric (as stated in the abstract), or actually identity-centric protocol (as described in section 4.6). Here the issue is not so much of "name-centric" versus "identity-centric", but rather, the question is what the name/identifier applies to. An information-centric architecture applies the the name/identifier delivery to data packets that get delivered through the network .
ii) exactly where in the protocol stack that SSB belongs to; the fact that the paper attempted a comparison between SSB and NDN gives an impression that the authors believe the two protocols are at the same layer in the protocol stack (otherwise comparing protocols at different layers would be like comparing apple with orange), which I do not believe is the case.
I believe NDN is designed to be the new narrow waist on the network protocol stack, i.e. the universal interconnect layer (as IP has been), upon which all sorts of applications can be supported.
SSB does not seem belonging to that same spot, for at least two reasons:
i) SSB depends on some underly protocol (i.e. IP) to deliver its data. One can confirm this by asking the question: can SSB still work, if IP goes away completely?
ii) can SSB work as a universal interconnect layer to support all applications? e.g. including IoT, V2V.
It seems the answer to both questions is a NO, thus SSB is not a L3 protocol.
Review #19A
Overall merit
- Accept
Reviewer expertise
- Some familiarity
Paper summary
The paper discusses how Secure Scuttlebutt (SSB) relates to the ICN approach NDN. It investigates different combinations of SSB and NDN. It observes, that besides the obvious difference that SSB is a Push protocol and NDN is a Pull protocol, they also differ in being identity-centric and name-centric approaches respectively.
Comments for author
The paper is well written and well structured.
The main contribution of the paper is that it widens the discussion of what ICN is and which solutions are of interest to the community. It should stimulate good discussions at the conference and in the community at large.
The paper discusses how SSB can be used for distributed social networks applications, GitHub, Chess, etc. All are quite "heavy" applications that are intended to run on general purpose computers and/or capable smart phones. Traditional ICN has a wider scope, which includes constrained devices such as sensors. It seems unfeasible that constrained devices should use append only logs as a basis for communication. It would be good if a discussion was added that elaborates on the scope of applications that are reasonable for an SSB approach. Also, if SSB and ICN is combined how would different application realms be interconnected and/or interworked?
How concepts like append only logs can scale for use in a global network might not be obvious to all readers, it would be good to add some text that provides some intuitive explanation or example.
Section 4.6, last sentence: "Beside the push/pull theme, SSB’s identity-centric approach seems to introduce a yet unseen element for ICN."
Please elaborate on what you mean by this.
Editorial comments:
The keyword section is missing.
Section 1, §2:
s/, true to is decentralized point of view,/, true to its decentralized point of view,/
Section 3.1, §4:
You introduce the concept "CRUD" without defining or explaining it.
Section 3.3,
For clarity it would be good to add a time axis to Figure 4.
Review #19B
Overall merit
- Weak reject
Reviewer expertise
- Knowledgeable
Paper summary
The authors present the some of the design and some implementation aspects of Secure Scuttlebutt, a practically deployed system for some class of distributed applications. The basic idea is users generate logs of actions that are protected by their own private/public key signatures and can then be replicated. Log entry n+1 refers back to n by means of a hash chain, thereby securing the entire log. The authors discuss the assumptions and present the architecture and then carry out a comparison to NDN (why not also a pub/sub ICN system which would fit much better?). They discuss some related work and present future topics. Evaluation is restricted to saying this is deployed by some 10K nodes (great!) but no details are given.
Comments for author
This is a tricky paper, leaving aside that it probably won't hold water in terms of double-blind submission guidelines, even though one cannot tell which of the 23 members of the GitHub Scuttlebutt community are behind this paper. Overall, the paper looks more like a engineering project description (which is nice) than a paper.
I liked the first two pages a lot, with the idea, the simple concept of single writer logs and the idea of building a simple platform for fully distributed applications. It's just not exactly new. Especially in the context of opportunistic networks, systems such as Haggle and SCAMPI, patly TwiMight and maybe even NetInf from the ICN world, are doing exactly that, albeit with slightly different emphasis. Most, if not all, use public key-based self-generated identities, Haggle has routing along social graphs for information replication, SCAMPI use attribute-value pairs with a pub/sub-style mode of operation for matching and replicating contents, and all serve as platforms for building fully distributed applications that lack central servers and are delay tolerant. Some researchers looked at the properties of information propagation, synchronisation, and even multiple writers and their interactions for such applications. Btw, the replication strategy reminds me of Usenet NetNews
. Also wanna note that we have seen systems for distributed online social networks such as DBook (a delay-tolerant version of Facebook) and Diaspora.
So, fundamentally, distributed systems of this nature are not particularly new. I like the notion of identity-centric. But in this context it would be useful to understand how an identity extends to multiple devices of a user (is this one or many identities) and how the system deals with compromised keys. I also like the user directory idea as well as the protocol stack architecture. From this perspective, the fundamental design choices appear sensible. And I certainly do appreciate the full implementation and deployment effort. Worthwhile playing with. So, basically, section 2 seems fine even though I would have liked to see more details.
Section 3 of the paper, IMO, lacks rigor. This is essentially a set of examples without enough technical depth of precision to allow the reader to follow (and appreciate or criticise) the design details.
The comparison to NDN is section 4 is weird. Why NDN? Ok, it's popular. But then the description is not comprehensive and there are no clear takeaways. I just wanna note that there are pub/sub extensions to NDN which would make the SSB-over-NDN in section 4.3 probably easier (even though this comes at an overhead).
Don't sell work-in-progress as a feature because it is not done yet. A section on "Benefits" belongs into a while paper or marketing material but not here.
Having worked in this space myself, I am curious about the statements on eventual consistency in the presence of independently acting parallel entities and network partitions or disconnected nodes. Getting at some point a complete log of one individual seems reasonable, but an application as described in the paper would probably be interested in the logs of several users. How would you then define eventual consistency and under which assumptions and over which expected time periods would you consider reaching consistency?
The SSB Paper got accepted =)
Now more official than before: The ssb paper written by @Dominic, @erick, @cft and me has been accepted at the ACM ICN 2019 conference. Since we aren't forced into the secrecity of the double-blind review process anymore, we'll move the remaining work into the open (with the exception of higher-bandwith phone calls).
Here is the version we submitted: icn2019-paper_19-submitted.pdf
We now have until the 23rd of August to prepare the final version, with guidance from an assigned "shepherd" and based on these reviews we got (see the next posts).
For the very curious, there's also an extended, augmented version with bad drawings and bad jokes (at least two of which get slightly better if you image "zine" rhymed with, well, "rhyme"). CC @Angelica, @Zach, @andreas, this is probably the closest I've come to making a zine.
@sean For some reason, I've always parsed your old profile image as Ziggy Stardust.
Between the incompleteness theorems, undecidable languages, and the real numbers being fricking weird, I sometimes wish I'd never walked my monkey brain this far into math land. I love how Gödel's face captures those feeling. I tend to retreat into the realm of finite or at least countable sets. Which is an option that pure mathematicians often might not get I suppose, so hooray for computer science (we have complexity theory though, most of which can also be summarized as "You can't have nice things, and I can prove it.").
Seriously though, sometimes I'm amazed how many impossibility results this universe can throw at us while still continuing to work. And should we ever encounter other sentient lifeforms, we'll be able to bond over that. Well, the nerds at least...
@cinnamon I don't think this necessarily requires offchain content, offchain content is so low-level it doesn't even know what an image is. Images that are used in "post" messages are blobs, so they are not downloaded by default. It's the clients that tell the ssb server to download them immediately, but not all clients do this by default (patchfoo? @cel).
So ultimately this would come down to client support rather then protocol support: When authoring a post, the message content could include the blurhashes of the images it contains. For rendering, the client would then use the decoded blurhash as a placeholder, until the real image was fetched (which might or might not be initiated automatically).
@mulrich Do you do the weird encoding dance (encode as utf16, drop the more significant byte) in your hash computation?
If you want to eliminate corner cases, I recommend testing against this dataset.
Here is a badly-written language reference (warning: 5k lines of markdown), and here is a badly-written implementation (that has all the examples from the reference as passing unit tests). The interpreter is depressingly non-optimizing.
I haven't written a proper introduction to the language yet, so I'm not considering it "released". But I will start writing some pavo code next, before getting the documentation to a point where all this work becomes useful to other people as well.
CC @frankie, looks like you'll actually get your language soon-ish. Please be a responsible world-liberator.
@Linas With the push-model of ssb, there's this inherent tension in how restrictive you are with granting push rights. Ultimately each user is able to decide for themselves whom they allow to push content to them. The system that is currently used by pretty much everyone is fairly relaxed. If this starts becoming a problem, people might become more conservative with their follows, or they might switch to a different system altogether (e.g. a friends graph were you could specify the degree of transitivity on a per-follow basis rather than the currently global cutoff that applies to all of your follows).
I guess you could outright reject any push-based approach, in that case ssb would be out of question for you. Validity of a push-based propagation system is one of the core assumptions that underlie ssb, you wouldn't be able to "fix" while still being ssb.
Not really relevant to the discussion, but just as an fyi:
As I don’t think you and I are directly following each other, that puts that person maybe 5 hops away from me… yet I presume that they will see my comment
Nope, if you are "too far away", they would only see my posts, but not yours. Since my posts include cypherlinks to your posts, they might be able to fetch your posts on demand, but that would be pull-based rather than push-based (and thus not spammable).
A real example of the problems of a transitive, push-based system: Two weeks ago, @enkiv2 "turned evil" (they misconfigured a script to publish multiple messages per minute), so I ended up blocking that feed. There's a bunch of junk left in my database, but in principle I could delete it.
Re corrupt feeds with malicious intentions, quoting from here:
There’s a bit of tension along requiring certain structure in content: What do you do with invalid content? It invalidates the feed, just like invalid metadata would. But since we have offchain content, we might only retroactively realize that a feed was broken all along. A peer could deliberately publish a message with broken content but never send it out, thus gaining the ability to retroactively cut off a suffix of their feed (and terminate their feed in the process). This sounds bad, but it is actually the world we are already living in: Any peer might at any time fork their feed at an arbitrary sequence number, achieving the exact same effects.
If a feed is broken, we stop replicating the feed beyond the point of breakage, but we want to propagate the information that it is broken rather than doing so silently. So we need proofs of broken feeds. For forks, this is simple: the proof consists of the set of messages with clashing hashes/backlinks/seqnums (partial verifiability makes this nontrivial, but it’s still simple and requires little space). For invalid contents (in particular invalid content size), this is more difficult. You’d need to transmit the whole invalid content so that the peer can check the signature (otherwise you could just claim that something was broken), but the content might be very (in fact invalidly) large. This creates a situation where there’s a tiny negotiation between two peers: “I could send you a proof of breakage, but it would be xxx bytes large. If you don’t want it, we’ll pretend I only had the feed up to that point instead”.
With current SSB, deleting a message's content in your local replica means that you can not replicate that feed to people who have not obtained that message yet. They'd need to check the signature, but you can't provide them the data they'd need to check it. If they already had the message locally and wanted to receive a later suffix of the log from you, that would work out. In practice this hasn't been implemented.
Bamboo signatures are not signing the payload, but only a hash of the payload. So you could delete a payload in your local replica (e.g. because storing it would be illegal) but keep the hash that would be needed to verify the signature. Replication can then proceed just fine. They obviously won't be able to receive the content from you, but they could still verify the feed and thus accept all the other messages from the same feed. SSB is aiming to incorporate such a feature as well (see #offchain-content).
Re DOS: The replication mechanism is not tied to the ssb protocol at all. The current system of publicly maintaining a friends graph in your log and using these graphs for (transitive) replication is just a convention that has worked sufficiently well so far (although it's definitely not without flaws). If you are worried about getting caught in the crossfire of a DOS through the transitivity of the current mechanism, you can either be very careful with whom you follow, or you might want to use an implementation that uses a different mechanism to determine which feeds to pull in.
The main problem of the PGP attack (anyone can append information) doesn't apply to ssb, here only the author can append information. The core assumption behind the transitive, follow-based mechanism is that individuals can trust other individuals to not spam them. And should they revoke this trust, they can unfollow (or even block) once they found out that their trust has been misplaced. Whether they have to find out manually or through e.g. a collaboratively maintained "spam filter" should be up to them (just as it should be up to them whether e.g. blocking implies local deletion of the data).
Hey @Daan, thanks for the feedback =)
After thinking it over a bit, this would probably not work because the payload’s hash would not match either in that case. It might be worth being explicit here that the hash is to be checked first, and then the size.
Yup, that's correct. Will make this more explicit.
So how does this play together with ssb’s current signing (and in the case of flume storage) format?
It doesn't. Every time I write up a format people assume it is about ssb =(
We'll probably end up using a modified version of bamboo for ssb, but in principle bamboo itself is a standalone protocol not connected to ssb.
Is that a typo?
Yes, will be fixed. It's my natural instincts that want to start at zero fighting against the fact that in this particular setting the math works out far better when starting at one.
Evolution, Compatibility, Naming
The above sections are nice and all, but mostly abstract blabbering. We have a very concrete protocol on our hands right now, and while we want to improve it, we also want to preserve backwards compatibility. Finding precise definitions for this turned out to be surprisingly difficult (and also fun), we ended up drawing quite a few graphs.
An unexpected observation: L-SSB does not involve concerns about backwards compatibility, those only arise with the need for feed identifiers and message identifiers. Yet it is our desire to change L-SSB that drives the current work on protocol evolution. This was reflected in how little time we spent on the details of the new L-SSB format. The real blockers are living in a higher layer.
Another general observation before we dive into the specifics: Dominic proposed an answer to "what is scuttlebutt" here that relied on a global view of the system. As long as there were connections between different implementations where each connectioned shared a common format, things would be fine. There are two problems with this: First, verifiability is not transitive. Suppose we have three servers, A, B and C, all connected to each other. A only speaks "old-SSB", C only speaks "new-ssb", B speaks both. C can not verify A's messages, it doesn't help that B could "translate" - C would have to trust B to not change any meaning. And once we allow trust in the equation, we could simply say that C trusts A and everything would be fine. Also we wouldn't need SSB anymore, we'd use email instead.
Aside from the transitivity problem, it is dangerous to define SSB in a way that requires a global view of the system - by design "the system" doesn't exist in the first place, and even if it did, gettin a global, consistent view is impossible. Instead we need a purely local definition: Given a computational node, does it implement SSB?
Backwards compatibility means that if the answer to the above question is "yes" at some point, then it must always remain "yes" even if the protocol changes but the node does not.
For our purpose (notably we are fully excluding replication), the SSB protocol consists of sets of encodings, both for L-SSB and U-SSB. Right now, there is exactly one of these for each, let's call them l-json and u-json. We plan on adding new encodings with nicer properties, let's call those l-future and u-future.
Right now, for a node to implement SSB, it must support l-json and u-json. In the future, all nodes must still be able to verify l-json and u-json logs to preserve backwards-compatibility. They do not need to be able to produce new messages in that format (in fact you don't need to be able to produce messages at all to be an SSB implementation, but people would probably switch to an implementation that wasn't read-only pretty quickly).
Once l-future and u-future are introduced, you can still call yourself scuttlebutt even without supporting them. But you'd be an old version of scuttlebutt and might not be able to verify newly produced content.
As time progresses, there might be l-far-future and u-far-future, and so on. New implementations would still be required to be able to verify messages in all the old formats, and they'd still be able to choose which formats to support for publishing (probably only the newest one).
It might become too much trouble to support verification for all the old formats. So you could choose to only implement a subset of scuttlebutt. That would be totally fine and there are settings where this makes sense. But you wouldn't be a full ssb node.
While there are few constraints on the L-SSB formats, we've decided to place a stronger constraint on the U-SSB formats: Any new self-describing data format to become part of SSB must support feed identifiers and message identifiers (and also blobs I guess?) for all prior formats.
Who decides when a format becomes official? Right now we don't have a formal structure for this. The way of the default world would be to register a trademark to control the name "ssb". Without that, there might be situations where two different parties (say the sunrise choir and verse) and their individual formats independently and built implementations that support the old stuff and their own, but not the format of the respective other party. Even if at some point both of these formats would become part of the "official" (whatever that may be) SSB, these implementations would not be full implementations, they'd only be implementing a subset.
In any case, we expect these kinds of situations to arise rarely if at all. Rather than claiming to be the "true new SSB", you could just say "Hey, we've implemented this extension, please use it because we think it is a good idea, but remember that there is a certain risk that it might not be actual SSB and might never become actual SSB either.". Similar to html evolved, but hopefully with less reduction to the lowest common denominator. This will probably also be how the new formats will be rolled out: Provisionary (we might put in a horrible error by accident) until we felt confident that it could become official (and thus all far-future implementations would have to support it as well).
That's it.
Verification aka Lower Scuttlebutt
Lower ssb (L-SSB) is about assigning metadata to some content such that the resulting package can be securely replicated. From the point of view of L-SSB the content is just a string of arbirary bytes, it doesn't care about json or other self-describing formats. Currently, this is instantiated by a json-based format with a few (ahem) loveable (cough cough) quirks. Did you know that the sha
in "hash": "sha256"
stands for "Scuttlebutt Happened Anyways"? Bamboo, birch and similar proposals target this layer of the protocol.
A semi-formal definition of L-SSB: An L-SSB is a triple of algorithms/functions (append, verify, hash)
, such that:
append
takes a log, a secret and a content, and (if the secret is valid) returns a new log whose newest message has the given contentverify
takes a log (roughly) and returns whether it is "valid" (I did call it semi-formal, didn't I)hash
takes a message and returns it's hash (modulo support for different primitives)
Ok, maybe this should be called an informal definition instead...
This is the part where we actually had the least to discuss, the path forward for the metadata seems to be fairly clear:
- include a backlink
- include a "lipmaalink" to support efficient verification of out-of-order messages
- include the public key of the feed's author
- include the size of the content
- include a hash of the content, but not the content itself (#offchain-content)
- include the sequence number
- remove the timestamp (moving it into application data)
- don't add any additional sequence numbers for fancy partial replication stuff
- leave this to the application level, or for a later protocol revision
Self-Describing Data aka Upper Scuttlebutt
Upper Scuttlebutt (U-SSB) is about additional meaning in the content bytes. Roughly speaking, we want the content to be in some well-known, self-describing data format that allows us to specify and detect references to feeds, messages and blobs. A nice side-effect of self-describing data is that everything is interoperable by default and we can build general-purpose databases. Of course you could still use a proprietary format encoded as a byte string, but this would be frowned upon.
Blobs are an interesting concept in that they seem to be integral to SSB at first glance, yet they don't appear in L-SSB at all. In fact you could take the current js implementation, rip out the blob handling and everything would still work. You might get a very different view of the scuttleverse (lots of weirdly encoded strings where there used to be pretty images), but that is fully in line with the freedom of all users to interpret the data that they receive subjectively. You'd just happen to interpret blob references as literal strings. This has been a fun realization, but still U-SSB currently supports blobs, and so future iterations of U-SSB probably want to keep them (also, images are neat).
Next, the self-describing data format has a standard way of referring to a feed, such a reference consists of an indicator for the signing primitive (e.g. ed255l9) and the public key. There are not a lot of interesting things to say about this.
Referring to messages however is more interesting. Since L-SSB assumes totally ordered feeds, we can uniquely address a message through the pair of its author and its sequence number. Unlike hash-based addressing, this uses intrinsic properies of the message rather than imposing some external criterium (a hash function). Most replication mechanisms for SSB will be optimized for finding and retrieving feeds, not hashes (use ipfs if you want a flat namespace of hashes). So besides being arguably more pleasing from a theoretical point of view, including the author in a message's identifier will make it easier to retrieve it in an out-of-order setting.
We do however want the graph of linked messages to stay acyclic. To guarantee this, we still need to include a cryptographically secure hash of a message. But even if the hash function was broken, only the feed's author could use this as an attack opportunity by producing new message with the same hash and of the same sequence number. But producing a message that duplicates an existing sequence number means forking their feed.
As further protection against broken hash functions, it might make sense to allow multiple hash values in a single message identifier (e.g. sha256 and blake2b). Instead of complicating the format for message identifiers, we could instead define a hash function sha256blake2b
that outputs the sha256 hash followed by the blake2b hash. While this is hacky, it allows us to keep the format simpler. And when humanity reaches the point where they can break blake2bsha3
, it is very likely that either the advances in computing power or in mathematics would be so immense that they've probably invalidated a bunch of further implicit assumptions of ours as well.
There's a bit of tension along requiring certain structure in content: What do you do with invalid content? It invalidates the feed, just like invalid metadata would. But since we have offchain content, we might only retroactively realize that a feed was broken all along. A peer could deliberately publish a message with broken content but never send it out, thus gaining the ability to retroactively cut off a suffix of their feed (and terminate their feed in the process). This sounds bad, but it is actually the world we are already living in: Any peer might at any time fork their feed at an arbitrary sequence number, achieving the exact same effects.
If a feed is broken, we stop replicating the feed beyond the point of breakage, but we want to propagate the information that it is broken rather than doing so silently. So we need proofs of broken feeds. For forks, this is simple: the proof consists of the set of messages with clashing hashes/backlinks/seqnums (partial verifiability makes this nontrivial, but it's still simple and requires little space). For invalid contents (in particular invalid content size), this is more difficult. You'd need to transmit the whole invalid content so that the peer can check the signature (otherwise you could just claim that something was broken), but the content might be very (in fact invalidly) large. This creates a situation where there's a tiny negotiation between two peers: "I could send you a proof of breakage, but it would be xxx bytes large. If you don't want it, we'll pretend I only had the feed up to that point instead".
continued in next post
What is Secure Scuttlebutt?
July 2019 Edition, Aljoscha's point of view.
We want to turn a crappy encoding into a crabby encoding.
(Quote surprisingly irrelevant to what follows)
What is ssb? Well, this is going to take a while. The following is a (necessarily biased) synthesis of a discussion between Dominic, Arj, Cryptix, Keks and me. It took place on a dedicated real-life meeting and evolved over four to five hours, so this writeup will not be able to recap the whole discussion. I'll rather aim to summarize the consensus we reached, giving more space to more contentious topics. This writeup might explore certain topics a bit further than we did in the discussion, and also I didn't take notes, so take everything with a grain of salt.
Over the course of our exploration, we subdivided the problem space into mostly independent chunks:
- The conceptual space occupied by SSB
- Verification ("lower scuttlebutt")
- Self-descibing data ("upper scuttlebutt")
- evolution, compatibility, naming
Conceptual Space
At this stage of our exploration, we are not talking about the current SSB protocol per se, but rather about the "SSB approach" of solving whatever problem Dominic set out to solve when he devised SSB. Defining this approach gives us a guideline on how SSB can evolve, it helps set the scope for future endeavors. If we underconstrained it, "SSB" becomes meaningless (if SSB ever encompasses a text editor, something went wrong). If we overconstrained it, it might not be able to evolve in a certain direction even though it might become necessary at some point.
One aspect we didn't even talk about is verifiability. SSB wants to be able to pass messages along untrusted peers, so we always need to be able to verify the correctness of data (yeah, "correctness" is horribly unspecific, but I don't really want to wake the others to have an hour-long discussion right now).
The uncontentious aspects were the focus on identities and the append-only nature: Replication happens at the granularity of messages that are being produced over time, tied to a cryptographic identity.
A more fuzzy concept was the "immutability" of SSB logs. A more precise term might be "persistent" in the sense of a purely functional data structure: "Modifications" to the data structure leave older versions intact. While a feed id refers to a growing data structure, the identifier for a specific message always refers to a fixed data structure that can never change: The message itself, and its predecessors.
Finally, we have the issue of total order along the messages of a feed. Informally, this means that you can assign unique sequence numbers to the messages of a (non-forked) feed such that the sequence numbers reflect the order in which the messages were created. This is a surprisingly strong requirement and forms the basis for SSB's replication: If two peers care about the same feed, they exchange the newest sequence numbers they know about, and can then easily deduce which messages they should transmit.
Baking total order into the core assumptions of SSB has important consequences: SSB will not support different replication models (such as replicating unordered sets or partially ordered DAGs). Applications can rely on the total ordering property whenever they consume a feed (or subset thereof). The latter is nice for application developers, but there is a good case to be made that applications should only rely on partial ordering instead: Partial orders arise naturally in the context of aggregating multiple feeds into a single logical entity ("same-as").
If we removed the guarantee that a feed is totally ordered, this would be a breaking change (e.g. the friends graph could start behaving weirdly, as could gatherings and a bunch of other applications in common use today). Instead, applications should be offered two separate interfaces: One for consuming totally ordered, "physical" feeds, and one for consuming partially ordered, "logical" aggregates (don't call them feeds or logs please, these terms imply total order). If they use the partially ordered interface, the actual mechanism for determining how feeds are aggregated (i.e. the implementation details of same-as and other such approaches) can be swapped out transparently, the applications will just work.
The aggregation mechanism is thus pushed into userspace, the SSB protocol itself can focus on totally ordered feeds and is free to utilize their nice properties. In particular, we will leverage this decision in the discussion of message identifiers (roughly "cypherlinks"). With message identifiers relying on this property, we have fully committed to this design: Introducing replication of partially ordered sets in the future would become very hairy. So if in the future the drawbacks that currently prevent us from replicating partially ordered sets can be eliminated, SSB might not be able to incorporate this new paradigm. Instead, a new format should arise and take its place. And that is ok, we can't reasonably design for immortality - all such protocols end up turning into monstrosities.
Some notable (and deliberate) omissions in the conceptual space: Blobs, encryption, replication, cypherlinks. We'll return to cypherlinks and blobs later, encryption really should have been a userspace conern, and replication doesn't really matter: As long as we deal with finite data structures, humans will be able to exchange them, and we don't really care how they do it. I mean, all of us actualy strongly care about it, but not at this level of abstraction.
continued in next post
{ "type": "tamaki:publication", "img": "&nDXZKpILvI1iaqeJ/qfCb8dAxge81+Iw12zvEaZep3Q=.sha256", "title": "Peacock", "description": "It's a peacock! Also, markers are hard. As are spirals.\n\nI like the name 'pavo' (latin for 'peafowl') for the programming language I'm about to finish prototyping. In addition to being short and memorable, the name also suggest a whole range of visuals and imagery that could be used. I'm somewhat hesitant to focus on the tail feathers too much though, because the extravagant tails are only present in males. The characteristic crest is present for all peafowl, so if pavo get's a logo, it'll likely be a peahen head (rather than the obvious choice of a peacock tail feather).\n\nThe peacock is also a symbol for pride/vanity. So if the imagery ended up suggesting \"male pride\", that would suck =/\n\n---\n\nCompletely unrelated: I'm hyped for space, even though it might get lonely at times. (Yes, I am regularly taking a look at hacky-art)", "caption": "A clumsily executed, stylized drawing of an Indian peacock with colored markers. The tail is not rendered faithfully, it consists of colorful spirals." }
No, this is fine because the publisher of the (extended) Meta/Event item will have done the serialization and signed the serialized bytes and you receive the serialized bytes inside the transfer item. No need for you to redo the serialization when verifying.
Yes, this is possible. But it forces the receiver to store the exact bytes, rather than letting them choose how to persist the data. Since they will likely have a different representation to efficiently access the data, they will need to store both.
My main gripe with this approach is that deterministic functions should be the default. Mathematically speaking, functions map an input to a deterministic value by definition. Algorithms are deterministic by definition. It actually takes work to get non-determinism. So why go out of your way to invite nondeterminism into a setting where it is actively detrimental?
If we abstract over the issue of how to encode the individual items (author, seqnum, etc.), what remains is the question of how to encode the product of the items. The natural answer is to concatenate the encodings of the individual items. Anything more is incidental complexity (and in particular, using self-describing maps means refusing to utilize the inherent order of a product - for no good reason). No established format will be simpler than concatenating a bunch of bytes. If you can't trust implementors to concatenate bytes, then you can't trust them to get any of the actually interesting stuff right either. And there won't be ready-made libraries for the other stuff to save them.
This is not nih-syndrome, there's nothing to invent at all.
starts to calm down but hits the publish button anyways
I'm going to Basel!
This photo was taken half a year ago, when @elavoie and me visited @cft in Basel for some ssb-related paper planning.
The hat in this photo will return to Basel in October, and the corresponding human will spend half a year as an intern in @cft's research group. An internship was the easiest way of satisfying the bureaucracy to let a TU Berlin student do a semester at Basel University. Whatever I'll end up doing, it will probably involve replicated append-only logs. You know, like this scuttlebutt thing (and also #bamboo and super-secret future projects).
=)
If only we could convince @Aljoscha to go read all the papers on the subject. ;-D
That can't be a sustainable solution to all your problems =P More seriously, I lack the crypto background to be really helpful here. I took a very brief look at some PSI stuff a while ago, but didn't find it enjoyable.
There is however some really neat stuff in the realm of (non-interactive) approximate set reconciliation, and it is somewhat private since it uses bloom filters (and variations thereof).
Intuitively it doesn't feel like it should be possible to prevent an attacker from confirming guesses (and useful slowdowns through sensible rate-limiting seem hard to achieve as well). At least public keys have good entropy... Still, might make more sense to have some trust-based system in place where a component doesn't answer to arbitrary request. This means that there'd need to be some way for the information to flow transitively, which might conflict with privacy requirements... Hmm...
Dammit, now I started thinking about this! Rabble, perhaps you could elaborate a bit on what kind of system you are thinking of (not how it works, but what it does and what properties it should have).
No promises though that I'll still be interested in this when I wake up tomorrow. Or that I'd come up with anything useful even if I was.
- new faerie
- UTC+2 (Berlin)
- (re)connect with people, regular check-ins, occasionally (but not primarily) vent feelings about the utterly depressing futility of computer science
- blank
I might be fully or mostly offline until the full moon, so I'd be grateful if a forming ring would take me in.
Wildcard butt I'd like to interact more with, beyond all the lovely humans I have crab-met already: @cblgh
@Piet I'd love to see such a data set. As with my choir work, I'd recommend (and can help with) setting up a fuzzer to generate weird data sets.
There is however an important public service announcement: Please, please, please use lowercase base16 instead of base64 for encoding binary content in json: It's simple, unambiguous, bijective to the original data (lowercase being an arbitrary choice), and can be implemented in 10 minutes. Base64 is a mess with competing specs, odd legacy whitespace cornercases, and canonicity issues. Unless you really care about the compression (but for some reason can use neither a binary format nor actual compression), don't default to base64. End public service announcement.
Re TagNet: I can see why the gains in expressivity that subset matches provide over prefix matches are desirable, although I'm not convinced that they are worth the additional complexity. But even ignoring the question of merit, I don't think tag subsets can be made feasible for our setting: We need to include a seqnum for every partial subscription that an entry could match, so that we can detect when peers maliciously withhold matching entries. For prefix matching, the number of seqnums is equal to the number of path components. With subset matching, the number of seqnums would be equal to the size of the powerset of its tags, that is exponential.
Then again, if the number of tags was bounded, say to seven... 2^7=128 seqnums, many of them quite small (and thus only 1 byte as a varint), probably a lot of duplicates that allow some clever compression scheme... I guess this could work out. (7 would also be a realistic bound for the number of path components, though I could see going for 15 as well)
Thanks for the link @cft, I'll push that to the front of my reading queue.
I've also been thinking about parallels to ndn namespaces. Hierarchical bamboo in comparison would be more restricted (all subfeeds are controlled by the same keypair). On the other hand, it can do without the global ndn
root. With hierarchical bamboo, you could even do prefix-based routing on subfeeds (and replication of multiple feeds is the inefficient special case of the shared prefix being the empty word). Makes me wonder about supporting subfeeds with different keypairs, essentially arriving at a push-based ndn.
Btw, I put your iterative lipmaalink computation into the README:
A python function computing lipmaalinks that doesn't explicitly use logarithms (credit goes to cft):
def lipmaa_iterative(n): # for the fixed graph m, po3, x = 1, 3, n # find k such that (3^k - 1)/2 >= n while m < n: po3 *= 3 m = (po3 - 1) // 2 po3 //= 3 # find longest possible backjump if m != n: while x != 0: m = (po3 - 1) // 2 po3 //= 3 x %= m if m != po3: po3 = m return n - po3
I just made a breaking change to the spec, now the author's public key is part of the metadata. CC @piet @hoodownr
Also added a bold status note saying that further breaking changes might happen.
The particular change I'm considering is baking hierarchical feeds (some ramblings here, there's no clean writeup yet) into the spec. Those are one particular mechanism through which it could be specified which subset of a feed one wants to request. I could leave bamboo as is, and require log formats that want to support hierarchical feeds to provide the additional data as part of the content. The big advantages are that it keeps bamboo simpler and that other log formats could use different mechanisms for partial subscriptions while still getting to use bamboo. But there are a few arguments in favor of hard coding:
- if the data was part of the content, how do we deal with deleted content?
- the above would not be a problem for full feed subscriptions, only for subfeeds -> subfeeds are second-class compared to full feeds
- in some settings, it would be nice to conceptually consider full feeds as subfeeds of the "universal feed" that contains all entries of all authors - again, that doesn't really work out if subfeeds are second-class
- greater cohesion among implementors - can't fracture into different user partitions that use different partial subscription models
- would be a distinguishing feature from ssb (which probably won't hardwire a partial subscription mechanism?)
As for the additional metadata for that, I'd only want to sign its hash to reduce leaked metadata. And I'd probably move the size indicator from the signed data into the hashed metadata as well.
In any case, those ideas will have to stew for a while, I won't rush them out. So there might be a larger breaking change coming at some point. Then again, you could just ignore it and keep using the old (i.e. current) format (bamboo-lite? bambino? bambi?) - there's nothing wrong with it.
Bug report: Nicknames are not revowelerized and can thus be used as a channel for unrestricted communication.
See also this implementation of computing lipmaalinks (I suppose I should include it in the README...)
@hoodownr Some improvements to bamboo that didn't make it into the README yet but probably will at some point:
- include the author's hash in the signed data
- sign only a hash of the metadata, not the metadata itself (subfeeds can still be verified, but without leaking metadata about messages not part of the subfeed, also slightly reduces the amount of data that needs to traverse the network)
I'd been secretly hoping there'd be a post of yours eventually that would contain both a panda and bamboo, and it happened sooner than I expected =)
@enkiv2 Just wanted to let you know that the new subheading ("a daily digest of what I’ve been reading", replacing the former "a daily digest of 23 links you (should) have already seen") is sooo much more friendly and welcoming. I now skim over the lists and check out a few things if I have the time, rather than flat-out ignoring it. And there's often some interesting stuff in there. So thanks for sharing it =)
I'm very sorry to shake your confidence @Mia Gooper, but that might very well be a false negative: I sold my soul at some point (though I did eventually get it back through mutual cancellation of the contract), yet most automatic doors continued to open for me. Then again, they might have been confused about a spare soul I acquired a couple of years earlier. I'm still fuzzy on whether I actually got that soul, given there was nothing but a verbal agreement.
{ "type": "chess_game_end", "status": "mate", "ply": 66, "fen": "8/k1p5/bR6/3Q1p2/3Pp3/8/6PP/4q1K1 w - - 0 34", "root": "%I949aoEkh+lnewBr40vajJJIyZ3H7iyb4SzCQnPPWqU=.sha256", "winner": "@zurF8X68ArfRM71dF3mKh36W0xDM8QmOnAS5bYOq8hA=.ed25519", "orig": "c1", "dest": "e1", "pgnMove": "Qxe1#", "branch": "%tILFJ6OGkQPeJukrsBp2MZ3Vx4jsvbeJIv6wMk1S4Wc=.sha256" }
There is a galactic council already, but they only reveal themselves to those civilizations that attempt to start one on their own.
Falsify this claim by starting a galactic council!
You should still be able to delete messages locally and then receive future messages without any problems, it’s just that you wouldn’t be able to share those messages with your peers.
That's already true for current ssb, it's just not implemented in jsbot. Simply keep the hash of the message you want to delete, then use the hash for backlink verification.
Absolutely, but I think this would imply that you could receive message content after the message.
Yes. That's basically the point (consider e.g. the scenario where you only request contents up to a certain size and at a later point raise that limit).
but if you know a way then I’d be super interested in a better solution.
Flip the responsibilities. If you run a legacy js implementation, somebody offers to give you messages 1-41 and 43-100, but you don't want to rebuild the indexes once you get 42: just pretend you never got messages 43-100. (We'd obviously extend the replication rpcs to indicate that you don't want to replicate beyond "gaps" so that actually the messages would never be sent, but conceptually you'd still choose to ignore some messages). Ignoring the data is a simple, local and free operation, but from you local perspective it has the exact same effect as preventing everybody else from sending you data with gaps. In the future, people might build backends that can handle out-of-order content. Why preemptively constrain them? Also good luck enforcing a global replication restriction in the first place =P People can still write down messages by hand and exchange their favorite messages in person, manually inserting them into their order-agnostic databases. Or whatever. Replication is whatever two nodes use to interchange data, you can't control it.
In this system peers replicating feeds would be required to replicate all content that hasn’t been deleted by the author, which means that peers can’t selectively censor message content from a feed.
A large part of the motivation for offchain-content was the ability to locally delete arbitrary data without losing the ability to replicate the remainder of the feed. Forfeiting this seems odd.
How do you differentiate between actual “deleted” content and a peer without the content?
Counter question: Can we do without being able to differentiate? I strongly suspect that the answer is "yes", the cost is lower than not having local deletion.
Also, since ssb will likely move to supporting partial replication, peers will be allowed to not have some data anyways. The concept of "selective censoring" won't really apply.
@piet caught an off-by-one in the lipmaalink function definition: In the first case of f
, it should be return n - (3^(k-1));
, not return n - (3^k);
.
Macaroni is better than spaghetti. If Guy Steele says so, I trust it. Other than the title, that paper has little to do with your question though.
"Software metrics" is probably the keyword you are looking for. They are not uncontroversial and stem mostly from enterprisey, oo-based environments. So beyond (and sometimes within) the papers, there are many opinion pieces out there.
Some metrics regarding spaghetti:
Agenda suggestion: Going through birch so that cft can tell us all the things we should address in Hamburg without him.
{ "type": "about", "about": "%ssEFz6BNCjV1oX3x+Gc9kY+OaNFaLIGxRrgmMT/Wt2c=.sha256", "attendee": { "link": "@zurF8X68ArfRM71dF3mKh36W0xDM8QmOnAS5bYOq8hA=.ed25519" } }
@Emmi Agreed that the setting is far more complex, I share your objections. There surely is some neat math through which you could establish expected reputation outcomes and widen the propagation range if the actual reputation outcome deviates far enough. This and other safety mechanisms could be part of the negotiation process, or they might even be inalienable. But then this probably still leaves some other loopholes (or might even create new ones). My intuition is that no matter how much engineering you pour into this, it's just not "solvable" in the conventional engineering sense (which is still deeply rooted in my brain, though I'm trying to make it more of an optional asset rather than my main mode of thinking).
Still, since engineering is fun: I should also be allowed to retroactively widen the spread of reputation that concerns only me - e.g. if I am rated better than I expected, or if I want to give a clearer picture of past-me. A generalization to sets of people would be conjunctive, all affected people would need to agree.
Outside the pure engineering view: Since it is basically impossible to restrict the propagation of reputation information through purely technical means, there would likely be social regulations (possibly laws, etc.) (arg, I barely know any proper (English) terminology in these areas). There'd hopefully be some basic rights that can overrule the regulations governing reputation propagation, which helps with some of the objections.
Thanks for the #nerd-sniping btw =)
A clumsy attempt at pointing out a privacy/information-flow aspect: There could (should?) be more nuance then "You put some data into the reputation network, now everybody in that network can evaluate you based on it". I'm quite fine with people I directly interacted with to judge me based on those interactions, even if those interactions were "off the record". I'm less ok with a third party (say a state or financial institution) judging me based on off-the-record interactions they didn't participate in. Then again, if they told their friends about our interaction, that's yet another, different story.
It's interesting to think about a system where prior to an interaction, the parties negotiate how far the resulting reputation changes may propagate (might be asymmetric) (let's ignore that a negotiation is an interaction in itself). The aim being to give agency to the actors themselves to decide how information that pertains to them spreads. A simplified view: Low propagation serves privacy, far propagation serves security. If the needs of the participants can not be reconciled, the interaction does not happen at all.
Append-Only Ropes
An observation on binary antimonotone graphs outside the context of log verification.
Suppose you want to build a version control system like git. You have a file that evolves over time, and you want to be able to access it at arbitrary points in its evolution. The trivial solution is to store a full copy at each timestep. This is however very inefficient. An alternative is to only store the changes (deltas) between the versions. The very first version is stored as the delta to the empty file.
While this helps with storage needs, there's a different problem now: Reconstructing a file now takes time linear in its time point. To solve this, could introduce caches at regular intervals that store the file at that particular point. Binary antimonotone graphs offer a different solution.
At each point in time, we store both the delta to the previous state, and the delta to the predecessor in the above graph. This ensures that for each node, we can reconstruct the state by applying only a logarithmic number of deltas. Another cool feature: We can construct the delta between any two nodes by combining the deltas on the shortest path between them - taking time logarithmic in the difference of the time points.
Viewn abstractly, restoring a state corresponds to quickly computing a bunch of monoidical operations, and computing a delta between two arbitrary points corresponds to subtraction in an additive group). A different, less mathematical viewpoint can regard this example as a specific instance of an event sourcing architecture.
While this is kinda nice, it's fairly unexciting if you already know about ropes. Ropes can do the same, but also handle deletions and insertions at arbitrary points, whereas the binary antimonotone graphs are append-only. But this restricted scope comes with some advantages.
First, ropes need rebalancing logic, an additional source of complexity. The inner nodes change over time, which means updating a bunch of state even though it lies in the past. To fix this, you could take the general concept of a rope, but disallow all operations other than appending at the end. You could then build the rope out of perfect binary trees. But suppose we had such a rope of eight entries (forming a tree of height four). A ninth entry would have an edge to the root of that tree. When adding the tenth entry though, that edge would need to be deleted, and instead nine and ten would get a common ancestor, and that one would link to the root. This would continue as more nodes are added, adding the 16'th node would require restructuring four nodes (since there would be that many complete binary trees before adding it). Overall, the complexity of the append operation is O(log(n)). Interestingly enough, this O(log(n)) append is more or less exactly what #hypercore does.
The binary antimonotone graph however has O(1) append, while still providing the same quick computations over monoid and group data as a general rope. In that sense, it seems fitting to call an (optimal) binary antimonotone graph a (ternary) append-only rope.
(Yes, I am aware that the version control example is unrealistic because compression is a thing.)
Our garden is real, and our gardeners are magicians.
I love this sequence of words
@mix A few quick remarks:
- with offchain content, you'd only need to the metadata for the messages that are only fetched to verify the messages of interest, but not their content. So you won't even have the chance to get an
about
by happy accident. - there are some good arguments for including the author with every cypherlink, I hope that we'll go there eventually.
- whether it is possible to find out more about that author without just requesting and scanning their full feed is a specific instance of a more general problem: given some semantic (as opposed to sequence-number based) criterium that specifies a subset of a feed, is it possible to request and validate that subset? We don't know yet which mechanism we'll end up with. Here is a wall of text on the topic.
Past me has spent some time on this.
First of, this can easily be generalized from tangles to arbitrary DAGs. There's no need to be restricted to a single root, simply transmit a set of nodes (let's call them sources) to collect all their transitive successors. So you send a set of sources, and then receive back everything that is reachable from them.
If you are missing N messages, you must ask the network N times. That is gonna be slow, especially if you have bad latency.
That's technically not correct: If there's you have a node with 20 outgoing links, and you ask for those 20 messages concurrently, then you still get the results after one rtt. The total rtt isn't proportional to the number of messages, it is proportional to the maximal depth of any node in the DAG (where the depth of a node is simply the length of the shortest path from a source to that node).
The idea of tasking the peer to send a node and everything reachable from there is similar to promise pipelining.
This probably needs some sort of mechanism to specify which cypherlinks are of interest (should be traversed) and which one should be ignored. In effect, this would end up as a database query language. Cap'n Proto comes to mind since it provides this combination of promise pipelining and filtering, another example is graphql. In addition, it might be nice to specify maximal recursion depths to prevent being flooded with far more messages than expected. If the peer signals that more data would have been available but the maximal depth has been reached, then you can simply send new requests. That way you basically interpolate between promise pipelining and request/reply.
Transmitting a set of nodes one already has can be regarded as simply another mechanism for filtering/restricting the result on the peer's side. It's good to keep in mind that this is not always an optimization, since it can actually make things worse: Suppose the peer only knows about 5 messages in the graph at all, but you just sent them 800 messages they should filter out. That exchange would have been more efficient if you hadn't sent a blacklist at all. So really this becomes a game of heuristics. Sometimes it might be a good idea to send only a subset of the nodes that you already know about.
It doesn't matter asymptotically whether we send the set losslessly or use an AMQ, so for simplicity I tend to think about this in terms of regular sets rather than bloom filters.
Rather then sending a set of nodes to ignore, one could also send a set of nodes for which you already have all the messages that are reachable from there. This actually still works with bloom filters. When the server traverses their DAG locally, they would cut off the traversal whenever a node is in the set (whereas in your blacklist they'd skip over the node but still traverse the successors). This does of course amplify the effect of false positives in an AMQ.
Just for completeness, all of this has been discussing declarative query specifications. Conceptually it would also be possible to send some procedural code to the peer that then does the filtering locally.
@andrestaltz Yay for taking the necessary time, our scuttleverse doesn't need to rush by
I completely agree with that choice of architecture - it also nicely mirrors how all nodes agree on the static rpc definitions, but they might issue rpcs according to different implementations.
However, the ideas I sketched in those posts would involve introducing new rpcs to the system, e.g. a set of remote procedures for implementing a peer sampling service. Most papers on this define a very narrow protocol for running the pss. I think ssb would benefit from a different approach: The set of rpcs should be expressive enough to experiment with the particular mechanism to use. So there would be support for random walks, for push, pull and shuffling based view propagation, a counter how often an address has been shuffled, possibly even a free-form data section.
I have deliberately not pushed this forward yet, since the more low-level stuff is more important to me. Introducing new rpcs doesn't require consensus from the whole network, so I'll be able to play with this whenever I want. But nevertheless, if anyone wants to introduce a pss to ssb's default overlay, I''ll be happy to exchange ideas and put in some work.
@dominic That's actually a very convincing point, I guess I won't need to argue. Perhaps @cft has something to add here, I think he also preferred not including the author (and unlike me he knows his crypto).
Btw 100% agreed on an optimizing transport format not caring either way.
Oh, hey @pepyakin, you were at the p2p meetup, right? Did you figure out the relation between trust, determinism and side-effects yet? That was fun to explore. Would the concept of trust make sense in a nondeterministic universe? Shouldn't the definition be somehow tied to a computation model with multiple processes? Is trustlessness the opposite of trust? Do side-effects even have a meaningful "opposite"?
The first array defines the graph (with a leading null for convenience such that the n-th vertex has index n). vertebra
can be ignored, I just needed that value to compute the certificate pools (it's the next-largest connection to the spine).
The second array contains the certificate pools, split into two segments: "upper"
is the shortest path from the vertebra to the node, "lower"
is the shortest path from the node to 1. Again, the leading null is in there so that the indices match.
I looked into making such an interactive svg diagram, but didn't really find an easy starting point. But if you could build an initial diagram, I'd be more than happy to add some stuff to it.
Some neat non-obvious things we could do with the diagram:
- draw those colored boxes (there's one box per node, with its top-right corner at that node, the height of that node's x coordinate, and the width of three to the power of that node's x coordinate)
- clicking on a node locks in the certificate pool, hovering over other nodes then shows both pools and highlights their intersection.
There's no hurry to get this done, just wait until you feel like coding up some neat svgs =)
@mix Here's a data set to build nice interactive svg diagrams with:
[null,{"n":1,"lipmaa":0,"vertebra":1,"x":1,"y":1},{"n":2,"lipmaa":1,"vertebra":4,"x":2,"y":1},{"n":3,"lipmaa":2,"vertebra":4,"x":3,"y":1},{"n":4,"lipmaa":1,"vertebra":4,"x":3,"y":2},{"n":5,"lipmaa":4,"vertebra":13,"x":4,"y":1},{"n":6,"lipmaa":5,"vertebra":13,"x":5,"y":1},{"n":7,"lipmaa":6,"vertebra":13,"x":6,"y":1},{"n":8,"lipmaa":4,"vertebra":13,"x":6,"y":2},{"n":9,"lipmaa":8,"vertebra":13,"x":7,"y":1},{"n":10,"lipmaa":9,"vertebra":13,"x":8,"y":1},{"n":11,"lipmaa":10,"vertebra":13,"x":9,"y":1},{"n":12,"lipmaa":8,"vertebra":13,"x":9,"y":2},{"n":13,"lipmaa":4,"vertebra":13,"x":9,"y":3},{"n":14,"lipmaa":13,"vertebra":40,"x":10,"y":1},{"n":15,"lipmaa":14,"vertebra":40,"x":11,"y":1},{"n":16,"lipmaa":15,"vertebra":40,"x":12,"y":1},{"n":17,"lipmaa":13,"vertebra":40,"x":12,"y":2},{"n":18,"lipmaa":17,"vertebra":40,"x":13,"y":1},{"n":19,"lipmaa":18,"vertebra":40,"x":14,"y":1},{"n":20,"lipmaa":19,"vertebra":40,"x":15,"y":1},{"n":21,"lipmaa":17,"vertebra":40,"x":15,"y":2},{"n":22,"lipmaa":21,"vertebra":40,"x":16,"y":1},{"n":23,"lipmaa":22,"vertebra":40,"x":17,"y":1},{"n":24,"lipmaa":23,"vertebra":40,"x":18,"y":1},{"n":25,"lipmaa":21,"vertebra":40,"x":18,"y":2},{"n":26,"lipmaa":13,"vertebra":40,"x":18,"y":3},{"n":27,"lipmaa":26,"vertebra":40,"x":19,"y":1},{"n":28,"lipmaa":27,"vertebra":40,"x":20,"y":1},{"n":29,"lipmaa":28,"vertebra":40,"x":21,"y":1},{"n":30,"lipmaa":26,"vertebra":40,"x":21,"y":2},{"n":31,"lipmaa":30,"vertebra":40,"x":22,"y":1},{"n":32,"lipmaa":31,"vertebra":40,"x":23,"y":1},{"n":33,"lipmaa":32,"vertebra":40,"x":24,"y":1},{"n":34,"lipmaa":30,"vertebra":40,"x":24,"y":2},{"n":35,"lipmaa":34,"vertebra":40,"x":25,"y":1},{"n":36,"lipmaa":35,"vertebra":40,"x":26,"y":1},{"n":37,"lipmaa":36,"vertebra":40,"x":27,"y":1},{"n":38,"lipmaa":34,"vertebra":40,"x":27,"y":2},{"n":39,"lipmaa":26,"vertebra":40,"x":27,"y":3},{"n":40,"lipmaa":13,"vertebra":40,"x":27,"y":4}]
[null,{"n":1,"upper":[1],"lower":[1]},{"n":2,"upper":[4,3,2],"lower":[2,1]},{"n":3,"upper":[4,3],"lower":[3,2,1]},{"n":4,"upper":[4],"lower":[4,1]},{"n":5,"upper":[13,12,8,7,6,5],"lower":[5,4,1]},{"n":6,"upper":[13,12,8,7,6],"lower":[6,5,4,1]},{"n":7,"upper":[13,12,8,7],"lower":[7,6,5,4,1]},{"n":8,"upper":[13,12,8],"lower":[8,4,1]},{"n":9,"upper":[13,12,11,10,9],"lower":[9,8,4,1]},{"n":10,"upper":[13,12,11,10],"lower":[10,9,8,4,1]},{"n":11,"upper":[13,12,11],"lower":[11,10,9,8,4,1]},{"n":12,"upper":[13,12],"lower":[12,8,4,1]},{"n":13,"upper":[13],"lower":[13,4,1]},{"n":14,"upper":[40,39,26,25,21,17,16,15,14],"lower":[14,13,4,1]},{"n":15,"upper":[40,39,26,25,21,17,16,15],"lower":[15,14,13,4,1]},{"n":16,"upper":[40,39,26,25,21,17,16],"lower":[16,15,14,13,4,1]},{"n":17,"upper":[40,39,26,25,21,17],"lower":[17,13,4,1]},{"n":18,"upper":[40,39,26,25,21,20,19,18],"lower":[18,17,13,4,1]},{"n":19,"upper":[40,39,26,25,21,20,19],"lower":[19,18,17,13,4,1]},{"n":20,"upper":[40,39,26,25,21,20],"lower":[20,19,18,17,13,4,1]},{"n":21,"upper":[40,39,26,25,21],"lower":[21,17,13,4,1]},{"n":22,"upper":[40,39,26,25,24,23,22],"lower":[22,21,17,13,4,1]},{"n":23,"upper":[40,39,26,25,24,23],"lower":[23,22,21,17,13,4,1]},{"n":24,"upper":[40,39,26,25,24],"lower":[24,23,22,21,17,13,4,1]},{"n":25,"upper":[40,39,26,25],"lower":[25,21,17,13,4,1]},{"n":26,"upper":[40,39,26],"lower":[26,13,4,1]},{"n":27,"upper":[40,39,38,34,30,29,28,27],"lower":[27,26,13,4,1]},{"n":28,"upper":[40,39,38,34,30,29,28],"lower":[28,27,26,13,4,1]},{"n":29,"upper":[40,39,38,34,30,29],"lower":[29,28,27,26,13,4,1]},{"n":30,"upper":[40,39,38,34,30],"lower":[30,26,13,4,1]},{"n":31,"upper":[40,39,38,34,33,32,31],"lower":[31,30,26,13,4,1]},{"n":32,"upper":[40,39,38,34,33,32],"lower":[32,31,30,26,13,4,1]},{"n":33,"upper":[40,39,38,34,33],"lower":[33,32,31,30,26,13,4,1]},{"n":34,"upper":[40,39,38,34],"lower":[34,30,26,13,4,1]},{"n":35,"upper":[40,39,38,37,36,35],"lower":[35,34,30,26,13,4,1]},{"n":36,"upper":[40,39,38,37,36],"lower":[36,35,34,30,26,13,4,1]},{"n":37,"upper":[40,39,38,37],"lower":[37,36,35,34,30,26,13,4,1]},{"n":38,"upper":[40,39,38],"lower":[38,34,30,26,13,4,1]},{"n":39,"upper":[40,39],"lower":[39,26,13,4,1]},{"n":40,"upper":[40],"lower":[40,13,4,1]}]
{ "type": "about", "about": "@TXKFQehlyoSn8UJAIVP/k2BjFINC591MlBC2e2d24mA=.ed25519", "image": { "link": "&jPj78+2mgdpWVLMkMnxryVaFuM5eoqN6cgppIUchgFw=.sha256" } }
I compiled a list of questions that any upcoming protocol design work will answer. The goal is to make those answers explicit, not implicit, and to have a guiding framework for the design processes. It is necessarily incomplete and biased, but hopefully still better than no list at all.
@dominic, @cft, @keks, @Anders
SSB Redesign Notes
- should specify encodings last, move on a purely logical layer first
- extensibility
- versions, status fields (trimming, deletion, etc.)
- migration path (or hard fork?)
- if hard fork, porting old content
- planning (replication) rpcs
- what can malicious actors do, what is our story why it is ok?
- specify (sensible) maximum sizes for all the things
General: Take a layered approach:
- specify verification metadata (e.g. bamboo)
- specify content-level metadata (type, timestamp, etc)
- specify ssb's self-describing freeform data format
Should the protocol(s) be aware of (and if so on which layer):
- same-as?
- soft deletion requests?
- "banned" content?
- private messages?
- private groups?
VerificationMeta
- lipmaalinks: 2 or 3 based?
- soft deletion?
- how many pieces of offchain-content: 1, 0or1, 1ormore, many, 2 (content-meta and content)?
- how many pieces of onchain-content?
- privacy: everything in the verification meta is globally visible
- sign hash of metadata instead of the metadata itself (solves the above privacy problem)?
- end-of-feed indicator: just a flag or carrying content (and if content, what kind of content)?
- support free-form outer meta?
- tangle-building: here or in content(meta)?
- fork recovery?
- do we need a sequence number?
- lipmaalink must be a hash, not a seqnum?
birch signatureInfo?
cypherlinks: exactly one or multiple primitives per link?
- also relevant for VerificationMeta
content-meta not always available - potential advantage for moving things (additional seqnums in particular) to the outer meta?
@cft with birch:
An interesting discussion/decision is whether off-chain data may refer to other logs, or MUST be in the same log as the event itself i.e., in-log attachments. I clearly prefer the latter, because it's about the content of that event that we want to process, and it would be cumbersome if we have to postpone this processing until we got that other log's data. Note that inside the content field there is full freedom to refer to external blobs or log entries.
- content: what does ssb specify?
content-meta
- free-form or predetermined?
- timestamp?
- type?
- subfeed indicator?
- additional seqnums?
where are max sizes specified, on which layer?
soft-deletion on an inner layer?
what is a cypherlink?
- hash vs author+seqnum+hash vs author+seqnum+hashes?
- blobs vs contents
FreeformData (self-describing)
- human-readable?
- extensible?
which of the following data types do we want?
- null
- bool
- int
- fixed-width(s)? (un9signed?
- float
- fixed-width(s)? infinities, nan, negative zero?
- rational
- fixed-width(s)?
- utf-8 string
- char (unicode scalar values)
- byte string
- array
- map
- arbitrary keys or string keys?
- set
- cypherlink
- public key
Partial subscriptions not based on sequence numbers
This will be crucial to get right (or design in a way such that not getting it right is also ok). Some options:
- none at all
- by type
- hierarchical subfeeds
- free-form, relay (server) implementations can choose what to support
What is a good way to fill up “remote” append-only logs on my machine (replication)?
You can use the same conceptual replication protocol as ssb does: Each peer maintains a list of public keys it is interested in. When two peers connect to each other, they each send a map containing those public keys they are interested in, associated with the newest message (i.e. largest sequence number) of that log that they have locally. Upon receiving the other peer's map, you compute the intersection of those key sets. For all keys that you both care about, check whether the number you received was less than the number you sent. If so, start streaming the new messages to the peer, since you have them but they don't. This simple replication protocol is the main reason for choosing the append-only log architecture.
This approach circumvents the problem of receiving stuff in the wrong order or with gaps in the log. Just discard all messages that are not the direct successor of your current tip of the log. When interacting with a conforming peer, this never happens.
How do I get the public key of a peer for data verification?
They can just tell you. Take the above replication protocol but also send your own public key in addition to the map.
The handshakes are used by the "real" systems for point-to-point encryption and authentication. For a toy example, that's not strictly necessary. If you do want the encryption, you could use the rust implementation of secret handshake.
If you want to make sure that a peer really has the corresponding private key to their claimed public key, send them a challenge (a sufficiently large number of random bytes) and ask them to send back the valid signature. Note that you should definitely not use such a scheme outside a toy context, since unlike secret-handshake this scheme has really bad metadata privacy.
what would be a good way to “merge” all given append-only logs
The main question (beyond implementation issues) is how you want to determine the ordering between messages from different logs. A couple of choices:
- Have each log entry include a claimed timestamp, sort by those.
- Sort by receive-time.
- When publishing a log entry, include the hashes of the tips of your local replicas of other logs, do a topological sort based on these hashes.
(And then there are some pathological choices like "sort by author" or "sort by a hash of the message" that don't really make sense for a chat ui...)
The topological sort based mechanism is the most complicated one (you need to determine which hashes to include, and maintaining the topological order can be annoying), but also gives the best results. Claimed timestamp is fine if you trust the authors to be non-malicious and have reasonably accurate system clocks. Quick anecdote, my machine's system clock lives ca 30 seconds in the future, which tends to mess up ssb-chess chats (those currently sort by claimed timestamp). Sorting by receive time is easy to implement, but can lead to inconsistent message orderings across multiple machines.
Entering (even more) opinionated territory:
Is a RPC protocol recommendable?
Nah. For a toy, just have a byte indicating the message type, followed by the data. Probably not worth pulling in the additional complexity of a full-blown rpc protocol.
What is a good way to implement a simple terminal UI
termion is nice if you don't need a fully-fledged UI framework. Has nonblocking input, but you might need to bridge it to the futures
ecosystem somehow.
CC @andreas and @Zach who would probably be sad if they missed this post
But then I figured that f as a whole can’t work, really. For example, when n is 8, the term ((3^g(n)) - 1) / 2 in the fourth line of f would have to be 3 but there is no value of g() that can make this happen.
It's the drawing that was buggy, not the function. See this post for the correct drawing.
Sorry that this error wasted your time too.
The tripling-based scheme needs base-3 logarithms, which are pretty painful. Your best bet will usually be to convert to a float (64 bit floats can represent integers up to 2^53 precisely, so that should be fine most of the time), use the built-in function for computing arbitrary logarithms, then floor to an integer.
Actually it isn't that bad: floor(log_3(n)/2)
only has 41 different results for n between 1 and 2^64, so you can simply do a case distinction:
fn log_3_half(n: u64) -> usize {
if n == 1 {
1
} else if n <= 4 {
2
} else if n <= 13 {
3
} else if n <= 40 {
...
If you can't trust your language implementation to convert this into a binary search, you can do so by hand. You could even do a not-quite-binary search that is biased towards smaller numbers (since most feeds are small and this is invoked in a recursive function with decreasing values).
That mostly eliminates my concerns regarding the tripling construction, I think it is the way to go.
@keks The individual elements are the signed metadata (including the signature). So with the current bamboo spec and assuming 64 bytes per signature and hash and 9 bytes per seqnum that would be 1 + 33 + 9 + 33 + 33 + 65 = 174
bytes. We could get this down to 33 + 65 == 98
bytes by signing a hash of the metadata instead of the raw metadata. In that setting, we'd send hashes of the metadata instead of the actual metadata when transferring metadata that is only needed for validation but not for its content. Yet another possible optimization that increases the overall complexity.
57 * 174 == 9918
, so the overhead for fetching a message around seqnum 1,000,000 would be that of fetching two messages of the current maximal message size. I wouldn't call that "huge".
Also keep in mind that certificates overlap, so when fetching multiple entries from the same feed, the overall transferred metadata grows sublinearly.
to be fair we’re still a long way away from feeds that large
Image a sensor that writes a message every ten seconds. After four months, that's 1,000,000 messages. These orders of magnitude aren't that far-fetched.
Now what does this mean for the decision on which construction to use? The most obvious argument is the certificate size, and that points to the tripling-based construction. There is however one important aspect that favors the doubling-based one: Computing the lipmaalink requires computing floored logarithms.
The tripling-based scheme needs base-3 logarithms, which are pretty painful. Your best bet will usually be to convert to a float (64 bit floats can represent integers up to 2^53 precisely, so that should be fine most of the time), use the built-in function for computing arbitrary logarithms, then floor to an integer.
The doubling-based scheme however only needs floored base-2 logarithms, which is the same as finding the index of the first nonzero bit. Hardware is really good at that. @cft, could this be significant from a layer2/3 perspective?
This post is probably difficult to follow. The important part is at the bottom: A table that compares the (approximate) number of metadata entries you need to verify a message of a sequence number n in the worst case. You can ignore my rambling and still get some insight from that table.
About those certificate sizes: The largest certificate sizes occur in the bottom row of those graph drawings. A certificate path "travels" through a box of each layer (note that conceptually all of the nodes in the bottom row sit in a box that contains exactly one node - I omitted those from the drawings). For the graph that duplicates subgraphs at each step, that gives 2 * floor(log_2(n)))
: floor(log_2(n))
is the depth of the boxes, 2
is the number of nodes it takes to traverse the box. In addition to that, the certificate includes the "spine" of the graph (the "diagonal" from the bottom-left to the top-right) for an additional floor(log_2(n))
nodes. So in total, that yields (1 + 2) * floor(log_2(n))
.
With the tripling-based graph, it takes three nodes to traverse a box, but we get the more efficient base-3 logarithm instead, for a total of (1 + 3) * floor(log_3(n))
. This is not correct yet, the size can be slightly larger: Because of the division by 2 in the formula, the "height" of the graph grows slightly faster. To fix this, we multiply n
by 2 before taking the logarithm: 4 * floor(log_3(2 * n)))
. Take this formula with a grain of salt (did I mention I'm bad at math yet?), but I think this is correct...
For sufficiently large n
, the tripling-based formula is indeed more efficient. "Sufficiently large" happens to be 128. For feeds of 128 or more messages, the certificates obtained by the tripling-based construction have smaller worst-case size than those obtained from the simpler, doubling-based construction.
Quick sanity check: If we quadruple at each step, we get 5 * floor(log_4(3 * n))
, which grows faster than the tripling-based one, as expected. (Disclaimer: the 3
in that formula might be incorrect, but asymptotically that doesn't matter).
Here's a table of some worst-case certificate sizes. The actual certificates for those numbers might be smaller - not every number is a worst case. But since half of the numbers have worst-case length for the simple scheme, and two thirds of the numbers have worst-case length for the tripling-based scheme, this should be accurate enough. For example 127 actually has a very small certificate for the simple scheme, since it is on the "spine". But there are "nearby" numbers that do hit the worst case (121 in this case).
n |
3 * floor(log_2(n)) |
4 * floor(log_3(2 * n)) |
---|---|---|
2 | 3 | 4 |
4 | 6 | 4 |
7 | 6 | 8 |
11 | 9 | 8 |
29 | 12 | 12 |
53 | 15 | 16 |
127 | 18 | 20 |
128 | 21 | 20 |
1000 | 27 | 24 |
10,000 | 39 | 36 |
100,000 | 48 | 44 |
1,000,000 | 57 | 52 |
10,000,000 | 69 | 60 |
100,000,000 | 78 | 68 |
1,000,000,000 | 87 | 76 |
@mix Yup, that's about right.
Two minor nitpicks: Conceptually, we don't have this wrapping object with the key
, there's only the value
object. Also, bamboo itself uses some different metadata than ssb (no author, content is just a hash, size of the content (actual, not the hash) is included in the metadata). But as for projecting the graph structure onto current scuttlebutt, your depiction is correct.
@krl Nope, it's all the messages in the bottom row, i.e. 1 to 3, 5 to 7, 9 to 11, 14 to 16, etc.
Those could actually omit the lipmaalink as an optimization, but they can't link to different nodes to improve the scheme, since that would always violate antimonotonicity.
I didn't draw the graphs correctly... Here are the (hopefully correct) version:
Perhaps the certificate size will start making more sense when I'm not looking at the wrong graph -_-
If the content is a
post
, then one can use attachments for shipping the images, or PDFs i.e. one can refer to attachments from within the content. And voila: No need for a blob discovery protocol as tody (which REALLY scares me a LOT).`
I'm not convinced that this additional complexity is really necessary. I can just as well imagine a system where a blob request may contain information about who might likely have that blob - i.e. the id of the message(s) that referenced it, and the ids of the author(s) that referenced it. This is even more flexible than your proposal (if we saw the blob reference multiple times, we can supply even more information), but keeps the complexity out of the log format.
I'm giving up on figuring out which graph construction is more efficient, I'm simply too bad at math that involves numbers =( It would be super cool if anybody who knows their logarithms could take a look at this. Id be happy to help clarify stuff.
To evaluate this graph (which is constructed from repeatedly tripling the previous subgraph and adding a single additional node) compared to the simpler graph that is constructed from repeatedly doubling the previous graph, we'd want to look at:
- worst-case certificate pool size for
n
- average certificate pool size for all
n
in some subset of the natural numbers
For the graph based on doubling, the worst-case size is 3 * floor(log_2(n))
, but I'm unable to come up with (non-recursive) formulas for the tripling-based one... Still feels like the latter might be more efficient (for some reason it seems to matter that 3 is closer to e than 2 is, compare radix economy...).
Honestly I'm reaching a point where I think the simplicity of the doubling-based construction would outweigh efficiency gains of the tripling-based one. But that might be frustration speaking...
@beroal Yes, it is possible to define a more-or-less arbitrary total order. But there might be situations where "arbitrary" isn't good enough. Also there might be situations where explicit ordering based on user input rather than automatic ordering is desired (think manual merge conflict resolution in git as opposed to automatic conflict resolution in a real-time collaborative text editor).
A different challenge is that a single feed is gap-less: If you have messages of consecutive sequence numbers, you know that nothing will have happened in-between. New messages are always appended to the head of the log. With same-as, we don't have that property: Messages might have to be retroactively inserted into the log. This breaks a bunch of assumptions some clients make, in particular flumeview-reduce runs into trouble.
Ok, I'm starting to get a better understanding of the optimality mismatch. The main difference between bamboo and the work by the Estonian mathematicians is that bamboo allows the certificate pool to include messages that may not exist yet. The Estonians solve the problem in a different way, by batching entries into rounds. Each round is a graph of |V| vertices, and their certificates contain the path from n to 1 and from |V| to n. This differs from ours, our certificates contain the path from n to 1 and from the next-largest spine node to n. This next-largest spine node places a bound on the maximum certificate size that depends solely on n, not on the size of the feed (compare the remark here that the certificate pool won't grow even if the feed exceeds 40 messages).
The 3-based graph is optimal for this setting where |V| might be much larger than the connection point to the spine. But since we have this bound, the 2-based graph should be better for ssb.
I'd be grateful if anyone could verify this, but that'll probably involve skimming Lipmaa's thesis (it might be sufficient to start at section 6.1 and then backtrack for the required definitions).
@mix But I want to properly look at certificate sizes first, to check whether we are actually discussing the correct graph right now. I really hope that I simply got it wrong and the base-2 graph actually has smaller certificate pools.
Is this a set of packets where only new packets (to be appended to the log) will be added to the set? Or could it be that the producer suddenly would have to go back in its log end swap some of the packets in the set with old packets from the log? That would rule out SSB for IoT devices which could only keep a limited window of their log in memory.
The sets should be stable enough. Let's call the nodes 1, 4, 13, 40, 120, ...
the spine of the graph. The certificate pool of msg n basically connects n to the spine (path from n to the next smaller node in the spine) and travels down the the spine to the origin - this latter part is stable enough for the iot usecase. Finally, the certpool contains the path from the next larger node on the spine to n. I can't tell you whether the connections to the spine are stable enough without knowing more about the requirements.
hey @Aljoscha if you can write got for the certificate pool
got?
then I’d be happy to make a nice little interactive SVG where people could hover over different numbers and see the nodes and edges followed
That would be great. I started a similar thing at arthack that included computing the graph and how to lay it out. But just doing it for the fixed set of 40 nodes would be sufficient and far simpler.
The number of old packets is bounded by
(6/(log_2(3))) * log_2(n)
(I think, the paper is pretty unreadable…)
I guess I shouldn't put off looking at this any further. I'm getting the feeling that the optimality in the paper is not the one we care about. What we want are certificate pools of minimal size. And intuitively it looks to me like the graph that doubles subgraphs at each construction step is actually more efficient in that regard than the bamboo one (which triples subgraphs).
To be fair, both the claim regarding path inclusion and the claim that this suffices for our guarantees need proofs.
Still too lazy to write it out, but as for "In an antimonotone graph, any path between u and v includes the shortest path between u and v.", you can do a proof by contradiction fairly easily. Assume you found a path that doesn't include the shortest one, then that path is "skipping" over one of the nodes in the shortest path, which contradicts antimonotonicity.
Simplified the definition of the certificate pool and added a graphic:
For some entry x
the peer is interested in, the certificate pool of x
is the (logarithmically sized) set of further entries the peer needs to store. It is defined as the union of the shortest link paths from:
x
to1
z
tox
, wherez
is the smallest natural number greater than or equal tox
such that there exists a natural numberl
withz == (((3^l) - 1) / 2)
The following graphic shows the certificate pool for entry 23. The path from 23 to 1 is marked in blue, the path from 40 to 23 in orange. Note that even if the log is larger than 40 messages, the certificate pool does not grow.
For inserting the first n numbers in order
should have been "For inserting the first n numbers in random order".
I'll make some certificate pool visualizations later, answering first.
Do I get it right that I need to download the following packets, in order to validate that claim?
32, 31, 30, 27, 26, 17, 14, 13, 4, 1
yup
When 38 comes, I need 37 and 35. When 39 comes, I need 38 and and 30. When 40 comes, I need 39 and 13.
38 doesn't need 37, only 35, 34, etc. 39 doesn't need 38, only 30, 27, etc. 40 doesn't need 39, only 13, 4, etc.
This is quite a bit of old packets to keep around.
The number of old packets is bounded by (6/(log_2(3))) * log_2(n)
(I think, the paper is pretty unreadable...), where n
is the sequence number. For small n
this looks like a lot, the difference becomes more drastic for larger n
. Also consider that since we are going for offchain-content, we only need to store the metadata of those older messages, which is quite small.
You call this the certificate pool in the Bamboo Git repo, right?
Note quite, the certificate pool also includes the shortest path from n to the next larger (3^k - 1) / 2
. That is not necessary for immediate validation, but it guarantees that the path from any future entry to 1 includes an entry for which there is a path to n.
[...] meaning that we have to do the cert pool exercise every time a new entry is added to the log, right?
Not quite sure what you mean by the "cert pool exercise". What we need to do is to check the shortest path from the new entry to an entry that has already been verified. For inserting the first n numbers in order, this takes O(n) time overall. It's amortized O(1), O(log2(n)) worst case per insertion.
Unlike how SSB works today, with Lipmaa it is not sufficient anymore to check the chaining to the currently oldest entry I have stored, but I must now check deep and many times into the past.
We only need to verify some path to the beginning, not necessarily the shortest path. We still catch all potential conflicts, since with these antimonotone graphs, any path between two vertices u
, v
contains the shortest path between u
and v
. So we always check the same information (plus sometimes a bit more).
To be fair, both the claim regarding path inclusion and the claim that this suffices for our guarantees need proofs, all I can offer right now are gut feeling and a few unsuccessful attempts at finding a counterexample.
Flying Microtonal Banana by King Gizzard and the Lizard Wizard (!)
(!)
And while I'm at it, have an svg of the typeset formula:
\documentclass{article}
\usepackage{amsmath}
\begin{document}
\thispagestyle{empty}
\begin{align*}
f(n) &= \begin{cases}
n- 3^k &\mbox{if } n = \frac{3^k - 1}{2} \\
n - \frac{3^{g(n)} - 1}{2} & \mbox{otherwise}\\
\end{cases}\\
g(n) &= \begin{cases}
k &\mbox{if } n = \frac{3^k - 1}{2} \\
g(n - \frac{3^{(k - 1)} - 1}{2}) & \mbox{if } \frac{3^{(k - 1)} - 1}{2} \leq n \leq \frac{3^k - 1}{2}
\end{cases}\\
\end{align*}
\end{document}
Coming out of a call with @mix and @piet where we talked about the intuition behind the graph that the lipmaalinks of bamboo form, I added a drawing and some ramblings on ternary encodings to the readme. CC @Anders @andreas @Powersource @dominic
The relevant excerpt:
Links and Entry Verification
The lipmaalinks are chosen such that for any pair of entries there is a path from the newer to the older one of a length logarithmic in their distance. Here is a graphical representation of the lipmaalinks in a log of 40 entries, the colored boxes indicating its recursive structure:
The lipmaalink target of the entry of sequence number n
is computed through the function f
, defined below:
f(n) := if (n == (((3^k) - 1) / 2) for some natural number k) then {
return n - (3^k);
} else {
return n - (((3^g(n)) - 1) / 2);
}
g(n) := if (n == (((3^k) - 1) / 2) for some natural number k) then {
return k;
} else {
let k := the natural number k such that (((3^(k - 1)) - 1) / 2) <= n <= (((3^(k)) - 1) / 2);
return g(n - (((3^(k - 1)) - 1) / 2));
}
Sorry for the math, but on the plus side, it works! This computes the edges according to the scheme presented in Buldas, A., & Laud, P. (1998, December). New linking schemes for digital time-stamping. For a (slightly) more enjoyable overview of the theory behind this, I'd recommend Helger Lipmaa's thesis.
An alternate way of gaining some intuition about the lipmalinks is to think of the sequence numbers in ternary (base 3). In ternary, 3^k
is represented as a 1
digit followed by k
zero digits. (3^k) - 1
is thus k
2
digits, and ((3^k) - 1) / 2
is k
1
digits. So if the sequence number consists solely of 1
digits in ternary, we enter the first branch of the formulas and k
is the number of digits. Otherwise (i.e. if the sequence number is not half of the predecessor of a power of three), k
is the number of ternary digits of the next smaller number whose ternary representation consists solely of 1
digits.
@beroal Yes, that's correct. But it is possible to declare your regular identity as a pub (if you have a stable, static ip address). I don't know how to do that, so pinging @kas and @cel who did this iirc.
@beroal On the bottom left of the text area, there's a small attachment icon that lets you select a file to attach to the post.
I hope there are more regular meet-ups in Berlin? Unfortunately I couldn’t come last week.
Not sure about regular meetups, but there are enough crabs in Berlin by now that we can always organize a one-off hangout. Given your "history in more artsy domains", you'd probably get along well with @andreas and @cafca.
I’d also think it would be amazing to install a pub at places like the campus of TU Berlin. How realistic is that?
As somebody typing this response from a room and network at the TU, I'd very much like the idea of a pub running here. And it'd be great to get more TU people involved here - both because of activist streaks, and with ssb slowly entering academic territory. This wouldn't even be the first academic pub, there's one at the uni basel already.
@cryptix Would it be a good idea to try setting up a go pub? I'm kinda scared of setting up a containerized, memory-leaking, regularly crashing js pub...
I won't have time for detailed feedback for a few days, but here are some notes during the initial read-through (which might be a bit harsh, I'm sorry but I currently don't have the time for well-thought-out wording):
This document deals with the transport format.
Are you sure you don't mean the signing format? Two peers can transport data however they want, as long as they agree amongst themselves.
I'm a big fan of version fields on protocol which is why I added it. Bamboo uses yamf-hash for some future proofability without changing the format, but I would rather keep the format simple and do changes as version numbers.
Bamboo has a "tag byte" at the beginning, which can serve as a version number (jumbled together with additional information like the end-of-feed boolean for efficiency). Only the bytes 0 and 1 are currently used, everything else would indicate a new version.
The version number is global consensus, so in that regard I'm not sure if it fits the spirit of ssb.
This is the one point in the protocol where we absolutely need global consensus. It's global in the sense of "Sure you can disagree, but what you are doing isn't ssb anymore.".
General comment regarding version numbers: If the change is non-breaking, did you really need the version number at all? If it is breaking, does the version number help in any way beyond signalling the breakage and allowing implementations to disambiguate accordingly? The answers to those questions are what makes me think that the tag byte is sufficient. In particular, this whole idea of classic version numbers inducing an order and ranges does not apply to ssb.
I think we should encode the above using cbor in the order the fields are specified.
Why?
There is the issue of linking the new with the old log.
Why does everybody assume there would be a new log? =(
Just append the new messages to the old log. Make sure the encoding never accidentally ends up being valid json, and everything is fine. Old servers will reject the new messages, but they would reject a new log the same way.
Message size restriction? Might be good to lift it from 16kb to something like 64kb?
Or even more, considering the size is signed and can be used to filter replication. Imo the guiding question here is "Should clients be able to assume that a message fits into memory?".
Content could be included as current design or could be fetched via some other channel (blobs, dat etc.) for offchain-content. Currently learning towards including it from a latency perspective
Nothing stops our replication rpcs to "inline" offchain content and send it together with the metadata. Offchain-content does not imply additional roundtrips.
Must include the following fields from the old format: timestamp, author and type. Still schemaless.
There's lots to comment on here. Imo this shouldn't be an object with mandatory fields, but a tuple of (timestamp, author, type, self-describing-free-form-thing)
, the last of which could be cbor or whatever. There should be no reasons to restrict content to be an object.
Why the author at all?
A important question though: Should this content-metadata (timestamp, type) really be specced out, or should it be another self-describing-free-form-thing?
The reason I like canonical cbor is that it is a well spec'ed standard
Hahaha. No.
The rfc gives a few recommendations, but nothing binding.
multiple implementations in different languages
None of which fulfil ssb's idiosyncratic requirements (how many off-the-shelf libraries will happen to use the exact same canonization rules, disallow the exact same set of cbor features (tags, timestamps, float16, etc.), reject infinities and NaNs?
Implementing clmr took a day or two - including hammering out the final design issues. This should imo not be a reason to settle for additional complexity just because it's already implemented somewhere.
Require a type seq that would specify the seq number for that particular type.
Partial subscriptions by type might be the most obvious partial subscriptions, but they are also pretty useless (and they conflate distinct concepts: Partial subscription and indication to clients about the content). We can try to come up with a better one (my favorite: Conceptually each feed consists of hierarchical subfeeds, each message specifies the subfeed to which it belongs. E.g. a subscription to [foo, bar]
will catch a message whose subfeed-specifier is [foo, bar]
, [foo, bar, baz]
or [foo, bar, baz, qux]
etc. Requires one seqnum per array entry, the size of which is dominated by the size of the array itself. This scheme allows applications to organize subfeeds exactly how they need them, much better than the arbitrary, flat subscribe-by-type).
If we want to hardcode such a scheme into ssb, why not put the data at the bamboo-level? it is replication metadata after all.
But do we want to hardcode? Possibly not. If not, where does this subscription information live? Part of the content-metadata? Do we need a fourth category (ssb-replication-metadata, freeform-replication-metadata, content-metadata, content)?
FIXME: better format for private messages
Alternatively: No format for private messages? Why should ssb need to know about them?
Missing questions:
- Which kinds of data values should the free-form format support? Sets? Bytestrings? Etc.
- Should ssb even mandate a self-describing format?
- What is a msg cypherlink? Author+seqnum+hash? +Size?
- Do we need a new cypherlink-type for off-chain content? Or can we address it like a blob?
- Isn't an offchain content thing nothing but a blob with a different size limit and self-describing format? Is it possible two collapse the two.
I'm off for the weekend now. Have fun discussing this =)
Don't mind me...
You can switch the tokenizer to return numbers as a strings
Why does that help? You'll need to serialize arbitrary floats anyways (since your clients might produce them), no way around that. In rust, I forked an implementation of the ryu float formatting algorithm and made some changes to it (also this, not sure whether that's relevant for go). There's a go implementation of ryu as well. Both the rust and go implementation claim to be straightforward translations of the original C implementation, so you might be able to apply the same changes I did without having to dig into the details.
The test data should be pretty comprehensive for the floats, it felt like the fuzzer spent most of its time exploring the inner workings of ryu.
{ "type": "chess_game_end", "status": "resigned", "root": "%xs6GBYPFNIicAFwHfEIZF/lw+Q9ebIj+CMEFDJ3KlS4=.sha256", "branch": "%854a7HlobExavYKpUNo3yZjDLOcg4legXyaBsF9u+cM=.sha256" }
See here for a couple of points that need to be addressed by a serious fork recovery proposal . TLDR: The difficult part is not deciding on how to recover, the difficult part is dealing with the consequences of a system that allows fork recovery.
{ "type": "about", "about": "@EMovhfIrFk4NihAKnRNhrfRaqIhBv1Wj8pTxJNgvCCY=.ed25519", "name": "Dominic" }
@arj Yes, I think that's the most recent thread. Some quick annotations that reflect my current view on this matter:
I'm pretty convinced that dedicated (as opposed to string-encoded) multifeeds and multihashes are the way to go, as is the addition of byte strings. Ints should probably be done like in cbor (i.e. not tagged with a width). Does ssb really need 32 bit floats in addition to 64 bit floats? Probably not. Does it need sets? I'd prefer to have them, but I'm not militant about it.
In general, most of the details are fairly unimportant. Efficiency gains over json are nice to have, but nothing more. Extending the value set with feeds/hashes/bytestrings/sets/integers makes sense imo, but it's not like ssb has really suffered from not having those.
The more important question is whether ssb should even insist on a particular self-describing data format (i.e. declaring a message as invalid if its content is not syntactically valid under that format), or whether it should leave interpretation open? Last time I talked to Dominic about this (more than half a year ago), the latter was the direction he wanted to explore.
But I really need to find a framing for this that won’t ruint my weekend
You are getting paid.
=P
I'm not sure I understand the output you quoted: Is your implementation recognizing this as valid json? Are you rolling a hand-written parser/serializer or are you going with the standard go one?
I would have expected most problems to have to do with float serialization and string escape sequence parsing.
I think the fuzzer got a bit overly eager when minimizing the test cases for invalid surrogate pair escapes in strings, removing the closing quotation marks from some test cases (since my rust implementation emitted the error directly upon encountering the invalid surrogate pair). To make sure that your implementation rejects unpaired surrogates, you should probably add trailing quotes to those nay/surrogate
tests that lack them. Since go is not as strict about unicode as ssb, I expect you could run into problems otherwise.
@bobhaugen Just to clarify: Ssb does not depend on implicit behavior, the ordering of object entries is precisely specified here. It is a mess, but it is a well-defined mess without any implementation-specific behavior. Please pass that information on.
As @arj said, conversation about a new signing encoding took up steam again after #dtn. We'll publish a few RFCs within the month, and aim at having figured out the way forward by the end of July.