You are reading content from Scuttlebutt
@Dominic %Igm25FZEje8LeruZ0MnCajFz9e1LoMO3EHB5C0fRMmw=.sha256

Dev Diary: replication/ebt

I've been working on my grant, getting ebt ready for full deployment.
I feel the grant process gives me license not to just do the most expedient thing, but do the right thing which has so far involved rewriting the muxrpc internals.

And, this part is half finished, but rewriting epidemic-broadcast-trees. I didn't rewrite for the hell of it. Firstly, there where various tests in ssb-ebt that were too complicated, and I didn't understand why they weren't passing. Secondly, the low level features where provided by the epidemic-broadcast-trees module, then that was ment to be glued to ssb in the ssb-ebt module... except far too much functionality ended up in ssb-ebt, and it needed it's own tests! ebt uses a data structure oriented programming style, with a deterministic simulation for tests, but because of the glue layer I wasn't able to test ssb-ebt that way. So, I realized I really needed to just rewrite ebt. EBT is probably the 3rd or 4th time I rewrote scuttlebutt replication (if you include the original "insecure" scuttlebutt module.

This time the state model encompasses the fact the you may be replicating the multiple peers at one time, and connecting and disconnecting with them. I havn't got to this yet, but could also make tests that include randomly connecting and disconnecting to peers.

@Dominic %FMAStvgt1SZYrxqZOv64GYyBojhC3Fl3Pv38/UvRVNI=.sha256

Okay, so my rewritten epidemic-broadcast-trees seems to be passing all the tests,
but having trouble running it in production. Running into problems in sbot (the gossip scheduler was creating more than one connection in parallel) something about my pub is seems to be using so much CPU that it's stuck and I can only connect to it successfully shortly after restarting it.

Today I made ssb-proc
github to monitor that. I havn't fixed that problem yet... but I kinda like to go a little into one task, then go back and work on another one...

This one being debug the ssb-ebt rewrite against my actual server. This seems to work, I've committed the changes and made notes about the additional tests needed...

My benchmark for initial replication is very close (currently replicating 100k messages in 84 vs 81 seconds for old style replication). close enough because the main way I plan on making replication faster is to replicate way less and lean on ssb-ooo instead...

@Dominic %5m6hK0l7a583OKU4W8/HmYQh+kloTAgOierDjTlQT3Q=.sha256

current plan: pull the build in replication plugin out of sbot (breaking change, so I'll remove that annoying private plugin too) ... I want to make this stable enough (in the presense of peers running the legacy client). I've already got legacy replication mode disabled on my pub, which means it's only receiving messages via peers running ebt (which is just a handful, bet enough!)

I'm wondering if there is a nice way to make sbot relatively stable, even in the presense of all the legacy replicators. Maybe I'm thinking start disconnecting peers once too many peers are replicating too much...

Then I'll test onboarding via ssb-ebt (with low hops setting) and ssb-ooo. We'll need more clients to implement support for ooo, though.

@Dominic %GLC4OFSPFI/eJvYnCpeg+fjdeAk3vl0b8mGkMN65olw=.sha256

the new ebt is coming along well. I'm running it as my only means of getting messages in and out, and it feels much more smooth than the previous implementation. Also it's less code! less code and more stable. Definitely feel happy about rewriting it!

Today I added a thing so that you try to only replicate in eager mode with a single peer.

There is just one edgecase, which I deal with next: you are replicating from one peer, then that peer just stalls (maybe it's a bug, or overwhelmed, or it's trying to block you having those feeds). EBT needs to realize that a peer hasn't sent a message for a while, and switch replicating feeds over to some other peer.

Which peer you switch to doesn't really matter, you could choose the oldest peer or the newest peer or a random peer.

key distinction: protocol vs behaviour. the "protocol" is what you must to be intelligible to other peers. behaviour is everything else. is there a better word for this than behaviour?

@Dominic %6CduerhHioJ2tqGw629YxqtYejLomurF4hI0QtrhZ3o=.sha256

Also, cross reference this thread over to scalable scuttlebutt paper (draft)

@Dominic %oNMNfovHYg5LmjNf3WAuA0SEXcsXrimoPb5KvTCTIdI=.sha256

Okay! I now have implemented timeouts. I tested replication locally. I used I set friends.hops: 2, and ran sbot in another test folder. ssb_appname sbot server then followed my self. This took about 4 and a half minutes, including building indexes.
This replicated 160k messages, and produced a 135mb offset file.

Building the indexes seemed to be a fair proportion of the load there, and they way I am calculating progress currently isn't very good (progress appears to stay very close to zero, then suddenly moves up to 100% near the end)

@Dominic %DVDZ0M/In1q5+2QdbovrCfWrOVMjr1FL8hY9vyJzwj4=.sha256

Last night I couldn't sleep, so I got up and worked on ebt for a bit. I just tidied stuff up, moved stuff that was performed frequently into a function. This made the code shorter, and easier to read, and removed a couple of inconsistencies.

I also realized a problem: for each feed. there are several things you want to communicate: the sequence, wether you want to be in receive mode. the sequence is a number, but receive mode is 1 bit. There is also wether you choose not to replicate this feed (because you don't know who that is). If you send unreplicate you don't have anything to say about the others.

Currently these are all encoded as a single number: if it's positive or zero, it's rx+seq, if it's -1 it's unreplicate. if it's -2 or lower it's !rx+~seq. note: ~ is a bitwise op that turns 1 to zero, and -2 to 1.

But the problem, is that there is no way to encode seq 0 + !rx. This is needed so you can ask for a new feed just one peer, but have other peers tell if they have something new. without that, it's necessary request them all in receive mode, which may mean they send you the same stuff for a while until you can get another message back to them.

Changing this will mean that peers running ebt@<5 won't be compatible with 5. I think that is okay though, since this hasn't been widely deployed yet.

Making upgrades in distributed systems is hard and we must embrace this! If we don't, and this protocol is successful, we'll end up with various legacy layers that made sense at the time (when we didn't understand the problem) and future generations will just have to accept that. I don't want to trap future generations in my innocent mistakes, I want them to be able to build past my bad ideas and keep my good ones if they still seem good.

User has not chosen to be hosted publicly
@Dominic %DmE2OAso63FWy3pIWrqjJZWy5xXkPLBRDN6jv9j+jds=.sha256

More today - spent a while refactoring stuff so that it the code can support both v3 and v2 representations, in this process I also found a couple of bugs were v2 assumptions were still present, everything is fixed now. now you can pass a version number to a stream. It thinks in version 3, internally, but if you tell it you want a v2 stream, it will translate incoming requests into 3 before receiving and back into 2 before sending.

This will mean I won't be cut off people running v2. They'll be able to connect to my pub and speak v2, but I'll be able to talk v3 to my pub. When people update to v3, they'll only be able to speak to v3 pubs, but it's fair to assume they'll update their pubs too.

This was all possible because ssb-ebt already passes a version option, since I already changed the representation before, however, I didn't bother to support any backwards compatibility that time.

@Dominic %3vbOg0LY9i4tSHXib7Cp7LgDtAryzCFouz7eZLoNvkA=.sha256

Trying to finish this up and get it actually deployed. I refactored sbot's test to use ebt instead of legacy replication, and apart from a few small changes they mostly passed!

One thing that is missing is block, if A blocks B, C won't give B messages from A. That's how it is in the tests. Although we can't really gaurantee this, because a peer could just change the code, but I'm gonna make sure the tests were passing as they are.

@Dominic %oz3uwgoAr2GwKcPh+/5BIMgm7hPMuSq0W91eP7AQieM=.sha256

Okay, I got blocks implemented in ebt, hooked up in ssb-ebt, and even passing the scuttlebot tests. Those tests are so ugly! Not happy with ssb-ebt, too coupled to ssb-friends right now, might have to fix that.

Okay, so that, and make progress smooth, and this is ready!

@Dominic %lp54mt+/0C5OCdJJvB/6M/uAstMeQLQVjQojeSoG0y4=.sha256

This is all working now. Now we just need to deploy it. I have conducted tests replicating (through the real internet) and it took 4 minutes to do initial sync, following myself with hops 2 of replication. That loads 136 mb of log. I think the next thing that would make it faster is faster indexes!

Okay, so how to deploy it. I guess the first thing is to put out a new sbot version with ssb-ebt as default, get pub owners to upgrade, then put out a new patchwork version.

Given the fallbacks we already have in place, old sbot versions should still work, but as we phase out legacy replication we should get way less weight on the pub servers.

User has not chosen to be hosted publicly
@Dominic %rEyqdKWB5jmAx4+9EBD0BEtLNyKtDNHmHIc/0TSqDyw=.sha256

Okay, I wrote a script to connect to a pub and by observing it's behaviour, check what version of the replication protocol.

  • really old: it just calls createHistoryStream loads
  • legacy: it calls history stream once, and then more times if you call a create history stream.
  • ebt: supports ebt, should fall back to createHistoryStream if client calls createHistoryStream. Backwards compatible with legacy.
  • ebt, legacy=false: doesn't do createHistoryStream no matter what. not compatible with legacy.

I have updated sbot to include ssb-ebt and ssb-ooo by default. I'm gonna have a cup of tea and see if I can think of anything I still need to do before publishing this.

@Dominic %nxKAEchndiwtuQcFo1kwwDPfOdM0AEOGAX1KD2Cu60o=.sha256

Okay, I decided that ssb-ooo is not ready just yet. So that is not gonna be included in scuttlebot@11 (but it will be part of user-invites so I'll work on it next this month)

Attention, active pub owners:

@olizilla
@dinosaur
@kas
@ktorn
@xj9
@noffle
@andrestaltz
@ev
@cel

please update to scuttlebot@11.0.0

@mix %2ABY3lYCTcn1Tu1swkUdxKBLgO3iwZoLg2s37gLYYa4=.sha256

nice work @dominic I'm super excited for this :flags:

As part of the convesation with @matt the other day about scuttle-shell, he was suggesting the idea that it might be good to make bundled binaries of scuttlebot - this would insulate us from changes in the dependencies and would allow us to say "use version 10.5.5" and for that to mean one thing for sure for everyone.

@ktorn %A6RVbgiR9BbZh+nTPuhnEqBS3okNg6HRaUkP4DMNGeU=.sha256

done.

User has chosen not to be hosted publicly
@Dominic %vHGqTZHtCpDugCwiz/fJZ0CDpICldMOsMME/6ac8kh4=.sha256

@ktorn oh, sorry I also removed scuttlebot/plugins/private if you add https://github.com/ssbc/ssb-private that should get patchfoo working again!

@Dominic %hjTApsaD6R86JB7Uc39OFJmZQ6BzsuN7yJd9HdCo2yo=.sha256

oops, I mean @kas!

User has chosen not to be hosted publicly
User has chosen not to be hosted publicly
@mikey %JU6/zV/oXUbqurKb8mbKjBqriNz3vE1F4AiE678Skak=.sha256

updated ssb-pub (and corresponding docker image) to use scuttlebot@11.0.0. :tada:

updated @ssb.mikey.nz to use this new image. :construction:

@mix %8O7eZfDKWk/uiJmJD329Q7/rB9PU2nJwxttY2tpu/48=.sha256

hey @dominic started moving Patchbay to 11. Moved my flume folder to see how long the replication would take. I bailed out at 13 minutes as it was still less than half done (based on final size of up to date log)

I also fucked up resotring my flume folder somehow so have order scrambled my messages ... this is probably something we should practice because it's the view that new people arriving are going to get as their feeds fill up in different orders (cc @matt )

@mix %btYGSIEh4S4ZCa72UpzbsKUJqRBaDohwMPhFKRmGrb4=.sha256

p.s. someone please acknowledge you see this message so I know I've not forked my feed D:

User has chosen not to be hosted publicly
@Anders %n1X56bfp07C+v+Q6hRa57wp5tZ8vL79/VuRt2Na/j0A=.sha256

Upgraded my pub. Exciting times!

@Gordon %LKenL5XuXprnYQXfOOKiJfX8lA2cdoh9sRWhyl/s8GM=.sha256

@dominic @mikey

Hey @dominic , does the following error mean anything to you (when starting scuttlebott 11.0.0)?

scuttlebot 11.0.0 /home/node/.ssb logging.level:notice
my key ID: SOJaamc8LB5js+OKLM40J76YUgxUfLhlR7CxGeLluC8=.ed25519
error appending: Error: invalid message: expected different previous message
    at Object.exports.checkInvalidCheap (/home/node/.npm-global/lib/node_modules/scuttlebot/node_modules/ssb-validate/index.js:86:14)
    at Object.exports.checkInvalid (/home/node/.npm-global/lib/node_modules/scuttlebot/node_modules/ssb-validate/index.js:101:21)
    at Object.exports.append (/home/node/.npm-global/lib/node_modules/scuttlebot/node_modules/ssb-validate/index.js:149:20)
    at reduce (/home/node/.npm-global/lib/node_modules/scuttlebot/node_modules/secure-scuttlebutt/minimal.js:54:15)
    at queue (/home/node/.npm-global/lib/node_modules/scuttlebot/node_modules/async-write/index.js:30:31)
    at /home/node/.npm-global/lib/node_modules/scuttlebot/node_modules/secure-scuttlebutt/minimal.js:96:5
    at EventEmitter.queue (/home/node/.npm-global/lib/node_modules/scuttlebot/node_modules/secure-scuttlebutt/minimal.js:88:17)
    at EventEmitter.db.add (/home/node/.npm-global/lib/node_modules/scuttlebot/node_modules/secure-scuttlebutt/index.js:76:8)
    at apply (/home/node/.npm-global/lib/node_modules/scuttlebot/node_modules/muxrpc-validation/index.js:173:15)
    at EventEmitter.<anonymous> (/home/node/.npm-global/lib/node_modules/scuttlebot/node_modules/muxrpc-validation/index.js:82:14)

I am using @mikey's latest docker image (sudo docker pull ahdinosaur/ssb-pub) on my VPS :).

@Gordon %kfNkRGv3jlG43sz4/BVLIFXytW6MDur1iqQgELKKt4c=.sha256

Oh, there's more errors in the log:

Dat paste link for beaker browser: dat://f3f8719c4604416ce15257d32408a1d01b183cdae41d15674aea1515a3a5e552/

http version for convenience if you don't use beaker browser =p : https://sbot-errs-happy0.hashbase.io/errs.txt

@Gordon %8Q8+blxfjamk6I6rnJA+wcjccTe2BJxnyEqJpElz/wg=.sha256

hmm, the errors above seem to concern the feed of @5yP0SoZ... . Is this an ebt only pub with a forked feed or something?

If that was the case, I guess I wouldn't have seen any such errors up until now because my VPS didn't have ebt enabled until the update.

Is anyone else seeing those errors on their pub's std err log?

They're spamming my pub's std err log quite a lot.. I guess I should find a way to make sure the docker container log doesn't fill my disk eventually =p

@Dominic %UjsbJPRt2oNmvexnmdKmvZtWq3E8+W4n1jD2n1zfMgw=.sha256

@happy0 yeah there are a handful of forked feeds in the network. You'll see results like that. I'm gonna push a patch that will handle those better.

@mixmix I tested with a fresh id that followed you, then 2 hops. It took 4.5 minutes to replicate from my pub and build indexes. If you did that as you, it would take longer if you are on 2 hops from you. Probably shouldn't update patchbay till the pubs are mostly running 11

@Dominic %NzHai683hgOOQ/0CAPwDQ741ldb/or+uPFJnuwmXmIs=.sha256

another forked feed: @Your pub

I'm gonna push a patch that will cause ebt to treat a forked feed like a blocked feed. This will stop replicating it over and over, but better than that will be to send a "fork proof".

We want to know that everyone who was following the forked feed sees the fork proof eventually. What you could do, is if anyone requests the feed, send them the fork proof.
but what should send as our state? If we indicate blocking/not-replicating and then send the proof, they might not get the proof (if the connection breaks at the right time). But if we sent another signal to indicate that the feed is corrupted, it would ask for that feed again until you receive the fork-proof. Everyone who follows that feed should then receive the proof!

@mikey %KHqlfyveE/RegsrHUQIfwVHA6h33YI8zwxSblXGV7wE=.sha256

updated @one.butt.nz and @ssb.rootsystems.nz :construction:

i was secretly hoping that the new muxrpc backpressure would fix the overloaded cpu and memory problems, which doesn't seem to be the case as @ssb.mikey.nz spent the morning hanging out at full cpu (and no external bandwidth usage). will see if this changes as clients begin to update to use ebt replication.

even if it goes without saying, thanks for your work on this update @dominic! :cat:

@Dominic %H/ImURyf46ruCWBKqJs8CZoZefvbux4mOthBhy4ialY=.sha256

@dinosaur it won't fix it right away. clients connecting and hammering createHistoryStream is most of the problem.

@Dominic %YQbnjpDysNSSVpOCzipKLjT/aNNVp4RwK9H9a415Ql0=.sha256

btw, @mixmix I did a test replication starting from you plus 2 hops which took 9 minutes, and replicated a 237 mb log. My full log is only slightly larger at 252.

I think there are two possible explainations:

  1. indexes. patchcore uses several extra indexes, and flumeview-level could be faster. (also, probably the index setup could be refactored to use less indexes)
  2. I'm connecting to less pubs, and am configured to only use ebt. You are probably connecting to legacy pubs, and falling back to old replication (with a tendency to download the same message twice?)

or maybe some combination of these?

@mix %AxT/9WBWeKjKySPwPRzfYz3MmBw8BjdJipivVdRQyRw=.sha256

thanks for that breakdown @dominic. I tihnk the indexes are a wee bit slow.

Did you post somwhee how to disable legacy replication?

User has chosen not to be hosted publicly
User has chosen not to be hosted publicly
Join Scuttlebutt now