Okay, less dread now. Things were really dark at some point, but in several pair coding sessions with @arj we poked around everything, and we found horrible things. The feeling is that we've been pushing a rock uphill for months (literally) and then we finally reach the top of the hill. Things are supposed to go butter smooth now downhill, but we're so tired of pushing the rock uphill that we're just taking a break up here on the hill, having a picnic and chatting about how shitty the uphill was before gathering strength to do the easy downhill.
The biggest thing is pull-weird in muxrpc. That was the big culprit, although there are other culprits responsible for other types of slowdowns. You know, Dominic always admitted that there are funky streams in muxrpc, that are neither pull-streams nor push-streams, they're just weird streams. Specifically what was happening in Manyverse was:
The frontend asks the backend for a pull-stream of threads, and there's a
pull.take(3) on the frontend side (this detail is important). What you would expect to happen is that the frontend pulls one thread from the backend, updates the take counter to 1, pulls another thread and updates counter to 2, pulls another thread and updates the counter to 3, then stops. What ACTUALLY happened was that the frontend pulled one thread from the backend, and the backend SPAMMED the frontend with an infinite amount of threads. Then, the frontend proceeds to fetch "about" msgs for each of those threads coming in. Yes, a ton of wasted work. No wonder it was slow.
Here's the catch, though. It was slow because it was fast. And with ssb-db1, it was fast because it was slow. I know that's unintuitive, but it's true, and I'll explain how.
With ssb-db1, fetching one thread was slow enough, that the frontend had enough time to fetch "about" msgs for the received thread before the next thread arrived. So after it received 3 threads, it "closed" or killed the backend stream, which under pull-weird semantics is allowed. pull-weird doesn't give you per-item backpressure (like normal pull-streams do), it only gives you a "shut up now please" backpressure to terminate the stream. So with ssb-db1, things were slow enough that the frontend had time to terminate the stream before it could spam us more. So it was fast because it was slow.
With ssb-db2, each stream emission was friggin fast, and spammed the frontend with so many threads that it barely could manage anything at all. This also explains a super old bug in Manyverse! The raw DB screen was always very slow, and this always gave me WTFs, and I've tried to fix it multiple times. Because the raw DB screen should be just a straightforward scan of the log, it should be the fastest query. Yet, it was the slowest thing. If you've ever opened that screen, you know what I'm talking about. Now we know the reason! Scanning the raw DB is friggin fast, and that's why the frontend got choked with too much work. It was slow because it was fast!
Yes now you see why I was in deep hell. The whole Manyverse was behaving like the old raw DB screen, even though I was expecting things to be much faster. It was very strange, because in our microbenchmarks, everything was behaving very fast. And that's because in our benchmarks, we didn't have any muxrpc boundary crossing a pull-weird bridge.
It's not new that pull-weird has this strange behavior, it has always affected createHistoryStream over the network for instance, and @cryptix has felt it more often than he'd want to feel it.
I found a simple and hacky solution. Basic idea is that muxrpc
source is considered harmful, only use it when you really know what you're doing. (e.g. "live" sources that are unlikely to spam you thousands of stuff, that's an okay use) The solution is to use only muxrpc
async APIs. So I devised a secret-stack plugin called
deweird that presents to you a "source" API, which under the hood translates to only "async" calls to the backend. It works! I'll prepare the package and put it on npm.
Next up: case-by-case optimizations of queries, and fixing/improving the migration part to take care of both log.offset and log.bipf.