You are reading content from Scuttlebutt
@Dominic %7v9fH3Hz+u3RuH4X5FR2Qu4cj0VHC59ZwxokOSrCM6A=.sha256

towards a fast enough database

This weekend I put a bunch of bits I have been tinkering with together.
It's enough to demonstrate that we can make the scuttlebutt database significantly faster
if we stop doing stupid things, like using JSON.

This also incorporates some bold experiments, such as replacing leveldb indexes, with indexes
written purely in javascript. This, combined with a binary in-place format, is actually as fast
as it was with level, but indexes are 5x smaller now! This can run in the browser, where
leveldb can't run (indexeddb is horribly slow, unfortunately!)

I put a bunch of code up here, that you can run: https://github.com/dominictarr/ssb-binary-database-experiments.

Next thing is to figure out how much base-line IO performance mobile, browsers, and embedded systems give us.

Please run the scripts and tell me how long it took!

@arj @cel @andrestaltz @dinosaur @mix @rabble

@mmckegg %lK0Ynxm1dlSjaSdXFjqoIQ4i9Zuh6Jj0NYpifvel3xg=.sha256

@Dominic

After installing, when I try to run I am getting Error: Cannot find module 'varint32'

Then I try:

$ npm install varint32

But get:

npm ERR! code E404
npm ERR! 404 Not Found: varint32@latest

Unpublished dep?

@Dominic %SiEt5S7WjzFL0sPLyylGdlpTd2MgQ2OzsoY6CyVXytY=.sha256

@matt oops! fixed in bipf@1.2.1 (I tried a strictly 32 bit version of the varint module, but never published it because it didn't make any noticable perf difference, so I just switched it back to regular varint)

@mmckegg %9WLKWZlNbayxPj2L7uFfD/vCwBqiP31u+U3Nw5+mag8=.sha256

@Dominic On my machine I get:

records, mb, seconds
25636, 17.430975914001465, 1.001
62658, 45.49252891540527, 2.002
99339, 73.39055728912354, 3.003
133206, 100.99117183685303, 4.004
172779, 130.03430271148682, 5.005
213781, 156.456298828125, 6.006
257011, 182.23061752319336, 7.007
299590, 206.3144016265869, 8.01
338234, 232.39395809173584, 9.011
378044, 258.3344135284424, 10.012
415893, 286.0402135848999, 11.013
454827, 312.4459638595581, 12.014
492972, 338.4016389846802, 13.015
525924, 373.08435440063477, 14.016
565806, 398.6913871765137, 15.017
604355, 425.10098457336426, 16.018
642319, 448.4728412628174, 17.019
648715, 453.09368801116943, 17.196

Is there something I can compare this to?

@mmckegg %vrV20miaPzI5LMUZeeOf1ufgLNeWMDrSzkC0Dt3j/iM=.sha256

Whoops, I didn't run the index generation. Here's the output (filtered to just show the csv):

seconds, key, log, cts, clk, tyt, tya, rtt, cta, aty, att
1.01, 19257877, 19256508, 19257223, 19256508, 19256508, 19256508, 19257223, 19257223, 19256023, 19256023
2.01, 19455877, 19457246, 19455395, 19454534, 19457246, 19458474, 19457759, 19455865, 19458257, 19455019
3.303, 11714615, 11712574, 11714780, 11717327, 11714615, 11713387, 11714828, 11713746, 11713588, 11715792
4.985, 0, 506, 470, 0, 470, 470, 495, 564, 501, 564
6.101, 0, 1535, 501, 0, 1466, 1466, 5334, 1441, 1441, 1441
7.101, 2094942, 2091187, 2094942, 2094942, 2095322, 2091701, 2089988, 2096788, 2092691, 2089751
9.184, 4129717, 4131774, 4129717, 4128490, 4126174, 4121074, 4126896, 4126896, 4115006, 4133211
11.42, 744, 471, 744, 722, 0, 971, 505, 505, 485, 0
13.693, 12764, 13010, 12764, 13042, 0, 7750, 13007, 13007, 9577, 0
15.813, 7161, 6423, 7161, 6177, 8923, 0, 7094, 7094, 4232, 8227
16.821, 20590627, 20592251, 20590627, 20592582, 20603600, 20611684, 20591195, 20591195, 20611684, 20604296
17.874, 18762290, 18765661, 18764319, 18763330, 18763834, 18766312, 18764822, 18764822, 18766312, 18764822
20.698, 8828614, 8826082, 8826585, 8827574, 8823692, 8826594, 8826589, 8828455, 8827080, 8822704
21.798, 0, 0, 0, 0, 2317, 486, 656, 1071, 724, 2317
22.899, 0, 0, 0, 0, 251441, 251445, 251931, 249650, 250721, 251441
23.898, 0, 14375405, 0, 9745560, 14124554, 14122311, 14121373, 14121373, 14122311, 14124554
28.196, 0, 1641461, 0, 2654368, 1640531, 1639589, 1640998, 1640527, 1640060, 1640531
30.65, 0, 504, 0, 482, 0, 471, 1401, 471, 0, 0
32.866, 0, 767, 0, 482, 0, 0, 504, 1401, 0, 0
35.816, 0, 6948145, 0, 6538541, 6706575, 0, 6948397, 6948252, 6121067, 5857889
36.911, 0, 14382665, 10400359, 14075789, 14414966, 0, 14382682, 14383829, 14359420, 14310097
38.569, 0, 13236567, 12534588, 13628464, 13389555, 4640718, 13238005, 13239563, 13456139, 13511331

Oh and the final output from query:

query 119 0.043
@Dominic %GJwAVeRlztEY4XKbKN6FkeUbLGh014xZ1922+eAnmV0=.sha256

@matt thanks!

I think I'll update this script to collect data in a more useful way (and then graph it)
but for me, the initial build takes:

records, mb, seconds
13626, 9.349052429199219, 1.001
32983, 22.63656711578369, 2.002
53663, 37.63745594024658, 3.003
74265, 54.183616638183594, 4.004
95688, 67.20764827728271, 5.005
118166, 81.02961921691895, 6.006
139934, 95.34706974029541, 7.007
161673, 111.41179656982422, 8.008
182736, 127.66858196258545, 9.009
203171, 141.54617595672607, 10.01
224960, 156.47870540618896, 11.011
247371, 170.9478521347046, 12.012
268893, 185.0323362350464, 13.016
289713, 199.6049451828003, 14.017
309950, 212.98091411590576, 15.018
331260, 226.5898561477661, 16.02
353054, 241.3536434173584, 17.021
374353, 255.59946632385254, 18.022
394535, 268.16277408599854, 19.023
413438, 281.231237411499, 20.024
433027, 295.26458740234375, 21.025
453278, 310.0141181945801, 22.026
469252, 321.6740198135376, 23.027
487763, 334.75016593933105, 24.028
506709, 349.0513792037964, 25.029
526669, 363.0400505065918, 26.03
545476, 376.8821449279785, 27.031
564196, 389.16582584381104, 28.07
582917, 402.17419052124023, 29.071
601132, 414.7186632156372, 30.072
618735, 427.1110019683838, 31.073
620274, 428.1830577850342, 31.156

(almost twice as long as on your computer, @matt)
(it's not a race, the point is to figure out how much perf we have to work with)
for me, the query took 122 ms - so less difference there.

Oh, it would also be good to get some representative queries - the default I have here is just posts within the last day... hmm, though, patchwork builds the latest threads too, right? but also filters by your friends?

@mix %f5SuSjAsl1nLwQf0huefKNoxT3E69AL0osW+AxkPgMU=.sha256

me :

29686, 18.46924591064453, 1.001
67656, 44.71892070770264, 2.002
106214, 72.51329898834229, 3.003
143263, 98.11196041107178, 4.004
180141, 123.9466962814331, 5.005
216808, 149.23049068450928, 6.006
251931, 178.84262371063232, 7.007
293310, 200.8111171722412, 8.008
331022, 225.62857151031494, 9.009
369877, 251.55218696594238, 10.01
407401, 276.41363048553467, 11.011
443067, 301.6712980270386, 12.012
478618, 326.32406997680664, 13.013
502675, 343.5799493789673, 14.014
540598, 379.9094982147217, 15.015
578744, 405.2534589767456, 16.016
616723, 430.10236644744873, 17.017
653964, 454.9683074951172, 18.018
657343, 457.16283226013184, 18.113

The indexes are still going ...

You could try loading all the data needed for a particular mega thread. avatars, names, likes, backlinks, who's blocked and whether any of the messages you're reading are by those people

ok here's a dump from the tail of that indexing :

sort time: 16 124625 2546 1982
indexes/key:compacted! 18 657342
{ input: [ { size: 1982, latest: 483705954 } ],
  count: 2,
  total: 1983,
  average: 991.5,
  since: 483705954,
  meta: 
   { since: 483705954,
     index: 
      [ '1543225491218.idx',
        '1543225488533.idx',
        '1543225489708.idx',
        '1543225448618.idx' ] } }
ended, key, 657342, 483705954
161.505, 1348035, 0, 0, 0, 0, 0, 0, 0, 0, 0
normalized-index:sorts 86 124625 2546
@mix %zXaBFWSRwgV1+2/4y/NpfX4vdIFa1jna/2sc0IF8Ffc=.sha256

I've not been so concerned about index sizes. Blobs are an order of magnitude bigger problem space with I think. I think figuring out how to make indexing faster OR to find good times to do particular indexing, and when to allow queries through without waiting for the indexing to complere would be important.

OH, that was an idea I wanted to inject @dominic - the ability to make a query and say "just give it to me what you got now, I don't want to wait for the indexes".

example use cases : I want to see posts, but I don't really really need the updated names and faces of everyone to read that content.

User has not chosen to be hosted publicly
@Anders %4UmtzYAbzcqdKug9wTD++k4U9mO5oFnET8rm50yT4g4=.sha256

Interesting! With the new formats that @aljoscha proposed, where do you see bipf in this?

The source file check in init.js doesn't seem to work on node 10. Removed that to allow things to continue.

init is:
646902, 449.78984451293945, 97.865

Indexes have been running for 500 seconds and counting. Memory usage is at 20%.

@Anders %LyjbU1wU8sIWgl7SVHBvBg4HZv0FjriAEaJxbv6J/A8=.sha256

Indexing finished after 591 seconds.

Query:

query 93 0.327
normalized-index:sorts 4 0 0

real    0m0.957s
user    0m0.776s
sys    0m0.100s
@Anders %P9hdKoMibZB1/HZbAx2nGwEGZfX1uX35uwgAbCO/3os=.sha256
Voted ## towards a _fast enough_ database This weekend I put a bunch of bits I h
@Dominic %KRcUysIGB3lzvTUUPd5A4sDBfuQc95bVVF3+8ttSZpE=.sha256

@arj bipf is proof of concept, to figure out the api. I plan to work with @aljoscha to either a) design a format that works for signing and also is in-place, or b) make sure we have a design that is easy to translate for database usage. We've currently got some translation to do anyway, and it's easier to change the database internal format than the signing format. If the cost of translation is low compared to verifying a signature, then there isn't much so much reason to avoid it. (He and I we discussed this on a call recently, and we agreed to take these factors into consideration while evaluating designs)

Quite curious result there, @arj, your indexing took longer, but then the query was somewhat faster! maybe your system leans more heavily on OS caching?

@Dominic %QT28n+HBTQC3wSzSXc6NLPb8/iWZk+lAIRdl7d/MCZM=.sha256

I managed to run indexes on my bottom-of-the-barrel phone (zte-blade), using termux (without rooting phone, which I couldn't get to work). I had to jump through a few hoops to get it to work. installing level didn't work, so I couldn't run the query, and indexes.js ran for 957 seconds, then crashed ("Killed"... is that out of memory?) It was almost finished at that point, so starting it again (which continues from there) that took another 345 seconds. 957+345=1302/60 = 21.7 minutes.

Later I hacked out the bit that touched level (since it wasn't actually needed) and got it to run on the phone, 97 ms! Although, the node process has a several second startup time...

I noticed that the key index took a lot longer. This is an value that every record has, and it's uniformly distributed, because it's a hash. Other indexes happen to be mostly in order, or have chunks of order (say, most replies to a thread are soon after the thread is posted). This causes compaction/sorting for key to be more work. It took a bit of messing around to get this working. I was a bit disappointed it was slower at first - out of lazyness I was reading an int straight out of base64, but it seems this had many hash collisions, was really slow. when I decoded the base64, my laptop could build that index in 3 seconds! on my phone it look 40 seconds (because io is slower) but that's a big improvement on 2 minutes per index on average.

I'm hoping that my phone is just particularily slow, and better devices will have better perf...

@cel %OBzInnu2Ody9S9s2ePSDRZk02qB9hNOLs0zqGA357oU=.sha256
Voted I managed to run indexes on my bottom-of-the-barrel phone (zte-blade), usin
@Dominic %RBbl2D9SXqmn/xwSh2Y1zBYLZ4Nhrsg3DhKAg2mCS4k=.sha256

I got termux working on my nexus 5. This involved updating to android 6, which wasn't simple because I had rooted it, making the normal update not work. Updating it using TWRP was actually fairly simple, but the imformation explaining how to do this is generally quite disorganized.

With the hashtable for key (which now finishes first!) I built the indexes in 618 seconds. That's only 10 minutes. Twice as fast as zte-blade.

I have an idea for a hashtable + linked vector index. It would be good for things like author, sequence or root (replies). The put the author / root message into the hash table, which points to a "vector". vector is like a linked list, but really it's list of linked buffers, if you start off allocating small buffers, but make the next one twice as big, half the space ends up empty, O(2*n) space, but get O(log(n)) access time, it wouldn't have the compaction overhead that LSMTs have. That's considered good.

@andrestaltz %Sd7OeozzOKLjaVHGieUadXGJ6hOcFuAhweMxwreA+M4=.sha256

@dominic Now that your nexus 5 is updated, you should be able to run manyverse, and sync over LAN. I'm curious how much time does it take for your nexus 5 to sync your @EMovhfIrFk4Nih... account

@Anders %XeNWAcFx63HoBz01URyVprHuBjFFrRiYhUEQZG744go=.sha256
Voted I managed to run indexes on my bottom-of-the-barrel phone (zte-blade), usin
@cel %FnusPoExHFJ2VJ6jToiIB9A5pInAIidLZySUfC285jM=.sha256
Voted I got termux working on my nexus 5. This involved updating to android 6, wh
@Dominic %OI9OXvJsb5z7lgSSQrl2WVuKamwUeDJb/CVMk7IcLe8=.sha256

update: wrote a script to convert currently base64 ids into buffers. I know @aljoscha is planning to give these types, so I added placeholders for that.

my full json format log.offset is 503 mb. converted to bipf, keeping ids as base64 strings is 440mb (takes 34 seconds), and converting them to buffers is 355 mb (but takes 48 seconds).

@Dominic %d0Ktf/xiuGn9ynmMtvRZziLbeO4Bb8Aw2d6z7h5X3HQ=.sha256

not having the ids in base64 directly makes the hashtable index faster. It doesn't need to parse the base64, it can just read an int out of the key and use that as the key. Also, I realized that a the current hashtable view clears a timeout and creates a new one every write. (the idea here is to write only when things stop changing, but clearing and creating half a million timeouts actually takes some time.

Adding:

if(timer) return
timer = setTimeout(function () {
  timer = null
  async.write(data)
})

means that index gets built in 2 seconds instead of 7! that's a huge improvement! I also suspect this change will effect flumeview-reduce too.

@Dominic %Ijm5LqNLCyWYttn057JqRtwYR3+RTt6AYFlPBGHts7A=.sha256

More updates: Okay, so I got a bit excited and started learning C. I wrote just enough to read records from aligned log, and to seek for bipf values. I wrote a program that scanned the entire log and dumped every text content to stdout. It runs in 0.6 seconds on my laptop, and 6 seconds on nexus 5. (on the nexus that's reading 58 mb/s while also writing 10 mb/s!)

I think this is very strong evidence that a fast enough embedded database is entirely possible on mobile.

@Dominic %EBY+DLCWWB1QnQyhIkMLXlwWY1w6hqV9k9LuR/Ji0Kg=.sha256

hmm, and it took 10 seconds on the zte-blade! that's really pretty okay fast. The next steps needed to make something actually useful:

  • test it in web assembly!
  • ability to do scan queries with map-filter-reduce filters (the filter will be fairly easy, the other parts probably not worth it yet)
  • some sort of index. am thinking that a hash table + vector index will very fast to build and also usable for replication.
@Anders %Fy1D+khqmfZSojDuM1X7A9gw3m5WsPMPWINuDXlNHYw=.sha256

Also, I realized that a the current hashtable view clears a timeout and creates a new one every write. (the idea here is to write only when things stop changing, but clearing and creating half a million timeouts actually takes some time.

Adding:

if(timer) return
timer = setTimeout(function () {
  timer = null
  async.write(data)
})

means that index gets built in 2 seconds instead of 7! that's a huge improvement! I also suspect this change will effect flumeview-reduce too.

Do you have a PR for this? Would be interesting in testing.

@Dominic %llEVnI2C5hN7GGcKQbk1mdIW81bq+7vF3k/1tsJJnic=.sha256

@arj here you go: https://github.com/dominictarr/async-single/pull/1

@Anders %XzjZqBZ3+RwFfV4dXdj/Tr27IpRK9wCzsq4eSoqBn2s=.sha256
Voted @arj here you go: https://github.com/dominictarr/async-single/pull/1
User has not chosen to be hosted publicly
@Dominic %tgryNbEcpORs6m5sVJynMVhETyDT0XDGx5VZDwM3rc0=.sha256

Okay, made some progress building wasm. (it was not easy to get everything set up at all, I have at least successfully compiled c to wasm) have ran a bipf benchmark in wasm and it was several times faster than js! it seems clang has lots of compiler options and quirks that greatly effect the speed the output runs at. The default output wasn't faster than javascript! but I found a mode that was.

@Dominic %ze5auiSY7j8t2cBmpYv371SPIhLH+/nOrkAxqFWrGIs=.sha256

I've made some good progress with wasm. I've compiled the basic parts of flume.c code I've been writing to wasm, and I just tested it with a node script around it. Node is handling the IO. node reads 64k blocks into the wasm space, then I use wasm to scan for the values. writing output is somewhat slower, but I can scan the entire database for one thread in half a second. This is very good news! It means it's possible to incrementally deploy bits of things written in wasm, and boost perf, without having to deal with another native addon.

@Anders %E4zNX04WwBYlksEkcY2mDOPXaGO3JeO1GINMlDspiwc=.sha256
Voted I've made some good progress with wasm. I've compiled the basic parts of [f
@andrestaltz %2bgBbDkJb9QdqZCflkoGtk+k69GKHBwjIeuSTmaB3gA=.sha256

Some results on mobile:

@dominic

init

records, mb, seconds
101, 0.03850746154785156, 0.167

indexes

seconds, key, log, cts, clk, tyt, tya, rtt, cta, aty, att
reindex, key, 39975, -1
ended, key, 100, 39974
reindex, log, 39975, -1
reindex, cts, 39975, -1
reindex, clk, 39975, -1
reindex, tyt, 39975, -1
reindex, tya, 39975, -1
reindex, rtt, 39975, -1
reindex, cta, 39975, -1
reindex, aty, 39975, -1
reindex, att, 39975, -1
indexes/log:compacting [ 100 ]
sort time: 14 14 14 100
indexes/cts:compacting [ 100 ]
indexes/clk:compacting [ 100 ]
indexes/tyt:compacting [ 100 ]
indexes/tya:compacting [ 100 ]
indexes/aty:compacting [ 100 ]
indexes/att:compacting [ 100 ]
sort time: 15 29 15 100
sort time: 26 55 26 100
sort time: 36 91 36 100
sort time: 22 113 36 100
indexes/rtt:compacted! NaN undefined
undefined
ended, rtt, 100, 39974
indexes/cta:compacted! NaN undefined
undefined
ended, cta, 100, 39974
sort time: 31 144 36 100
sort time: 32 176 36 100
indexes/log:compacted! 213 100
{ input: [ { size: 100, latest: 39974 } ],
  count: 2,
  total: 101,
  average: 50.5,
  since: 39974,
  meta: { since: 39974, index: [ '1547035671080.idx' ] } }
ended, log, 100, 39974
indexes/cts:compacted! 193 100
{ input: [ { size: 100, latest: 39974 } ],
  count: 2,
  total: 101,
  average: 50.5,
  since: 39974,
  meta: { since: 39974, index: [ '1547035671102.idx' ] } }
ended, cts, 100, 39974
indexes/clk:compacted! 193 100
{ input: [ { size: 100, latest: 39974 } ],
  count: 2,
  total: 101,
  average: 50.5,
  since: 39974,
  meta: { since: 39974, index: [ '1547035671103.idx' ] } }
ended, clk, 100, 39974
indexes/tyt:compacted! 195 100
{ input: [ { size: 100, latest: 39974 } ],
  count: 2,
  total: 101,
  average: 50.5,
  since: 39974,
  meta: { since: 39974, index: [ '1547035671103.idx' ] } }
ended, tyt, 100, 39974
indexes/tya:compacted! 195 100
{ input: [ { size: 100, latest: 39974 } ],
  count: 2,
  total: 101,
  average: 50.5,
  since: 39974,
  meta: { since: 39974, index: [ '1547035671104.idx' ] } }
ended, tya, 100, 39974
indexes/aty:compacted! 196 100
{ input: [ { size: 100, latest: 39974 } ],
  count: 2,
  total: 101,
  average: 50.5,
  since: 39974,
  meta: { since: 39974, index: [ '1547035671105.idx' ] } }
ended, aty, 100, 39974
indexes/att:compacted! 196 100
{ input: [ { size: 100, latest: 39974 } ],
  count: 2,
  total: 101,
  average: 50.5,
  since: 39974,
  meta: { since: 39974, index: [ '1547035671106.idx' ] } }
ended, att, 100, 39974
0.415, 39974, 39974, 39974, 39974, 39974, 39974, 39974, 39974, 39974, 39974

query

(notice that it ends with an error)

seconds, key, log, cts, clk, tyt, tya, rtt, cta, aty, att
reindex, key, 0, 39974
ended, key, 0, 39974
reindex, rtt, 39975, -1
reindex, cta, 39975, -1
reindex, log, 0, 39974
ended, log, 0, 39974
reindex, cts, 0, 39974
ended, cts, 0, 39974
reindex, clk, 0, 39974
ended, clk, 0, 39974
reindex, tyt, 0, 39974
ended, tyt, 0, 39974
reindex, tya, 0, 39974
ended, tya, 0, 39974
reindex, aty, 0, 39974
ended, aty, 0, 39974
reindex, att, 0, 39974
ended, att, 0, 39974
indexes/rtt:compacted! NaN undefined
undefined
ended, rtt, 100, 39974
indexes/cta:compacted! NaN undefined
undefined
ended, cta, 100, 39974
0.054, 39974, 39974, 39974, 39974, 39974, 39974, 39974, 39974, 39974, 39974
flumeview-query: _ in memory log or no indexes defined, will always use full scan, queries will likely be slow
{ since: { [Function: many] set: [Function], once: [Function], value: 39974 },
  get: undefined,
  methods: { get: 'async', read: 'source' },
  read: [Function: read],
  createSink: [Function: createSink] }
normalized-index:sorts 7 176 36
/data/data/se.manyver/files/nodejs-project/index.js:75774
      query.add(Index(indexes[k]))
            ^

TypeError: query.add is not a function
    at /data/data/se.manyver/files/nodejs-project/index.js:75774:13
    at next (/data/data/se.manyver/files/nodejs-project/index.js:81668:11)
    at /data/data/se.manyver/files/nodejs-project/index.js:81610:38
    at again (/data/data/se.manyver/files/nodejs-project/index.js:51544:26)
    at /data/data/se.manyver/files/nodejs-project/index.js:78830:9
    at /data/data/se.manyver/files/nodejs-project/index.js:51218:7
    at /data/data/se.manyver/files/nodejs-project/index.js:83541:5
    at /data/data/se.manyver/files/nodejs-project/index.js:42463:9
    at /data/data/se.manyver/files/nodejs-project/index.js:83546:5
    at /data/data/se.manyver/files/nodejs-project/index.js:78827:14
@andrestaltz %Yl6AwlCHDFxC4nJEw5Iivpo0fKDPMMAF5RWiJyjzT7s=.sha256

And I ran again, now with the mobile (empty) account having replicated my @andrestaltz account:

init

1677, 1.4109058380126953, 1.001
3809, 3.569573402404785, 2.008
5109, 4.938605308532715, 2.72
User has not chosen to be hosted publicly
@yi %sH0vW161QzmX3ToQnncGV3LPCMzbBjlMNbt9YUp8ECI=.sha256
Voted ## towards a _fast enough_ database This weekend I put a bunch of bits I h
Join Scuttlebutt now