You are reading content from Scuttlebutt
@Dominic %Gs2NhjWxbNQrsOu1AtL4w8x7BudcvbE5CY8Uc+14DA4=.sha256

benchmarking flumelog-offset

In recent performance experiments, I started to realize that flumelog-offset (the module that manages the main storage of messages in ssb has significant problems)

First I did tried creating a binary format optimized for in-place access and found reading and accessing a couple fields to be significantly faster than JSON parsing and entire record.

Next I tried using that as the codec in flumelog-offset but I was disapointed to see that the improvement wasn't as much as just comparing it to JSON. Today I made a benchmark to test each layer in the offset format, to figure out where the slowness came from.

flumelog-offset_linegraph.png

I discovered a couple things: block size makes a significant difference, and we weren't using a good blocksize (16k by default, should probably be 64k). Also, since we use JSON, that overwhelms the blocksize effect - we weren't noticing the effect of block size, because JSON was slowing it down even more.

flumelog-offset_barchart.png

But also: just reading the blocks is surprisingly slow! I think the log (read records but don't parse) could be closer to blocks.

That lead me to implementing simpler parser for flumelog-offset flumelog-offset-parser and when used with the proposed binary format, is significantly faster

barchart2.png

Note, the vertical axis is milliseconds - the difference is between waiting under 1 second to waiting 6 seconds.

Additional analysis at bench-flumelog-offset

all tests conducted on lenovo x201 with spinning disk, but warmed os fs cache.


The major TODO here, rewrite flumelog-offset for performance.
The time to scan the entire log file should be close to the time to just read the file. /cc @arj @C-Keen

User has not chosen to be hosted publicly
@Anders %dfcC767AhfyiLcoUxsaJ/ygq0J2fVJDq/S/qw8dVOJc=.sha256

This matches very much my experience when delving into flume.

The code also didn't have a fast path for the normal case where the data you need has already been read, but that that could be obscured by JSON parsing and block size, as you pointed out.

I'll have a look at the code tomorrow. Very good progress! I'm excited.

@Dominic %Vq60k4fWMn6mNq62LLw0pzTf4EmLSHHmvumTti6g5Bc=.sha256

@keks the high perf "stream" measurement actually parses the same flumeview-offset format. The problem is that the main implementation isn't very good. mmap or seek and read. I read into fixed size, aligned blocks. (hence why I could parameterize the block size) and then read the records out of that. In the current format sometimes a record overlaps two blocks.

I do have an idea for a format that always aligned records to the start of the block (with a empty gap at the end). This would probably be easier to implement than dealing with overlaps, anyway.

But the main thing here, is just creating a lighter implemenation than the current one.

User has not chosen to be hosted publicly
@mix %ETz9A71d92jdNfg/6vNBOD9jqXymGSiqO875oGbGRAM=.sha256

bits of this are over my head, but I love the graphs and reading along.
Sounds really rad, thanks for sharing

@Anders %Tzokz/IJiRcNMk8qb2YoakF21Lrg0ICKjPUAC90lImE=.sha256

I'm trying to replicate your results locally to see how performance is on my machine. How did you create the graphs? I ran all.sh and modified the number of rounds a bit as my machine is really slow :-)

File is around 3.5k, blocks 4.2k, raw log 9.6k, json 25k

@kieran %zw7fNjiANXedDXan7SqVkEW9UhJmogYtu6SCKSmAvzc=.sha256

I want to understand, but this is also going largely over my head.

@Dominic %9UHm+xDJWVWRaKssB5oOmwXo1MMubNkiWBn8NfL0fmc=.sha256

update: I spent some time today working on a new log format. Like flumelog-offset@3, the file ishandled in blocks aligned to a fixed size to facillitate caching (eg, 64k blocks) and records have a maximum size. Messages are framed by their length (as UInt16LE's) but unlike the flumelog-offset@3 format, records do not overlap blocks. A non-empty block always has a record at the start. If you want to append a record and there isn't enough space in the block, you pad the block (including a pointer back to the last item in the block) and start a new block. Lengths are little endian, because that is faster for today's processors.

so an ordinary record is <length LE16><record><length LE16>. Then padding is <...records><block-1 LE16><zeros...><pointer to start of padding LE32>. The start of the padding is represented as one less than the block size. (0xFF if you have 64k blocks). The end of the padding is a pointer to the start of the padding, so it's easy to find the last record in a block. This makes it easy to scan forwards and backwards.

Also, with the binary format it's giving performance consistent with the above benchmarks!

To visualize this, imagine the log is a bookshelf, and adding a record means adding a book, and adding a block means adding a shelf. In the old format, sometimes a book overlaps two shelves. In the new format, if a book can't fit on one shelf, there is a spacer, then the next book is at the start of the next shelf. This makes the format slightly more complicated, but the code to read it it simpler, because it never has to deal with books that are on two shelves at once.

Also, since we know that there is always a record at the start of every block, we can just jump straight there! This means we can do a binary search at 64k chunks, that becomes iteration within a block. This would only really make sense to query for messages received in a particular time range, but still, it's something we can do without adding another index.

@Dominic %adwjSt3aQnDE5vfoDuk6KnovJHuANYFBSa6G2Y5y1Uo=.sha256

Okay, got this together enough to run against the benchmark I used for flumelog-offset. Following are the result of using it with raw buffers (no json).

results for flumelog-offset:

name, ops/second, mb/second, ops, total-mb, seconds
append, 119387.461, 15.1, 1193994, 151.021, 10.001
stream, 325871.724, 41.217, 1193994, 151.021, 3.664
stream no cache, 427648.28, 54.09, 1193994, 151.021, 2.792
stream10, 421363.363, 53.295, 4218690, 533.597, 10.012
random, 32891.11, 4.16, 328944, 41.606, 10.001

in particular, it can read back the stream at 41 mb/second

and here are the results for, flumelog-random-access-file:

name, ops/second, mb/second, ops, total-mb, seconds
append, 147213.757, 18.623, 1472432, 186.272, 10.002
stream, 1084265.095, 137.166, 1472432, 186.272, 1.358
stream no cache, 1505554.192, 190.462, 1472432, 186.272, 0.978
stream10, 1904335.23, 240.91, 14724320, 1862.721, 7.732
random, 447818.718, 56.652, 4478635, 566.578, 10.001

41 becomes 137! over 3x faster! stream10 is 5 times faster, and reading random records is over 10x!

Using JSON the difference isn't as big... but still good, 16mb/s streaming for -offset, and 25mb/s streaming for -random-access-file.

So, I want to use this with an in-place format, and see how much difference it makes once the whole application is running like that.

@dan %SVl+KprJ+IprKTrp/uHRAFNxfIwuod7Xxq+tW17G/sg=.sha256

studio-gibli-dog-running-grass.gif

User has not chosen to be hosted publicly
@Dominic %gxPMicZjl/2KUWu0zUlE5WOJDYM+wFZeCC5Vy5iz3CM=.sha256

@keks how big are your blocks? is your format documented anywhere / can you point me to code?

User has not chosen to be hosted publicly
@Anders %HAheX9/amYKXLdvfQPIAZf0budfUhyo7Nr5L19dMj7Y=.sha256

@Dominic is that code from the latest results available as a branch somewhere?

I'm getting quite close to having well tested patches ready for turbo-json-parse, just wanted to make sure we are testing the same things.

Join Scuttlebutt now