in-place format performance
I did some experiments to test the proposed in-place format
Earlier experiments indicated it to be very fast indeed, so I started to wonder how fast it would be if it just scanned the entire database. I figured a reasonable test case was to look for all messages with a particular content.root
since that defines a thread.
I had previously observed that something was amiss with the performance of flumelog-offset. unix commands such as cat
and wc
and grep
can process a file of several hundred megabytes in well under a second (once the fs cache is hot, which of course makes testing quite difficult). But streaming the whole file through flumelog-offset takes several seconds.
In my test script, I took a copy of my entire .ssb/flume/log.offset
file, and converted it to in-place-read binary. streaming the entire thing through the json parser and checking for value.content.root == msg_id
took about 9 seconds, but the binary encoded one was 20 seconds. (all tests with hot cache, since I don't have a reliable way to cool the cache) Since in the previous tests, I had found in-place-read to by really fast, this surprised me.
However, there are a couple layers involved in reading the framing in the log file, so I implemented a simpler stream parser - instead of return sliced buffers, it just called a method with (buffer, start, end
) parameters. Once I got this working,
it scanned the JSON format log in 5 seconds... and the binary format log in 0.8 seconds! That means there is definitely some extra stuff that doesn't need to be happening in flume land. But also, very promising results for the in-place format.
note: I think I may have found a reliable way to clear the cache (may be particular to my machine) if I read one large file a couple of times, it becomes way faster. then reading a different file that first file becomes slower again. I could just graph the results of the Nth consecutive run of the benchmark, and that would show the interaction of the benchmark with the cache.
(okay, let me tidy this stuff up and publish it...)