flumefind.c
Given a flumelog-offset file and a message id, finds the offset of that message in the log, by scanning in reverse through the log file.
Given a flumelog-offset file and a message id, finds the offset of that message in the log, by scanning in reverse through the log file.
@cel Thanks a lot for that C code. In this message I mentioned that I wanted to try playing with flume, so I've picked your code and reimplemented that flumefind utility using #racket. It is much slower than your code of course, not only because it is interpreted but because it parses the json for each frame. Still, it works and I am happy.
@Dominic thanks for the help with format description. @dinoƧ𝔸Ⓤᖇ thanks for the Rust code even though my baby knowledge of Rust got me stumbling on some parts.
The main loop is quite cute:
(define (flumefind pos)
(define f (read-frame pos))
(if (equal? (message-id) (frame-key f))
(displayln (frame-start-position f))
(flumefind (- (frame-start-position f) 4))))
(flumefind i)
Regarding speed, this are the results from running on my Surface Go under WSL (which has a notoriously slow filesystem access):
Cel's C version:
soapdog@FafiGo$ time ./flumefind ./log.offset %5Pmvr9fijzjX/ouPf3pq+RfkvH5KX7RqYqN8RMltCDc=.sha256
676860406
real 0m0.256s
user 0m0.047s
sys 0m0.141s
SoapDog's Racket version:
soapdog@FafiGo$ time racket ./flumefind.rkt --id %5Pmvr9fijzjX/ouPf3pq+RfkvH5KX7RqYqN8RMltCDc=.sha256 ./log.offset
676860406
real 0m15.977s
user 0m11.422s
sys 0m4.547s
@SoapDog you can surely speed this up by testing the key with a regular expression and then parsing the json only if that works. That way you save the json parsing time, but also avoid allocating memory for the parsed structure.
Made a new version that allows you to either decode json or use RegEx. The speed gain is great. Below is the version running with decode-json?
set to true:
> (time (flume-find "%5Pmvr9fijzjX/ouPf3pq+RfkvH5KX7RqYqN8RMltCDc=.sha256" #:decode-json? #t))
676860406
cpu time: 19453 real time: 19810 gc time: 4096
And this is the version using RegEx:
> (time (flume-find "%5Pmvr9fijzjX/ouPf3pq+RfkvH5KX7RqYqN8RMltCDc=.sha256"))
676860406
cpu time: 7640 real time: 7772 gc time: 1000
I'm running from inside DrRacket so it doesn't count the overhead of starting Racket but it is surely impacted by the other stuff DrRacket is running inside it. Still, the RegEx version is using less than half the time to find the message, and I am not being clever in the code. I'm sure that if I knew more Racket I could speed this even more.
I wonder how a version built with SBCL (which is CL not Racket) or Chicken Scheme (which is Scheme not Racket) would run. Both are able to compile down to native machine code which should enable it to run in near C speed.
It would be fun if people recreated flumefind.c in their favorite language so that we had more examples to look at...
@SoapDog challenge accepted!
Just a reminder, if you want the most efficient lookup of a message in ssb's flumelog, you can use the hashtable stored at file keys.ht
to look up the offset of a message in the log by its id. That is what ssbget.c does. flumefind.c
just linearly scans the log in reverse to find a message. So it works best for recent messages. This fits the use case I had for it, which was to find the offset of a message I recently published, so that I could truncate the log file to that length and then regenerate indexes, thus attempting to delete that message (this being done while offline or before gossipping the message). I made the tool because more than once I had executed that procedure and it was a little tedious to manually calculate the offset. (And I didn't want to erase the whole flumelog).
Also another pointer: some of the code in flumefind.c
and ssbget.c
came from flumecat.c.
I'm glad this can serve as a cross-language reference "Rosetta Code"-style task. The flumelog-offset format is simple and useful for many SSB users so it is nice to see it implemented in more languages. Nice work @SoapDog.
Also, part of the reason the C code is fast is because it is using memory-mapped I/O (mmap). Although with I/O caching maybe this doesn't matter. Good luck with the challenge.
@cryptix have you cooked some go amazing gizmo for this? :-P
Languages that I'd love to see an implementation of this:
I will send a digital thank you note with an original bad haiku about the language you used for anyone who sends an implementation in any of these languages :-P