%9UHm+xDJWVWRaKssB5oOmwXo1MMubNkiWBn8NfL0fmc=.sha256

@Dominic6 years ago %9UHm+xDJWVWRaKssB5oOmwXo1MMubNkiWBn8NfL0fmc=.sha256

update: I spent some time today working on a new log format. Like flumelog-offset@3, the file ishandled in blocks aligned to a fixed size to facillitate caching (eg, 64k blocks) and records have a maximum size. Messages are framed by their length (as UInt16LE's) but unlike the flumelog-offset@3 format, records do not overlap blocks. A non-empty block always has a record at the start. If you want to append a record and there isn't enough space in the block, you pad the block (including a pointer back to the last item in the block) and start a new block. Lengths are little endian, because that is faster for today's processors.

so an ordinary record is <length LE16><record><length LE16>. Then padding is <...records><block-1 LE16><zeros...><pointer to start of padding LE32>. The start of the padding is represented as one less than the block size. (0xFF if you have 64k blocks). The end of the padding is a pointer to the start of the padding, so it's easy to find the last record in a block. This makes it easy to scan forwards and backwards.

Also, with the binary format it's giving performance consistent with the above benchmarks!

To visualize this, imagine the log is a bookshelf, and adding a record means adding a book, and adding a block means adding a shelf. In the old format, sometimes a book overlaps two shelves. In the new format, if a book can't fit on one shelf, there is a spacer, then the next book is at the start of the next shelf. This makes the format slightly more complicated, but the code to read it it simpler, because it never has to deal with books that are on two shelves at once.

Also, since we know that there is always a record at the start of every block, we can just jump straight there! This means we can do a binary search at 64k chunks, that becomes iteration within a block. This would only really make sense to query for messages received in a particular time range, but still, it's something we can do without adding another index.