You are reading content from Scuttlebutt
@aljoscha %EaCbf31VBcRlqXqAMGKKDjmx53voO7U39CLO3MGERlk=.sha256
Re: %ZzBw5zlHZ


My first intuition is to use a Bloom filter [...] but something better may be possible.

"Set reconciliation" might be the (annoyingly non-obvious) name of what you're looking for. See the related work section of this (chapter 5, page 50) for a literature overview that is fairly complete to the best of my knowledge (disclaimer: I wrote this).

An interesting property of pull is that it's possible to ask a peer to look for a post they don't have yet.

Is this really inherent to pull/push? You can also have a push-based system in which the pushing is suspended until missing data becomes available; this effectively happens every time when live-streaming data between peers in ssb. In a system where data is linearly ordered but it is possible for individual data to be missing, you can of course also resume pushing with newer data rather than suspending upon missing data, and eventually push that data should it surface.

Another viewpoint: pure push ("you may send arbitrary amounts of data to me") is not practical, as no one can guarantee arbitrarily large amounts of resources to process those arbitrary amounts of data. Any sensible push system takes the form of "you may send arbitrary parts of this finite collection of data to me". Squint a bit and this is just pull with some additional implementation details: you request a single datum (the finite collection) and the means of receiving it can be split up over time.

I'm pretty much just rambling now, but the "2, 20, 20000?" question of yours resonated with me. I think I'm looking for a clarification why "in case some got deleted before reaching me" seems to be problematic for you. (I'm not trying to say it isn't - I spent a good time trying to conceptually make efficient replication of ssb messages for which only the metadata is required but the content might be missing work gracefully even when arbitrary pieces of content are unavailable, and eventually gave up. See the #offchain-content tag if you have way too much time and want to dig into some discussions that happened on ssb around that topic. Unfortunately I don't think there are posts about the replication difficulties.)

this conceptually being a cache rather than a log

Do you still have sequence numbers in your model? If you replace sequence numbers by arbitrarily chosen ids, and allow overriding values in the resulting cache, you pretty much get the data model behind earthstar: each peer holds an updatable (i.e., mutable) mapping from (author, id) pairs to data.

Join Scuttlebutt now