You are reading content from Scuttlebutt
@Christian Bundy %A4807Ky0Usu2VIL2J3BczoNMiJooJyiD3V/9YdO1vMQ=.sha256

Mutability and flumedb

Riffing off of my issue:

I'm under the impression that some view backends (e.g. Level) gives us the option of mutating a piece of data after the fact, which seems like it could be useful for encrypted groups, mutable messages, and blob content. Would it be possible to get a view like flumeview-level to regenerate views for a subset of messages rather than regenerate the entire view, or would this have unintended consequences?

  • encrypted groups: when you get access to a new group key, regenerate views for all private messages
  • mutable messages: when you get an type: edit message, regenerate the views for the referenced message
  • blob content: when you get a new blob, regenerate the views for each type: blob message that references that blob

cc: @dominic

I'm not very familiar with the intimate details of flumedb, but over time I've been getting more concerned about mutability in our append-only database. I've prototyped a hacky application that can delete data from the instance of flumelog-offset that powers Scuttlebutt, but it requires:

  1. Restarting Scuttlebutt.
  2. Destroying all views in ~/.ssb/flume/.
  3. Rebuilding views when you open an application that needs them.

It's probably possible to fix the first problem, but the other two seem hard: if you modify the log then you must rebuild the views that depend on it, but there's no way to tell which application generated the views in the first place. Unfortunately, mutability is required any time we're exposed to new information that would change a view:

  • If we block a user and delete their feed, the view won't reflect that.
  • If we receive a blob after receiving a #blob-content message then the view won't reflect that.
  • If we're added to a private group (or receive an unbox key) then the view won't reflect that.

The only tool we have to deal with this is to destroy all views and then rebuild them all when the right application is started. This means that you might open one application, wait five minutes for it to finish building views, and then have the same experience with a different application a week later because they may use different views (based on the same log).

While each feed may be append-only, I don't think that means our database has to be append-only as well. The ability to delete data from a flumelog seems critical for the scalability and maintenance of the network. I'd love some feedback on the feasibility of various solutions, or ideas on how to solve this!

  • Implementing a flumelog.delete() method that takes a sequence number (or array of seqs) as input.
    • Unfortunately I think this would have to be done on each flumelog, but otherwise seems simple enough.
  • Implementing a way to delete individual messages from a flumeview without rebuilding the whole thing.
    • This would require that each flumeview ships with an inverse function that's capable of removing messages as well.
    • Rebuilding individual messages (for new blobs or unbox keys) would be triggered by removing and re-adding a message.

Sparked by %1bz0TXD... and %YMUkbRH....

#apps #scuttlebot #scuttleshell #flume #flumedb #against-consensus

:rocket:

@mikey %IL+ogF/+aO83i99V/EmBu+E9uZnW/sOq5VPhaI6DeAY=.sha256

/cc @Piet, %meil3xW..., %QRQaadQ...

@Christian Bundy %qBeoUxmRB2fSc0ogmps7/+hy5QxU+khNYkUy6bOZQAU=.sha256
Voted /cc [@Piet](@U5GvOKP/YUza9k53DSXxT0mk3PIrnyAmessvNfZl5E0=.ed25519), %meil3x
@Anders %0nBJkkYoaIcAL1AZDfRu1fsc9GykKHDzcEKJEplA00I=.sha256
Voted # Mutability and flumedb Riffing off of my [issue](https://github.com/flum
User has not chosen to be hosted publicly
@Christian Bundy %tvERTIWyLo+h8PIXTWsHBlyZD1DlFG3oBxIEzcQv8fQ=.sha256

@keks

The more I think about this, the more I wonder whether we should be appending changes to flumedb rather than content. For example, instead of appending msg to the flumelog, we would append { add: msg }, which would give us the option of message deletion with something like:

{
  remove: {
    id: msg.id
  }
}

If views respected these semantics, most of them could easily run a few small commands to delete individual messages, feeds, or even rebuild (with a remove and subsequent add operation). Do you think that's too naive of an approach? Or too similar to MongoDB? 🙃

I'll admit I'm not very familiar with CRDTs, if you have a learning resource you like then I'd love if you could share.

@Christian Bundy %7WyGpUrOeCHi/fPpC4BQcMklVXpn9OQtz/jF/sazy88=.sha256
Voted I've also been thinking about this for some time. I don't really know how m
@Christian Bundy %codU1kaafCOwI+biuV4EFwj6izjRYLgSWMrMdHvVj28=.sha256

flumedb.delete()

The above message was written under the incorrect assumption that flumedb.append() is our only interface into the flumeview API. After doing a bit more research, I think think our best bet would be to implement deletion from logs and views with some new API surface area: maybe flumedb.delete()?

flumedb.png

Above I've put together my best impression of what's going on under the hood and which interfaces this might touch, and I think this could actually work. It might be useful to prototype feed deletion with flumelogs first, and if that's successful then I may start implementing it for flume views.

I'm likely getting head of myself here, but from a shallow glance at the above I think this idea might actually have legs. Are there any major problems that I should be aware of while hacking on this?

cc: @dominic

@Dominic %p++m9rdfZekCzz3sN2X8tu+MeYRzbP1CkVc63qyJyWw=.sha256

The more I think about this, the more I wonder whether we should be appending changes to flumedb rather than content.

Yes! that is how you should interpret what we are already doing.

@Christian Bundy %Zmvhnq/M883sIxbYHjHtbCqzbmQpB4ZUI7QoFZ3bCXo=.sha256
Voted > The more I think about this, the more I wonder whether we should be appen
@Christian Bundy %qG24F9Cl1FDtn8VBpUGzzGdHPr5TT0aECPpc1NZ4tgI=.sha256

@dominic

Do you think something like flumedb.delete() would be viable, or is there a better way to delete/rebuild individual messages?

@Dominic %6GzDVmtO3lVs1uPKo+Z1oYS9/s75wHH1ezhqQu0hyLE=.sha256

@christianbundy well, the thing that worries me is the spiraling concerns about who has the right to delete what content. I don't think there is a simple answer. The obvious approach is you can delete your own content - but once your content is on my machine I ought to have some say. It may effect me now - for example, say Alice threatens Bob, waits till Bob sees that, then deletes the message. Or say a politically powerful person deletes a message which causes a scandal - maybe they should be held to record? So peers should have some say whether something is really deleted. On the other hand if anyone can delete anything, what if someone deletes some thing you want, for reasons you do not agree with? Basically, it makes it really complicated, and if you respect everyone's right to have a different opinion on a particular deletion... you end up with something that isn't gonna work very well... and at worst a Streisand Effect button, when someone makes the deleted messages view.

On the other hand, I think removing an entire feed after you've blocked them is straightforward,
and suitable for removing offensive content - remove the whole feed, and refuse to host it, rebuild indexes.

On the other hand, for fixing typos, etc (which I believe will be the majority of the use of deleting a message) I think edit messages will be fine, and don't complicate the protocol.

User has not chosen to be hosted publicly
User has not chosen to be hosted publicly
User has chosen not to be hosted publicly
@andrestaltz %qJQ4fWGgtXjeoFzZJ54vBjZy8TdYm2GECAaUdTEua3U=.sha256

Found the cypherlink: %lvyOzx/...

@Christian Bundy %Kw7lqhqn9szBAPb59HLB3lu12mR57rzIrG1XQ001Fgg=.sha256

@dominic

spiraling concerns about who has the right to delete what content

I'm only talking about deleting from flumedb on my machine, although if we were talking about network deletion over Scuttlebutt I'd absolutely share your concerns. I'm working on this, which seems straightforward, although I agree that trying to propagate deletion across the network would require loads of nuance.

For what it's worth, this thread is a continuation of our discussion here, the only change is that I'm trying to avoid a full rebuild of each flumeview when we only delete a single message. Do you think that's viable, or do you think there's a better way to implement this?

@Christian Bundy %Nz/iy8zhmxlIICoWNHxXFunPx3+E+p/O5a4vo0qI34o=.sha256

I took a few hours today and wrote two small components:

  • an in-memory flumelog that acts like flumelog-memory without the filename argument
  • a small flumeview (which depends on the above flumelog) that just converts each item to boolean with !!

They're both impressively useless, but I feel like I have a much better handle on how things are working under the hood. My log supports deletion via flumelog.del(), but I'm unclear on two details of the implementation:

since

Is since meant to track the number of messages in the log or the number of operations executed? Having flumelog.del() bump since seems to make the most sense, but I don't want to overload any semantics.

deleteSink()

Speaking of overloading semantics, I'm thinking that a theoretical flumedb.del() would work like this: check for flumelog.deleteSink(), if exists then pass the { seq, value } item to be deleted, and if it doesn't exist then run flumelog.destroy() and force a full rebuild.

This would give us an optional deletion method that matches the creation method, the only problem is that it changes the semantics behind flumelog.createSink() so that "create" is a noun rather than a verb.

I was originally thinking flumelog.del() or flumelog.delete(), but I'm really enjoying the createSink() pattern and I think the consistency would be nice. With that said, I don't have any strong opinions on how this is named or implemented.

User has not chosen to be hosted publicly
@mikey %MEZOrE37EyY5lUz3etEyBxNF89EUzWzEKQXO8pQmdpI=.sha256

@Christian Bundy

Is since meant to track the number of messages in the log or the number of operations executed? Having flumelog.del() bump since seems to make the most sense, but I don't want to overload any semantics.

my naive understanding is that since was meant to be a sequence number, like what you'd pass into flumedb.get(sequence, callback). i might be totally wrong though.

@Christian Bundy %E96asWLUyECUV+4DrHUsaMJmUfUrsPwukaXTItGkvq8=.sha256
Voted [@Christian Bundy](@+oaWWDs8g73EZFUMfW37R/ULtFEjwKN/DczvdYihjbU=.ed25519)
@Dominic %cbivnB/VcyNGOXE3yhuWakPRck9nwljwhlzD4SrGml0=.sha256

@christianbundy since is the offset of the last record added to the log. In different flumeviews this is a different thing. In flumelog-offset and flumelog-aligned-offset, it's the byte offset in the log file where that record starts. in flumelog-memory it's the integer index of that record, in flumelog-level it's the local timestamp when that record was added, which is used as the key.

Following that logic, deleting a record shouldn't change since.

hmm, a way simpler way than streaming deletes to the views, would be to filter out deletes from the views when querying them. views such as flumeview-level call to the log to get the message - if it's not there, drop it, done.

aggregation views like flumeview-reduce would need to be rebuilt, though.

@rabble great, I'm on board with deleting blobs, including subscription based moderation on how that happens exactly. It still has some of the same problems, but it feels like a different level since the expectation for blobs is not as high as messages.

@Christian Bundy %oZVdzrDnsH4HKSLDQs+Ovq64JyWiQSkg1k04QximAfM=.sha256
Voted [@keks](@YXkE3TikkY4GFMX3lzXUllRkNTbj5E+604AkaO1xbz8=.ed25519) > While re
@Christian Bundy %1sVYSgSb3D739aAwcrzRc2RpvR31ow80ylErCShpcLM=.sha256

@dominic

hmm, a way simpler way than streaming deletes to the views, would be to filter out deletes from the views when querying them. views such as flumeview-level call to the log to get the message - if it's not there, drop it, done.

That's a really interesting idea, I hadn't thought of that. I think that makes a lot of sense, but I'm having trouble trying to grok how to:

  • know when all items deleted from the log have been deleted from the view(s)
    • because since won't change, I'm not sure how we could be sure items are deleted
    • because items are only deleted on get(), could a rarely-touched item survive indefinitely?
  • rebuild all views for a thread when we receive information to correctly build a view
    • because of the sub-list above, could we be sure the view for an item is correctly rebuilt?
    • for example:
      • unbox keys (encrypted groups)
      • message edits (mutable messages)
      • blobs (off-chain content)

Also, the bit about since in logs makes so much sense. Do flumelogs follow the same pattern? For example, do they always increase as an integer or do they follow the format of the log (e.g. a timestamp)? I was originally thinking that rebuilds could be triggered by del() and an immediate re-append(), but I just noticed this in the flumedb readme:

a flumeview must process items from the main log in order, otherwise inconsistencies will occur

Are these inconsistencies dangerous, or are they things like item.seq + item.value varying because there's no guarantee that you'll always have the same seq and value matched up?


@regular

That's so rad, thanks for pointing that out! I love that you're forging ahead and you've already worked on reversible reduce functions. Your message didn't come in until I'd already started this message, but here's what I originally wrote in reply to what you quoted:

True, it's possible to write the inverse of some [trivial] reduce functions, but I'm happy to take some easy performance wins before even trying to think about that.

P.S. You may already know, but on the latest Patchbay/Patchwork you can start your fenced code block with ```js and it should do syntax highlighting. It should also work with most other languages.

@Christian Bundy %/IDvl5y9iAF+35644SmBrAE0ZJvcksHLL16MoGTnpfc=.sha256

Had another thought about this today. One thing that might reduce rebuild times would be storing the state of each feed independently. This would have a few benefits:

  • Index the important things first
    • Index your messages first
    • Index only the people that you follow
  • Lazy-load everything else
    • Friends of friends get indexed when you need it
    • Or, more likely, very slowly in the background
  • Rebuild indexes for one feed only
    • Did Alice delete a message? Rebuild her index
    • Is Bob's index corrupt again? Rebuild it now

There are probably indexes that won't be able to save each feed's state separately, but my guess is most of them could do this just fine. I think the only universal constraint that we can't ignore is the number of file descriptors we have available.

Join Scuttlebutt now