You are reading content from Scuttlebutt
@Christian Bundy %Sv2DRcGmHb+YnxPFGGj2t7uoZ887JkNKWFn8/Yg/PmY=.sha256

How can we deploy new transforms to Scuttlebutt?

The JavaScript implementation has a database called flumedb where each message is passed to a handful of views, which extract the information they need and save it in their own little database. These views are isolated from each other, which is usually fine but presents a problem when you want to transform the messages before the view sees the data.

Problem

The most prominent use-case is that we always want to decrypt messages before the view sees the message, so we transform the data before it's passed to the view. This is great! But what happens if we want to deploy a different transform to the data? Some examples:

  • decryption with a different key
  • decryption with a different algorithm
  • decryption of a group chat we've just been invited to
  • my use-case: off-chain content

If we add one of the above then future messages that we receive will be correctly transformed before they're passed to a view, but previous messages will still be in their decrypted (or unlinked) state. Consider the following:

  1. Scuttlebutt deploys group chats!
  2. Alice upgrades her client
  3. Alice starts an encrypted group chat with Bob
  4. Bob upgrades his client
  5. Bob checks his group chat view and doesn't see anything

That's bad.

Solutions

Rebuild

The recommended way to handle this is to delete your views and rebuild them from scratch. This is slow, resource-intensive, and frustrating for everyone involved. I've talked to Dominic about partial rebuilds where we could rebuild an individual message but he didn't seem keen, and this still wouldn't solve the problem of reduce functions where you can't just rebuild an individual message.

Soft-fork

We could deploy a small change to the protocol when we deploy the new transform function, which would stop old clients from replicating new messages so they wouldn't receive messages that their client couldn't understand. Once they update their client to the newest version, they'd receive (and transform) the new messages.

I'm not super excited about solving an implementation problem by soft-forking the protocol, but it's an option. This is a duct tape solution at best, but it might be better than forcing every client to do a full rebuild.

Radical change

Maybe these transforms should be their own views? Maybe views should be able to query each other? Maybe partial rebuilds on map views are worth trying? Maybe we should be using SQL? Maybe there's an option I haven't considered? I'd really love to be able to deploy these improvements without breaking everything in the process.

I don't know

I've been working a bunch on off-chain content and I'm really happy with it, but I'm concerned that deploying it to the JavaScript implementation is going to be a real challenge. The current options seem to be rebuild, fork, or refactor, and I'm not feeling particularly excited about any of these options. I'd love some advice on this.

cc: @dominic @mix @andré @SoapDog @mikey @arj @regular @happy

@Daan Patchwork %dxuuD7cztRbnvvs/w8MOM9NBzosOxVKa819HVQQjfLU=.sha256

Personally, I'd go with a full rebuild, and mitigate what's possible.
Assuming that sunrise-choir, go-ssb, and similar high-performance implementations will eventually form the backbone for most ssb apps, this should not be a huge problem.
At the moment, #ssb-patchql churns through a gigabyte worth of offset log in a few minutes and stays fully operational for read access in that time. I expect that to become even better over time, and with the "me frontier" offset log from the drank specs thread, it should not even have to be read-only for more than a minute or so.
So this might realistically become a non-issue soon. In the meantime, we can just rebuild.

just my two cents :)

@Christian Bundy %gjHcQV85pDgPpmcNd8lQ71qfH5FJ5HDZSYojAaQ2+fg=.sha256

One problem I should've mentioned with the rebuild: if clients implement different sets of transforms then some messages might be indexed with the correct transforms and others wont. For example, Patchbay could support group messages and do a huge rebuild only to have the user download some more messages with Patchwork that require another full rebuild once Patchbay is started again. Maybe this big rebuild might be a good time to switch to client-specific view directories?

@SoapDog %LnDiHFnrMp7YVvlVGCEFhhtYzFsqWFcFZ7693QE6En0=.sha256

Maybe a message should carry a field detailing the transforms that were applied to it so that at least a client could infer if it could support it or not. Also client specific views make a ton of sense to me.

@mix %bHMc04RqUUNix/Qr/Hf4y8RC0tVpv6R42Gsa/qlStck=.sha256

@christianbundy my gut says go all in on SQL - in retrospect it's kinda hilarious we've been storing relational data in a no-sql way.... and then building the relational meaning (clumsily) by doing a range of queries and reductions.

This thing changed my mind a lot : http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/

@Rabble %OYFayYQwMx4e7Hh5L37fQGiC6U+tugbbUZghN6LgUKQ=.sha256

@mixmix when ever Sarah says something i listen. She's got mad engineering wisdom.

@Christian Bundy I think we definitely should do different views for the different apps. That would let you switch back and forth much more easily.

@Christian Bundy %MFpnXOOTweda24S/YROBcD7sad5rM/vIW5X5BX5dPB4=.sha256

@SoapDog

Maybe a message should carry a field detailing the transforms that were applied to it so that at least a client could infer if it could support it or not.

I think this is a great idea, but in the current architecture I don't think it would help for the database to know that it can't support a transform. It would say "oops, I don't know how to handle this message" and pass it to the views, and we'd be in the same place where we need to do a full rebuild.

@mix

I'm super down with SQL, but I'm not sure how we'd be able to solve this particular problem with it without rewriting the majority of the stack. The problem is between the log and the view, so it seems like the only way to use SQL here would be to scrap or rewrite flumedb. The simplest way forward today seems to be baking off-chain content into the protocol (just like private-box decryption), but long-term it seems like the value of msg.value.content should be an application-level concern.

@Rabble

Yeah! I'd like to look into client-specific views, it should be pretty straightforward.

@mikey %butTGKDGqk1p61zsRhDktdf+wfCbpiLqrMooPlNx9Js=.sha256

my thinking is that we're on the right path, but the journey still requires more hard work before we reap all the sweet rewards. :smiley_cat:

what i mean, is that our data model is "event sourcing" ("every change to the data is an event"), our database model is "kappa architecture" ("we persist the events to an append-only log, we use the events to construct views that can efficiently and effectively answer questions (queries) about the data"). to this end, @dominic made flumedb (most similar database is Apache Kafka, but Flume is for a single machine not a distributed cluster), which provides abstract interfaces for a "FlumeLog" (how to persist the events to an append-only log) and a "FlumeView" (how to index the events to answer queries about the data).

to me, i'd be very disappointed if we stopped using the flume abstractions (Log and View), since i think they are necessary complexity for our system, and so this is something i continue to maintain as our direction for the #sunrise-choir.

however i think there are some implementation details that we can improve on with regards to using flume:

  • using a FlumeLog doesn't mean we have to use flumelog-offset, i care about the abstract interface but not the current implementation
  • similarly, i find the current JavaScript FlumeViews (flumeview-level, etc) leave much to be desired, i reckon SQL is a match made in heaven for our FlumeViews
  • a FlumeLog's purpose is to persist the log, a FlumeView purpose is to index the log. this means the FlumeView can be rebuilt at any time using the FlumeLog's persisted data. a common concern i've heard is that we're storing the data twice, in the Log and in the View(s), so it's inefficient and duplicating and why not just have one database for everything. then let's solve that problem without throwing out the baby with the bathwater: if the purpose of a Log is simply to persist, why not have a Log which heavily compresses the data. because to me it's fundamental that we can change Views at any time for any reason without complicated migrations.
  • and yes, every app (or ecosystem of apps who create a shared agreement) should have it's own database, i don't think we should make decisions based on this still being the case
  • we should also expect a Scuttlebutt future with partial replication. last night at Art~Hack i was introduced to the latest thinking on this by @joshalja, i'm very excited for what could be possible in the future, i think this question of how to index Views out-of-order will continue to require effort, i think the data concerns will be minimized.
@Christian Bundy %qHwrcaA6LRh5JhBdyD2aaHed2TDnVjlZH/BuMm4g/6M=.sha256

@mikey

Do you have any specific thoughts on how to deal with new transforms? I agree with all of the above, but I'm stuck on the problem of transforming data between the log and the views. The current solution doesn't give us a way to add new transforms in the future, which smells to me like we need a new way of handling decryption.

One solution might be views that pass data to other views, so you might have a private-box view that passes data to your ssb-friends view, but I'm unclear how we could handle multiple input views (private-box, group-box, off-chain content, etc). Maybe some sort of intermediate view that reduces the output of transform views?

Screenshot from 2019-06-06 17-27-46.png

This is the bit I'm unclear on: what do we do when we have a new transform without rebuilding every single view?

@jaccarmac %xcpxEiZjdtrXsndstADai79BCiURmW4ltWo+CC6R4Ts=.sha256

Perhaps I'm misunderstanding SSB's architecture (I have mainly admired event-stream architectures from a distance, but I'm not completely understanding the problem with so-called "input views". The data stream is more-or-less immutable, but there's no reason that views constructed from it need to be immutable, or streams, correct? That way only views in question would need to change, and even could change without being blown away entirely, on a data stream update. Of course, rebuilding from scratch is always possible given a single source of truthy data. I may be simply parroting @dinoƧ𝔸Ⓤᖇ's thoughts from above.

@jaccarmac %2zCXi7E3Up5L2s2YU1NMOhZcJsp0UzOqD1xDmIDCkJo=.sha256

whoopsie missed a paren there after distance, my bad

@mix %3k7eSpysLDEtqUCAqxhoxHP/yxc4S6LPi29NuOF0ZiI=.sha256
Voted [@mixmix](@ye+QM09iPcDJD6YvQYjoQc7sLF/IFhmNbEqgdzQo3lQ=.ed25519) when ever
@mikey %ButTG9gkzt5lPL0RdhhiowZYvcUdXIEOGDdWeepo/Zs=.sha256

Do you have any specific thoughts on how to deal with new transforms?

thinking out loud:

let's say we add a Transform abstraction to flume, where a Transform receives a Log (or transformed Log) and returns a transformed stream of "new" messages (decrypted with private box key, decrypted with group key, supplemented with off-chain content, etc). in this case, a "new" message might be an old message with a new transformation, for example if the group-box transform received a new key, it might check all messages it's previously seen that was unable to decrypt, and push along any "new" decrypted messages.

as you propose, we might have multiple of these transforms, which probably will need to run in series rather than parallel (for example if a decrypted private message contains off-chain content), which together form a single transformed log that we feed into each view.

if a new transform is added, it invalidates everything after it in the chain (further Transform and every View), which i think should be fine. but when i say "new transform" i mean a new way of transforming messages, i think every transform should be able to handle new keys or access to existing messages by keeping track of those it wasn't able to transform and being able to push along new messages when it can.

in this way, each View can no longer assume that it receives the messages in order. they must be able to process any "new" message in any order, in order to update their indexes.

but is this Transform different enough from a View to warrant a new abstraction? or should Views simply be able to depend on other Views, with the guarantee that any "new" message given to a View has already been processed by Views that it depends on. i think both are valid approaches, the one benefit of the Transform abstraction is that it is a stream of messages (like a Log), where a View is not necessarily such.

okay, enough rambles, hope that was at all helpful. :heart:

@mix %YRe83GCMu5RW0cxIsAw5BdQnIpBn6rCLfOifVAY0E28=.sha256
Voted [@SoapDog](@gaQw6z30GpfsW9k8V5ED4pHrg8zmrqku24zTSAINhRg=.ed25519) > Maybe
@Christian Bundy %bXT1g7pIVGZgJ4oKztHXmhQTPNMFZj+jNmcNIsheeTU=.sha256

@jaccarmac

Yeah, I think you're spot on. If we zoom out far enough it seems simple to say "when a new decryption scheme is added, pass the newly-decrypted messages to each view", but flume views expect to see each message once, in the order it was received, and the only option for mutability is rebuilding the whole view from scratch. This rebuild is especially painful for low-resource devices (think hours, not minutes) and it's important to me that we avoid it unless absolutely necessary.

@mikey

Yes! I think one of the big foundational changes here is that a message with the same key might be passed to a view multiple times with different content. Using the example from the thread root:

  1. Scuttlebutt deploys group chats!
  2. Alice upgrades her client
  3. Alice starts an encrypted group chat with Bob
  4. Bob's decryption view passes along the encrypted messages to the other views
  5. Bob upgrades his client
  6. Bob's decryption view streams all messages and passes any newly decrypted messages
  7. The other views receive these messages for the second time, now decrypted, and index them
  8. Bob checks his group chat view and sees the new group chat with Alice

One problem with chaining views is that the intermediate views (i.e. transforms) might need a higher since value than the log, since they might receive X messages but could send > X messages downstream to other views (in the form of updates).

The log might send:

{ key: 0, color: "red" }
{ key: 1, color: "green" }
{ key: 2, color: "blue" }
{ key: 3, color: { decryptedContent: "e2c692693217e4a2" } }
{ key: 4, color: { decryptedContent: "95fc83207e59f07f" } }
{ key: 5, color: "yellow" }
{ key: 6, color: "magenta" }
{ key: 7, decryptionKey: "1372dda6acbb8835" }

Which the decryption view would receive, and pass along to other views as:

{ key: 0, color: "red" }
{ key: 1, color: "green" }
{ key: 2, color: "blue" }
{ key: 3, color: { decryptedContent: "e2c692693217e4a2" } }
{ key: 4, color: { decryptedContent: "95fc83207e59f07f" } }
{ key: 5, color: "yellow" }
{ key: 6, color: "magenta" }
{ key: 7, decryptionKey: "1372dda6acbb8835" }
{ key: 3, color: "magenta" }
{ key: 4, color: "turquoise" }

I think the big changes here are:

  • views can depend on views
  • views can receive a message with the same key multiple times
  • views can receive messages out-of-order (probably? important for partial replication)

cc: @noffle, forgot to cc you above but I imagine you've thought about this lots for both ssb and kappa-core

@Christian Bundy %KCyN4lA0xOcydrwE4d1XAqrl4CPZOU0/wBZ7Rt8sbEw=.sha256

One last addition!

I think the big changes here are:

  • when a view is depended on by another view, a rebuild should only pass the diff between the previous state and the new state
    • this means that adding a new decryption key doesn't mean you re-stream all messages to other views
    • this also means that views need to maintain their own state, which means that transforms should act like real views
@Anders %0YJiR+8RNUwOWVfJFznywzdgTxttasc6csv0HTHZ4sE=.sha256

Don't have very much capacity to contribute here, but this is really important stuff @christianbundy so thanks for bringing this up.

My only 2 cents to add for now is that it might be useful to specify if a view is "pure" or if it has state. With this information it should be possible to not have to rebuild every view from scratch depending on the changes coming in.

@jaccarmac %/dYOejG5TR8C6BD59DZASMdkm/GcF1wZxfsaTqpo3Mo=.sha256

@Christian Bundy

but flume views expect to see each message once, in the order it was received

This seems to suggest that currently, all views are expected to behave like event-streams themselves? My intuition is that while event streams can be a useful abstraction, trying to use them everywhere is probably not the best of ideas. But it's entirely possible I'm misunderstanding encryption and that that view is necessarily a stream.

When I hear "event streams/sourcing" I tend to think of Kleppmann's talk where relational materialized views are important parts of the pipeline. On the other hand everything as a stream reminds me of reactive streams which I think about in an entirely different way. Again, quite possibly in error.

@Dominic %cmJUsgQmsqJ1XEHT6qAJtrGaJ61AKJaJK5jPRHAUR6w=.sha256

@christianbundy I think we can substantially improve the performance of a full rebuild (even in pure javascript implementation) by using in-place format (avoid parsing into another memory structure, just to throw it away immediately)

@dinosaur I'm happy you see the advantages of flume! Also, because flume has views, it was possible to add SQL without even asking me to merge anything. Permission-less Innovation! Personal aside, the thing I dislike about SQL is now you have to have a schema, and how do you add things to that schema, so it's both flexible and inflexible.

Another idea: I've been thinking about having a separate log for decrypted messages. Newly decrypted messages would just be copied into this log. The same indexes would run on both, and when you query them they'd be merged, the main log has messages in receive order, and the decrypted log would have messages in decrypt order. Offchain content could handled this way also? I'd like this discussion of "future transforms" to be a lot more concrete. I wanna here about specific usecases!

@Dominic %HTvxRYknxDgznZHNW7ZkTVUiGGytB11ahSnw9f3/ows=.sha256

a common concern i’ve heard is that we’re storing the data twice, in the Log and in the View(s), so it’s inefficient and duplicating and why not just have one database for everything

Btw, someone making this complaint needs to learn more about how databases work. Databases need to arrange data so it is easily readable. There are many approaches to this but just because you have "one database" does not mean that database doesn't internally save the data twice, in some way or another.

@Gordon %SDqIDlUwPL3+xlxA3JgLa1n+hPFaoIcwG6K1VcWSkaI=.sha256
Voted > a common concern i’ve heard is that we’re storing the data twice, in the
User has not chosen to be hosted publicly
@Anders %U5HAaxZa9A8gkk6rwiRs7vGY+rtHxHjiVFrQ6aGpdZ0=.sha256
Voted I have opinions on the out-of-order issue! 😃 We need to fix this for this
Join Scuttlebutt now