You are reading content from Scuttlebutt
@Christian Bundy %rciFuVmIAi6WxamBNcF+EYSnJtngAbHwxxmtudsz3v4=.sha256

Prototyping Blob Content

If you haven't read @aljoscha's thoughtful and insightful post on moving message content out of the sigchain, it may be useful backstory. The gist is that #offchain-content has many benefits but that the best possible implementation doesn't involve using the current blob mechanism. I agree wholeheartedly.

Instead, I've built the worst possible implementation, and I'd like to share it.

Private Unboxer

When secure-scuttlebutt sees a message with content as a string, it passes it off to a function that translates that into an object by decrypting the content string. This means that you can query for messages with the type chess_chat and you don't need to worry about whether the message was public, unboxed privately, or something else. Emphasis on the something else.

My first move was to create the ability to chain unboxers, so that multiple unboxers can work on the same message. My original intention was to add a blob_unboxer that made a call to multiblob, but multiblob's async methods don't mesh well with flumelog-offset.

Blob Unboxer

I'm calling this the worst possible implementation for a reason: instead, I wrote a hacky reimplementation of multiblob.get and added it as an unboxer. This does three simple things:

  1. Looks for a content property that starts with the blob sigil (&).
  2. Reads that blob into memory and attempts to parse it as JSON.
  3. Passes the result onto the next unboxer, and then to the database.

Gross

Yeah. It's gross. But it works! I want to be very clear about the fact that this is a hacky prototype that I hope will be immediately made obsolete, but if nothing else it's an interesting experiment in ergonomics. I've only prototyped viewing blob content, but my workflow for adding blob content is even worse:

  1. Attaching a JSON file.
  2. Copying the hash.
  3. Patching ssb-validate to allow my terrible message.
  4. Running sbot publish $hash.
  5. Viewing the message.

I may be able to get away with skipping the first step, but I'm not very familiar with the specifics of blob replication and I want to make completely sure that my prototype blob is replicated for this post.

screeenshot of prototype blob content

???

In the interest of iterative development, I'd love to hear some input on what my next steps should be (or whether this works for you). I think this can be tested by running npm i --save https://github.com/ssbc/secure-scuttlebutt#blob-unboxer in your client directory. Unless the consensus is an overwhelming \"delete it\", I think my next steps will likely either:

  • an asynchronous flumelog-offset codec
  • a synchronous multiblob option/implementation

Anyway, that's all I've got. What do you think?

hermies firehose


Note: It turns out that posting the proof-of-concept message actually crashed Scuttlebot for anyone who tried to replicate with me. I've opened an issue for the bug and deleted the post from my feed until I'm confident I can re-post without getting #marooned again.

cc: #ssb #ssb-dev #scuttlebot #blobs

@Dominic %Osa+no8tcL7eX7meShSMGynjiXpMy6VroWKXjqrmgCo=.sha256

my understanding of the current message validation rules is that should not even be allowed. What version of ssb-validate are you on? in any case though, nothing should crash just because something was wrong with the decryption - it should just fail to decrypt.

Okay, I wrote some tests to confirm that... see comments on your PR: https://github.com/ssbc/secure-scuttlebutt/issues/217

@aljoscha %xtSX3Iu0Cxlg1/GYEDU5sJqyL0Nccn6yJJ7CQMetIP8=.sha256

I'm thoroughly impressed @Christian Bundy. This is horrible on so many levels at once, I've been literally laughing out loud while reading that post.

How do you signal errors (blob can't be fetched, or blob is not json)? Just return a content object with a special type field (how about the empty string) and no other entries?

Congratulations on the creative act of bringing together seemingly disparate ugly hacks, combining them into an even uglier one. This approach could actually work, which is amazing.

Please don't merge this, ever.

User has not chosen to be hosted publicly
@aljoscha %dm8Fi2UEiGGpEgXoLS9AkNeZj0GIXRH0x0n5F0bi6Ig=.sha256

You are right, my previous post reads rather harshly, sorry for that. My point of view was that of somebody being actually impressed, and very amused (or maybe "flabbergasted" is the right word here?). So please read it through that lens - lighthearted and jokingly - if you can.

The question about error handling was genuine, I'm interested in exploring where this approach leads to. And I really do admire how @Christian Bundy saw how this could be done without any protocol changes, and then went on to immediately implement it (and uncovering a bug in the process).

This can serve as a valuable starting point for figuring out which backend-changes might be needed for offchain-content, and what challenges are involved in implementing them. Ideally, those could be worked on without relying on details of the actual offchain-content implementation.

User has chosen not to be hosted publicly
@Christian Bundy %/bxI3icyCb+TRst5yz3vcPa2/nk6rueIbmi1emnRYSw=.sha256

@dominic

my understanding of the current message validation rules is that should not even be allowed.

I think this was my problem -- I was under the impression that ssb-validate was only hooked up to sbot publish, so my local patch meant that I was publishing a message that others wouldn't replicate. Thanks for helping me troubleshoot this, definitely a learning experience!

@aljoscha

So please read it through that lens - lighthearted and jokingly - if you can.

❤️

I've just refactored and posted a new message, which shows up to me as:

{
  "key": "%59UufAGSLgEUQWsvdZEHSIGw/x2ezxY5jczavjpKFZ8=.sha256",
  "value": {
    "previous": "%2+uc1az8Xk9NGV4xMSnaMyXyq2eLeBWGzQlcpT/emac=.sha256",
    "sequence": 4925,
    "author": "@+oaWWDs8g73EZFUMfW37R/ULtFEjwKN/DczvdYihjbU=.ed25519",
    "timestamp": 1537379262052,
    "hash": "sha256",
    "content": {
      "type": "post",
      "text": "If you're reading this, #blob-content works.\n\n...or not. Depends on how you're viewing this message.",
      "mentions": [
        {
          "link": "#blob-content"
        }
      ]
    },
    "signature": "j8/lzTjN/FJYEMHyaJqFdJj5tyDPR9D1R+l8foO3XE7sZTfOVZrLM7zXsggipXEXtWo/WYgylMfE3Hkhie7IDA==.sig.ed25519",
    "blob": "&sbBmsB7XWvmIzkBzreYcuzPpLtpeCMDIs6n/OJGSC1U=.sha256",
    "blobContent": true
  }
}

Currently there seem to be two states:

  • Blob is found and returned with value.blob: "&blob" and value.blobContent: true.
  • Blob can't be found/unboxed/etc and the original value object is returned as-is.

I'm not familiar enough to understand why, but I've even been able to post a blob message and then add the blob in a later message, and after restarting sbot it seems to be unboxed correctly. I'm sure this throws a wrench in the whole "timestamp" bit, as we now have:

  • when the message was received
  • when the message was [supposedly] authored
  • when the blob was received

Anyway, it turns out that I was able to hook into the unboxing code without setting the invalid value.content: "&blob" property, which seems to resolve the bug I caused/experienced. Fingers crossed that I didn't just break my feed again!

User has not chosen to be hosted publicly
@Christian Bundy %uUdnfih066BmhWXhNKnmJsEWlgtg9/ZJXNGF64JW4OM=.sha256

Blob Content II: Return of the Multiblob

This is still a hack but it turns out I might have implemented async codecs for flumelog-offset, which lets us use multiblob the way the gods intended it. I'm sure there are bugs, but this seems to be working well enough that I think I'm planning to use it as my daily driver. I won't be posting blob content messages regularly, but I want to see if I can find any new failure modes.

Testing

Let me know if you end up testing this out! I don't think there's any risk of breaking your feed with this version of blob content, but don't trust me. Please let me know if this gives you any trouble.

Patchbay

I'm running a branch of Patchbay on the SSBC repo called @christianbundy/master if you'd like to test it out, just bear in mind that this is pretty experimental.

git pull origin @christianbundy/master

Not Patchbay

I think this should be relatively easy to plug into an existing client by just running npm i --save https://github.com/ssbc/secure-scuttlebutt#blob-unboxer and staying up-to-date with npm update.

Feedback

The first round of feedback was super helpful in:

  • pointing out my misunderstandings (especially @dominic's test)
  • pushing me in the right direction
  • helping me think about failure modes (especially @aljoscha's questions)
  • making me feel warm and fuzzy and such

So for that, thank you. I'd really love some more feedback on this latest iteration if you've got the time. I'm going to include some notes below that might be good points to start discussion. Really looking forward to hearing more thoughts!

Notes

Good

  • you can now view blob content messages (!)
  • all the hard stuff is handled by multiblob (!!)
  • it doesn't break your feed anymore (!!!)

Bad

  • if you receive a blob after the blob content message:
    • it will remain hidden as a "private" message
    • the only workaround I've found is restarting sbot
  • the only way to publish messages is `sbot publish --type 'blob' --blob '&abcdef.sha256'

Ugly

  • the changes to secure-scuttlebutt are experimental
  • the changes to flumelog-offset are experimental

???

  • benchmarks are showing no performance changes from the codec change(s)
  • console.error wasn't working without console = require('console')
  • I'm super curious whether private blob messages work -- anyone want to try?

Next

  • I'd like to bikeshed decide on a name for the post type
    • I was originally thinking blob_content but got sidetracked trying to understand why underscores are often used in post types (at least for #ssb-chess)
    • I'm a bit surprised that the blob type had never been used (within 2 hops)
    • I'd imagine blob is probably fine unless there are other objections
  • working with the actual API instead of forking secure-scuttlebutt
    • this is dependent on the async codec code being merged into flumelog-offset and secure-scuttlebutt
    • is there a difference between secure-scuttlebutt plugins and scuttlebot plugins?
    • I'm unclear on how this would actually be implemented or used by client authors
  • working through the two "bad" bits mentioned above
    • it would be nice to not have to restart sbot when a blob comes after the corresponding message
    • it would be nice to be able to post from a non-CLI client
  • probably heaps of other stuff, I'd love some feedback on what to work on next
@Dominic %jYQnj8z5DcaXF9PSlyDt0KtWh/dhcuqkfYPl47lsMFY=.sha256

@christianbundy another way of looking at this, if you flip it around, is that you are raising the question of queryable blobs. We've always had various ways to query the messages, but the only way of searching for blobs is to look query the messages and then find one that points to a blob. You've essentially found a way to insert blobs into that pipeline.

What applications could be built if you could do a database query over blobs?

@Christian Bundy %xJcymGKXalpsdsv/zuhrlzrx1TtPaKKGIhrFt/A2Q6A=.sha256

@dominic

What applications could be built if you could do a database query over blobs?

I'm not sure -- I understand that abstracting content away from the sigchain has some nice properties for engineering and security (especially forward secrecy), but the idea of querying blobs doesn't really click for me yet. Could you unpack that idea a bit?

My brain jumped to technologies like Dat, IPFS, BitTorrent, and other data-centric software that would benefit from #offchain-content, but I get the feeling you're talking more about the shape of these applications rather than just throwing out techy buzzwords that have to do with blobs and data.

@dan %dqV+IuXUYyneg9axJOzX7Wmu55j5plYPTCCwb9yOLB0=.sha256

This surely deserves the invocation of a new gif i hear you all say!

Happy to oblige.

mad-science-scientist-experiment.gif

@Dominic %mdZvmTCrEdsSai537eUuPf2CUTp57kbz9GORDWGuwz8=.sha256

@christianbundy well, in your case, you are putting some json in a blob, then feeding it into the indexing pipeline. That means some of the constraints on messages are relaxed... so you could build different applications.

So right now you might do a query such as "all messages that have X as the value.content.root" but what if you could ask for "all blobs?"

@Christian Bundy %f4U9CyAqlSFnazDUmkAYBZU7u9OquLhmIq5jUcn7nBE=.sha256

@dominic Did we just break out of JSON message content?

@aljoscha %eA3JFhootBuIHxFScFWinZaHDkN33QiT12497/YCFO4=.sha256

Isn't the whole point of blobs that the protocol can treat them as opaque? Indexable blobs sound like an oxymoron, and somewhat unnecessary if we end up lifting the message size limit.

I'll take the conservative opinion of leaving self-describing data to messages, opaque data to blobs.

@Dominic %TN6HsJYmeBTQ2Vd13x/ym9FKy16ZWtjEMhrpojcsh4c=.sha256

somewhat unnecessary if we end up lifting the message size limit.

I messages have a limit for a reason: so you can know bounds on how much you are committing to when you replicate a feed.

Blobs are a release valve to that - you can make something much bigger - but also, your followers aren't committed to replicating your blobs. Depends on the application. There are already several applications that have some sort of structured data in blobs. for example, ssb-git uses has packfiles, and ticktack has blogs in markdown format. If a blob is plaintext, I think it definitely makes sense to be able to search inside of it.

Also, searchable blobs is a pure feature - it's not removing anything, so we don't have to change how something works to add that.

@aljoscha %dWbw9cYZroN8CF16RQF0XsLfKQ/PYObPtkjpPliA2aY=.sha256

I messages have a limit for a reason: so you can know bounds on how much you are committing to when you replicate a feed.

With offchain-content, that's not a problem anymore. You can always resort to replicating nothing but metadata. Your peers can't know whether you even have the content data or not.

To get the current behavior, we can do offchain-content replication that says "give me all content below size XXX, send me other blobs on demand".

Also, searchable blobs is a pure feature - it's not removing anything, so we don't have to change how something works to add that.

But if it ends up in the protocol, it is a feature everyone else is going to have to implement/support as well.

@Dominic %8ebUR0K+PWALf0YJ+H2v7S0FojKJIHje04jaJm/WJug=.sha256

@aljoscha lots of problems spin out of that... like, if each peer can choose XXX limit, then what happens if the peer you are replicating has chosen a different one and you don't have that? if there is a fixed limit... then how is that different to the current set up with 8k messages and blobs?

Wether something like searchable blobs becomes "part of the protocol" depends on how many applications use it, and whether it's considered indispensable. To answer that question, we need to experiment - and the nice thing about that particular experiment is that it doesn't mean changing anything that already exists.

@aljoscha %ZdAiHrpK/Z36RzjCTmFlyuVRwHVZBGsi65q1tdr9vno=.sha256

@Dominic

then what happens if the peer you are replicating has chosen a different one and you don't have that?

Unavailable message content is something we'd need to handle in any case. It could happen for all kinds of reasons: Garbage collection of rarely accessed messages, or just a server crashing during replication, at a point where metadata has already been transmitted, but (part of the) content has not yet. This is a problem we will face regardless of any message size limit.

You wrote that the message size limit gives "bounds on how much you are committing to when you replicate a feed". Offchain-content can do even better, it lets you chose your own bounds, rather than having to accept the prescribed, arbitrary ones.

To answer that question, we need to experiment - and the nice thing about that particular experiment is that it doesn't mean changing anything that already exists.

Sure, as long as those APIs are clearly marked as experimental and unstable, so we don't force ourselves to support it for backwards compatible. Having to attempt parsing of all blobs and creating indexes if parsing succeeds would be a huge (computationally) commitment.

The whole question of how much of the server-client communication is part of ssb is one we will need to figure out eventually, just like the plugin story. That'll be a lot more effort than the message format changes.

@Christian Bundy %I2jPmQcXvvPROp0KDpc6RahAxKKfYNc1woEK08un5GM=.sha256

If a blob is plaintext, I think it definitely makes sense to be able to search inside of it.

I think this extends to anything with a syntax. The current prototype is a JSON blob unboxer, but we can sidestep JSON message verification and JSON signing format for everything in value.content and experiment with any syntax we'd like (like SDN?). No more arbitrarily nested JSON, we just have to agree on what's appended to the feed and we can remain #against-consensus for everything else.

{
  "key": "%key.sha256",
  "value": {
    "previous": "%previous.sha256,
    "sequence": 42,
    "author": "@feed.ed25519",
    "timestamp": 8675309,
    "hash": "sha256",
    "content": {
      "type": "blob",
      "blob": "&blob"
    }
    "signature": "abcdef.sig.ed25519"
  }
}

You can always resort to replicating nothing but metadata.

I think this would be really neat, so you could set up a system where you:

  1. Download the metadata of anyone within 3 hops
  2. Download blobs posted by anyone within 2 hops
  3. Download blobs linked by anyone within 1 hop

I think this may have some feature overlap with #ssb-ooo, but I'm not familiar enough to have any strong intuitions or opinions.

Having to attempt parsing of all blobs and creating indexes if parsing succeeds would be a huge (computationally) commitment.

This ties into two problems I'd love feedback on:

How should unboxers work together? Originally the code just used the first unboxer that returned something truthy, so I made a small change that allows unboxers to be chained. Unfortunately, this means that they have a canonical order, so we have to choose between _unbox(blobs.get(x)) or blobs.get(_unbox(x)). I think the former is the correct answer for now, but if other unboxers are added in the future this may get complicated.

More importantly: what's the right way to add an unboxer? I'm sure this experiment shouldn't be baked into the secure-scuttlebutt package, so I'd like to use the db.addUnboxer() API. Does this mean writing a plugin for secure-scuttlebutt? Or maybe a scuttlebot plugin? Or is this the sort of thing that goes straight into the client? My goal is to get this experiment working in my "fork" of Patchbay.

@Christian Bundy %a51hTXuupUQ0VEgzJ1ylnY28jZKqajnW0PYeXQvXp9A=.sha256

More importantly: what's the right way to add an unboxer?

Just for reference, my current thought is that this should be a Scuttlebot plugin (please correct me if there's a better path). I think my steps for this iteration would look something like this:

@Dominic %fHtYWTfreCGk+cthi5dKhn1wTOjw+nn7YoYMy5Wlj0s=.sha256

@christianbundy yes, add an unboxer via sbot plugin.

{type: 'blob' , blob: '&...' }

The problem with this, is that it doesn't tell us anything about whether we actually want to get that blob. Maybe I want to follow your photo galleries, but not your git repos. blobs cost in both bandwidth and round-trips. If we have ways to avoid downloading too many blobs, we can make more blob heavy applications - without having to force that weight on everyone. so a blob based message should include some minimal metadata to make it easier to for applications to use, without parsing every blob.

Of course, now we've gone in a circle back to the current design.

@Christian Bundy %6ut9dE74bC7Ii+Zj7/XOWjKTEVXdWR8zZ0SUPVhFejA=.sha256

@dominic

You're right, the metadata aspect is critical. I'll focus on the basic stuff for now, like merging those PRs, but it sounds like patchcore's blob input might be a good place to start once I start working on the plugin itself. Thanks for all the help.

Do you have any input/opinions on _unbox(blobs.get(x)) versus blobs.get(_unbox(x))? Or should we just use one unboxer per message, letting the unboxer call secondary functions if necessary? If multiple unboxers want to work on the same message we'd get a race condition, but fixing that may be a premature optimization.

@Dominic %1tz0/TiXdc3T3UBLKcY9y6Z3yViiglYXaHhae6Q3+LY=.sha256

avoiding race conditions is not a premature optimization in a database or protocol, that stuff needs to be correct from the get go.

@Christian Bundy %+bsq5qi6RUJKuPupP1+LwlgQjvPAcKpV3oFUivX5d6s=.sha256

@dominic Sorry, to be specific I mean "can multiple unboxers try to unbox the same content?" and if so "how should that be handled?". This seems analagous to depject.

I think we have a few options:

  • first: This is what we're currently doing, which means that if main_unboxer returns plaintext that no other unboxers are tried. This also has the downside of being sychronous.
  • race (async first): The async version of first, where all of the unboxers are started and the first one to finish gets to unbox the content. This is the minimum change necessary to allow async unboxers, and basically moves the priority from "first in the array" to "first to return successfully". To be clear, I mean a literal Promise.race(), not an unplanned race condition.
  • merge (reduce): This is the behavior of the current PR, which calls all of the unboxers, waits for them, and merges the successful ones into an object that's returned. My concern is this is slow, complicated, and seems to provide no benefit as the current unboxers would never operate on the same data.
  • walk (recursive reduce): A computational nightmare, this would be trying every unboxer, merging the results, and retrying all of the unboxers with the new result. This process would continue when the unboxers stop returning modified data. Again, it's a nightmare, but it's the only way to do stuff like _unbox(blobs.get(x)) with an arbitrary number and order of unboxers.

My PR would move from us from "first" to "reduce", but the more I think about the problem the more I wonder whether "race" is a better call. We don't get any additional benefits from merging results unless we go full-pickle and recursively walk through every unboxer combination, which gives me the impression that "race" solves our problem of synchronousness without inheriting a new monster.

@Christian Bundy %lGtaW0RtC9SlC+8kR3LM3HddYDb7TSWwkcJBo5FXa3c=.sha256

Please disregard, just read your PR comments and it sounds like there's a more elegant way to do this. Thanks for all the thoughtful feedback!

@Christian Bundy %f1GbYUu8Tu8/XJ6fF75BM640FE7L8J5UGvfq3hYiaU8=.sha256

Blob Content III: The Flume Wars

After watching a video on flumedb from @mix I started a pull request, and @dominic walked me through a handful of iterations and improvements until we ended up with something vaguely resembling an async map for Flume. This means that we'll be able to avoid the unboxer code, which circumvents all sorts of fun puzzles headaches.

Also important, the diff from master actually looks reasonable. Here are my current todos/questions/etc:

  • Ensure blobs are replicated to others when publishing blob content
  • Ensure blobs are downloaded from others when receiving blob content
  • Figure out how to abstract this out of the secure-scuttlebutt package
    • As mentioned above, this seems like it should be a Scuttlebot plugin, but I'm not sure how.
    • Does this mean Scuttlebot needs to expose some sort of addMap() method to plugins?
      • If so, what's the best way to pass that map function to secure-scuttlebutt?

At this point I've got more questions than answers, but I wanted to publish what I've got.

@Christian Bundy %oXuVPDqnirdV0ttIGUh1/Tjpicd/vhdHQIafixK5ADQ=.sha256

Transplanted to Scuttlebot as a plugin!

I've opened this pull request to expose the addMap() method I mentioned in my above post. 🔨

@Christian Bundy %4HqPe3xvuqmYwCCxeJy+IpNTvrYLCQxvuGreWpw9OH4=.sha256

I'm probably about done for the day, so to summarize I've:

  • Opened a small PR to expose flume's map to secure-scuttlebutt.
  • Opened a small-ish PR to expose flume's map to Scuttlebot and its plugins.
  • Moved the entire ssb-blob-content package to its own repo.
  • Replaced multiblob with ssb-blobs.
  • Triggered blobs.want() for hashes we want.
  • Triggered blobs.push() for hashes we have.
    • This is a hack and should eventually be added to some publish function.

Assuming that those pull requests are merged without too many gigantic changes, this means the next steps for full #blob-content integration would be:

  • Writing a tests to ensure it really works.
  • Including the plugin in a client for read-only access.
  • Exposing a method to publish content blobs via Scuttlebot.
  • Adding an interface or settings for users to choose to publish blob content.

The only weirdness I've found has been centered around blobs you don't have: sbot get $key seems to hang until the blob is downloaded, which doesn't seem ideal. On the other hand, I think blobs are replicating just fine, which is really neat to see.

@mix %t3rMf+amRgWSrpYTHAqSXzAIrUedZPpHNQuzMnffR6w=.sha256

async mapping in flume!

this looks like mapping on read correct? awesome work

on my wishlist is async mapping in flumeviews - use case is a new message comes in, you want to check if say a message it's asserting something about is a gathering type message. You couldn't gaurentee that the message is in your db yet, which means you'd be making a bloom-filter type index (dominic made flumeview-bloom i think, but haven't looked at what he did in that)... mmm

@Christian Bundy %+COo4I0I9Qnkk9cAO71JB7Zn6v5B4n/z5PRytAvKgYg=.sha256

@mix Yeah, thanks! It runs on on db.get() and db.stream(). I don't think I understand flumeviews well enough yet, could you go into more depth on how that would work or point me to some docs to read (and probably reread)?

@mix %CAPrNBEFzfZTCTgNXFUczr8KVDfAtdGlw1ykXEb+Nfc=.sha256

can talk about this on a call. It's an early formed idea nd may be weak.
We should schedule that call @christianbundy ... want to scry again (we probably should given daylight savings hitting here tonight).

Can you start it, I've got more time flexibility at my end I think?

User has not chosen to be hosted publicly
Join Scuttlebutt now