You are reading content from Scuttlebutt
@aljoscha %5kJUD3o5ukQQHWUh4mSMhMh07pQf3slmWq3BzznpJyw=.sha256

Canonical Base64 in SSB data types

When decoding base64 data, in some cases the last few bits don't encode any data. The rfc mandates those to be zero bits. Node helpfully decodes noncanonical data anyways:

> Buffer.from("iYW=", "base64") // Not canonical, last bits are `10` rather than `00`
<Buffer 89 85>
> Buffer.from("iYU=", "base64") // Canonical representation of the data
<Buffer 89 85>

Ssb-ref doesn't check for canonicity either.

As a result, ssb currently accepts multiple encodings for a single public key/hash/signature/encrypted message content. This means that implementations can't just parse incoming messages and handle the actual data. For correct signature-and-hash-preserving re-serialization, they also need to store details about the possibly noncanonical encoding. It's not that bad (only a single byte of data) per base64 encoded chunk, but it's definitely weird.

So instead I'd like to specify that all the base64 used in the ssb protocols MUST be canonical and invalid base64 MUST be rejected. @Dominic, is that ok?

CC @cel, @keks, @cryptix

User has not chosen to be hosted publicly
@mikey %l8MBxGCNQ6m4kXMUy5f3rFxyp5jh6aB7igbqbstZoXI=.sha256

all the base64 used in the ssb protocols MUST be canonical and invalid base64 MUST be rejected

:+1:

@Dominic %WH06R1BZtYDsj4ZBsIQm6/wyz2FhvB5sq+Fc+GxBHqg=.sha256

Okay but first lets check whether or not the we already have all canonical base64 in the current data. Hopefully we do, if so, then yes lets make that a rule immediately. I'm pretty sure that the javascript created messages will be cannonical, but it's possible someone else has posted a message with non-cannonical base64.

@aljoscha %NyoXRbMzJ9tdsdoDSiMg7GsHRjQeUcU49iJQus20gmg=.sha256

How do you iterate over all messages your local sbot knows about? If someone who has experience with this sort of thing would like to do the check, here's a function to call for each message in your database:

function check_msg(msg) {
  if (msg.previous != null && !is_canonic(msg.previous.slice(1, 45))) {
    throw "non-canonic previous";
  }

  if (!is_canonic(msg.author.slice(1, 45))) {
    throw "non-canonic author";
  }

  if (!is_canonic(msg.signature.slice(0, 88))) {
    throw "non-canonic signature";
  }

  if (typeof msg.content === "string" && !is_canonic(msg.content.slice(0, msg.content.indexOf(".box")))) {
    throw "non-canonic private message";
  }
}

function is_canonic(str) {
  return Buffer.from(str, "base64").toString("base64") === str;
}

If it never throws, then your sbot does not know about any messages with invalid base64 encoding.

Ideally this should be done by somebody who has the feeds of @Vendan, @Vendan-Phone and @cft (those contain data not published by sbot).

CC @Dominic and @Christian Bundy, @regular (that's what you get for helping me once - I'll ping you again. Please let me know if that's not ok)

User has not chosen to be hosted publicly
@mikey %uEEmuKzSuw7MLGrciBbzYjpHsZHqOxe6gHARDDYRI/0=.sha256

@Aljoscha

i didn't find any non-canonicals:

var Client = require('ssb-client')
var pull = require('pull-stream')

Client((err, client) => {
  if (err) throw err

  pull(
    client.createLogStream(),
    pull.drain(
      check_msg,
      (err) => {
        if (err) throw err
        client.close()
      }
    )
  )
})


function check_msg(msg) {
  if (msg.value.previous != null && !is_canonic(msg.value.previous.slice(1, 45))) {
    throw `non-canonic previous: ${key}`;
  }

  if (!is_canonic(msg.value.author.slice(1, 45))) {
    throw `non-canonic author: ${key}`;
  }

  if (!is_canonic(msg.value.signature.slice(0, 88))) {
    throw `non-canonic signature: ${key}`
  }

  if (typeof msg.value.content === "string" && !is_canonic(msg.value.content.slice(0, msg.value.content.indexOf(".box")))) {
    throw `non-canonic private message: ${key}`
  }
}

function is_canonic(str) {
  return Buffer.from(str, "base64").toString("base64") === str;
}

took 23.24s user 5.16s system 43% cpu 1:05.70 total time

@aljoscha %S/g1v46Qo42QjPf6/mrS4prJnghUmBGHxvJ785+t0sg=.sha256

Thanks both of you. I'll go ahead and add the canonicity requirement to the spec.

Updating ssb-ref to use stricter checks will make sbot reject noncanonical base64 I think.

CC @Dominic

@aljoscha %PxhkKV8GEoyJX+wqhwaZ5trYeQlpAAZpo/gkGlU3PA8=.sha256

Ssb-ref alone is not sufficient, sbot also needs to explicitly check signatures an encrypted messages.

@aljoscha %ue+DhZqXw6lJALDERqSjQ+Yl74iAs/n+r8mGs7Dehrs=.sha256

Note for the js devs: The most efficient way of checking for canonicity is probably to count the number of padding =s and check against a list of allowed last characters. Less efficient but simpler to implement: If there is one padding =, decode and reencode the last two nonpadding characters and check for equality, if there are two padding ==s, decode and reencode the last three nonpadding characters. That is at least more efficient than reencoding the full string.

The rust implementation will perform the check as part of the decoding process, but js doesn't get that luxury.

@Dominic %VbgIb7/yjmBvAnWhDoaVPTI+n/tjbWNyC5gjUL3HzCs=.sha256

We could actually do it with a regular expression... If there are two ==, then the last char only encodes 2 bits + 4 zeros, so that's only 4 possibilities. If there is one = then the last char encodes 4 bits, so there are 16 possibilities.

@Dominic %47+qhPn2a9nrkEqusfQG87BsrDvw5APx53dwcDTXsOw=.sha256

Well, I wrote thing to check if base64 is canonical is-canonical-base64 it seems this got caught in a new npm spam filter... I did manage to find this blog post but I don't understand why it denied is-canonical-base64 because it accepted iscanonicalbase64 (without punctuation)

hmm, also good news: I notice that ssb-ref already enforces canonical base64 on identifiers - because the tests have fixed numbers on the length of the ids (although not signatures or encrypted content currently)

@Dominic %ZrFBa/OhWcvhNf6UxUgySVEL0cENYQlACAw4WjpxfBI=.sha256

Sorry, I lied. It didn't check the correct last chars, just the length. so the top bits can encode zero, but my new module actually tests this, so it will be safe.

@Dominic %BHxIv0d489WUg+NpwOudU1nws2GhJ/hV3vf1thR7IGU=.sha256

for review:
in ssb-ref https://github.com/ssbc/ssb-ref/pull/21
in ssb-validate https://github.com/ssbc/ssb-validate/pull/11

The top level properties are checked (previous, author, signature, and content if it's a string) but references inside content are not checked. I think if these are invalid, they should just be treated as arbitary strings, which is what the node.js implementation will do once the first link has been merged.

Join Scuttlebutt now