You are reading content from Scuttlebutt
@aljoscha %NeOkZhPNhthqfXXAQ7cIU/NMe3S8SHtDhA8E5hu0H5I=.sha256

Did you know that the byte sequence [34, 228, 166, 164, 110, 237, 166, 164, 44, 34] ("䦤n���,", quotes are part of the string itself) is considered valid utf8 by ECMAScript, but not by the rust std library? By extension, js claims that this is valid json, my rust json parser claims it isn't. One of those two has to be buggy. I really hope it is the rust standard library, or else we will have to put those errors into the ssb spec.

At least I'm gaining confidence that my testing process is thorough enough...

@kas %MlU8AXuf3q77tubFfxD0308VDLBhbTySup0F64MF4vY=.sha256

Python says:

>>> bytes([34, 228, 166, 164, 110, 237, 166, 164, 44, 34]).decode('utf-8')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec cannot decode byte 0xed in position 5: invalid continuation byte
@aljoscha %GECd9ubBG82NEHQMGWRcj308vbMORNzVsxaDdf7w5Rk=.sha256

So much fun: https://github.com/rust-lang/rust/issues/54845

Ok, I think I understand it now: It is a valid utf8 encoding of a unicode code point, but the code point happens to not be a valid unicode scalar value (which is what rust chars are).

Follow-up question (if you don't mind me derailing this):

Wikipedia says:

Since RFC 3629 (November 2003), the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and code points not encodable by UTF-16 (those after U+10FFFF) are not legal Unicode values, and their UTF-8 encoding must be treated as an invalid byte sequence.

Does that mean that nodejs decoding a buffer of those utf8 bytes into a string without complaining is a unicode violation?

I'm not sure I fully understand the interplay of all the involved specs here (unicode, utf8, utf16, ECMAScript JSON.parse, ECMA-404 aka json), but if this is actually a bug in nodejs, we'll need to decide whether we want to

  • explicitly allow utf8 encodings of surrogate half codepoints, or
  • stick to valid unicode and change the js implementation

I'd strongly advocate the latter option, but that would break all feeds containing messages that store invalid unicode code points in json strings. Which quite likely is only my own feed, unless anybody else has also pasted some very abstruse strings into their posts...

CC @Dominic, @cryptix, @keks, @cel, @dinosaur

@aljoscha %7b7QeCvaJfrof8++oK590199jr+l0gWK9RKCxqAA9pY=.sha256

Okay, I'm off to reading the wtf-8 spec, which might be what ssb accidentally uses. Quoting the first two paragraphs of the wtf-8 spec, to give you an idea of how great that would be:

WTF-8 is a hack intended to be used internally in self-contained systems with components that need to support potentially ill-formed UTF-16 for legacy reasons.

Any WTF-8 data must be converted to a Unicode encoding at the system’s boundary before being emitted. UTF-8 is recommended. WTF-8 must not be used to represent text in a file format or for transmission over the Internet.

Let us hope I'm somehow terribly wrong about all of this.

@Daan (deprecated) %i/ML7wWyGwttXenrkj4LtBjunDbxuSi3n8Nq5h1KZSk=.sha256

@Aljoscha Much as I definitely don't envy you all of this, it is always entertaining, in a very morbid way, to watch people deal with this kind of thing and think to oneself "it is 2018" 0_o

Best of luck! :)

@cryptix %PwsL6VMEqaFOf21ZOHiHhrJS5RaRHrQmIIA3w95fBMU=.sha256

:cold_sweat: not much to add right now besides Go seems to be fine with it: https://play.golang.org/p/nquklPbV5Vs

(meaning when I would have stumbled over this. godoc json/encoding says they implement RFC 7159.)

@aljoscha %3b+Xn4mGgmqx1iK5PI8sZpzoBjjZF1B7Yb2ojYkCMzU=.sha256

Filed a nodejs issue, let's see how they react. If this is indeed a bug, maybe you could go ahead and do a bug report on the go side of things.

@aljoscha %tUOwdQ482+SlFsRd77Y45NSb/ytaetUVOaQz5I9HSOo=.sha256

Ok, the string I pasted into my message just contains the replacement character U+FFFD, not any invalid utf8, so that's a plus, my feed won't be broken.

Turns out Buf.toString doesn't check validity at all, it even returns a string for Buf.from([255], 'utf8').toString('utf8). Which is technically allowed, since js strings are not required to be valid unicode.

The mismatch happened because I used Buf.toString for my test data generation, I'll need too look at what exactly the js ssb implementation does. Maybe this is actually handled correctly and all of this was just a false alarm.

@aljoscha %T5O1h0ZrWB4dZJq1FFliwUls0JH4HUVx5XgQ21i85nM=.sha256

The js codebase remains as impenetrable to me as ever. If anybody could point me to where the conversion from byte buffers into strings happens, I'd be very grateful.

@aljoscha %bd96n3wGwweq6PnWTyVCMO8zUe17nllRbEItxJRsnGg=.sha256

Clarification: Buffer.toString does not emit invalid strings, it just replaces invalid byte sequences with the unicode replacement character U+FFFD : https://github.com/nodejs/node/issues/23280#issuecomment-427402427

So if that is what happens in the js implementation, then things are actually fine I guess. In any case, I'm done for today, I really hope someone knows or can find out what js ssb does.

@aljoscha %7cFmte5ThRzZ/0dLYsHL3AZHhYXzL1XiMCTsRp0NqQ8=.sha256

If the js implementation is just replacing invalid utf8 with U+FFFD, then only the transport encoding is affected, and there are two ways forward:

  • change the current ssb implementations to treat non-utf8 json as invalid instead of silently patching it, and make that the spec
    • let's do that one, I'd hate to always say "its a json subset except that invalid utf8 is ok for stupid reasons"
  • make the replacement part of the spec, with the weird consequence that strings which actually contain U+FFFD in a string can then be encoded by substituting any non-utf8 byte sequence instead. Implementations would be required to e.g. decode [34 255 34] into the string "�"...

To fix the js implementation (if it turns out to do the silent character replacement), use code such as new TextDecoder('utf-8', { fatal: true }).decode(Buffer.from([34, 255, 34])), which errors on invalid utf8.

@aljoscha %F586gTOCDtmAOIWSc2TlWEdD5HBGNt1WL+Mt/Ne4mg8=.sha256

Call for Help

This issue is still blocking me: I need to know what happens when sbot receives a message where a string literal contains invalid utf8. There are three possibilities, and I don't know which one applies:

  • it might replace invalid bytes with the unicode replacement character U+FFFD
    • this is what happens by default when converting from buffer to string via buf.toString()
  • it might reject the message as invalid
  • it might store invalid utf-16 in the javascript strings

Any of the following would unblock me:

  • get a clear statement from @Dominic which of these three is the desired behavior, and treat any divergent sbot behavior as a bug
  • find the place(s) in the javascript implementation(s) where the conversion from buffer to string happens
  • manage to send a message with invalid utf8 in a string literal to sbot and detect how it reacts
    • I don't think the js API is low-level enough to do so, but maybe another implementor could help me out here: @keks, @cryptix , @cel
    • a simple invalid string is [0x22, 0xff, 0x22] (double-quote, invalid utf8 byte, double-quote)

I'd be super grateful if anyone could help me here.


The first unblocking option (defining desired behavior instead of looking what sbot does) is important in any case. In my opinion, rejecting invalid utf-8 is the only sensible behavior. Even if sbot performs character replacement instead, I'd very much prefer if we treated that as a bug rather than gospel.

@Christian Bundy %PglzpRsqN4AeE7WlwdawYysF4V0TBUdAxHdxdPi+hcM=.sha256

@aljoscha I'd love to help you however I can. Would it be helpful for me to just do it and report back with scuttlebot's current behavior, or have you already tried that?

@aljoscha %33e3Ws8CJjCFyRjmmg8R1Ic1hsSi/OV34vzU5H3bkXU=.sha256

@Christian Bundy

That would be great. I haven't done this myself, I'm not familiar enough with the js codebase to get this running without spending a lot of time.

User has not chosen to be hosted publicly
User has not chosen to be hosted publicly
User has not chosen to be hosted publicly
@Anders %EIPYNto+/1bQqAcq7CA6QZzhf6BL5rNUFw+ZP1idOQU=.sha256

@Aljoscha this is where it converts buffer to str and then json when reading from db: https://github.com/ssbc/secure-scuttlebutt/blob/master/minimal.js#L83

@aljoscha %zx0GVnlMl179osXbP2r8lC1a9VsDn8tiZli+1BdA6kA=.sha256

Thank you @regular.

Sbot replaced the invalid utf8 with the replacement character U+FFFD. So now we need to decide whether to codify this behavior in the spec, or whether to change sbot.

I see three main reasons for changing sbot to reject invalid utf instead of performing lossy replacement:

  • If we allow invalid utf8, then the rpc protocol isn't valid json anymore. This is against the json spec js ssb would otherwise follow.
  • Any string that actually contains U+FFFD in a string has multiple valid encodings, namely all those that substitute an invalid utf8 sequence for the U+FFFD character. Conforming implementations have to accept and convert all of them.
  • The legacy message format becomes even more ridiculous.

Reasons for keeping the silent replacement behavior:

  • No need to fix the sbot code base

My preferred choice is clear, but I guess it is @Dominic who gets to decide this?

@Humberto Ortiz Zuazaga %aqU0CDJalTw+6QxWwlZaGLcGsTfoFftqiV992ObFEoU=.sha256

and patchfoo

@Humberto Ortiz Zuazaga %21xHVuHGwMP8eoSG9Qq0fxh1gW9MiN2UbDEC6ND+XIE=.sha256

@aljoscha, @regular

bad-unicode.jpeg

@aljoscha %l/q/53suwl+0PwaZ6FAiwP3bYoNnn4wAVuE4oSDtB0o=.sha256

@arj

@Aljoscha this is where it converts buffer to str and then json when reading from db: https://github.com/ssbc/secure-scuttlebutt/blob/master/minimal.js#L83

Isn't encode only encoding arbitrary js values into js strings, and decode decoding js strings into arbitrary js values? Those js strings are an internal, utf-16 encoded representation, but ssb servers exchange utf-8 encoded json. So there's still a conversion missing, and that's the one I'm after (in particular the decoding from utf-8 json into js values (and possibly with a utf-16 string as an intermediate representation)).

@Anders %wYffYkhQBsalTCj03udPQDP2c7kx94h3IbTiW2sQ0eQ=.sha256

@aljoscha I actually havn't looked at that part of the code, strangely enough, but my guess is that its https://github.com/dominictarr/packet-stream-codec?

@Dominic %wqxVOw3+VRs6c1wf8s8LQLWNiH60i8SglgC4mJPUvx8=.sha256

thankfully, something sensible here is happening here @regular and you did not just create a message that makes @aljoscha's spec more complicated ;)

hex dumping that message:

000000e0  0a 20 20 20 20 22 74 79  70 65 22 3a 20 22 5c 22  |.    "type": "\"|
000000f0  ef bf bd 5c 22 22 0a 20  20 7d 2c 0a 20 20 22 73  |...\"".  },.  "s|

5c 22 ef bf db 5c 22 22 which is \ " <a_now_valid_utf8> \ "

investigating:

//default is to convert to valid utf8
> Buffer.from(JSON.stringify('\x22\xff\x22'))
<Buffer 22 5c 22 c3 bf 5c 22 22> //becomes \uc3bf valid utf8

//turns out v8 does actually allow invalid chars, if you force read as ascii:
> Buffer.from(JSON.stringify('\x22\xff\x22'), 'ascii')
<Buffer 22 5c 22 c3 ff 22 22> //becomes \uc3bf valid utf8

packet-stream-codec converts json strings into buffers with the buffer default,
which is utf8, so this means that the reference implementation never sends invalid utf8 inside of json, and would not validate a signature of a message with invalid js.

So @aljoscha good news, reject such a message!

@aljoscha %8fFtpLe5b3nxd6rYPUjHcqRBTvtEMAtPjnTNzTO04Ww=.sha256

@arj Yeah, that might be right. This converts from string to buffer, and this from buffer to string.

Does sbot talk to the db via packet-stream as well? If it doesn't, then there must be at least one other place where these conversions happen. Changing only some of the conversion functions but not all of them would result in some nasty bugs.

@mix %OFjy6fymtJRCkDNu4NAb5vJg+ty8ae+MYIxJsnmpFMA=.sha256

damn @arj beat me to it (and I didn't see because of replication delays!)

following this down I was I got to looking at that minimal.js too

  var codec = {
    encode: function (obj) {
      return JSON.stringify(obj, null, 2)
    },
    decode: function (str) {
      return unbox(JSON.parse(str.toString()), unboxers)
    },
    buffer: false,    ////  << watch this value
    type: 'ssb'
  }

  var log = OffsetLog(path.join(dirname, 'log.offset'), {blockSize:1024*16, codec:codec})

diving into flumelog-offset/inject.js (where this codec object is used) :

      var data = {
        value: codec.decode(codec.buffer ? value : value.toString()),
        prev: prev,
        next: next
      }

I didn't dig further to check that value is a Buffer...
but seems like arj is correct value -> toString -> toString() -> JSON.parse -> unbox (toString twice!)

@Dominic %rh5Brc5bzzwqiFokc96wsq1uj3iH7lDx8G+A224K8G8=.sha256

@aljoscha node uses utf8 encoding by default for all of those, we only have the latin16 quirk because crypto.createHash('sha256').update(string) most unfortunately defaulted to 'binary' before node@<=6.

@aljoscha %2Uu3OB1KCLFqkXDqcC4LWH/6Y2GjpVZO+0sTz7JEREA=.sha256

@Dominic

thankfully, something sensible here is happening here @regular and you did not just create a message that makes @aljoscha's spec more complicated ;)
[...]
So @aljoscha good news, reject such a message!

It's not this simple: If I sent this message to an sbot during replication, the sbot would not reject it. It would perform the character replacement and then accept it if the hash is correct. This is the behavior we have to put into the spec, unless you deem it a bug and change sbot to flat-out reject such messages rather than patching them up and then validating them as if nothing happened.

I'll take your "good news" as confirmation that a conforming implementation should fully reject such a message, but you'll need to update the js implementations (apparently here and here, and you might want to double-check whether those are indeed the only places were buffer -> string conversions happen).

Can I get a final ack that unconditional rejection of invalid utf-8 is the behavior to put into the spec?

@aljoscha %KKzmtqYDiXifHzHkhJl7r9/ouHt95aksTNtpe0qja9I=.sha256

^^^ @Dominic

@Dominic %W0/US4odp23xgalqbtFyv8dTGCknDeEVeYrJhag2rRk=.sha256

@aljoscha it would reject it, because if you signed the invalid character, and then sbot replaced that before checking the signature, that signature will not be valid, and messages without valid signatures are rejected.

Although, if you signed the utf8, but then sent invalid utf8 that would be converted back into that utf8, then that would be accepted. But, we could safely just ignore/error on anything that sends invalid utf8 because the reference implementation never sends that. The wire protocol is easier to change than the signing protocol!

The rest of the api, would currently accept and convert utf8. I'm pretty confidant nothing depends on this.

Yes, lets just "all json must be valid utf8"

@aljoscha %CwfKiUCf626/HStCdBNxnHXOICKCMWkR/djHI1HPu0s=.sha256

Yes, lets just "all json must be valid utf8"

:green_heart:

@Daan (deprecated) %yguQGbduG2F3xglPGs+kB9Xv01aVl9ortkJa8A15q+s=.sha256

so for a layman: does that mean @Aljoscha's feed will be considered invalid? You made some posts with invalid strings in them, so any (new?) peer would now reject them, right?

@aljoscha %lFrGAxYhNuVNNfyMrV1m5tBRZrTfbnvgpCJ1sPn8TKI=.sha256

@Daan (the other one):

so for a layman: does that mean @Aljoscha's feed will be considered invalid? You made some posts with invalid strings in them, so any (new?) peer would now reject them, right?

No, fortunately not. The implementation that posted my (and @regular's) data substituted it with something that is safe to replicate.

The outcome of this discussion means that newer implementations do not need to perform that exact same data substitution, instead they can (and in certain circumstances must) completely reject invalid data, without trying to patch it up.

Join Scuttlebutt now