You are reading content from Scuttlebutt
@aljoscha %kXIiDGhzLT837IxoMsay7z1uoNRsHiqSJV6OWAtWb4k=.sha256

Ready for more unicode fun?

Me neither.

The following is not a problem, the canonicity requirements prevent it. But since I wrote it up before realizing that, I might as well share it.

past Aljoscha:

Json does unicode escape via four hexadecimal characters that specify the utf-16 code unit to escape. Each of these code units is a valid unicode code point. But not all of those code points are valid unicode scalar values: Surrogates code points are not scalar values.
These surrogate code points are ok when they are matched, a high surrogate followed by a low surrogate. These two code points, taken together, encode a single code point outside the basic multilingual plane. That's how json escape sequences are expected to do it, the world is good. There's a problem though: Nothing stops a user from escaping one of those surrogate code points without properly matching it: "\udc00". Decoding this results in invalid utf-8, the byte sequence (0xdc, 0x00) is rejected by conforming utf-8 implementations.

But: JSON.parse('"\udc00"') happily accepts the input and returns a js string containing invalid unicode (which is allowed, js strings are not guaranteed to contain valid unicode). Run JSON.stringify(JSON.parse('"\udc00"')) in a console to see the string literal, helpfully rendering the invalid code point.

If we accepted this behavior, it would become impossible to use the language-native unicode string implementations in any language that provides proper validity guarantees for its strings.

At least that's another obscure detail to add to the spec. There aren't enough obscure details in there yet.

@aljoscha %iuJDgSyh+4Hjcj3EUcjXZKfsuSZF5zFlsRKGpeBh6W8=.sha256

Fun fact: Js strings may contain invalid unicode, but they don't provide a method to check whether they are valid or invalid. If you ever need to know:

let prev_is_high_surrogate = false;
for (let code_point_str of str) {
  const code_point = code_point_str.codePointAt(0);
  if (code_point >= 0xD800 && code_point <= 0xDBFF) {
    if (prev_is_high_surrogate) {
      throw "two high surrogates in sequence";
    } else {
      prev_is_high_surrogate = true;
    }
  } else if (code_point >= 0xDC00 && code_point <= 0xDFFF) {
    if (prev_is_high_surrogate) {
      prev_is_high_surrogate = false;
    } else {
      throw "low surrogate without preceeding high surrogate"
    }
  }
}
Join Scuttlebutt now