So much fun: https://github.com/rust-lang/rust/issues/54845
Ok, I think I understand it now: It is a valid utf8 encoding of a unicode code point, but the code point happens to not be a valid unicode scalar value (which is what rust
char
s are).Follow-up question (if you don't mind me derailing this):
Wikipedia says:
Since RFC 3629 (November 2003), the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and code points not encodable by UTF-16 (those after U+10FFFF) are not legal Unicode values, and their UTF-8 encoding must be treated as an invalid byte sequence.
Does that mean that nodejs decoding a buffer of those utf8 bytes into a string without complaining is a unicode violation?
I'm not sure I fully understand the interplay of all the involved specs here (unicode, utf8, utf16, ECMAScript JSON.parse, ECMA-404 aka json), but if this is actually a bug in nodejs, we'll need to decide whether we want to
- explicitly allow utf8 encodings of surrogate half codepoints, or
- stick to valid unicode and change the js implementation
I'd strongly advocate the latter option, but that would break all feeds containing messages that store invalid unicode code points in json strings. Which quite likely is only my own feed, unless anybody else has also pasted some very abstruse strings into their posts...