Took a closer look at message signing, hash computation and length limitation today, as part of the beginning of a precise and comprehensive ssb spec. The results might be somewhat surprising to a few of you:
Hash Computation
To compute the hash of a message, you can not use the signing encoding, but the hash computation is based on it. The signing encoding always results in valid unicode. Represent this unicode in utf-16. This encoding is a sequence of code units, each consisting of two bytes. The data to hash is obtained from these code units by only keeping the less significant byte.
Example: Suppose you want to compute the hash for
"ß"
, the corresponding utf8 is[22, C3, 9F, 22]
. In big-endian utf16, this is[(22, 0), (DF, 0), (22, 0)]
, in little-endian utf16, this is[(0, 22), (0, DF), (0, 22)]
. In both cases, the sequence of less signifiant bytes per code unit is[22, DF, 22]
. That is the byte array over which to compute the hash.Note that this means that two strings with different utf-8 encodings can result in the same hash, due to the information in the more significant byte of the utf-16 encoding being dropped.
Length Computation
Ssb places a limit on the size of legacy messages. To compute whether a message is too long, compute the signing format (which is always valid unicode), encode that unicode as utf16, then count the number of code units. This number must be smaller then
16385
(== 8192 * 2 + 1
), or the message is considered too long (16384 is still ok).
I'll set up test suites to verify that this is indeed the behavior of the js implementation and for other implementation to test for conformance. But it'll be some time until I reach that point. My next goal is to spec out the encoding used for signing without just bailing out by referencing the ecmascript spec.
CC @Dominic (the last posts I read from you on these issues talk about utf8, even where things are based on utf16 instead), @cryptix (is the go implementation compatible with this?), @duncan (useful information for the protocol guide)