[+cc aaron]

We recently added an implementation of BIP 38 (password protected private keys) to bitcoinj. It came to my attention that the third test vector may be broken. It gives a hex version of what the NFC normalised version of the input string should be, but this does not match the results of the Java unicode normaliser, and in fact I can't even get Python to print the names of the characters past the embedded null. I'm curious where this normalised version came from.

Given that "pile of poo" is not a character I think any sane user would put into a passphrase, I question the value of this test vector. NFC form is intended to collapse things like umlaut control characters onto their prior code point, but here we're feeding the algorithm what is basically garbage so I'm not totally surprised that different implementations appear to disagree on the outcome.

Proposed action: we remove this test vector as it does not represent any real world usage of the spec, or if we desperately need to verify NFC normalisation I suggest using a different, more realistic test string, like Zürich, or something written in Thai.



Test 3:
  • Passphrase ϓ␀𐐀💩 (\u03D2\u0301\u0000\U00010400\U0001F4A9GREEK UPSILON WITH HOOKCOMBINING ACUTE ACCENTNULLDESERET CAPITAL LETTER LONG IPILE OF POO)
  • Encrypted key: 6PRW5o9FLp4gJDDVqJQKJFTpMvdsSGJxMYHtHaQBF3ooa8mwD69bapcDQn
  • Bitcoin Address: 16ktGzmfrurhbhi6JGqsMWf7TyqK9HNAeF
  • Unencrypted private key (WIF): 5Jajm8eQ22H3pGWLEVCXyvND8dQZhiQhoLJNKjYXk9roUFTMSZ4
  • Note: The non-standard UTF-8 characters in this passphrase should be NFC normalized to result in a passphrase of0xcf9300f0909080f09f92a9 before further processing