From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <natanael.l@gmail.com>
Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org
	[172.17.192.35])
	by mail.linuxfoundation.org (Postfix) with ESMTPS id 35F6F49F
	for <bitcoin-dev@lists.linuxfoundation.org>;
	Tue, 20 Nov 2018 01:52:02 +0000 (UTC)
X-Greylist: whitelisted by SQLgrey-1.7.6
Received: from mail-ed1-f47.google.com (mail-ed1-f47.google.com
	[209.85.208.47])
	by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 9F22D5D4
	for <bitcoin-dev@lists.linuxfoundation.org>;
	Tue, 20 Nov 2018 01:52:00 +0000 (UTC)
Received: by mail-ed1-f47.google.com with SMTP id f4so519877edq.10
	for <bitcoin-dev@lists.linuxfoundation.org>;
	Mon, 19 Nov 2018 17:52:00 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
	h=mime-version:references:in-reply-to:from:date:message-id:subject:to; 
	bh=doBt+voFUWIiDvnR/ld2Qk96sd5Mm3UQeALmFt+/LWA=;
	b=GlHrhfj3AkDckpwMsjQzrSYV/ORullO5id2OiUM5d8ggrCwq2fmnBqhrd3600KbDaq
	R6YrX1tndTXCsXMIXAPlJQUJQED/bYOmePmFWYO59Olv3CieC6bMvBcLDEO3FN1eJwwq
	VblROZvTh0mjVAipV3Qt8XRYrckG6WuPbAIPT1V6U71OcPIEvghyEdngruthzPelcap/
	1qFCQCK+O4xHGXNE2rMy9e1HEZS5ZNzh6CfefVwJM2Wcy3Qap/zjeipfFHf/QOWcjtUX
	3fZtzR6D1KvKOYzZpiEb6a4KPlZhqGBy6gpPpxFoR7JMQVkmqjbqnmhhoEjl4RR3iHgU
	DWBQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:mime-version:references:in-reply-to:from:date
	:message-id:subject:to;
	bh=doBt+voFUWIiDvnR/ld2Qk96sd5Mm3UQeALmFt+/LWA=;
	b=DAsm9mz9bHaVlnTqiLAF3uFeeuHxdW6wd5he2sKWx80A1kjpDis0HzYAnEu9pYDavx
	OvDYpFwzRSPFKsp7EeZQIilTvOZiOCCx4EO1j/CDTsD/NSoR1CiW6CuOdUkxY4So27kR
	ClNMq4vLkGq4REqtUU+tiahx/UnyiVlmiwu+HnnIRohIavSHKA1RRGGkfL00urunJSiI
	XopHBXADnPtqYKQXOi0lggEbIUFAE16R1/LTZws8iRqoEqcd9HXSJGEYzF9jDHO9InHZ
	lPNZ0nv4jBDQ7G58rtxtATGE4ij5vJplS2K0tdXNQ/Hw0YR0u8vt9ioLuVf3uBNehs32
	9ZhQ==
X-Gm-Message-State: AA+aEWYpIBWP8U5nyrQdzXfYERIWX9+Iw6+uk7eraDYxF8kQM5oMph2F
	raDlmoyr6eh9grwO6MYS3IiadggjM2M8TCD9OJ8=
X-Google-Smtp-Source: AFSGD/X/2nxfOSAh+g/IZkORdIw+gliEwpFmc7Shd58mIyvLrD6dCw8QqB4ZTkow8KbfdvVNvjsJfYEj4WZxsX+6piA=
X-Received: by 2002:aa7:d487:: with SMTP id b7mr420242edr.256.1542678719115;
	Mon, 19 Nov 2018 17:51:59 -0800 (PST)
MIME-Version: 1.0
References: <CABsxsG27bJN0vGRJOP3=zriPvkL+G8n3t2nd6Y8L6KwW4ePdeg@mail.gmail.com>
	<CABsxsG1QZ7h9Cxs=rzuBEOsMY+oJ1n4q5zhRxVE1JsS1e-hgew@mail.gmail.com>
In-Reply-To: <CABsxsG1QZ7h9Cxs=rzuBEOsMY+oJ1n4q5zhRxVE1JsS1e-hgew@mail.gmail.com>
From: Natanael <natanael.l@gmail.com>
Date: Tue, 20 Nov 2018 02:51:44 +0100
Message-ID: <CAAt2M1_5apKwQoBc_WkNox36Y9xRdo5rZRVA51RQHyJVJX0nFg@mail.gmail.com>
To: Steven Hatzakis <shatzakis@gmail.com>, 
	Bitcoin Dev <bitcoin-dev@lists.linuxfoundation.org>
Content-Type: multipart/alternative; boundary="0000000000000e36ab057b0ee146"
X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID, DKIM_VALID_AU, FREEMAIL_FROM, HTML_MESSAGE,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	smtp1.linux-foundation.org
X-Mailman-Approved-At: Tue, 20 Nov 2018 03:21:14 +0000
Subject: Re: [bitcoin-dev] BIP- & SLIP-0039 -- better multi-language support
X-BeenThere: bitcoin-dev@lists.linuxfoundation.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Bitcoin Protocol Discussion <bitcoin-dev.lists.linuxfoundation.org>
List-Unsubscribe: <https://lists.linuxfoundation.org/mailman/options/bitcoin-dev>,
	<mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=unsubscribe>
List-Archive: <http://lists.linuxfoundation.org/pipermail/bitcoin-dev/>
List-Post: <mailto:bitcoin-dev@lists.linuxfoundation.org>
List-Help: <mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=help>
List-Subscribe: <https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev>,
	<mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=subscribe>
X-List-Received-Date: Tue, 20 Nov 2018 01:52:02 -0000

--0000000000000e36ab057b0ee146
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Den m=C3=A5n 19 nov. 2018 21:21 skrev Steven Hatzakis via bitcoin-dev <
bitcoin-dev@lists.linuxfoundation.org>:

> Hi Weiji, and Everyone,
>
> I think this is an important topic so sharing my two cents in case in
> helps: It makes sense for users to know that they can't merely just
> translate a word from one language into another and expect the same
> underlying entropy to be mapped, as the wordlists are not the same (i.e.
> words differ at the same index values across languages).
>
> However, while the words for each language cannot translate directly to
> their equivalent in another language, in terms of entropy (bits), the
> underlying entropy is, in fact, the same, when comparing mnemonics
> generated across languages (see English/Spanish comparison below) when
> sourced from the same initial entropy.
>
> Importantly, the entropy is a pre-image of the resulting mnemonic and
> doesn't change as the language changes, where the only changes are to the
> resulting words which depend on the language chosen, for a given entropy
> string. Ideally, the wallet/software should deal with these nuances, I
> don't think the protocol needs any revision (except for how the BIP39 see=
d
> is derived, perhaps), even if someone made up their own wordlist, as long
> as the wallet/software has a copy of it to map those words to the
> underlying index values, it's *those underlying index values and the
> entropy they map too is what really matters**. *
>
> I fully support the idea for users to back up this pre-image (initial
> entropy) as it can also be used to check the validity of the mnemonic and
> check that it mapped correctly, see Ian Coleman's BIP39 tool which shows
> index values, a feature that I proposed last year and was since
> implemented. Below is an example of how two mnemonics generated with the
> same entropy will produce different BIP39 seeds.
>
> * Example initial entropy of 128 bits +4 bit checksum derived from hash o=
f
> byte array: *
>
> 10001101000 01010100100 11011010000 11100001101 01010001101 00010010001
> 01100000010 10101110100 00100100011 11110000111 01100011010 1100010 (+111=
0
> checksum)
>
> *In English*: minimum fee sure ticket faculty banana gate purse caught
> valley globe shift
>
> The same initial entropy above (all 132 bits) produces this mnemonic:
>
> *In Spanish*: mercado faja soledad tarea evadir aries gafas peine bu=CC=
=81ho
> tumor gerente reja
>
> And the underlying index values below are the same for both the English
> and Spanish mnemonics above:
>
> Word Indexes: 1128, 676, 1744, 1805, 653, 145, 770, 1396, 291, 1927, 794,
> 1582
>
> *ISSUE AT HAND*:  While the initial entropy is the same, and word indexes
> the same for a given entropy, (i.e. same pre-image), the resulting BIP39
> seed is not the same when comparing the above English mnemonic with its
> Spanish counterpart:
>
>    - *English BIP39 seed:*
>    ce7618075099c89e986f18dc495daa3be190450ed07bef77d4334a54dbc1cd7e205797=
ffed2615ac0999a5d691f65bf316e2cdbfd2c9d7d90b03e77ff1e6a6f5
>    - *Spanish BIP39 seed*:
>    9f164de0fb09af51b5831886e424d6d2479d49b5e5a1b28f5c09467ea36089b144cd94=
bb9b636b3c27ccff96a8958e5b7ce43cf1dea81423fc66fa7fef0aea2c
>
>
> *Option 1:* Without changing anything in terms of the entropy
> generation/mapping process in the BIP39 spec, the wallet/client-side
> software would ideally recognize the language and show the corresponding
> index value per wordlist, and reverse-calculate the entropy and then re-m=
ap
> it to the language selected.
>
> *Option 2*: Perhaps a revision is needed to how the BIP39 seed is
> generated in the first place, such as by hashing the entropy instead of t=
he
> words. Any thoughts on how viable that could be where the initial entropy
> is fed into the PBKDF2 function and not the words?
>
> *Closing thoughts and tiny checksum nitpick: *
>
>       - The multiple BIP39 seeds per language lend some similarities to
> BIP44 multi-account, so perhaps this can be an advantage, depends on how =
it
> is applied in UI/UX's (compared to having one BIP39 seed regardless of
> language, for a given initial entropy).
>       - There is perhaps an opportunity to add greater detail to the BIP3=
9
> spec in terms of standards/best-practices for computing checksum values, =
as
> some software may be hashing bits, versus hashing bytes, or hashing the
> entropy as a hex string, etc.. for a given entropy, which will result in
> different checksum values for the same "valid" mnemonic, that might not b=
e
> "valid" in another wallet which may format the data differently before
> hashing to compute the checksum.
>

This probably wouldn't work as a drop-in replacement, but having the
identifier of the chosen wordlist be part of the mnemonic might work?
Perhaps the raw seed would then be [hash of chosen dictionary]+[sequence of
word indexes].

The user experience then involves always selecting a dictionary by name. I
also suggest maintaining an official list of named dictionaries.

The purpose of including the dictionary in the seed is so that if you use
the last word as a checksum, you also can verify that the dictionary
selection is correct as well as the word sequence.

This allows substitution of words to other languages by manually specifying
a different input dictionary, but you would then have to remember both the
seed language and the translated language so you can specify both
correctly.

The user experience here matches your option 1, while the implementation
matches option 2.

If you remove specification of the seed's original language, you would need
auto detection during entry when the raw seed is just the index. I do not
recommend trying that, especially if any language would end up with
multiple competing dictionaries. Even more so if there's many related
languages which might collide (like all the Latin languages, or even US vs
UK English...).

>

--0000000000000e36ab057b0ee146
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"auto"><br><br><div class=3D"gmail_quote" dir=3D"auto"><div dir=
=3D"ltr">Den m=C3=A5n 19 nov. 2018 21:21 skrev Steven Hatzakis via bitcoin-=
dev &lt;<a href=3D"mailto:bitcoin-dev@lists.linuxfoundation.org">bitcoin-de=
v@lists.linuxfoundation.org</a>&gt;:<br></div><blockquote class=3D"gmail_qu=
ote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex=
"><div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:arial,=
helvetica,sans-serif">Hi=C2=A0<span style=3D"color:rgb(0,0,0);white-space:p=
re-wrap;font-family:Arial,Helvetica,sans-serif">Weiji, and Everyone,</span>=
<br></div><div class=3D"gmail_quote"><div dir=3D"ltr"><div dir=3D"ltr"><div=
 dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div style=
=3D"font-family:arial,helvetica,sans-serif"><span style=3D"color:rgb(0,0,0)=
;white-space:pre-wrap;font-family:Arial,Helvetica,sans-serif"><br></span></=
div><div style=3D"font-family:arial,helvetica,sans-serif"><span style=3D"co=
lor:rgb(0,0,0);white-space:pre-wrap;font-family:Arial,Helvetica,sans-serif"=
>I think this is an important topic so sharing my two cents in case in help=
s: </span>It makes sense for users to know that they can&#39;t merely just =
translate a word from one language into another and expect the same underly=
ing entropy to be mapped, as the wordlists are not the same (i.e. words dif=
fer at the same index values across languages).=C2=A0</div><div style=3D"fo=
nt-family:arial,helvetica,sans-serif"><br></div><div style=3D"font-family:a=
rial,helvetica,sans-serif">However, while the words for each language canno=
t translate directly to their equivalent in another language, in=C2=A0terms=
 of entropy (bits), the underlying entropy is, in fact, the same, when comp=
aring mnemonics generated across languages (see English/Spanish comparison =
below) when sourced from the same initial entropy.<br></div><div style=3D"f=
ont-family:arial,helvetica,sans-serif"><br></div><div style=3D"font-family:=
arial,helvetica,sans-serif">Importantly, the entropy is a pre-image of the =
resulting mnemonic and doesn&#39;t change as the language changes, where th=
e only changes are to the resulting words which depend on the language chos=
en, for a given entropy string. Ideally, the wallet/software should deal wi=
th these nuances, I don&#39;t think the protocol needs any revision (except=
 for how the BIP39 seed is derived, perhaps), even if someone made up their=
 own wordlist, as long as the wallet/software has a copy of it to map those=
 words to the underlying index values, it&#39;s=C2=A0<b>those underlying in=
dex values and the entropy they map too is what really matters</b><b>.=C2=
=A0</b></div><div style=3D"font-family:arial,helvetica,sans-serif"><br></di=
v><div style=3D"font-family:arial,helvetica,sans-serif">I fully support the=
 idea for users to back up this pre-image (initial entropy) as it can also =
be used to check the validity of the mnemonic and check that it mapped corr=
ectly, see Ian Coleman&#39;s BIP39 tool which shows index values, a feature=
 that I proposed last year and was since implemented. Below is an example o=
f how two mnemonics generated with the same entropy will produce different =
BIP39 seeds.<br></div><div style=3D"font-family:arial,helvetica,sans-serif"=
><br></div><div style=3D"font-family:arial,helvetica,sans-serif"><b>=C2=A0E=
xample initial entropy of 128 bits +4 bit checksum derived from hash of byt=
e array:=C2=A0</b></div><div style=3D"font-family:arial,helvetica,sans-seri=
f"><br></div><div style=3D"font-family:arial,helvetica,sans-serif"><span st=
yle=3D"color:rgb(51,51,51);font-family:&quot;Helvetica Neue&quot;,Helvetica=
,Arial,sans-serif;font-size:14px">10001101000 01010100100 11011010000 11100=
001101 01010001101 00010010001 01100000010 10101110100 00100100011 11110000=
111 01100011010 1100010 (+1110 checksum)</span><br></div><div style=3D"font=
-family:arial,helvetica,sans-serif"><br></div><div><font face=3D"arial, hel=
vetica, sans-serif"><b>In English</b>: minimum fee sure ticket faculty bana=
na gate purse caught valley globe shift</font><br></div><div><font face=3D"=
arial, helvetica, sans-serif"><br></font></div><div><font face=3D"arial, he=
lvetica, sans-serif">The same initial entropy above (all 132 bits) produces=
 this mnemonic:</font></div><div><font face=3D"arial, helvetica, sans-serif=
"><br></font></div><div><font face=3D"arial, helvetica, sans-serif"><b>In S=
panish</b>:=C2=A0mercado faja soledad tarea evadir aries gafas peine bu=CC=
=81ho tumor gerente reja</font></div><div><font face=3D"arial, helvetica, s=
ans-serif"><br></font></div><div><font face=3D"arial, helvetica, sans-serif=
">And the underlying index values below are the same for both the English a=
nd Spanish mnemonics above:=C2=A0</font></div><div><font face=3D"arial, hel=
vetica, sans-serif"><br></font></div><div><span style=3D"color:rgb(51,51,51=
);font-family:&quot;Helvetica Neue&quot;,Helvetica,Arial,sans-serif;font-si=
ze:14px;font-weight:bold;text-align:right">Word Indexes:=C2=A0</span><span =
style=3D"color:rgb(51,51,51);font-family:&quot;Helvetica Neue&quot;,Helveti=
ca,Arial,sans-serif;font-size:14px">1128, 676, 1744, 1805, 653, 145, 770, 1=
396, 291, 1927, 794, 1582</span></div><div><span style=3D"font-family:arial=
,helvetica,sans-serif"><br></span></div><div><span style=3D"font-family:ari=
al,helvetica,sans-serif"><font color=3D"#ff0000"><b>ISSUE AT HAND</b></font=
>:=C2=A0 While the initial entropy is the same, and word indexes the same f=
or a given entropy, (i.e. same pre-image), the resulting BIP39 seed is not =
the same when comparing the above English mnemonic with its Spanish counter=
part:</span><br></div><div><ul><li><b>English BIP39 seed:</b>=C2=A0<font fa=
ce=3D"arial, helvetica, sans-serif">ce7618075099c89e986f18dc495daa3be190450=
ed07bef77d4334a54dbc1cd7e205797ffed2615ac0999a5d691f65bf316e2cdbfd2c9d7d90b=
03e77ff1e6a6f5</font><br></li><li><b>Spanish BIP39 seed</b>:<font face=3D"a=
rial, helvetica, sans-serif">9f164de0fb09af51b5831886e424d6d2479d49b5e5a1b2=
8f5c09467ea36089b144cd94bb9b636b3c27ccff96a8958e5b7ce43cf1dea81423fc66fa7fe=
f0aea2c</font><br></li></ul></div><div><br></div><div style=3D"font-family:=
arial,helvetica,sans-serif"><b>Option 1:</b> Without changing anything in t=
erms of the entropy generation/mapping process in the BIP39 spec, the walle=
t/client-side software would ideally recognize the language and show the co=
rresponding index value per wordlist, and reverse-calculate the entropy and=
 then re-map it to the language selected.=C2=A0</div><div style=3D"font-fam=
ily:arial,helvetica,sans-serif"><br></div><div><b style=3D"font-family:aria=
l,helvetica,sans-serif">Option 2</b><font face=3D"arial, helvetica, sans-se=
rif">: Perhaps a revision is needed to how the BIP39 seed is generated in t=
he first place, such as by hashing the entropy instead of the words. Any th=
oughts on how viable that could be where the initial entropy is fed into th=
e PBKDF2 function and not the words?</font></div><div><font face=3D"arial, =
helvetica, sans-serif"><br></font></div><div style=3D"font-family:arial,hel=
vetica,sans-serif"><u>Closing thoughts and tiny checksum nitpick:=C2=A0</u>=
</div><div style=3D"font-family:arial,helvetica,sans-serif"><br></div><div =
style=3D"font-family:arial,helvetica,sans-serif">=C2=A0 =C2=A0 =C2=A0 - The=
 multiple BIP39 seeds per language lend=C2=A0some similarities to BIP44 mul=
ti-account, so perhaps this can be an advantage, depends on how it is appli=
ed in UI/UX&#39;s (compared to having one BIP39 seed regardless of language=
, for a given initial entropy).</div><div style=3D"font-family:arial,helvet=
ica,sans-serif">=C2=A0 =C2=A0 =C2=A0 - There is perhaps an opportunity to a=
dd greater detail to the BIP39 spec in terms of standards/best-practices fo=
r computing checksum values, as some software may be hashing bits, versus h=
ashing bytes, or hashing the entropy as a hex string, etc.. for a given ent=
ropy, which will result in different checksum values for the same &quot;val=
id&quot; mnemonic, that might not be &quot;valid&quot; in another wallet wh=
ich may format the data differently before hashing to compute the checksum.=
=C2=A0</div></div></div></div></div></div></div></div></div></blockquote></=
div><div dir=3D"auto"><br></div><div dir=3D"auto">This probably wouldn&#39;=
t work as a drop-in replacement, but having the identifier of the chosen wo=
rdlist be part of the mnemonic might work? Perhaps the raw seed would then =
be [hash of chosen dictionary]+[sequence of word indexes].=C2=A0</div><div =
dir=3D"auto"><br></div><div dir=3D"auto">The user experience then involves =
always selecting a dictionary by name. I also suggest maintaining an offici=
al list of named dictionaries.=C2=A0</div><div dir=3D"auto"><br></div><div =
dir=3D"auto">The purpose of including the dictionary in the seed is so that=
 if you use the last word as a checksum, you also can verify that the dicti=
onary selection is correct as well as the word sequence.=C2=A0</div><div di=
r=3D"auto"><br></div><div dir=3D"auto">This allows substitution of words to=
 other languages by manually specifying a different input dictionary, but y=
ou would then have to remember both the seed language and the translated la=
nguage so you can specify both correctly.=C2=A0</div><div dir=3D"auto"><br>=
</div><div dir=3D"auto">The user experience here matches your option 1, whi=
le the implementation matches option 2.</div><div dir=3D"auto"><br></div><d=
iv dir=3D"auto">If you remove specification of the seed&#39;s original lang=
uage, you would need auto detection during entry when the raw seed is just =
the index. I do not recommend trying that, especially if any language would=
 end up with multiple competing dictionaries. Even more so if there&#39;s m=
any related languages which might collide (like all the Latin languages, or=
 even US vs UK English...).=C2=A0</div><div class=3D"gmail_quote" dir=3D"au=
to"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-lef=
t:1px #ccc solid;padding-left:1ex">
</blockquote></div></div>

--0000000000000e36ab057b0ee146--