From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from sog-mx-1.v43.ch3.sourceforge.com ([172.29.43.191] helo=mx.sourceforge.net) by sfs-ml-3.v29.ch3.sourceforge.com with esmtp (Exim 4.76) (envelope-from ) id 1VcSsA-0007q6-Sl for bitcoin-development@lists.sourceforge.net; Sat, 02 Nov 2013 04:31:42 +0000 Received-SPF: pass (sog-mx-1.v43.ch3.sourceforge.com: domain of midnightdesign.ws designates 50.87.144.70 as permitted sender) client-ip=50.87.144.70; envelope-from=boydb@midnightdesign.ws; helo=gator3054.hostgator.com; Received: from gator3054.hostgator.com ([50.87.144.70]) by sog-mx-1.v43.ch3.sourceforge.com with esmtps (TLSv1:AES256-SHA:256) (Exim 4.76) id 1VcSs9-0002EC-6p for bitcoin-development@lists.sourceforge.net; Sat, 02 Nov 2013 04:31:42 +0000 Received: from [74.125.82.54] (port=35570 helo=mail-wg0-f54.google.com) by gator3054.hostgator.com with esmtpsa (TLSv1:RC4-SHA:128) (Exim 4.80) (envelope-from ) id 1VcSs2-0000or-St for bitcoin-development@lists.sourceforge.net; Fri, 01 Nov 2013 23:31:35 -0500 Received: by mail-wg0-f54.google.com with SMTP id c11so275468wgh.33 for ; Fri, 01 Nov 2013 21:31:33 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=j20tqOPQbTrxaIvRjNHmXmEb1JxhKRxiOKCwraFtvMg=; b=EXHgXZUT3+bgqPR0gWQzTxSS/a+PG2oPZtXptGktUfbaKp5bc0vNkq1kWOro3TrSjr 1QK/QxGYC6eYd1hm1eqoBe/JTdHucom8xHPOYHVIXvjtp0rqI0A9+H9OrBT+6Dg/lVo2 ytklJfGtUP3C50v9nzCSKir9aIzC7GaEG1L4OdsQer1K5mgpwx5ZwskaOk9vCP1RzkCG rc7xRutcjmYGGT35TB9yk75VizDRl4Po5nvn3JZjhwFpiRCD20dMofdmSziiiw1aCZMG Dhc0YovIwz+e8/K+mEN3/HzaVejjELA50GNe4meOk6RZ2k3hKajrZsg7R9xiTmwrfVXv B5SQ== X-Gm-Message-State: ALoCoQn84Qw9M1iEq966hO7SZ0A0Z/7hcdZqXF8engBMiIJsncVEY+QfugoyqIqc5hcsvdHl3V1Q MIME-Version: 1.0 X-Received: by 10.180.72.238 with SMTP id g14mr4683488wiv.17.1383366693105; Fri, 01 Nov 2013 21:31:33 -0700 (PDT) Received: by 10.227.60.6 with HTTP; Fri, 1 Nov 2013 21:31:33 -0700 (PDT) In-Reply-To: References: Date: Fri, 1 Nov 2013 23:31:33 -0500 Message-ID: From: Brooks Boyd To: slush Content-Type: multipart/alternative; boundary=f46d043d673756507404ea2a2850 X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - gator3054.hostgator.com X-AntiAbuse: Original Domain - lists.sourceforge.net X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - midnightdesign.ws X-BWhitelist: no X-Source: X-Source-Args: X-Source-Dir: X-Source-Sender: (mail-wg0-f54.google.com) [74.125.82.54]:35570 X-Source-Auth: midnight X-Email-Count: 1 X-Source-Cap: bWlkbmlnaHQ7bWlkbmlnaHQ7Z2F0b3IzMDU0Lmhvc3RnYXRvci5jb20= X-Spam-Score: -0.5 (/) X-Spam-Report: Spam Filtering performed by mx.sourceforge.net. See http://spamassassin.org/tag/ for more details. -1.5 SPF_CHECK_PASS SPF reports sender host as permitted sender for sender-domain -0.0 SPF_HELO_PASS SPF: HELO matches SPF record -0.0 SPF_PASS SPF: sender matches SPF record 0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [URIs: midnightdesign.ws] 1.0 HTML_MESSAGE BODY: HTML included in message X-Headers-End: 1VcSs9-0002EC-6p Cc: bitcoin-development@lists.sourceforge.net Subject: Re: [Bitcoin-development] BIP39 word list X-BeenThere: bitcoin-development@lists.sourceforge.net X-Mailman-Version: 2.1.9 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Nov 2013 04:31:43 -0000 --f46d043d673756507404ea2a2850 Content-Type: text/plain; charset=ISO-8859-1 That would be a way to go, though iterating through all possibilities of a similar letter misspell would take significantly more processing (4x3x3 = 36 total possibilities, only to cull it back to 2, in your example), than iterating through a list of pre-calculated possibilities. It's definitely not a hard computation on any modern device, though, and depending on how "helpful" the program wants to try to be, it could even try help with misspellings due to hitting a keyboard key next to the correct one or hitting a letter twice, depending on how big a comparison matrix it wants to create. I do agree it should not be required for clients implementing the BIP to help fix mis-translations, though keeping the similar letter unit test in there I like, since it helps convey the thought that went into culling some words from the dictionary. Though to Allen's point, what did happen with the words that were found to be similar; was one of the similar words left in the list or were all the similar words removed? Brooks MidnightLightning On Fri, Nov 1, 2013 at 7:04 PM, slush wrote: > Hi Brooks, > > I've been already thinking about eat -> cat typing mistake. Actually there > may be simplier solution than having wordlist with duplicated words. > Because there's already a mapping of similar characters in the source code > (currently only in unit test, but it can be moved), when user type a word > which isn't in wordlist, application may try to use such mapping to find a > combination which actually is in the mapping. This may be disambiguous in > some cases, but giving a choice between few words may be better than hard > fail. And it is actually quite easy to implement. Although I think > application can do such smart suggestions and help user to recover badly > written mnemonic, I don't think it is necessary to standardize such method > directly into BIP. It may or may not be implemented by developers and it is > just nice to have feature. > > Example: > > user type ear, but it isn't in wordlist. > > Regards the mapping, > E is similar to A, C, F, O > A is similar to E, C, O > R is similar to B, P, H > > So application can calculate combinations of possible characters: > > a) when app consider than the the user mistyped only one character > AAR, CAR, FAR, OAR > EER, ECR, EOR > EAB, EAP, EAH > > b) when app consider than user maybe mistyped more characters, it may do > full combination matrix > AEB, ACB, AOB, ... OEH, OCH, OOH > > and then ask user to select only these combinations which are actually > presented in the wordlist. In this particular case it may be only CAR or > FAR (both cannot be in the wordlist because of rules in similarity). > > Marek > > > On Fri, Nov 1, 2013 at 9:14 PM, Brooks Boyd wrote: > >> I was inspired to join the mailing list to comment on some of these >> discussions about BIP39, which I think will have great use in the Bitcoin >> community and outside it as a way to transcribe binary data. >> >> The one thought I had as the discussions about similar characters are >> resulting in culling words from the list, is that it only helps to validate >> input, not help the user if it is incorrect. >> >> For example, if both "cat" and "eat" were in the word list, and someone >> wrote down "eat", but later mis-translated it and put "cat" back into >> translator, the result would be a checksum error; "cat" is a different >> number, so the checksum would fail. >> >> As it currently stands, "cat" would not be a valid word ("eat" is the >> real word, and no other number is "cat"), so the translator can throw a >> different error which is more helpful (i.e. "'cat' isn't a valid word >> choice), but still doesn't get the user to the proper translation. >> >> What about if the wordlist included those "words that are so similar to >> each other that we only kept one of them" and had them all refer to the >> same number? I propose the wordlist have the possibility of multiple words >> on a single line, with the first word on the line being the "primary" or >> "real" word to be used, with the other similar words be included so that a >> translation program if it wanted to assist the user could fix their input >> for them (verbosely or not), along the lines of "'cat' isn't a valid word >> choice; assuming you meant 'eat', which is valid". You might still hit a >> checksum error if that similar word is still the wrong word, but as it >> stands now, I know you culled a bunch of words from the wordlist as "too >> similar", but if I want to try and help the user fix a bad input, I need to >> write a translation program with a full english dictionary alongside the >> BIP39 dictionary. >> >> I'd be willing to create a pull request for such an update, but before I >> delve into that, does this sound like a good idea? I could see it devolving >> into a slippery slope if every number in the 2048 set had a dozen word >> variations (misspellings, similar words, slang terms for the real word, >> etc.) which could get confusing of how similar is similar enough to be >> added as an alternate, and the standard would need to be clear that when >> translating binary to words, you only use the "main" word for that row, not >> any of the variations. >> >> MidnightLightning >> >> >> > I've just pushed updated wordlist which is filtered to similar >> characters taken from this matrix. >> > BIP39 now consider following character pairs as similar: >> > similar = ( >> > ('a', 'c'), ('a', 'e'), ('a', 'o'), >> > ('b', 'd'), ('b', 'h'), ('b', 'p'), ('b', 'q'), ('b', 'r'), >> > ('c', 'e'), ('c', 'g'), ('c', 'n'), ('c', 'o'), ('c', 'q'), >> ('c', 'u'), >> > ('d', 'g'), ('d', 'h'), ('d', 'o'), ('d', 'p'), ('d', 'q'), >> > ('e', 'f'), ('e', 'o'), >> > ('f', 'i'), ('f', 'j'), ('f', 'l'), ('f', 'p'), ('f', 't'), >> > ('g', 'j'), ('g', 'o'), ('g', 'p'), ('g', 'q'), ('g', 'y'), >> > ('h', 'k'), ('h', 'l'), ('h', 'm'), ('h', 'n'), ('h', 'r'), >> > ('i', 'j'), ('i', 'l'), ('i', 't'), ('i', 'y'), >> > ('j', 'l'), ('j', 'p'), ('j', 'q'), ('j', 'y'), >> > ('k', 'x'), >> > ('l', 't'), >> > ('m', 'n'), ('m', 'w'), >> > ('n', 'u'), ('n', 'z'), >> > ('o', 'p'), ('o', 'q'), ('o', 'u'), ('o', 'v'), >> > ('p', 'q'), ('p', 'r'), >> > ('q', 'y'), >> > ('s', 'z'), >> > ('u', 'v'), ('u', 'w'), ('u', 'y'), >> > ('v', 'w'), ('v', 'y') >> > ) >> > Feel free to review and comment current wordlist, but I think we're >> slowly moving forward final list. >> > slush >> >> >> ------------------------------------------------------------------------------ >> Android is increasing in popularity, but the open development platform >> that >> developers love is also attractive to malware creators. Download this >> white >> paper to learn more about secure code signing practices that can help keep >> Android apps secure. >> >> http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk >> _______________________________________________ >> Bitcoin-development mailing list >> Bitcoin-development@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/bitcoin-development >> >> > --f46d043d673756507404ea2a2850 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
That would be a way to go, though iterating through all po= ssibilities of a similar letter misspell would take significantly more proc= essing (4x3x3 =3D=A036 total possibilities, only to cull it back to 2, in y= our example), than iterating through a list of pre-calculated possibilities= . It's definitely not a hard computation on any modern device, though, = and depending on how "helpful" the program wants to try to be, it= could even try help with misspellings due to hitting a keyboard key next t= o the correct one or hitting a letter twice, depending on how big a compari= son matrix it wants to create.

I do agree it should not be required for clients implementin= g the BIP to help fix mis-translations, though keeping the similar letter u= nit test in there I like, since it helps convey the thought that went into = culling some words from the dictionary. Though to Allen's point, what d= id happen with the words that were found to be similar; was one of the simi= lar words left in the list or were all the similar words removed?

Brooks
MidnightLightning


On Fri, Nov 1, 2013 at 7:04 PM= , slush <slush@centrum.cz> wrote:
Hi Brooks,

I've be= en already thinking about eat -> cat typing mistake. Actually there may = be simplier solution than having wordlist with duplicated words. Because th= ere's already a mapping of similar characters in the source code (curre= ntly only in unit test, but it can be moved), when user type a word which i= sn't in wordlist, application may try to use such mapping to find a com= bination which actually is in the mapping. This may be disambiguous in some= cases, but giving a choice between few words may be better than hard fail.= And it is actually quite easy to implement. Although I think application c= an do such smart suggestions and help user to recover badly written mnemoni= c, I don't think it is necessary to standardize such method directly in= to BIP. It may or may not be implemented by developers and it is just nice = to have feature.

Example:

user type ear, but it= isn't in wordlist.

Regards the mapping,
=
E is similar to A, C, F, O
A is similar to E, C, O
R is similar to B, P, H
So application can calculate combinations of possible characte= rs:

a) when app consider than the the user mistype= d only one character
AAR, CAR, FAR, OAR
EER, ECR, EOR
EAB, EAP, EAH=

b) when app consider than user maybe mistyped mor= e characters, it may do full combination matrix
AEB, =A0ACB, AOB, =A0... OEH, OCH, OOH

and th= en ask user to select only these combinations which are actually presented = in the wordlist. In this particular case it may be only CAR or FAR (both ca= nnot be in the wordlist because of rules in similarity).

Marek


On Fri, Nov 1, 2013 at 9:14 PM, Brooks Boyd <boydb@midnightdesign.ws> wrote:
I was inspired to join th= e mailing list to comment on some of these discussions about BIP39, which I= think will have great use in the Bitcoin community and outside it as a way= to transcribe binary data.

The one thought I had as the discussions about similar characters are resul= ting in culling words from the list, is that it only helps to validate inpu= t, not help the user if it is incorrect.

For example, if both "= cat" and "eat" were in the word list, and someone wrote down= "eat", but later mis-translated it and put "cat" back = into translator, the result would be a checksum error; "cat" is a= different number, so the checksum would fail.

As it currently stands, "cat" would not be a valid word (&quo= t;eat" is the real word, and no other number is "cat"), so t= he translator can throw a different error which is more helpful (i.e. "= ;'cat' isn't a valid word choice), but still doesn't get th= e user to the proper translation.

What about if the wordlist included those "words that are so simil= ar to each other that we only kept one of them" and had them all refer= to the same number? I propose the wordlist have the possibility of multipl= e words on a single line, with the first word on the line being the "p= rimary" or "real" word to be used, with the other similar wo= rds be included so that a translation program if it wanted to assist the us= er could fix their input for them (verbosely or not), along the lines of &q= uot;'cat' isn't a valid word choice; assuming you meant 'ea= t', which is valid". You might still hit a checksum error if that = similar word is still the wrong word, but as it stands now, I know you cull= ed a bunch of words from the wordlist as "too similar", but if I = want to try and help the user fix a bad input, I need to write a translatio= n program with a full english dictionary alongside the BIP39 dictionary.
I'd be willing to create a pull request for such an update, but bef= ore I delve into that, does this sound like a good idea? I could see it dev= olving into a slippery slope if every number in the 2048 set had a dozen wo= rd variations (misspellings, similar words, slang terms for the real word, = etc.) which could get confusing of how similar is similar enough to be adde= d as an alternate, and the standard would need to be clear that when transl= ating binary to words, you only use the "main" word for that row,= not any of the variations.

MidnightLightning

=A0
> I've just pushed updated wordl= ist which is filtered to similar characters taken from this matrix.
>= BIP39 now consider following character pairs as similar:
> =A0 =A0 = =A0 =A0 similar =3D (
> =A0 =A0 =A0 =A0 =A0 =A0 ('a', 'c'), ('a', '= ;e'), ('a', 'o'),
> =A0 =A0 =A0 =A0 =A0 =A0 ('= ;b', 'd'), ('b', 'h'), ('b', 'p'= ;), ('b', 'q'), ('b', 'r'),
> =A0 =A0 =A0 =A0 =A0 =A0 ('c', 'e'), ('c', '= ;g'), ('c', 'n'), ('c', 'o'), ('c&#= 39;, 'q'), ('c', 'u'),
> =A0 =A0 =A0 =A0 =A0 = =A0 ('d', 'g'), ('d', 'h'), ('d', &= #39;o'), ('d', 'p'), ('d', 'q'),
> =A0 =A0 =A0 =A0 =A0 =A0 ('e', 'f'), ('e', '= ;o'),
> =A0 =A0 =A0 =A0 =A0 =A0 ('f', 'i'), ('= ;f', 'j'), ('f', 'l'), ('f', 'p'= ;), ('f', 't'),
> =A0 =A0 =A0 =A0 =A0 =A0 ('g', 'j'), ('g', '= ;o'), ('g', 'p'), ('g', 'q'), ('g&#= 39;, 'y'),
> =A0 =A0 =A0 =A0 =A0 =A0 ('h', 'k'= ;), ('h', 'l'), ('h', 'm'), ('h', &= #39;n'), ('h', 'r'),
> =A0 =A0 =A0 =A0 =A0 =A0 ('i', 'j'), ('i', '= ;l'), ('i', 't'), ('i', 'y'),
> = =A0 =A0 =A0 =A0 =A0 =A0 ('j', 'l'), ('j', 'p= 9;), ('j', 'q'), ('j', 'y'),
> =A0 =A0 =A0 =A0 =A0 =A0 ('k', 'x'),
> =A0 =A0 = =A0 =A0 =A0 =A0 ('l', 't'),
> =A0 =A0 =A0 =A0 =A0 =A0= ('m', 'n'), ('m', 'w'),
> =A0 =A0 = =A0 =A0 =A0 =A0 ('n', 'u'), ('n', 'z'),
> =A0 =A0 =A0 =A0 =A0 =A0 ('o', 'p'), ('o', '= ;q'), ('o', 'u'), ('o', 'v'),
> = =A0 =A0 =A0 =A0 =A0 =A0 ('p', 'q'), ('p', 'r= 9;),
> =A0 =A0 =A0 =A0 =A0 =A0 ('q', 'y'),
> =A0 =A0 =A0 =A0 =A0 =A0 ('s', 'z'),
> =A0 =A0 = =A0 =A0 =A0 =A0 ('u', 'v'), ('u', 'w'), (&#= 39;u', 'y'),
> =A0 =A0 =A0 =A0 =A0 =A0 ('v', '= ;w'), ('v', 'y')
> =A0 =A0 =A0 =A0 )
> Feel free to review and comment current word= list, but I think we're slowly moving forward final list.
> slush=

------------------------------------------------------= ------------------------
Android is increasing in popularity, but the open development platform that=
developers love is also attractive to malware creators. Download this white=
paper to learn more about secure code signing practices that can help keep<= br> Android apps secure.
http://pubads.g.doubleclick.net/gam= pad/clk?id=3D65839951&iu=3D/4140/ostg.clktrk
___________________= ____________________________
Bitcoin-development mailing list
Bitcoin-development@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bitcoin-de= velopment



--f46d043d673756507404ea2a2850--