[Bitcoin-development] BIP39 word list

public inbox for bitcoindev@googlegroups.com
 help / color / mirror / Atom feed

* [Bitcoin-development] BIP39 word list
@ 2013-10-18 23:52 jan
  2013-10-18 23:58 ` Gregory Maxwell
  2013-10-23  0:56 ` slush
  0 siblings, 2 replies; 9+ messages in thread
From: jan @ 2013-10-18 23:52 UTC (permalink / raw)
  To: bitcoin-development


The words 'public', 'private' and 'secret' could be confusing when
encoding public and private keys. eg. a private key that begins with
the word 'public'.

I think avoiding words that could look similar when written down would
be a good idea aswell. I searched for words that only differ by the
letters c & e, g & y, u & v and found the following:

car ear
cat eat
gear year
value valve

Other combinations could potentially be problematic depending on the
handwriting style: ft, ao, ij, vy, possibly even lt and il?

I've included the search utility I used below.


#include <stdbool.h>
#include <string.h>
#include <stdio.h>

char *similar_char_pairs[] = { "ce", "gy", "uv", NULL };

bool is_similar_char(char c1, char c2)
{
  char **pairs = similar_char_pairs;
  do {
    char *p = *pairs;
    if ((c1 == p[0] && c2 == p[1]) ||
        (c1 == p[1] && c2 == p[0]))
      return true;
  } while (*++pairs);

  return false;
}

bool print_words_if_similar(char *word1, char *word2)
{
  /* reject words of different lengths */
  if (strlen(word1) != strlen(word2))
    return false;

  size_t i, similarcount = 0;
  
  for (i = 0; i < strlen(word1); i++) {
    /* skip identical letters */
    if (word1[i] == word2[i])
      continue;

    /* reject words that don't match */
    if (is_similar_char(word1[i], word2[i]) == false)
      return false;

    similarcount++;
  }

  /* reject words with more than 1 different letter */
  //if (similarcount > 1)
  //  return false;

  printf("%s %s\n", word1, word2);

  return true;
}

int main(void)
{
  /* english.txt is assumed to exist in the working directory
     download from:
     https://github.com/trezor/python-mnemonic/blob/master/mnemonic/wordlist/english.txt */
  FILE* f = fopen("english.txt", "r");
  if (!f) {
    fprintf(stderr, "failed to open english.txt\n");
    return 1;
  }

  /* read in word list, assumes one word per line */
  #define MAXWORD 16
  char wordlist[2048][MAXWORD];
  int word = 0;
  while (fgets(wordlist[word], MAXWORD, f)) {
    /* strip trailing whitespace, assumes no leading whitespace */
    char *ch = strpbrk(wordlist[word], " \n\t");
    if (ch)
      *ch = '\0';
    word++;
  }

  if (word != 2048) {
    fprintf(stderr, "word list incorrect length\n");
    return 1;
  }

  /* check each word for similarity against every other word */
  int i, j, count = 0;
  for (i = 0; i < 2048; i++) {
    for (j = i+1; j < 2048; j++) {
      if (print_words_if_similar(wordlist[i], wordlist[j]))
        count++;
    }
  }

  printf("%d matches\n", count);
  
  return 0;
}



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bitcoin-development] BIP39 word list
  2013-10-18 23:52 [Bitcoin-development] BIP39 word list jan
@ 2013-10-18 23:58 ` Gregory Maxwell
  2013-10-19 10:11   ` Pavol Rusnak
  2013-10-24 13:26   ` slush
  2013-10-23  0:56 ` slush
  1 sibling, 2 replies; 9+ messages in thread
From: Gregory Maxwell @ 2013-10-18 23:58 UTC (permalink / raw)
  To: jan; +Cc: Bitcoin Development

some fairly old wordlist solver code of mine:

https://people.xiph.org/~greg/wordlist.visual.py

it has a 52x52 letter visual similarity matrix in it (along with a citation)

On Fri, Oct 18, 2013 at 4:52 PM, jan <jan.marecek@gmail.com> wrote:
>
> The words 'public', 'private' and 'secret' could be confusing when
> encoding public and private keys. eg. a private key that begins with
> the word 'public'.
>
> I think avoiding words that could look similar when written down would
> be a good idea aswell. I searched for words that only differ by the
> letters c & e, g & y, u & v and found the following:
>
> car ear
> cat eat
> gear year
> value valve
>
> Other combinations could potentially be problematic depending on the
> handwriting style: ft, ao, ij, vy, possibly even lt and il?
>
> I've included the search utility I used below.
>
>
> #include <stdbool.h>
> #include <string.h>
> #include <stdio.h>
>
> char *similar_char_pairs[] = { "ce", "gy", "uv", NULL };
>
> bool is_similar_char(char c1, char c2)
> {
>   char **pairs = similar_char_pairs;
>   do {
>     char *p = *pairs;
>     if ((c1 == p[0] && c2 == p[1]) ||
>         (c1 == p[1] && c2 == p[0]))
>       return true;
>   } while (*++pairs);
>
>   return false;
> }
>
> bool print_words_if_similar(char *word1, char *word2)
> {
>   /* reject words of different lengths */
>   if (strlen(word1) != strlen(word2))
>     return false;
>
>   size_t i, similarcount = 0;
>
>   for (i = 0; i < strlen(word1); i++) {
>     /* skip identical letters */
>     if (word1[i] == word2[i])
>       continue;
>
>     /* reject words that don't match */
>     if (is_similar_char(word1[i], word2[i]) == false)
>       return false;
>
>     similarcount++;
>   }
>
>   /* reject words with more than 1 different letter */
>   //if (similarcount > 1)
>   //  return false;
>
>   printf("%s %s\n", word1, word2);
>
>   return true;
> }
>
> int main(void)
> {
>   /* english.txt is assumed to exist in the working directory
>      download from:
>      https://github.com/trezor/python-mnemonic/blob/master/mnemonic/wordlist/english.txt */
>   FILE* f = fopen("english.txt", "r");
>   if (!f) {
>     fprintf(stderr, "failed to open english.txt\n");
>     return 1;
>   }
>
>   /* read in word list, assumes one word per line */
>   #define MAXWORD 16
>   char wordlist[2048][MAXWORD];
>   int word = 0;
>   while (fgets(wordlist[word], MAXWORD, f)) {
>     /* strip trailing whitespace, assumes no leading whitespace */
>     char *ch = strpbrk(wordlist[word], " \n\t");
>     if (ch)
>       *ch = '\0';
>     word++;
>   }
>
>   if (word != 2048) {
>     fprintf(stderr, "word list incorrect length\n");
>     return 1;
>   }
>
>   /* check each word for similarity against every other word */
>   int i, j, count = 0;
>   for (i = 0; i < 2048; i++) {
>     for (j = i+1; j < 2048; j++) {
>       if (print_words_if_similar(wordlist[i], wordlist[j]))
>         count++;
>     }
>   }
>
>   printf("%d matches\n", count);
>
>   return 0;
> }
>
> ------------------------------------------------------------------------------
> October Webinars: Code for Performance
> Free Intel webinars can help you accelerate application performance.
> Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from
> the latest Intel processors and coprocessors. See abstracts and register >
> http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
> _______________________________________________
> Bitcoin-development mailing list
> Bitcoin-development@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bitcoin-development



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bitcoin-development] BIP39 word list
  2013-10-18 23:58 ` Gregory Maxwell
@ 2013-10-19 10:11   ` Pavol Rusnak
  2013-10-24 13:26   ` slush
  1 sibling, 0 replies; 9+ messages in thread
From: Pavol Rusnak @ 2013-10-19 10:11 UTC (permalink / raw)
  To: Bitcoin Development

On 19/10/13 01:58, Gregory Maxwell wrote:
> https://people.xiph.org/~greg/wordlist.visual.py

>> I've included the search utility I used below.

Yeah, there are lots of tools on the Internet. Posting links to them is
not helping. Sending pull requests with particular changesets with
explanation is. Well, or rather was. I think we are past the point where
it was wise to introduce changes to the word list ... (especially when
50 people have 51 different opinions on this topic)

-- 
Best Regards / S pozdravom,

Pavol Rusnak <stick@gk2.sk>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bitcoin-development] BIP39 word list
  2013-10-18 23:52 [Bitcoin-development] BIP39 word list jan
  2013-10-18 23:58 ` Gregory Maxwell
@ 2013-10-23  0:56 ` slush
  1 sibling, 0 replies; 9+ messages in thread
From: slush @ 2013-10-23  0:56 UTC (permalink / raw)
  Cc: bitcoin-development

[-- Attachment #1: Type: text/plain, Size: 693 bytes --]

I think this is a good idea; I just pushed new unit test test_similarity()
to github which finds such similar words. Right now it identifies ~90
similar pairs in current wordlist, I'll update wordlist tomorrow to pass
this test.

slush

On Sat, Oct 19, 2013 at 1:52 AM, jan <jan.marecek@gmail.com> wrote:

>
> I think avoiding words that could look similar when written down would
> be a good idea aswell. I searched for words that only differ by the
> letters c & e, g & y, u & v and found the following:
>
> car ear
> cat eat
> gear year
> value valve
>
> Other combinations could potentially be problematic depending on the
> handwriting style: ft, ao, ij, vy, possibly even lt and il?
>
>

[-- Attachment #2: Type: text/html, Size: 1055 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bitcoin-development] BIP39 word list
  2013-10-18 23:58 ` Gregory Maxwell
  2013-10-19 10:11   ` Pavol Rusnak
@ 2013-10-24 13:26   ` slush
  1 sibling, 0 replies; 9+ messages in thread
From: slush @ 2013-10-24 13:26 UTC (permalink / raw)
  To: Gregory Maxwell; +Cc: Bitcoin Development

[-- Attachment #1: Type: text/plain, Size: 5883 bytes --]

I've just pushed updated wordlist which is filtered to similar characters
taken from this matrix.

BIP39 now consider following character pairs as similar:

        similar = (
            ('a', 'c'), ('a', 'e'), ('a', 'o'),
            ('b', 'd'), ('b', 'h'), ('b', 'p'), ('b', 'q'), ('b', 'r'),
            ('c', 'e'), ('c', 'g'), ('c', 'n'), ('c', 'o'), ('c', 'q'),
('c', 'u'),
            ('d', 'g'), ('d', 'h'), ('d', 'o'), ('d', 'p'), ('d', 'q'),
            ('e', 'f'), ('e', 'o'),
            ('f', 'i'), ('f', 'j'), ('f', 'l'), ('f', 'p'), ('f', 't'),
            ('g', 'j'), ('g', 'o'), ('g', 'p'), ('g', 'q'), ('g', 'y'),
            ('h', 'k'), ('h', 'l'), ('h', 'm'), ('h', 'n'), ('h', 'r'),
            ('i', 'j'), ('i', 'l'), ('i', 't'), ('i', 'y'),
            ('j', 'l'), ('j', 'p'), ('j', 'q'), ('j', 'y'),
            ('k', 'x'),
            ('l', 't'),
            ('m', 'n'), ('m', 'w'),
            ('n', 'u'), ('n', 'z'),
            ('o', 'p'), ('o', 'q'), ('o', 'u'), ('o', 'v'),
            ('p', 'q'), ('p', 'r'),
            ('q', 'y'),
            ('s', 'z'),
            ('u', 'v'), ('u', 'w'), ('u', 'y'),
            ('v', 'w'), ('v', 'y')
        )

Feel free to review and comment current wordlist, but I think we're slowly
moving forward final list.

slush


On Sat, Oct 19, 2013 at 1:58 AM, Gregory Maxwell <gmaxwell@gmail.com> wrote:

> some fairly old wordlist solver code of mine:
>
> https://people.xiph.org/~greg/wordlist.visual.py
>
> it has a 52x52 letter visual similarity matrix in it (along with a
> citation)
>
> On Fri, Oct 18, 2013 at 4:52 PM, jan <jan.marecek@gmail.com> wrote:
> >
> > The words 'public', 'private' and 'secret' could be confusing when
> > encoding public and private keys. eg. a private key that begins with
> > the word 'public'.
> >
> > I think avoiding words that could look similar when written down would
> > be a good idea aswell. I searched for words that only differ by the
> > letters c & e, g & y, u & v and found the following:
> >
> > car ear
> > cat eat
> > gear year
> > value valve
> >
> > Other combinations could potentially be problematic depending on the
> > handwriting style: ft, ao, ij, vy, possibly even lt and il?
> >
> > I've included the search utility I used below.
> >
> >
> > #include <stdbool.h>
> > #include <string.h>
> > #include <stdio.h>
> >
> > char *similar_char_pairs[] = { "ce", "gy", "uv", NULL };
> >
> > bool is_similar_char(char c1, char c2)
> > {
> >   char **pairs = similar_char_pairs;
> >   do {
> >     char *p = *pairs;
> >     if ((c1 == p[0] && c2 == p[1]) ||
> >         (c1 == p[1] && c2 == p[0]))
> >       return true;
> >   } while (*++pairs);
> >
> >   return false;
> > }
> >
> > bool print_words_if_similar(char *word1, char *word2)
> > {
> >   /* reject words of different lengths */
> >   if (strlen(word1) != strlen(word2))
> >     return false;
> >
> >   size_t i, similarcount = 0;
> >
> >   for (i = 0; i < strlen(word1); i++) {
> >     /* skip identical letters */
> >     if (word1[i] == word2[i])
> >       continue;
> >
> >     /* reject words that don't match */
> >     if (is_similar_char(word1[i], word2[i]) == false)
> >       return false;
> >
> >     similarcount++;
> >   }
> >
> >   /* reject words with more than 1 different letter */
> >   //if (similarcount > 1)
> >   //  return false;
> >
> >   printf("%s %s\n", word1, word2);
> >
> >   return true;
> > }
> >
> > int main(void)
> > {
> >   /* english.txt is assumed to exist in the working directory
> >      download from:
> >
> https://github.com/trezor/python-mnemonic/blob/master/mnemonic/wordlist/english.txt*/
> >   FILE* f = fopen("english.txt", "r");
> >   if (!f) {
> >     fprintf(stderr, "failed to open english.txt\n");
> >     return 1;
> >   }
> >
> >   /* read in word list, assumes one word per line */
> >   #define MAXWORD 16
> >   char wordlist[2048][MAXWORD];
> >   int word = 0;
> >   while (fgets(wordlist[word], MAXWORD, f)) {
> >     /* strip trailing whitespace, assumes no leading whitespace */
> >     char *ch = strpbrk(wordlist[word], " \n\t");
> >     if (ch)
> >       *ch = '\0';
> >     word++;
> >   }
> >
> >   if (word != 2048) {
> >     fprintf(stderr, "word list incorrect length\n");
> >     return 1;
> >   }
> >
> >   /* check each word for similarity against every other word */
> >   int i, j, count = 0;
> >   for (i = 0; i < 2048; i++) {
> >     for (j = i+1; j < 2048; j++) {
> >       if (print_words_if_similar(wordlist[i], wordlist[j]))
> >         count++;
> >     }
> >   }
> >
> >   printf("%d matches\n", count);
> >
> >   return 0;
> > }
> >
> >
> ------------------------------------------------------------------------------
> > October Webinars: Code for Performance
> > Free Intel webinars can help you accelerate application performance.
> > Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
> from
> > the latest Intel processors and coprocessors. See abstracts and register
> >
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
> > _______________________________________________
> > Bitcoin-development mailing list
> > Bitcoin-development@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/bitcoin-development
>
>
> ------------------------------------------------------------------------------
> October Webinars: Code for Performance
> Free Intel webinars can help you accelerate application performance.
> Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
> from
> the latest Intel processors and coprocessors. See abstracts and register >
> http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
> _______________________________________________
> Bitcoin-development mailing list
> Bitcoin-development@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bitcoin-development
>

[-- Attachment #2: Type: text/html, Size: 9390 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bitcoin-development] BIP39 word list
  2013-11-02  0:04 ` slush
@ 2013-11-02  4:31   ` Brooks Boyd
  0 siblings, 0 replies; 9+ messages in thread
From: Brooks Boyd @ 2013-11-02  4:31 UTC (permalink / raw)
  To: slush; +Cc: bitcoin-development

[-- Attachment #1: Type: text/plain, Size: 7375 bytes --]

That would be a way to go, though iterating through all possibilities of a
similar letter misspell would take significantly more processing (4x3x3
= 36 total possibilities, only to cull it back to 2, in your example), than
iterating through a list of pre-calculated possibilities. It's definitely
not a hard computation on any modern device, though, and depending on how
"helpful" the program wants to try to be, it could even try help with
misspellings due to hitting a keyboard key next to the correct one or
hitting a letter twice, depending on how big a comparison matrix it wants
to create.

I do agree it should not be required for clients implementing the BIP to
help fix mis-translations, though keeping the similar letter unit test in
there I like, since it helps convey the thought that went into culling some
words from the dictionary. Though to Allen's point, what did happen with
the words that were found to be similar; was one of the similar words left
in the list or were all the similar words removed?

Brooks
MidnightLightning


On Fri, Nov 1, 2013 at 7:04 PM, slush <slush@centrum.cz> wrote:

> Hi Brooks,
>
> I've been already thinking about eat -> cat typing mistake. Actually there
> may be simplier solution than having wordlist with duplicated words.
> Because there's already a mapping of similar characters in the source code
> (currently only in unit test, but it can be moved), when user type a word
> which isn't in wordlist, application may try to use such mapping to find a
> combination which actually is in the mapping. This may be disambiguous in
> some cases, but giving a choice between few words may be better than hard
> fail. And it is actually quite easy to implement. Although I think
> application can do such smart suggestions and help user to recover badly
> written mnemonic, I don't think it is necessary to standardize such method
> directly into BIP. It may or may not be implemented by developers and it is
> just nice to have feature.
>
> Example:
>
> user type ear, but it isn't in wordlist.
>
> Regards the mapping,
> E is similar to A, C, F, O
> A is similar to E, C, O
> R is similar to B, P, H
>
> So application can calculate combinations of possible characters:
>
> a) when app consider than the the user mistyped only one character
> AAR, CAR, FAR, OAR
> EER, ECR, EOR
> EAB, EAP, EAH
>
> b) when app consider than user maybe mistyped more characters, it may do
> full combination matrix
> AEB,  ACB, AOB,  ... OEH, OCH, OOH
>
> and then ask user to select only these combinations which are actually
> presented in the wordlist. In this particular case it may be only CAR or
> FAR (both cannot be in the wordlist because of rules in similarity).
>
> Marek
>
>
> On Fri, Nov 1, 2013 at 9:14 PM, Brooks Boyd <boydb@midnightdesign.ws>wrote:
>
>> I was inspired to join the mailing list to comment on some of these
>> discussions about BIP39, which I think will have great use in the Bitcoin
>> community and outside it as a way to transcribe binary data.
>>
>> The one thought I had as the discussions about similar characters are
>> resulting in culling words from the list, is that it only helps to validate
>> input, not help the user if it is incorrect.
>>
>> For example, if both "cat" and "eat" were in the word list, and someone
>> wrote down "eat", but later mis-translated it and put "cat" back into
>> translator, the result would be a checksum error; "cat" is a different
>> number, so the checksum would fail.
>>
>> As it currently stands, "cat" would not be a valid word ("eat" is the
>> real word, and no other number is "cat"), so the translator can throw a
>> different error which is more helpful (i.e. "'cat' isn't a valid word
>> choice), but still doesn't get the user to the proper translation.
>>
>> What about if the wordlist included those "words that are so similar to
>> each other that we only kept one of them" and had them all refer to the
>> same number? I propose the wordlist have the possibility of multiple words
>> on a single line, with the first word on the line being the "primary" or
>> "real" word to be used, with the other similar words be included so that a
>> translation program if it wanted to assist the user could fix their input
>> for them (verbosely or not), along the lines of "'cat' isn't a valid word
>> choice; assuming you meant 'eat', which is valid". You might still hit a
>> checksum error if that similar word is still the wrong word, but as it
>> stands now, I know you culled a bunch of words from the wordlist as "too
>> similar", but if I want to try and help the user fix a bad input, I need to
>> write a translation program with a full english dictionary alongside the
>> BIP39 dictionary.
>>
>> I'd be willing to create a pull request for such an update, but before I
>> delve into that, does this sound like a good idea? I could see it devolving
>> into a slippery slope if every number in the 2048 set had a dozen word
>> variations (misspellings, similar words, slang terms for the real word,
>> etc.) which could get confusing of how similar is similar enough to be
>> added as an alternate, and the standard would need to be clear that when
>> translating binary to words, you only use the "main" word for that row, not
>> any of the variations.
>>
>> MidnightLightning
>>
>>
>> > I've just pushed updated wordlist which is filtered to similar
>> characters taken from this matrix.
>> > BIP39 now consider following character pairs as similar:
>> >         similar = (
>> >             ('a', 'c'), ('a', 'e'), ('a', 'o'),
>> >             ('b', 'd'), ('b', 'h'), ('b', 'p'), ('b', 'q'), ('b', 'r'),
>> >             ('c', 'e'), ('c', 'g'), ('c', 'n'), ('c', 'o'), ('c', 'q'),
>> ('c', 'u'),
>> >             ('d', 'g'), ('d', 'h'), ('d', 'o'), ('d', 'p'), ('d', 'q'),
>> >             ('e', 'f'), ('e', 'o'),
>> >             ('f', 'i'), ('f', 'j'), ('f', 'l'), ('f', 'p'), ('f', 't'),
>> >             ('g', 'j'), ('g', 'o'), ('g', 'p'), ('g', 'q'), ('g', 'y'),
>> >             ('h', 'k'), ('h', 'l'), ('h', 'm'), ('h', 'n'), ('h', 'r'),
>> >             ('i', 'j'), ('i', 'l'), ('i', 't'), ('i', 'y'),
>> >             ('j', 'l'), ('j', 'p'), ('j', 'q'), ('j', 'y'),
>> >             ('k', 'x'),
>> >             ('l', 't'),
>> >             ('m', 'n'), ('m', 'w'),
>> >             ('n', 'u'), ('n', 'z'),
>> >             ('o', 'p'), ('o', 'q'), ('o', 'u'), ('o', 'v'),
>> >             ('p', 'q'), ('p', 'r'),
>> >             ('q', 'y'),
>> >             ('s', 'z'),
>> >             ('u', 'v'), ('u', 'w'), ('u', 'y'),
>> >             ('v', 'w'), ('v', 'y')
>> >         )
>> > Feel free to review and comment current wordlist, but I think we're
>> slowly moving forward final list.
>> > slush
>>
>>
>> ------------------------------------------------------------------------------
>> Android is increasing in popularity, but the open development platform
>> that
>> developers love is also attractive to malware creators. Download this
>> white
>> paper to learn more about secure code signing practices that can help keep
>> Android apps secure.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Bitcoin-development mailing list
>> Bitcoin-development@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/bitcoin-development
>>
>>
>

[-- Attachment #2: Type: text/html, Size: 10200 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bitcoin-development] BIP39 word list
  2013-11-01 20:14 Brooks Boyd
  2013-11-01 23:41 ` Allen Piscitello
@ 2013-11-02  0:04 ` slush
  2013-11-02  4:31   ` Brooks Boyd
  1 sibling, 1 reply; 9+ messages in thread
From: slush @ 2013-11-02  0:04 UTC (permalink / raw)
  To: Brooks Boyd; +Cc: bitcoin-development

[-- Attachment #1: Type: text/plain, Size: 6090 bytes --]

Hi Brooks,

I've been already thinking about eat -> cat typing mistake. Actually there
may be simplier solution than having wordlist with duplicated words.
Because there's already a mapping of similar characters in the source code
(currently only in unit test, but it can be moved), when user type a word
which isn't in wordlist, application may try to use such mapping to find a
combination which actually is in the mapping. This may be disambiguous in
some cases, but giving a choice between few words may be better than hard
fail. And it is actually quite easy to implement. Although I think
application can do such smart suggestions and help user to recover badly
written mnemonic, I don't think it is necessary to standardize such method
directly into BIP. It may or may not be implemented by developers and it is
just nice to have feature.

Example:

user type ear, but it isn't in wordlist.

Regards the mapping,
E is similar to A, C, F, O
A is similar to E, C, O
R is similar to B, P, H

So application can calculate combinations of possible characters:

a) when app consider than the the user mistyped only one character
AAR, CAR, FAR, OAR
EER, ECR, EOR
EAB, EAP, EAH

b) when app consider than user maybe mistyped more characters, it may do
full combination matrix
AEB,  ACB, AOB,  ... OEH, OCH, OOH

and then ask user to select only these combinations which are actually
presented in the wordlist. In this particular case it may be only CAR or
FAR (both cannot be in the wordlist because of rules in similarity).

Marek


On Fri, Nov 1, 2013 at 9:14 PM, Brooks Boyd <boydb@midnightdesign.ws> wrote:

> I was inspired to join the mailing list to comment on some of these
> discussions about BIP39, which I think will have great use in the Bitcoin
> community and outside it as a way to transcribe binary data.
>
> The one thought I had as the discussions about similar characters are
> resulting in culling words from the list, is that it only helps to validate
> input, not help the user if it is incorrect.
>
> For example, if both "cat" and "eat" were in the word list, and someone
> wrote down "eat", but later mis-translated it and put "cat" back into
> translator, the result would be a checksum error; "cat" is a different
> number, so the checksum would fail.
>
> As it currently stands, "cat" would not be a valid word ("eat" is the real
> word, and no other number is "cat"), so the translator can throw a
> different error which is more helpful (i.e. "'cat' isn't a valid word
> choice), but still doesn't get the user to the proper translation.
>
> What about if the wordlist included those "words that are so similar to
> each other that we only kept one of them" and had them all refer to the
> same number? I propose the wordlist have the possibility of multiple words
> on a single line, with the first word on the line being the "primary" or
> "real" word to be used, with the other similar words be included so that a
> translation program if it wanted to assist the user could fix their input
> for them (verbosely or not), along the lines of "'cat' isn't a valid word
> choice; assuming you meant 'eat', which is valid". You might still hit a
> checksum error if that similar word is still the wrong word, but as it
> stands now, I know you culled a bunch of words from the wordlist as "too
> similar", but if I want to try and help the user fix a bad input, I need to
> write a translation program with a full english dictionary alongside the
> BIP39 dictionary.
>
> I'd be willing to create a pull request for such an update, but before I
> delve into that, does this sound like a good idea? I could see it devolving
> into a slippery slope if every number in the 2048 set had a dozen word
> variations (misspellings, similar words, slang terms for the real word,
> etc.) which could get confusing of how similar is similar enough to be
> added as an alternate, and the standard would need to be clear that when
> translating binary to words, you only use the "main" word for that row, not
> any of the variations.
>
> MidnightLightning
>
>
> > I've just pushed updated wordlist which is filtered to similar
> characters taken from this matrix.
> > BIP39 now consider following character pairs as similar:
> >         similar = (
> >             ('a', 'c'), ('a', 'e'), ('a', 'o'),
> >             ('b', 'd'), ('b', 'h'), ('b', 'p'), ('b', 'q'), ('b', 'r'),
> >             ('c', 'e'), ('c', 'g'), ('c', 'n'), ('c', 'o'), ('c', 'q'),
> ('c', 'u'),
> >             ('d', 'g'), ('d', 'h'), ('d', 'o'), ('d', 'p'), ('d', 'q'),
> >             ('e', 'f'), ('e', 'o'),
> >             ('f', 'i'), ('f', 'j'), ('f', 'l'), ('f', 'p'), ('f', 't'),
> >             ('g', 'j'), ('g', 'o'), ('g', 'p'), ('g', 'q'), ('g', 'y'),
> >             ('h', 'k'), ('h', 'l'), ('h', 'm'), ('h', 'n'), ('h', 'r'),
> >             ('i', 'j'), ('i', 'l'), ('i', 't'), ('i', 'y'),
> >             ('j', 'l'), ('j', 'p'), ('j', 'q'), ('j', 'y'),
> >             ('k', 'x'),
> >             ('l', 't'),
> >             ('m', 'n'), ('m', 'w'),
> >             ('n', 'u'), ('n', 'z'),
> >             ('o', 'p'), ('o', 'q'), ('o', 'u'), ('o', 'v'),
> >             ('p', 'q'), ('p', 'r'),
> >             ('q', 'y'),
> >             ('s', 'z'),
> >             ('u', 'v'), ('u', 'w'), ('u', 'y'),
> >             ('v', 'w'), ('v', 'y')
> >         )
> > Feel free to review and comment current wordlist, but I think we're
> slowly moving forward final list.
> > slush
>
>
> ------------------------------------------------------------------------------
> Android is increasing in popularity, but the open development platform that
> developers love is also attractive to malware creators. Download this white
> paper to learn more about secure code signing practices that can help keep
> Android apps secure.
> http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
> _______________________________________________
> Bitcoin-development mailing list
> Bitcoin-development@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bitcoin-development
>
>

[-- Attachment #2: Type: text/html, Size: 8536 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bitcoin-development] BIP39 word list
  2013-11-01 20:14 Brooks Boyd
@ 2013-11-01 23:41 ` Allen Piscitello
  2013-11-02  0:04 ` slush
  1 sibling, 0 replies; 9+ messages in thread
From: Allen Piscitello @ 2013-11-01 23:41 UTC (permalink / raw)
  To: Brooks Boyd; +Cc: Bitcoin Development

[-- Attachment #1: Type: text/plain, Size: 4962 bytes --]

The problem with this is that you might have word A which is similar to B,
but B is also similar to C.  So we scrub B from the list, someone enters B,
and we have no way to know if it means A or C.  It leads to a much more
complicated scheme to ensure that all errors are correctable.

Scrubbing A, B, and C is preferable, since it leads to no ambiguity and
there is no need to try to correct an error.


On Fri, Nov 1, 2013 at 3:14 PM, Brooks Boyd <boydb@midnightdesign.ws> wrote:

> I was inspired to join the mailing list to comment on some of these
> discussions about BIP39, which I think will have great use in the Bitcoin
> community and outside it as a way to transcribe binary data.
>
> The one thought I had as the discussions about similar characters are
> resulting in culling words from the list, is that it only helps to validate
> input, not help the user if it is incorrect.
>
> For example, if both "cat" and "eat" were in the word list, and someone
> wrote down "eat", but later mis-translated it and put "cat" back into
> translator, the result would be a checksum error; "cat" is a different
> number, so the checksum would fail.
>
> As it currently stands, "cat" would not be a valid word ("eat" is the real
> word, and no other number is "cat"), so the translator can throw a
> different error which is more helpful (i.e. "'cat' isn't a valid word
> choice), but still doesn't get the user to the proper translation.
>
> What about if the wordlist included those "words that are so similar to
> each other that we only kept one of them" and had them all refer to the
> same number? I propose the wordlist have the possibility of multiple words
> on a single line, with the first word on the line being the "primary" or
> "real" word to be used, with the other similar words be included so that a
> translation program if it wanted to assist the user could fix their input
> for them (verbosely or not), along the lines of "'cat' isn't a valid word
> choice; assuming you meant 'eat', which is valid". You might still hit a
> checksum error if that similar word is still the wrong word, but as it
> stands now, I know you culled a bunch of words from the wordlist as "too
> similar", but if I want to try and help the user fix a bad input, I need to
> write a translation program with a full english dictionary alongside the
> BIP39 dictionary.
>
> I'd be willing to create a pull request for such an update, but before I
> delve into that, does this sound like a good idea? I could see it devolving
> into a slippery slope if every number in the 2048 set had a dozen word
> variations (misspellings, similar words, slang terms for the real word,
> etc.) which could get confusing of how similar is similar enough to be
> added as an alternate, and the standard would need to be clear that when
> translating binary to words, you only use the "main" word for that row, not
> any of the variations.
>
> MidnightLightning
>
>
> > I've just pushed updated wordlist which is filtered to similar
> characters taken from this matrix.
> > BIP39 now consider following character pairs as similar:
> >         similar = (
> >             ('a', 'c'), ('a', 'e'), ('a', 'o'),
> >             ('b', 'd'), ('b', 'h'), ('b', 'p'), ('b', 'q'), ('b', 'r'),
> >             ('c', 'e'), ('c', 'g'), ('c', 'n'), ('c', 'o'), ('c', 'q'),
> ('c', 'u'),
> >             ('d', 'g'), ('d', 'h'), ('d', 'o'), ('d', 'p'), ('d', 'q'),
> >             ('e', 'f'), ('e', 'o'),
> >             ('f', 'i'), ('f', 'j'), ('f', 'l'), ('f', 'p'), ('f', 't'),
> >             ('g', 'j'), ('g', 'o'), ('g', 'p'), ('g', 'q'), ('g', 'y'),
> >             ('h', 'k'), ('h', 'l'), ('h', 'm'), ('h', 'n'), ('h', 'r'),
> >             ('i', 'j'), ('i', 'l'), ('i', 't'), ('i', 'y'),
> >             ('j', 'l'), ('j', 'p'), ('j', 'q'), ('j', 'y'),
> >             ('k', 'x'),
> >             ('l', 't'),
> >             ('m', 'n'), ('m', 'w'),
> >             ('n', 'u'), ('n', 'z'),
> >             ('o', 'p'), ('o', 'q'), ('o', 'u'), ('o', 'v'),
> >             ('p', 'q'), ('p', 'r'),
> >             ('q', 'y'),
> >             ('s', 'z'),
> >             ('u', 'v'), ('u', 'w'), ('u', 'y'),
> >             ('v', 'w'), ('v', 'y')
> >         )
> > Feel free to review and comment current wordlist, but I think we're
> slowly moving forward final list.
> > slush
>
>
> ------------------------------------------------------------------------------
> Android is increasing in popularity, but the open development platform that
> developers love is also attractive to malware creators. Download this white
> paper to learn more about secure code signing practices that can help keep
> Android apps secure.
> http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
> _______________________________________________
> Bitcoin-development mailing list
> Bitcoin-development@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bitcoin-development
>
>

[-- Attachment #2: Type: text/html, Size: 6937 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bitcoin-development] BIP39 word list
@ 2013-11-01 20:14 Brooks Boyd
  2013-11-01 23:41 ` Allen Piscitello
  2013-11-02  0:04 ` slush
  0 siblings, 2 replies; 9+ messages in thread
From: Brooks Boyd @ 2013-11-01 20:14 UTC (permalink / raw)
  To: bitcoin-development

[-- Attachment #1: Type: text/plain, Size: 3722 bytes --]

I was inspired to join the mailing list to comment on some of these
discussions about BIP39, which I think will have great use in the Bitcoin
community and outside it as a way to transcribe binary data.

The one thought I had as the discussions about similar characters are
resulting in culling words from the list, is that it only helps to validate
input, not help the user if it is incorrect.

For example, if both "cat" and "eat" were in the word list, and someone
wrote down "eat", but later mis-translated it and put "cat" back into
translator, the result would be a checksum error; "cat" is a different
number, so the checksum would fail.

As it currently stands, "cat" would not be a valid word ("eat" is the real
word, and no other number is "cat"), so the translator can throw a
different error which is more helpful (i.e. "'cat' isn't a valid word
choice), but still doesn't get the user to the proper translation.

What about if the wordlist included those "words that are so similar to
each other that we only kept one of them" and had them all refer to the
same number? I propose the wordlist have the possibility of multiple words
on a single line, with the first word on the line being the "primary" or
"real" word to be used, with the other similar words be included so that a
translation program if it wanted to assist the user could fix their input
for them (verbosely or not), along the lines of "'cat' isn't a valid word
choice; assuming you meant 'eat', which is valid". You might still hit a
checksum error if that similar word is still the wrong word, but as it
stands now, I know you culled a bunch of words from the wordlist as "too
similar", but if I want to try and help the user fix a bad input, I need to
write a translation program with a full english dictionary alongside the
BIP39 dictionary.

I'd be willing to create a pull request for such an update, but before I
delve into that, does this sound like a good idea? I could see it devolving
into a slippery slope if every number in the 2048 set had a dozen word
variations (misspellings, similar words, slang terms for the real word,
etc.) which could get confusing of how similar is similar enough to be
added as an alternate, and the standard would need to be clear that when
translating binary to words, you only use the "main" word for that row, not
any of the variations.

MidnightLightning

> I've just pushed updated wordlist which is filtered to similar characters
taken from this matrix.
> BIP39 now consider following character pairs as similar:
>         similar = (
>             ('a', 'c'), ('a', 'e'), ('a', 'o'),
>             ('b', 'd'), ('b', 'h'), ('b', 'p'), ('b', 'q'), ('b', 'r'),
>             ('c', 'e'), ('c', 'g'), ('c', 'n'), ('c', 'o'), ('c', 'q'),
('c', 'u'),
>             ('d', 'g'), ('d', 'h'), ('d', 'o'), ('d', 'p'), ('d', 'q'),
>             ('e', 'f'), ('e', 'o'),
>             ('f', 'i'), ('f', 'j'), ('f', 'l'), ('f', 'p'), ('f', 't'),
>             ('g', 'j'), ('g', 'o'), ('g', 'p'), ('g', 'q'), ('g', 'y'),
>             ('h', 'k'), ('h', 'l'), ('h', 'm'), ('h', 'n'), ('h', 'r'),
>             ('i', 'j'), ('i', 'l'), ('i', 't'), ('i', 'y'),
>             ('j', 'l'), ('j', 'p'), ('j', 'q'), ('j', 'y'),
>             ('k', 'x'),
>             ('l', 't'),
>             ('m', 'n'), ('m', 'w'),
>             ('n', 'u'), ('n', 'z'),
>             ('o', 'p'), ('o', 'q'), ('o', 'u'), ('o', 'v'),
>             ('p', 'q'), ('p', 'r'),
>             ('q', 'y'),
>             ('s', 'z'),
>             ('u', 'v'), ('u', 'w'), ('u', 'y'),
>             ('v', 'w'), ('v', 'y')
>         )
> Feel free to review and comment current wordlist, but I think we're
slowly moving forward final list.
> slush

[-- Attachment #2: Type: text/html, Size: 5176 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-11-02  4:31 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-18 23:52 [Bitcoin-development] BIP39 word list jan
2013-10-18 23:58 ` Gregory Maxwell
2013-10-19 10:11   ` Pavol Rusnak
2013-10-24 13:26   ` slush
2013-10-23  0:56 ` slush
2013-11-01 20:14 Brooks Boyd
2013-11-01 23:41 ` Allen Piscitello
2013-11-02  0:04 ` slush
2013-11-02  4:31   ` Brooks Boyd

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox