I just wanted to thank cb4960 for making these changes.
I have reached 1000 words in Core6K and while doing the next set of kanji I tried reading NHK Easy News articles. It was extremely hit and miss. Some articles I understood a lot of words, some I understood none. I wanted to practice more on the articles where I knew a majority of the words, and this change to the tool now makes it possible.
For example, with a known word list (~1000 words) exported from Anki and 450 NHK articles here are the top and bottom 5 results:
Reading the articles from the top of the list is much more satisfying than previous attempts at reading NHK. Though it appears I need to learn a lot more grammar.
If you are going to use the tool on your own archives keep one thing in mind. The parsers that are being used add particles and other grammar related elements to the word list. These are unlikely to be in your own vocab lists. I created a separate file and added these types of words from the word_freq_report and appended that file to my vocab list. This improves the rating of the documents a lot.
By the way, if someone was to know 1000 words from Core6K, and that person happened to have a copy of some totally innocentbooks book reviews and that person ran this tool over them they might sadly discover that the top ratings on those books book reviews are around the 15-20% range.
Such a person might then decide to go learn more vocab
I have reached 1000 words in Core6K and while doing the next set of kanji I tried reading NHK Easy News articles. It was extremely hit and miss. Some articles I understood a lot of words, some I understood none. I wanted to practice more on the articles where I knew a majority of the words, and this change to the tool now makes it possible.
For example, with a known word list (~1000 words) exported from Anki and 450 NHK articles here are the top and bottom 5 results:
Code:
77.78 207 161 46 67.06 85 57 28 2013_04_26_2.txt
75.59 127 96 31 61.76 68 42 26 2012_7_11.txt
75.47 159 120 39 63.41 82 52 30 2013_04_25.txt
73.30 221 162 59 54.63 108 59 49 2012_10_9_1.txt
72.92 192 140 52 55.91 93 52 41 2013_3_29_1.txt
...
47.31 167 79 88 33.65 104 35 69 2012_7_25_1.txt
47.16 176 83 93 42.35 85 36 49 2013_3_4_2.txt
45.70 151 69 82 35.80 81 29 52 2013_05_13.txt
44.69 273 122 151 37.59 133 50 83 2013_04_22_1.txt
43.18 176 76 100 36.73 98 36 62 2012_8_17_1.txtIf you are going to use the tool on your own archives keep one thing in mind. The parsers that are being used add particles and other grammar related elements to the word list. These are unlikely to be in your own vocab lists. I created a separate file and added these types of words from the word_freq_report and appended that file to my vocab list. This improves the rating of the documents a lot.
By the way, if someone was to know 1000 words from Core6K, and that person happened to have a copy of some totally innocent
Such a person might then decide to go learn more vocab
