Back

cb's Japanese Text Analysis Tool

#40
I just wanted to thank cb4960 for making these changes.

I have reached 1000 words in Core6K and while doing the next set of kanji I tried reading NHK Easy News articles. It was extremely hit and miss. Some articles I understood a lot of words, some I understood none. I wanted to practice more on the articles where I knew a majority of the words, and this change to the tool now makes it possible.

For example, with a known word list (~1000 words) exported from Anki and 450 NHK articles here are the top and bottom 5 results:

Code:
77.78    207    161    46    67.06    85    57    28    2013_04_26_2.txt
75.59    127    96    31    61.76    68    42    26    2012_7_11.txt
75.47    159    120    39    63.41    82    52    30    2013_04_25.txt
73.30    221    162    59    54.63    108    59    49    2012_10_9_1.txt
72.92    192    140    52    55.91    93    52    41    2013_3_29_1.txt
...
47.31    167    79    88    33.65    104    35    69    2012_7_25_1.txt
47.16    176    83    93    42.35    85    36    49    2013_3_4_2.txt
45.70    151    69    82    35.80    81    29    52    2013_05_13.txt
44.69    273    122    151    37.59    133    50    83    2013_04_22_1.txt
43.18    176    76    100    36.73    98    36    62    2012_8_17_1.txt
Reading the articles from the top of the list is much more satisfying than previous attempts at reading NHK. Though it appears I need to learn a lot more grammar.

If you are going to use the tool on your own archives keep one thing in mind. The parsers that are being used add particles and other grammar related elements to the word list. These are unlikely to be in your own vocab lists. I created a separate file and added these types of words from the word_freq_report and appended that file to my vocab list. This improves the rating of the documents a lot.

By the way, if someone was to know 1000 words from Core6K, and that person happened to have a copy of some totally innocent books book reviews and that person ran this tool over them they might sadly discover that the top ratings on those books book reviews are around the 15-20% range.

Such a person might then decide to go learn more vocab Wink
Reply

Messages In This Thread