lauri_ranta Wrote:A few weeks ago I downloaded Japanese subs for about 10000 anime episodes and used mecab to make a frequency list of words with at least one kanji:By the way, do you happen to still have the unparsed list, with the Hiragana words and everything? I'm going through this list still and it's working a lot better then the core decks order, however I'm not getting any kana words, which is a pretty big chunk of the language.
for f in *.srt *.ass; do mecab -F'%t %f[6]\n' "$f" | sed -En 's/^2 (.+)/\1/p'; done | sort | uniq -c | sort -r
I wrote a Ruby script that removed words that didn't contain kanji (932/6000), and then words that weren't on the first 10000 lines of the frequency list (1436/5068).
A random sample of the 1436 words removed after the second step:
有害 harmful, hazardous
願書 application form
停留所 (bus) stop
物差し ruler
重役 executive
鰻 eel
お辞儀 bow
細長い long and thin, long and narrow
観賞 ornamental
時給 hourly wage
It's not perfect, because mecab doesn't recognize all words, and the corpus is too small and not diverse enough.
Anyway, you can download the list as TSV from http://lri.me/upload/core3632.txt. Most Anki decks are based on an older version of the data from iKnow's website. Audio files that match this version can be downloaded from this thread.
I won't make an Anki deck for it, but I'm thinking about making a huge vocabulary deck based on different word frequency lists later. Most of the top 20000 words have sound files from JapanesePod101 or iKnow.
Edited: 2013-02-05, 8:47 am

