Animosophy
Member
Registered: 2013-02-19
Posts: 180
I'm hoping there's something similar to cb's text analysis tool, that can create lists of vocab in order of appearance (obviously ignoring subsequent appearances of the same word). That way I can read a few pages of something, then study the vocab that I read later.
I know JadeReader for example lets you add new words to a text file as they come, but it would still mean having to manually add them to an anki deck each time. I'd rather have all of the words organised in one go 
What I'm trying to do...
I've used cb's text analysis tool to run word frequency reports on the first 9 light novels of the Suzumiya Haruhi series and these are the numbers it came up with:
Novel 2: 3125 new words
Novel 3: 2569 new words
Novel 4: 1568 new words
Novel 5: 2242 new words
Novel 6: 1739 new words
Novel 7: 1759 new words
Novel 8: 1272 new words
Novel 9: 1323 new words
Basically if I read them sequentially, those are the number of new words I will encounter in each novel. I exclude all vocab from Core 2k/6k and the kanji compound deck from ja-min. Each report orders words by their relative frequency, not by their order of appearance.
The first book's vocab list is full of farmiliar grammar and particles and character names, but including all of these, there are ~5350 new "words" in the first novel. I'm going to assume it's closer to 5000. All together that would make ~20,597 words (+ core + grammar = ~27,000 word vocabulary! yum). This is one of the goals I am setting myself before Japan in late 2015.
I'd reaaally like to create an Anki deck that gathers all of these words in the order that they will appear. It just seems like such a beneficial way to reinforce new vocabulary.
You can use MeCab to get a list of lemmas in different files:
mecab -F '%t %f[6]\n' -E '' *.txt|awk '$1~/[26]/{print $2}'|awk '!a[$0]++'>words.txt
%t is word type (0: other, 2: kanji or kanji with kana, 3: special character, 4: number, 5: latin, 6: hiragana, 7: katakana), and %f[6] is the lemma, or field 6 in the default output. Change 26 to 2 if you only want to include words with at least one kanji.
This would only keep words that appear between lines 15,000 and 70,000 on a word frequency list:
curl lri.me/japanese/word-frequency.txt|tail -n+3>word-frequency.txt;awk 'NR==FNR{if(NR>=15000&&NR<=70000)a[$0];next}$0 in a' word-frequency.txt words.txt>words2.txt
You can install MeCab with `brew install mecab mecab-ipadic` on OS X or by using mecab-0.996.exe from http://code.google.com/p/mecab/downloads/list on Windows. I don't know how to get the shell commands above to work on Windows though.