Any text analysis tools that organise vocab in order of appearance?

Index » Learning resources

  • 1
 
Animosophy Member
Registered: 2013-02-19 Posts: 180

I'm hoping there's something similar to cb's text analysis tool, that can create lists of vocab in order of appearance (obviously ignoring subsequent appearances of the same word). That way I can read a few pages of something, then study the vocab that I read later.

I know JadeReader for example lets you add new words to a text file as they come, but it would still mean having to manually add them to an anki deck each time. I'd rather have all of the words organised in one go smile

What I'm trying to do...

I've used cb's text analysis tool to run word frequency reports on the first 9 light novels of the Suzumiya Haruhi series and these are the numbers it came up with:

Novel 2: 3125 new words
Novel 3: 2569 new words
Novel 4: 1568 new words
Novel 5: 2242 new words
Novel 6: 1739 new words
Novel 7: 1759 new words
Novel 8: 1272 new words
Novel 9: 1323 new words

Basically if I read them sequentially, those are the number of new words I will encounter in each novel. I exclude all vocab from Core 2k/6k and the kanji compound deck from ja-min. Each report orders words by their relative frequency, not by their order of appearance.

The first book's vocab list is full of farmiliar grammar and particles and character names, but including all of these, there are ~5350 new "words" in the first novel. I'm going to assume it's closer to 5000. All together that would make ~20,597 words (+ core + grammar = ~27,000 word vocabulary! yum). This is one of the goals I am setting myself before Japan in late 2015.

I'd reaaally like to create an Anki deck that gathers all of these words in the order that they will appear. It just seems like such a beneficial way to reinforce new vocabulary.

PotbellyPig Member
From: New York Registered: 2012-01-29 Posts: 337

Yomichan and Rikiasama on the PC directly connect to the PC copy of Anki and add them as each word is added.  No additional work needed.  I have 5000 words stored via Yomichan and not only are they in order and have the book name recorded for each one, the sentance from the book is copied as the example sentence.  It's great stuff.  Those two programs only work on the PC.  There is a rekaisama on  android but it creates a text file.  I have a Windows 8 tablet and use those programs a lot on it.

lauri_ranta Member
Registered: 2012-03-31 Posts: 139 Website

You can use MeCab to get a list of lemmas in different files:

mecab -F '%t %f[6]\n' -E '' *.txt|awk '$1~/[26]/{print $2}'|awk '!a[$0]++'>words.txt

%t is word type (0: other, 2: kanji or kanji with kana, 3: special character, 4: number, 5: latin, 6: hiragana, 7: katakana), and %f[6] is the lemma, or field 6 in the default output. Change 26 to 2 if you only want to include words with at least one kanji.

This would only keep words that appear between lines 15,000 and 70,000 on a word frequency list:

curl lri.me/japanese/word-frequency.txt|tail -n+3>word-frequency.txt;awk 'NR==FNR{if(NR>=15000&&NR<=70000)a[$0];next}$0 in a' word-frequency.txt words.txt>words2.txt

You can install MeCab with `brew install mecab mecab-ipadic` on OS X or by using mecab-0.996.exe from http://code.google.com/p/mecab/downloads/list on Windows. I don't know how to get the shell commands above to work on Windows though.

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
Animosophy Member
Registered: 2013-02-19 Posts: 180

PotbellyPig wrote:

Yomichan and Rikiasama on the PC directly connect to the PC copy of Anki and add them as each word is added.  No additional work needed.  I have 5000 words stored via Yomichan and not only are they in order and have the book name recorded for each one, the sentance from the book is copied as the example sentence.  It's great stuff.  Those two programs only work on the PC.  There is a rekaisama on  android but it creates a text file.  I have a Windows 8 tablet and use those programs a lot on it.

Ooh, I like this even more! Thanks smile problem solved

lauri_ranta wrote:

You can use MeCab to get a list of lemmas in different files:

mecab -F '%t %f[6]\n' -E '' *.txt|awk '$1~/[26]/{print $2}'|awk '!a[$0]++'>words.txt

%t is word type (0: other, 2: kanji or kanji with kana, 3: special character, 4: number, 5: latin, 6: hiragana, 7: katakana), and %f[6] is the lemma, or field 6 in the default output. Change 26 to 2 if you only want to include words with at least one kanji.

This would only keep words that appear between lines 15,000 and 70,000 on a word frequency list:

curl lri.me/japanese/word-frequency.txt|tail -n+3>word-frequency.txt;awk 'NR==FNR{if(NR>=15000&&NR<=70000)a[$0];next}$0 in a' word-frequency.txt words.txt>words2.txt

You can install MeCab with `brew install mecab mecab-ipadic` on OS X or by using mecab-0.996.exe from http://code.google.com/p/mecab/downloads/list on Windows. I don't know how to get the shell commands above to work on Windows though.

This seems quite technical for my purposes, but thanks for sharing smile I mayfind a use for MeCab some time in the future.

  • 1