yogert909
Member
From: Los Angeles, Ca
Registered: 2013-05-03
Posts: 270
Website
Hi, I've been looking for a tool that can extract all the vocabulary words from a text, then compare against a list of known words and return a list of unknown vocab. it seems that morph man does this, but it's giving me errors.
You can use MeCab to find lemmas of specific types:
mecab -F '%t %f[6]\n' -E '' *.txt|awk '$1~/[26]/{print $2}'|awk '!a[$0]++'>words.txt
%t is word type (0: other, 2: kanji or kanji with kana, 3: special character, 4: number, 5: latin, 6: hiragana, 7: katakana), and %f[6] is the lemma, or field 6 in the default output. Change 26 to 2 if you only want to include words with at least one kanji.
Removing words that are included in a text file for learned words:
awk 'NR==FNR{a[$0];next}!($0 in a)' learned.txt words.txt>words2.txt
Removing words that don't appear within specific lines on a word frequency list:
curl lri.me/japanese/word-frequency.txt|tail -n+3>word-frequency.txt;awk 'NR==FNR{if(NR>=20000&&NR<=100000)a[$0];next}$0 in a' word-frequency.txt words2.txt>words3.txt