kanji koohii FORUM
Tool to extract unknown vocab from text? - Printable Version

+- kanji koohii FORUM (http://forum.koohii.com)
+-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html)
+--- Forum: Learning resources (http://forum.koohii.com/forum-9.html)
+--- Thread: Tool to extract unknown vocab from text? (/thread-11182.html)



Tool to extract unknown vocab from text? - yogert909 - 2013-09-19

Hi, I've been looking for a tool that can extract all the vocabulary words from a text, then compare against a list of known words and return a list of unknown vocab. it seems that morph man does this, but it's giving me errors.


Tool to extract unknown vocab from text? - Animosophy - 2013-09-19

If nothing else use cb's text analysis tool: http://forum.koohii.com/showthread.php?pid=168481#pid168481

Then go to http://www.tracemyip.org/tools/remove-duplicate-words-in-text , c&p known words in front of the new vocabulary and divide the two lists with a bunch or hyphens or something, keeping all of the vocab left from the extracted word list.


Tool to extract unknown vocab from text? - yogert909 - 2013-09-19

cb's tool looks interesting. I'll give it a try and see.

Thanks!


Tool to extract unknown vocab from text? - lauri_ranta - 2013-09-20

You can use MeCab to find lemmas of specific types:

mecab -F '%t %f[6]\n' -E '' *.txt|awk '$1~/[26]/{print $2}'|awk '!a[$0]++'>words.txt

%t is word type (0: other, 2: kanji or kanji with kana, 3: special character, 4: number, 5: latin, 6: hiragana, 7: katakana), and %f[6] is the lemma, or field 6 in the default output. Change 26 to 2 if you only want to include words with at least one kanji.

Removing words that are included in a text file for learned words:

awk 'NR==FNR{a[$0];next}!($0 in a)' learned.txt words.txt>words2.txt

Removing words that don't appear within specific lines on a word frequency list:

curl lri.me/japanese/word-frequency.txt|tail -n+3>word-frequency.txt;awk 'NR==FNR{if(NR>=20000&&NR<=100000)a[$0];next}$0 in a' word-frequency.txt words2.txt>words3.txt


Tool to extract unknown vocab from text? - yogert909 - 2013-09-20

edit: never mind. I just realized MeCab is part of MacOs and the man page is in English. Cool! I'm not so good with shell scripting, but your examples should get me started.

Thanks! Is there English documentation for MeCab? I've done cursory searches, but nothing turned up.