![]() |
|
Tool to extract unknown vocab from text? - Printable Version +- kanji koohii FORUM (http://forum.koohii.com) +-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html) +--- Forum: Learning resources (http://forum.koohii.com/forum-9.html) +--- Thread: Tool to extract unknown vocab from text? (/thread-11182.html) |
Tool to extract unknown vocab from text? - yogert909 - 2013-09-19 Hi, I've been looking for a tool that can extract all the vocabulary words from a text, then compare against a list of known words and return a list of unknown vocab. it seems that morph man does this, but it's giving me errors. Tool to extract unknown vocab from text? - Animosophy - 2013-09-19 If nothing else use cb's text analysis tool: http://forum.koohii.com/showthread.php?pid=168481#pid168481 Then go to http://www.tracemyip.org/tools/remove-duplicate-words-in-text , c&p known words in front of the new vocabulary and divide the two lists with a bunch or hyphens or something, keeping all of the vocab left from the extracted word list. Tool to extract unknown vocab from text? - yogert909 - 2013-09-19 cb's tool looks interesting. I'll give it a try and see. Thanks! Tool to extract unknown vocab from text? - lauri_ranta - 2013-09-20 You can use MeCab to find lemmas of specific types: mecab -F '%t %f[6]\n' -E '' *.txt|awk '$1~/[26]/{print $2}'|awk '!a[$0]++'>words.txt %t is word type (0: other, 2: kanji or kanji with kana, 3: special character, 4: number, 5: latin, 6: hiragana, 7: katakana), and %f[6] is the lemma, or field 6 in the default output. Change 26 to 2 if you only want to include words with at least one kanji. Removing words that are included in a text file for learned words: awk 'NR==FNR{a[$0];next}!($0 in a)' learned.txt words.txt>words2.txt Removing words that don't appear within specific lines on a word frequency list: curl lri.me/japanese/word-frequency.txt|tail -n+3>word-frequency.txt;awk 'NR==FNR{if(NR>=20000&&NR<=100000)a[$0];next}$0 in a' word-frequency.txt words2.txt>words3.txt Tool to extract unknown vocab from text? - yogert909 - 2013-09-20 edit: never mind. I just realized MeCab is part of MacOs and the man page is in English. Cool! I'm not so good with shell scripting, but your examples should get me started. |