Hi, I've been looking for a tool that can extract all the vocabulary words from a text, then compare against a list of known words and return a list of unknown vocab. it seems that morph man does this, but it's giving me errors.
2013-09-19, 4:51 pm
2013-09-19, 7:39 pm
If nothing else use cb's text analysis tool: http://forum.koohii.com/showthread.php?p...#pid168481
Then go to http://www.tracemyip.org/tools/remove-du...ds-in-text , c&p known words in front of the new vocabulary and divide the two lists with a bunch or hyphens or something, keeping all of the vocab left from the extracted word list.
Then go to http://www.tracemyip.org/tools/remove-du...ds-in-text , c&p known words in front of the new vocabulary and divide the two lists with a bunch or hyphens or something, keeping all of the vocab left from the extracted word list.
Edited: 2013-09-19, 7:41 pm
2013-09-19, 8:38 pm
cb's tool looks interesting. I'll give it a try and see.
Thanks!
Thanks!
Advertising (Register to hide)
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions!
- Sign up here
2013-09-20, 5:15 am
You can use MeCab to find lemmas of specific types:
mecab -F '%t %f[6]\n' -E '' *.txt|awk '$1~/[26]/{print $2}'|awk '!a[$0]++'>words.txt
%t is word type (0: other, 2: kanji or kanji with kana, 3: special character, 4: number, 5: latin, 6: hiragana, 7: katakana), and %f[6] is the lemma, or field 6 in the default output. Change 26 to 2 if you only want to include words with at least one kanji.
Removing words that are included in a text file for learned words:
awk 'NR==FNR{a[$0];next}!($0 in a)' learned.txt words.txt>words2.txt
Removing words that don't appear within specific lines on a word frequency list:
curl lri.me/japanese/word-frequency.txt|tail -n+3>word-frequency.txt;awk 'NR==FNR{if(NR>=20000&&NR<=100000)a[$0];next}$0 in a' word-frequency.txt words2.txt>words3.txt
mecab -F '%t %f[6]\n' -E '' *.txt|awk '$1~/[26]/{print $2}'|awk '!a[$0]++'>words.txt
%t is word type (0: other, 2: kanji or kanji with kana, 3: special character, 4: number, 5: latin, 6: hiragana, 7: katakana), and %f[6] is the lemma, or field 6 in the default output. Change 26 to 2 if you only want to include words with at least one kanji.
Removing words that are included in a text file for learned words:
awk 'NR==FNR{a[$0];next}!($0 in a)' learned.txt words.txt>words2.txt
Removing words that don't appear within specific lines on a word frequency list:
curl lri.me/japanese/word-frequency.txt|tail -n+3>word-frequency.txt;awk 'NR==FNR{if(NR>=20000&&NR<=100000)a[$0];next}$0 in a' word-frequency.txt words2.txt>words3.txt
2013-09-20, 11:57 am
edit: never mind. I just realized MeCab is part of MacOs and the man page is in English. Cool! I'm not so good with shell scripting, but your examples should get me started.
Thanks! Is there English documentation for MeCab? I've done cursory searches, but nothing turned up.
Edited: 2013-09-20, 12:04 pm
