Tool to extract unknown vocab from text?

Index » Learning resources

  • 1
 
yogert909 Member
From: Los Angeles, Ca Registered: 2013-05-03 Posts: 270 Website

Hi,  I've been looking for a tool that can extract all the vocabulary words from a text, then compare against a list of known words and return a list of unknown vocab.  it seems that morph man does this, but it's giving me errors.

Animosophy Member
Registered: 2013-02-19 Posts: 180

If nothing else use cb's text analysis tool: http://forum.koohii.com/viewtopic.php?pid=177549

Then go to http://www.tracemyip.org/tools/remove-d … ds-in-text , c&p known words in front of the new vocabulary and divide the two lists with a bunch or hyphens or something, keeping all of the vocab left from the extracted word list.

Last edited by Animosophy (2013 September 19, 7:41 pm)

yogert909 Member
From: Los Angeles, Ca Registered: 2013-05-03 Posts: 270 Website

cb's tool looks interesting.  I'll give it a try and see.

Thanks!

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
lauri_ranta Member
Registered: 2012-03-31 Posts: 139 Website

You can use MeCab to find lemmas of specific types:

mecab -F '%t %f[6]\n' -E '' *.txt|awk '$1~/[26]/{print $2}'|awk '!a[$0]++'>words.txt

%t is word type (0: other, 2: kanji or kanji with kana, 3: special character, 4: number, 5: latin, 6: hiragana, 7: katakana), and %f[6] is the lemma, or field 6 in the default output. Change 26 to 2 if you only want to include words with at least one kanji.

Removing words that are included in a text file for learned words:

awk 'NR==FNR{a[$0];next}!($0 in a)' learned.txt words.txt>words2.txt

Removing words that don't appear within specific lines on a word frequency list:

curl lri.me/japanese/word-frequency.txt|tail -n+3>word-frequency.txt;awk 'NR==FNR{if(NR>=20000&&NR<=100000)a[$0];next}$0 in a' word-frequency.txt words2.txt>words3.txt

yogert909 Member
From: Los Angeles, Ca Registered: 2013-05-03 Posts: 270 Website

edit:  never mind.  I just realized MeCab is part of MacOs and the man page is in English.  Cool!  I'm not so good with shell scripting, but your examples should get me started.

Thanks!  Is there English documentation for MeCab?  I've done cursory searches, but nothing turned up.

Last edited by yogert909 (2013 September 20, 12:04 pm)

  • 1