Back

what software to extract words from a text file?

#1
Hello,

I was wondering if there was a software that could help me make glossaries by extracting the words from the texts I am using (newspaper articles, web-pages, ebooks etc). I remember JWPce could extract individual characters, but what about compounds and kana words?
Thank you.
Reply
#2
I'm not quite sure I'm understanding what you're looking for exactly...

But it seems like Copy/Paste would suit this quite fine...?
Reply
#3
Asriel Wrote:I'm not quite sure I'm understanding what you're looking for exactly...

But it seems like Copy/Paste would suit this quite fine...?
I think he's looking for something a little more automated. Wink

I don't have a solution for the problem, but I think you're looking for a program that will extract each individual word from the text, then create a list of the words so you can put them in Anki or Smart.fm or something.

I would love to see something like this as well. I've thought about programming it, but I always get stuck when it comes to distinguishing one word from another in the text. I know some programs manage, I just don't know how they do it.
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
There's a program called "chasen" but I don't know if it does exactly what you want.
Reply
#5
I think you can try the technique used in the "Ultimate Study pack" ...is somewhere in the forum...
Reply
#6
Thanks for the replies.
http://jgloss.sourceforge.net/
jgloss works the best but has some trouble if too much romaji is present at the same time, so you need to clean that first. By the way, is there a way to do this automatically instead of doing find and replace all with the entire alphabet?

Cheers
Reply
#7
clemente Wrote:Thanks for the replies.
http://jgloss.sourceforge.net/
jgloss works the best but has some trouble if too much romaji is present at the same time, so you need to clean that first. By the way, is there a way to do this automatically instead of doing find and replace all with the entire alphabet?

Cheers
Writing a program to remove all alphanumerics from a text wouldn't be hard. On windows there aren't any native programs to do it but this would be cake on UNIX based OS's and could be done from command line easily enough. I'm no UNIX wiz really but I suspect using sed or grep you could just find and replace and output a new file minus the alphanumerics. There should be windows based versions of sed/grep available as well; just need someone to generate the proper command line for it.
Reply
#8
(I really don't know what's with all the *NIX love on this forum...)

Just use regex

([a-zA-Z0-9.,;:&!"\\/\[\]*])

test it here: http://gskinner.com/RegExr/
paste your own text into the top text box.
Use it in combination with a replace and it will remove all alphanumerics plus some punctuation from a text.

Also, can't you just paste the text into wwwdict and get a translation per line, ie: a glossary from it.

[edit] this will only remove the romaji, not split kanji-words up etc. - to do this, you might try feeding the output from this into mecab from sourceforge. In Windows, it'll open up a dos box that you can paste Japanese text into, and then it'll go away and split it up for you.
Edited: 2010-03-16, 8:43 pm
Reply
#9
I'm not sure, but I think WaKan can do this:

http://wakan.manga.cz/
Reply
#10
I feel like I should be excited by these programs, but my tool-using brain functions aren't working properly at the moment. I can still use a stick to poke an ant-hill, though.
Edited: 2010-03-17, 5:33 pm
Reply
#11
Thank you all for the new suggestions on how to "skin" the text.
I also tried wakan, but I could not easily figure out how to make the wordlist with my text (although I could with a smaller, only Japanese text).
Cheers.

One last question: is there a way to change kana to romaji in open office?
I tried using chasen, but it's a bit complicated.
Reply
#12
Chasen is the old way to do this. There's also kakasi - but it too is an outdated way.

The best way to segment words now is to use MeCab, which was written by the same author as Chasen. It has a mode which will split sentences into their constituent parts.
Reply