I am in the process of attempting to create a large frequency-based deck similar in nature to the Core decks. This has proved to be a vastly larger undertaking than I had originally thought!
One thing that I am going to need to be able to do is search for specific words in my sentence collection (I have millions of sentences). I can not do with with a standard "find" command in a text editor, because the dictionary form of the word is usually different than the way the word is actually used in the sentences.
Does anyone know of any tools or have any idea how I can perform such a search without resorting to writing my own application to do it?
Very interested to hear how this goes for you. The mecab tool looks great.
Just curious, Zarxrax, where did you get your corpus of millions of sentences from?
I finished writing a basic search application, and I would be glad to make it available if anyone else needs it (windows only). I probably need to clean it up a bit first though. And it currently only works for a very specific use case to meet my own needs. Let me know if you would like me to upload it, because if no one needs it, I wont waste my time.
Your input needs to be a text file. Each line of your text file is considered a "sentence". The application will convert each word in the file into the dictionary form and store the data in a database. Then you are able to search for a single word in the text (multi-word is not supported at the moment).
The advantage of this over a standard search function is that you can search for 食べる and you will get back sentences containing 食べて 食べなさい 食べられて etc. It is also quite fast. In a database of 15 million sentences, it takes ~3 seconds to output 26,000 sentences containing 食べる.
This is already proving quite useful. For instance, my word frequency report was showing a high occurrence of the word 「す」. I couldn't figure out what this was coming from. After searching my sentences for 「す」, I was able to see that this was due to many sentences ending like ま～す causing the す to get reported as a separate word.
I'll toss in an idea. If you look into MorphMan, you can see how it creates morphemes of vocabulary, turning 食べる it into 食べ. You could perhaps utilize that with your program.
I would be very interested in seeing the application if you have time to upload it. For now I'm managing to do most of what I need to by searching inside anki and/or using MorphMan, but seeing another approach would be useful. I only wish I could figure out how to do this for Latin and Greek (both of which languages are highly inflected, much more so than Japanese, and with a lot of ambiguous forms).
Snagged - thanks so much! Will let you know if I have any feedback.