Back

Any way to search for words in my own sentence corpus?

#1
I am in the process of attempting to create a large frequency-based deck similar in nature to the Core decks. This has proved to be a vastly larger undertaking than I had originally thought!


One thing that I am going to need to be able to do is search for specific words in my sentence collection (I have millions of sentences). I can not do with with a standard "find" command in a text editor, because the dictionary form of the word is usually different than the way the word is actually used in the sentences.

Does anyone know of any tools or have any idea how I can perform such a search without resorting to writing my own application to do it?
Reply
#2
(2017-02-11, 9:53 am)Zarxrax Wrote: I am in the process of attempting to create a large frequency-based deck similar in nature to the Core decks. This has proved to be a vastly larger undertaking than I had originally thought!


One thing that I am going to need to be able to do is search for specific words in my sentence collection (I have millions of sentences). I can not do with with a standard "find" command in a text editor, because the dictionary form of the word is usually different than the way the word is actually used in the sentences.

Does anyone know of any tools or have any idea how I can perform such a search without resorting to writing my own application to do it?

Can't you just use wildcards?  E.g., if the dictionary form is "kaku" (to write), do a search on kak* in order to get kaku, kakimasita etc.
Reply
#3
You may try searching only for the invariable part of the verb or the adjectives, that way you will get more hits.

Still, you are guaranteed to miss examples because of the mess that is japanese orthography.
Here is an interesting study on The Challenges of Intelligent Japanese Searching.
Edited: 2017-02-11, 12:04 pm
Reply
(March 20-31) All Access Pass: 25% OFF Basic, Premium & Premium PLUS! 
Coupon: ALLACCESS2017
JapanesePod101
#4
(2017-02-11, 10:31 am)phil321 Wrote: Can't you just use wildcards?  E.g., if the dictionary form is "kaku" (to write), do a search on kak* in order to get kaku, kakimasita etc.

Yea, but with some words I'm going to get a lot of false positives, probably more false positives than the actual word I'm seeking.
Reply
#5
(2017-02-11, 10:33 am)Zarxrax Wrote:
(2017-02-11, 10:31 am)phil321 Wrote: Can't you just use wildcards?  E.g., if the dictionary form is "kaku" (to write), do a search on kak* in order to get kaku, kakimasita etc.

Yea, but with some words I'm going to get a lot of false positives, probably more false positives than the actual word I'm seeking.

Ok, this is similar to what I'm trying to do (see my thread here).

Without going in any detail, what you can do is to:

1) use MeCab to put spaces between words, so that your search will be limited to single words;

2) programmately attach the dictionary form of the word to each word, and let your program search for that instead of the actual word (you can use MeCab to do this too);

I'm doing something very similar to what you need, but as I'm not a programmer, this will require a lot of time for me, so it won't be completed soon.

I leave this here in the hope that someone more expert than me will develope something similar (and surely way better) than mine.

If something want to tackle this I can add more details on how to do this (the part that I've already understood how to do).
Reply
#6
(2017-02-11, 10:33 am)Zarxrax Wrote:
(2017-02-11, 10:31 am)phil321 Wrote: Can't you just use wildcards?  E.g., if the dictionary form is "kaku" (to write), do a search on kak* in order to get kaku, kakimasita etc.

Yea, but with some words I'm going to get a lot of false positives, probably more false positives than the actual word I'm seeking.
Wildcards are indeed no option and kaku is a good example for it. The results of 書* won't help you a lot.

What about cophnia61's suggestion to use MeCab [or similar tool] to change the words into dictionary form? If you can do that it would be very easy. Just duplicate in your database all sentences and use the dictionary form column for the search and the original sentence column for the output.
Reply
#7
(2017-02-11, 2:20 pm)Matthias Wrote:
(2017-02-11, 10:33 am)Zarxrax Wrote:
(2017-02-11, 10:31 am)phil321 Wrote: Can't you just use wildcards?  E.g., if the dictionary form is "kaku" (to write), do a search on kak* in order to get kaku, kakimasita etc.

Yea, but with some words I'm going to get a lot of false positives, probably more false positives than the actual word I'm seeking.
Wildcards are indeed no option and kaku is a good example for it. The results of 書* won't help you a lot.

What about cophnia61's suggestion to use MeCab [or similar tool] to change the words into dictionary form? If you can do that it would be very easy. Just duplicate in your database all sentences and use the dictionary form column for the search and the original sentence column for the output.
Edited: 2017-02-11, 4:17 pm
Reply
#8
(2017-02-11, 11:44 am)cophnia61 Wrote: 2) programmately attach the dictionary form of the word to each word, and let your program search for that instead of the actual word (you can use MeCab to do this too);

Looking at Mecab right now, and I'm not sure how to do this step. Do you know what options I need to use?
Reply
#9
(2017-02-11, 4:15 pm)Zarxrax Wrote:
(2017-02-11, 11:44 am)cophnia61 Wrote: 2) programmately attach the dictionary form of the word to each word, and let your program search for that instead of the actual word (you can use MeCab to do this too);

Looking at Mecab right now, and I'm not sure how to do this step. Do you know what options I need to use?

I'm just uploading a simple tool which does something like that, as I'm going to work on it on the next weeks/months, I'm going to post it in a separate thread (wait 5 mins and I'll post the link here!).

EDIT: click here!
Edited: 2017-02-11, 5:13 pm
Reply
#10
(2017-02-11, 4:47 pm)cophnia61 Wrote:
(2017-02-11, 4:15 pm)Zarxrax Wrote:
(2017-02-11, 11:44 am)cophnia61 Wrote: 2) programmately attach the dictionary form of the word to each word, and let your program search for that instead of the actual word (you can use MeCab to do this too);

Looking at Mecab right now, and I'm not sure how to do this step. Do you know what options I need to use?

I'm just uploading a simple tool which does something like that, as I'm going to work on it on the next weeks/months, I'm going to post it in a separate thread (wait 5 mins and I'll post the link here!).

EDIT: click here!

Cool, thanks, I'll check it out.
Reply
#11
Very interested to hear how this goes for you. The mecab tool looks great.

Just curious, Zarxrax, where did you get your corpus of millions of sentences from?
Reply
#12
(2017-02-11, 8:00 pm)tanaquil Wrote: Very interested to hear how this goes for you. The mecab tool looks great.

Just curious, Zarxrax, where did you get your corpus of millions of sentences from?

Made it from tv scripts.
Reply
#13
If possible, use kuromoji (http://www.atilika.org/ and note https://github.com/atilika/kuromoji/issues/111) instead of mecab. Kuromoji gives better segmentations.
Edited: 2017-02-14, 7:36 am
Reply
#14
(2017-02-11, 9:53 am)Zarxrax Wrote: One thing that I am going to need to be able to do is search for specific words in my sentence collection (I have millions of sentences). I can not do with with a standard "find" command in a text editor, because the dictionary form of the word is usually different than the way the word is actually used in the sentences.

Does anyone know of any tools or have any idea how I can perform such a search without resorting to writing my own application to do it?

Anki has this ability with Japanese, doesn't it? I guess the code would be in a  plugin
Reply
#15
(2017-02-15, 12:53 am)juniperpansy Wrote: Anki has this ability with Japanese, doesn't it? I guess the code would be in a  plugin

No, at least not with just the standard Japanese support plugin. Anki will only search the specific form of the word that you type.

Anyways, I've got something half-way rigged up now though. I've managed to index and store my text into a database, I just need to make an interface to search and return sentences.
Reply
#16
(2017-02-15, 6:26 am)Zarxrax Wrote:
(2017-02-15, 12:53 am)juniperpansy Wrote: Anki has this ability with Japanese, doesn't it? I guess the code would be in a  plugin

No, at least not with just the standard Japanese support plugin. Anki will only search the specific form of the word that you type.

Anyways, I've got something half-way rigged up now though. I've managed to index and store my text into a database, I just need to make an interface to search and return sentences.
The standard Japanese support plugin installs mecab, so if you wanted to do it inside anki, you could.
Reply
#17
I finished writing a basic search application, and I would be glad to make it available if anyone else needs it (windows only). I probably need to clean it up a bit first though. And it currently only works for a very specific use case to meet my own needs. Let me know if you would like me to upload it, because if no one needs it, I wont waste my time.

Your input needs to be a text file. Each line of your text file is considered a "sentence". The application will convert each word in the file into the dictionary form and store the data in a database. Then you are able to search for a single word in the text (multi-word is not supported at the moment).

The advantage of this over a standard search function is that you can search for 食べる and you will get back sentences containing 食べて 食べなさい 食べられて etc.  It is also quite fast. In a database of 15 million sentences, it takes ~3 seconds to output 26,000 sentences containing 食べる.

This is already proving quite useful. For instance, my word frequency report was showing a high occurrence of the word 「す」. I couldn't figure out what this was coming from. After searching my sentences for 「す」, I was able to see that this was due to many sentences ending like ま~す causing the す to get reported as a separate word.
Reply
#18
I'll toss in an idea. If you look into MorphMan, you can see how it creates morphemes of vocabulary, turning 食べる it into 食べ. You could perhaps utilize that with your program.
Reply
#19
In addiction, check this out https://github.com/mifunetoshiro/kanjium

ops wrong thread
Edited: 2017-02-18, 1:49 pm
Reply
#20
I would be very interested in seeing the application if you have time to upload it. For now I'm managing to do most of what I need to by searching inside anki and/or using MorphMan, but seeing another approach would be useful. I only wish I could figure out how to do this for Latin and Greek (both of which languages are highly inflected, much more so than Japanese, and with a lot of ambiguous forms).
Reply
#21
Here it is: https://mega.nz/#!JVVkXLRa!D4jUKNnyk6gKS...eD51AO8Clg
There's probably still a few unhandled exceptions if you do weird stuff like try to search without having a database.

I've included a subset of the tanaka corpus as examples.txt.
Upon starting the program, press "run indexer", and then load examples.txt
After indexing it, go back to the main screen and load the .db file that was generated.
Reply
#22
Snagged - thanks so much! Will let you know if I have any feedback.
Reply