kanji koohii FORUM
Idea for a software - Printable Version

+- kanji koohii FORUM (http://forum.koohii.com)
+-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html)
+--- Forum: Off topic (http://forum.koohii.com/forum-13.html)
+--- Thread: Idea for a software (/thread-13054.html)



Idea for a software - cophnia61 - 2015-10-04

I was thinking about doing a software which, given a word 'x' (or a list of words) in dictionary form and a native textfile (or a directory of native textfiles), output a list of all the sentences which contain the word 'x'.

The main issue is in which order to put the sentences? A good idea is to put them in order of difficulty but how to evaluate it?
Maybe using a frequency list so that sentences with common words will be put first? But I fear this will require too much of cpu work and will be too slow.
So maybe it's best to limit it to a word per time: you put a word, it analizes your corpus of japanese texts, and it takes all sentences with that word in order of difficulty.
Do you people know if something similar exist?


Idea for a software - vix86 - 2015-10-05

CPUs are pretty fast. If you pre-cut the native files into sentences and loaded those into ram; I can't see it taking more than a minute. It'd take 5+ million (maybe not even that) sentences before you start noticing long processing times.

There's a problem you haven't considered though, which is that just having the word won't necessarily mean you have the right sentence. Think of how many different possible meanings あげる there are or 付ける. So you are still going to have to parse the sentences on your own after getting them.

Ranking difficult could maybe be done by taking the top 100-1000 frequency words, discarding particles, and use those as a start and then search for N+1 matches from there.


Idea for a software - Roketzu - 2015-10-05

This reminds me of something I did about a year ago, only I wanted to have an audio version of what you are looking for. What I did was use Sub2srs on whole seasons of Japanese subbed J-Dramas and anime, where I changed some of the settings so that the files Subs2srs output were renamed to whatever the sentence happened to be, so it looked like this:

[Image: hLxPMzm.png]

I ended up with over 33000 individual files like this from 90 episodes of 7 different shows and used the program Everything to search through them like a dictionary for when I wanted to hear a word being used in context. So searching for 一応 for example would look like this:

[Image: eUz6sb2.png]

The problem was even with this huge number of files it was still nowhere near enough to be comprehensive at all, and a lot of the time you'd get sentences that meant very little without context. Most words I wanted to find never came up, and I figured I'd have to do this for hundreds of shows/anime before having anything comprehensive. I still think it would be possible, though quite time consuming.

Still, it was nice when it worked and I got a bunch of audio files of naturally spoken sentences containing a word I wanted to hear in action.