kanji koohii FORUM
Is there a program to put spaces in Japanese Sentences? - Printable Version

+- kanji koohii FORUM (http://forum.koohii.com)
+-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html)
+--- Forum: Learning resources (http://forum.koohii.com/forum-9.html)
+--- Thread: Is there a program to put spaces in Japanese Sentences? (/thread-8016.html)

Pages: 1 2


Is there a program to put spaces in Japanese Sentences? - Boy.pockets - 2011-06-26

I know about mecab. What I want is something on top of mecab (or anything similar) that will take Japanese text and put spaces in between the words.

I am currently working on something to do just this, but I thought I should check here to see if there is anything that already does this.

Example

Input: 校長先生、私の部屋が一番近いです。
Output: 校長先生_私の_部屋_が_一番_近い_です
Note:
- I replaced spaces with under scores to make it a bit more readable.
- I don't care too much about the fine details (like if it should be 部屋_が or 部屋が)

Background: Why I am asking this: http://stackoverflow.com/q/6424211/377384

cheers


Is there a program to put spaces in Japanese Sentences? - Hashiriya - 2011-06-26

doesn't make much sense to me why you would want to... it's like using romaji instead of kana. Just keep reading and you will get it. I read Japanese without spaces just as fast as I do English. Increase your exposure!


Is there a program to put spaces in Japanese Sentences? - Boy.pockets - 2011-06-26

Hashiriya Wrote:doesn't make much sense to me why you would want to... it's like using romaji instead of kana. Just keep reading and you will get it. I read Japanese without spaces just as fast as I do English. Increase your exposure!
I don't want the sentences with spaces in them to read. I am only using it for a sentence aligners heuristic, which takes 'words' as input. And the heuristic define words as characters delimited by spaces. So I need spaces.

There is a bit of background in the original post that might explain (though looking back over it, maybe it is too confusing).


Is there a program to put spaces in Japanese Sentences? - Seamoby - 2011-06-26

Once you have the output from mecab I would think that it would be trivial to put in the spaces.


Is there a program to put spaces in Japanese Sentences? - xfact2007 - 2011-06-26

You can use Chasen POS tagger, what is used for middle size corpora, and kanji to hiragana converters:
http://chasen.aist-nara.ac.jp/

You can read more about others' experiences:
http://forum.koohii.com/showthread.php?pid=13665#pid13665
(search for "Chasen", pm215 #45 post)
You should convert your text to japanese encoding system (if your source text is Unicode). I used that program in Linux command line. Pretty straightforward to use.

chasen eucjp.txt | iconv -f eucjp -t utf8 | sed 's/\t.*//' | tr '\n' ' '
Example output: 国 の 中央 防災 会議 の 専門 調査会 は 2 6 日 、 東日本 大震災 を 教訓 に 、


Is there a program to put spaces in Japanese Sentences? - iSoron - 2011-06-26

Using awk:

mecab input.txt | awk '!/^EOS/ {printf "%s ", $1}'

If you don't care about punctuation:

mecab input.txt | awk '!/^EOS/ && $2 !~ /^記号/ {printf "%s ", $1}'


Is there a program to put spaces in Japanese Sentences? - Boy.pockets - 2011-06-27

iSoron Wrote:Using awk:

mecab input.txt | awk '!/^EOS/ {printf "%s ", $1}'

If you don't care about punctuation:

mecab input.txt | awk '!/^EOS/ && $2 !~ /^記号/ {printf "%s ", $1}'
I am not sure how to make this work in awk (I am using OSX > Terminal (bash), if that makes a difference), but I think it is rather my set up rather than your code. This is the first time I have tried to use AWK. I only have mecab through Anki. I tried to install it myself, but have not had luck so far.

In any case, I have managed to get good results through just adding spaces between each character instead of trying to figure out the words. I am then using the MS sentence aligner:

http://research.microsoft.com/en-us/downloads/aafd5dcf-4dcc-49b2-8a22-f7055113e656/

It works really well - from a sentence set of about 8000, it matched over 4000 with good accuracy (have not found a mistake yet). I expect that the main thing that is holding this back is the way I break up sentences. I am currently just using the period and the '。' to delimit sentences. This is not great, since the English sentence use less periods than the Japanese "ideographic full stop". At least in the test files I am using. I have not yet thought of a good way to get around this yet.

I guess I could break up the file into known segments (ie chapters), run it through the sentence aligner and use the results to fiddle with what is left over. I am rambling now. But exciting times.

Back to study now (this time with 50% of the cards with English translations)


Is there a program to put spaces in Japanese Sentences? - Boy.pockets - 2011-06-28

Boy.pockets Wrote:have not found a mistake yet
I have now - a couple. I will keep on trying to get Mecab working in Python. I will also try working with a larger data set - to see if the accuracy is increased.


Is there a program to put spaces in Japanese Sentences? - Seamoby - 2011-06-28

The commands given by iSoron should work. You can then just redirect the output to a file (using ">"). If you're using mecab with Anki's Japanese plug-in, then the mecab packages are already installed and you should be able to run mecab in a command line. Anki's Japanese plug-in uses the mecab-ipadic version which is in euc-jp format (encoding), so if your input text files are not in euc-jp, you might have to convert them first (which is easy to do).


Is there a program to put spaces in Japanese Sentences? - Seamoby - 2011-06-28

A caveat about the awk command given above, though. It works, but you will lose paragraph formatting. So you might want to use the EOS mark after all if you want to keep paragraph formatting.


Is there a program to put spaces in Japanese Sentences? - Boy.pockets - 2011-06-28

Seamoby Wrote:The commands given by iSoron should work. You can then just redirect the output to a file (using ">"). If you're using mecab with Anki's Japanese plug-in, then the mecab packages are already installed and you should be able to run mecab in a command line. Anki's Japanese plug-in uses the mecab-ipadic version which is in euc-jp format (encoding), so if your input text files are not in euc-jp, you might have to convert them first (which is easy to do).
Thanks for trying to help me. I have had no luck yet. This is what I am putting in the command line:
Code:
mecab input.txt | awk '!/^EOS/ && $2 !~ /^記号/ {printf "%s ", $1}' > output.txt

mecab input.txt | awk '!/^EOS/ {printf "%s ", $1}' > output.txt
contents of output.txt:
tagger.cpp(151)
I have tried it in the anki directory that has the mecab Executable file, and in another directory. I have also tried to change the encoding to euc-jp.


Is there a program to put spaces in Japanese Sentences? - Seamoby - 2011-06-28

So one thing at a time. Are you able to use Anki's Japanese plug-in (i.e. generate readings automatically in cards you create)? This is a good test for checking if the mecab packages are installed correctly.


Is there a program to put spaces in Japanese Sentences? - Boy.pockets - 2011-06-28

Seamoby Wrote:So one thing at a time. Are you able to use Anki's Japanese plug-in (i.e. generate readings automatically in cards you create)? This is a good test for checking if the mecab packages are installed correctly.
Yes - the generate Automatic readings works.

In case it matters: I am running OSX and the command line is Terminal (Bash)


Is there a program to put spaces in Japanese Sentences? - iSoron - 2011-06-28

What are the contents of input.txt?
What happens when you run 'mecab input.txt'?

Seamoby Wrote:A caveat about the awk command given above, though. It works, but you will lose paragraph formatting. So you might want to use the EOS mark after all if you want to keep paragraph formatting.
This will insert a line break at the end of each sentence:

mecab input.txt | awk '!/^EOS/ && $2 !~ /^記号/ { printf "%s ", $1 } /^EOS/ { printf "\n"}'


Is there a program to put spaces in Japanese Sentences? - nest0r - 2011-06-28

@boy.pockets

Perhaps not something you're interested in, but my solution for this in the past, while dissing the Listen-Read method and promoting the possibilities for @balloonguy's tool, which I call Kage Shibari (keep in mind subs2srs supports such files now so you can turn the interactive audiobooks into Anki cards), is to have the sentence translations in mouseover popups, because I think marginal/parallel/aligned stuff is inferior to hypertext style integrations where you maintain uninterrupted reading as much as possible.

I feel there's a way to do this as an extra layer in addition to the individual words, where alignment precision can be coarser and more flexible... something where the choice of mouseover location or an extra navigational function can increase the span or whathaveyou to compensate for the slight length differences.

Edit 3: For example, you could put a popup trigger (perhaps just an alt thingy?) every significant punctuation element in the Japanese sentence, then have the contents be the translation which would contain a minimum of 2-3 sentences/sections? And you could have those overlap, perhaps, with the next popup element containing the last part of the one before it, etc.

I also think Google Translate's on to something there in the way you can mouseover and it has the items linked, but that's still parallel, and with machine translations they generate.

At any rate, if you want to be roundabout regarding space insertions, see my last post in the JNovel Formatter thread and MorphMan threads.

Edit: Although parallel concordancing seems interesting, but I'm still reading up on that.

Edit 2: I didn't put the regexp process in that JNF post, but the idea is, you take the list MorphMan gives you and replace the line breaks with |, add a parenthesis to the beginning of the first word and the end of the last word, and use the resulting (Word|Word|Word) as your Find and Replace with spaces around the backslash1 (can't get that to show up properly but you know what I mean). Or maybe just an initial space would be better.


Is there a program to put spaces in Japanese Sentences? - Boy.pockets - 2011-06-28

iSoron Wrote:What are the contents of input.txt?
contents (note, this is just a test file)

プリベット通り4番地、朝食の席で今朝もまたいざこざが始まった。

iSoron Wrote:What happens when you run 'mecab input.txt'?
result:

tagger.cpp(151) [load_dictionary_resource(param)] param.cpp(71) [ifs] no such file or directory: /usr/local/lib/mecab/dic/ipadic/dicrc

Looks like there is something missing in my mecab setup :)

Looking at the code in the Japanese plugin, Environment variables are added, during setup of mecab, that point to the mecab install directory and some other things. I can try that when I get home.


Is there a program to put spaces in Japanese Sentences? - Seamoby - 2011-06-28

You can try the command "mecab -P". This will dump the mecab parameters, including the directory for the dictionary and resource file.


Is there a program to put spaces in Japanese Sentences? - Boy.pockets - 2011-06-29

@nest0r

I am interested. But there were many things in your posts that I did not understand. Unfortunately, many of the Dropbox links in your linked posts have become Necker Cubes. But I think I finally found what you are talking about with the brothers grimm.

The original purpose of this thread was to find something that would put spaces in the "right" places in a Japanese sentence. The purpose of this was not to enable a heuristic to 'align' sentences. After I have aligned the sentences, I create Anki cards with them (also providing additional contextual information like the surrounding sentences and chapter name). Once in Anki, I gloss the sentences and put them through MorphMan. So in the end I have a Japanese sentence (with a bit of context), with a reading that might be right (good chance that it is - but if it is not, there is a low chance that I will be able to realize) and either;
* no English translation,
* an English that is the "correct" one, or
* an English translation that is obviously wrong (I think)

I feel I am not up to uninterrupted reading yet. My Japanese is pretty low and the things that I want to read are way to difficult. I even have trouble with "konichiwa inu, konichiwa ari" - which is actually an interesting book, but does not hold me for very long. But with the text I have at the moment, I can space repetition and MorphMan my way to content that A) I want to read and B) I can read. The MorphMan part is crucial for this.

I think that the brothers grimm resource is a good step after the Anki stage; maybe you have SRS/MorphMan'ed the book well enough, but before you can just read freely.

Nest0r Wrote:At any rate, if you want to be roundabout regarding space insertions, see my last post in the JNovel Formatter thread and MorphMan threads.
I will have a look at that as soon as I can. Thank you.

Ahhh, so much I don't know. I need to look into linking up the text and the audio book Whoever can bring all this stuff together with a nice ribbon on top will be a very.... happy person.


Is there a program to put spaces in Japanese Sentences? - Boy.pockets - 2011-06-29

Seamoby Wrote:You can try the command "mecab -P". This will dump the mecab parameters, including the directory for the dictionary and resource file.
That gives the same output:

tagger.cpp(151) [load_dictionary_resource(param)] param.cpp(71) [ifs] no such file or directory: /usr/local/lib/mecab/dic/ipadic/dicrc


Is there a program to put spaces in Japanese Sentences? - Seamoby - 2011-06-29

Yeah, figures. It looks like you have to find where the dictionary directory is and then tell mecab. Odd that it works in Anki but not on the command line.


Is there a program to put spaces in Japanese Sentences? - nest0r - 2011-06-29

Kage Shibari is here, in the first link within the first link within the first link. ;p http://dl.dropbox.com/u/263833/version1.html (OP); Edit 2: instructions here: http://forum.koohii.com/showthread.php?pid=127148#pid127148 ; examples to use as source here: http://forum.koohii.com/showthread.php?pid=129120#pid129120

The DingLabs Reader I originally didn't like because it was online only, but there's a standalone version, except I ended up not using it because the Google Analytics code bugged me, though it was likely residual (I'm picky that way) and balloonguy's tool did everything I wanted, plus it's entirely offline and minimal (but also you can add it to any dropbox for sharing).

And I see. You basically want ParallelText2SRS. I wonder if WordNet could be useful here. Like have it align cards by the number of synsets their words share? It could use Japanese as the source, looks at the words' synsets, and finds the English sentences that match up. Since WWWJDIC is linked to Japanese WordNet, which is in turn linked to Princeton's English WordNet. (I don't think you'd need WWWJDIC though, you can probably do it directly through JW's site or their offline resources which I'm thinking of turning into Anki decks with Japanese/English definitions/examples as cards, the Japanese part sorted by iPlusN. Plus I think referencing synsets could be useful with other tools such as Rikaisan or creating/disambiguating card relations, but that's another topic.)

The JNF link is just a way to get the spaces in the words by extracting the morphemes using MorphMan's GUI database manager for texts and then reapplying the extracted morphemes to the original text using regexp, but it looks like you moved past that already.

Here's the kind of thing I meant before, that @ahibba linked me to; what I had been thinking before I saw it (I can't remember if I originally got the idea from there or Google Translate) was something in slightly shorter segments, but this length and with multiple popups per line or something where the translations are overlapping could work: http://www.interactiveselfstudy.com/images/lepetitprincetranseng.gif

Edit: As for uninterrupted, I meant that you bring the information to you, so to speak (for more see link/quote here: http://forum.koohii.com/showthread.php?pid=141095#pid141095).


Is there a program to put spaces in Japanese Sentences? - Boy.pockets - 2011-06-29

Seamoby Wrote:Yeah, figures. It looks like you have to find where the dictionary directory is and then tell mecab. Odd that it works in Anki but not on the command line.
I think the easiest thing to do from here is just work from the MorphMan code. Thanks for all the help!


Is there a program to put spaces in Japanese Sentences? - Boy.pockets - 2011-06-29

@nest0r:

thanks for all the links. I think I have got it now. Wow, things are really coming together. If only I knew what I know today... yesterday...

nest0r Wrote:The JNF link is just a way to get the spaces in the words by extracting the morphemes using MorphMan's GUI database manager for texts and then reapplying the extracted morphemes to the original text using regexp, but it looks like you moved past that already.
I understood the first time - maybe my response was confusing. And I am not past that yet. I did start to follow your advice, but after initial enthusiasm and getting to the point of loading the words into the MorphMan GUI, I did not continue because I became skeptical. In your example, you have the words spaced, with one "word' being "なる" and another being "な", but unless ”な” is just a residual word (ie, between two recognised words, and not in the .db file), why would it no chop ”なる” into two words (incorrectly)?

In any case, I think I will try and use overture's code as a base and just add the code necessary to space out the lines. I did not know about the "extract morphemes from file" before, so another thank you is due.

Incidentally, for anyone looking to learn Python, I recommend this set of google tutorials. It does assume you know a little about programming, but it is pretty good. Two fun facts:
* The dude that created python works for google
* Python was not named after the snake (even though the current logo suggests otherwise).


Is there a program to put spaces in Japanese Sentences? - nest0r - 2011-06-29

I'm not sure what you mean. As you can see, it didn't split the なる.

Edit: Here's another example: http://pastebin.com/HQL2zuHi (From the first half of this which I randomly picked from the audiobooks thread; I noticed MorphMan doesn't like longer texts: http://www.hyuki.com/trans/magi.html)

This is the Find I used (UltraEdit's Perl regexp rather than Unix or its built-in engine): http://pastebin.com/rDGvAJYq


Is there a program to put spaces in Japanese Sentences? - Boy.pockets - 2011-06-29

nest0r Wrote:I'm not sure what you mean. As you can see, it didn't split the なる.
Good point - but why? I think I am missing something. If it is looking for なる and な. How does it know when to use which match in which case? For example, we have in our list of words:
* A
* AB
* BC
* ABC

If the input is "AABC" how does it know what is the right word to choose? A is the first choice, be then what? "ABC" or "AB"? Maybe this is not a problem when the text is small and the regular expression always chooses the longest word...? Anyway, I should just try it myself and stopping asking why.