Mecab, Learning With Texts

Index » Learning resources

  • 1
 
kiokuwarui Member
From: Germany Registered: 2012-01-23 Posts: 17

Hello,

I've been trying to utilize LWT, because I find it very helpful. Lots of functions are reminding  me of LingQ, which, unfortunately, is commercial.

The one main weakness of LWT is, the Japanese support is more or less nonexistant. Unless you can actually insert spaces between the words, there's no chance to properly use it as a learning tool for this specific language, and while I've tried to find a way to do this (apart from inserting them manually, which, for long texts, is タイヘン), is using Mecab.

Well, I've checked that, but I don't even have the slightest idea on how to use it. I tried with command line parameters, but nothing happened (no error message either). I set an input and an output file (UTF-8) for it, to no avail.

However. It would be optimal to have a little app where you paste text, press a button and have it "spaced". Kind of like a little GUI, useable for dummies (or people who rather spend their time learning the language instead of programs for learning).

Do you know of such a tool, or is it possible to make Mecab somehow useable without twisting my brain?

Thank you very much

P.S.: I tried to import something into LingQ, then copy and paste it into LWT, which led to cluttered text (-.-

travis Member
Registered: 2008-08-11 Posts: 178

See here

mecab -O wakati input.txt -o ouput.txt

kiokuwarui Member
From: Germany Registered: 2012-01-23 Posts: 17

The output.txt doesn't contain anything. May it have to do with Win7 security settings? I have to always confirm admin rights if I want to change stuff inside that directory..

Okay, now it works, but the output file always seems to be ANSI, not UTF-8, despite setting it every time (-.-

とても面白いことになってます。バズたんと、協会が仲間割れしまして、あいつが悪い嫌おまえがとのことになってます。
でもってこの協会も、あのKEKの野尻先生とか数名を訴えるらしい。日本でも名だたる先生ばかりですが。どうみても分が悪いのは協会。明細を出さないのはおかしいでしょうよ。
そろそろ放射能詐欺が色々ばれてくる段階になってきましたね。

results in this:

ï» ¿ ã  ¨ ã  ¦ ã‚ ‚é  ¢ç™ ½ ã  „ã  “ ã  ¨ ã  « ã  ª ã  £ ã  ¦ ã  ¾ 㠙㠀‚ ム㠂º ã Ÿã ‚“ ã  ¨ 〠 å  ” ä¼ šã  Œ ä» ² é– “å ‰² ã‚Œã  — ã  ¾ ã —ã  ¦ 〠 ã ‚ ã  „ã  ¤ ã Œæ ‚ª ã  „å « Œã  Š ã  ¾ ã ˆã  Œ ã  ¨ ã  ® ã “ã  ¨ ã  « ã  ª ã  £ ã  ¦ ã  ¾ 㠙㠀‚
ã  § ã‚ ‚ã  £ ã  ¦ ã “ã  ® å ”ä ¼ šã ‚‚ 〠 ã ‚ ã  ® KEK ã  ® é‡Žå °» å…ˆç ”Ÿ ã  ¨ ã ‹æ •°å   ã ‚’ è¨ ´ 㠈㠂‹ ã‚‰ã  — ã  „。 æ— ¥ æœ ¬ ã  § ã‚ ‚å    ã   ã  Ÿ ã‚‹å …ˆ ç”Ÿã  ° ã ‹ã ‚Š ã  § ã ™ã  Œ 〠‚ã  © ã  †ã  ¿ ã  ¦ ã‚ ‚å ˆ† ã Œæ ‚ª ã  „ã ® ã  ¯ å ”ä ¼ šã €‚ æ˜ Žç ´° ã‚’å ‡º ã •ã  ª ã  „ã ® ã  ¯ ã Šã  ‹ ã —ã  „ ã  § ã —ã ‚‡ ã  †ã ‚ˆ 〠‚
ã  ã ‚  ã  ã ‚  æ” ¾ å° „能 è© æ ¬º ã Œè ‰² 〠…ã ° ã‚Œã  ¦ ã  ã ‚‹ æ® µ éšŽã  « ã  ª ã  £ ã  ¦ ã  ã  ¾ ã —ã  Ÿ ã  ­ 〠‚

Last edited by kiokuwarui (2012 January 23, 7:53 am)

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
kiokuwarui Member
From: Germany Registered: 2012-01-23 Posts: 17

I now moved the complete mecab folder to the desktop to avoid the win7 security issues. The output file is still always in ANSI, even after i open it with Notepad++ and convert it to UTF-8, then save.

There must be something I'm missing here, please help wink

Last edited by kiokuwarui (2012 January 23, 9:30 am)

overture2112 Member
From: New York Registered: 2010-05-16 Posts: 400

It's likely that mecab is outputting in EUC and not utf-8.

Last edited by overture2112 (2012 January 23, 9:41 am)

kiokuwarui Member
From: Germany Registered: 2012-01-23 Posts: 17

Do you have an idea on how to fix it? I've tried with -t UTF-8, but it doesn't help..

travis Member
Registered: 2008-08-11 Posts: 178

kiokuwarui wrote:

Do you have an idea on how to fix it? I've tried with -t UTF-8, but it doesn't help..

When you installed Mecab did you pick the UTF-8 option for the dictionary? I just tried the Windows version and it was fine. I've usually use it in Linux, so I'm not really sure.

In Notepad++ did you pick the right encoding before converting to UTF-8, so:
Encoding->Character Sets->Japanese->Shift JIS?

kiokuwarui Member
From: Germany Registered: 2012-01-23 Posts: 17

Setting it to ShiftJIS lets the text appear like this:

・ ソ 縺 ィ 縺 ヲ 繧 る 擇逋 ス 縺 ・ % 縺 ィ 縺 ォ 縺 ェ 縺 」 縺 ヲ 縺 セ 縺吶 €・繝舌 ぜ 縺溘 s 縺 ィ 縲 ∝ 鵠 莨 壹 ′ 莉 イ 髢 灘 牡 繧後 @ 縺 セ 縺励 ※ 縲 √≠ 縺 ・ ▽ 縺梧 が 縺 ・ ォ 後 ♀ 縺 セ 縺医 ′ 縺 ィ 縺 ョ 縺薙 → 縺 ォ 縺 ェ 縺 」 縺 ヲ 縺 セ 縺吶 €・
縺 ァ 繧 ゅ ▲ 縺 ヲ 縺薙 ・ 蜊比 シ 壹 b 縲 √≠ 縺 ョ KEK 縺 ョ 驥主 ーサ 蜈育 函 縺 ィ 縺区 焚蜷 阪 r 險 エ 縺医 k 繧峨 @ 縺 ・€・譌 ・ 譛 ャ 縺 ァ 繧 ょ 錐 縺 ・◆ 繧句 ・ 逕溘 ・ 縺九 j 縺 ァ 縺吶 ′ 縲 ゅ ← 縺 ・ ∩ 縺 ヲ 繧 ょ ・ 縺梧 が 縺 ・・ 縺 ッ 蜊比 シ 壹 €・譏 守 エー 繧貞 ・ 縺輔 ↑ 縺 ・・ 縺 ッ 縺翫 ° 縺励 > 縺 ァ 縺励 g 縺 ・ h 縲 ・
縺昴 m 縺昴 m 謾 セ 蟆 ・・ 隧先 ャコ 縺瑚 牡 縲 ・・ 繧後 ※ 縺上 k 谿 オ 髫弱 ↓ 縺 ェ 縺 」 縺 ヲ 縺阪 ∪ 縺励 ◆ 縺 ュ 縲 ・

How would I set mecab to UTF-8?

Last edited by kiokuwarui (2012 January 25, 9:59 am)

khalhern Member
From: UK Registered: 2011-04-11 Posts: 33

I've been looking into this as well... and I have no idea how to use Mecab :s Not trying to hijack the thread or anything, but are there any decent manuals for Mecab in English, for relative computer noobs? I'm struggling to find anything either on this forum or elsewhere sad

Edit: Using Windows 7, forgot to mention!

Last edited by khalhern (2012 January 30, 6:35 pm)

Reply #10 - 2012 January 30, 9:17 pm
fifo_thekid Member
From: Fukui Japan Registered: 2011-07-20 Posts: 94

during the installation it will ask you for the type of encoding you want to use with Mecab

Reply #11 - 2012 June 15, 9:18 am
Fillanzea Member
From: New York, NY Registered: 2009-10-02 Posts: 534 Website

I failed at setting up mecab; I really only know enough about the command line to get myself into trouble, so I don't know if I'm going to try to take another crack at it.

So far the most practical way for me to use LWT has been to use the option that makes every character as its own word, and then define words of more than one character as multi-word expressions. This isn't ideal because words composed of two or more kanji that I know well will appear as "known," but it's not too difficult to read through and capture them and add in the definitions. Easier than manually inserting white space, anyway.

Reply #12 - 2012 June 15, 9:51 am
chamcham Member
Registered: 2005-11-11 Posts: 1444

mecab sounds too much like macabre.... :-(

Reply #13 - 2012 June 24, 9:12 pm
Cacawate Member
From: California Registered: 2006-12-07 Posts: 32 Website

It seems I decide to drop in once a year. I will attempt to make a little tutorial on setting mecab up for Windows 7 64bit, and if need be, I can virtualize 32 bit Win7 as I believe its install was a bit different.

Since my last post, I've moved on to using Linux and it has made working with MeCab (and Python) a comfortable stroll in the park. I recommend it, but I understand the obstacles. 

One last note: using MeCab along side Python is a wonderful way to create little scripts that do exactly what you want, but getting them to work together on Windows can be a pain for new users. If that's something intriguing to you, I can make a rundown of how to do that as well.

One laster note: LWT is amazing and MeCab has a wrapper for PHP, but the maker of the software is pretty adamant about adding features that are relevant to his interests and that is perfectly fine. It is open source. wink

Last edited by Cacawate (2012 June 24, 9:12 pm)

Reply #14 - 2012 June 25, 2:30 am
Seamoby Member
From: USA Registered: 2011-01-11 Posts: 175

The default encoding of the ipadic dictionary that mecab uses is euc-jp.  If you're using this euc-jp version of ipadic, convert your input file's encoding to euc-jp before feeding it into mecab.  The output will also be in euc-jp, which you would want to encode into utf-8.

Alternatively, there is an ipadic dictionary that is compiled to use utf-8 encoding which you can install.  But if you're using Anki, the utf-8 version of ipadic will most likely cause incompatibilities with the Anki Japanese plug-in.

Another thing to check is that your windows environment is set to display utf-8.  Your output file may be in utf-8 encoding, but if your windows are not set to display utf-8, then you'll see garbage on the screen when you display the file's contents.

Last edited by Seamoby (2012 June 25, 2:49 am)

Reply #15 - 2012 August 11, 8:08 am
Cacawate Member
From: California Registered: 2006-12-07 Posts: 32 Website

I finished the installation video and I hope it helps some of the people here in troubleshooting the issues they have.

http://www.youtube.com/watch?v=1wqwWji4 … ture=g-upl

Stansfield123 Member
From: Europe Registered: 2011-04-17 Posts: 801

Cacawate wrote:

I finished the installation video and I hope it helps some of the people here in troubleshooting the issues they have.

http://www.youtube.com/watch?v=1wqwWji4 … ture=g-upl

It sure did, thanks.

Btw., the syntax for spacing text, just so we don't have to watch the video every time we forget it, is

mecab.lnk -O wakati input.txt -o output.txt

Daichi Member
From: Washington Registered: 2009-02-04 Posts: 450

kiokuwarui wrote:

The one main weakness of LWT is, the Japanese support is more or less nonexistant. Unless you can actually insert spaces between the words, there's no chance to properly use it as a learning tool for this specific language, and while I've tried to find a way to do this (apart from inserting them manually, which, for long texts, is タイヘン), is using Mecab.

Adding spaces to Japanese text is a pretty awful solution. Even if it's automated.

It's sad, I don't think LWT is going to be very useful unless someone actually takes the source and forks it for real Japanese support. The morphology Anki plugin just does so much more then LWT. Just without a custom dictionary for each word. Anki just isn't really made to read something chronologically.

If you built in proper Japanese support, you could treat different verb stems as the same word. That alone would make LWT more useful. Anyway, I'm going to quit ranting about this.

Stansfield123 Member
From: Europe Registered: 2011-04-17 Posts: 801

Daichi wrote:

Adding spaces to Japanese text is a pretty awful solution. Even if it's automated.

Why?

Daichi Member
From: Washington Registered: 2009-02-04 Posts: 450

For one, you can't have hybrid English/asian text. Well, you can, but it's going to look ugly and space-less "―Theworldisnotbeautiful.Therefore,itis.―" It completely fails at dealing with furigana or ruby text in a nice manner. It doesn't have any idea how to deal with different verb conjugations, so it treats them all as different words.

I'll be honest, I didn't get very far in using it because it was too tedious for myself. I'm not saying it doesn't work, I'm just saying it doesn't work well enough. A lot of the programming is too European language focused, it really could use some programming specific to Asian languages to make it far, far more useful.

Stansfield123 Member
From: Europe Registered: 2011-04-17 Posts: 801

Daichi wrote:

For one, you can't have hybrid English/asian text. Well, you can, but it's going to look ugly and space-less "―Theworldisnotbeautiful.Therefore,itis.―" It completely fails at dealing with furigana or ruby text in a nice manner.

Good. The point is to practice reading actual Japanese, not English or kana. That would defeat the purpose.

Daichi wrote:

It doesn't have any idea how to deal with different verb conjugations, so it treats them all as different words.

Does it deal with verb conjugations in European languages?

Last edited by Stansfield123 (2013 February 08, 7:27 am)

Stansfield123 Member
From: Europe Registered: 2011-04-17 Posts: 801

Btw., for those who wouldn't use MeCab for anything except to separate words (I keep saying "words" even though I know the right term is morphemes, because I'm guessing not everyone knows what that means), there's no need to install it anymore.

There's a webpage that runs Mecab server-side to separate morphemes for you:

http://nihongo.dpwright.com/spaces/index.php

Haven't tried it for large texts, hope it doesn't have some silly, tiny limit.

Daichi Member
From: Washington Registered: 2009-02-04 Posts: 450

Stansfield123 wrote:

Daichi wrote:

For one, you can't have hybrid English/asian text. Well, you can, but it's going to look ugly and space-less "―Theworldisnotbeautiful.Therefore,itis.―" It completely fails at dealing with furigana or ruby text in a nice manner.

Good. The point is to practice reading actual Japanese, not English or kana. That would defeat the purpose.

That isn't the point, it should display all languages properly despite the one your trying to learn. And part of that is proper display of furigana. Adding spaces to Japanese is just a hack to make LWT work. It is certainly something that could be improved upon.

Stansfield123 wrote:

Daichi wrote:

It doesn't have any idea how to deal with different verb conjugations, so it treats them all as different words.

Does it deal with verb conjugations in European languages?

Probably not, but the Anki morphology plugin can pick up the dictionary versions of these verbs just fine, so I don't see why it couldn't be done.

Also, when you spit something into Mecab (like with the site in the above post) for spacing you get lots of weird language fragments that you don't need to deal with in European languages.

思ってる becomes 思って る

「る」 is not exactly a vocab word, sure I can ignore it, but it would be far easier for the user to have the LWT parser just see it as 思う.

Again, I'm not saying LWT doesn't work, the point I'm trying to make is that it could be far better and far more useful to learn to read with. I'm just trying to point out some of the faults. It's too cumbersome for myself. If it works for you, great. Go about your way and ignore me.

Last edited by Daichi (2013 February 09, 6:01 am)

  • 1