Joined: Jan 2012
Posts: 17
Thanks:
0
Hello,
I've been trying to utilize LWT, because I find it very helpful. Lots of functions are reminding me of LingQ, which, unfortunately, is commercial.
The one main weakness of LWT is, the Japanese support is more or less nonexistant. Unless you can actually insert spaces between the words, there's no chance to properly use it as a learning tool for this specific language, and while I've tried to find a way to do this (apart from inserting them manually, which, for long texts, is タイヘン), is using Mecab.
Well, I've checked that, but I don't even have the slightest idea on how to use it. I tried with command line parameters, but nothing happened (no error message either). I set an input and an output file (UTF-8) for it, to no avail.
However. It would be optimal to have a little app where you paste text, press a button and have it "spaced". Kind of like a little GUI, useable for dummies (or people who rather spend their time learning the language instead of programs for learning).
Do you know of such a tool, or is it possible to make Mecab somehow useable without twisting my brain?
Thank you very much
P.S.: I tried to import something into LingQ, then copy and paste it into LWT, which led to cluttered text (-.-
Joined: Jan 2012
Posts: 17
Thanks:
0
The output.txt doesn't contain anything. May it have to do with Win7 security settings? I have to always confirm admin rights if I want to change stuff inside that directory..
Okay, now it works, but the output file always seems to be ANSI, not UTF-8, despite setting it every time (-.-
とても面白いことになってます。バズたんと、協会が仲間割れしまして、あいつが悪い嫌おまえがとのことになってます。
でもってこの協会も、あのKEKの野尻先生とか数名を訴えるらしい。日本でも名だたる先生ばかりですが。どうみても分が悪いのは協会。明細を出さないのはおかしいでしょうよ。
そろそろ放射能詐欺が色々ばれてくる段階になってきましたね。
results in this:
ï» ¿ ã ¨ ã ¦ ã‚ ‚é ¢ç™ ½ ã „ã “ ã ¨ ã « ã ª ã £ ã ¦ ã ¾ 㠙㠀‚ ム㠂º ã Ÿã ‚“ ã ¨ ã€ å ” ä¼ šã Œ ä» ² é– “å ‰² れ㠗 ã ¾ ã —ã ¦ 〠㠂 ã „ã ¤ ã Œæ ‚ª ã „å « Œã Š ã ¾ 㠈㠌 ã ¨ ã ® ã “ã ¨ ã « ã ª ã £ ã ¦ ã ¾ 㠙㠀‚
ã § ã‚ ‚ã £ ã ¦ ã “ã ® å ”ä ¼ šã ‚‚ 〠㠂 ã ® KEK ã ® é‡Žå °» å…ˆç ”Ÿ ã ¨ ã ‹æ •°å ã ‚’ è¨ ´ 㠈㠂‹ ら㠗 㠄。 æ— ¥ æœ ¬ ã § ã‚ ‚å ã ã Ÿ ã‚‹å …ˆ 生㠰 ã ‹ã ‚Š ã § 㠙㠌 〠‚ã © 㠆㠿 ã ¦ ã‚ ‚å ˆ† ã Œæ ‚ª ã „ã ® ã ¯ å ”ä ¼ šã €‚ æ˜ Žç ´° ã‚’å ‡º ã •ã ª ã „ã ® ã ¯ ã Šã ‹ ã —ã „ ã § ã —ã ‚‡ 㠆㠂ˆ 〠‚
ã ã ‚ ã ã ‚ æ” ¾ å° „能 è© æ ¬º ã Œè ‰² 〠…ã ° れ㠦 ã ã ‚‹ æ® µ 階㠫 ã ª ã £ ã ¦ ã ã ¾ ã —ã Ÿ ã 〠‚
Edited: 2012-01-23, 8:53 am
Joined: May 2010
Posts: 421
Thanks:
0
It's likely that mecab is outputting in EUC and not utf-8.
Edited: 2012-01-23, 10:41 am
Joined: Jan 2012
Posts: 17
Thanks:
0
Do you have an idea on how to fix it? I've tried with -t UTF-8, but it doesn't help..
Joined: Jan 2012
Posts: 17
Thanks:
0
Setting it to ShiftJIS lets the text appear like this:
・ ソ 縺 ィ 縺 ヲ 繧 る 擇逋 ス 縺 ・ % 縺 ィ 縺 ォ 縺 ェ 縺 」 縺 ヲ 縺 セ 縺吶 €・繝舌 ぜ 縺溘 s 縺 ィ 縲 ∝ 鵠 莨 壹 ′ 莉 イ 髢 灘 牡 繧後 @ 縺 セ 縺励 ※ 縲 √≠ 縺 ・ ▽ 縺梧 が 縺 ・ ォ 後 ♀ 縺 セ 縺医 ′ 縺 ィ 縺 ョ 縺薙 → 縺 ォ 縺 ェ 縺 」 縺 ヲ 縺 セ 縺吶 €・
縺 ァ 繧 ゅ ▲ 縺 ヲ 縺薙 ・ 蜊比 シ 壹 b 縲 √≠ 縺 ョ KEK 縺 ョ 驥主 ーサ 蜈育 函 縺 ィ 縺区 焚蜷 阪 r 險 エ 縺医 k 繧峨 @ 縺 ・€・譌 ・ 譛 ャ 縺 ァ 繧 ょ 錐 縺 ・◆ 繧句 ・ 逕溘 ・ 縺九 j 縺 ァ 縺吶 ′ 縲 ゅ ← 縺 ・ ∩ 縺 ヲ 繧 ょ ・ 縺梧 が 縺 ・・ 縺 ッ 蜊比 シ 壹 €・譏 守 エー 繧貞 ・ 縺輔 ↑ 縺 ・・ 縺 ッ 縺翫 ° 縺励 > 縺 ァ 縺励 g 縺 ・ h 縲 ・
縺昴 m 縺昴 m 謾 セ 蟆 ・・ 隧先 ャコ 縺瑚 牡 縲 ・・ 繧後 ※ 縺上 k 谿 オ 髫弱 ↓ 縺 ェ 縺 」 縺 ヲ 縺阪 ∪ 縺励 ◆ 縺 ュ 縲 ・
How would I set mecab to UTF-8?
Edited: 2012-01-25, 10:59 am
Joined: Jul 2011
Posts: 93
Thanks:
0
during the installation it will ask you for the type of encoding you want to use with Mecab
Joined: Oct 2009
Posts: 533
Thanks:
1
I failed at setting up mecab; I really only know enough about the command line to get myself into trouble, so I don't know if I'm going to try to take another crack at it.
So far the most practical way for me to use LWT has been to use the option that makes every character as its own word, and then define words of more than one character as multi-word expressions. This isn't ideal because words composed of two or more kanji that I know well will appear as "known," but it's not too difficult to read through and capture them and add in the definitions. Easier than manually inserting white space, anyway.
Joined: Nov 2005
Posts: 1,442
Thanks:
2
mecab sounds too much like macabre.... :-(
Joined: Jan 2011
Posts: 172
Thanks:
0
The default encoding of the ipadic dictionary that mecab uses is euc-jp. If you're using this euc-jp version of ipadic, convert your input file's encoding to euc-jp before feeding it into mecab. The output will also be in euc-jp, which you would want to encode into utf-8.
Alternatively, there is an ipadic dictionary that is compiled to use utf-8 encoding which you can install. But if you're using Anki, the utf-8 version of ipadic will most likely cause incompatibilities with the Anki Japanese plug-in.
Another thing to check is that your windows environment is set to display utf-8. Your output file may be in utf-8 encoding, but if your windows are not set to display utf-8, then you'll see garbage on the screen when you display the file's contents.
Edited: 2012-06-25, 2:49 am
Joined: Feb 2009
Posts: 453
Thanks:
0
For one, you can't have hybrid English/asian text. Well, you can, but it's going to look ugly and space-less "―Theworldisnotbeautiful.Therefore,itis.―" It completely fails at dealing with furigana or ruby text in a nice manner. It doesn't have any idea how to deal with different verb conjugations, so it treats them all as different words.
I'll be honest, I didn't get very far in using it because it was too tedious for myself. I'm not saying it doesn't work, I'm just saying it doesn't work well enough. A lot of the programming is too European language focused, it really could use some programming specific to Asian languages to make it far, far more useful.