kanji koohii FORUM
MeCab - Printable Version

+- kanji koohii FORUM (http://forum.koohii.com)
+-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html)
+--- Forum: Learning resources (http://forum.koohii.com/forum-9.html)
+--- Thread: MeCab (/thread-12919.html)



MeCab - cmertb - 2015-08-03

Has anyone successfully used MeCab on Windows? I vaguely remember that it used to work for me the first time I downloaded it, but I'm trying it now, and I just can't get get it to understand my input. It seems that it uses neither UTF-8 nor ShiftJIS. Whatever I type in using Windows IME doesn't even get split up into morphemes, it just outputs a single line that I can't even read because I don't know what encoding it is... I tried setting Windows to Japanese locale as well, but I think that only works for ShiftJIS encoding, not the devil's creation that MeCab uses by default. In any case, I still couldn't read the output or get it to parse my input. I also tried rebuilding the dictionary with UTF-8 based on some instructions I googled up ( http://nymemo.com/mecab/502/ ), but nothing seems to have changed.

If someone could give me any hints, I would really appreciate it!

(Encoding is, of course, just the first problem. I have more programming specific questions too.)


MeCab - Flamerokz - 2015-08-04

The default encoding for mecab is euc-jp I believe (This is what Anki's Japanese plugin uses interface with MeCab). I never personally tried getting MeCab to work on Windows so I probably can't help you further.

EDIT: It looks like another koohii user made a youtube guide for people here three years ago, so this can probably help you out https://youtu.be/1wqwWji4u0E


MeCab - Savii - 2015-08-04

Yes, it should work with EUC-JP. I used it with PHP on Windows once and just ran all input and output through mb_convert_encoding() to go from and to UTF-8, this worked fine.

I recall the MeCab installer for Windows also asks you which encoding (EUC-JP/ShiftJS/UTF-8) you want to use for its internal dictionary. Can't say for sure but maybe this is also what determines the input and output encoding?


MeCab - yogert909 - 2015-08-04

Savii Wrote:Yes, it should work with EUC-JP. I used it with PHP on Windows once and just ran all input and output through mb_convert_encoding() to go from and to UTF-8, this worked fine.

I recall the MeCab installer for Windows also asks you which encoding (EUC-JP/ShiftJS/UTF-8) you want to use for its internal dictionary. Can't say for sure but maybe this is also what determines the input and output encoding?
I haven't used it, and certainly not on windows. But I seem to recall reading that there's a command to change the default encoding. Aldebrn, a user here has been doing some things with mecab, so maybe he'll come in with some real solutions.


MeCab - cmertb - 2015-08-04

Oh man, I've googled all over the internet, but somehow I did not come across that video. I think I can solve my encoding problems with its help. At least I'll be able to use it from C or C++ code if I give up on UTF-8 and stick with Shift-JIS.

But actually I wanted to try Python. Has anyone been able to successfully install the MeCab Python module on Windows? It fails miserably for me, and various workarounds I found don't work either. Of course, the fact that I've never installed a Python module before and don't know how it's supposed to work doesn't help.

Thanks for all your help!


MeCab - Flamerokz - 2015-08-04

cmertb Wrote:Oh man, I've googled all over the internet, but somehow I did not come across that video. I think I can solve my encoding problems with its help. At least I'll be able to use it from C or C++ code if I give up on UTF-8 and stick with Shift-JIS.

But actually I wanted to try Python. Has anyone been able to successfully install the MeCab Python module on Windows? It fails miserably for me, and various workarounds I found don't work either. Of course, the fact that I've never installed a Python module before and don't know how it's supposed to work doesn't help.

Thanks for all your help!
Well LUCKY FOR YOU I decided to be miserable and spend the last 12+ hours trying to compile MeCab on Windows 7 x64 and then also the corresponding Python module for x64 (2.7.10)... Bleh. I finally got it working though. If you're looking to do it on x86 though the following instructions won't work (a bunch of hardcoding for x64 configuration).

I hope you can either read enough Japanese or use Rikaichan/Rikaisama/Google-Translate. Basically, I followed the instructions on this article
http://pop365.cocolog-nifty.com/blog/2015/03/windows-64bit-m.html

There are references in this to the old google code repo but you can find the files at the latest repo page https://taku910.github.io/mecab/

However, just following those instructions alone will not compile correctly (at least it didn't for me). [1]

Before you proceed to the section "MeCab 本体 Source ビルド" there are two more changes you need to make. Notably, at the bottom of the article here http://orion.bluememe.jp/2011/09/windows-64bitmecab.html

Quote:mecab.hに下記修正を行う(Windowsインストーラーについてる32bit版DLLと同じシンボルにしたい場合は変更不要)
 「#ifndef SIWG」を「#ifndef SWIG」に変更
So basically, there are two locations in src\mecab.h that have the above preprocessor typo...
At both line 1125 and 1414, change
Code:
#ifndef SIWG
to
Code:
#ifndef SWIG
Then proceed with the rest of the instructions from the first article I linked. By doing this, you should be able to get both an x64 build of Mecab 0.996 and the corresponding python module bindings.

If there's something that's unclear to you from the articles I linked I can clarify.

I sent a pull request on the git repo for this but it looks like the developer doesn't know how it works (the repo seems to have moved around a bunch over the years)... There are still several pull requests that have been completely unaddressed for months.

[1] The error in particular I got is as follows after attempting to build the python module
Code:
python setup.py install

running build

running build_py

running build_ext

building '_MeCab' extension

<bunch of compiler commands/warning/error stuff>

build\lib.win-amd64-2.7\_MeCab.pyd : fatal error LNK1120: 11 unresolved externals

error: command '"C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\Bin\amd64\link.exe"' failed with exit status 1120
EDIT: If you are on x86 then you probably only need to make changes to the setup.py file that comes with the python bindings file to point to the directory where you installed MeCab, the same way the changes are indicated in the first article instructions. I haven't tested this myself personally though.


MeCab - cmertb - 2015-08-05

Thank you Flamerokz, this is more than I expected! Smile

I'll try this on the weekend.


MeCab - Flamerokz - 2015-08-05

cmertb Wrote:Thank you Flamerokz, this is more than I expected! Smile

I'll try this on the weekend.
It was coincidental since I had wanted to get the python bindings for mecab working on Windows x64 too... and then it sucked my life out. Hopefully it saves you that pain.


MeCab - cmertb - 2015-08-15

Well, things didn't go smoothly once I got to python module installation. First of all, it was failing to find vcvarsall.bat, even though I set the env variable as instructed. I ended up hard coding the path in find_vcvarsall() in msvc9compiler.py.

But now I'm getting linker errors which I don't know what to do with:
Code:
c:\mecab-python-0.996>python setup.py build
running build
running build_py
running build_ext
building '_MeCab' extension
C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\BIN\cl.exe /c /nologo /Ox
/MD /W3 /GS- /DNDEBUG -IC:\MeCab\sdk -IC:\Python27\include -IC:\Python27\PC /Tp
MeCab_wrap.cxx /Fobuild\temp.win32-2.7\Release\MeCab_wrap.obj
MeCab_wrap.cxx
MeCab_wrap.cxx(3747) : warning C4530: C++ exception handler used, but unwind sem
antics are not enabled. Specify /EHsc
C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\BIN\link.exe /DLL /nologo
/INCREMENTAL:NO /LIBPATH:C:\MeCab\sdk /LIBPATH:C:\Python27\libs /LIBPATH:C:\Pyt
hon27\PCbuild libmecab.lib /EXPORT:init_MeCab build\temp.win32-2.7\Release\MeCab
_wrap.obj /OUT:build\lib.win32-2.7\_MeCab.pyd /IMPLIB:build\temp.win32-2.7\Relea
se\_MeCab.lib /MANIFESTFILE:build\temp.win32-2.7\Release\_MeCab.pyd.manifest
   Creating library build\temp.win32-2.7\Release\_MeCab.lib and object build\tem
p.win32-2.7\Release\_MeCab.exp
MeCab_wrap.obj : error LNK2019: unresolved external symbol "public: static char
const * __cdecl MeCab::Model::version(void)" (?version@Model@MeCab@@SAPBDXZ) ref
erenced in function __catch$__wrap_Model_swap$5
MeCab_wrap.obj : error LNK2019: unresolved external symbol "public: static class
MeCab::Model * __cdecl MeCab::Model::create(int,char * *)" (?create@Model@MeCab
@@SAPAV12@HPAPAD@Z) referenced in function __catch$__wrap_delete_Model$4
MeCab_wrap.obj : error LNK2019: unresolved external symbol "public: static class
MeCab::Model * __cdecl MeCab::Model::create(char const *)" (?create@Model@MeCab
@@SAPAV12@PBD@Z) referenced in function __catch$__wrap_Model_create__SWIG_0$5
MeCab_wrap.obj : error LNK2019: unresolved external symbol "public: static bool
__cdecl MeCab::Tagger::parse(class MeCab::Model const &,class MeCab::Lattice *)"
(?parse@Tagger@MeCab@@SA_NABVModel@2@PAVLattice@2@@Z) referenced in function __
catch$__wrap_new_Model__SWIG_1$3
MeCab_wrap.obj : error LNK2019: unresolved external symbol "public: static class
MeCab::Tagger * __cdecl MeCab::Tagger::create(int,char * *)" (?create@Tagger@Me
Cab@@SAPAV12@HPAPAD@Z) referenced in function __catch$__wrap_delete_Tagger$4
MeCab_wrap.obj : error LNK2019: unresolved external symbol "public: static class
MeCab::Tagger * __cdecl MeCab::Tagger::create(char const *)" (?create@Tagger@Me
Cab@@SAPAV12@PBD@Z) referenced in function __catch$__wrap_Tagger_create__SWIG_0$
5
MeCab_wrap.obj : error LNK2019: unresolved external symbol "public: static char
const * __cdecl MeCab::Tagger::version(void)" (?version@Tagger@MeCab@@SAPBDXZ) r
eferenced in function __catch$__wrap_Tagger_create__SWIG_1$4
MeCab_wrap.obj : error LNK2019: unresolved external symbol "__declspec(dllimport
) class MeCab::Lattice * __cdecl MeCab::createLattice(void)" (__imp_?createLatti
ce@MeCab@@YAPAVLattice@1@XZ) referenced in function "class MeCab::Lattice * __cd
ecl new_MeCab_Lattice(void)" (?new_MeCab_Lattice@@YAPAVLattice@MeCab@@XZ)
MeCab_wrap.obj : error LNK2019: unresolved external symbol "__declspec(dllimport
) class MeCab::Model * __cdecl MeCab::createModel(char const *)" (__imp_?createM
odel@MeCab@@YAPAVModel@1@PBD@Z) referenced in function "class MeCab::Model * __c
decl new_MeCab_Model(char const *)" (?new_MeCab_Model@@YAPAVModel@MeCab@@PBD@Z)
MeCab_wrap.obj : error LNK2019: unresolved external symbol "__declspec(dllimport
) class MeCab::Tagger * __cdecl MeCab::createTagger(char const *)" (__imp_?creat
eTagger@MeCab@@YAPAVTagger@1@PBD@Z) referenced in function "class MeCab::Tagger
* __cdecl new_MeCab_Tagger(char const *)" (?new_MeCab_Tagger@@YAPAVTagger@MeCab@
@PBD@Z)
MeCab_wrap.obj : error LNK2019: unresolved external symbol "__declspec(dllimport
) char const * __cdecl MeCab::getLastError(void)" (__imp_?getLastError@MeCab@@YA
PBDXZ) referenced in function "class MeCab::Tagger * __cdecl new_MeCab_Tagger(ch
ar const *)" (?new_MeCab_Tagger@@YAPAVTagger@MeCab@@PBD@Z)
build\lib.win32-2.7\_MeCab.pyd : fatal error LNK1120: 11 unresolved externals
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\BIN\\
link.exe' failed with exit status 1120
This looks like it's trying to do a 32-bit instead of a 64-bit build, but I have no idea how to force it to x64. Any ideas?


MeCab - Flamerokz - 2015-08-16

The linker errors look like the same ones I got when I made the snip in my previous post.

Are you sure you fixed the "#ifndef SIWG" typos I mentioned in the previous post? That in particular fixed the errors I got, and I am fairly certain they are the direct cause of the unresolved external symbol error. SWIG is used only for python/ruby/etc. binding generation so even if you left the typos in there, you could compile and use MeCab fine from the command line without any difference.

After fixing those you need to compile MeCab from source and make the file replacements in your MeCab install folder, as indicated by the first article.

If you did do all that, are you sure you're actually using 64-bit Python? (Don't mean to be insulting but I honestly don't know why else it would try to do a 32-bit build if you fixed the typos above; might be worth trying to compile some other python module and see if it does 32-bit as well?)

Also I have no idea why it wouldn't work with the set environment variables. It sounds like the hardcoding is working as it should at though.


MeCab - cmertb - 2015-08-16

Well, as embarrassing as it is, you're right, I had 32-bit python. I had it for a while and for some reason I assumed I'd have 64-bit version, but it actually installs 32-bit if you take what they give you. Had to browse the site for 64-bit specifically.

Everything compiled and installed (other than not finding vcvarsall.bat).

For those who might try to replicate this later, note that if the instructions say to get Visual Studio 2013, then get that version. In e.g. 2015 they rearranged where all the files are, so you'd have to figure out all the new paths and settings yourself.

Anyway, mecab now works for me on the command line, but it doesn't want to recognize python strings. I ran the test app that has the hardcoded string "太郎はこの本を二郎を見た女性に渡した。" and the output I get with Shift-JIS dictionary is:

Code:
螟      名詞,一般,*,*,*,*,*
ェ       名詞,一般,*,*,*,*,*
驛弱    名詞,一般,*,*,*,*,*
・      記号,一般,*,*,*,*,*
縺薙    名詞,固有名詞,組織,*,*,*,*
・      記号,一般,*,*,*,*,*
譛      名詞,固有名詞,組織,*,*,*,*
ャ       名詞,一般,*,*,*,*,*
繧剃    名詞,一般,*,*,*,*,*
コ       名詞,一般,*,*,*,*,*
碁      名詞,一般,*,*,*,*,碁,ゴ,ゴ
ヮ      名詞,一般,*,*,*,*,*
繧定    名詞,一般,*,*,*,*,*
ヲ       名詞,一般,*,*,*,*,*
九      名詞,数,*,*,*,*,九,キュウ,キュー
◆      記号,一般,*,*,*,*,◆,◆,◆
螂      名詞,固有名詞,組織,*,*,*,*
ウ       名詞,一般,*,*,*,*,*
諤      名詞,一般,*,*,*,*,*
ァ       名詞,一般,*,*,*,*,*
縺      名詞,一般,*,*,*,*,*
ォ       名詞,一般,*,*,*,*,*
貂      名詞,一般,*,*,*,*,貂,テン,テン
。       名詞,サ変接続,*,*,*,*,*
縺励    名詞,一般,*,*,*,*,*
◆      記号,一般,*,*,*,*,◆,◆,◆
縲      名詞,固有名詞,組織,*,*,*,*
・記号,一般,*,*,*,*,*
EOS
With UTF-8 dictionary it's even worse:
Code:
螟ェ驛・蜷崎ゥ・蝗コ譛牙錐隧・莠コ蜷・蜷・*,*,螟ェ驛・繧ソ繝ュ繧ヲ,繧ソ繝ュ繝シ
縺ッ     蜉ゥ隧・菫ょ勧隧・*,*,*,*,縺ッ,繝・繝ッ
縺薙・  騾」菴楢ゥ・*,*,*,*,*,縺薙・,繧ウ繝・繧ウ繝・
譛ャ     蜷崎ゥ・荳€闊ャ,*,*,*,*,譛ャ,繝帙Φ,繝帙Φ
繧・蜉ゥ隧・譬シ蜉ゥ隧・荳€闊ャ,*,*,*,繧・繝イ,繝イ
莠・蜷崎ゥ・謨ー,*,*,*,*,莠・繝・繝・
驛・蜷崎ゥ・荳€闊ャ,*,*,*,*,驛・繝ュ繧ヲ,繝ュ繝シ
繧・蜉ゥ隧・譬シ蜉ゥ隧・荳€闊ャ,*,*,*,繧・繝イ,繝イ
隕・蜍戊ゥ・閾ェ遶・*,*,荳€谿オ,騾」逕ィ蠖「,隕九k,繝・繝・
縺・蜉ゥ蜍戊ゥ・*,*,*,迚ケ谿翫・繧ソ,蝓コ譛ャ蠖「,縺・繧ソ,繧ソ
螂ウ諤ァ  蜷崎ゥ・荳€闊ャ,*,*,*,*,螂ウ諤ァ,繧ク繝ァ繧サ繧、,繧ク繝ァ繧サ繧、
縺ォ     蜉ゥ隧・譬シ蜉ゥ隧・荳€闊ャ,*,*,*,縺ォ,繝・繝・
貂。縺・蜍戊ゥ・閾ェ遶・*,*,莠疲ョオ繝サ繧オ陦・騾」逕ィ蠖「,貂。縺・繝ッ繧ソ繧キ,繝ッ繧ソ繧キ
縺・蜉ゥ蜍戊ゥ・*,*,*,迚ケ谿翫・繧ソ,蝓コ譛ャ蠖「,縺・繧ソ,繧ソ
縲・險伜捷,蜿・轤ケ,*,*,*,*,縲・縲・縲・
EOS
Not just the input, but even the output is messed up. My Windows is currently in Japanese locale.

When I saved test.py with Shift-JIS encoding, everything printed correctly (with Shift-JIS dictionary). Is it simply that Windows cmd doesn't support UTF-8?

EDIT: Let me ask a more specific question: If I don't care about what's printed to stdout, I just want to check which part of speech certain words are in input text, then I should be fine using UTF-8 string literals in python source with UTF-8 input files and UTF-8 MeCab dictionary?

P.S. I implore the Japanese to get their encoding anarchy under control.


MeCab - Flamerokz - 2015-08-16

Yep, as far as I can tell the windows command line doesn't support UTF-8, which is quite annoying. Unless someone can show me otherwise (I hope!)

EDIT: Although you can still keep your python files in UTF-8 and just encode a unicode string to SHIFT-JIS before shoving it into MeCab.

EDIT2: I should probably note that whatever your dictionary charset is determines both its input and output encoding. So it *can* do UTF-8 input and output but you won't be able to see display it properly in the command line (you should be able to pipe into a tool like iconv if you really want to display the text). The data from MeCab will still be fed back into python correctly.


MeCab - cmertb - 2015-08-16

Oops, edited my post while you were replying.

Anyway, thank you very much for all your help, Flamerokz!

Now that it's working, I'll try to do something useful with it. Smile


MeCab - Flamerokz - 2015-08-16

Oh no now with all this editing the correspondence is no longer linear. Noooooooooooooooooooooooooooo


MeCab - aldebrn - 2015-08-16

Out of curiosity, do the python bindings just shell out to mecab and parse the command's output, or do they link against the mecab library? From the amount of work described above, it seems like the latter, but then the discussion about changing encodings before invoking mecab make me think it's the former.

Off topic, but on Mac/Linux I build MeCab and IPADIC for UTF8 from the source tarball on Google Code (now Github). I'm guessing it's a Windows thing that makes you run into the horrible "SIWG" vs "SWIG" thing (what a typo…)?, because I've never had to patch the source.

Even more off-topic, I use Ve, a Ruby front-end and post-processor for MeCab that's quite snazzy. I like its fancy re-assembling of morphemes into lexemes so much I actually call it from Node.js apps.


MeCab - Flamerokz - 2015-08-16

aldebrn Wrote:Out of curiosity, do the python bindings just shell out to mecab and parse the command's output, or do they link against the mecab library? From the amount of work described above, it seems like the latter, but then the discussion about changing encodings before invoking mecab make me think it's the former.
It does link against the mecab library. The input/output encoding is determined by how the dictionary file is compiled which is separate from the compilation of the mecab executable.

aldebrn Wrote:Off topic, but on Mac/Linux I build MeCab and IPADIC for UTF8 from the source tarball on Google Code (now Github). I'm guessing it's a Windows thing that makes you run into the horrible "SIWG" vs "SWIG" thing (what a typo…)?, because I've never had to patch the source.
I'm not sure if it's just a Windows thing or if it's a 64-bit versus 32-bit thing. I would assume the latter but if it's working fine for you on Mac/Linux 64-bit then I guess it is a Windows thing. Almost all the patching I noted above is because the source is hard-coded for 32-bit builds.