![]() |
|
MeCab - Printable Version +- kanji koohii FORUM (http://forum.koohii.com) +-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html) +--- Forum: Learning resources (http://forum.koohii.com/forum-9.html) +--- Thread: MeCab (/thread-12919.html) |
MeCab - cmertb - 2015-08-03 Has anyone successfully used MeCab on Windows? I vaguely remember that it used to work for me the first time I downloaded it, but I'm trying it now, and I just can't get get it to understand my input. It seems that it uses neither UTF-8 nor ShiftJIS. Whatever I type in using Windows IME doesn't even get split up into morphemes, it just outputs a single line that I can't even read because I don't know what encoding it is... I tried setting Windows to Japanese locale as well, but I think that only works for ShiftJIS encoding, not the devil's creation that MeCab uses by default. In any case, I still couldn't read the output or get it to parse my input. I also tried rebuilding the dictionary with UTF-8 based on some instructions I googled up ( http://nymemo.com/mecab/502/ ), but nothing seems to have changed. If someone could give me any hints, I would really appreciate it! (Encoding is, of course, just the first problem. I have more programming specific questions too.) MeCab - Flamerokz - 2015-08-04 The default encoding for mecab is euc-jp I believe (This is what Anki's Japanese plugin uses interface with MeCab). I never personally tried getting MeCab to work on Windows so I probably can't help you further. EDIT: It looks like another koohii user made a youtube guide for people here three years ago, so this can probably help you out https://youtu.be/1wqwWji4u0E MeCab - Savii - 2015-08-04 Yes, it should work with EUC-JP. I used it with PHP on Windows once and just ran all input and output through mb_convert_encoding() to go from and to UTF-8, this worked fine. I recall the MeCab installer for Windows also asks you which encoding (EUC-JP/ShiftJS/UTF-8) you want to use for its internal dictionary. Can't say for sure but maybe this is also what determines the input and output encoding? MeCab - yogert909 - 2015-08-04 Savii Wrote:Yes, it should work with EUC-JP. I used it with PHP on Windows once and just ran all input and output through mb_convert_encoding() to go from and to UTF-8, this worked fine.I haven't used it, and certainly not on windows. But I seem to recall reading that there's a command to change the default encoding. Aldebrn, a user here has been doing some things with mecab, so maybe he'll come in with some real solutions. MeCab - cmertb - 2015-08-04 Oh man, I've googled all over the internet, but somehow I did not come across that video. I think I can solve my encoding problems with its help. At least I'll be able to use it from C or C++ code if I give up on UTF-8 and stick with Shift-JIS. But actually I wanted to try Python. Has anyone been able to successfully install the MeCab Python module on Windows? It fails miserably for me, and various workarounds I found don't work either. Of course, the fact that I've never installed a Python module before and don't know how it's supposed to work doesn't help. Thanks for all your help! MeCab - Flamerokz - 2015-08-04 cmertb Wrote:Oh man, I've googled all over the internet, but somehow I did not come across that video. I think I can solve my encoding problems with its help. At least I'll be able to use it from C or C++ code if I give up on UTF-8 and stick with Shift-JIS.Well LUCKY FOR YOU I decided to be miserable and spend the last 12+ hours trying to compile MeCab on Windows 7 x64 and then also the corresponding Python module for x64 (2.7.10)... Bleh. I finally got it working though. If you're looking to do it on x86 though the following instructions won't work (a bunch of hardcoding for x64 configuration). I hope you can either read enough Japanese or use Rikaichan/Rikaisama/Google-Translate. Basically, I followed the instructions on this article http://pop365.cocolog-nifty.com/blog/2015/03/windows-64bit-m.html There are references in this to the old google code repo but you can find the files at the latest repo page https://taku910.github.io/mecab/ However, just following those instructions alone will not compile correctly (at least it didn't for me). [1] Before you proceed to the section "MeCab 本体 Source ビルド" there are two more changes you need to make. Notably, at the bottom of the article here http://orion.bluememe.jp/2011/09/windows-64bitmecab.html Quote:mecab.hに下記修正を行う(Windowsインストーラーについてる32bit版DLLと同じシンボルにしたい場合は変更不要)So basically, there are two locations in src\mecab.h that have the above preprocessor typo... At both line 1125 and 1414, change Code: #ifndef SIWGCode: #ifndef SWIGIf there's something that's unclear to you from the articles I linked I can clarify. I sent a pull request on the git repo for this but it looks like the developer doesn't know how it works (the repo seems to have moved around a bunch over the years)... There are still several pull requests that have been completely unaddressed for months. [1] The error in particular I got is as follows after attempting to build the python module Code: python setup.py installMeCab - cmertb - 2015-08-05 Thank you Flamerokz, this is more than I expected! ![]() I'll try this on the weekend. MeCab - Flamerokz - 2015-08-05 cmertb Wrote:Thank you Flamerokz, this is more than I expected!It was coincidental since I had wanted to get the python bindings for mecab working on Windows x64 too... and then it sucked my life out. Hopefully it saves you that pain. MeCab - cmertb - 2015-08-15 Well, things didn't go smoothly once I got to python module installation. First of all, it was failing to find vcvarsall.bat, even though I set the env variable as instructed. I ended up hard coding the path in find_vcvarsall() in msvc9compiler.py. But now I'm getting linker errors which I don't know what to do with: Code: c:\mecab-python-0.996>python setup.py buildMeCab - Flamerokz - 2015-08-16 The linker errors look like the same ones I got when I made the snip in my previous post. Are you sure you fixed the "#ifndef SIWG" typos I mentioned in the previous post? That in particular fixed the errors I got, and I am fairly certain they are the direct cause of the unresolved external symbol error. SWIG is used only for python/ruby/etc. binding generation so even if you left the typos in there, you could compile and use MeCab fine from the command line without any difference. After fixing those you need to compile MeCab from source and make the file replacements in your MeCab install folder, as indicated by the first article. If you did do all that, are you sure you're actually using 64-bit Python? (Don't mean to be insulting but I honestly don't know why else it would try to do a 32-bit build if you fixed the typos above; might be worth trying to compile some other python module and see if it does 32-bit as well?) Also I have no idea why it wouldn't work with the set environment variables. It sounds like the hardcoding is working as it should at though. MeCab - cmertb - 2015-08-16 Well, as embarrassing as it is, you're right, I had 32-bit python. I had it for a while and for some reason I assumed I'd have 64-bit version, but it actually installs 32-bit if you take what they give you. Had to browse the site for 64-bit specifically. Everything compiled and installed (other than not finding vcvarsall.bat). For those who might try to replicate this later, note that if the instructions say to get Visual Studio 2013, then get that version. In e.g. 2015 they rearranged where all the files are, so you'd have to figure out all the new paths and settings yourself. Anyway, mecab now works for me on the command line, but it doesn't want to recognize python strings. I ran the test app that has the hardcoded string "太郎はこの本を二郎を見た女性に渡した。" and the output I get with Shift-JIS dictionary is: Code: 螟 名詞,一般,*,*,*,*,*Code: 螟ェ驛・蜷崎ゥ・蝗コ譛牙錐隧・莠コ蜷・蜷・*,*,螟ェ驛・繧ソ繝ュ繧ヲ,繧ソ繝ュ繝シWhen I saved test.py with Shift-JIS encoding, everything printed correctly (with Shift-JIS dictionary). Is it simply that Windows cmd doesn't support UTF-8? EDIT: Let me ask a more specific question: If I don't care about what's printed to stdout, I just want to check which part of speech certain words are in input text, then I should be fine using UTF-8 string literals in python source with UTF-8 input files and UTF-8 MeCab dictionary? P.S. I implore the Japanese to get their encoding anarchy under control. MeCab - Flamerokz - 2015-08-16 Yep, as far as I can tell the windows command line doesn't support UTF-8, which is quite annoying. Unless someone can show me otherwise (I hope!) EDIT: Although you can still keep your python files in UTF-8 and just encode a unicode string to SHIFT-JIS before shoving it into MeCab. EDIT2: I should probably note that whatever your dictionary charset is determines both its input and output encoding. So it *can* do UTF-8 input and output but you won't be able to see display it properly in the command line (you should be able to pipe into a tool like iconv if you really want to display the text). The data from MeCab will still be fed back into python correctly. MeCab - cmertb - 2015-08-16 Oops, edited my post while you were replying. Anyway, thank you very much for all your help, Flamerokz! Now that it's working, I'll try to do something useful with it.
MeCab - Flamerokz - 2015-08-16 Oh no now with all this editing the correspondence is no longer linear. Noooooooooooooooooooooooooooo MeCab - aldebrn - 2015-08-16 Out of curiosity, do the python bindings just shell out to mecab and parse the command's output, or do they link against the mecab library? From the amount of work described above, it seems like the latter, but then the discussion about changing encodings before invoking mecab make me think it's the former. Off topic, but on Mac/Linux I build MeCab and IPADIC for UTF8 from the source tarball on Google Code (now Github). I'm guessing it's a Windows thing that makes you run into the horrible "SIWG" vs "SWIG" thing (what a typo…)?, because I've never had to patch the source. Even more off-topic, I use Ve, a Ruby front-end and post-processor for MeCab that's quite snazzy. I like its fancy re-assembling of morphemes into lexemes so much I actually call it from Node.js apps. MeCab - Flamerokz - 2015-08-16 aldebrn Wrote:Out of curiosity, do the python bindings just shell out to mecab and parse the command's output, or do they link against the mecab library? From the amount of work described above, it seems like the latter, but then the discussion about changing encodings before invoking mecab make me think it's the former.It does link against the mecab library. The input/output encoding is determined by how the dictionary file is compiled which is separate from the compilation of the mecab executable. aldebrn Wrote:Off topic, but on Mac/Linux I build MeCab and IPADIC for UTF8 from the source tarball on Google Code (now Github). I'm guessing it's a Windows thing that makes you run into the horrible "SIWG" vs "SWIG" thing (what a typo…)?, because I've never had to patch the source.I'm not sure if it's just a Windows thing or if it's a 64-bit versus 32-bit thing. I would assume the latter but if it's working fine for you on Mac/Linux 64-bit then I guess it is a Windows thing. Almost all the patching I noted above is because the source is hard-coded for 32-bit builds. |