RECENT TOPICS » View all
I hate to sound like a complete newb here, but can you point me to any resources on understanding how to do this in windows? I've found the kakasi site, but I have no clue how to use this due to programming ignorance. I'll be Googling in the mean time, but any assistance would be much appreciated.
Ok, so I've fiddled with this for a month coming from absolutely no programming experience and created a spacer that you can use for LWT. You have to have Python installed on your computer (this was made in 2.7). Also, I'm almost done with the meat of a GUI for this, but still have to learn how to turn that into an executable file for windows as well as translate it into a web app. I don't have access to root on my web server, so I may have to learn PHP or something. I really have absolutely no experience with programming. Here's the code. Save it as .py:
#!/usr/bin/python
# -*- coding: <utf-8> -*-
import MeCab
import codecs
import os
inputJP = raw_input("Please input Japanese here: ")
saveJP = codecs.open("pholder.txt", "w", "utf-8").write(inputJP)
read_from = open("pholder.txt", "r").read()
mecab = MeCab.Tagger("-Owakati")
output = mecab.parse(read_from)
print output
text = output
save_to = raw_input("Please name the file you'd like to save it to: ")
write_to = open(save_to, "w").write(text)
os.system("start notepad.exe" + " " + save_to)
I don't want you to think you're being ignored. I don't have time to mess with anything right now, but when I do, I'll be looking at using the knowledge you've provided with LWT and seeing how I can make happen. Thanks for working on this!
Oh, it's no problem. This thread gets good hits on Google for LWT and parsing Japanese, so I'm updating it to help others that may not be forum users.
Actually, I'm not sure how much help it is as I'm still new to this programming business. ![]()
Gah, sorry I haven't been tracking this thread, it seems you've worked out most problems for parsing by now, right?
I've you've got mecab installed I don't see why you're going through all the scripting business. I just copy text of interest into a text file "article.txt" and run "mecab -O wakati article.txt -o art.out" and then "art.out" contains the parsed text, which I drop right into LWT.
Anyway, glad to see LWT is getting some attention -- I'm still using it and loving it.
K.
Any advice for a simple-minded Windows user?
Typing mecab <arguments> in a Windows command line doesn't do anything (duh). Tried using cacawate-san's script, but python doesn't recognise the mecab module. Tried saving the script in the same folder as the mecab exe's. Didn't help.
EDIT: Just had to add the Mecab folder to the PATH variable. Works perfectly now.
Last edited by Arcturus_Red (2012 January 04, 8:31 am)
In which directory do I put the input.txt files? I keep getting "No such file or directory" no matter where I put it ![]()
khalhern wrote:
In which directory do I put the input.txt files? I keep getting "No such file or directory" no matter where I put it
C:\users\(your username) should work.
Or you could just cd to whichever directory they're in... Look for a DOS tutorial on google.
Arcturus_Red wrote:
Any advice for a simple-minded Windows user?
Typing mecab <arguments> in a Windows command line doesn't do anything (duh). Tried using cacawate-san's script, but python doesn't recognise the mecab module. Tried saving the script in the same folder as the mecab exe's. Didn't help.
EDIT: Just had to add the Mecab folder to the PATH variable. Works perfectly now.
Where does one find the PATH variable?
Hi digitlhand, which OS are you using? In both XP and Windows 7 it's accessible by right-clicking on My Computer and choosing Properties, then:
If it's Windows 7: Click on the link on the right hand side of the properties window that opens that says "Advanced system settings".
Then choose the "Advanced" tab, and at the bottom you'll see "Environment Variables". Click that, and you'll see a list of variables. Find the one that says "Path", and edit it. Add something like: ;C:\MeCab\bin\ (Or where ever you have mecab installed).
You need the preceding ";" to divide paths, so if you ever need to add a new one after the mecab one, you need to start that with a ";" as well.
Hope that helps somehow!
(Also: )
XP: http://support.microsoft.com/kb/310519
Windows 7: http://www.itechtalk.com/thread3595.html
@Arcturus_Red: For some reason, I tried putting the input file in "users" and got no results... then realised for some reason I hadn't cd'd to the directory on DOS (doh!). Thank you SO much for the help - this worked PERFECTLY! I'm so happy ![]()
Last edited by khalhern (2012 February 12, 8:00 pm)
Hello everybody!
I am struggeling with this problem (no spaces in japanese) right now. I read this thread but i cannot figure out, what to do. I have Anki with japanese model working - it inserts furigana as reading. So, do I have mecab? It may sound stupid, but I can't remember, what I installed to get Anki running as it does now - it's quite a while ago. And btw I am working on MacOS.
Please could someone explain, what to do on a Mac to get this Mecab thing running. I dont even know where to look at to figure out if Mecab is already installed.
And: Could you possibly translate the DOS-Command-Line for Mac-Users?
I hope it doesn't sound too stupid! I was searching this forum an google for this problem with LWT and japanese, but since I am no Computer-Pro I guess I am something missing. Hope for you to spotlight on it!
Thank you very much in advance!
Deja
Hi derDeja,
Sorry, but I can't help with how to follow these instructions in Mac OS ![]()
But if you have the Japanese plug-in installed, there should be a directory called "mecab" within the directory that you installed Anki. Inside the "mecab" directory is a /bin/ directory, which contains the mecab.exe file that you need to run to split the file.
An example in windows would be simply to do something like:
cd C:\Program Files\Anki\mecab\bin
Then inside that directory (bin), you need an input.txt file, which contains Japanese text (saved to UTF-8), then you can run something like this from the DOS-Prompt:
mecab.exe -O wakati input.txt -o output.txt
So that the title list of commands looks something like:
C:\>cd C:\Program Files\Anki\mecab\bin C:\Program Files\Anki\mecab\bin>mecab.exe -O wakati input.txt -o output.txt
So you'll need to research how to a) change directories through your Mac OS commands prompt (I think the "cd" command works?), and b) find out how to run programs with arguments (Again, it might be exactly the same, simply running mecab.exe -O etc etc through the prompt).
Sorry if that doesn't help much but I hope it does!
Dear khalhern!
Thank you very much for your explanations!
I finally found some Unix-File called mecab which should do the job. I tried to find it via "spotlight" before, the macOS-Search-System, but that didn't find anything. But Your hints set me on track! Thank You!
Now I have to find out, how to run this program via terminal - the macOS DOS. Maybe I will figure it out myself, although I am a little afraid to crush my system if I fumble to much inside its heart. So still any hints about running little programs plus arguments in terminal would heartly appreciated.
Kind regards
Deja
How do you guys deal with separated words such as verb endings, e.g., 行っ て い ます? I know for the definition you can expand to multiple words, but this doesn't actually bring the word together as far as I've been able to tell. So, for example, for "ます", do you just hit "ignore" or "well known"? The same with "て"?
Actually it kind of sucks, but I just delete by hand. If it's a long text that probably isn't viable, but if you know the root, you might be able to figure out the ending, I guess, because you can always just search individual kanji.
What I've been doing is just searching separated terms without spaces and so on, and just seeing if I can get some results. If I get ones that make sense, then just click the "Edit Text" button at the top, and remove the spaces ![]()
I don't know if anyone else has a more productive way to do things, though ![]()
I unchecked the "show all" button and that makes things a bit better, showing the longest saved term.
I guess the best practice may be to just group the different endings: います, いました, etc?
I've just been using the program in conjunction with rikaisama (and EBwin) since it usually indicates the tense of the verb.
Can anyone point me in the right direction for a large csv or tsv file with Japanese terms, readings and translations? (to import into LWT)
I'm willing to convert from other formats, but please let me know which source would be a good option.
I take it there are none? I guess I'll have to give converting an Anki deck or an Edict file or something a shot. Both seem tricky.
Unfortunately, I noticed that there are some size limitations in the LWT term import feature as well, so importing an entire dictionary file won't be possible even after conversion. But that can be solved by copy/pasting chunks of it.
Perhaps improving this feature to use some common public domain dictionary formats and allow full size imports, in the future, would be helpful.
Stansfield123 wrote:
I take it there are none? I guess I'll have to give converting an Anki deck.
I guess that core extended deck with 17 or 27k entries? I don't think any of the text analysis files have translations.
RawToast wrote:
I guess that core extended deck with 17 or 27k entries? I don't think any of the text analysis files have translations.
I've found some JLPT (level 2 through 4) vocab files with about 6-7000 entries total. When I imported them, they barely made dent in my sample text (it was still full of unknown terms). How many more would a Core deck have?
The other problem is that these Anki decks don't have verb stems, they just use the imperfective (the one which often ends in ru) form. Importing them won't help LWT recognize most verbs.
I don't know much about dictionaries, but I assume a dictionary file would have stems? That's gotta be how Rikaisama, Jisho.org etc. recognize verbs, no?
I gave up on LWT a long time ago. Such a hassle to install and set up everything including MECAB, and then you have to fiddle around with your words inside the program as well. The author was zero supportive in this matter, nobody seems to know it anyways, so I don't see Japanese with LWT getting any better in the near future (or at all).
TLDR: LWT sucks for Japanese.
If you use the free LINGQ and export your vocab items with Rikaisama into ANKI, you got basically the paid LINGQ with a perfectly configurable SRS.
Or you just read a lot and don't bother with SRS at all, and use Yomichan or Wakan.
andikaze wrote:
I gave up on LWT a long time ago. Such a hassle to install and set up everything including MECAB, and then you have to fiddle around with your words inside the program as well. The author was zero supportive in this matter, nobody seems to know it anyways, so I don't see Japanese with LWT getting any better in the near future (or at all).
TLDR: LWT sucks for Japanese.
If you use the free LINGQ and export your vocab items with Rikaisama into ANKI, you got basically the paid LINGQ with a perfectly configurable SRS.
Or you just read a lot and don't bother with SRS at all, and use Yomichan or Wakan.
It took me less than a minute to install LWT. I didn't install mecab, I jut use this. http://nihongo.dpwright.com/spaces/index.php
My only complaint is that I can't find a dictionary file I can easily import into it, because LWT only recognizes csv and tsv, and even that not very well. So someone will just have to write a program to convert EDICT-format or something like that to a csv that works. I might do it, but I'm not sure when I'll get around to it. I'll post it if I do.
Thought you could set up dictionary URLs in LWT somewhere. I used Wadoku.de when I tried it out.
Stansfield123 wrote:
RawToast wrote:
I guess that core extended deck with 17 or 27k entries? I don't think any of the text analysis files have translations.
I've found some JLPT (level 2 through 4) vocab files with about 6-7000 entries total. When I imported them, they barely made dent in my sample text (it was still full of unknown terms). How many more would a Core deck have?
I think a Core deck will leave you with the same problem. Core 6k is pretty close to N2 in vocabulary so I doubt you'll get any further -- unless you need all those business/politics terms!
I guess you should start with an edict dictionary file from here instead:
http://www.edrdg.org/jmdict/edict.html
Since the edit file is simply a specially formatted text file, it shouldn't be too difficult to change the separator. Notepad++ struggles with the size of the file, so I couldn't really get a look at how verbs are handled.
andikaze wrote:
Thought you could set up dictionary URLs in LWT somewhere. I used Wadoku.de when I tried it out.
You can. Not what I'm trying to do though. I'm trying to import a dict. file into the database itself, using the "import terms" feature.

