"Learning With Texts" software + Japanese?

Index » Learning resources

Cacawate Member
From: California Registered: 2006-12-07 Posts: 32 Website

I hate to sound like a complete newb here, but can you point me to any resources on understanding how to do this in windows? I've found the kakasi site, but I have no clue how to use this due to programming ignorance. I'll be Googling in the mean time, but any assistance would be much appreciated.

Reply #27 - 2011 October 25, 3:55 pm
Cacawate Member
From: California Registered: 2006-12-07 Posts: 32 Website

Ok, so I've fiddled with this for a month coming from absolutely no programming experience and created a spacer that you can use for LWT. You have to have Python installed on your computer (this was made in 2.7). Also, I'm almost done with the meat of a GUI for this, but still have to learn how to turn that into an executable file for windows as well as translate it into a web app. I don't have access to root on my web server, so I may have to learn PHP or something. I really have absolutely no experience with programming. Here's the code. Save it as .py:

#!/usr/bin/python
# -*- coding: <utf-8> -*-
import MeCab
import codecs
import os

inputJP = raw_input("Please input Japanese here: ")
saveJP = codecs.open("pholder.txt", "w", "utf-8").write(inputJP)

read_from = open("pholder.txt", "r").read()
mecab = MeCab.Tagger("-Owakati")
output = mecab.parse(read_from)
print output

text = output
save_to = raw_input("Please name the file you'd like to save it to: ")
write_to = open(save_to, "w").write(text)

os.system("start notepad.exe" + " " + save_to)

Reply #28 - 2011 October 26, 1:58 pm
wccrawford Member
From: FL US Registered: 2008-03-28 Posts: 1551

I don't want you to think you're being ignored.  I don't have time to mess with anything right now, but when I do, I'll be looking at using the knowledge you've provided with LWT and seeing how I can make happen.  Thanks for working on this!

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
Reply #29 - 2011 October 28, 3:11 pm
Cacawate Member
From: California Registered: 2006-12-07 Posts: 32 Website

Oh, it's no problem. This thread gets good hits on Google for LWT and parsing Japanese, so I'm updating it to help others that may not be forum users.

Actually, I'm not sure how much help it is as I'm still new to this programming business. smile

Reply #30 - 2011 October 28, 5:14 pm
kodorakun Member
From: Seattle Registered: 2008-10-15 Posts: 276 Website

Gah, sorry I haven't been tracking this thread, it seems you've worked out most problems for parsing by now, right?

I've you've got mecab installed I don't see why you're going through all the scripting business. I just copy text of interest into a text file "article.txt" and run "mecab -O wakati article.txt -o art.out" and then "art.out" contains the parsed text, which I drop right into LWT.

Anyway, glad to see LWT is getting some attention -- I'm still using it and loving it.

K.

Arcturus_Red New member
From: India Registered: 2011-04-22 Posts: 2

Any advice for a simple-minded Windows user?
Typing mecab <arguments> in a Windows command line doesn't do anything (duh). Tried using cacawate-san's script, but python doesn't recognise the mecab module. Tried saving the script in the same folder as the mecab exe's. Didn't help.


EDIT: Just had to add the Mecab folder to the PATH variable. Works perfectly now.

Last edited by Arcturus_Red (2012 January 04, 8:31 am)

khalhern Member
From: UK Registered: 2011-04-11 Posts: 33

In which directory do I put the input.txt files? I keep getting "No such file or directory" no matter where I put it sad

Arcturus_Red New member
From: India Registered: 2011-04-22 Posts: 2

khalhern wrote:

In which directory do I put the input.txt files? I keep getting "No such file or directory" no matter where I put it sad

C:\users\(your username) should work.

Or you could just cd to whichever directory they're in... Look for a DOS tutorial on google.

digitlhand Member
From: Los Angeles CA Registered: 2007-12-04 Posts: 98 Website

Arcturus_Red wrote:

Any advice for a simple-minded Windows user?
Typing mecab <arguments> in a Windows command line doesn't do anything (duh). Tried using cacawate-san's script, but python doesn't recognise the mecab module. Tried saving the script in the same folder as the mecab exe's. Didn't help.


EDIT: Just had to add the Mecab folder to the PATH variable. Works perfectly now.

Where does one find the PATH variable?

khalhern Member
From: UK Registered: 2011-04-11 Posts: 33

Hi digitlhand, which OS are you using? In both XP and Windows 7 it's accessible by right-clicking on My Computer and choosing Properties, then:

If it's Windows 7: Click on the link on the right hand side of the properties window that opens that says "Advanced system settings".

Then choose the "Advanced" tab, and at the bottom you'll see "Environment Variables". Click that, and you'll see a list of variables. Find the one that says "Path", and edit it. Add something like: ;C:\MeCab\bin\ (Or where ever you have mecab installed).

You need the preceding ";" to divide paths, so if you ever need to add a new one after the mecab one, you need to start that with a ";" as well.

Hope that helps somehow!

(Also: )
XP: http://support.microsoft.com/kb/310519
Windows 7: http://www.itechtalk.com/thread3595.html



@Arcturus_Red: For some reason, I tried putting the input file in "users" and got no results... then realised for some reason I hadn't cd'd to the directory on DOS (doh!). Thank you SO much for the help - this worked PERFECTLY! I'm so happy big_smile

Last edited by khalhern (2012 February 12, 8:00 pm)

derDeja New member
Registered: 2010-02-16 Posts: 2

Hello everybody!

I am struggeling with this problem (no spaces in japanese) right now. I read this thread but i cannot figure out, what to do. I have Anki with japanese model working - it inserts furigana as reading. So, do I have mecab? It may sound stupid, but I can't remember, what I installed to get Anki running as it does now - it's quite a while ago. And btw I am working on MacOS.
Please could someone explain, what to do on a Mac to get this Mecab thing running. I dont even know where to look at to figure out if Mecab is already installed.
And: Could you possibly translate the DOS-Command-Line for Mac-Users?
I hope it doesn't sound too stupid! I was searching this forum an google for this problem with LWT and japanese, but since I am no Computer-Pro I guess I am something missing. Hope for you to spotlight on it!

Thank you very much in advance!
Deja

khalhern Member
From: UK Registered: 2011-04-11 Posts: 33

Hi derDeja,

Sorry, but I can't help with how to follow these instructions in Mac OS sad

But if you have the Japanese plug-in installed, there should be a directory called "mecab" within the directory that you installed Anki. Inside the "mecab" directory is a /bin/ directory, which contains the mecab.exe file that you need to run to split the file.

An example in windows would be simply to do something like:

Code:

cd C:\Program Files\Anki\mecab\bin

Then inside that directory (bin), you need an input.txt file, which contains Japanese text (saved to UTF-8), then you can run something like this from the DOS-Prompt:

Code:

mecab.exe -O wakati input.txt -o output.txt

So that the title list of commands looks something like:

Code:

C:\>cd C:\Program Files\Anki\mecab\bin
C:\Program Files\Anki\mecab\bin>mecab.exe -O wakati input.txt -o output.txt

So you'll need to research how to a) change directories through your Mac OS commands prompt (I think the "cd" command works?), and b) find out how to run programs with arguments (Again, it might be exactly the same, simply running mecab.exe -O etc etc through the prompt).

Sorry if that doesn't help much but I hope it does!

derDeja New member
Registered: 2010-02-16 Posts: 2

Dear khalhern!

Thank you very much for your explanations!
I finally found some Unix-File called mecab which should do the job. I tried to find it via "spotlight" before, the macOS-Search-System, but that didn't find anything. But Your hints set me on track! Thank You!
Now I have to find out, how to run this program via terminal - the macOS DOS. Maybe I will figure it out myself, although I am a little afraid to crush my system if I fumble to much inside its heart. So still any hints about running little programs plus arguments in terminal would heartly appreciated.

Kind regards
Deja

Earthlark Member
From: Japan Registered: 2008-12-23 Posts: 25

How do you guys deal with separated words such as verb endings, e.g., 行っ て い ます?  I know for the definition you can expand to multiple words, but this doesn't actually bring the word together as far as I've been able to tell.  So, for example, for "ます", do you just hit "ignore" or "well known"?  The same with "て"?

khalhern Member
From: UK Registered: 2011-04-11 Posts: 33

Actually it kind of sucks, but I just delete by hand. If it's a long text that probably isn't viable, but if you know the root, you might be able to figure out the ending, I guess, because you can always just search individual kanji.

What I've been doing is just searching separated terms without spaces and so on, and just seeing if I can get some results. If I get ones that make sense, then just click the "Edit Text" button at the top, and remove the spaces wink

I don't know if anyone else has a more productive way to do things, though yikes

Earthlark Member
From: Japan Registered: 2008-12-23 Posts: 25

I unchecked the "show all" button and that makes things a bit better, showing the longest saved term.

I guess the best practice may be to just group the different endings: います, いました, etc?

I've just been using the program in conjunction with rikaisama (and EBwin) since it usually indicates the tense of the verb.

Reply #42 - February 17, 8:51 am
Stansfield123 Member
From: Europe Registered: 2011-04-17 Posts: 799

Can anyone point me in the right direction for a large csv or tsv file with Japanese terms, readings and translations? (to import into LWT)

I'm willing to convert from other formats, but please let me know which source would be a good option.

Reply #43 - February 18, 7:55 am
Stansfield123 Member
From: Europe Registered: 2011-04-17 Posts: 799

I take it there are none? I guess I'll have to give converting an Anki deck or an Edict file or something a shot. Both seem tricky.

Unfortunately, I noticed that there are some size limitations in the LWT term import feature as well, so importing an entire dictionary file won't be possible even after conversion. But that can be solved by copy/pasting chunks of it.

Perhaps improving this feature to use some common public domain dictionary formats and allow full size imports, in the future, would be helpful.

Reply #44 - February 18, 8:14 am
RawToast お巡りさん
From: UK Registered: 2012-09-03 Posts: 431 Website

Stansfield123 wrote:

I take it there are none? I guess I'll have to give converting an Anki deck.

I guess that core extended deck with 17 or 27k entries? I don't think any of the text analysis files have translations.

Reply #45 - February 18, 8:49 am
Stansfield123 Member
From: Europe Registered: 2011-04-17 Posts: 799

RawToast wrote:

I guess that core extended deck with 17 or 27k entries? I don't think any of the text analysis files have translations.

I've found some JLPT (level 2 through 4) vocab files with about 6-7000 entries total. When I imported them, they barely made  dent in my sample text (it was still full of unknown terms). How many more would a Core deck have?

The other problem is that these Anki decks don't have verb stems, they just use the imperfective (the one which often ends in ru) form. Importing them won't help LWT recognize most verbs.

I don't know much about dictionaries, but I assume a dictionary file would have stems? That's gotta be how Rikaisama, Jisho.org etc. recognize verbs, no?

Reply #46 - February 18, 10:08 am
andikaze Member
From: Kumagaya, Japan Registered: 2013-12-15 Posts: 201

I gave up on LWT a long time ago. Such a hassle to install and set up everything including MECAB, and then you have to fiddle around with your words inside the program as well. The author was zero supportive in this matter, nobody seems to know it anyways, so I don't see Japanese with LWT getting any better in the near future (or at all).

TLDR: LWT sucks for Japanese.

If you use the free LINGQ and export your vocab items with Rikaisama into ANKI, you got basically the paid LINGQ with a perfectly configurable SRS.
Or you just read a lot and don't bother with SRS at all, and use Yomichan or Wakan.

Reply #47 - February 18, 3:30 pm
Stansfield123 Member
From: Europe Registered: 2011-04-17 Posts: 799

andikaze wrote:

I gave up on LWT a long time ago. Such a hassle to install and set up everything including MECAB, and then you have to fiddle around with your words inside the program as well. The author was zero supportive in this matter, nobody seems to know it anyways, so I don't see Japanese with LWT getting any better in the near future (or at all).

TLDR: LWT sucks for Japanese.

If you use the free LINGQ and export your vocab items with Rikaisama into ANKI, you got basically the paid LINGQ with a perfectly configurable SRS.
Or you just read a lot and don't bother with SRS at all, and use Yomichan or Wakan.

It took me less than a minute to install LWT. I didn't install mecab, I jut use this. http://nihongo.dpwright.com/spaces/index.php

My only complaint is that I can't find a dictionary file I can easily import into it, because LWT only recognizes csv and tsv, and even that not very well. So someone will just have to write a program to convert EDICT-format or something like that to a csv that works. I might do it, but I'm not sure when I'll get around to it. I'll post it if I do.

Reply #48 - February 18, 10:28 pm
andikaze Member
From: Kumagaya, Japan Registered: 2013-12-15 Posts: 201

Thought you could set up dictionary URLs in LWT somewhere. I used Wadoku.de when I tried it out.

Reply #49 - February 19, 3:52 am
RawToast お巡りさん
From: UK Registered: 2012-09-03 Posts: 431 Website

Stansfield123 wrote:

RawToast wrote:

I guess that core extended deck with 17 or 27k entries? I don't think any of the text analysis files have translations.

I've found some JLPT (level 2 through 4) vocab files with about 6-7000 entries total. When I imported them, they barely made  dent in my sample text (it was still full of unknown terms). How many more would a Core deck have?

I think a Core deck will leave you with the same problem. Core 6k is pretty close to N2 in vocabulary so I doubt you'll get any further -- unless you need all those business/politics terms!

I guess you should start with an edict dictionary file from here instead:

http://www.edrdg.org/jmdict/edict.html

Since the edit file is simply a specially formatted text file, it shouldn't be too difficult to change the separator. Notepad++ struggles with the size of the file, so I couldn't really get a look at how verbs are handled.

Reply #50 - February 19, 7:39 am
Stansfield123 Member
From: Europe Registered: 2011-04-17 Posts: 799

andikaze wrote:

Thought you could set up dictionary URLs in LWT somewhere. I used Wadoku.de when I tried it out.

You can. Not what I'm trying to do though. I'm trying to import a dict. file into the database itself, using the "import terms" feature.