Back

"Learning With Texts" software + Japanese?

#1
Hi All,

I recently explored Lingq's free account options out of curiosity and found it pretty interesting. One of the features that I really really loved is the highlighting of words that are being studied as they appear in _any_ future content. Once you mark a word as one you'd like to study it shows up yellow-highlighted and then if you read any other content and that word is present it will still be highlighted.

While longing for some desktop-based Lingq-like software I randomly stumbled across "learning with texts" via the RtK wiki and it seemed very promising, an almost complete and even more robust Lingq-like software suite -- for free! I set it up and got it working (a lot easier than it seems via the install description) and the only major let down is that it doesn't parse Japanese automatically. So I went from thinking Lingq's Japanese parser really sucks to "at least it works".

Has anyone else explored using Japanese texts with "learning with texts"? Is there any good way to get Japanese texts parsed for use with such software?

Alternately, is there any interest, inclination, or existence of a browser plugin that maintains a list of "words currently being studied" that appear highlighted whenever they are present _anywhere_?

It seems with all the modifications that Rikaichan has undergone that we're _almost_ there, and that rikaichan's parser is pretty awesome...

Anyway, this is really just a longish way to say I wish there was a plugin that interactively highlighted or de-highlighted words I was studying across all content -- the current lookup and import to anki features with various software/plugin are sufficient. Using the "Learning with Languages" software would be super awesome for managing audio and a comprehensive archive of documents read, etc but isn't absolutely necessary...

Here's the link for reference. Thanks for any suggestions!
http://lwt.sourceforge.net/

K.

p.s. For reference and discussion of methods, what I currently do to study and archive my reading is as follows:
1) create a text file for a month of the year
2) any content from the web of interest is copy-pasted into text file
3) read text file in yomichan, creating cards on the go.
4) update text file as the days go by
5) at the end of the month, start a new text file (e.g. "2011.08_jp_reading.txt")

I put the URL of each content pasted in and that's that. Yomichan is really great for integration to anki and I like the "non-procrastination" continuous text file thing -- it keeps growing, I keep chipping away at it. Everything I want to read gets read, eventually. (This was my solution to having ten million browser tabs open of Japanese content I wanted to read).

6) optional step: when reviewing the cards from yomichan, add a J-J definition, which is one more quick copy-paste via OSX's native Japanese dictionary.
Reply
#2
Reference review @Fluentin3months

http://www.fluentin3months.com/forum/res...-language/

And, barring the possibility that I've idiotically missed an option to make the software work as I wished, I added some "customer feedback" for software. If enough people vote maybe it will get some attention!

http://lwt.uservoice.com/forums/124481-c...nese-texts-
Reply
#3
That's the first I've heard of it, but since it's better than Lingq AND open source, this is awesome.

I haven't looked into what it takes to import a text for it, but since it's open source (again, yay!) it shouldn't be too hard to make things work well. If nobody else writes a converter, I will probably look into it. (There's so many programmers on here that I'll probably be beaten to the punch, though, as I have far too many hobbies.)
Reply
Breakthrough Sale: Get 28% OFF Basic or Premium!   Click here: BREAKTHROUGH2017
JapanesePod101
#4
Importing texts is NOT hard at all. The hard part is having the interface parse things properly. As it is now it either recognizes every character as a single word (has use, but not practical), or breaks words on white space. So you either manually enter white space for each word separation (not practical) or... Don't use it for Japanese. Seems like something that utilizes Rikaichan to scan a text and insert spaces or some (invisible?) delimiter would be best.

Still an awesome piece of software, would love to see it get going with Japanese development though.
Reply
#5
kodorakun Wrote:Importing texts is NOT hard at all. The hard part is having the interface parse things properly. As it is now it either recognizes every character as a single word (has use, but not practical), or breaks words on white space. So you either manually enter white space for each word separation (not practical) or... Don't use it for Japanese. Seems like something that utilizes Rikaichan to scan a text and insert spaces or some (invisible?) delimiter would be best.

Still an awesome piece of software, would love to see it get going with Japanese development though.
Ah, well that should be easy enough with mecab, I think. Might even be able to hack in Japanese support directly. (PHP is my day job.)
Reply
#6
I think the solution is to use Mecab to add the spaces for the text, and add a style class to the spans for the spaces. Then you can set the style to be hidden to make the spaces disappear. I think that's what Lingq does.
Reply
#7
Does Lingq use Mecab? If so... Maybe an alternate to Mecab would be nice Big Grin

I guess I should appreciate the difficulty of intelligently parsing Japanese text automatically...
Reply
#8
kodorakun Wrote:Does Lingq use Mecab? If so... Maybe an alternate to Mecab would be nice Big Grin

I guess I should appreciate the difficulty of intelligently parsing Japanese text automatically...
It does, and it's still the best... Heh.
Reply
#9
What is it about Rikaichan that enables it to find and define words that Mecab doesn't even notice? It seems the Lingq results leave off conjugating okurigana and you get spaces like "食べ[space]った" so then the vocab parser sees the word "食べ" and treats it differently than ”食べました" which it somehow doesn't screw up. Rikaichan has the nice de-conjugation, which seems like it would be nice to implement in something like this. A rule like "de-conjugate, find word/meaning/conjugation, note definition, highlight -entire conjugated word-, move on" seems... reasonable? right? Big Grin
Reply
#10
kodorakun Wrote:Does Lingq use Mecab? If so... Maybe an alternate to Mecab would be nice Big Grin

I guess I should appreciate the difficulty of intelligently parsing Japanese text automatically...
I don't know what Lingq uses, I was talking about the spaces. Somehow the text is parsed and spaces inserted between the words, then the text can be processed just like any other language, each word is wrapped in an HTML <span>. In order to get the spaces to disappear they just use CSS to hide the space <span>s.
Reply
#11
Rikai (chan and kun) hedges its bets. It shows you the lookups for the next few characters... And if different lengths match, it'll show you multiple results.

Mecab has to pick 1 and stick with it, because it needs to parse the whole document non-interactively.

I'm not saying Mecab couldn't be improved... But I don't think Rikai is doing anything special that would help.
Reply
#12
I see. Thanks for the explanation... I guess I'd be happy with Mecab setup to parse any text files I give it. Space-insertion would do, I s'pose, and then this program would be good to go as is. Maybe I'll look into that tomorrow.

Of course, if you want to do a smooth Mecab integration, that's fine too Big Grin The nice thing about it running in a web browser is that Rikaichan/kun still function on top of all the built-in dictionary lookup stuff too.

Of course, I still encourage people to click a vote+ for my suggestion to the original developer to enhance the Japanese parsing Tongue
Reply
#13
Rikaisan, I suspect, works similarly to Breen's sentence glossing tool: http://forum.koohii.com/showthread.php?p...#pid141670

I mentioned it there for the purposes of getting an offline version in the works, for the super-reader tool I hope we'll eventually have, which would include both glossing functions and formatting based on known.db information.

For more rambles on that topic and on the topic of a browser plugin that uses known.db (or the possibilities of some kind of hack involving the FoxReplace and Vocabulary Highlighter extensions): http://forum.koohii.com/showthread.php?p...#pid146276

Edit: By the way, feel free to add more explanations/usage tips/limitations to the LWT wiki entry, if you think they'd be useful: http://rtkwiki.koohii.com/wiki/Learning_with_Texts

Oh, and there's also this, for spaces: http://forum.koohii.com/showthread.php?p...#pid141729 (Edit #2 + link therein) and I think Boy.pockets made something that adds spaces: http://forum.koohii.com/showthread.php?p...#pid145350
Edited: 2011-08-23, 12:32 pm
Reply
#14
Well I'm not sure if this is what you're looking for but if you're familiar with Visual Novels, you can play those with a special program named the Translation Aggregator that hooks the text and does all kinds of neat things like give you defnitions and furigana with mecab. I've actually just been reading VNs with this program for the last two weeks or so and my reading speed has significantly increased and I've also picked up a lot of new vocab, then again I prefer to learn through more practical means than just Anki reps.
Reply
#15
Well Mecab install broke on OSX, almost made it though... Guess I'll keep goin' with Yomichan for now until I make the time to check out some of these other suggestions. Hopefully the developer will respond to my request. Thanks for the support votes and comments!
Reply
#16
kodorakun Wrote:Well Mecab install broke on OSX, almost made it though... Guess I'll keep goin' with Yomichan for now until I make the time to check out some of these other suggestions. Hopefully the developer will respond to my request. Thanks for the support votes and comments!
If you have Anki installed and automatic reading generation works, then you already have Mecab installed. It's probably (just guessing) inside Applications/Anki. You'll need to look inside the package.
Reply
#17
Yeah, I'm guessing Anki uses the python binding of Mecab but I'm not 100% sure. I did a quick glance and didn't see anything too obvious. I'll try installing the python binding myself too... But still, then I'll have to figure out how to make mecab take a text file as an input and do nothing other than spit out the text with spaces as the lookup and defining will all be integrated in LWT. If anyone knows that command line or option it'd be appreciated Big Grin
Reply
#18
It looks like Anki just spawns a process, this will probably help:
https://github.com/dae/ankiplugins/blob/...reading.py
Reply
#19
Hi Travis, thanks for the tip. Actually I spent a bit of time reading Japanese install pages and finally got the official release (not the python binding) running alright.

I now have it setup where I copy whatever into a text file and then mecab spits out the text file with spaces. I push it into LWT and it works great! One feature I didn't realize is that unlike LingQ, LWT lets you augment the term you're trying to define by clicking on it.

As an example if some common word gets split because the root is recognized, you just click the root and then there's an option to start concatenating the following characters. It will then glom those together for the definition. It's pretty sweet. The automatically generated clozed testing and export to Anki is also awesome.

Pretty happy atm.
Reply
#20
Well don't keep us in suspense, tell us how you did it. ;D
Reply
#21
it's 2am here, meeting at 7am... must go to bed. the terse answer is i read the instructions carefully, opted to specify the utf8 only mode (instructions therein), and the "insert spaces" command is like "./mecab -O wakati input-file.txt -o output-file.txt"

After that it's all LWT. Will post tomorrow after my meeting if any more help/comments needed.

file://mecab-0.98/doc/index.html (part of package download)

http://xn--p8ja5bwe1i.jp/wiki/%E3%83%86%...%E3%81%86/

(sorry that's a whack-ass URL, it's the first URL i've seen that's actually a japanese domain name in hiragana... weird...).
Reply
#22
You're forgiven. Go get some rest!
Reply
#23
I return! Have you tried to get it setup at all? The author of LWT shot down my request for a rikai-chan like extension, but I'm OK with that now that things are looking up.

(http://lwt.uservoice.com/forums/124481-c...ese-texts-)

He suggested that one import a list of known terms if that is desired -- I guess it would have an identical impact and seems pretty useful. Now that I'm slightly more familiar with the structure of the word tagging and grouping I am more appreciative of the system.

The documentation reads (from the "Import Terms" section):
"Import a list of terms for a language, and set the status for all to a specified value. You can specify a file to upload or type/paste the data directly into the textbox. Format: one term per line, fields (columns) are separated either by comma ("CSV" file, e.g. used in LingQ as export format), TAB ("TSV" file, e.g. copy and paste from a spreadsheet program, not possible if you type in data manually) or # (if you type in data manually). "
(http://lwt.sourceforge.net/#howto)

So excited about this software...
Reply
#24
Have you compiled such a list? I would like to create a multi-user environment on my site for this, and I'd like to have these lists for Japanese, Korean, Thai, etc. for the users to be able to easily bypass this annoyance.

Would you like to brainstorm this with me? Please email me if you're interested.
Edited: 2011-09-20, 8:05 pm
Reply
#25
kodorakun Wrote:I see. Thanks for the explanation... I guess I'd be happy with Mecab setup to parse any text files I give it. Space-insertion would do, I s'pose, and then this program would be good to go as is. Maybe I'll look into that tomorrow.
If you're just looking to do space insertion between works, kakasi does a pretty good job with

$ kakasi -w

I have an alias I use for this in bash (Mac OS version):

alias add_spaces='cat | iconv -t EUC-JP | kakasi -w -ieuc -oeuc | iconv -f EUC-JP'

Hope this helps,

CJ
Reply