Back

To all the programmers

#1
I'm trying LingQ and I like very much the fact that the new words are highlighted in blue while the words you're learning are highlighted in yellow. But, as you already know if you've tried LingQ, it treats every cojnugation as a separate word. So I was thinking to do something similar, but without this negative aspect.
The thing is I don't understand how it works for japanese behind the scenes. If someone would help me understand this it will be much appreciated Smile
Here are my perplexities:

1) When you import a text, it separates the words in a default way, I think with a mecab extension. And so far no problem. But it gives the option to modify this default behavior.

For example, if I import a text which contains the word "飲みます", LingQ parses the word like this: "飲み" - "ます". It treats it as two words. But if I underline the two words "飲み" and "ます" together, it gives me the option to treat it as a single word and add a lingq for it. So I add a lingq for the word "飲みます" and from that moment on, every other instance of "飲み ます" in that and other "lessons" is treated automatically as a single words instead of two as happened before. How is it possible if every lesson is parsed via Mecab (or, at least, I'm supposing this but maybe I'm wrong)? How does Mecab know that from now on it must parse that word as a single word? It relies upon a glossary or what? When you add a lingq that modifies the default way in which a word is parset, it modifies the entry for that word in the glossary so that from now on Mecab will parse it that way?

It wouldn't be difficult to do something similar but which treats every cojnugation as a single word because of the capacity of Mecab to decojnugate words (or am I remembering wrong?), but I would like to add the option for the user to modify the way a word is parsed as Lingq does, because by default Mecab break words in a way that is not optimal I think... So, if someone know the mechanism behind this it would be awesome Big Grin
Reply
#2
Another thing I noted is that when I parse a word like "聞きませんでした" it parses it like "聞き" "ませ" "ん" "でした".
Then I underline the whole word and I create a lingq like if it is a single word and it happens what I said in the previous post. But behind the scenes it still continues to parse it like four different words, like it is showed in the html:

<span id="w-0" class="word word-918660 unknown">聞き</span>
<span id="w-1" class="word word-895913 unknown">ませ</span>
<span id="w-2" class="word word-690879 unknown">ん</span>
<span id="w-3" class="word word-755882 unknown">でした</span>

But still in the text it shows the word like a single yellow lingq. But if I add the word "聞き" without the rest of the conjugation, it continues to show the word as a new (unknown) blue word, and in the html the words is tagget with the same id (in this case 918660).

So it is obvious it continues to parse the word the standard Mecab way, but after this there is a sort of second pass of the text, where it searches for custom lingqs.

Any suggestion?

EDIT:

It apparently has a feature do add phrases:

lingq screenshot

In the light of this it would be a pain in the axx to make it recognise every different cojnugation of a word as pertaining to that single word.
Edited: 2014-10-12, 8:43 am
Reply
#3
I'm not sure what you are trying to do. Are you trying to create a new program like LingQ, but with different features or are you trying to modify LingQ? The first is obviously possible, but probably more challenging than you think, the second isn't possible; you can only change how stuff appears on the page, not how it is processed by LingQ in the background (unless they give you access to fine tune the settings).
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
vix86 Wrote:I'm not sure what you are trying to do. Are you trying to create a new program like LingQ, but with different features or are you trying to modify LingQ? The first is obviously possible, but probably more challenging than you think, the second isn't possible; you can only change how stuff appears on the page, not how it is processed by LingQ in the background (unless they give you access to fine tune the settings).
I wanted to do the first thing but as you said is too much work... I don't understand why they don't give an option to do this... I read a lot of complain of people who doesn't subscribe only because of the fact that for example 読まない is treated as a word distinct from 読みます. If all those words are in a database, why not add a field with a unique id which points to the field of another table wherein is stored the dictionary form of the word? So, withoud any need to modify other things, when one tags 読まない, it automatically looks the dictionary form and tags also every other conjugation of that word. Maybe via an option every user can select, so who likes the other way could continue to do things that way and viceversa. But above all, why am I writing this here? xD
Reply
#5
In the end I'm doing it, but I have a problem.
In order to treat different forms of the same morfeme as a single word, I use mecab to obtain the dictionary form of the word. The problem is MeCab doesn't always give the corret morpheme. For example, if I give as a input "嫌わなかった" or "嫌ったり", it returns me the morpheme "嫌う", and the same goes for ALMOST all the forms of "嫌う".

One exception is, for example, "嫌いたくなくて", that it breaks in "嫌; いたく; な; くて". So without going in any detail, considering I'm using the first part to obtain the base morpheme, I pass "嫌" to MeCab with the lattice option but obviously it gives me "嫌" as the base morpheme.

But if I pass "嫌いたくなくて" to MeCab with the allmorph option, it gives this as a output:

Quote:嫌い 動詞,自立,*,*,五段・ワ行促音便,連用形,嫌う,キライ,キライ
嫌 名詞,形容動詞語幹,*,*,*,*,嫌,イヤ,イヤ
いたく 形容詞,自立,*,*,形容詞・アウオ段,連用テ接続,いたい,イタク,イタク
いた 動詞,自立,*,*,五段・ラ行,体言接続特殊2,いたる,イタ,イタ
い 動詞,自立,*,*,一段,連用形,いる,イ,イ
たく 動詞,自立,*,*,五段・カ行イ音便,基本形,たく,タク,タク
た 動詞,非自立,*,*,五段・ラ行,体言接続特殊2,たる,タ,タ
く 動詞,非自立,*,*,カ変・クル,体言接続特殊2,くる,ク,ク
なく 動詞,自立,*,*,五段・カ行イ音便,基本形,なく,ナク,ナク
な 動詞,非自立,*,*,五段・ラ行,体言接続特殊2,なる,ナ,ナ
く 動詞,非自立,*,*,カ変・クル,体言接続特殊2,くる,ク,ク
て 動詞,自立,*,*,五段・ラ行,体言接続特殊2,てる,テ,テ
So it obviously has the capability to break "嫌いたくなくて" as "嫌い; ...たくなくて", so if it breaks it like that I would be able to use "嫌い" to obtain the base morpheme that is "嫌う".

But for some reasons for that form of the word MeCab preffer to assign the い to "いたく", for a matter of dictionary priorities I suppose.

Evidently いたく has a higher priority than 嫌い in the ipadic dictionary. I've read someone suggests to obtain the dictionary source and modify its entry to change the priority of some words. A solution would be to obtain all the postpositions (not the right term but I don't know how to call them) for which MeCab screws the word breaking process (as for example "いたくなくて") and put them at the end of the priority list so it will allways prefer to attach the okurigana to the kanji. (In my example I've used a na adjective but the same thing happens also for some forms of verbs and i adjectives).

But I don't know where to find the ipadic source, or if it's possible to modify it without the source (it seems the words are contained in an .iso file, maybe it would be enough to just decompress it).

Any suggestion? Please xD

EDIT: it's not an iso file but a bin file :/ EDIT2: just wait, MeCab gives the source by default, so you can recompile it in different formats (utf8 etc..) or at least it seams so...
Edited: 2014-10-14, 10:56 am
Reply
#6
Ok, so apparently the only solution is to remove from the dictionary all words which cause this issue, like "いたく" in my previous example. Someone could gently tell me how I can find all possible conjugations for every kind of japanese word tipe? Like for every kind of verbs (ru verbs, u verbs with different endings etc..)?
Reply
#7
Did you ever end up making progress on this tool?
Reply
#8
I just started using Lingq today (based on Stephen Krashen's recommendation) and now finally have a clue what you're talking about. Couple of notes.

Raw MeCab is certainly a bit of a pain to use. I recommend http://beta.jisho.org. Running it on "嫌いたくなくて" gives me a *single* word, based on "嫌う", and Jisho will helpfully tell you that "嫌いたくなくて looks like an inflection of 嫌う, with these forms: Nai-form. It indicates the negative form of the verb" (multiple conjugations are handled well).

Beta.jisho uses MeCab in the backend but it uses (1) a custom dictionary, adds lots of other text, Wikipedia entries, etc., and (2) Ve, also written by Kimtaro, which combines morphemes output by MeCab into higher-level "words", plus (3) some secret sauce conjugation analyzer that Kimtaro will open-source once beta.jisho moves out of beta (this is the part that can tell you "X looks like an inflection of Y with these forms...").

If you use beta.jisho.org, does that obviate the need to hack existing dictionaries or create new ones? (Though those are not hard to do, I can point you to some resources to do them.)

As for how Lingq can know to group together words you've already denoted as "phrases", I imagine that's just a very simple regular expression text search. Text which matches your pre-defined phrases overrides the segmentation suggested by MeCab.

cophnia61 Wrote:In the end I'm doing it, but I have a problem.
What did you wind up doing?

Also, what lessons are you doing on Lingq? I am looking for *riveting* content and I can mainly find kinda boring stories... I may need to lower my standards Smile.
Reply
#9
Are there other problems with Lingq for Japanese that are waiting for me?

I think there's benefit in writing a Lingq clone just for Japanese to fix this conjugation problem. And so I can upload copyrighted content xD though what are the odds that Shueisha gives us free or low-cost license to use One Piece content for educational purposes?
Reply
#10
The One Piece mangaka and publisher has always been very loose with usages of the franchise. Instead of locking down people who use images of One Piece, they let them. They figure the exposure to the series will bolster the popularity and bring people to read and buy the manga.
Reply
#11
vix86 Wrote:The One Piece mangaka and publisher has always been very loose with usages of the franchise. Instead of locking down people who use images of One Piece, they let them. They figure the exposure to the series will bolster the popularity and bring people to read and buy the manga.
Thanks for that, I didn't know! Is this also true of the US publisher?

(Of course, the whole unlicensed use of copyrighted material was a joke, but this is interesting.)
Edited: 2015-01-16, 10:51 pm
Reply
#12
No idea on US publisher. They own the license in the US so they can do whatever they wish unless there were some stipulations in the licensing agreement.
Reply