Back

Japanese NLP Libraries

#1
Long story short: I’m wondering what lightweight Japanese NLP libraries people are using in apps (preferably Java-based). 

Basically, of the ones I've tried, none are able to find the base forms of Godan verbs accurately (usually for imperative and potential conjugations). I wouldn't mind this limitation for what I'm doing, but no apps I've used seem to have any trouble like this, so it's making me curious. Maybe I'm searching for the wrong terms or misunderstanding something.

I'll list the libraries I've tried below from searching for terms like tokenizer, morphological analyzer, stemmer, 形態素解析. None of them seem to handle the Godan issue.  

Sanmoku has been stale for the past 4 years, but it runs well on Android and tokenises pretty accurately. 

MeCab seems to be one of the most popular libraries, but it's written in C++.

Kuromoji looked good but seems to be too heavy for Android. 

I also found Gosen and Igo, but they didn’t seem suitable for Android.
Reply
#2
Something that might be worth exploring is how difficult it would be to port Mecab to a NDK .lib for android. You'd have to check if the android C headers are sufficient for Mecab's requirements. You could then import the library and access it using some NDK wrappers you made.
Reply
#3
(2016-02-27, 12:00 am)vix86 Wrote: Something that might be worth exploring is how difficult it would be to port Mecab to a NDK .lib for android. You'd have to check if the android C headers are sufficient for Mecab's requirements. You could then import the library and access it using some NDK wrappers you made.

Apparently there are some Java bindings available, but I don't know how it would perform. MeCab also has the same issues if this online version is anything to go by. For example, the output of 待てる is the same as Sanmoku's output. The -eru endings seem to be mistaken for ichidan verbs, but like I said I'm not sure if this is technically correct or not.

inflection:"一段,基本形",
baseform:  "待てる"


I haven't had time to do much research, but these tools are "trained" with special dictionaries. The most popular and performant seems to be IPADICT. The godan issue could be a limitation of IPADIC rather than the tools themselves. Here's Kuromoji output using four of the available dictionaries. It looks like Jumandic can detect the potential form, but you pay for it with an extra 8MB on the JAR (and probably in memory too). 

Ipadic      (13MB) [動詞, 自立, *, *, 一段, 基本形, 待てる, マテル, マテル]
Jumandic (21MB) [動詞, *, 母音動詞, 基本形, 待てる, まてる, 代表表記:待てる/まてる 可能動詞:待つ/まつ]
Naist-jdic (16MB) [動詞, 自立, *, *, 一段, 基本形, 待てる, マテル, マテル, まてる/待てる/待る, ]
Unidic     (46MB) [動詞, 一般, *, *, 下一段-タ行, 終止形-一般, マツ, 待つ, 待てる, マテル, 待てる, マテル, 和, *, *, *, *]
Edited: 2016-02-27, 1:25 am
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
Have you looked around on the Google IME website and see if they mention any licenses for NLP packages/software? That might be a good place to look for hints on whether they are using anything in their IME. If you can't find anything, then its probably home grown.

I wouldn't worry about the memory requirements too much. For something like NLP, you are probably going to have to have a dictionary of some sort around depending on what you are doing.
Reply
#5
Worse comes to worst and you find that none of the libraries available do what you want, you could possibly use a dictionary that gets it right (maybe EDICT even if it's not the best) to automatically generate the entries to get what you want.

I tried the imperative forms of a few words using the online MeCab site you linked and all of them correctly identified, but the potential forms were returning potential forms of godan verbs as ichidan as you said, so I decided to look around a little bit and found this explanation about potential form that might be relevant here (emphasis mine):
Quote:The other is the potential form of the verb, typically constructed by adding られる to the base (一段 verbs) or adding る to a potential base formed by changing the last kana of the plain form from the う column to the え column (読む to 読める, 話す to 話せる, etc.) (五段 verbs). The potential form is itself an 一段 verb, and is an intransitive verb, generally taking the particle が instead of を.

   この本を読むことが出来ますか。 (Can you read this book?)
   but
   この本が読めますか。(Can you read this book?)

The special class of single-kanji する verbs (害する, 愛する, etc.), which are marked as "vs-s" in WWWJDIC, have a potential form which ends in either しえる or しうる. The two are inter-changeable, (however just to complicate matters, the しうる form is not 五段, and does not inflect further; other inflections must be derived from the しえる form.) Some of these verbs also have a せる potential, which denotes capability to carry out something, and appears to be rather colloquial.
Source: http://nihongo.monash.edu/wwwverbinf.html

Quote:可能を示す助動詞は、「れる」が五段活用動詞とサ行変格活用動詞(この場合は「される」となるので、「ない」に先立つ「し」とならんで「さ」も未然形とされる)の未然形に、「られる」が上一段活用動詞と下一段活用動詞とカ行変格活用動詞の未然形につくのが文法的に正しい。だから、「取る」なら「取られる」、「見る」なら「見られる」が正しい表現ということになる。ところが、「あの山では松茸が取られる」という言い方をする人はまずいない。ふつうは「取れる」というであろう。「取れる」は、文法上動詞に助動詞が続いたものとはされず、「可能動詞」として一つの動詞とされる。五段活用動詞を下一段化すると可能動詞となる。たとえば、「読める」の場合は「読らめる」の「ら」が抜けたわけではないので、一つの動詞とするほかしょうがないのである。「取れる」の場合は、一見「取られる」の「ら」が抜けたように見えるが、「読める」が「読む」が下一段化して成立したのと同様に「取る」が下一段化したものだから、「ら抜き言葉」ではない。「取れる」がよくて、「見れる」がいけないという判断は、このような理屈によって成立している。「見れる」という言い方は、上一段活用動詞の未然形「見」に「られる」をつけるべきところを「れる」をつけたので、「ら抜き言葉」として間違いだとされるのである。
Source: http://shouyouki.web.fc2.com/ranuki2.htm

And directly from the dictionary, which I just noticed after searching for other things:
Quote:五段(四段)活用の動詞が下一段活用に転じて可能の意味を表すようになったもの。例えば,「読む」「書く」に対する「読める」「書ける」などの類。命令形をもたない。近世江戸語に発生し,明治以降次第に普及した。
Source: https://kotobank.jp/word/%E5%8F%AF%E8%83...%9E-465488

So the way that MeCab etc are returning seems to be technically "correct", just not what you want or in a helpful way to determine if it's potential or not.

Here's a potential solution:
http://nlp.ist.i.kyoto-u.ac.jp/index.php...page=JUMAN

The Juman dictionary makes it clear if it's a potential verb or not even. It will still return that it's an ichidan verb (母語動詞) vs godan verb (子音動詞), but it says that it's a 可能動詞 in the comment at the same time.
Quote:C:\Program Files\juman>juman.exe < "C:\Users\Owner\Desktop\test.txt"
待てる まてる 待てる 動詞 2 * 0 母音動詞 1 基本形 2 "代表表記:待てる/まてる 可能動詞:待つ/まつ"
EOS
待つ まつ 待つ 動詞 2 * 0 子音動詞タ行 6 基本形 2 "代表表記:待つ/まつ"
EOS
待った まった 待つ 動詞 2 * 0 子音動詞タ行 6 タ形 10 "代表表記:待つ/まつ"
EOS

Juman's output is a little cryptic (no idea what all of those numbers mean) but it has a kinda thick manual if you want to look into that.

Juman itself isn't suitable for your purpose, but there's a tool that I found while searching about MeCab and 可能動詞 that came up which says it can convert Juman's dictionary to MeCab format: http://hayashibe.jp/tr/mecab/dictionary/juman


I hope that any of that helps you out in some way.
Edited: 2016-02-27, 6:23 am
Reply
#6
If you don't mind GPL virality, you could use Rikaichan's approach of mapping inflections to stem and verifying the results with a dictionary. Here's a straight port of Rikaichan's code in Java: https://github.com/ispedals/FloatingJapa...einflector with DeInflector.java containing the main logic.
Reply
#7
Thanks for all the input. balloonguy's suggestion sounds very promising, so I think I'm going to look into that. Using a tokenizer for this task was probably overkill in the first place, although I need that library for other things.
Reply
#8
Have you looked into how Rikaichan (and the Chrome/Safari forks) do it? They are javascript based so it seems unlikely they are using some huge C lexical library.
Reply
#9
Rikaichan uses Mecab i believe.
Reply
#10
I don't see mecab in the package and this seems to be the custom code they use for deflection. That deflect.dat is a pretty tiny dataset as well. However, that code is all GPL so jimeux probably shouldn't look at it, but Rikaichan does get it done with a small deflection database and a relatively small amount of javascript code.
Reply
#11
Oh crap, sorry. You are 100% right. I'm thinking about the furigana adder in Anki, that runs on Mecab I think.

@OP:
Another option might be to write a web server and deliver your results to the app. You have to host the thing somewhere but there are some free hosting services (Heroku, Google App Engine) that can run stuff like Java, Python, Go, etc. If you don't need <50ms response times, this could be another option.
Reply
#12
The code I linked to is a straight Javascript to Java port of Rikaichan's deinflection algorithm with deinflect.dat embedded. Even though deinflect.dat is small, it uses all of Edict to validate its results.
Reply
#13
There's also a romaji-based one in the Aedict source with the same license.
Reply
#14
(2016-02-29, 6:37 am)balloonguy Wrote: The code I linked to is a straight Javascript to Java port of Rikaichan's deinflection algorithm with deinflect.dat embedded.

 I'm just going to blame this all on not having my coffee before I posted ...  Blush
Reply
#15
I'm planning on putting Kuromoji + Jetty webserver powering a REST endpoint on a VPS on Digital Ocean ($5/mo). With Let's Encrypt giving away TLS certs for free, you could secure it so just your Android app is allowed to have access to it—or let others hit it too if you were nice enough (and willing to risk freeloaders interfering with throughput to your app). Of course this doesn't help if your app needs to be offline.

There's really no way around the large file sizes since the dictionaries are so large, unless you see how much accuracy you lose when you reduce the dictionary (and tell me how to do this).

We're using UniDic for all our applications, it's newer and designed from the ground up to power applications like BCCWJ (which in turns powered Nyar182's "Core5000" deck, by way of Routledge's Frequency Dictionary of Japanese). It's smarter than IPADIC in our experience.

Another option that I haven't tested much: all-JavaScript segmentation & POS-tagging by Rakuten's R&D branch: https://github.com/rakuten-nlp/rakutenma It supports reducing dictionary sizes.

KyTea is another product in this space, which alas I haven't evaluated either: http://www.phontron.com/kytea/ It's in C++.

Just a note about GPL: if you get a buddy to help, you can create a clean-room implementation of GPL software that's no longer bound by GPL. One of you reads the GPL code and translates it to pseudocode or some human description of its operations. The other never looks at the GPL code, but only consults the pseudocode and implements that. (And hopefully makes it available as BSD/Apache/Eclipse/etc. license, since the resulting work is no longer GPL'd.)

Edit: I realize Rakuten-MA and KyTea aren't Java, but in dire straits someone could potentially run Rakuten-MA inside Nashorn.

Edit: Also, one downside to Kuromoji that I haven't been able to figure out is how to get it to give me the top-N results (equivalent of MeCab's "--nbest" flag)—any info?
Edited: 2016-03-01, 1:44 pm
Reply