Back

Japanese NLP Libraries

#15
I'm planning on putting Kuromoji + Jetty webserver powering a REST endpoint on a VPS on Digital Ocean ($5/mo). With Let's Encrypt giving away TLS certs for free, you could secure it so just your Android app is allowed to have access to it—or let others hit it too if you were nice enough (and willing to risk freeloaders interfering with throughput to your app). Of course this doesn't help if your app needs to be offline.

There's really no way around the large file sizes since the dictionaries are so large, unless you see how much accuracy you lose when you reduce the dictionary (and tell me how to do this).

We're using UniDic for all our applications, it's newer and designed from the ground up to power applications like BCCWJ (which in turns powered Nyar182's "Core5000" deck, by way of Routledge's Frequency Dictionary of Japanese). It's smarter than IPADIC in our experience.

Another option that I haven't tested much: all-JavaScript segmentation & POS-tagging by Rakuten's R&D branch: https://github.com/rakuten-nlp/rakutenma It supports reducing dictionary sizes.

KyTea is another product in this space, which alas I haven't evaluated either: http://www.phontron.com/kytea/ It's in C++.

Just a note about GPL: if you get a buddy to help, you can create a clean-room implementation of GPL software that's no longer bound by GPL. One of you reads the GPL code and translates it to pseudocode or some human description of its operations. The other never looks at the GPL code, but only consults the pseudocode and implements that. (And hopefully makes it available as BSD/Apache/Eclipse/etc. license, since the resulting work is no longer GPL'd.)

Edit: I realize Rakuten-MA and KyTea aren't Java, but in dire straits someone could potentially run Rakuten-MA inside Nashorn.

Edit: Also, one downside to Kuromoji that I haven't been able to figure out is how to get it to give me the top-N results (equivalent of MeCab's "--nbest" flag)—any info?
Edited: 2016-03-01, 1:44 pm
Reply

Messages In This Thread
Japanese NLP Libraries - by jimeux - 2016-02-26, 8:29 am
RE: Japanese NLP Libraries - by vix86 - 2016-02-27, 12:00 am
RE: Japanese NLP Libraries - by jimeux - 2016-02-27, 1:24 am
RE: Japanese NLP Libraries - by vix86 - 2016-02-27, 2:41 am
RE: Japanese NLP Libraries - by zx573 - 2016-02-27, 6:13 am
RE: Japanese NLP Libraries - by balloonguy - 2016-02-27, 4:09 pm
RE: Japanese NLP Libraries - by jimeux - 2016-02-29, 12:09 am
RE: Japanese NLP Libraries - by tokyostyle - 2016-02-29, 12:26 am
RE: Japanese NLP Libraries - by vix86 - 2016-02-29, 12:50 am
RE: Japanese NLP Libraries - by tokyostyle - 2016-02-29, 1:11 am
RE: Japanese NLP Libraries - by vix86 - 2016-02-29, 5:05 am
RE: Japanese NLP Libraries - by balloonguy - 2016-02-29, 6:37 am
RE: Japanese NLP Libraries - by tokyostyle - 2016-02-29, 12:23 pm
RE: Japanese NLP Libraries - by jimeux - 2016-02-29, 8:20 am
RE: Japanese NLP Libraries - by aldebrn - 2016-03-01, 1:38 pm