Back

unnamed japanese text analyzer

#1
https://github.com/wareya/analyzer/releases

Use 64-bit Java if the analyzer gets stuck in initialization.

[Image: 1W6yYdl.png]

This serves a similar purpose to cb's frequency list generator except that it uses kuromoji instead of mecab or jparser. Kuromoji is the state-of-the-art bulk text segmenter for japanese.

Note that this is a command line program and only works on a single text file. It also only serves vocabulary frequency lists, not readability reports or kanji frequency lists; those are a lot more straightforward to do and this program is going to be easier to keep working well if it's as small as possible.

Kuromoji actually makes it very hard (impossible) to get the "dictionary entry" of each word it encounters, so this tool works around that by guessing based on the dictionary information kuromoji does expose about segments. The specific dictionary used is incredibly large and makes the download a whopping 47MB, but I picked it because it has the best chance of giving unambiguous identifying information about words, since it includes spelling, pronunciation (ou -> o~, ei -> e~), and pitch accent, as well as specific morphological categories (e.g. godan ~nu).

Example output (top 10 results):

Code:
1684    それ    ソレ    ソレ    0    和    代名詞    *    *    *
1407    こと    コト    コト    2    和    名詞    普通名詞    一般    *
1206    そう    ソウ    ソー    1    和    副詞    *    *    *
1195    何    ナン    ナン    1    和    代名詞    *    *    *
1170    ギー    ギー    ギー    1    外    名詞    普通名詞    一般    *
1018    この    コノ    コノ    0    和    連体詞    *    *    *
954    彼    カレ    カレ    1    和    代名詞    *    *    *
904    もの    モノ    モノ    2,0    和    名詞    普通名詞    サ変可能    *
846    都市    トシ    トシ    1    漢    名詞    普通名詞    一般    *
786    その    ソノ    ソノ    0    和    連体詞    *    *    *

(ギー is the name of a character in the text)

Companion program: https://github.com/wareya/normalizer/releases/tag/test1
Edited: 2017-08-12, 12:27 pm
Reply

Messages In This Thread
unnamed japanese text analyzer - by wareya - 2017-03-21, 12:35 am
RE: unnamed japanese text analyzer - by wareya - 2017-03-22, 4:39 am
RE: unnamed japanese text analyzer - by wareya - 2017-03-21, 12:44 pm
RE: unnamed japanese text analyzer - by wareya - 2017-03-21, 1:45 pm
RE: unnamed japanese text analyzer - by Zarxrax - 2017-03-21, 3:39 pm
RE: unnamed japanese text analyzer - by wareya - 2017-03-22, 1:41 am
RE: unnamed japanese text analyzer - by karageko - 2017-03-21, 8:22 pm
RE: unnamed japanese text analyzer - by vladz0r - 2017-03-21, 10:52 pm
RE: unnamed japanese text analyzer - by phil321 - 2017-04-02, 8:03 am
RE: unnamed japanese text analyzer - by phil321 - 2017-04-02, 10:21 am
RE: unnamed japanese text analyzer - by phil321 - 2017-05-11, 10:47 pm
RE: unnamed japanese text analyzer - by pm215 - 2017-05-12, 4:10 pm
RE: unnamed japanese text analyzer - by phil321 - 2017-05-12, 4:16 pm
RE: unnamed japanese text analyzer - by wareya - 2017-04-02, 10:34 am
RE: unnamed japanese text analyzer - by vladz0r - 2017-04-02, 12:48 pm
RE: unnamed japanese text analyzer - by wareya - 2017-04-02, 2:02 pm
RE: unnamed japanese text analyzer - by Zarxrax - 2017-04-02, 2:48 pm
RE: unnamed japanese text analyzer - by wareya - 2017-04-02, 3:00 pm
RE: unnamed japanese text analyzer - by phil321 - 2017-04-02, 3:05 pm
RE: unnamed japanese text analyzer - by wareya - 2017-04-02, 3:28 pm
RE: unnamed japanese text analyzer - by Zarxrax - 2017-04-06, 8:26 am
RE: unnamed japanese text analyzer - by wareya - 2017-04-06, 2:41 pm
RE: unnamed japanese text analyzer - by wareya - 2017-04-06, 5:55 pm
RE: unnamed japanese text analyzer - by Zarxrax - 2017-04-06, 7:11 pm
RE: unnamed japanese text analyzer - by vladz0r - 2017-04-07, 5:03 pm
RE: unnamed japanese text analyzer - by wareya - 2017-04-07, 12:28 pm
RE: unnamed japanese text analyzer - by wareya - 2017-04-07, 10:52 pm
RE: unnamed japanese text analyzer - by Zarxrax - 2017-04-09, 10:37 am
RE: unnamed japanese text analyzer - by wareya - 2017-04-09, 4:10 pm
RE: unnamed japanese text analyzer - by Zarxrax - 2017-04-09, 4:28 pm
RE: unnamed japanese text analyzer - by wareya - 2017-04-09, 4:29 pm
RE: unnamed japanese text analyzer - by vladz0r - 2017-04-09, 10:04 pm
RE: unnamed japanese text analyzer - by wareya - 2017-04-10, 3:41 am
RE: unnamed japanese text analyzer - by Zarxrax - 2017-04-10, 5:39 am
RE: unnamed japanese text analyzer - by wareya - 2017-04-10, 5:47 am
RE: unnamed japanese text analyzer - by vladz0r - 2017-04-10, 7:17 am
RE: unnamed japanese text analyzer - by Zarxrax - 2017-04-11, 11:02 am
RE: unnamed japanese text analyzer - by cophnia61 - 2017-04-15, 11:52 am
RE: unnamed japanese text analyzer - by Matthias - 2017-04-15, 5:09 pm
RE: unnamed japanese text analyzer - by wareya - 2017-04-16, 12:34 am
RE: unnamed japanese text analyzer - by wareya - 2017-04-26, 7:07 pm
RE: unnamed japanese text analyzer - by wareya - 2017-05-12, 10:45 am
RE: unnamed japanese text analyzer - by pm215 - 2017-05-12, 4:30 pm
RE: unnamed japanese text analyzer - by wareya - 2017-05-12, 5:34 pm
RE: unnamed japanese text analyzer - by wareya - 2017-05-17, 11:42 am
RE: unnamed japanese text analyzer - by wareya - 2017-05-18, 3:22 pm
RE: unnamed japanese text analyzer - by wareya - 2017-05-28, 5:43 am
RE: unnamed japanese text analyzer - by wareya - 2017-07-22, 8:20 am
RE: unnamed japanese text analyzer - by Matthias - 2017-07-23, 1:46 pm
RE: unnamed japanese text analyzer - by Matthias - 2017-07-25, 3:46 pm
RE: unnamed japanese text analyzer - by wareya - 2017-08-12, 12:27 pm
RE: unnamed japanese text analyzer - by danelips - 2017-08-14, 8:45 am
RE: unnamed japanese text analyzer - by wareya - 2017-08-28, 1:32 am