Back

unnamed japanese text analyzer

#1
https://github.com/wareya/analyzer/releases

[Image: Doc0oDg.png]

This serves a similar purpose to cb's frequency list generator except that it uses kuromoji instead of mecab or jparser. Kuromoji is the state-of-the-art bulk text segmenter for japanese.

Note that this is a command line program and only works on a single text file. It also only serves vocabulary frequency lists, not readability reports or kanji frequency lists; those are a lot more straightforward to do and this program is going to be easier to keep working well if it's as small as possible.

Kuromoji actually makes it very hard (impossible) to get the "dictionary entry" of each word it encounters, so this tool works around that by guessing based on the dictionary information kuromoji does expose about segments. The specific dictionary used is incredibly large and makes the download a whopping 47MB, but I picked it because it has the best chance of giving unambiguous identifying information about words, since it includes spelling, pronunciation (ou -> o~, ei -> e~), and pitch accent, as well as specific morphological categories (e.g. godan ~nu).

Example output (top 10 results):

Code:
1684    それ    ソレ    ソレ    0    和    代名詞    *    *    *
1407    こと    コト    コト    2    和    名詞    普通名詞    一般    *
1206    そう    ソウ    ソー    1    和    副詞    *    *    *
1195    何    ナン    ナン    1    和    代名詞    *    *    *
1170    ギー    ギー    ギー    1    外    名詞    普通名詞    一般    *
1018    この    コノ    コノ    0    和    連体詞    *    *    *
954    彼    カレ    カレ    1    和    代名詞    *    *    *
904    もの    モノ    モノ    2,0    和    名詞    普通名詞    サ変可能    *
846    都市    トシ    トシ    1    漢    名詞    普通名詞    一般    *
786    その    ソノ    ソノ    0    和    連体詞    *    *    *

(ギー is the name of a character in the text)
Edited: 2017-03-22, 11:25 am
Reply
#2
Looks good. I might make use of it in my subs2srs project on Memrise. If you're the programmer, can I make a request?

Column that lists how many characters or words into the document(s) that the word first appears. So a word might appear 28 times (frequency), and the first time it appears was 1012 words (or 5308 characters) into the document. We should get a column with 28 and a column for 1012. 

This would be useful in creating vocabulary lists via order of appearance that can be divided by X number of pages, words or even characters if the user wants. For myself, I'm creating a non-repeating vocabulary list for every 50 sentences in a drama subtitle so this could be useful.
Reply
#3
Yeah, I can do it based on line pretty easily.
Reply
(March 20-31) All Access Pass: 25% OFF Basic, Premium & Premium PLUS! 
Coupon: ALLACCESS2017
JapanesePod101
#4
Sounds cool. Will this work on my mac?
Reply
#5
I haven't tested it but it's java so it should just work.
Reply
#6
(2017-03-21, 1:45 pm)wareya Wrote: I haven't tested it but it's java so it should just work.

This looks really interesting, as I'm currently working on my own frequency list/deck.

I really hate Java, but I guess I'll bite the bullet and install it.


Edit:
Very interesting results. Compared to CB's analysis tool, it seems to have a lot less "nonsense" words in the beginning, but its also missing things like の, and expressions like こんにちは.
Also a little interesting to see that words don't really necessarily follow the same frequency in both apps, though they are generally pretty close.
Edited: 2017-03-21, 5:08 pm
Reply
#7
Ah if kuromoji ever gave the ability to fetch the dictionary entry of a match that would be amazing. Even though I am not a fan of java either,  I'd deal with it to remove the mecab python bindings in my scripts...

EDIT: It seems that there's a WIP for a C implementation ( http://www.atilika.com/en/products/kuromoji.html )

Quote:Is Kuromoji available in C or C++?

We have a work-in-progress C version of Kuromoji available. Please contact us if you are interested.

Sorry to derail a bit from the thread.
Edited: 2017-03-21, 8:36 pm
Reply
#8
How do I use this? I ran the jar but nothing happens :/
Reply
#9
(2017-03-21, 3:39 pm)Zarxrax Wrote: Very interesting results. Compared to CB's analysis tool, it seems to have a lot less "nonsense" words in the beginning, but its also missing things like の, and expressions like こんにちは.
Also a little interesting to see that words don't really necessarily follow the same frequency in both apps, though they are generally pretty close.

That's because of my output filter. You can disable all filters except the punctuation filter by invoking the analyzer as java -jar analyzer.jar <corpus.txt> -w -l -d > output.txt

Kuromoji insists on using shorter segments in general too.

(2017-03-21, 8:22 pm)Flamerokz Wrote: EDIT: It seems that there's a WIP for a C implementation ( http://www.atilika.com/en/products/kuromoji.html )

Quote:Is Kuromoji available in C or C++?

We have a work-in-progress C version of Kuromoji available. Please contact us if you are interested.

I wouldn't hold my breath.

(2017-03-21, 10:52 pm)vladz0r Wrote: How do I use this? I ran the jar but nothing happens :/



It's a command line program, but it tries to force the terminal to open in UTF-8. If it can't, it will die with an exception. I don't know if this exception is visible or not. I might add a GUI in a future release, but I don't really want to deal with whatever UI frameworks java makes available.

Until I make a second test release you can read the command line options here and see if it works regardless of whatever's going on on your end: https://github.com/wareya/analyzer/blob/....java#L250

If that doesn't work, let me know and I'll try to figure out what's going on.
Edited: 2017-03-22, 3:46 am
Reply
#10
(2017-03-21, 9:16 am)Nukemarine Wrote: Column that lists how many characters or words into the document(s) that the word first appears. So a word might appear 28 times (frequency), and the first time it appears was 1012 words (or 5308 characters) into the document. We should get a column with 28 and a column for 1012. 

I just added an option that does this, but counts lines rather than words/letters. Please test it.

https://github.com/wareya/analyzer/releases/tag/test2

EDIT: There's a GUI now.

https://github.com/wareya/analyzer/releases/tag/test3

[Image: Doc0oDg.png]
Edited: 2017-03-22, 11:25 am
Reply