Back

unnamed japanese text analyzer

#1
https://github.com/wareya/analyzer/releases

Use 64-bit Java if the analyzer gets stuck in initialization.

[Image: 1W6yYdl.png]

This serves a similar purpose to cb's frequency list generator except that it uses kuromoji instead of mecab or jparser. Kuromoji is the state-of-the-art bulk text segmenter for japanese.

Note that this is a command line program and only works on a single text file. It also only serves vocabulary frequency lists, not readability reports or kanji frequency lists; those are a lot more straightforward to do and this program is going to be easier to keep working well if it's as small as possible.

Kuromoji actually makes it very hard (impossible) to get the "dictionary entry" of each word it encounters, so this tool works around that by guessing based on the dictionary information kuromoji does expose about segments. The specific dictionary used is incredibly large and makes the download a whopping 47MB, but I picked it because it has the best chance of giving unambiguous identifying information about words, since it includes spelling, pronunciation (ou -> o~, ei -> e~), and pitch accent, as well as specific morphological categories (e.g. godan ~nu).

Example output (top 10 results):

Code:
1684    それ    ソレ    ソレ    0    和    代名詞    *    *    *
1407    こと    コト    コト    2    和    名詞    普通名詞    一般    *
1206    そう    ソウ    ソー    1    和    副詞    *    *    *
1195    何    ナン    ナン    1    和    代名詞    *    *    *
1170    ギー    ギー    ギー    1    外    名詞    普通名詞    一般    *
1018    この    コノ    コノ    0    和    連体詞    *    *    *
954    彼    カレ    カレ    1    和    代名詞    *    *    *
904    もの    モノ    モノ    2,0    和    名詞    普通名詞    サ変可能    *
846    都市    トシ    トシ    1    漢    名詞    普通名詞    一般    *
786    その    ソノ    ソノ    0    和    連体詞    *    *    *

(ギー is the name of a character in the text)

Companion program: https://github.com/wareya/normalizer/releases/tag/test1
Edited: 2017-08-12, 12:27 pm
Reply
#2
Looks good. I might make use of it in my subs2srs project on Memrise. If you're the programmer, can I make a request?

Column that lists how many characters or words into the document(s) that the word first appears. So a word might appear 28 times (frequency), and the first time it appears was 1012 words (or 5308 characters) into the document. We should get a column with 28 and a column for 1012. 

This would be useful in creating vocabulary lists via order of appearance that can be divided by X number of pages, words or even characters if the user wants. For myself, I'm creating a non-repeating vocabulary list for every 50 sentences in a drama subtitle so this could be useful.
Reply
#3
Yeah, I can do it based on line pretty easily.
Reply
See this thread for Holiday Countdown Deals (extended to Dec 26th)
JapanesePod101
#4
Sounds cool. Will this work on my mac?
Reply
#5
I haven't tested it but it's java so it should just work.
Reply
#6
(2017-03-21, 1:45 pm)wareya Wrote: I haven't tested it but it's java so it should just work.

This looks really interesting, as I'm currently working on my own frequency list/deck.

I really hate Java, but I guess I'll bite the bullet and install it.


Edit:
Very interesting results. Compared to CB's analysis tool, it seems to have a lot less "nonsense" words in the beginning, but its also missing things like の, and expressions like こんにちは.
Also a little interesting to see that words don't really necessarily follow the same frequency in both apps, though they are generally pretty close.
Edited: 2017-03-21, 5:08 pm
Reply
#7
Ah if kuromoji ever gave the ability to fetch the dictionary entry of a match that would be amazing. Even though I am not a fan of java either,  I'd deal with it to remove the mecab python bindings in my scripts...

EDIT: It seems that there's a WIP for a C implementation ( http://www.atilika.com/en/products/kuromoji.html )

Quote:Is Kuromoji available in C or C++?

We have a work-in-progress C version of Kuromoji available. Please contact us if you are interested.

Sorry to derail a bit from the thread.
Edited: 2017-03-21, 8:36 pm
Reply
#8
How do I use this? I ran the jar but nothing happens :/
Reply
#9
(2017-03-21, 3:39 pm)Zarxrax Wrote: Very interesting results. Compared to CB's analysis tool, it seems to have a lot less "nonsense" words in the beginning, but its also missing things like の, and expressions like こんにちは.
Also a little interesting to see that words don't really necessarily follow the same frequency in both apps, though they are generally pretty close.

That's because of my output filter. You can disable all filters except the punctuation filter by invoking the analyzer as java -jar analyzer.jar <corpus.txt> -w -l -d > output.txt

Kuromoji insists on using shorter segments in general too.

(2017-03-21, 8:22 pm)Flamerokz Wrote: EDIT: It seems that there's a WIP for a C implementation ( http://www.atilika.com/en/products/kuromoji.html )

Quote:Is Kuromoji available in C or C++?

We have a work-in-progress C version of Kuromoji available. Please contact us if you are interested.

I wouldn't hold my breath.

(2017-03-21, 10:52 pm)vladz0r Wrote: How do I use this? I ran the jar but nothing happens :/



It's a command line program, but it tries to force the terminal to open in UTF-8. If it can't, it will die with an exception. I don't know if this exception is visible or not. I might add a GUI in a future release, but I don't really want to deal with whatever UI frameworks java makes available.

Until I make a second test release you can read the command line options here and see if it works regardless of whatever's going on on your end: https://github.com/wareya/analyzer/blob/....java#L250

If that doesn't work, let me know and I'll try to figure out what's going on.
Edited: 2017-03-22, 3:46 am
Reply
#10
(2017-03-21, 9:16 am)Nukemarine Wrote: Column that lists how many characters or words into the document(s) that the word first appears. So a word might appear 28 times (frequency), and the first time it appears was 1012 words (or 5308 characters) into the document. We should get a column with 28 and a column for 1012. 

I just added an option that does this, but counts lines rather than words/letters. Please test it.

https://github.com/wareya/analyzer/releases/tag/test2

EDIT: There's a GUI now.

https://github.com/wareya/analyzer/releases/tag/test3

[Image: Doc0oDg.png]
Edited: 2017-03-22, 11:25 am
Reply
#11
OK, finally got to try something I wanted to do for a while and that's create a vocabulary list from Harry Potter book 1 that can be sorted by order of appearance. This analyzer made that possible and the result is more than usable by anyone I'm sure:

Japanese Translation of Harry Potter Book 1 - Vocabulary list

If you're interested in the steps I used to create this, check out this Reddit thread.
Edited: 2017-04-02, 2:13 am
Reply
#12
(2017-04-02, 2:11 am)Nukemarine Wrote: OK, finally got to try something I wanted to do for a while and that's create a vocabulary list from Harry Potter book 1 that can be sorted by order of appearance. This analyzer made that possible and the result is more than usable by anyone I'm sure:

Japanese Translation of Harry Potter Book 1 - Vocabulary list

If you're interested in the steps I used to create this, check out this Reddit thread.

Having a vocabulary list for a book you're reading is a great idea.

Hareeeee Pottaaaaaah is not my cup of tea however so I'll have to figure out how to apply these tools to the books I'm actually interested in reading.

[I managed to find Japanese translations of "Rosemary's Baby" and "The Stepford Wives" by Ira Levin.  These are among my favorite books.  I can't wait to see how they translated "the Welcome Wagon Lady" (in the opening paragraph of The Stepford Wives) into Japanese....].
Edited: 2017-04-02, 8:05 am
Reply
#13
(2017-04-02, 8:03 am)phil321 Wrote: Having a vocabulary list for a book you're reading is a great idea.

Hareeeee Pottaaaaaah is not my cup of tea however so I'll have to figure out how to apply these tools to the books I'm actually interested in reading.

[I managed to find Japanese translations of "Rosemary's Baby" and "The Stepford Wives" by Ira Levin.  These are among my favorite books.  I can't wait to see how they translated "the Welcome Wagon Lady" (in the opening paragraph of The Stepford Wives) into Japanese....].

So long as you have the books in a text file format of some sort, it's easy. 

About books, just know that not all translations are created equally. For example, my wife told me she thought the translation for A Game of Thrones was horrible, but the one for The Martian was good. Not sure how one can research that ahead of time though.
Reply
#14
(2017-04-02, 9:40 am)Nukemarine Wrote:
(2017-04-02, 8:03 am)phil321 Wrote: Having a vocabulary list for a book you're reading is a great idea.

Hareeeee Pottaaaaaah is not my cup of tea however so I'll have to figure out how to apply these tools to the books I'm actually interested in reading.

[I managed to find Japanese translations of "Rosemary's Baby" and "The Stepford Wives" by Ira Levin.  These are among my favorite books.  I can't wait to see how they translated "the Welcome Wagon Lady" (in the opening paragraph of The Stepford Wives) into Japanese....].

So long as you have the books in a text file format of some sort, it's easy. 

About books, just know that not all translations are created equally. For example, my wife told me she thought the translation for A Game of Thrones was horrible, but the one for The Martian was good. Not sure how one can research that ahead of time though.

Thanks for the comments about the quality of the translations.  I'm not too concerned about that, because I figure that a mass-market Japanese translation of a popular American novel will be in grammatically correct, fluent Japanese which is what I want.  Whether the translation is 100% faithful to the original is something I'm less concerned about.
Reply
#15
Cool. How does your method handle excluding words with the same spelling but different readings or meanings?
Reply
#16
Bug/Issue:
If the Input file is not in the same folder as the Analyzer, you get an error. I had a working file, tried to open it from a different folder, and it didn't work. I pasted it into the Analyzer folder and it worked.

I was trying to figure out the reason for like 10 minutes until I figured this out,. Gotta fix that sometime~
Reply
#17
Thanks, I somehow managed to miss that problem because I had files with the same names in both folders when I went to test for it. I'll fix it some time soon. For now I just put a heads-up on the release page.

EDIT: Fixed (I think), new test release uploading.
Edited: 2017-04-02, 2:23 pm
Reply
#18
Would it be possible to flag words that are possible duplicates?
Such as one word has kanji, then the same word appears written in hiragana, then the same word appears written in katakana. Or a word appears with an お in front of it vs the same word without the お.
I wouldn't expect the software to be able to determine actual duplicates, but the point would just be to flag it for the user so they can make the decision.

Also, if it is determined that 2 entries are duplicates, is there a mathematical formula that I can apply to get the combined frequency ranking?
Reply
#19
(2017-04-02, 2:48 pm)Zarxrax Wrote: Would it be possible to flag words that are possible duplicates?
Such as one word has kanji, then the same word appears written in hiragana, then the same word appears written in katakana. Or a word appears with an お in front of it vs the same word without the お.
I wouldn't expect the software to be able to determine actual duplicates, but the point would just be to flag it for the user so they can make the decision.
The best say to do this is to open the frequency list in a spreadsheet and sort it in a way that's useful for identifying certain kinds of "duplicates". There are too many possibilities for duplicate words; for example, there are like four versions of 通る・通り in one of the lists I generated, and that's just scratching the tip of the iceberg.

Another issue is that some "duplicates" should be learned as separate words from the word they're based on, because they work differently, like how "annoying" in english isn't literally just a participle of "to annoy", despite looking like it is. I still think "annoying" should contribute to the frequency of "annoy", it's just a nuanced issue.

The analyzer already combines terms that have the same spelling, reading, pronunciation, pitch accent, and lexical data, since that's part of how it identifies "dictionary words" from the information kuromoji spits out.

(2017-04-02, 2:48 pm)Zarxrax Wrote: Also, if it is determined that 2 entries are duplicates, is there a mathematical formula that I can apply to get the combined frequency ranking?

Add the number of occurrences together and re-sort the frequency list with the new occurrence count.
Edited: 2017-04-02, 3:05 pm
Reply
#20
(2017-03-21, 12:35 am)wareya Wrote: https://github.com/wareya/analyzer/releases

[Image: wGedAmz.png]


This may be a dumb question but what exactly goes into the "input" field and what does into the "output"field?  Can you show us an example?
Reply
#21
Press the "Input" or "Output" button, it'll open a file browser. The reason the text fields are left editable is because that makes using the UI in certain ways easier. The text fields should contain a path to the input file you want to read, and a path to the output file you want to generate. The input file should be utf-8 plain text. The output file will be a tab-separated-values frequency list.
Reply
#22
Both this and cb's text analyzer run into a similar problem that seems to really throw off the results, and I'm wondering if this is a solvable problem.
The results can end up showing a lot of high frequency words which in fact aren't being used as words at all in most cases, but are actually just a single kanji from someone's name. This is a problem because in many cases, those single kanji could potentially be an actual word or suffix or something, but the results just get completely skewed by the appearance in names.

It seems like it should be a fairly easy problem to fix, largely by including a large names dictionary to help with word identification. But perhaps this is actually a much more difficult problem to solve than I suspect.

Spending the past couple of months making my own frequency list, this has been the single largest issue I have faced. It would be nice to find a way to increase accuracy of the results in this regard.
Reply
#23
It is possible. Two notes:

1) The dictionary the analyzer users, unidic, is worse about names than ipadic. However ipadic makes it easier to confuse real words because it doesn't give pitch accent information.

2) I can probably supply a names dictionary, but I don't know how that would affect kuromoji internally. Kuromoji uses a viterbi implementation, which models the text as the result of possible unique strings of tokens, and picks the most likely one. The key takeaway is that different token pairs have different likelihoods, and I don't know how a custom dictionary interacts with that. It might make the parser stupider overall, and it might mask real words that have the same sequence as or contain the sequence that is the name.

To me it's primarily a corpus problem. Kuromoji already tries to segment unknown words on their own, it just prefers interpreting them as a real word when it can.

There's a neologism version of unidic that supposedly contains names. Kuromoji supports it, but not with pitch accent information.
Edited: 2017-04-06, 2:45 pm
Reply
#24
I uploaded a new test release. This one has a second download that uses the neologd dictionary, which is massive, but handles the names it includes better.

https://github.com/wareya/analyzer/releases/tag/test5
Edited: 2017-04-06, 5:56 pm
Reply
#25
(2017-04-06, 5:55 pm)wareya Wrote: I uploaded a new test release. This one has a second download that uses the neologd dictionary, which is massive, but handles the names it includes better.

https://github.com/wareya/analyzer/releases/tag/test5

Cool, I'll test it out this weekend.
Reply