Back

unnamed japanese text analyzer

#26
I missed it before, but this is really useful.
Reply
#27
Is there a way to merge together different forms of the same word when it appears both in kanji, hiragana and katakana?
Like 私 and わたし?
cb's analysis tool does this with the JParser option, it merges them together into a single word and it selects the most common form of the word, unless you check the option "kanji". Is it possible to do the same thing with this tool?
Reply
#28
Only way to do it for now is manually.
Reply
MONSTER Sale Get 28% OFF Basic, Premium & Premium PLUS! (Oct 16 - 27)
JapanesePod101
#29
(2017-04-07, 8:03 am)cophnia61 Wrote: Is there a way to merge together different forms of the same word when it appears both in kanji, hiragana and katakana?
Like 私 and わたし?
cb's analysis tool does this with the JParser option, it merges them together into a single word and it selects the most common form of the word, unless you check the option "kanji". Is it possible to do the same thing with this tool?

At higher levels, once you're tackling lots of homophones and stuff, you'll want these unmerged so you can pick up on which words are commonly written in kana, when more words have synonyms. Sometimes the word written in kana might not be the <most common kanji word with the same hiragana> so I think it's fine to keep them as how they appear in the original text.

Just my 2円
Reply
#30
Wareya, you planning to include a "filter these words" option to omit words in a column from a supplied text file? Currently, I have to use EXCEL to do the trimming, but if it happened automatically that would be great. That can also solve the problem listed above about people wanting Kanji and their kana variants unlisted. For them, they can have a filter list with both words in the column.
Reply
#31
I plan on it. I also want to make the existing filters customizable. I just have to get around to doing it.
Reply
#32
On test 5 neologd version and on test 6, I am getting an out of memory error: Exception in thread "Thread-4" java.lang.OutOfMemoryError: Java heap space

This occurs even with a small (less than 1mb) input file.

The normal version of test 5 and previous releases worked.


Also, just curious why there are two reading fields? What is the difference?
Edited: 2017-04-09, 10:38 am
Reply
#33
That means java's running out of memory. If you have a lot of physical memory, make sure you're using 64-bit java instead of 32-bit. The neologd dictionary is definitely large and makes kuromoji take up a lot of memory. If it continues to not work, try the test6 release; it uses the normal unidic-kanaaccent dictionary with some names added as dummy fields, and it doesn't improve accuracy spectacularly, but it does help: http://i.imgur.com/zU9NWeN.jpg

(2017-04-09, 10:37 am)Zarxrax Wrote: Also, just curious why there are two reading fields? What is the difference?

One of them is the "spelling" of the word and the other is the "pronunciation". In other words, where the first one writes long お sounds with おう, the second will use おー. This helps disambiguate words a little.
Reply
#34
(2017-04-09, 4:10 pm)wareya Wrote: If you have a lot of physical memory, make sure you're using 64-bit java instead of 32-bit. The neologd dictionary is definitely large and makes kuromoji take up a lot of memory. If it continues to not work, try the test6 release; it uses the normal unidic-kanaaccent dictionary with some names added as dummy fields, and it doesn't improve accuracy spectacularly, but it does help: http://i.imgur.com/zU9NWeN.jpg

That was on the test6 release as well. Looks like switching to 64bit java solved the issue though.
Reply
#35
Glad that fixed it.
Reply
#36
Have you thought about including collocations as a feature for this?
i.e. the most common combinations of 2-3 words in a given text.
I think it'd be pretty awesome to have something like this.

In http://forum.koohii.com/thread-14477-post-243650.html we were talking a bit about collocations and how they'd be pretty legit when using a large source (10,000s of subtitle files, for example).
Reply
#37
Kuromoji's segmentation is too fine-grained to do it in a particularly simple way. I could probably do it for specific kinds of collocations like specific patterns of <noun><particle><verb>, but I'd pretty much have to do specific patterns like that to get any useful data out of it. There's definitely potential there, though.
Reply
#38
(2017-04-10, 3:41 am)wareya Wrote: Kuromoji's segmentation is too fine-grained to do it in a particularly simple way.  I could probably do it for specific kinds of collocations like specific patterns of <noun><particle><verb>, but I'd pretty much have to do specific patterns like that to get any useful data out of it. There's definitely potential there, though.

I'm wondering if it could be more beneficial (from a standpoint of finding useful collocations) if it did the search based on the highest frequency verbs that are identified in the frequency analysis.


However, in regards to the main function of the software for now, do you think there would be any way of tweaking it to identify longer words rather than shorter ones? For instance, it will never pick up words like 飛行機 or 救急車 instead opting for the less useful 飛行 and 救急.
Reply
#39
Only way to do that is to modify the weights given to all long compounds in the dictionary one by one. There's no right answer about whether to track compounds or their parts when the compound is just the meaning of the small bits and the small bits are common on their own. If anything, it's the same problem as collocations. I'd rather focus on improving the filters for the time being.

As a side note, 飛行 is actually quite common.
Edited: 2017-04-10, 5:48 am
Reply
#40
There's this whole huge article about tackling the idea of grabbing collocations in Japanese http://www.heliyon.com/article/e00189?vi...c=y%3D&amp;

Looks tough. I wonder if they'd share their tools if you emailed them, though...
Reply
#41
"counting input lines" takes a lot of time, could you make it skip that part if "include index of line of first occurrence" is unchecked?

EDIT: this happens only with the last version, while the test5 version works just fine (it briefly shows the "counting input lines" string, then it proceeds to "parsing file ..."). So maybe it's just my computer fault !?
Edited: 2017-04-11, 9:46 am
Reply
#42
(2017-04-11, 9:34 am)cophnia61 Wrote: "counting input lines" takes a lot of time, could you make it skip that part if "include index of line of first occurrence" is unchecked?

EDIT: this happens only with the last version, while the test5 version works just fine (it briefly shows the "counting input lines" string, then it proceeds to "parsing file ..."). So maybe it's just my computer fault !?

Are you sure it's not just hung? If the program runs out of memory it will get hung at that part and stop running. Make sure you are using the 64 bit Java as was suggested to me a few posts ago.
Reply
#43
Is there a way to use this with very big files?
I've download the Japanese Wikipedia offline dump, which is over 10 GB, but when I try to parse it with your analyzer, it gives me the same issue as before (stuck a the "counting input lines" phase). I'm using the 64 bit Java as suggested by Zarxrax.

PS: I know a Wikipedia word frequency list already existst but I want to try it anyway with Kuromoji, because it seems pretty accurate to me, compared to other morphological analyzers.
Reply
#44
EDIT: it's working now, I had only to wait ._.
Reply
#45
(2017-04-15, 1:19 pm)cophnia61 Wrote: EDIT: it's working now, I had only to wait ._.
So how does this Kuromoji list compare to the previously existing one?
Reply
#46
In the end I aborted it as I decided to use another list, as suggested by Zarxrax in another thread:

http://forum.koohii.com/thread-14490-pos...#pid243814
Reply
#47
Thanks for letting me know that speed is a problem. I'll consider ways to split it up into small chunks so you can close the program without losing all its progress. I'll also make it so you can see how far it is in counting the lines in the file.
Edited: 2017-04-16, 12:35 am
Reply
#48
Here's a list merger/normalizer:

https://github.com/wareya/normalizer/releases/tag/test1

Normalization pretends there were one million tokens per input frequency list. Same for the output frequency list.
Edited: 2017-04-26, 7:11 pm
Reply
#49
(2017-04-02, 8:03 am)phil321 Wrote:
(2017-04-02, 2:11 am)Nukemarine Wrote: OK, finally got to try something I wanted to do for a while and that's create a vocabulary list from Harry Potter book 1 that can be sorted by order of appearance. This analyzer made that possible and the result is more than usable by anyone I'm sure:

Japanese Translation of Harry Potter Book 1 - Vocabulary list

If you're interested in the steps I used to create this, check out this Reddit thread.

Having a vocabulary list for a book you're reading is a great idea.

Hareeeee Pottaaaaaah is not my cup of tea however so I'll have to figure out how to apply these tools to the books I'm actually interested in reading.

[I managed to find Japanese translations of "Rosemary's Baby" and "The Stepford Wives" by Ira Levin.  These are among my favorite books.  I can't wait to see how they translated "the Welcome Wagon Lady" (in the opening paragraph of The Stepford Wives) into Japanese....].

Today I received my copy of The Stepford Wives in Japanese.

Here's how they translated "the Welcome Wagon Lady":  歓迎ワゴンのレヂイは.  The kanji  歓迎 are glossed with katakana furigana as follows:  ウェルカム.  So even though 歓迎 which means "welcome" is pronounced "kangei" it is in this case glossed as "werukamu".  Bizarre, but I guess that is the only way to translate "the Welcome Wagon Lady".
Edited: 2017-05-11, 10:48 pm
Reply
#50
New version of the frequency list merger, which adds two options that control how it handles the merging process.

https://github.com/wareya/normalizer/releases/tag/test2
Reply