Back

cb's Japanese Text Analysis Tool

#51
I'll take a look at those things during the weekend. No promises though.
Edited: 2013-12-04, 9:15 pm
Reply
#52
I found what the problem was with my word list file. If you leave extra line feed at the end of your file, the program crashes. If you export from Excel sheet as a txt file there always seems to be extra line feed at the end which you need to manually remove...
Reply
#53
Hello,

I have just released version 4.4 of cb's Japanese Text Analysis Tool.

Download cb's Japanese Text Analysis Tool v4.4 via SourceForge

What Changed?

● Added "percentage" and "cumulative percentage" fields to the word frequency report and kanji frequency report. (Thanks Jiroukun!).

cb4960
Reply
EPIC SALE: Get 30% OFF Premium & Premium PLUS! (Sept 18 - 29)
JapanesePod101
#54
kanttuvei,

I was unable to reproduce the issue. Were you using version 4.3?
Reply
#55
May be related to Excel saving the file in UCS-2 instead of UTF8?
Reply
#56
WOAH O.O Thanks CB!

Just downloaded it and it works like a charm!

You're really really amazing ^.^
Reply
#57
I'm not quite sure if I understand what this is for and if this is what I am looking for.

Does it take a text file and output the frequency of words or characters? I would like to create a word frequency chart for things that I am interested in reading so that I can see what words I need to study to be able to read said document. Essentially a vocab study list for a particular story, article, chapter etc.

Will it work for things that have kana in it? I have reading material that has less kanji in it so I would like to know if it would be able to see にほんご instead of 日本語 as a word. In addition, for the user generated list does it have to be kanji or can it be hiragana words too?
Reply
#58
I have a question about how this tool works.

Looking at various frequency lists including your novel frequency list, I'd like to know how many of the words that are showing up are proper nouns, grammatical particles, etc. and how many are actual vocabulary words. MeCab at least has a part of speech analysis. Does your tool have a way to generate the list while using the part-of-speech feature to screen out particles, proper nouns, etc.?

Also, is there some tool that can combine the lemmas in the frequency list into a word family frequency list? (E.g. Sushi and sushi-ya are two different words, but they are part of the same word family.)
Reply
#59
MaxHayden Wrote:I have a question about how this tool works.

Looking at various frequency lists including your novel frequency list, I'd like to know how many of the words that are showing up are proper nouns, grammatical particles, etc. and how many are actual vocabulary words. MeCab at least has a part of speech analysis. Does your tool have a way to generate the list while using the part-of-speech feature to screen out particles, proper nouns, etc.?

Also, is there some tool that can combine the lemmas in the frequency list into a word family frequency list? (E.g. Sushi and sushi-ya are two different words, but they are part of the same word family.)
The tool does not currently output part-of-speech or have an ability to filter by part-of-speech. I suppose I could add it in the future though.

I personally do not know of any such tool that can generate a word family list.
Edited: 2014-07-04, 12:56 pm
Reply
#60
Thank you for the info.

If it isn't too difficult to add in the next release, please do add support for outputting part of speech (so that it can be filtered and/or sorted). I think this would be very useful for generating better vocabulary lists since it would let us exclude particles and proper nouns and so only focus on the actual vocabulary words that need to be learned.

As for the word family thing, I've emailed some Japanese researchers and will report back if they have anything useful to say.
Reply
#61
KH Coder might have the word family feature: http://sourceforge.net/projects/khc/

Or AntConc: http://www.antlab.sci.waseda.ac.jp/software.html
Edited: 2014-07-04, 1:19 pm
Reply
#62
I spoke to Dr. Tatsuhiko Matsushita at the University of Tokyo. He says that word families aren't as useful for Japanese vocabulary learning because of the way the morphology works. He recommends just using lexemes/lemmas and learning more words. If we are interested in Japanese word formation, he says we should look at the papers by Prof. Masahiko Nomura (野村雅昭) and Prof. Masahiko Ishii (石井正彦) which are in Japanese. (He doesn't know of anything in English.)

So the bottom line is that if cb can add support for "part of speech" to the text analysis, that's about as good as we can currently do.
Reply
#63
I have some questions about how the tool works and some small suggestions based on my limited usage of this neat tool. Thank you very much for sharing it, it has been very helpful!

1) How should the input be composed and how does it get processed?
I used the tool to analyze the video game localization files. These files contain a lot of formatting stuff like
Subtitles[0]=(Subtitle=(Speaker=\"Booker\",Subtitle=\"
etc. So I first went through the trouble to get rid of all of those with my poor knowledge of regular expressions. However, I later discovered that the tool seems to work even without this manual pre-processing. So now I wonder if e.g. html and similar get stripped from the input files.

2) Related to 1): What kind of files can be parsed?
I noticed that when the files were named *.int or *.jpn the tool didn't process them but ran on an empty number of files. Maybe it'd be helpful to warn the user when that happens. Do the files need to be *.txt, or does the tool also work on *.html and similar?

3) The 'Completed Window' is hidden somewhere in the background.
When I processed a lot of files, I alt-tabbed away as it took quite some time. When alt-tabbing back, even if the processing has completed the main window was always shown to me rather than the windows indicating completion. The completion window can be found by alt tabbing some more, but I think it'd be helpful to force it into the foreground as I actually opted to use the task manager to kill the tool by accident ... several times!

4) Could you add an (optional) option to generate a word frequency list only containing unknown words?
Reply
#64
MaxHayden Wrote:I spoke to Dr. Tatsuhiko Matsushita at the University of Tokyo. He says that word families aren't as useful for Japanese vocabulary learning because of the way the morphology works. He recommends just using lexemes/lemmas and learning more words. If we are interested in Japanese word formation, he says we should look at the papers by Prof. Masahiko Nomura (野村雅昭) and Prof. Masahiko Ishii (石井正彦) which are in Japanese. (He doesn't know of anything in English.)

So the bottom line is that if cb can add support for "part of speech" to the text analysis, that's about as good as we can currently do.
Cb,

I actually thought of a much simpler way to add support for this that would probably work better in practice anyway. Prof. Matsushita has a list of "Assumed Known Words". (Proper names that you should be able to read if you know the language, etc.) He also has a vocabulary list that includes part of speech information.

So really all we need your software to do is to allow for an "exclusion list" of words that don't get counted as part of the tabulation. And if people wanted to generate a frequency list that didn't include proper names, they could just use the existing list and exclude those words. (I guess technically we don't need your software to do this either since it probably wouldn't be very involved to create a Perl script that would exclude a list of words and adjust the statistics. But if it's something you could add without much trouble, I think it would be a good idea.)
Reply
#65
hyvel Wrote:1) How should the input be composed and how does it get processed?
I used the tool to analyze the video game localization files. These files contain a lot of formatting stuff like
Subtitles[0]=(Subtitle=(Speaker=\"Booker\",Subtitle=\"
etc. So I first went through the trouble to get rid of all of those with my poor knowledge of regular expressions. However, I later discovered that the tool seems to work even without this manual pre-processing. So now I wonder if e.g. html and similar get stripped from the input files.
The only pre-processing that JTAT does is to replace Aozora gaiji with utf-8 equivalents and remove all other Aozora formatting. After that the text is passed directly to Mecab which seems to ignore non-Japanese text. You shouldn't have to manually remove formatting present in the localization files.

hyvel Wrote:2) Related to 1): What kind of files can be parsed?
I noticed that when the files were named *.int or *.jpn the tool didn't process them but ran on an empty number of files. Maybe it'd be helpful to warn the user when that happens. Do the files need to be *.txt, or does the tool also work on *.html and similar?
By default, only .txt files within the directory will be processed. To add additional extensions, simply edit settings.txt and modify the "extensions" option.

For example:
extensions = txt;html;jpn;int

hyvel Wrote:3) The 'Completed Window' is hidden somewhere in the background.
When I processed a lot of files, I alt-tabbed away as it took quite some time. When alt-tabbing back, even if the processing has completed the main window was always shown to me rather than the windows indicating completion. The completion window can be found by alt tabbing some more, but I think it'd be helpful to force it into the foreground as I actually opted to use the task manager to kill the tool by accident ... several times!
Hmm, I suppose I can add the Completed Window to the taskbar so that if it does get hidden it will be easier to open.

hyvel Wrote:4) Could you add an (optional) option to generate a word frequency list only containing unknown words?
I think that can be arranged for a future version.
Edited: 2014-07-10, 11:38 pm
Reply
#66
MaxHayden Wrote:
MaxHayden Wrote:I spoke to Dr. Tatsuhiko Matsushita at the University of Tokyo. He says that word families aren't as useful for Japanese vocabulary learning because of the way the morphology works. He recommends just using lexemes/lemmas and learning more words. If we are interested in Japanese word formation, he says we should look at the papers by Prof. Masahiko Nomura (野村雅昭) and Prof. Masahiko Ishii (石井正彦) which are in Japanese. (He doesn't know of anything in English.)

So the bottom line is that if cb can add support for "part of speech" to the text analysis, that's about as good as we can currently do.
Cb,

I actually thought of a much simpler way to add support for this that would probably work better in practice anyway. Prof. Matsushita has a list of "Assumed Known Words". (Proper names that you should be able to read if you know the language, etc.) He also has a vocabulary list that includes part of speech information.

So really all we need your software to do is to allow for an "exclusion list" of words that don't get counted as part of the tabulation. And if people wanted to generate a frequency list that didn't include proper names, they could just use the existing list and exclude those words. (I guess technically we don't need your software to do this either since it probably wouldn't be very involved to create a Perl script that would exclude a list of words and adjust the statistics. But if it's something you could add without much trouble, I think it would be a good idea.)
Should be easy enough to do for the next version. Same goes for the original part-of-speech suggestion.
Reply
#67
Hello,

I have just released version 5.0 of cb's Japanese Text Analysis Tool.

Download cb's Japanese Text Analysis Tool v5.0 via SourceForge

What Changed?

● Added the part-of-speech field to the Word Frequency report. (Thanks MaxHayden!).

● Added the Word Removal option. (Thanks MaxHayden!).

● Analysis is faster and now utilizes all cores.

● Added the Directory Search Options group.

● Added the Remove Aozora Formatting option (in the previous version you didn't get a choice, it was always enabled).

cb4960
Reply
#68
Thanks for adding those features.
Reply
#69
It's been over half a year since I used this program, so I can't remember what I'm doing wrong. When I try to analyze a UTF-8 formatted txt file for word frequency I just get an empty txt file, no matter what parsing method or encoding I choose. The kanji freqency analysis works though. I seem to remember having this problem last time too but can't remember how I fixed it...

It's an exported anki txt-file so it's a garbled mess of html code lines mixed with japanese words.
Reply
#70
Termy Wrote:It's been over half a year since I used this program, so I can't remember what I'm doing wrong. When I try to analyze a UTF-8 formatted txt file for word frequency I just get an empty txt file, no matter what parsing method or encoding I choose. The kanji freqency analysis works though. I seem to remember having this problem last time too but can't remember how I fixed it...

It's an exported anki txt-file so it's a garbled mess of html code lines mixed with japanese words.
I get the same problem when I try to import a directory which contains files in a format different from that specified in the "settings.txt" file, maybe it could be this? :/ Hope to be useful Smile
Reply
#71
Does the previous version work any better for you? Perhaps I screwed something up in the 5.0 release.
Reply
#72
So for me, the word removal option does not delete the word out of the generated frequency report, is this the intended function, or perhaps a bug? Also, for some reason I can get the part of speech to show up.
Reply
#73
cophnia61: the files specified in settings.ini are txt;srt;ass;htm;html. I haven't changed anything there since back when it did work.

cb4960: I tried the version first that I used the last time I used the program (march/april), not sure what version that was though. Then I upgraded to the latest version, but it didn't make a difference.

I'm thinking there's something obvious I'm missing here and am too daft to realize... I need to keep a diary or something...
Reply
#74
Can you email me the following:

1) A screenshot of Japanese Text Analysis Tool showing all of the options that you are using.

2) The input file (or some subset) that you are using.

I'll try to reproduce it on my end.
Reply
#75
cb4960 Wrote:Can you email me the following:

1) A screenshot of Japanese Text Analysis Tool showing all of the options that you are using.

2) The input file (or some subset) that you are using.

I'll try to reproduce it on my end.
Thanks! The email is sent.
Reply