I'll take a look at those things during the weekend. No promises though.
Edited: 2013-12-04, 9:15 pm
MaxHayden Wrote:I have a question about how this tool works.The tool does not currently output part-of-speech or have an ability to filter by part-of-speech. I suppose I could add it in the future though.
Looking at various frequency lists including your novel frequency list, I'd like to know how many of the words that are showing up are proper nouns, grammatical particles, etc. and how many are actual vocabulary words. MeCab at least has a part of speech analysis. Does your tool have a way to generate the list while using the part-of-speech feature to screen out particles, proper nouns, etc.?
Also, is there some tool that can combine the lemmas in the frequency list into a word family frequency list? (E.g. Sushi and sushi-ya are two different words, but they are part of the same word family.)
MaxHayden Wrote:I spoke to Dr. Tatsuhiko Matsushita at the University of Tokyo. He says that word families aren't as useful for Japanese vocabulary learning because of the way the morphology works. He recommends just using lexemes/lemmas and learning more words. If we are interested in Japanese word formation, he says we should look at the papers by Prof. Masahiko Nomura (野村雅昭) and Prof. Masahiko Ishii (石井正彦) which are in Japanese. (He doesn't know of anything in English.)Cb,
So the bottom line is that if cb can add support for "part of speech" to the text analysis, that's about as good as we can currently do.
hyvel Wrote:1) How should the input be composed and how does it get processed?The only pre-processing that JTAT does is to replace Aozora gaiji with utf-8 equivalents and remove all other Aozora formatting. After that the text is passed directly to Mecab which seems to ignore non-Japanese text. You shouldn't have to manually remove formatting present in the localization files.
I used the tool to analyze the video game localization files. These files contain a lot of formatting stuff like
Subtitles[0]=(Subtitle=(Speaker=\"Booker\",Subtitle=\"
etc. So I first went through the trouble to get rid of all of those with my poor knowledge of regular expressions. However, I later discovered that the tool seems to work even without this manual pre-processing. So now I wonder if e.g. html and similar get stripped from the input files.
hyvel Wrote:2) Related to 1): What kind of files can be parsed?By default, only .txt files within the directory will be processed. To add additional extensions, simply edit settings.txt and modify the "extensions" option.
I noticed that when the files were named *.int or *.jpn the tool didn't process them but ran on an empty number of files. Maybe it'd be helpful to warn the user when that happens. Do the files need to be *.txt, or does the tool also work on *.html and similar?
hyvel Wrote:3) The 'Completed Window' is hidden somewhere in the background.Hmm, I suppose I can add the Completed Window to the taskbar so that if it does get hidden it will be easier to open.
When I processed a lot of files, I alt-tabbed away as it took quite some time. When alt-tabbing back, even if the processing has completed the main window was always shown to me rather than the windows indicating completion. The completion window can be found by alt tabbing some more, but I think it'd be helpful to force it into the foreground as I actually opted to use the task manager to kill the tool by accident ... several times!
hyvel Wrote:4) Could you add an (optional) option to generate a word frequency list only containing unknown words?I think that can be arranged for a future version.
MaxHayden Wrote:Should be easy enough to do for the next version. Same goes for the original part-of-speech suggestion.MaxHayden Wrote:I spoke to Dr. Tatsuhiko Matsushita at the University of Tokyo. He says that word families aren't as useful for Japanese vocabulary learning because of the way the morphology works. He recommends just using lexemes/lemmas and learning more words. If we are interested in Japanese word formation, he says we should look at the papers by Prof. Masahiko Nomura (野村雅昭) and Prof. Masahiko Ishii (石井正彦) which are in Japanese. (He doesn't know of anything in English.)Cb,
So the bottom line is that if cb can add support for "part of speech" to the text analysis, that's about as good as we can currently do.
I actually thought of a much simpler way to add support for this that would probably work better in practice anyway. Prof. Matsushita has a list of "Assumed Known Words". (Proper names that you should be able to read if you know the language, etc.) He also has a vocabulary list that includes part of speech information.
So really all we need your software to do is to allow for an "exclusion list" of words that don't get counted as part of the tabulation. And if people wanted to generate a frequency list that didn't include proper names, they could just use the existing list and exclude those words. (I guess technically we don't need your software to do this either since it probably wouldn't be very involved to create a Perl script that would exclude a list of words and adjust the statistics. But if it's something you could add without much trouble, I think it would be a good idea.)
Termy Wrote:It's been over half a year since I used this program, so I can't remember what I'm doing wrong. When I try to analyze a UTF-8 formatted txt file for word frequency I just get an empty txt file, no matter what parsing method or encoding I choose. The kanji freqency analysis works though. I seem to remember having this problem last time too but can't remember how I fixed it...I get the same problem when I try to import a directory which contains files in a format different from that specified in the "settings.txt" file, maybe it could be this? :/ Hope to be useful
It's an exported anki txt-file so it's a garbled mess of html code lines mixed with japanese words.
cb4960 Wrote:Can you email me the following:Thanks! The email is sent.
1) A screenshot of Japanese Text Analysis Tool showing all of the options that you are using.
2) The input file (or some subset) that you are using.
I'll try to reproduce it on my end.