Back

cb's Japanese Text Analysis Tool

#76
Hi, Thanks! This is a great tool that I'll be sure to use very soon.

I hope you don't mind me making a feature request, I imagine it could be useful. In addition to the user based readability report, wouldn't it be great to have a list of all the unknown words/kanji so that you could study them?
Reply
#77
yogert909 Wrote:Hi, Thanks! This is a great tool that I'll be sure to use very soon.

I hope you don't mind me making a feature request, I imagine it could be useful. In addition to the user based readability report, wouldn't it be great to have a list of all the unknown words/kanji so that you could study them?
You can make a list of unknown words and kanji. For example, if you export all of your anki decks into txt-files, analyze them (word and kanji frequency), use the "combine frequency reports" under tools, and ultimately use "compare frequency reports" to compare your merged anki txt-file with whatever txt file you wish to check the number of unknown words/kanji in.
Reply
#78
Termy Wrote:
yogert909 Wrote:Hi, Thanks! This is a great tool that I'll be sure to use very soon.

I hope you don't mind me making a feature request, I imagine it could be useful. In addition to the user based readability report, wouldn't it be great to have a list of all the unknown words/kanji so that you could study them?
You can make a list of unknown words and kanji. For example, if you export all of your anki decks into txt-files, analyze them (word and kanji frequency), use the "combine frequency reports" under tools, and ultimately use "compare frequency reports" to compare your merged anki txt-file with whatever txt file you wish to check the number of unknown words/kanji in.
Additionally, you can take a look at Word List Duplicate Remover.
Reply
JapanesePod101
#79
Oh, thanks! I thought I probably wasn't the first person to think of thisBig Grin
Reply
#80
5.0 does indeed generate blank word frequency reports from some files. Switching to 4.4 worked for me.
Reply
#81
I could have sworn that 4.4 didn't work when I tried it, right before installing 5.0, but when I switched back to 4.4 now it works fine.
Reply
#82
This blank file thing is happening for me too.

Happens selectively with 4.4 as well. Sometimes Mecab and always with JParser.
Edited: 2014-09-02, 4:01 pm
Reply
#83
Thanks for the responses. I'll set the primary download on SourceForge to v4.4 until I have the time to investigate the issue further.
Reply
#84
I'm on a mac and can't get v4.4 to open in parallels (windows says it is not a proper exe file or something). It opens in wine, and using jparser it exports a kanji report, but the word report is blank. Mecab crashes the program.

I so want to get this working. Any suggestions?
Reply
#85
yogert909 Wrote:I'm on a mac and can't get v4.4 to open in parallels (windows says it is not a proper exe file or something). It opens in wine, and using jparser it exports a kanji report, but the word report is blank. Mecab crashes the program.

I so want to get this working. Any suggestions?
Other users are having issues with v4.4 even using Windows. I've tried to reproduce the issues on a number of other computer using various forms of Windows (Windows 7 Home Premium 64-bit, Windows 7 Enterprise 64-bit, Windows 7 Ultimate 32-bit, Windows 8 64-bit, Windows XP 32-bit, Windows Vista 32-bit) without success. Getting it to work properly on a Mac might not be possible and I don't know enough about the Mac environment to provide any real help. Maybe your Wine environment is missing whatever version of the Visual Studio runtime that Mecab uses.

If you have a one-time analysis that you would like to perform, feel free to send me your files and I will analyze them for you.
Reply
#86
cb4960 Wrote:
yogert909 Wrote:I'm on a mac and can't get v4.4 to open in parallels (windows says it is not a proper exe file or something)
Other users are having issues with v4.4 even using Windows. ... Getting it to work properly on a Mac might not be possible and I don't know enough about the Mac environment to provide any real help.
Windows running in a Parallels virtual machine could be a good way to look into the problem, since the virtualization layer can (hopefully) be assumed to be fixed even if your host (Mac, Win) changes. I would try making a clean install of Win7 or whatever inside VirtualBox (free, robust, works great with Mac and Win hosts) and seeing what happens with version 4.4.

Or compile MeCab with Emscripten and rewrite the tool itself in Javascript so it all runs in a browser, the ultimate cross-platform layer Smile
Reply
#87
@CB Thanks for your offer and taking time respond. It's strange because last night I tried installing wine via homebrew instead of macports and I got the same mecab crashing and blank file behavior. I'm going to try installing another version of windows in parallels this weekend and see if that does the trick. I might even try a dual boot if that doesn't do it. I'm obsessed with making it work!

@aldebrn I think that'll be the easiest - installing a new version of windows. Rewriting in javascript is a bit outside my capablities. I found a bash script that I might mess around with though.
Edited: 2014-09-11, 2:16 pm
Reply
#88
yogert909 Wrote:I might even try a dual boot if that doesn't do it. I'm obsessed with making it work!
Or you could Run Windows Vista/7 From a Flash Drive Without Installing.
Reply
#89
A fresh install of windows 7 did the trick. A was able to make good use of the text analysis tool and wordlist duplicate remover. Thanks!
Reply
#90
Thanks for the software cb4960. You really saved a lot of people's time with this one. I was thinking of writing a simple script for kanji analysis myself, but you came up with a full-blown software plus GUI!

Anyway, any chance that you can add the order of appearance of kanji and word as well? I prefer to learn by the order of appearance, rather than the frequency.

EDIT: On second thought, I don't like to learn by order of appearance after all. There's 8000 words in one volume of a light novel I'm trying to read!
Edited: 2015-02-24, 12:49 am
Reply
#91
blindbox Wrote:Thanks for the software cb4960. You really saved a lot of people's time with this one. I was thinking of writing a simple script for kanji analysis myself, but you came up with a full-blown software plus GUI!

Anyway, any chance that you can add the order of appearance of kanji and word as well? I prefer to learn by the order of appearance, rather than the frequency.

EDIT: On second thought, I don't like to learn by order of appearance after all. There's 8000 words in one volume of a light novel I'm trying to read!
Still, not a terrible idea... Maybe I can add a new column containing the order in which the words/kanji were found.

Maybe you can run the tool once for each new chapter so you aren't presented with 8000 words all at one time.
Reply
#92
I think having that order would be useful.
Reply
#93
This tool fails when it encounters certain gibberish or control characters, as can be seen in this example file:
https://mega.co.nz/#!xQEjARRY!S_GCZtd7Ds...hecqoAtIaQ
Can this be fixed so that it will bypass the junk and still get all of the data from the Japanese?
Reply
#94
Gayle Wrote:5.0 does indeed generate blank word frequency reports from some files. Switching to 4.4 worked for me.
ryuudou Wrote:This blank file thing is happening for me too.

Happens selectively with 4.4 as well. Sometimes Mecab and always with JParser.
cb4960 Wrote:Thanks for the responses. I'll set the primary download on SourceForge to v4.4 until I have the time to investigate the issue further.
@cb4960: I managed to track down this issue. If I put the JTAT in a folder directory without a space i.e. E:\JTAT\, the blank word report issue using JParser doesn't occur. However, if I put JTAT in a folder with space i.e. E:\J T A T\, the blank word report issue does occur.

To summarize, here are my tests.
E:\JTAT\ - no blank
E:\J T A T\ - blank word report
E:\foobar\JTAT\ - no blank
E:\foo bar\JTAT\ - blank word report
E:\foobar\J T A T\ - blank word report

This has been tested with both MeCab and JParser. Same behavior. On JTAT 5.0

What others can do meanwhile, is to place JTAT in a folder or a subfolder without spaces. The output location is not affected by this bug.

That is, JTAT must be in the folders like the following example:
E:\JTAT\
E:\foobar\JTAT\
E:\foo\bar\JTAT\
Edited: 2015-05-10, 11:07 am
Reply
#95
Thanks for investigating this. I'll try to reproduce this later with spaces in the path. If this is the issue, it should be an easy fix.

Update:
I'm unable to reproduce this. For me it works fine with or without spaces in the path.
Edited: 2015-05-14, 9:04 pm
Reply
#96
Hm, I just tried it again myself, word_freq_report.txt is still blank (both meCAB and JDICT) if my folders or subfolders have spaces. Tested on 5.1.

Maybe some sanitization of the source code is required? Maybe a dll version mismatch issue? I'm running the software on Windows 7. I've tried to install .net 3.5 but I can't (also tried the workaround that tells you how to activate .net 3.5 on windows 7).

Gonna try update my windows.
Edited: 2015-05-23, 5:02 am
Reply
#97
cb4960's tools that need novel texts to work are not very useful to me because I don't have these novel texts or know where to get them.

His tools that don't need novel texts to work, namely Rikaisama and the OCR one, are very useful, I found.
Reply
#98
jcdietz03 Wrote:cb4960's tools that need novel texts to work are not very useful to me because I don't have these novel texts or know where to get them.

His tools that don't need novel texts to work, namely Rikaisama and the OCR one, are very useful, I found.
what tools need novel texts to work?

his japanese text analysis tool just needs a japanese text to analyze...but no surprise there.
Reply
#99
Are you aware of this morpheme splitter?
http://atilika.com/en/products/kuromoji.html
It uses a statistical/context model that keeps it from doing shit like misinterpreting それ+は as the adverb ソレハ. It also handles made-up words and names elegantly.
Old demo/talk (1-2 yrs):

I'd be interested in seeing support for this because it avoids the ソレハ thing, which seems to be a current problem that the things we're currently on have. Everything I've gotten out of JTAT has way too many false positives of ソレハ. I would just blacklist bad terms, but it negatively impacts the ranking of それ regardless, so I don't see a point.
Also it generally seems to be an incredibly good splitter for japanese.
Reply
wareya Wrote:Are you aware of this morpheme splitter?
http://atilika.com/en/products/kuromoji.html
It uses a statistical/context model that keeps it from doing shit like misinterpreting それ+は as the adverb ソレハ. It also handles made-up words and names elegantly.
Old demo/talk (1-2 yrs):

I'd be interested in seeing support for this because it avoids the ソレハ thing, which seems to be a current problem that the things we're currently on have. Everything I've gotten out of JTAT has way too many false positives of ソレハ. I would just blacklist bad terms, but it negatively impacts the ranking of それ regardless, so I don't see a point.
Also it generally seems to be an incredibly good splitter for japanese.
Can you provide an example sentence that demonstrates this behavior?
Reply