Back

cb's Japanese Text Analysis Tool

Using 5.1.0.0:
>『それは違う。あなた間違っていますよ』って言ったらどうします?
MeCab:
1 それは 7.14285714 85.71428571 副詞,一般,*,*
JParser:
1 それは 10.00000000 70.00000000 adv
Blacklisting それは removes it from the list without causing it to pick up on それ, of course.
Reply
Thanks for the example. Unfortunately kuromoji appears to be a Java library, so I won't be able to integrate it as easily as I may have hoped. There's also the issue of requiring Java to be installed.
Reply
from their FAQ:

Quote:Is Kuromoji available in C or C++?

We have a work-in-progress C version of Kuromoji available. Please contact us if you are interested.
Reply
May 15 - 26: Pretty Big Deal: Get 31% OFF Premium & Premium PLUS! CLICK HERE
JapanesePod101
After some further analysis, I've determined that the issue is not with Mecab, but rather with the user dictionary that I am providing to it. I plan to remove それは and perhaps other entries from the user dictionary in the next release. I will also add an option to disable the user dictionary.
Reply
Thank you for your efforts.
Reply
Hello,

I have just released version 5.2 of cb's Japanese Text Analysis Tool.

Download cb's Japanese Text Analysis Tool v5.2 via SourceForge

What Changed?

● Removed entries from the MeCab user dictionary that interfered with MeCab's statistical model.

● Added the Use Enhanced Dictionary option when MeCab is selected. This enables the MeCab user dictionary.

cb4960
Edited: 2015-07-03, 11:26 pm
Reply
Word frequency reports are turning out empty with the newest version. Is this a bug or just me?
EDIT
Fixed by following the steps detailed by blindbox in this thread.
Edited: 2015-07-20, 6:23 am
Reply
I found a text of the book I want to read and I analyzed it using this tool.

How can I use the output from this tool to further my studies?

What I did was
1. Copypaste the word frequency report into editpad.org.
2. Scan down the list for unfamiliar words with f >= 10 (my text is a light novel, it's approx. 50,000 "words" long, not sure how many characters).
2a.For unfamiliar words, use Rikaisama to make Anki-importable TSV of unfamiliar words and definitions.
3. Import into Anki.

Is that a good method?
Reply
There is a tool in the program to compare 2 wordlists. You if you have a wordlist of known words, you could filter out all of the words that you already know so you could automate step 2. There's also an anki add-on that will add definitions to cards so you could automate step 2b.
Reply
Or you could use cb's EPWING2ANKI for step 2a. It adds example sentences too so that's a plus.
Reply
Hello,

I have just released version 5.3 of cb's Japanese Text Analysis Tool.

Download cb's Japanese Text Analysis Tool v5.3 via SourceForge

What Changed?

● Added some optimizations.

● Added analysis time to the Complete dialog.

● Fixed bug in user-readability report not using the Use Enhanced Dictionary option.

● Added the max_tasks option to settings.txt.

● Now targets .Net 4.5.

cb4960
Reply
Getting out of memory exceptions with large files in the new version.
[spoiler]
See the end of this message for details on invoking
just-in-time (JIT) debugging instead of this dialog box.

************** Exception Text **************
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
at System.Text.StringBuilder.ToString()
at System.IO.StreamReader.ReadToEnd()
at JapaneseTextAnalysisTool.Mecab.parseWithExe(String input, Boolean userDic)
at JapaneseTextAnalysisTool.Mecab.parseFields(String input, Boolean userDic, CancellationTokenSource cancelTokenSource)
at JapaneseTextAnalysisTool.FreqWord.addWithMecab(String text, CancellationTokenSource cancelTokenSource)
at JapaneseTextAnalysisTool.FreqWord.addFileText(String text, CancellationTokenSource cancelTokenSource)
at JapaneseTextAnalysisTool.FormMain.analyzeFile(FileInfo file)
at JapaneseTextAnalysisTool.FormMain.<>c__DisplayClass4.<analyzeFileAsync>b__3()
at System.Threading.Tasks.Task.InnerInvoke()
at System.Threading.Tasks.Task.Execute()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at JapaneseTextAnalysisTool.FormMain.<callAnalyzeFile>d__0.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.<ThrowAsync>b__4(Object state)


************** Loaded Assemblies **************
mscorlib
Assembly Version: 4.0.0.0
Win32 Version: 4.0.30319.34209 built by: FX452RTMGDR
CodeBase: file:///C:/Windows/Microsoft.NET/Framework/v4.0.30319/mscorlib.dll
----------------------------------------
JapaneseTextAnalysisTool
Assembly Version: 5.3.0.0
Win32 Version: 5.3.0.0
CodeBase: file:///D:/JTAT/JapaneseTextAnalysisTool.exe
----------------------------------------
System.Windows.Forms
Assembly Version: 4.0.0.0
Win32 Version: 4.0.30319.34251 built by: FX452RTMGDR
CodeBase: file:///C:/Windows/Microsoft.Net/assembly/GAC_MSIL/System.Windows.Forms/v4.0_4.0.0.0__b77a5c561934e089/System.Windows.Forms.dll
----------------------------------------
System.Drawing
Assembly Version: 4.0.0.0
Win32 Version: 4.0.30319.34209 built by: FX452RTMGDR
CodeBase: file:///C:/Windows/Microsoft.Net/assembly/GAC_MSIL/System.Drawing/v4.0_4.0.0.0__b03f5f7f11d50a3a/System.Drawing.dll
----------------------------------------
System
Assembly Version: 4.0.0.0
Win32 Version: 4.0.30319.34238 built by: FX452RTMGDR
CodeBase: file:///C:/Windows/Microsoft.Net/assembly/GAC_MSIL/System/v4.0_4.0.0.0__b77a5c561934e089/System.dll
----------------------------------------
System.Configuration
Assembly Version: 4.0.0.0
Win32 Version: 4.0.30319.34209 built by: FX452RTMGDR
CodeBase: file:///C:/Windows/Microsoft.Net/assembly/GAC_MSIL/System.Configuration/v4.0_4.0.0.0__b03f5f7f11d50a3a/System.Configuration.dll
----------------------------------------
System.Xml
Assembly Version: 4.0.0.0
Win32 Version: 4.0.30319.34234 built by: FX452RTMGDR
CodeBase: file:///C:/Windows/Microsoft.Net/assembly/GAC_MSIL/System.Xml/v4.0_4.0.0.0__b77a5c561934e089/System.Xml.dll
----------------------------------------

************** JIT Debugging **************
To enable just-in-time (JIT) debugging, the .config file for this
application or computer (machine.config) must have the
jitDebugging value set in the system.windows.forms section.
The application must also be compiled with debugging
enabled.

For example:

<configuration>
<system.windows.forms jitDebugging="true" />
</configuration>

When JIT debugging is enabled, any unhandled exception
will be sent to the JIT debugger registered on the computer
rather than be handled by this dialog box.
[/spoiler]

On failure the mecab folder has some really large files left over (mecab_in and mecab_out) which aren't deleted.
Reply
I just uploaded the 64-bit version, try that one instead. It should allow JTAT access to more memory. Not sure why it's taking up so much memory to begin with though. Analysis of the 5000+ innocent novel set doesn't use more than ~200 MB on my machine. How large is the file that you are analyzing?
Reply
Working perfectly now. Thank you. As for the size I was analyzing a 400mb folder. Largest file should've been 40mb.
Reply
Some time ago I made a small modification to the app (I think I used the 5.0 version). Basically I added the option to save separate reports for each file when processing directories. I planned to use those reports for making vocab drills based on chapters of web novels (don't ask). My vocab drill app got stuck in a very beta stage, so I guess it's not very usable for anyone but me, but maybe someone will find the mod to Japanese Text Analysis Tool useful.

The code and executable can be found here: https://github.com/jahu00/JTATmod

I tried contacting cb about this before (through sourceforge), but I'm not sure if he ever got my message.
Reply
Hello,

I have just released version 5.4 of cb's Japanese Text Analysis Tool.

Download cb's Japanese Text Analysis Tool v5.4 via SourceForge

What Changed?

● Added the "Frequency Group" and "Frequency Rank" fields to both the Word Frequency Report and Kanji Frequency Report.

Frequency Group: All words in the analysis that share the exact same frequency (Field 1) will be assigned to a numbered Frequency Group, with group 1 containing the most common word(s), group 2 containing the next most common word(s), and so on.

Frequency Rank: For a given word, the Frequency Rank is the total number of words in the analysis that are more frequent that the given word + 1. For example, if the given word has a Frequency Rank of 500, then there are 499 other words in the analysis that are more frequent than the given word.

New Word Frequency Report format:
Field 1: Number of times word was encountered
Field 2: Word
Field 3: Frequency Group
Field 4: Frequency Rank
Field 5: Percentage (Field 1 / Total number of words)
Field 6: Cumulative percentage
Field 7: Part-of-speech

New Kanji Frequency Report format:
Field 1: Number of times kanji was encountered
Field 2: Kanji
Field 3: Frequency Group
Field 4: Frequency Rank
Field 5: Percentage (Field 1 / Total number of kanji)
Field 6: Cumulative percentage

Innocent Novel analysis (Sample_Output_151003.zip) can be found at the link above.

cb4960
Edited: 2015-10-10, 9:57 pm
Reply
You're a beautiful man cb. Danke.
Reply
Getting this on Windows 7 when clicking the analyze button:

************** Exception Text **************
System.TypeLoadException: Could not load type 'System.Runtime.CompilerServices.IAsyncStateMachine' from assembly 'mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089'.
at JapaneseTextAnalysisTool.FormMain.callAnalyzeFile(FileInfo file)
at JapaneseTextAnalysisTool.FormMain.performAnalysis()
at System.Windows.Forms.Button.OnMouseUp(MouseEventArgs mevent)
at System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
at System.Windows.Forms.Control.WndProc(Message& m)
at System.Windows.Forms.ButtonBase.WndProc(Message& m)
at System.Windows.Forms.Button.WndProc(Message& m)
at System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)

64bit version of the program. I also can't tell you if that blank report error was ever fixed as I'm not on Windows 8 anymore where that took place.
Edited: 2015-10-05, 4:34 pm
Reply
ryuudou Wrote:Getting this on Windows 7 when clicking the analyze button:

************** Exception Text **************
System.TypeLoadException: Could not load type 'System.Runtime.CompilerServices.IAsyncStateMachine' from assembly 'mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089'.
at JapaneseTextAnalysisTool.FormMain.callAnalyzeFile(FileInfo file)
at JapaneseTextAnalysisTool.FormMain.performAnalysis()
at System.Windows.Forms.Button.OnMouseUp(MouseEventArgs mevent)
at System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
at System.Windows.Forms.Control.WndProc(Message& m)
at System.Windows.Forms.ButtonBase.WndProc(Message& m)
at System.Windows.Forms.Button.WndProc(Message& m)
at System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)

64bit version of the program. I also can't tell you if that blank report error was ever fixed as I'm not on Windows 8 anymore where that took place.
Do you have .Net version 4.5 installed?

To find .NET Framework versions by viewing the registry (.NET Framework 4.5 and later)
Reply
Hello,

I have just released version 2.3 of cb's Japanese Frequency List Sorter.

Download cb's Japanese Frequency List Sorter via SourceForge

What Changed?

● Added "Output entire line" option.
● Added "Append number of times encountered" option.
● Added "Append Frequency group option.
● Added "Append Frequency Rank" option.
● Updated word_freq_report_mecab.txt and kanji_freq_report.txt.

cb4960
Reply
cb4960 Wrote:
ryuudou Wrote:Getting this on Windows 7 when clicking the analyze button:

************** Exception Text **************
System.TypeLoadException: Could not load type 'System.Runtime.CompilerServices.IAsyncStateMachine' from assembly 'mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089'.
at JapaneseTextAnalysisTool.FormMain.callAnalyzeFile(FileInfo file)
at JapaneseTextAnalysisTool.FormMain.performAnalysis()
at System.Windows.Forms.Button.OnMouseUp(MouseEventArgs mevent)
at System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
at System.Windows.Forms.Control.WndProc(Message& m)
at System.Windows.Forms.ButtonBase.WndProc(Message& m)
at System.Windows.Forms.Button.WndProc(Message& m)
at System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)

64bit version of the program. I also can't tell you if that blank report error was ever fixed as I'm not on Windows 8 anymore where that took place.
Do you have .Net version 4.5 installed?

To find .NET Framework versions by viewing the registry (.NET Framework 4.5 and later)
It seems to work now but now I'm getting blank word reports with both parsers. This also happened in Windows 8.
Reply
ryuudou Wrote:
cb4960 Wrote:
ryuudou Wrote:Getting this on Windows 7 when clicking the analyze button:

************** Exception Text **************
System.TypeLoadException: Could not load type 'System.Runtime.CompilerServices.IAsyncStateMachine' from assembly 'mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089'.
at JapaneseTextAnalysisTool.FormMain.callAnalyzeFile(FileInfo file)
at JapaneseTextAnalysisTool.FormMain.performAnalysis()
at System.Windows.Forms.Button.OnMouseUp(MouseEventArgs mevent)
at System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
at System.Windows.Forms.Control.WndProc(Message& m)
at System.Windows.Forms.ButtonBase.WndProc(Message& m)
at System.Windows.Forms.Button.WndProc(Message& m)
at System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)

64bit version of the program. I also can't tell you if that blank report error was ever fixed as I'm not on Windows 8 anymore where that took place.
Do you have .Net version 4.5 installed?

To find .NET Framework versions by viewing the registry (.NET Framework 4.5 and later)
It seems to work now but now I'm getting blank word reports with both parsers. This also happened in Windows 8.
Someday I'll have to add logging to JTAT to get to the bottom of this.

Another user who was having a similar issue apparently found a workaround:

http://forum.koohii.com/showthread.php?p...#pid220227
Reply
i have black word report problem. and i cant download 4.4 version.

sorry, nvm. no space do the trick.

btw, is there a way to blacklist part-of-speech?
i want JTAT to ignore 感動詞,助詞 and 助動詞.....
Edited: 2015-10-31, 8:31 pm
Reply
Anyone know a tool or some code I can use to take the word frequency report, and limit it to only words which contain kanji?
Reply
Zarxrax Wrote:Anyone know a tool or some code I can use to take the word frequency report, and limit it to only words which contain kanji?
If you open the report in a text editor which supports regular expressions, search the following, and replace with nothing.

Search for:  ^((?![\x{4e00}-\x{9faf}]).)*\n
Replace with:
Edited: 2015-11-30, 6:35 pm
Reply