kanji koohii FORUM
cb's Japanese Text Analysis Tool - Printable Version

+- kanji koohii FORUM (http://forum.koohii.com)
+-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html)
+--- Forum: Learning resources (http://forum.koohii.com/forum-9.html)
+--- Thread: cb's Japanese Text Analysis Tool (/thread-9459.html)

Pages: 1 2 3 4 5 6


cb's Japanese Text Analysis Tool - wareya - 2015-07-02

Using 5.1.0.0:
>『それは違う。あなた間違っていますよ』って言ったらどうします?
MeCab:
1 それは 7.14285714 85.71428571 副詞,一般,*,*
JParser:
1 それは 10.00000000 70.00000000 adv
Blacklisting それは removes it from the list without causing it to pick up on それ, of course.


cb's Japanese Text Analysis Tool - cb4960 - 2015-07-03

Thanks for the example. Unfortunately kuromoji appears to be a Java library, so I won't be able to integrate it as easily as I may have hoped. There's also the issue of requiring Java to be installed.


cb's Japanese Text Analysis Tool - Tamba - 2015-07-03

from their FAQ:

Quote:Is Kuromoji available in C or C++?

We have a work-in-progress C version of Kuromoji available. Please contact us if you are interested.



cb's Japanese Text Analysis Tool - cb4960 - 2015-07-03

After some further analysis, I've determined that the issue is not with Mecab, but rather with the user dictionary that I am providing to it. I plan to remove それは and perhaps other entries from the user dictionary in the next release. I will also add an option to disable the user dictionary.


cb's Japanese Text Analysis Tool - wareya - 2015-07-03

Thank you for your efforts.


cb's Japanese Text Analysis Tool - cb4960 - 2015-07-03

Hello,

I have just released version 5.2 of cb's Japanese Text Analysis Tool.

Download cb's Japanese Text Analysis Tool v5.2 via SourceForge

What Changed?

● Removed entries from the MeCab user dictionary that interfered with MeCab's statistical model.

● Added the Use Enhanced Dictionary option when MeCab is selected. This enables the MeCab user dictionary.

cb4960


cb's Japanese Text Analysis Tool - xd1986k - 2015-07-13

Word frequency reports are turning out empty with the newest version. Is this a bug or just me?
EDIT
Fixed by following the steps detailed by blindbox in this thread.


cb's Japanese Text Analysis Tool - jcdietz03 - 2015-07-29

I found a text of the book I want to read and I analyzed it using this tool.

How can I use the output from this tool to further my studies?

What I did was
1. Copypaste the word frequency report into editpad.org.
2. Scan down the list for unfamiliar words with f >= 10 (my text is a light novel, it's approx. 50,000 "words" long, not sure how many characters).
2a.For unfamiliar words, use Rikaisama to make Anki-importable TSV of unfamiliar words and definitions.
3. Import into Anki.

Is that a good method?


cb's Japanese Text Analysis Tool - yogert909 - 2015-07-29

There is a tool in the program to compare 2 wordlists. You if you have a wordlist of known words, you could filter out all of the words that you already know so you could automate step 2. There's also an anki add-on that will add definitions to cards so you could automate step 2b.


cb's Japanese Text Analysis Tool - xd1986k - 2015-07-30

Or you could use cb's EPWING2ANKI for step 2a. It adds example sentences too so that's a plus.


cb's Japanese Text Analysis Tool - cb4960 - 2015-07-31

Hello,

I have just released version 5.3 of cb's Japanese Text Analysis Tool.

Download cb's Japanese Text Analysis Tool v5.3 via SourceForge

What Changed?

● Added some optimizations.

● Added analysis time to the Complete dialog.

● Fixed bug in user-readability report not using the Use Enhanced Dictionary option.

● Added the max_tasks option to settings.txt.

● Now targets .Net 4.5.

cb4960


cb's Japanese Text Analysis Tool - xd1986k - 2015-08-02

Getting out of memory exceptions with large files in the new version.
[spoiler]
See the end of this message for details on invoking
just-in-time (JIT) debugging instead of this dialog box.

************** Exception Text **************
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
at System.Text.StringBuilder.ToString()
at System.IO.StreamReader.ReadToEnd()
at JapaneseTextAnalysisTool.Mecab.parseWithExe(String input, Boolean userDic)
at JapaneseTextAnalysisTool.Mecab.parseFields(String input, Boolean userDic, CancellationTokenSource cancelTokenSource)
at JapaneseTextAnalysisTool.FreqWord.addWithMecab(String text, CancellationTokenSource cancelTokenSource)
at JapaneseTextAnalysisTool.FreqWord.addFileText(String text, CancellationTokenSource cancelTokenSource)
at JapaneseTextAnalysisTool.FormMain.analyzeFile(FileInfo file)
at JapaneseTextAnalysisTool.FormMain.<>c__DisplayClass4.<analyzeFileAsync>b__3()
at System.Threading.Tasks.Task.InnerInvoke()
at System.Threading.Tasks.Task.Execute()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at JapaneseTextAnalysisTool.FormMain.<callAnalyzeFile>d__0.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.<ThrowAsync>b__4(Object state)


************** Loaded Assemblies **************
mscorlib
Assembly Version: 4.0.0.0
Win32 Version: 4.0.30319.34209 built by: FX452RTMGDR
CodeBase: file:///C:/Windows/Microsoft.NET/Framework/v4.0.30319/mscorlib.dll
----------------------------------------
JapaneseTextAnalysisTool
Assembly Version: 5.3.0.0
Win32 Version: 5.3.0.0
CodeBase: file:///D:/JTAT/JapaneseTextAnalysisTool.exe
----------------------------------------
System.Windows.Forms
Assembly Version: 4.0.0.0
Win32 Version: 4.0.30319.34251 built by: FX452RTMGDR
CodeBase: file:///C:/Windows/Microsoft.Net/assembly/GAC_MSIL/System.Windows.Forms/v4.0_4.0.0.0__b77a5c561934e089/System.Windows.Forms.dll
----------------------------------------
System.Drawing
Assembly Version: 4.0.0.0
Win32 Version: 4.0.30319.34209 built by: FX452RTMGDR
CodeBase: file:///C:/Windows/Microsoft.Net/assembly/GAC_MSIL/System.Drawing/v4.0_4.0.0.0__b03f5f7f11d50a3a/System.Drawing.dll
----------------------------------------
System
Assembly Version: 4.0.0.0
Win32 Version: 4.0.30319.34238 built by: FX452RTMGDR
CodeBase: file:///C:/Windows/Microsoft.Net/assembly/GAC_MSIL/System/v4.0_4.0.0.0__b77a5c561934e089/System.dll
----------------------------------------
System.Configuration
Assembly Version: 4.0.0.0
Win32 Version: 4.0.30319.34209 built by: FX452RTMGDR
CodeBase: file:///C:/Windows/Microsoft.Net/assembly/GAC_MSIL/System.Configuration/v4.0_4.0.0.0__b03f5f7f11d50a3a/System.Configuration.dll
----------------------------------------
System.Xml
Assembly Version: 4.0.0.0
Win32 Version: 4.0.30319.34234 built by: FX452RTMGDR
CodeBase: file:///C:/Windows/Microsoft.Net/assembly/GAC_MSIL/System.Xml/v4.0_4.0.0.0__b77a5c561934e089/System.Xml.dll
----------------------------------------

************** JIT Debugging **************
To enable just-in-time (JIT) debugging, the .config file for this
application or computer (machine.config) must have the
jitDebugging value set in the system.windows.forms section.
The application must also be compiled with debugging
enabled.

For example:

<configuration>
<system.windows.forms jitDebugging="true" />
</configuration>

When JIT debugging is enabled, any unhandled exception
will be sent to the JIT debugger registered on the computer
rather than be handled by this dialog box.
[/spoiler]

On failure the mecab folder has some really large files left over (mecab_in and mecab_out) which aren't deleted.


cb's Japanese Text Analysis Tool - cb4960 - 2015-08-02

I just uploaded the 64-bit version, try that one instead. It should allow JTAT access to more memory. Not sure why it's taking up so much memory to begin with though. Analysis of the 5000+ innocent novel set doesn't use more than ~200 MB on my machine. How large is the file that you are analyzing?


cb's Japanese Text Analysis Tool - xd1986k - 2015-08-03

Working perfectly now. Thank you. As for the size I was analyzing a 400mb folder. Largest file should've been 40mb.


cb's Japanese Text Analysis Tool - jahu00 - 2015-08-16

Some time ago I made a small modification to the app (I think I used the 5.0 version). Basically I added the option to save separate reports for each file when processing directories. I planned to use those reports for making vocab drills based on chapters of web novels (don't ask). My vocab drill app got stuck in a very beta stage, so I guess it's not very usable for anyone but me, but maybe someone will find the mod to Japanese Text Analysis Tool useful.

The code and executable can be found here: https://github.com/jahu00/JTATmod

I tried contacting cb about this before (through sourceforge), but I'm not sure if he ever got my message.


cb's Japanese Text Analysis Tool - cb4960 - 2015-10-03

Hello,

I have just released version 5.4 of cb's Japanese Text Analysis Tool.

Download cb's Japanese Text Analysis Tool v5.4 via SourceForge

What Changed?

● Added the "Frequency Group" and "Frequency Rank" fields to both the Word Frequency Report and Kanji Frequency Report.

Frequency Group: All words in the analysis that share the exact same frequency (Field 1) will be assigned to a numbered Frequency Group, with group 1 containing the most common word(s), group 2 containing the next most common word(s), and so on.

Frequency Rank: For a given word, the Frequency Rank is the total number of words in the analysis that are more frequent that the given word + 1. For example, if the given word has a Frequency Rank of 500, then there are 499 other words in the analysis that are more frequent than the given word.

New Word Frequency Report format:
Field 1: Number of times word was encountered
Field 2: Word
Field 3: Frequency Group
Field 4: Frequency Rank
Field 5: Percentage (Field 1 / Total number of words)
Field 6: Cumulative percentage
Field 7: Part-of-speech

New Kanji Frequency Report format:
Field 1: Number of times kanji was encountered
Field 2: Kanji
Field 3: Frequency Group
Field 4: Frequency Rank
Field 5: Percentage (Field 1 / Total number of kanji)
Field 6: Cumulative percentage

Innocent Novel analysis (Sample_Output_151003.zip) can be found at the link above.

cb4960


cb's Japanese Text Analysis Tool - rainmaninjapan - 2015-10-05

You're a beautiful man cb. Danke.


cb's Japanese Text Analysis Tool - ryuudou - 2015-10-05

Getting this on Windows 7 when clicking the analyze button:

************** Exception Text **************
System.TypeLoadException: Could not load type 'System.Runtime.CompilerServices.IAsyncStateMachine' from assembly 'mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089'.
at JapaneseTextAnalysisTool.FormMain.callAnalyzeFile(FileInfo file)
at JapaneseTextAnalysisTool.FormMain.performAnalysis()
at System.Windows.Forms.Button.OnMouseUp(MouseEventArgs mevent)
at System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
at System.Windows.Forms.Control.WndProc(Message& m)
at System.Windows.Forms.ButtonBase.WndProc(Message& m)
at System.Windows.Forms.Button.WndProc(Message& m)
at System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)

64bit version of the program. I also can't tell you if that blank report error was ever fixed as I'm not on Windows 8 anymore where that took place.


cb's Japanese Text Analysis Tool - cb4960 - 2015-10-05

ryuudou Wrote:Getting this on Windows 7 when clicking the analyze button:

************** Exception Text **************
System.TypeLoadException: Could not load type 'System.Runtime.CompilerServices.IAsyncStateMachine' from assembly 'mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089'.
at JapaneseTextAnalysisTool.FormMain.callAnalyzeFile(FileInfo file)
at JapaneseTextAnalysisTool.FormMain.performAnalysis()
at System.Windows.Forms.Button.OnMouseUp(MouseEventArgs mevent)
at System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
at System.Windows.Forms.Control.WndProc(Message& m)
at System.Windows.Forms.ButtonBase.WndProc(Message& m)
at System.Windows.Forms.Button.WndProc(Message& m)
at System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)

64bit version of the program. I also can't tell you if that blank report error was ever fixed as I'm not on Windows 8 anymore where that took place.
Do you have .Net version 4.5 installed?

To find .NET Framework versions by viewing the registry (.NET Framework 4.5 and later)


cb's Japanese Text Analysis Tool - cb4960 - 2015-10-10

Hello,

I have just released version 2.3 of cb's Japanese Frequency List Sorter.

Download cb's Japanese Frequency List Sorter via SourceForge

What Changed?

● Added "Output entire line" option.
● Added "Append number of times encountered" option.
● Added "Append Frequency group option.
● Added "Append Frequency Rank" option.
● Updated word_freq_report_mecab.txt and kanji_freq_report.txt.

cb4960


cb's Japanese Text Analysis Tool - ryuudou - 2015-10-11

cb4960 Wrote:
ryuudou Wrote:Getting this on Windows 7 when clicking the analyze button:

************** Exception Text **************
System.TypeLoadException: Could not load type 'System.Runtime.CompilerServices.IAsyncStateMachine' from assembly 'mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089'.
at JapaneseTextAnalysisTool.FormMain.callAnalyzeFile(FileInfo file)
at JapaneseTextAnalysisTool.FormMain.performAnalysis()
at System.Windows.Forms.Button.OnMouseUp(MouseEventArgs mevent)
at System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
at System.Windows.Forms.Control.WndProc(Message& m)
at System.Windows.Forms.ButtonBase.WndProc(Message& m)
at System.Windows.Forms.Button.WndProc(Message& m)
at System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)

64bit version of the program. I also can't tell you if that blank report error was ever fixed as I'm not on Windows 8 anymore where that took place.
Do you have .Net version 4.5 installed?

To find .NET Framework versions by viewing the registry (.NET Framework 4.5 and later)
It seems to work now but now I'm getting blank word reports with both parsers. This also happened in Windows 8.


cb's Japanese Text Analysis Tool - cb4960 - 2015-10-11

ryuudou Wrote:
cb4960 Wrote:
ryuudou Wrote:Getting this on Windows 7 when clicking the analyze button:

************** Exception Text **************
System.TypeLoadException: Could not load type 'System.Runtime.CompilerServices.IAsyncStateMachine' from assembly 'mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089'.
at JapaneseTextAnalysisTool.FormMain.callAnalyzeFile(FileInfo file)
at JapaneseTextAnalysisTool.FormMain.performAnalysis()
at System.Windows.Forms.Button.OnMouseUp(MouseEventArgs mevent)
at System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
at System.Windows.Forms.Control.WndProc(Message& m)
at System.Windows.Forms.ButtonBase.WndProc(Message& m)
at System.Windows.Forms.Button.WndProc(Message& m)
at System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)

64bit version of the program. I also can't tell you if that blank report error was ever fixed as I'm not on Windows 8 anymore where that took place.
Do you have .Net version 4.5 installed?

To find .NET Framework versions by viewing the registry (.NET Framework 4.5 and later)
It seems to work now but now I'm getting blank word reports with both parsers. This also happened in Windows 8.
Someday I'll have to add logging to JTAT to get to the bottom of this.

Another user who was having a similar issue apparently found a workaround:

http://forum.koohii.com/showthread.php?pid=220227#pid220227


cb's Japanese Text Analysis Tool - Gensan - 2015-10-31

i have black word report problem. and i cant download 4.4 version.

sorry, nvm. no space do the trick.

btw, is there a way to blacklist part-of-speech?
i want JTAT to ignore 感動詞,助詞 and 助動詞.....


RE: cb's Japanese Text Analysis Tool - Zarxrax - 2015-11-30

Anyone know a tool or some code I can use to take the word frequency report, and limit it to only words which contain kanji?


RE: cb's Japanese Text Analysis Tool - yogert909 - 2015-11-30

Zarxrax Wrote:Anyone know a tool or some code I can use to take the word frequency report, and limit it to only words which contain kanji?
If you open the report in a text editor which supports regular expressions, search the following, and replace with nothing.

Search for:  ^((?![\x{4e00}-\x{9faf}]).)*\n
Replace with: