Back

cb's Japanese Text Analysis Tool

#26
This looks like a fun tool, so I thought I would play with it a bit. I decided to run it on more than 5600 japanese subtitle files taken from http://kitsunekko.net/

There were 66,603 unique words reported, from a total of 11,838,104 words.
The 100 most common entries made up 53% of the total.
The 500 most common entries made up 70% of the total.
The 1000 most common entries made up 76% of the total.
The 2000 most common entries made up 83% of the total.
The 6000 most common entries made up 92% of the total.
The 10000 most common entries made up 95% of the total.
The 20000 most common entries made up 98% of the total.

However, the 100 most common entries consisted mostly of single characters or groups of characters that one would be more likely to consider "grammar" than to consider as words. I'm not sure if its really fair to include this into consideration.
If we completely remove the those 100 from consideration, and then do the same calculations, everything changes dramatically.

The 100 most common entries made up 16% of the total.
The 500 most common entries made up 39% of the total.
The 1000 most common entries made up 51% of the total.
The 2000 most common entries made up 64% of the total.
The 6000 most common entries made up 83% of the total.
The 10000 most common entries made up 89% of the total.
The 20000 most common entries made up 96% of the total.


Keep in mind, there are so many different factors at play here that I don't think we can draw much conclusion from this. I thought its interesting though.
Also interesting, is that while reading through the list of words which only occurred a SINGLE time, I saw many that actually seemed like rather normal words, including many katakana words that I could immediately understand.
Reply
#27
I'm getting Unhandled Exception errors when I try to run the Frequency list sorter (2.0) on my Anki deck using a custom word list. (I'm trying to see if the words in the list already exist in my deck.)

Here's the error I get:

Unhandled exception has occurred in your application. [blah blah blah]

Unable to read frequency report.
System.IndexOutOfRangeException: Index was outside the bounds of the array.

at FrequencyListSorter.FormMain.readFreqReport().

Here are the details:

Quote:See the end of this message for details on invoking
just-in-time (JIT) debugging instead of this dialog box.

************** Exception Text **************
System.Exception: Unable to read frequency report.
System.IndexOutOfRangeException: Index was outside the bounds of the array.
at FrequencyListSorter.FormMain.readFreqReport()
at FrequencyListSorter.FormMain.readFreqReport()
at FrequencyListSorter.FormMain.sortList()
at System.Windows.Forms.Control.OnClick(EventArgs e)
at System.Windows.Forms.Button.OnMouseUp(MouseEventArgs mevent)
at System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
at System.Windows.Forms.Control.WndProc(Message& m)
at System.Windows.Forms.ButtonBase.WndProc(Message& m)
at System.Windows.Forms.Button.WndProc(Message& m)
at System.Windows.Forms.Control.ControlNativeWindow.WndProc(Message& m)
at System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)


************** Loaded Assemblies **************
mscorlib
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.5456 (Win7SP1GDR.050727-5400)
CodeBase: file:///C:/Windows/Microsoft.NET/Framework64/v2.0.50727/mscorlib.dll
----------------------------------------
FrequencyListSorter
Assembly Version: 2.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Users/Rich/Downloads/FrequencyListSorter/FrequencyListSorter.exe
----------------------------------------
System.Windows.Forms
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.5460 (Win7SP1GDR.050727-5400)
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/System.Windows.Forms/2.0.0.0__b77a5c561934e089/System.Windows.Forms.dll
----------------------------------------
System
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.5456 (Win7SP1GDR.050727-5400)
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/System/2.0.0.0__b77a5c561934e089/System.dll
----------------------------------------
System.Drawing
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.5462 (Win7SP1GDR.050727-5400)
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/System.Drawing/2.0.0.0__b03f5f7f11d50a3a/System.Drawing.dll
----------------------------------------

************** JIT Debugging **************
To enable just-in-time (JIT) debugging, the .config file for this
application or computer (machine.config) must have the
jitDebugging value set in the system.windows.forms section.
The application must also be compiled with debugging
enabled.

For example:

<configuration>
<system.windows.forms jitDebugging="true" />
</configuration>

When JIT debugging is enabled, any unhandled exception
will be sent to the JIT debugger registered on the computer
rather than be handled by this dialog box.
Reply
#28
Make sure that each line in your custom frequency report is in this format:
frequency<tab>word:

Example:
1780 昼飯
1780 みすぼらしい
1779 人家
Reply
JapanesePod101
#29
Zarxrax Wrote:This looks like a fun tool, so I thought I would play with it a bit. I decided to run it on more than 5600 japanese subtitle files taken from http://kitsunekko.net
Would you mind sharing these lists on Googledocs or mediafire as spreadsheets? I shared a similar list for mandarin on the other forum (also generated from subtitle files). It would be interesting to compare the results.
Reply
#30
EDIT: Never mind. I think I'm trying to get the program to do something it's just not supposed to do. See my post in the E2A thread.
Edited: 2012-08-16, 12:28 am
Reply
#31
Ran this on my corpus of 4450 SRS vocab cards. (use JParser)
2151 kanji
10704 words

Seems rather high as I'm not even finished studying for N2 yet and I don't know half of the joyo kanji. Then again, 1800 of the cards use Japanese definitions from goo dictionary and most of the cards have example sentences. Also, I created all of them myself instead of using a pre-made deck.


Also, for Kanzen Master 2kyuu Grammar
921 kanji
2587 words
OBI level: 9
Edited: 2012-08-19, 2:36 pm
Reply
#32
cb4960 Wrote:Reports from cb's Japanese Text Analysis Tool v3.0 based on 5000+ novels (27 May 2012):
Download via MediaFire

Includes word frequency report via Mecab, word frequency report via JParser, differences between the Mecab report and JParser report, kanji frequency report, and readability report.
How accurate would these reports be? If a word is in the top 10,000 on this list, is it worth learning? For example a word like 世界中 (around the world) is 5366th on the list, but it doesn't even appear as common on jisho.org. I know jisho isn't perfect either, do you think this list would be good to study from? What I would like to do is when I learn a new kanji, go to this list and find a few new words that include it to help me remember the pronunciations and meanings of the kanji.
Reply
#33
I don't know if it would make a difference, but would it be possible to do a text analysis based off of your own morphology database? I guess to figure out what would be theoretically easier or more difficult for you.
Reply
#34
egoplant Wrote:
cb4960 Wrote:Reports from cb's Japanese Text Analysis Tool v3.0 based on 5000+ novels (27 May 2012):
Download via MediaFire

Includes word frequency report via Mecab, word frequency report via JParser, differences between the Mecab report and JParser report, kanji frequency report, and readability report.
How accurate would these reports be? If a word is in the top 10,000 on this list, is it worth learning? For example a word like 世界中 (around the world) is 5366th on the list, but it doesn't even appear as common on jisho.org. I know jisho isn't perfect either, do you think this list would be good to study from? What I would like to do is when I learn a new kanji, go to this list and find a few new words that include it to help me remember the pronunciations and meanings of the kanji.
It is accurate in the sense that 世界中 is probably the 5366th most frequent word in the corpus of novels that I used. Can't comment on jisho.org.
Reply
#35
Daichi Wrote:I don't know if it would make a difference, but would it be possible to do a text analysis based off of your own morphology database? I guess to figure out what would be theoretically easier or more difficult for you.
Not a bad idea. Maybe it would output something like the ratio of known words to unknown words.
Reply
#36
I was looking for some of the files from your first report on 5000 books, but couldn't find them in my collection. So I tried running the tool against the books I had.

All reports were done with MeCab and are available here:
http://406notacceptable.com/wp-content/uploads/2013/06/

Some of the books have negative readability scores... which is strange! Had a quick look at those and they aren't books. Smile e.g. one of them just contained this:

  ∧_∧
 ( ´∀`)< ぬるぽ


:S
Reply
#37
Hello,

I have just released version 4.0 of cb's Japanese Text Analysis Tool.

Download cb's Japanese Text Analysis Tool v4.0 via SourceForge

What Changed?

● Added the User-based Readability Report option.

Using a list of words that the user already knows, this report can help to determine readability of a text based on the percentage of words in the text that the user already knows.

Name: user_based_readability_report.txt

Format:
Field 1: Readability expressed as a percentage (0-100) of the total number
of non-unique known words vs. the total number of non-unique words.
Field 2: Total number of non-unique words
Field 3: Total number of non-unique known words
Field 4: Total number of non-unique unknown words
Field 5: Readability expressed as a percentage (0-100) of the total number
of unique known words vs. the total number of unique words.
Field 6: Total number of unique words
Field 7: Total number of unique known words
Field 8: Total number of unique unknown words
Field 9: Filename

Report is sorted based on Readability (Field 1).

To generate this report, the "File that contains a list of words that you already know" option must be filled in. If a line contains multiple tab-separated columns, then the word is assumed to be in the first column.

● Renamed the old Readability report to Formula-based Readability report.

● Updated Mecab to version 0.996.

cb4960
Reply
#38
After some additional testing, I've noticed some instability with the way I'm calling the newer version of Mecab. I'll see about releasing version 4.1 to fix this later tonight.
Reply
#39
Hello,

I have just released version 4.1 of cb's Japanese Text Analysis Tool.

Download cb's Japanese Text Analysis Tool v4.1 via SourceForge

What Changed?

● Fixed crash bug when processing more than ~150 novels due to poor memory management on my end, not Mecab as I suspected in my previous post.

● Fixed crash bug when closing JTAT while it was still processing.

cb4960
Reply
#40
I just wanted to thank cb4960 for making these changes.

I have reached 1000 words in Core6K and while doing the next set of kanji I tried reading NHK Easy News articles. It was extremely hit and miss. Some articles I understood a lot of words, some I understood none. I wanted to practice more on the articles where I knew a majority of the words, and this change to the tool now makes it possible.

For example, with a known word list (~1000 words) exported from Anki and 450 NHK articles here are the top and bottom 5 results:

Code:
77.78    207    161    46    67.06    85    57    28    2013_04_26_2.txt
75.59    127    96    31    61.76    68    42    26    2012_7_11.txt
75.47    159    120    39    63.41    82    52    30    2013_04_25.txt
73.30    221    162    59    54.63    108    59    49    2012_10_9_1.txt
72.92    192    140    52    55.91    93    52    41    2013_3_29_1.txt
...
47.31    167    79    88    33.65    104    35    69    2012_7_25_1.txt
47.16    176    83    93    42.35    85    36    49    2013_3_4_2.txt
45.70    151    69    82    35.80    81    29    52    2013_05_13.txt
44.69    273    122    151    37.59    133    50    83    2013_04_22_1.txt
43.18    176    76    100    36.73    98    36    62    2012_8_17_1.txt
Reading the articles from the top of the list is much more satisfying than previous attempts at reading NHK. Though it appears I need to learn a lot more grammar.

If you are going to use the tool on your own archives keep one thing in mind. The parsers that are being used add particles and other grammar related elements to the word list. These are unlikely to be in your own vocab lists. I created a separate file and added these types of words from the word_freq_report and appended that file to my vocab list. This improves the rating of the documents a lot.

By the way, if someone was to know 1000 words from Core6K, and that person happened to have a copy of some totally innocent books book reviews and that person ran this tool over them they might sadly discover that the top ratings on those books book reviews are around the 15-20% range.

Such a person might then decide to go learn more vocab Wink
Reply
#41
Thanks for sharing Wrenn! I should probably add an option to automatically append particles and grammar and such to the known words list.
Reply
#42
Hello,

I have just released version 4.2 of cb's Japanese Text Analysis Tool.

Download cb's Japanese Text Analysis Tool v4.2 via SourceForge

What Changed?

● Fixed bug where statistics were not reset for the User-based Readability Report when performing more than one full analysis in a row.

cb4960
Reply
#43
Hi cb4960, I really admire the resources you've shared here! Thanks for all of the hard work Smile

I'm wondering how possible it would be to implement a tool that can create word reports out of text files based on order of appearance? I just made a thread about it because I didn't consider making a suggestion/request lol.
Reply
#44
Animosophy Wrote:Hi cb4960, I really admire the resources you've shared here! Thanks for all of the hard work Smile

I'm wondering how possible it would be to implement a tool that can create word reports out of text files based on order of appearance? I just made a thread about it because I didn't consider making a suggestion/request lol.
Look like you've already found a solution in that other thread. Still, maybe I'll add this feature anyway.
Reply
#45
cb4960 Wrote:
Animosophy Wrote:Hi cb4960, I really admire the resources you've shared here! Thanks for all of the hard work Smile

I'm wondering how possible it would be to implement a tool that can create word reports out of text files based on order of appearance? I just made a thread about it because I didn't consider making a suggestion/request lol.
Look like you've already found a solution in that other thread. Still, maybe I'll add this feature anyway.
That would be great! To be honest it's more convenient for me to use JadeReader on my android, because I don't have a tablet to make reading on Yomichan very convenient. I'd still happily use it if there were no other option, though. It would just limit where and how often I could read, knowing the words will already be prepared for reviewing.
Edited: 2013-09-05, 6:49 pm
Reply
#46
Hello,

I have just released version 4.3 of cb's Japanese Text Analysis Tool.

Download cb's Japanese Text Analysis Tool v4.3 via SourceForge

What Changed?

● Updated the Mecab database.

cb4960
Reply
#47
Hey CB,

Thanks for the amazing tool.

Quick question about mecab and other parsers. I have need of a parser made with the life science dictionary http://lsd.pharm.kyoto-u.ac.jp/en/servic...index.html also available over at Jim Breen's site. Is it (im)possible to create a custom parser based on a dictionary?

Thanks for any insights.
Reply
#48
Here are the results for this drama subtitle database (4000+ subtitle files)
http://www.mediafire.com/download/xg1xbx.../drama.rar
Edited: 2013-12-03, 11:20 pm
Reply
#49
There is a minor bug:
If you click "User-based readability", choose a file. Then uncheck the "user-based readability" you still get a popup
Unable to read the "known words" file.
(Yeah, there was something wrong with my wordlist.txt, the program crashes reading it)
Reply
#50
Hey CB様,

Would it be possible to add percentages to the analysis?

Like add what percentage each entry contributes to the whole, and also a cumulative as it goes down?

Kind of like how the percentages are here:

https://docs.google.com/spreadsheet/ccc?...WVkE#gid=0

Thanks in advance!

note: not a lick of programmer knowledge so Idk how difficult or easy this would be o.o
Reply