Japanese Text Analysis Tool allows users to generate 3 kinds of reports:
1) Word Frequency Report
2) Kanji Frequency Report
3) Readability Report
You may analyze a single text file or an entire directory of text files (including sub-directories).
Before text analysis begins, all aozora formatting is removed (if present) and a user-specified number of lines are trimmed from the beginning and end of the text files.
By default, only files with a .txt extension will be analyzed. To change this, edit the "extensions" setting in settings.txt.
All generated reports will be placed in the specified output directory.
In the Tools menu, you can launch tools that allow you to combine multiple frequency reports or compare two frequency reports.
Download Japanese Text Analysis Tool via Google Code
Word Frequency Report
Name: word_freq_report.txt
Format:
Field 1: Number of times word was encountered
Field 2: Word
Report is sorted from most frequent word to least frequent word.
You have two methods of generating a report: MeCab or JParser.
MeCab is widely used morphological analyzer and is quite fast.
JParser is an alternate method that uses a larger dictionary (EDICT + ENAMDICT)
and thus recognizes more words and seems to have better support for names
and short expressions. However, it is much slower than Mecab.
Kanji Frequency Report
Name: kanji_freq_report.txt
Format:
Field 1: Number of times kanji was encountered
Field 2: Kanji
Report is sorted from most frequent kanji to least frequent kanji.
Readability Report
Name: readability_report.txt
Format:
Field 1: OBI-2 Grade Level (1-13, where 1 is the most readable)
Field 2: Hayashi Score (0-100, where 100 is the most readable)
Field 3: Filename
Report is sorted from most readable to least readable.
Have Fun!
Last edited by cb4960 (2012 May 27, 1:06 pm)
Reports from cb's Japanese Text Analysis Tool v3.0 based on 5000+ novels (27 May 2012):
Download via MediaFire
Includes word frequency report via Mecab, word frequency report via JParser, differences between the Mecab report and JParser report, kanji frequency report, and readability report.
================================================================
cb's Frequency List Sorter. Sorts a list of Japanese words based on their frequency.
Download cb's Frequency List Sorter via Google Code
(Requires .Net Framework 3.5)
From the readme.txt:
This tool sorts a list of Japanese words or kanji based on their frequency.
The list can be simple like this:
...
合図
挨拶
愛情
...
Or the list can have multiple columns like this:
...
合図 あいず sign, signal
挨拶 あいさつ greeting
愛情 あいじょう love, affection
...
In the latter case, each value is separated with tabs.
If the lines contain multiple columns, you will need to specify which column
contains the word or kanji. In both of the above cases, the word is in column 1.
You will need to select the frequency report to sort against with the
"Frequency report to sort against option". Three good, pre-made frequency reports
are available for use. They were generated using cb's Japanese Text Analysis Tool
(JTAT) on a corpus of 5000+ Japanese novels. If you want to sort a list of kanji,
select "Kanji Frequency Report". If you want to sort a list of words, select either
"Word Frequency Report (MeCab)" or "Word Frequency Report (JParser)". Advanced
users may generate their own frequency report with JTAT and use it by selecting
the "(user-specified frequency report)" option and providing the path of the report.
The sorted output list would look something like this:
...
挨拶 あいさつ greeting 22040
...
愛情 あいじょう love, affection 11568
...
合図 あいず sign, signal 10860
...
The number in the last column is the frequency. To prevent the frequency from
being added, uncheck the "Add new column frequency to the output file" option.
Last edited by cb4960 (2012 May 27, 11:18 pm)
Man, just want to say thanks for all your hard work.
Been using Capture2Text and cbJisho a lot, incredibly useful.
Never used JReadability for some reason, but used this, very nice.
Not sure how the Hayashi score works , but the OBI-2 seems to work well, correctly identifying the Hans Christian Andersen translations and Japanese folk tales as easier.
Also, interesting to see that after parsing about 3000 texts, it shows about 6100 different kanjis in the Kanji Frequency Report. I expected a higher number.
visualsense wrote:
Man, just want to say thanks for all your hard work.
Been using Capture2Text and cbJisho a lot, incredibly useful.
Never used JReadability for some reason, but used this, very nice.
Not sure how the Hayashi score works , but the OBI-2 seems to work well, correctly identifying the Hans Christian Andersen translations and Japanese folk tales as easier.
Also, interesting to see that after parsing about 3000 texts, it shows about 6100 different kanjis in the Kanji Frequency Report. I expected a higher number.
Your welcome.
For the 5000+ innocent novels, 6430 unique kanji were recognized, and of those:
4685 appeared less than 10,000 times
3364 appeared less than 1,000 times
2024 appeared less than 100 times
864 appeared less than 10 times
240 appeared exactly once.
My everyday kanwa dictionary has about 6500 kanji in it, and I don't think I've ever seen a kanji that I couldn't find in there.
I don't do any Classical Literature or kanbun or serious historical stuff, but I think that you'll need to get into that kind of thing before you need more than about those 6500.
And now a graph:
cb4960, I'm always really impressed with the software that you create. Amazing.
Unfortunately, I don't have a PC, so I can't use this ![]()
Is there any chance anyone would be kind enough to run this text file through the 'word frequency report'? :
http://filebin.ca/1s8mUTFYcGT/gyakuten.txt
{The file's saved as Unicode (UTF-8) encoding }
Last edited by luckynumber7 (2012 May 15, 7:54 am)
@cb4960
With your graph, you just beat me to it ![]()
These days I'm trying to create index of kanji in Kanji in Context (I'm half way there) and once it's done, compare it with your data. (It might also be interesting to know how KO2001 compares, but it's not going to be me as I don't work with KO.) The authors of KiC claim that 500 most often used kanji represent 80% of the kanji in newspapers, and 1,000 changes that number to 94%. Well, that's newspapers but what about the literature?
http://i.imgur.com/6WzGP.png
Edit: Replaced bogus numbers with a graph.
Last edited by Inny Jan (2012 May 15, 7:03 pm)
This is a great tool. For me, it's useful to show students that they are indeed getting more bang for their buck by learning in bunches. Then there's the side benefit of ranking books by their ease of reading.
Now that I think about it, has anyone ever collected all the Dramanote scripts into a useful group? Might be a nice project if it can't be done automatically, as these make for easy reading.
On a related note:
In the comprehensive Kanken spreadsheet, there's a column which lists the various primitives in each kanji. I wondered how it would look if we attempted to get a primitive count, then use that count to create a list to help sort kanji from their primitives.
With that, it'd be possible for users to create their own RTK order for any subset of Kanji. Heisig's order is good for the most part, but it's limited in that he made the list on the idea you're learning all 2041 at once. If you wanted to learn the first 555 of the KO kanji, there'd probably be a more efficient order as some primitives used more in a 2041 group appear only once upon introduction in the 555 subset.
@luckynumber7:
Here: http://pastebin.ca/2149188
@Inny Jan:
Can you give a few words explaining how to interpret that graph for statistics n00bs like me?
@Nukemarine:
Good idea. Maybe someone can write an algorithm to determine the most efficient order in which to learn a given subset of kanji based on their primitives. Give the algorithm a list of the 2,000 most frequency kanji and who needs Heisig?
Nukemarine wrote:
This is a great tool. For me, it's useful to show students that they are indeed getting more bang for their buck by learning in bunches. Then there's the side benefit of ranking books by their ease of reading.
Now that I think about it, has anyone ever collected all the Dramanote scripts into a useful group? Might be a nice project if it can't be done automatically, as these make for easy reading.
I guess you could also analyze subtitles (though probably you would have to strip the time codes before) and get lists of dorama ordered by ease of language, and vocabulary lists you could use in advance before watching a dorama to boost your understanding and learn new vocabulary in an easy and fun way.
On a related note:
In the comprehensive Kanken spreadsheet, there's a column which lists the various primitives in each kanji. I wondered how it would look if we attempted to get a primitive count, then use that count to create a list to help sort kanji from their primitives.
With that, it'd be possible for users to create their own RTK order for any subset of Kanji. Heisig's order is good for the most part, but it's limited in that he made the list on the idea you're learning all 2041 at once. If you wanted to learn the first 555 of the KO kanji, there'd probably be a more efficient order as some primitives used more in a 2041 group appear only once upon introduction in the 555 subset.
I think it would be harder than it sounds, but if such a tool existed, it would be of great use for lots of people.
cb4960 wrote:
Can you give a few words explaining how to interpret that graph for statistics n00bs like me?
The blue trend plots numbers that are calculated based on the formula:
p_i = k_i/n
where:
p_i - probability of encountering the i-th kanji (from your sample)
k_i - number of times the i-th kanji has been encountered
n - total number of "encounters" for all kanji
So the formula basically answers a question what is the probability that when you are looking at some kanji it is this specific (i-th) kanji.
The red one does:
cp_j = sum_{i=1}^j p_i
where:
cp_j - cumulative probability of recognising a kanji where the range of known kanji is from 1 to j
p_i - probability calculated from the formula above
sum_{i=1}^j - does summation of p_i probabilities starting from p_1 and ending at p_j
Here we want to know, that given that you know kanji in range 1 to j, what is the probability that some kanji you are looking at is in that range.
cb4960 wrote:
@luckynumber7:
Here: http://pastebin.ca/2149188
Thank you so much! This means a lot to me.
Say I have a large list of vocab words from the novels I read, and I want to learn the most frequent ones first and leave the rarer for last... Can I somehow use this program to reorder my list like that?
Thanks for the tool!
I analyzed the components of 2136 joyo kanji and here are the 149 most common ones (I left out those occurring 8 times or less):
http://pastebin.com/CzUTQapV
I've also analyzed my full database of 6776 kanji and these are the 15 most common elements:
1391 口
652 木
648 日
493 田
484 土
444 艹
429 亠
427 十
405 二
395 丶
394 丷
390 目
378 氵
366 小
328 月
toshiromiballza wrote:
Thanks for the tool!
I analyzed the components of 2136 joyo kanji and here are the 149 most common ones (I left out those occurring 8 times or less):
http://pastebin.com/CzUTQapV
I've also analyzed my full database of 6776 kanji and these are the 15 most common elements:
1391 口
652 木
648 日
493 田
484 土
444 艹
429 亠
427 十
405 二
395 丶
394 丷
390 目
378 氵
366 小
328 月
Seems off. You'd think ONE would be near the top. Is this based on the radicals and not Heisig primitives? Anyway, it's a cool list.
Anybody know a good program for text frequency analyzer and maybe even a sorting program for lists that have things like "language, talk, cellphone, mouth, self, five, ceiling, car keys, mouth2" in the primitive column?
I didn't include most one-stroke characters because practically every kanji would have one, but yes, 一 would top the list.
It's based on radicals and their variants and some characters that aren't radicals themselves, but occur enough times to constitute their inclusion as parts (元, 五, 廿, ...).
This is from my own database which was based on KRADFILE, but I did a complete overhaul, added 400+ kanji, added new elements, corrections, etc.
Hello,
I have just released version 2.0 of cb's Japanese Text Analysis Tool.
Download cb's Japanese Text Analysis Tool v2.0 via Google Code
What Changed?
Added option to choose the parse method for the word frequency report. You may now choose to use MeCab or JParser.
MeCab is a widely used morphological analyzer. However, for certain word conjugations, it will parse words in an undesirable fashion. Example: 挟まれる is parsed as 挟ま and れる instead of 挟む.
JParser is the method used by Translation Aggregator to generate furigana. It has no problem with the above conjugation and has better support for short expressions. The downside is that JParser is much slower than Mecab.
For JParser you have the option of always using the kanji form of a word even if it is normally written in kana. If checked, ともに will be converted to 共に. If unchecked, 共に will be converted to ともに.
==============================================================
I have also released two new tools (see the second post for details):
1) cb's Frequency Report Combiner. Combines two reports generated by cb's Japanese Text Analysis Tool.
2) cb's Frequency Report Diff Tool. Compares two reports generated by cb's Japanese Text Analysis Tool.
[Edit: These tools are now deprecated. I merged their functionality into JTAT v3.0]
==============================================================
Here are the latest Japanese Text Analysis Tool reports for a large number of innocent novels:
Download via MediaFire
Includes word frequency report via Mecab, word frequency report via JParser (new), differences between the Mecab report and JParser report (new), kanji frequency report, and readability report.
cb4960
Last edited by cb4960 (2012 May 27, 1:40 pm)
CB, seeing all these awesome Kanji analyzer programs, I'm wondering if you already made a Kanji sorting program similar to the one Cangy made for perl?
If you do have one, is there an option to either have kana only words placed at the top or evenly placed throughout the list?
cb4960 wrote:
MeCab is a widely used morphological analyzer. However, for certain word conjugations, it will parse words in an undesirable fashion. Example: 挟まれる is parsed as 挟ま and れる instead of 挟む.
While MeCab parses this way, it also includes the root form for the word in the part of speech field information. In this case it would return 挟ま 動詞,自立,*,*,五段・マ行,未然形,挟む,ハサマ,ハサマ. So the 挟む for is included.
So maybe you could include an option to use this root form of a word, to aggregate the frequencies of all conjugations into one?
cryptica wrote:
CB, seeing all these awesome Kanji analyzer programs, I'm wondering if you already made a Kanji sorting program similar to the one Cangy made for perl?
If you do have one, is there an option to either have kana only words placed at the top or evenly placed throughout the list?
Do you have a link to Cangy's script?
cryptica wrote:
So maybe you could include an option to use this root form of a word, to aggregate the frequencies of all conjugations into one?
D'oh! You're right. I have no idea why I didn't use the root field. I'll release a fix build soon and issue a new word frequency report for the 5000+ innocent novels.
Hello,
I have just released version 3.0 of cb's Japanese Text Analysis Tool.
Download cb's Japanese Text Analysis Tool v3.0 via Google Code
What Changed?
1) Added "Compare Frequency Reports" to the Tools menu. This allows you to compare two frequency reports.
2) Added "Combine Frequency Reports" to the Tools menu. This allows you to combine multiple frequency reports.
3) Fixed MeCab Word Frequency Report so that it uses the 'root' field of the MeCab output. (Thanks cryptica!)
==============================================================
Reports from cb's Japanese Text Analysis Tool v3.0 based on 5000+ novels (27 May 2012):
Download via MediaFire
Includes word frequency report via Mecab, word frequency report via JParser, differences between the Mecab report and JParser report, kanji frequency report, and readability report.
cb4960
Last edited by cb4960 (2012 May 27, 12:52 pm)
Hello,
I have just released version 2.0 of cb's Frequency List Sorter.
Download cb's Frequency List Sorter v2.0 via Google Code
What Changed?
Added option to specify which frequency report to compare against. You may choose from 3 pre-made frequency reports or you may use your own frequency report. The 3 pre-made frequency reports are:
1) Kanji Frequency Report
2) Word Frequency Report (MeCab)
3) Word Frequency Report (JParser)
Each of the 3 pre-made reports were generated with JTAT v3.0.
cb4960
Last edited by cb4960 (2012 May 27, 11:20 pm)
cb4960 wrote:
Nukemarine wrote:
CB, seeing all these awesome Kanji analyzer programs, I'm wondering if you already made a Kanji sorting program similar to the one Cangy made for perl?
If you do have one, is there an option to either have kana only words placed at the top or evenly placed throughout the list?Do you have a link to Cangy's script?
I can't find it on the Cangy's Core Thread or the updated Cangy's site. Maybe I'm just misremembering where I got the program, but I honestly thought it was Cangy's.
I've got the sorting program saved, so I'll forward that to you when I get home.
Here's a copy of the Perl Kanji sorting program by Fugounashi. Sorry for the long delay in posting this:
#!/usr/bin/perl -w
# $ kanji-sort --kanji kanjiorder.txt --sentence-field 2 < mydeck-exported.txt > mydeck-toimport.txt
# $Revision: 1.5 $ $Date: 2010/01/08 08:22:33 $
# http://ichi2.net/anki/wiki/ContribFugounashi
use open qw( :std :encoding(UTF-8) );
use strict;
use Getopt::Long;
use utf8;
my $kanjifile;
my $sentence_field;
GetOptions(
'sentence-field=i'=> \$sentence_field,
'kanji=s' => \$kanjifile
);
my %kanji;
open KANJI, "<$kanjifile";
while(<KANJI>){
chomp;
$_=(split /\t/)[0];
if(exists $kanji{$_}){
print STDERR "$0: warning: ignoring duplicate kanji: $_: $kanjifile: $.\n";
}else{
$kanji{$_}=$.;
}
}
my @max;
my @lines;
while(<>){
chomp;
my $i=$. - 1;
$lines[$i]=$_;
my $sentence=(split '\t', $_)[$sentence_field];
my @chars = split //, $sentence;
$max[$i]=0;
foreach my $char (@chars){
if(($kanji{$char}) && ($kanji{$char} > $max[$i])){
$max[$i]=$kanji{$char};
}
}
}
my @index = 0 .. (@max - 1);
my @sorted = sort {$max[$a] <=> $max[$b]} @index;
my $last=0;
foreach my $i (@sorted){
my $step=$max[$i] - $last;
$last=$max[$i];
my $sentence=(split '\t', $lines[$i])[$sentence_field];
my @chars = split //, $sentence;
print "$lines[$i]\t$max[$i]\t$step\n";
}

