Back

cb's Japanese Text Analysis Tool

#1
Japanese Text Analysis Tool allows users to generate 4 kinds of reports:
1) Word Frequency Report
2) Kanji Frequency Report
3) Formula-based Readability Report
4) User-based Readability Report

You may analyze a single text file or an entire directory of text
files (including sub-directories).

Before text analysis begins, all Aozora formatting is removed (if present) and
a user-specified number of lines are trimmed from the beginning and end of the
text files.

All generated reports will be placed in the specified output directory.

In the Tools menu, you can launch tools that allow you to combine multiple
frequency reports or compare two frequency reports.

Download Japanese Text Analysis Tool via SourceForge
(Requires .Net Framwork 4.5)

[Image: main.png]

Word Frequency Report
Name: word_freq_report.txt

Format:
Field 1: Number of times word was encountered
Field 2: Word
Field 3: Frequency Group (see explanation below)
Field 4: Frequency Rank (see explanation below)
Field 5: Percentage (Field 1 / Total number of words)
Field 6: Cumulative percentage
Field 7: Part-of-speech

Frequency Group: All words in the analysis that share the exact same frequency (Field 1)
will be assigned to a numbered Frequency Group, with group 1 containing the most common
word(s), group 2 containing the next most common word(s), and so on.

Frequency Rank: For a given word, the Frequency Rank is the total number of words
in the analysis that are more frequent that the given word + 1. For example, if the
given word has a Frequency Rank of 500, then there are 499 other words in the analysis
that are more frequent than the given word.

Report is sorted from most frequent word to least frequent word.

You have two methods of generating a report: MeCab or JParser.

MeCab is widely used morphological analyzer and is quite fast.

JParser is an alternate method that uses a larger dictionary (EDICT + ENAMDICT)
and thus recognizes more words and seems to have better support for names
and short expressions. However, it is much slower than Mecab.

The Part-of-speech field is printed verbatim from Mecab/JParser.

Kanji Frequency Report
Name: kanji_freq_report.txt

Format:
Field 1: Number of times kanji was encountered
Field 2: Kanji
Field 3: Frequency Group (see explanation below)
Field 4: Frequency Rank (see explanation below)
Field 5: Percentage (Field 1 / Total number of kanji)
Field 6: Cumulative percentage

Frequency Group: All kanji in the analysis that share the exact same frequency (Field 1)
will be assigned to a numbered Frequency Group, with group 1 containing the most common
kanji(s), group 2 containing the next most common kanji(s), and so on.

Frequency Rank: For a given kanji, the Frequency Rank is the total number of kanji
in the analysis that are more frequent that the given kanji + 1. For example, if the
given kanji has a Frequency Rank of 500, then there are 499 other kanji in the analysis
that are more frequent than the given kanji.

Report is sorted from most frequent kanji to least frequent kanji.

Formula-based Readability Report:
Readability report generated based on Hayashi and OBI-2 readability calculations.

Name: formula_based_readability_report.txt

Format:
Field 1: OBI-2 Grade Level (1-13, where 1 is the most readable)
Field 2: Hayashi Score (0-100, where 100 is the most readable)
Field 3: Filename

Report is sorted from most readable to least readable.

Hayashi Score Information:
http://www.ideosity.com/ideosphere/seo-i...ts#Hayashi

OBI-2 Grade Level Information:
http://kotoba.nuee.nagoya-u.ac.jp/sc/obi2/obi_e.html

User-based Readability Report
Using a list of words that the user already knows, this report can help to determine readability of a text based on the percentage of words in the text that the user already knows.

Name: user_based_readability_report.txt

Format:
Field 1: Readability expressed as a percentage (0-100) of the total number
of non-unique known words vs. the total number of non-unique words.
Field 2: Total number of non-unique words
Field 3: Total number of non-unique known words
Field 4: Total number of non-unique unknown words
Field 5: Readability expressed as a percentage (0-100) of the total number
of unique known words vs. the total number of unique words.
Field 6: Total number of unique words
Field 7: Total number of unique known words
Field 8: Total number of unique unknown words
Field 9: Filename

Report is sorted based on Readability (Field 1).

To generate this report, the "File that contains a list of words that you already know" option must be filled in. If a line contains multiple tab-separated columns, then the word is assumed to be in the first column.


Have Fun!
Edited: 2015-10-05, 9:10 pm
Reply
#2
Reports from cb's Japanese Text Analysis Tool v5.4 based on 5000+ novels (Sample_Output_151003.zip):
Download via SourceForge

Includes Word Frequency Report via Mecab, Kanji Frequency Report, and Formula-based Readability Report.

================================================================

cb's Frequency List Sorter. Sorts a list of Japanese words based on their frequency.

Download cb's Japanese Frequency List Sorter via SourceForge
(Requires .Net Framework 3.5)

[Image: main.png]

From the readme.txt:

This tool sorts a list of Japanese words or kanji based on their frequency.

The list can be simple like this:

合図
挨拶
愛情

Or the list can have multiple columns like this:

合図 あいず sign, signal
挨拶 あいさつ greeting
愛情 あいじょう love, affection

In the latter case, each value must be separated with tabs.

If the lines contain multiple columns, you will need to specify which column
contains the word or kanji. In both of the above cases, the word is in column 1.

You will need to select the frequency report to sort against with the
"Frequency report to sort against option". Three good, pre-made frequency reports
are available for use. They were generated using cb's Japanese Text Analysis Tool
(JTAT) on a corpus of 5000+ Japanese novels. If you want to sort a list of kanji,
select "Kanji Frequency Report". If you want to sort a list of words, select either
"Word Frequency Report (MeCab)" or "Word Frequency Report (JParser)". Advanced
users may generate their own frequency report with JTAT and use it by selecting
the "(user-specified frequency report)" option and providing the path of the report.

The sorted output list would look like this:

挨拶
愛情
合図

To output the entire line from the input file instead of just the word,
check the "Output entire line" option. With this option selected, the
sorted output list would look like this:

挨拶 あいさつ greeting
愛情 あいじょう love, affection
合図 あいず sign, signal

You may append frequency information to the output list by checking one of
the following options:

1) Append number of times encountered

This number is taken from the selected frequency report.

2) Append Frequency Group

Description of Frequency Group from Japanese Text Analysis Tool:

"All words in the analysis that share the exact same frequency
will be assigned to a numbered Frequency Group, with group 1
containing the most common word(s), group 2 containing the
next most common word(s), and so on."

3) Append Frequency Rank

Description of Frequency Rank from Japanese Text Analysis Tool:

"For a given word, the Frequency Rank is the total
number of words in the analysis that are more frequent
that the given word + 1. For example, if the given word has a
Frequency Rank of 500, then there are 499 other words in the
analysis that are more frequent than the given word."

With all 3 options checked, the sorted output list would look like this:

挨拶 22018 1243 1256
愛情 11568 2271 2338
合図 10860 2389 2465
Edited: 2015-10-10, 9:47 pm
Reply
#3
Man, just want to say thanks for all your hard work.

Been using Capture2Text and cbJisho a lot, incredibly useful.

Never used JReadability for some reason, but used this, very nice.

Not sure how the Hayashi score works , but the OBI-2 seems to work well, correctly identifying the Hans Christian Andersen translations and Japanese folk tales as easier.

Also, interesting to see that after parsing about 3000 texts, it shows about 6100 different kanjis in the Kanji Frequency Report. I expected a higher number.
Reply
6-Month Challenge: Get 6-Month Premium for $66 or Premium PLUS for $166 (June 19th - 30th)
JapanesePod101
#4
visualsense Wrote:Man, just want to say thanks for all your hard work.

Been using Capture2Text and cbJisho a lot, incredibly useful.

Never used JReadability for some reason, but used this, very nice.

Not sure how the Hayashi score works , but the OBI-2 seems to work well, correctly identifying the Hans Christian Andersen translations and Japanese folk tales as easier.

Also, interesting to see that after parsing about 3000 texts, it shows about 6100 different kanjis in the Kanji Frequency Report. I expected a higher number.
Your welcome.

For the 5000+ innocent novels, 6430 unique kanji were recognized, and of those:
4685 appeared less than 10,000 times
3364 appeared less than 1,000 times
2024 appeared less than 100 times
864 appeared less than 10 times
240 appeared exactly once.
Reply
#5
My everyday kanwa dictionary has about 6500 kanji in it, and I don't think I've ever seen a kanji that I couldn't find in there.

I don't do any Classical Literature or kanbun or serious historical stuff, but I think that you'll need to get into that kind of thing before you need more than about those 6500.
Reply
#6
And now a graph:

[Image: LQgMQ.png]
Reply
#7
cb4960, I'm always really impressed with the software that you create. Amazing.

Unfortunately, I don't have a PC, so I can't use this Sad
Is there any chance anyone would be kind enough to run this text file through the 'word frequency report'? :

http://filebin.ca/1s8mUTFYcGT/gyakuten.txt

{The file's saved as Unicode (UTF-8) encoding }
Edited: 2012-05-15, 7:54 am
Reply
#8
@cb4960

With your graph, you just beat me to it Smile

These days I'm trying to create index of kanji in Kanji in Context (I'm half way there) and once it's done, compare it with your data. (It might also be interesting to know how KO2001 compares, but it's not going to be me as I don't work with KO.) The authors of KiC claim that 500 most often used kanji represent 80% of the kanji in newspapers, and 1,000 changes that number to 94%. Well, that's newspapers but what about the literature?

http://i.imgur.com/6WzGP.png

Edit: Replaced bogus numbers with a graph.
Edited: 2012-05-15, 7:03 pm
Reply
#9
This is a great tool. For me, it's useful to show students that they are indeed getting more bang for their buck by learning in bunches. Then there's the side benefit of ranking books by their ease of reading.

Now that I think about it, has anyone ever collected all the Dramanote scripts into a useful group? Might be a nice project if it can't be done automatically, as these make for easy reading.

On a related note:

In the comprehensive Kanken spreadsheet, there's a column which lists the various primitives in each kanji. I wondered how it would look if we attempted to get a primitive count, then use that count to create a list to help sort kanji from their primitives.

With that, it'd be possible for users to create their own RTK order for any subset of Kanji. Heisig's order is good for the most part, but it's limited in that he made the list on the idea you're learning all 2041 at once. If you wanted to learn the first 555 of the KO kanji, there'd probably be a more efficient order as some primitives used more in a 2041 group appear only once upon introduction in the 555 subset.
Reply
#10
@luckynumber7:

Here: http://pastebin.ca/2149188

@Inny Jan:

Can you give a few words explaining how to interpret that graph for statistics n00bs like me?

@Nukemarine:

Good idea. Maybe someone can write an algorithm to determine the most efficient order in which to learn a given subset of kanji based on their primitives. Give the algorithm a list of the 2,000 most frequency kanji and who needs Heisig?
Reply
#11
Nukemarine Wrote:This is a great tool. For me, it's useful to show students that they are indeed getting more bang for their buck by learning in bunches. Then there's the side benefit of ranking books by their ease of reading.

Now that I think about it, has anyone ever collected all the Dramanote scripts into a useful group? Might be a nice project if it can't be done automatically, as these make for easy reading.
I guess you could also analyze subtitles (though probably you would have to strip the time codes before) and get lists of dorama ordered by ease of language, and vocabulary lists you could use in advance before watching a dorama to boost your understanding and learn new vocabulary in an easy and fun way.


Quote:On a related note:

In the comprehensive Kanken spreadsheet, there's a column which lists the various primitives in each kanji. I wondered how it would look if we attempted to get a primitive count, then use that count to create a list to help sort kanji from their primitives.

With that, it'd be possible for users to create their own RTK order for any subset of Kanji. Heisig's order is good for the most part, but it's limited in that he made the list on the idea you're learning all 2041 at once. If you wanted to learn the first 555 of the KO kanji, there'd probably be a more efficient order as some primitives used more in a 2041 group appear only once upon introduction in the 555 subset.
I think it would be harder than it sounds, but if such a tool existed, it would be of great use for lots of people.
Reply
#12
cb4960 Wrote:Can you give a few words explaining how to interpret that graph for statistics n00bs like me?
The blue trend plots numbers that are calculated based on the formula:
p_i = k_i/n
where:
p_i - probability of encountering the i-th kanji (from your sample)
k_i - number of times the i-th kanji has been encountered
n - total number of "encounters" for all kanji

So the formula basically answers a question what is the probability that when you are looking at some kanji it is this specific (i-th) kanji.

The red one does:
cp_j = sum_{i=1}^j p_i
where:
cp_j - cumulative probability of recognising a kanji where the range of known kanji is from 1 to j
p_i - probability calculated from the formula above
sum_{i=1}^j - does summation of p_i probabilities starting from p_1 and ending at p_j

Here we want to know, that given that you know kanji in range 1 to j, what is the probability that some kanji you are looking at is in that range.
Reply
#13
cb4960 Wrote:@luckynumber7:

Here: http://pastebin.ca/2149188
Thank you so much! This means a lot to me.
Reply
#14
Say I have a large list of vocab words from the novels I read, and I want to learn the most frequent ones first and leave the rarer for last... Can I somehow use this program to reorder my list like that?
Reply
#15
Thanks for the tool!

I analyzed the components of 2136 joyo kanji and here are the 149 most common ones (I left out those occurring 8 times or less):

http://pastebin.com/CzUTQapV

I've also analyzed my full database of 6776 kanji and these are the 15 most common elements:

1391 口
652 木
648 日
493 田
484 土
444 艹
429 亠
427 十
405 二
395 丶
394 丷
390 目
378 氵
366 小
328 月
Reply
#16
toshiromiballza Wrote:Thanks for the tool!

I analyzed the components of 2136 joyo kanji and here are the 149 most common ones (I left out those occurring 8 times or less):

http://pastebin.com/CzUTQapV

I've also analyzed my full database of 6776 kanji and these are the 15 most common elements:

1391 口
652 木
648 日
493 田
484 土
444 艹
429 亠
427 十
405 二
395 丶
394 丷
390 目
378 氵
366 小
328 月
Seems off. You'd think ONE would be near the top. Is this based on the radicals and not Heisig primitives? Anyway, it's a cool list.

Anybody know a good program for text frequency analyzer and maybe even a sorting program for lists that have things like "language, talk, cellphone, mouth, self, five, ceiling, car keys, mouth2" in the primitive column?
Reply
#17
I didn't include most one-stroke characters because practically every kanji would have one, but yes, 一 would top the list.

It's based on radicals and their variants and some characters that aren't radicals themselves, but occur enough times to constitute their inclusion as parts (元, 五, 廿, ...).

This is from my own database which was based on KRADFILE, but I did a complete overhaul, added 400+ kanji, added new elements, corrections, etc.
Reply
#18
Hello,

I have just released version 2.0 of cb's Japanese Text Analysis Tool.

Download cb's Japanese Text Analysis Tool v2.0 via Google Code

What Changed?

Added option to choose the parse method for the word frequency report. You may now choose to use MeCab or JParser.

MeCab is a widely used morphological analyzer. However, for certain word conjugations, it will parse words in an undesirable fashion. Example: 挟まれる is parsed as 挟ま and れる instead of 挟む.

JParser is the method used by Translation Aggregator to generate furigana. It has no problem with the above conjugation and has better support for short expressions. The downside is that JParser is much slower than Mecab.

For JParser you have the option of always using the kanji form of a word even if it is normally written in kana. If checked, ともに will be converted to 共に. If unchecked, 共に will be converted to ともに.

==============================================================

I have also released two new tools (see the second post for details):

1) cb's Frequency Report Combiner. Combines two reports generated by cb's Japanese Text Analysis Tool.

2) cb's Frequency Report Diff Tool. Compares two reports generated by cb's Japanese Text Analysis Tool.

[Edit: These tools are now deprecated. I merged their functionality into JTAT v3.0]

==============================================================

Here are the latest Japanese Text Analysis Tool reports for a large number of innocent novels:
Download via MediaFire

Includes word frequency report via Mecab, word frequency report via JParser (new), differences between the Mecab report and JParser report (new), kanji frequency report, and readability report.

cb4960
Edited: 2012-05-27, 1:40 pm
Reply
#19
CB, seeing all these awesome Kanji analyzer programs, I'm wondering if you already made a Kanji sorting program similar to the one Cangy made for perl?

If you do have one, is there an option to either have kana only words placed at the top or evenly placed throughout the list?
Reply
#20
cb4960 Wrote:MeCab is a widely used morphological analyzer. However, for certain word conjugations, it will parse words in an undesirable fashion. Example: 挟まれる is parsed as 挟ま and れる instead of 挟む.
While MeCab parses this way, it also includes the root form for the word in the part of speech field information. In this case it would return 挟ま 動詞,自立,*,*,五段・マ行,未然形,挟む,ハサマ,ハサマ. So the 挟む for is included.

So maybe you could include an option to use this root form of a word, to aggregate the frequencies of all conjugations into one?
Reply
#21
cryptica Wrote:CB, seeing all these awesome Kanji analyzer programs, I'm wondering if you already made a Kanji sorting program similar to the one Cangy made for perl?

If you do have one, is there an option to either have kana only words placed at the top or evenly placed throughout the list?
Do you have a link to Cangy's script?

cryptica Wrote:So maybe you could include an option to use this root form of a word, to aggregate the frequencies of all conjugations into one?
D'oh! You're right. I have no idea why I didn't use the root field. I'll release a fix build soon and issue a new word frequency report for the 5000+ innocent novels.
Reply
#22
Hello,

I have just released version 3.0 of cb's Japanese Text Analysis Tool.

Download cb's Japanese Text Analysis Tool v3.0 via Google Code

What Changed?

1) Added "Compare Frequency Reports" to the Tools menu. This allows you to compare two frequency reports.

2) Added "Combine Frequency Reports" to the Tools menu. This allows you to combine multiple frequency reports.

3) Fixed MeCab Word Frequency Report so that it uses the 'root' field of the MeCab output. (Thanks cryptica!)

==============================================================

Reports from cb's Japanese Text Analysis Tool v3.0 based on 5000+ novels (27 May 2012):
Download via MediaFire

Includes word frequency report via Mecab, word frequency report via JParser, differences between the Mecab report and JParser report, kanji frequency report, and readability report.

cb4960
Edited: 2012-05-27, 12:52 pm
Reply
#23
Hello,

I have just released version 2.0 of cb's Frequency List Sorter.

Download cb's Frequency List Sorter v2.0 via Google Code

What Changed?

Added option to specify which frequency report to compare against. You may choose from 3 pre-made frequency reports or you may use your own frequency report. The 3 pre-made frequency reports are:

1) Kanji Frequency Report
2) Word Frequency Report (MeCab)
3) Word Frequency Report (JParser)

Each of the 3 pre-made reports were generated with JTAT v3.0.

cb4960
Edited: 2012-05-27, 11:20 pm
Reply
#24
cb4960 Wrote:
Nukemarine Wrote:CB, seeing all these awesome Kanji analyzer programs, I'm wondering if you already made a Kanji sorting program similar to the one Cangy made for perl?

If you do have one, is there an option to either have kana only words placed at the top or evenly placed throughout the list?
Do you have a link to Cangy's script?
I can't find it on the Cangy's Core Thread or the updated Cangy's site. Maybe I'm just misremembering where I got the program, but I honestly thought it was Cangy's.

I've got the sorting program saved, so I'll forward that to you when I get home.
Reply
#25
Here's a copy of the Perl Kanji sorting program by Fugounashi. Sorry for the long delay in posting this:

#!/usr/bin/perl -w

# $ kanji-sort --kanji kanjiorder.txt --sentence-field 2 < mydeck-exported.txt > mydeck-toimport.txt
# $Revision: 1.5 $ $Date: 2010/01/08 08:22:33 $
# http://ichi2.net/anki/wiki/ContribFugounashi


use open qw( Confusedtd :encoding(UTF-8) );
use strict;
use Getopt::Long;
use utf8;

my $kanjifile;
my $sentence_field;
GetOptions(
'sentence-field=i'=> \$sentence_field,
'kanji=s' => \$kanjifile
);

my %kanji;
open KANJI, "<$kanjifile";
while(<KANJI>){
chomp;
$_=(split /\t/)[0];
if(exists $kanji{$_}){
print STDERR "$0: warning: ignoring duplicate kanji: $_: $kanjifile: $.\n";
}else{
$kanji{$_}=$.;
}
}

my @max;
my @lines;
while(<>){
chomp;
my $i=$. - 1;
$lines[$i]=$_;
my $sentence=(split '\t', $_)[$sentence_field];
my @chars = split //, $sentence;
$max[$i]=0;
foreach my $char (@chars){
if(($kanji{$char}) && ($kanji{$char} > $max[$i])){
$max[$i]=$kanji{$char};
}
}
}

my @index = 0 .. (@max - 1);
my @sorted = sort {$max[$a] <=> $max[$b]} @index;

my $last=0;
foreach my $i (@sorted){
my $step=$max[$i] - $last;
$last=$max[$i];
my $sentence=(split '\t', $lines[$i])[$sentence_field];
my @chars = split //, $sentence;
print "$lines[$i]\t$max[$i]\t$step\n";
}
Reply