Back

cb's Japanese Text Analysis Tool

#1
Japanese Text Analysis Tool allows users to generate 4 kinds of reports:
1) Word Frequency Report
2) Kanji Frequency Report
3) Formula-based Readability Report
4) User-based Readability Report

You may analyze a single text file or an entire directory of text
files (including sub-directories).

Before text analysis begins, all Aozora formatting is removed (if present) and
a user-specified number of lines are trimmed from the beginning and end of the
text files.

All generated reports will be placed in the specified output directory.

In the Tools menu, you can launch tools that allow you to combine multiple
frequency reports or compare two frequency reports.

Download Japanese Text Analysis Tool via SourceForge
(Requires .Net Framwork 4.5)

[Image: main.png]

Word Frequency Report
Name: word_freq_report.txt

Format:
Field 1: Number of times word was encountered
Field 2: Word
Field 3: Frequency Group (see explanation below)
Field 4: Frequency Rank (see explanation below)
Field 5: Percentage (Field 1 / Total number of words)
Field 6: Cumulative percentage
Field 7: Part-of-speech

Frequency Group: All words in the analysis that share the exact same frequency (Field 1)
will be assigned to a numbered Frequency Group, with group 1 containing the most common
word(s), group 2 containing the next most common word(s), and so on.

Frequency Rank: For a given word, the Frequency Rank is the total number of words
in the analysis that are more frequent that the given word + 1. For example, if the
given word has a Frequency Rank of 500, then there are 499 other words in the analysis
that are more frequent than the given word.

Report is sorted from most frequent word to least frequent word.

You have two methods of generating a report: MeCab or JParser.

MeCab is widely used morphological analyzer and is quite fast.

JParser is an alternate method that uses a larger dictionary (EDICT + ENAMDICT)
and thus recognizes more words and seems to have better support for names
and short expressions. However, it is much slower than Mecab.

The Part-of-speech field is printed verbatim from Mecab/JParser.

Kanji Frequency Report
Name: kanji_freq_report.txt

Format:
Field 1: Number of times kanji was encountered
Field 2: Kanji
Field 3: Frequency Group (see explanation below)
Field 4: Frequency Rank (see explanation below)
Field 5: Percentage (Field 1 / Total number of kanji)
Field 6: Cumulative percentage

Frequency Group: All kanji in the analysis that share the exact same frequency (Field 1)
will be assigned to a numbered Frequency Group, with group 1 containing the most common
kanji(s), group 2 containing the next most common kanji(s), and so on.

Frequency Rank: For a given kanji, the Frequency Rank is the total number of kanji
in the analysis that are more frequent that the given kanji + 1. For example, if the
given kanji has a Frequency Rank of 500, then there are 499 other kanji in the analysis
that are more frequent than the given kanji.

Report is sorted from most frequent kanji to least frequent kanji.

Formula-based Readability Report:
Readability report generated based on Hayashi and OBI-2 readability calculations.

Name: formula_based_readability_report.txt

Format:
Field 1: OBI-2 Grade Level (1-13, where 1 is the most readable)
Field 2: Hayashi Score (0-100, where 100 is the most readable)
Field 3: Filename

Report is sorted from most readable to least readable.

Hayashi Score Information:
http://www.ideosity.com/ideosphere/seo-i...ts#Hayashi

OBI-2 Grade Level Information:
http://kotoba.nuee.nagoya-u.ac.jp/sc/obi2/obi_e.html

User-based Readability Report
Using a list of words that the user already knows, this report can help to determine readability of a text based on the percentage of words in the text that the user already knows.

Name: user_based_readability_report.txt

Format:
Field 1: Readability expressed as a percentage (0-100) of the total number
of non-unique known words vs. the total number of non-unique words.
Field 2: Total number of non-unique words
Field 3: Total number of non-unique known words
Field 4: Total number of non-unique unknown words
Field 5: Readability expressed as a percentage (0-100) of the total number
of unique known words vs. the total number of unique words.
Field 6: Total number of unique words
Field 7: Total number of unique known words
Field 8: Total number of unique unknown words
Field 9: Filename

Report is sorted based on Readability (Field 1).

To generate this report, the "File that contains a list of words that you already know" option must be filled in. If a line contains multiple tab-separated columns, then the word is assumed to be in the first column.


Have Fun!
Edited: 2015-10-05, 9:10 pm
Reply

Messages In This Thread