This tool is deprecated. Use cb's Japanese Text Analysis Tool instead. Thread for cb's Japanese Text Analysis Tool.
Here is the utility that I mentioned in the previous post. I call it cb's Japanese Word Frequency List Generator.
Download cb's Japanese Word Frequency List Generator v1.1 via MediaFire
Download cb's Japanese Word Frequency List Generator v1.1 Source Code via MediaFire
Here is the custom Mecab user dictionary that I created. Used together with the default Mecab dictionary, it adds several thousand words:
Download custom Mecab user dictionary via MediaFire
![[Image: main_v1.1.png]](http://subs2srs.sourceforge.net/JapaneseWordFrequencyListGenerator/main_v1.1.png)
From the readme file:
What is cb's Japanese Word Frequency List Generator?
As the name implies, it is a utility for generating word frequency lists from
files that contain Japanese text.
The input to the program is directory that contains the .txt files with Japanese
text (files from subdirectories will also be used).
The output of the program is a text file with each line consisting of a Japanese
word that was encountered in the input files and a number indicating the number
of times that the word was encountered. Example:
...
8297 謝る
8290 王女
8284 欲望
...
In this example, 謝る was encountered 8297 times in the input files, 王女 was
encountered 8290 times, etc.
How to Install and Launch:
1) Unzip cb's Japanese Word Frequency List Generator.
2) In the unzipped directory, simply double-click
JapaneseWordFrequencyListGenerator.exe.
Note: Requires the .Net Framework.
How to Use:
It's fairly straightforward, just fill out each of the fields and press the Start
button. The fields are explained below:
Root Directory:
The directory that contains the .txt files to analyze. Files from subdirectories
will also be used.
Encoding:
The encoding of the input files.
Output File:
The output file that will contain the frequency information.
Mecab Location:
The location of Mecab. Mecab is the sophisticated morphological analyzer that this
program relies on to extract words from the input files. You can download it from:
https://sourceforge.net/projects/mecab/f...cab-win32/
During the installation of Mecab, you will be prompted for the encoding to
use. Select UTF-8.
Mecab User Dictionary Location (Optional):
If you have a user dictionary that you would like to use in addition to the default Mecab
dictionary, you may enter it here. Leave blank if you don't have a user dictionary.
Chop Front:
The # of lines to ignore from the start of each file. Can be useful for removing
TOC and other header information.
Chop Back:
The # of lines to ignore from the end of each file.
Can be useful for removing copyright and publisher information.
To customize the field defaults, edit the "settings.txt" file that is in the
same directory as the .exe file.
How Much Time Does the Analysis Take?
On my 3.5 year old computer, analysis of 5109 Japanese novels took 34 minutes.
cb4960
Here is the utility that I mentioned in the previous post. I call it cb's Japanese Word Frequency List Generator.
Download cb's Japanese Word Frequency List Generator v1.1 via MediaFire
Download cb's Japanese Word Frequency List Generator v1.1 Source Code via MediaFire
Here is the custom Mecab user dictionary that I created. Used together with the default Mecab dictionary, it adds several thousand words:
Download custom Mecab user dictionary via MediaFire
![[Image: main_v1.1.png]](http://subs2srs.sourceforge.net/JapaneseWordFrequencyListGenerator/main_v1.1.png)
From the readme file:
What is cb's Japanese Word Frequency List Generator?
As the name implies, it is a utility for generating word frequency lists from
files that contain Japanese text.
The input to the program is directory that contains the .txt files with Japanese
text (files from subdirectories will also be used).
The output of the program is a text file with each line consisting of a Japanese
word that was encountered in the input files and a number indicating the number
of times that the word was encountered. Example:
...
8297 謝る
8290 王女
8284 欲望
...
In this example, 謝る was encountered 8297 times in the input files, 王女 was
encountered 8290 times, etc.
How to Install and Launch:
1) Unzip cb's Japanese Word Frequency List Generator.
2) In the unzipped directory, simply double-click
JapaneseWordFrequencyListGenerator.exe.
Note: Requires the .Net Framework.
How to Use:
It's fairly straightforward, just fill out each of the fields and press the Start
button. The fields are explained below:
Root Directory:
The directory that contains the .txt files to analyze. Files from subdirectories
will also be used.
Encoding:
The encoding of the input files.
Output File:
The output file that will contain the frequency information.
Mecab Location:
The location of Mecab. Mecab is the sophisticated morphological analyzer that this
program relies on to extract words from the input files. You can download it from:
https://sourceforge.net/projects/mecab/f...cab-win32/
During the installation of Mecab, you will be prompted for the encoding to
use. Select UTF-8.
Mecab User Dictionary Location (Optional):
If you have a user dictionary that you would like to use in addition to the default Mecab
dictionary, you may enter it here. Leave blank if you don't have a user dictionary.
Chop Front:
The # of lines to ignore from the start of each file. Can be useful for removing
TOC and other header information.
Chop Back:
The # of lines to ignore from the end of each file.
Can be useful for removing copyright and publisher information.
To customize the field defaults, edit the "settings.txt" file that is in the
same directory as the .exe file.
How Much Time Does the Analysis Take?
On my 3.5 year old computer, analysis of 5109 Japanese novels took 34 minutes.
cb4960
Edited: 2012-05-27, 1:12 pm
