Back

cb's Japanese Text Analysis Tool

#26
This looks like a fun tool, so I thought I would play with it a bit. I decided to run it on more than 5600 japanese subtitle files taken from http://kitsunekko.net/

There were 66,603 unique words reported, from a total of 11,838,104 words.
The 100 most common entries made up 53% of the total.
The 500 most common entries made up 70% of the total.
The 1000 most common entries made up 76% of the total.
The 2000 most common entries made up 83% of the total.
The 6000 most common entries made up 92% of the total.
The 10000 most common entries made up 95% of the total.
The 20000 most common entries made up 98% of the total.

However, the 100 most common entries consisted mostly of single characters or groups of characters that one would be more likely to consider "grammar" than to consider as words. I'm not sure if its really fair to include this into consideration.
If we completely remove the those 100 from consideration, and then do the same calculations, everything changes dramatically.

The 100 most common entries made up 16% of the total.
The 500 most common entries made up 39% of the total.
The 1000 most common entries made up 51% of the total.
The 2000 most common entries made up 64% of the total.
The 6000 most common entries made up 83% of the total.
The 10000 most common entries made up 89% of the total.
The 20000 most common entries made up 96% of the total.


Keep in mind, there are so many different factors at play here that I don't think we can draw much conclusion from this. I thought its interesting though.
Also interesting, is that while reading through the list of words which only occurred a SINGLE time, I saw many that actually seemed like rather normal words, including many katakana words that I could immediately understand.
Reply

Messages In This Thread