It has been mentioned several times, that the usual kanji frequency lists that people refer to are from mining financial newspapers, and thus rather skewed in their vocabulary. So, since you can download a static snapshot of ja.wikipedia.org, I thought I'd run some simple frequency analysis for that. I just finished the first version of the program, and it's now going through the approx. 25GB of data counting occurrences of single kanji as well as compounds (i.e. uninterrupted strings of kanji that don't contain kana).
I wrote this mostly as a fun exercise, but I can share the data after the process finishes (it might take a while...) if people here think it might have any use. The vocabulary used in Wikipedia is most likely also skewed and there's probably a lot of ways to improve the way the program handles the Wikipedia pages, but let me know what you think.
At the moment, the program has crunched through 60,000 html pages and encountered 4377 unique kanji and 176987 unique compounds (hrrmh.. that sounds like too much. I think I need to check the logic for bugs).
Top kanji by frequence so far
年 2.840%
日 2.646%
最 2.374%
月 2.317%
事 1.965%
用 1.872%
新 1.868%
利 1.625%
者 1.611%
国 1.394%
Top 10 compounds
利用者 113599 occurrences
編集 77492 occurrences
投稿 76718 occurrences
参照 65074 occurrences
報告 59338 occurrences
記事 57930 occurrences
井戸端 57915 occurrences
出来事 56760 occurrences
著作権 56554 occurrences
最近 56448 occurrences
Do these look like they make any sense?
I wrote this mostly as a fun exercise, but I can share the data after the process finishes (it might take a while...) if people here think it might have any use. The vocabulary used in Wikipedia is most likely also skewed and there's probably a lot of ways to improve the way the program handles the Wikipedia pages, but let me know what you think.
At the moment, the program has crunched through 60,000 html pages and encountered 4377 unique kanji and 176987 unique compounds (hrrmh.. that sounds like too much. I think I need to check the logic for bugs).
Top kanji by frequence so far
年 2.840%
日 2.646%
最 2.374%
月 2.317%
事 1.965%
用 1.872%
新 1.868%
利 1.625%
者 1.611%
国 1.394%
Top 10 compounds
利用者 113599 occurrences
編集 77492 occurrences
投稿 76718 occurrences
参照 65074 occurrences
報告 59338 occurrences
記事 57930 occurrences
井戸端 57915 occurrences
出来事 56760 occurrences
著作権 56554 occurrences
最近 56448 occurrences
Do these look like they make any sense?
