Kanji Frequency in Wikipedia

Index » General discussion

 
Reply #1 - 2009 June 04, 2:32 pm
shang Member
From: Finland Registered: 2009-04-09 Posts: 57

It has been mentioned several times, that the usual kanji frequency lists that people refer to are from mining financial newspapers, and thus rather skewed in their vocabulary. So, since you can download a static snapshot of ja.wikipedia.org, I thought I'd run some simple frequency analysis for that. I just finished the first version of the program, and it's now going through the approx. 25GB of data counting occurrences of single kanji as well as compounds (i.e. uninterrupted strings of kanji that don't contain kana).

I wrote this mostly as a fun exercise, but I can share the data after the process finishes (it might take a while...) if people here think it might have any use. The vocabulary used in Wikipedia is most likely also skewed and there's probably a lot of ways to improve the way the program handles the Wikipedia pages, but let me know what you think.

At the moment, the program has crunched through 60,000 html pages and encountered 4377 unique kanji and 176987 unique compounds (hrrmh.. that sounds like too much. I think I need to check the logic for bugs).

Top kanji by frequence so far
年 2.840%
日 2.646%
最 2.374%
月 2.317%
事 1.965%
用 1.872%
新 1.868%
利 1.625%
者 1.611%
国 1.394%

Top 10 compounds
利用者 113599 occurrences
編集 77492 occurrences
投稿 76718 occurrences
参照 65074 occurrences
報告 59338 occurrences
記事 57930 occurrences
井戸端 57915 occurrences
出来事 56760 occurrences
著作権 56554 occurrences
最近 56448 occurrences

Do these look like they make any sense?

Reply #2 - 2009 June 04, 2:37 pm
LTze0 Member
From: UK - Cambridge Registered: 2009-04-26 Posts: 19

I think you might need some pre-processing to strip down to just the core content, if it's full html pages off Wikipedia (which seems likely given the top compounds you're getting wink).

Though knowing the kanji compounds for getting around wikipedia is useful, I don't think it factors into a kanji-frequency report numbers if you want accuracy. big_smile

Reply #3 - 2009 June 04, 2:46 pm
shang Member
From: Finland Registered: 2009-04-09 Posts: 57

LTze0 wrote:

I think you might need some pre-processing to strip down to just the core content, if it's full html pages off Wikipedia (which seems likely given the top compounds you're getting wink).

Though knowing the kanji compounds for getting around wikipedia is useful, I don't think it factors into a kanji-frequency report numbers if you want accuracy. big_smile

Yeah, I just looked up translations for the most common compounds. Doh!  I tried to limit the mining to just the actual content, but clearly, the parser is not doing what it's supposed to.

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
Reply #4 - 2009 June 04, 2:54 pm
vosmiura Member
From: SF Bay Area Registered: 2006-08-24 Posts: 1085

The top compounds look like boilerplate on Wikipedia.  For example 編集 appears everywhere you can click to edit the article.

Reply #5 - 2009 June 04, 3:44 pm
shang Member
From: Finland Registered: 2009-04-09 Posts: 57

I changed the program to prune a lot more stuff, hopefully getting rid of most of the Wikipedia boilerplate. Here are some new numbers from the beginning of the second run:

Top 10 most frequent kanji
年 2.162%
日 1.320%
月 0.999%
用 0.994%
発 0.900%
本 0.852%
作 0.742%
大 0.722%
行 0.717%
機 0.715%

Top 10 compounds
日本 7381 occurrences
使用 7341 occurrences
開発 7174 occurrences
場合 6195 occurrences
放送 6179 occurrences
発売 5120 occurrences
現在 4809 occurrences
番組 4137 occurrences
可能 3868 occurrences
搭載 3865 occurrences

Last edited by shang (2009 June 04, 3:44 pm)

Reply #6 - 2009 June 04, 4:04 pm
bombpersons Member
From: UK Registered: 2008-10-08 Posts: 907 Website

How did you download a static snapshot of the wiki?

Reply #7 - 2009 June 04, 4:09 pm
Jarvik7 Member
From: 名古屋 Registered: 2007-03-05 Posts: 3946

bombpersons wrote:

How did you download a static snapshot of the wiki?

http://en.wikipedia.org/wiki/Wikipedia_database

@Shang: Your parsing methodology is flawed as you have described it. A word can be all kanji, kanji+kana, kana+kanji, kanji+kana+kanji, kanji+kana+kanji+kana, all kana, etc. The only reliable method of detection is to use dictionary lookups.

However, there is already a good Japanese parser out there called Chasen ( http://chasen.naist.jp/hiki/ChaSen/ ). You might want to run wikipedia through that to get a proper word frequency list. It even accounts for conjugation!

Last edited by Jarvik7 (2009 June 04, 4:19 pm)

Reply #8 - 2009 June 04, 4:11 pm
Tobberoth Member
From: Sweden Registered: 2008-08-25 Posts: 3364

Yeah, your new list looks a lot more authentic so to speak. While wikipedia is definitely extremely skewed, I think this is useful stuff. Good job!

Reply #9 - 2009 June 04, 4:12 pm
shang Member
From: Finland Registered: 2009-04-09 Posts: 57

bombpersons wrote:

How did you download a static snapshot of the wiki?

http://en.wikipedia.org/wiki/Wikipedia_database

Actually, now that I read the page a bit more carefully, it seems you can get an XML format snapshot of just the articles and nothing else. That would probably be a lot better to work with compared to the 25GB of HTML files I'm using now. tongue

Some more data:

Top 5 length 2 compounds
日本 27275 occurrences
場合 22140 occurrences
現在 20852 occurrences
使用 17358 occurrences
放送 16453 occurrences

Top 5 length 3 compounds
一般的 3364 occurrences
基本的 3223 occurrences
可能性 2582 occurrences
主人公 2291 occurrences
日発売 1766 occurrences

Top 5 length 4 compounds
株式会社 2540 occurrences
核心部分 1934 occurrences
江戸時代 1335 occurrences
携帯電話 1215 occurrences
登場人物 1027 occurrences

Reply #10 - 2009 June 04, 4:22 pm
Jarvik7 Member
From: 名古屋 Registered: 2007-03-05 Posts: 3946

On a side note, people may want to check out http://corpus.leeds.ac.uk/frqc/internet-jp.num (if you get mojibake, set your browser to unicode text encoding)

It is a word frequency list made from a corpus of internet content.

Reply #11 - 2009 June 04, 4:29 pm
Tobberoth Member
From: Sweden Registered: 2008-08-25 Posts: 3364

Jarvik7 wrote:

On a side note, people may want to check out http://corpus.leeds.ac.uk/frqc/internet-jp.num (if you get mojibake, set your browser to unicode text encoding)

It is a word frequency list made from a corpus of internet content.

I love how 的 is so high on that list, being the most common character used in Chinese big_smile

Reply #12 - 2009 June 04, 4:29 pm
shang Member
From: Finland Registered: 2009-04-09 Posts: 57

Jarvik7 wrote:

@Shang: Your parsing methodology is flawed as you have described it. A word can be all kanji, kanji+kana, kana+kanji, kanji+kana+kanji, kanji+kana+kanji+kana, all kana, etc. The only reliable method of detection is to use dictionary lookups.

However, there is already a good Japanese parser out there called Chasen ( http://chasen.naist.jp/hiki/ChaSen/ ). You might want to run wikipedia through that to get a proper word frequency list. It even accounts for conjugation!

Yeah, that was just the quickest way getting some meaningful results, without doing any linguistical analysis. My original idea was just to analyze kanji frequencies, and the compounds were more of a freebie on the side.

Thanks for the ChaSen link! I'll see if I'll have time during the weekend to find out how it works and plug it into my program.

Reply #13 - 2009 June 04, 4:34 pm
Jarvik7 Member
From: 名古屋 Registered: 2007-03-05 Posts: 3946

Tobberoth wrote:

Jarvik7 wrote:

On a side note, people may want to check out http://corpus.leeds.ac.uk/frqc/internet-jp.num (if you get mojibake, set your browser to unicode text encoding)

It is a word frequency list made from a corpus of internet content.

I love how 的 is so high on that list, being the most common character used in Chinese big_smile

Yeah, that parser takes a very formal view of Japanese grammar, so suffixes/particles etc are on the same level as what we would call a word.

@shang: There is a precompiled windows binary on the faq page. I don't know if it works out of the box or needs to be fed some dictionary files though. I've been meaning to analyze my set of previous JLPT tests but never got around to it.

Last edited by Jarvik7 (2009 June 04, 4:40 pm)

Reply #14 - 2009 June 05, 3:24 am
shang Member
From: Finland Registered: 2009-04-09 Posts: 57

My computer finished crunching through the whole Japanese Wikipedia last night. There were a total of 1,194,582 files, most of which were discarded (discussions, user pages, article stubs etc.). For the remaining 416,563 pages, I included only body text, discarding all tables, lists, sidebars and headings etc. and also all kana and romaji. This left me with a sample of 106,067,614 kanji with 8916 unique characters.

I thought people might be interested in some threshold values:

173 kanji make up 50% all kanji in Wikipedia.
454 kanji cover 75% of all kanji in Wikipedia.
874 kanji cover 90%
1214 kanji cover 95%
2061 kanji cover 99%
2456 kanji cover 99.5%
3489 kanji cover 99.9%

Reply #15 - 2009 June 05, 4:22 am
captal Member
From: San Jose Registered: 2008-03-22 Posts: 682

That's amazing that 173 kanji make up 50% of wikipedia!

Reply #16 - 2009 June 05, 8:32 am
ファブリス Administrator
From: Belgium Registered: 2006-06-14 Posts: 4044 Website

Hi shang that's very interesting.

Can you make the kanji lists available for the "tiers" 50%, 75%, 90%, and 95% ?

I'd like to add support to review/study subset of kanji on the website in the future, I could use your list, along with the JLPT sets etc. If you have it on Google Spreadsheets or a downloadable file I may be able to use it. If you can put the data online, you may want to add a Creative Commons license or something like that so we know who to credit for the original data.

captal wrote:

That's amazing that 173 kanji make up 50% of wikipedia!

It's very tempting to learn those first isn't it smile

I remember vaguely reading something about how you could handle basic communication in most languages with a select list of 200 words or so.

Reply #17 - 2009 June 05, 9:34 am
transalpin Member
From: on the move Registered: 2008-10-30 Posts: 27

Would it be possible to repeat this exercise on the Chinese Wikipedia? The xml index, however, contains both traditional and simplified characters.
You could analyse the html source for simplified (简体) and traditional (繁體), Mainland (大陆) and Taiwan Standard (台灣正體) separately in order to get the most accurate results for either orthography. (Wikipedia uses their own software for converting on the fly in either direction, even for compounds.)

A cc-license of some sort seems like a good idea! You could allow people to use your data freely as long as they link to your site.

Reply #18 - 2009 June 05, 10:05 am
shang Member
From: Finland Registered: 2009-04-09 Posts: 57

Google Spreadsheet

I included the most common kanji up to rank 3500 and after that, just the ones that have a Heisig keyword. The Heisig keyword and number were cross referenced from Nukemarine's kanken+heisig googledoc. The "Cumulative%" field shows how much of the Japanese Wikipedia is covered up to that kanji, so if you e.g. want to make a list of all kanji required for 75%, just copy from the beginning of the list until Cumulative% = 75%.

You can freely use the data available in the Google spreadsheet. After all, the content is from Wikipedia. I just applied a few hours of my time and a bit of Python code to it. :-)

I also have the full kanji compound statistics from the whole Wikipedia, with the caveats pointed out by Jarvik7 (only pure Kanji compounds, compounds that include kana are not handled properly). So if people find that useful/interesting, I can dig out some trivia from that too.

Reply #19 - 2009 June 05, 10:17 am
ファブリス Administrator
From: Belgium Registered: 2006-06-14 Posts: 4044 Website

Great work, thanks for sharing sammi wink

If this file is auto-generated it's probably best to make it read-only, I'm not sure if it is, or how it works as I haven't published something yet on Google Docs. I noticed it shows other users viewing in real time, if it allows editing there could be some errors introduced by mistake.

Reply #20 - 2009 June 05, 12:12 pm
nac_est Member
From: Italy Registered: 2006-12-12 Posts: 617 Website

Thank you!
If I were to add one more caveat (besides Jarvik's), also the "uninterrupted strings of kanji that don't contain kana" are not going to be all real compounds. In many occasions one word ending with a kanji is immediately followed by another word starting with a kanji. In these cases your algorithm will list them as a single compound.

Example: この買収は今日両社から発表されました (taken randomly from ALC)
Here 今日両者 would be considered a compound, when it's actually 2 words.

That's just to say that the compound info has to be used with care. smile

Reply #21 - 2009 June 05, 12:22 pm
shang Member
From: Finland Registered: 2009-04-09 Posts: 57

nac_est wrote:

Thank you!
If I were to add one more caveat (besides Jarvik's), also the "uninterrupted strings of kanji that don't contain kana" are not going to be all real compounds. In many occasions one word ending with a kanji is immediately followed by another word starting with a kanji. In these cases your algorithm will list them as a single compound.

Example: この買収は今日両社から発表されました (taken randomly from ALC)
Here 今日両者 would be considered a compound, when it's actually 2 words.

That's just to say that the compound info has to be used with care. smile

That's a good point I hadn't thought of. Thanks! I was counting on the fact that most of the time, such strings of kanji would either be a set phrase or there would be a hiragana particle between them, but I didn't consider that those pesky time related words don't need a particle.

As Jarvik said, the only way to get perfect data is to compare the strings against a dictionary, but I still think the current data is somewhat useful, at least for shorter compounds. One option I've been considering, is to download the JMDict/EDict data. That should be a pretty exhaustive set of words and in a format that's relatively easy to use.

Reply #22 - 2009 June 05, 12:52 pm
bodhisamaya Guest

I am so jealous of all the tricks you computer guys can do cool

Reply #23 - 2009 June 05, 12:57 pm
nac_est Member
From: Italy Registered: 2006-12-12 Posts: 617 Website

The data IS useful. I find it interesting, for instance, that most of the least frequent kanji are contained in RTK3.
Perhaps all the non-RTK, middle-ranking characters are just 人名用...

Reply #24 - 2009 June 05, 7:32 pm
Katsuo M.O.D.
From: Tokyo Registered: 2007-02-06 Posts: 894 Website

shang wrote:

Google Spreadsheet

I included the most common kanji up to rank 3500 and after that, just the ones that have a Heisig keyword.

This is interesting. Is it possible to see the complete list? (I'd like to check out some kanji that aren't on the spreadsheet.)

Reply #25 - 2009 June 05, 7:40 pm
activeaero Member
From: Mobile-AL Registered: 2008-08-15 Posts: 500

nac_est wrote:

Thank you!
If I were to add one more caveat (besides Jarvik's), also the "uninterrupted strings of kanji that don't contain kana" are not going to be all real compounds. In many occasions one word ending with a kanji is immediately followed by another word starting with a kanji. In these cases your algorithm will list them as a single compound.

Example: この買収は今日両社から発表されました (taken randomly from ALC)
Here 今日両者 would be considered a compound, when it's actually 2 words.

That's just to say that the compound info has to be used with care. smile

You know I was thinking about this and wonder if it really matters that much? 

If two Kanji compounds are being seen together enough to warrant them appearing high on a frequency list of "single" compounds then that is just as meaningful, IMO at least.

What would be really interesting is if someone could figure out a way rank the most common phrases. Could you run a script that searches between punctuation marks and looks at each batch of text between them as a single piece?   

I mean since most of us are all about the sentence method think how neat would it be to have a list of the actual most common complete sentences/phrases.

Last edited by activeaero (2009 June 05, 7:48 pm)