Kanji frequency analysis in Wikipedia

Index » The Japanese language

  • 1
 
FooSoft
Member
From: Seattle, WA
Registered: 2009-02-15
Posts: 498
Website

I thought some of you guys might be interested in this. I wrote some scripts to run character analysis on 20 gigabytes or so of the downloadable version of the Japanese Wikipedia and generated some cool reports smile

http://foosoft.net/japanese/kanji_frequency/

If you want to just view the largest report it's here (kind of large file, though, the above link has some smaller versions of it):

http://foosoft.net/japanese/kanji_frequ … port1.html

Update.... And from an assortment of novels:
Report table: http://foosoft.net/japanese/kanji_frequ … eport.html

Last edited by FooSoft (2010 February 21, 1:07 am)

nest0r
Member
Registered: 2007-10-19
Posts: 5234
Website

Is this different from our 3-page thread here? Kanji Frequency in Wikipedia

With additions from Katsuo: http://forum.koohii.com/viewtopic.php?pid=57575#p57575

Not for my sake but for others, I hope someone soon runs an analysis on all those .txt files of novels I didn't post here: http://forum.koohii.com/viewtopic.php?id=5252

Nice jisho.org links. I'll leave all this fancy analysis stuff to you eggheads. ;p

Last edited by nest0r (2010 February 20, 5:40 pm)

FooSoft
Member
From: Seattle, WA
Registered: 2009-02-15
Posts: 498
Website

Ack! I was beaten I see. Oh well :p

Yeah, I can totally run an analysis on those novels. I actually wanted to do something outside of Wikipedia too!

Advertising (register and sign in to hide this)
JapanesePod101
Sponsor
 
nest0r
Member
Registered: 2007-10-19
Posts: 5234
Website

I never actually looked at the previously posted lists, but now that I look at yours and theirs, a layperson like me really enjoys your presentation, so there's my expert opinion. ^_^

wccrawford
Member
From: FL US
Registered: 2008-03-28
Posts: 1548

I would totally love to see kanji analysis on the files that nest0r didn't post.  Especially, I'd like to know which ones stick to easier kanji, fewer kanji, and other stuffs like that.  Knowing where they sit by grade and JLPT level would be really nice.

Spines11
Member
From: California
Registered: 2009-11-04
Posts: 39

Also, I made this tool a while back to figure out your "Kanji coverage Percent" - http://answercowapp.appspot.com/app/kanjicoverage/index

It is based off the Kanji frequencies in Wikipedia.

yudantaiteki
Member
From: 東京
Registered: 2009-10-03
Posts: 3008

Someone should probably write a more "intelligent" script to mine Wikipedia that strips out the things on every page like 編集 and 利用; I don't remember if the other list did that or not.

FooSoft
Member
From: Seattle, WA
Registered: 2009-02-15
Posts: 498
Website

I actually strip out most of the boilerplate code, Wikipedia conveniently puts tags around "content" and that's what I grab. There might be a couple of minor things like that left to filter though, but the vast majority of stuff you see in navigation bars and the like is gone.

I have also just finished running my scripts on the novels that nest0r did not post. Just needed to make a little tweak to work with shift-jis encoding. I believe it was a couple of gigs, so sample size is good. Here are the results:

Report table: http://foosoft.net/japanese/kanji_frequ … eport.html
Raw frequency CSV: http://foosoft.net/japanese/kanji_frequ … /novel/raw

Last edited by FooSoft (2010 February 20, 8:16 pm)

ruiner
Member
Registered: 2009-08-20
Posts: 751

That was fast, FooSoft! There is some 'boilerplate' formatting across most of those files as well, hope you considered that (mostly on first page or so, I think). Hmm, now how organize and apply this information.

Reply #10 - 2010 April 12, 4:30 pm
nest0r
Member
Registered: 2007-10-19
Posts: 5234
Website

By the way, we discussed this one before, but here's a couple novel word frequency lists: http://pomax.nihongoresources.com/index … 1222520260 - I never used them because I didn't know what texts they came from.

And also here's a list taken from Goo blogs:

"Another frequency list was created in 2008 by Hiroshi Utsumi using the blog pages on the Goo site in Japan. His explanation is here.

    *edict-freq-20081002.tar.gz (3952168 bytes) archive containing the marked-up EDICT file, plus the Ruby code used to generate the markup. "

Last edited by nest0r (2010 April 12, 4:33 pm)

Reply #11 - 2010 April 17, 8:28 pm
clemente
Member
From: venexia
Registered: 2008-11-06
Posts: 22

Cool thread. I am posting here now hoping that some of you programming minded people could help me with this:
http://forum.koohii.com/viewtopic.php?p … 60#p100960
Check each text of an archive against all others to find for same sentences (in Japanese).
Cheers

Reply #12 - 2010 April 17, 8:56 pm
killeralgae
Member
From: New York
Registered: 2009-05-27
Posts: 15
Website

Would there be anyway to run a search of, say, the German wiki to find the most common words, for SRS? Or say contrasting the entire wiki with a word list of 5,000 common words, and determine which wiki entries use most of them?

Reply #13 - 2010 April 17, 8:59 pm
nest0r
Member
Registered: 2007-10-19
Posts: 5234
Website

I like the direction this is going. I think there's bound to be easy ways to do the above as a less complex version of this: http://forum.koohii.com/viewtopic.php?id=3955 - To help narrow down useful swaths of text within books/stories/etc. (I'm big on going through swaths of text, spans of episodes, etc., especially with regards to using Transcriber to make interactive audiobooks and sampling many different episodes for subs2srs decks.) -- The paper seems to indicate domains of 1000-3000 words for books. It's easier to read than the abstract might indicate, though the math is way over my head.

Last edited by nest0r (2010 April 17, 9:03 pm)

Reply #14 - 2010 April 18, 4:01 am
FooSoft
Member
From: Seattle, WA
Registered: 2009-02-15
Posts: 498
Website

killeralgae wrote:

Would there be anyway to run a search of, say, the German wiki to find the most common words, for SRS? Or say contrasting the entire wiki with a word list of 5,000 common words, and determine which wiki entries use most of them?

Of course. It would take a little bit more work, since with 漢字 you can take some shortcuts (like you know they won't show up in HTML tags, so you don't have to handle that), but yeah it would be cool to do such analysis on different versions of wikipedia.

Reply #15 - 2011 January 22, 9:12 pm
nest0r
Member
Registered: 2007-10-19
Posts: 5234
Website

FooSoft wrote:

I actually strip out most of the boilerplate code, Wikipedia conveniently puts tags around "content" and that's what I grab. There might be a couple of minor things like that left to filter though, but the vast majority of stuff you see in navigation bars and the like is gone.

I have also just finished running my scripts on the novels that nest0r did not post. Just needed to make a little tweak to work with shift-jis encoding. I believe it was a couple of gigs, so sample size is good. Here are the results:

Report table: http://foosoft.net/japanese/kanji_frequ … eport.html
Raw frequency CSV: http://foosoft.net/japanese/kanji_frequ … /novel/raw

Okay FooSoft. Make us another list. Words this time. Chop-chop!

Reply #16 - 2011 January 22, 9:57 pm
FooSoft
Member
From: Seattle, WA
Registered: 2009-02-15
Posts: 498
Website

Words are harder because you need to be able to figure out where one ends and another begins. Having no spaces can be annoying like that sad

Edit: I'm sure it's possible, I'll just have to think about how to do it first though

Last edited by FooSoft (2011 January 22, 10:05 pm)

Jarvik7
Member
From: 名古屋
Registered: 2007-03-05
Posts: 3940

Mecab etc could handle that with a reasonable margin of error.

Reply #18 - 2011 January 23, 2:00 am
Womacks23
Member
From: 死国
Registered: 2008-01-10
Posts: 576

With running a kanji frequency on wikipedia, wouldn't the thousands of articles on all things Chinese kind of throw off the results with a whole lot of kanji not really recognized in Japanese?

Reply #19 - 2011 January 23, 2:19 am
mezbup
Member
From: sausage lip
Registered: 2008-09-18
Posts: 1679
Website

someone should run this stuff on dorama scripts. Get a feel for frequency of spoken words.

Reply #20 - 2011 January 23, 2:25 am
Jarvik7
Member
From: 名古屋
Registered: 2007-03-05
Posts: 3940

Womacks23 wrote:

With running a kanji frequency on wikipedia, wouldn't the thousands of articles on all things Chinese kind of throw off the results with a whole lot of kanji not really recognized in Japanese?

They would be greatly outnumbered by all of the kanji used normally in article text and end up as just noise. Remember that this is an exercise in statistics.

Reply #21 - 2011 January 23, 9:49 am
FooSoft
Member
From: Seattle, WA
Registered: 2009-02-15
Posts: 498
Website

Ah, mecab is a good idea. I will try this after I get a couple more features into Yomichan big_smile

Yufina
Member
From: Finland
Registered: 2007-06-18
Posts: 49

How about if you can add list of (all) 常用2009 kanji. Would be nice to see those missing ones.

  • 1