kanji koohii FORUM
Kanji frequency analysis in Wikipedia - Printable Version

+- kanji koohii FORUM (http://forum.koohii.com)
+-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html)
+--- Forum: The Japanese language (http://forum.koohii.com/forum-10.html)
+--- Thread: Kanji frequency analysis in Wikipedia (/thread-5062.html)



Kanji frequency analysis in Wikipedia - FooSoft - 2010-02-20

I thought some of you guys might be interested in this. I wrote some scripts to run character analysis on 20 gigabytes or so of the downloadable version of the Japanese Wikipedia and generated some cool reports Smile

http://foosoft.net/japanese/kanji_frequency/

If you want to just view the largest report it's here (kind of large file, though, the above link has some smaller versions of it):

http://foosoft.net/japanese/kanji_frequency/downloads/report1.html

Update.... And from an assortment of novels:
Report table: http://foosoft.net/japanese/kanji_frequency/downloads/novel/report.html


Kanji frequency analysis in Wikipedia - nest0r - 2010-02-20

Is this different from our 3-page thread here? Kanji Frequency in Wikipedia

With additions from Katsuo: http://forum.koohii.com/showthread.php?pid=55631#pid55631

Not for my sake but for others, I hope someone soon runs an analysis on all those .txt files of novels I didn't post here: http://forum.koohii.com/showthread.php?tid=5048

Nice jisho.org links. I'll leave all this fancy analysis stuff to you eggheads. ;p


Kanji frequency analysis in Wikipedia - FooSoft - 2010-02-20

Ack! I was beaten I see. Oh well :p

Yeah, I can totally run an analysis on those novels. I actually wanted to do something outside of Wikipedia too!


Kanji frequency analysis in Wikipedia - nest0r - 2010-02-20

I never actually looked at the previously posted lists, but now that I look at yours and theirs, a layperson like me really enjoys your presentation, so there's my expert opinion. ^_^


Kanji frequency analysis in Wikipedia - wccrawford - 2010-02-20

I would totally love to see kanji analysis on the files that nest0r didn't post. Especially, I'd like to know which ones stick to easier kanji, fewer kanji, and other stuffs like that. Knowing where they sit by grade and JLPT level would be really nice.


Kanji frequency analysis in Wikipedia - Spines11 - 2010-02-20

Also, I made this tool a while back to figure out your "Kanji coverage Percent" - http://answercowapp.appspot.com/app/kanjicoverage/index

It is based off the Kanji frequencies in Wikipedia.


Kanji frequency analysis in Wikipedia - yudantaiteki - 2010-02-20

Someone should probably write a more "intelligent" script to mine Wikipedia that strips out the things on every page like 編集 and 利用; I don't remember if the other list did that or not.


Kanji frequency analysis in Wikipedia - FooSoft - 2010-02-20

I actually strip out most of the boilerplate code, Wikipedia conveniently puts tags around "content" and that's what I grab. There might be a couple of minor things like that left to filter though, but the vast majority of stuff you see in navigation bars and the like is gone.

I have also just finished running my scripts on the novels that nest0r did not post. Just needed to make a little tweak to work with shift-jis encoding. I believe it was a couple of gigs, so sample size is good. Here are the results:

Report table: http://foosoft.net/japanese/kanji_frequency/downloads/novel/report.html
Raw frequency CSV: http://foosoft.net/japanese/kanji_frequency/downloads/novel/raw


Kanji frequency analysis in Wikipedia - ruiner - 2010-02-21

That was fast, FooSoft! There is some 'boilerplate' formatting across most of those files as well, hope you considered that (mostly on first page or so, I think). Hmm, now how organize and apply this information.


Kanji frequency analysis in Wikipedia - nest0r - 2010-04-12

By the way, we discussed this one before, but here's a couple novel word frequency lists: http://pomax.nihongoresources.com/index.php?entry=1222520260 - I never used them because I didn't know what texts they came from.

And also here's a list taken from Goo blogs:

"Another frequency list was created in 2008 by Hiroshi Utsumi using the blog pages on the Goo site in Japan. His explanation is here.

*edict-freq-20081002.tar.gz (3952168 bytes) archive containing the marked-up EDICT file, plus the Ruby code used to generate the markup. "


Kanji frequency analysis in Wikipedia - clemente - 2010-04-17

Cool thread. I am posting here now hoping that some of you programming minded people could help me with this:
http://forum.koohii.com/showthread.php?pid=90037#pid90037
Check each text of an archive against all others to find for same sentences (in Japanese).
Cheers


Kanji frequency analysis in Wikipedia - killeralgae - 2010-04-17

Would there be anyway to run a search of, say, the German wiki to find the most common words, for SRS? Or say contrasting the entire wiki with a word list of 5,000 common words, and determine which wiki entries use most of them?


Kanji frequency analysis in Wikipedia - nest0r - 2010-04-17

I like the direction this is going. I think there's bound to be easy ways to do the above as a less complex version of this: http://forum.koohii.com/showthread.php?tid=3792 - To help narrow down useful swaths of text within books/stories/etc. (I'm big on going through swaths of text, spans of episodes, etc., especially with regards to using Transcriber to make interactive audiobooks and sampling many different episodes for subs2srs decks.) -- The paper seems to indicate domains of 1000-3000 words for books. It's easier to read than the abstract might indicate, though the math is way over my head.


Kanji frequency analysis in Wikipedia - FooSoft - 2010-04-18

killeralgae Wrote:Would there be anyway to run a search of, say, the German wiki to find the most common words, for SRS? Or say contrasting the entire wiki with a word list of 5,000 common words, and determine which wiki entries use most of them?
Of course. It would take a little bit more work, since with 漢字 you can take some shortcuts (like you know they won't show up in HTML tags, so you don't have to handle that), but yeah it would be cool to do such analysis on different versions of wikipedia.


Kanji frequency analysis in Wikipedia - nest0r - 2011-01-22

FooSoft Wrote:I actually strip out most of the boilerplate code, Wikipedia conveniently puts tags around "content" and that's what I grab. There might be a couple of minor things like that left to filter though, but the vast majority of stuff you see in navigation bars and the like is gone.

I have also just finished running my scripts on the novels that nest0r did not post. Just needed to make a little tweak to work with shift-jis encoding. I believe it was a couple of gigs, so sample size is good. Here are the results:

Report table: http://foosoft.net/japanese/kanji_frequency/downloads/novel/report.html
Raw frequency CSV: http://foosoft.net/japanese/kanji_frequency/downloads/novel/raw
Okay FooSoft. Make us another list. Words this time. Chop-chop!


Kanji frequency analysis in Wikipedia - FooSoft - 2011-01-22

Words are harder because you need to be able to figure out where one ends and another begins. Having no spaces can be annoying like that Sad

Edit: I'm sure it's possible, I'll just have to think about how to do it first though


Kanji frequency analysis in Wikipedia - Jarvik7 - 2011-01-23

Mecab etc could handle that with a reasonable margin of error.


Kanji frequency analysis in Wikipedia - Womacks23 - 2011-01-23

With running a kanji frequency on wikipedia, wouldn't the thousands of articles on all things Chinese kind of throw off the results with a whole lot of kanji not really recognized in Japanese?


Kanji frequency analysis in Wikipedia - mezbup - 2011-01-23

someone should run this stuff on dorama scripts. Get a feel for frequency of spoken words.


Kanji frequency analysis in Wikipedia - Jarvik7 - 2011-01-23

Womacks23 Wrote:With running a kanji frequency on wikipedia, wouldn't the thousands of articles on all things Chinese kind of throw off the results with a whole lot of kanji not really recognized in Japanese?
They would be greatly outnumbered by all of the kanji used normally in article text and end up as just noise. Remember that this is an exercise in statistics.


Kanji frequency analysis in Wikipedia - FooSoft - 2011-01-23

Ah, mecab is a good idea. I will try this after I get a couple more features into Yomichan Big Grin


Kanji frequency analysis in Wikipedia - Yufina - 2011-02-21

How about if you can add list of (all) 常用2009 kanji. Would be nice to see those missing ones.