Joined: Feb 2009
Posts: 532
Thanks:
9
Ack! I was beaten I see. Oh well :p
Yeah, I can totally run an analysis on those novels. I actually wanted to do something outside of Wikipedia too!
Joined: Oct 2007
Posts: 4,582
Thanks:
0
I never actually looked at the previously posted lists, but now that I look at yours and theirs, a layperson like me really enjoys your presentation, so there's my expert opinion. ^_^
Joined: Mar 2008
Posts: 1,533
Thanks:
0
I would totally love to see kanji analysis on the files that nest0r didn't post. Especially, I'd like to know which ones stick to easier kanji, fewer kanji, and other stuffs like that. Knowing where they sit by grade and JLPT level would be really nice.
Joined: Oct 2009
Posts: 3,944
Thanks:
11
Someone should probably write a more "intelligent" script to mine Wikipedia that strips out the things on every page like 編集 and 利用; I don't remember if the other list did that or not.
Joined: Aug 2009
Posts: 710
Thanks:
0
That was fast, FooSoft! There is some 'boilerplate' formatting across most of those files as well, hope you considered that (mostly on first page or so, I think). Hmm, now how organize and apply this information.
Joined: May 2009
Posts: 14
Thanks:
0
Would there be anyway to run a search of, say, the German wiki to find the most common words, for SRS? Or say contrasting the entire wiki with a word list of 5,000 common words, and determine which wiki entries use most of them?
Joined: Mar 2007
Posts: 3,851
Thanks:
0
Mecab etc could handle that with a reasonable margin of error.
Joined: Jan 2008
Posts: 591
Thanks:
0
With running a kanji frequency on wikipedia, wouldn't the thousands of articles on all things Chinese kind of throw off the results with a whole lot of kanji not really recognized in Japanese?
Joined: Sep 2008
Posts: 1,674
Thanks:
1
someone should run this stuff on dorama scripts. Get a feel for frequency of spoken words.
Joined: Jun 2007
Posts: 49
Thanks:
0
How about if you can add list of (all) 常用2009 kanji. Would be nice to see those missing ones.