Back

Kanji Frequency in Wikipedia

#26
"collocations"
Edited: 2009-06-05, 8:34 pm
Reply
#27
Katsuo Wrote:This is interesting. Is it possible to see the complete list? (I'd like to check out some kanji that aren't on the spreadsheet.)
Sure. I've put the full raw list up at http://shang.pp.fi/public/kanji/jawp-kanji-full.csv (utf-8 encoding).
Reply
#28
activeaero Wrote:You know I was thinking about this and wonder if it really matters that much?
No, no, it doesn't matter at all! Smile
It only means that if you go and study from that list, you'd better double-check whether the compound is actually a single word. But it may actually be useful, for example, because it shows you what words can go together without particles in between (which may not be as obvious as it sounds?).
Reply
JapanesePod101
#29
The unformatted raw list of 15,000 compounds can be viewed at http://shang.pp.fi/public/kanji/jawp-compounds.csv, but I'll put a prettier, pruned version in Google docs at some point (possibly, with translations from edict).

Note that this also contains "compounds" of length 1. I thought it might be informative to include the frequency when each kanji appears alone (i.e. not connected to any other kanji. they can have okurigana) compared to multi-kanji compounds.
Reply
#30
shang Wrote:Sure. I've put the full raw list up
Thanks for that full list.

I added a reference number for KANJIDIC (here) so that it's easy to combine the kanji frequency with data from other spreadsheets. Hope that's ok.

Of the 6,355 kanji in KANJIDIC (JIS X 0208), all but 114 were in the frequency list. I added those (giving them "zero" occurrences) in order to make aligning columns easier.
Reply
#31
I'm still playing around with the Wikipedia DB, and added some code to search through edict2 and kanjidic. Now I can generate all sorts of stuff, for example, here's a file that lists the 100 most frequent kanji in Wikipedia in frequency order with their readings, and references the compound list to pick examples of common compounds (with translations from edict) for them all.

http://shang.pp.fi/public/kanji/kanji-in...t-stub.txt (utf-8 encoding)


Here's an example snippet from the file:

---------------------------------
在 685 exist
onyomi:ザイ
kunyomi:あ.る
name readings:あり
(145035) 現在 【げんざい】 (n-adv,n-t) now; current; present; present time; as of; (P)
(87824) 存在 【そんざい】 (n,vs,adj-no) existence; being; (P)
(6174) 在籍 【ざいせき】 (n,vs) enrollment; enrolment; (P)
(5284) 所在 【しょざい】 (n,vs) whereabouts; (P)
(3380) 在位 【ざいい】 (n,vs) reign (i.e of a ruler)
(3050) 実在 【じつざい】 (n,vs) reality; existence; (P)
(2838) 所在地 【しょざいち】 (n) location; (P)
---------------------------------

The number in brackets before the compound is the number of occurrences in the compound db. If the compound has okurigana in edict, it lists all entries marked as common. E.g.

(2639) 見直し 【みなおし】 (n,vs) review; reconsideration; revision; (P)
(2639) 見直す 【みなおす】 (v5s) to look again; to get a better opinion of; (P)

Here the (2639) is the number of hits for the compound 見直. I don't currently have any way to count actual numbers for different conjugations.
Edited: 2009-06-07, 9:04 am
Reply
#32
Nice cross-referencing with the linking to frequencies of compounds.

When you say it's a DB you've got, is it an easy to export format, SQL based or such? Would be fun to have a go at some statistics on the data too.
Reply
#33
LTze0 Wrote:When you say it's a DB you've got, is it an easy to export format, SQL based or such? Would be fun to have a go at some statistics on the data too.
Currently, it's all in a custom homebrewn format, but I guess I could make an SQL schema for it and export it all to an SQLite database file. I probably won't have time to do that today, but let me know if that's a format you could use.

Another problem is the sheer amount of data. The resulting database will be hundreds of megs, so I don't think I will be able to host it anywhere.
Reply
#34
shang Wrote:
LTze0 Wrote:When you say it's a DB you've got, is it an easy to export format, SQL based or such? Would be fun to have a go at some statistics on the data too.
Currently, it's all in a custom homebrewn format, but I guess I could make an SQL schema for it and export it all to an SQLite database file. I probably won't have time to do that today, but let me know if that's a format you could use.

Another problem is the sheer amount of data. The resulting database will be hundreds of megs, so I don't think I will be able to host it anywhere.
Storage is no problem, I have a server I can give you FTP to.

And yeah, SQLite or a MySQL insert format would be ideal. Given the fairly simplistic data I shouldn't think it would be too hard to make a fairly dialect-free export file? Barring multiple inserts that MySQL supports (not sure about SQLite, not used it much).
Reply
#35
shang Wrote:The unformatted raw list of 15,000 compounds can be viewed at http://shang.pp.fi/public/kanji/jawp-compounds.csv, but I'll put a prettier, pruned version in Google docs at some point (possibly, with translations from edict).

Note that this also contains "compounds" of length 1. I thought it might be informative to include the frequency when each kanji appears alone (i.e. not connected to any other kanji. they can have okurigana) compared to multi-kanji compounds.
Shang it's possible to have that pruned version? it can be really helpful
Thx
Reply
#36
It makes me happy how people are getting interested in corpus based linguistics analysis Big Grin

It's too bad Google doesn't open up their spider database to have raw queries against it. It would be the best corpus of all.
Reply
#37
shang Wrote:The unformatted raw list of 15,000 compounds can be viewed at http://shang.pp.fi/public/kanji/jawp-compounds.csv, but I'll put a prettier, pruned version in Google docs at some point (possibly, with translations from edict).
Just noticed a minor mistake, this file only goes to 77.565% instead of 100%. My guess is you did some pruning after collecting a total count. I made that mistake last night while working with some frequency lists Wink

Any plans to generate a list of words from your wikipedia corpus?
Edited: 2009-06-23, 1:30 pm
Reply
#38
scuda Wrote:
shang Wrote:The unformatted raw list of 15,000 compounds can be viewed at http://shang.pp.fi/public/kanji/jawp-compounds.csv, but I'll put a prettier, pruned version in Google docs at some point (possibly, with translations from edict).
Just noticed a minor mistake, this file only goes to 77.565% instead of 100%. My guess is you did some pruning after collecting a total count. I made that mistake last night while working with some frequency lists Wink

Any plans to generate a list of words from your wikipedia corpus?
Yeah, the list was already pruned, because the total number of compounds was over 200,000, with a lot of compounds that only appear once or twice. I've been planning on getting back to the wikipedia frequency project (and, for example, using a proper japanese parser), but I'm nearing a deadline at work and doing long days, so I haven't been able to do anything Japanese related lately (except for trying to keep up with my ANKI reviews).
Reply
#39
BTW, I found out that Mecab is supposed to be a superior parser to Chasen. I stumbled across some lemma/morphological paper that compared the two, and it basically said that Mecab handles lemmas better (but also being better for regular words). Perhaps try using that instead.

I actually made my own corpus using 683 j-drama subtitles, and I used Chasen for it. I basically did something along the lines of 'cat corpus | chasen | awk '{print $1}' | sort | uniq -c > output' and that gave me the list of words & their counts. I'll have to try using Mecab on the j-drama corpus later and see what kind of difference that makes.
Reply
#40
I'm currently running the wikipedia data through MeCab (250,000 pages and counting..), but it seems to split words a bit too eagerly for this sort of word frequency analysis. For example, it splits 再利用 into 再 (prefix) - 利用 (noun) when it might be more interesting to count the frequency of the full 再利用 combination. I've been thinking of changing the processing so, that if a word is identified as a prefix, it's concatenated to the word following it, but I'm a bit worried that this might lead to unwanted combinations too.

Which option do you think would result in the most useful set of data?
Reply
#41
Both? I'd be interested in how often 再利用 occurs together as well as how often just 利用 appears. Not sure how that would affect the aggregate statistics though...
Reply
#42
Okay, I'm now counting both total occurrences and occurrences in isolation separately, so you can get stats like this:

The word 使用 occurs alone in 28.3% of all occurrences
73.6% as 使用する
0.50% as 使用時
0.47% as 使用者
0.41% as 使用料

etc.

I'll post the full statistics after my computer finishes going through all the data (will take a few hours).
Reply
#43
The list that includes all words that have at least 10 occurrences. The full list was just too large, and contained a lot of entries that I suspect were typos or glitches in the parser.

http://tables.googlelabs.com/DataSource?...6438/36438

EDIT: Hmm, the table seems to get corrupted after the 100th item even if I try to re-upload. The original file looks fine, so the Fusion Tables project might still be a bit buggy.
Edited: 2009-07-04, 9:08 am
Reply
#44
Oh well. I can't get it to import correctly into Fusion Tables, and it's a bit too large for Google Spreadsheets I think, so here's the plain csv file.

http://shang.pp.fi/public/kanji/jawp-mec...pruned.csv
Reply
#45
"I'm still playing around with the Wikipedia DB, and added some code to search through edict2 and kanjidic. Now I can generate all sorts of stuff, for example, here's a file that lists the 100 most frequent kanji in Wikipedia in frequency order with their readings, and references the compound list to pick examples of common compounds (with translations from edict) for them all.

http://shang.pp.fi/public/kanji/kanji-i … t-stub.txt (utf-8 encoding)"

Can one prevail on you to generate exactly the same list, but with the 173 most frequent kanji, so that the list covers the 50% cumulative frequency point? Reason - I'm working on converting the list into an Anki deck or the like, and having one that covers the 50% mark would be great. Obviously, this can be extended to the 75% and 90% points as well...
Reply
#46
anyone still have a copy of shang's list? his is gone Sad
Reply
#47
I have a copy; where should I upload it?

And what happened to shang???
Reply
#48
about shang I don't know...but for uploading the list...if it's too large for google docs just up the text file on Megaupload.com rapidshare.com or other hosting site...it's pretty easy
Reply
#49
I'm not sure if this will be of use to anybody but me, but I find it interesting.

I took the original Kanji frequency list that was posted in Google docs and then color coded it by the Heisig number. The lighter color being the earliest learned, and the darkest the latest.

I've also added a calculator that tells you what percentage of the characters you should be able to recognize based on which frame you are on in Heisig.

It's funny to think that you could power through memorizing the 173 most frequent Kanji and be able to recognize 50% of every Kanji on Wikipedia, but you would have to work up to frame 887 to achieve the same percentage using RTK. Although, I somehow feel for me like it would be easier to learn and remember 887 Kanji with RTK than 173 with repetition.

You can view my color coded list here:

http://spreadsheets.google.com/ccc?key=0...0MVE&hl=en
Reply
#50
what can I say...Well organized and beautiful! but the calc doesn't work...I receice this as answer:######...something with the formula...
Edited: 2009-08-20, 5:30 pm
Reply