Back

Frequency of kana-based vocabulary?

#1
I know this site is about remembering the Kanji, and I'm certainly enjoying working my way thru RtK1. I've also spent time with Genki and some other introductory texts. But I'm curious about the balance between kana-based vocabulary and kanji-based vocabulary (either individual kanji or kanji compounds). Does anyone have a rough sense of how much vocabulary falls into one group or the other?

I know it's difficult to count. E.g., some words are expressed in either kana or kanji; and many kana are particles and -mashita and things like that. And, as in any language, there's difficulty in separating different forms of the same word. Nonetheless, there are also nouns, adjectives and adverbs that are generally expressed only in kana, and of course many words that are expressed only in kanji. So I'm still curious for a rough estimate of how vocabulary breaks down. Is it 70% kanji vocab, 30% kana, say?

Perhaps there are frequency-of-use studies that could answer my question? I recall reading one post here indicating that the Japanese translation of the Harry Potter books require 12,000 unique words. Any idea how many of them were kanji compounds, and how many were kana? If I've missed such a post, I apologize.

Many thanks in advance.
Reply
#2
This is somewhat in the area of what I'm researching for my bachelor in Japanese. Basically, how much is made up of wago and kango (and gairaigo) depends on what you're talking about. Spoken conversation? Newspapers? Blogs? Magazines? Are you talking on a per word or per distinct word basis?

A rough estimate is that kango makes up about 45 percent while wago makes up 35 percent, gairaigo making up the remaining 10%, but it differs a lot from case to case.

A good source of information is this: http://www.kokken.go.jp/katsudo/seika/goityosa/

It's a study by Kokuritsu Kokugo Kenkyuujo, where 2 000 000 words were counted and analysed from magazines published in 1994. It lists how common words were, which forms they appeared in, which lexical strata they belong to and some other stuff. Really nice.
Edited: 2010-03-31, 7:02 pm
Reply
#3
Thanks for your reply. If only my Japanese were good enough to read the webpage you linked! But I was able to dope out the basics, especially from the PDF linked there. Thanks.

Can you or anyone else point me to English-language stats on usage of kana versus kanji?
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
Tobberoth Wrote:A rough estimate is that kango makes up about 45 percent while wago makes up 35 percent, gairaigo making up the remaining 10%
...the final 10%, of course, is smilies and other emoticons.
Reply
#5
Are you interested in word script (kanji vs hiragana), word origin (wago vs kango), or proportion of characters that are kanji vs hiragana?

If the former, there is a raw LEEDS word frequency list (45K items) based on a large internet corpus. Looking at the first ~15K actual words (ie ignoring the nonsense bits), it's approximately:
Kanji words: 73%
Hiragana words: 16%
Katakana words: 11%

Note:
- Kanji includes mixed kanji/hiragana, proper nouns, wago and kango words.
- Some words are in both kanji and hiragana categories
- Hiragana % should be a bit higher (I probably deleted particles, basic demonstratives, etc.)
- If it were the first 1000 words, hiragana % would be higher bc they are frequent function words
- If you're reading about %s, make note of whether it's by running total or by word type (kanji will often be higher if by word type)

If you're interested in word origin or character %, there's stuff in English if you Google it. (studies/articles with breakdowns of both and time comparisons)
Reply
#6
Thora, thanks for your reply. I am interested in "all of the above," but especially word script, so your post was of particular interest to me. The numbers you report are pretty striking to me: 73% Kanji words, 16% hiragana, 11% katakana. That's a somewhat higher percentage of Kanji words than I might have expected, but then I'm a relative beginner.

Incidentally, I couldn't read the LEEDS list you linked; I guess I don't have the right settings enabled on my PC. It all looks like funny ASCII characters to me.

It's funny, when I first started studying Japanese, I found myself wishing there were more frequent kana and less frequent kanji. Now that I've become more obsessed with the language, and now about halfway through RtK1, I find the opposite -- I find it more fun to study kanji-based words than kana-based words. I get a big kick out of guessing what a compound means (or is pronounced), even if my guesses are wrong half the time.

Anyway, thanks again for your very helpful post.
Reply
#7
Groot Wrote:Incidentally, I couldn't read the LEEDS list you linked; I guess I don't have the right settings enabled on my PC. It all looks like funny ASCII characters to me.
I think I set my browser character encoding (under View) to Unicode UTF-8. Or maybe it was just auto detect Japanese. I don't know much about that stuff - perhaps someone else will have a better idea.

It's great that you've discovered a passion for kanji and words. I figure curiosity and appreciation are what keep us motivated and make knowledge stick.
Reply
#8
Tobberoth Wrote:A rough estimate is that kango makes up about 45 percent while wago makes up 35 percent, gairaigo making up the remaining 10%, but it differs a lot from case to case.
One thing to keep in mind is that those stats look at the makeup of the most commonly used 'n' words (let's say 10,000). Kango will outnumber wago in the list, but that smaller number of wago are used much more frequently than kango, so if you look at frequency within that list, the wago will all be higher up.

It will of course vary depending on the discourse though. I translated some documents for a beauty clinic yesterday and it was mostly katakana-ed English.
Edited: 2010-04-02, 4:36 am
Reply
#9
Jarvik7 Wrote:
Tobberoth Wrote:A rough estimate is that kango makes up about 45 percent while wago makes up 35 percent, gairaigo making up the remaining 10%, but it differs a lot from case to case.
One thing to keep in mind is that those stats look at the makeup of the most commonly used 'n' words (let's say 10,000). Kango will outnumber wago in the list, but that smaller number of wago are used much more frequently than kango, so if you look at frequency within that list, the wago will all be higher up.

It will of course vary depending on the discourse though. I translated some documents for a beauty clinic yesterday and it was mostly katakana-ed English.
Correct, the stats I posted (which I managed to mix up with a lacking 10% lol), were supposed to be distinct words, so each word is only counted once. Of course, like you said, wago are usually repeated a lot more than kango in a given text.
Edited: 2010-04-02, 7:35 am
Reply
#10
Quote:I think I set my browser character encoding (under View) to Unicode UTF-8.
Well what do you know, that worked perfectly! Thanks!
Reply
#11
Thought I'd add a note here that Shang also created a word frequency list based on Wikipedia (20,000 words with freq indicated for first x amount) See post: http://forum.koohii.com/showthread.php?p...7#pid97597

Kanji, katakana and hiragana are mixed together, but it's a cleaner list than the LEEDS one. The LEEDS covers a broader range of writing though.
Reply
#12
Don't forget FooSoft's analysis not only of this but those light novels no one posted here: http://forum.koohii.com/showthread.php?p...8#pid90568

And iSoron: http://forum.koohii.com/showthread.php?p...5#pid89705

Also: http://forum.koohii.com/showthread.php?p...2#pid96732
Edited: 2010-04-19, 6:50 pm
Reply