Back

For those who finished Core 2k/6k/10k

#26
Thanks for all your work, MaxHayden.
Reply
#27
There is a software which compare two lists and gives you the statistics about words coverage?

For example, I've compared a list of words extracted from Code Geass' subtitles with the list of Prof. Matsushita 60k words. I've used OpenOffice Calc but the work has been tedious. For every 10k words in Matsushita's list I've extracted the words which are also in Code Geass, and from here I've calculated the percentage of coverage. There is something that does this automatically? xD

BTW the data for Code Geass are:


Lexeme:

0-10k = 74,42%
0k-20k = 88,41%
0k-30k = 94,23%
0k-40k = 96,94%
0k-50k = 99,65%
0k-60k = 100%


Standard (Newspaper) Ortography:

0-10k = 71,34%
0k-20k = 84,82%
0k-30k = 92,18%
0k-40k = 96,37%
0k-50k = 98,57%
0k-60k = 100%


How must I interpet those data, considering the list of words from Code Geass has been generated by cb's utility and of their 5242 words, only 3500 c.ca has been found in Prof. Matsushita list?
Reply
#28
I'm not 100% clear on what you did. You ran cb's utility to get the frequency counts of words from Code Gueass, then you looked up those words in this frequency list. But how did you get rid of proper names, fictitious technical terms, etc.? And how did you calculate the coverage frequency once you had done that? Are those %'s the percent of *words* covered or the percent of *instances of words*? And what is the "3500" found at the bottom of your post? It sounds like you are saying that there are 5242 unique words in Code Guess and only 3500 of them are on the list, but that doesn't match with your numbers which says that all of them are found if you include all 60k. It also isn't clear which of the three lists you used. He has a "general learners" list (not the default sort order but the one you would want to use for this) along with two others. (One for international students and one for reading academic literature.)

In any event, this list is a list of words used in written Japanese. And obviously spoken and written Japanese have a somewhat different vocabulary (technical they have a different "register" which includes a bit more than vocabulary...).

It's worth noting that for the 10 written sub-domains, a few extra words go a long way. For "ST" (science and technology), ~750 words not in the first 20k account for over 50% of the unknown vocabulary. The same is true of "BM" (biomedical). Even "OC", the internet forums, exhibits this to some degree with ~25% of the unknown words being accounted for by the highest frequency 750. So odds are that if you look at the frequencies of the words not covered, you can get similar coverage for spoken Japanese if you use a large enough database of subtitles. (And this might not be a bad project for someone to do.)
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#29
Which data did you use by the way? Here's my brief analysis:

VDLJ_Ver1_0_General-Learners.xlsx gives word rankings. It doesn't give the frequency though, so it won't be enough calculate the coverage % for N words.

The BCCWJ frequency lists from here http://www.ninjal.ac.jp/corpus_center/bc...-list.html give you lexeme rankings as well as frequency for various corpora so we can calculate the coverage % with those.

The SUW lists include base lexemes only, while the LUW lists include more forms and compounds. For example in SUW you will find base frequency for lexeme 案内 but in LUW you will find 案内、案内する、御案内、御案内する、案内人、案内図 etc.

Obviously you need way more entries from LUW to get the same coverage as from SUW.

Luckily MeCab seems to give numbers compatible with SUW, in other words if I give MeCab this: 御案内、案内、案内する、御案内する、案内人、案内図 it outputs this:

する 動詞 自立 スル
図 名詞 接尾 ズ
案内 名詞 サ変接続 アンナイ
御 接頭詞 名詞接続 ゴ
人 名詞 接尾 ジン

I checked the SUW list, core_pmw field seems to be the frequency of the lexeme per 1 million words. It shows 10,000 lexemes = ~95% coverage, and 20,000 lexemes = ~98% coverage.
Edited: 2014-07-09, 2:44 am
Reply
#30
I just did a similar test with Death Note against the SUW list. I have a small Python script for loading the lists and comparing.

First 10,000 lexemes: Found 2320 Missing 1538
First 20,000 lexemes: Found 2639 Missing 1219
First 30,000 lexemes: Found 2757 Missing 1101
First 40,000 lexemes: Found 2819 Missing 1039
First 50,000 lexemes: Found 2853 Missing 1005

Trouble is a lot is missing because of irregular writing, e.g. Death Note uses 取調べる and the SUW list uses 取り調べる, so 取調べる counts as missing Sad.
Reply
#31
Some stats for Death Note, excluding morphemes that aren't matched at all since they're not genuinely rare.

Loaded lemmas up to: 185136
Using core_rank order
Found 2725 Missing 1133
Coverage of found lemmas
Range Match Tot %
0 ~ 9999 : 2282 2282 83.74%
10000 ~ 19999 : 326 2608 95.71%
20000 ~ 29999 : 117 2725 100.00%
30000 ~ 39999 : 0 2725 100.00%
40000 ~ 49999 : 0 2725 100.00%
50000 ~ 59999 : 0 2725 100.00%
60000 ~ 185136 : 0 2725 100.00%

The main "rank" looks odd for some words in the SUW list, for example 二つ is rank 121,932! Doesn't make sense.
Reply
#32
Sorry MaxHayden, my explanation was awful! Thank Max and you all for your help, I think those statistics can be very useful to some of us, so it is good to continue with those sort of analysis!

So what do you think it is the best word list to use? And what is the best way to extract word frequency from subtitles? Is cb' utility ok or it is best to use MeCab?

Sorry if I endure with this topic but it is very interesting Tongue
Reply
#33
vosmiura Wrote:I just did a similar test with Death Note against the SUW list. I have a small Python script for loading the lists and comparing.

First 10,000 lexemes: Found 2320 Missing 1538
First 20,000 lexemes: Found 2639 Missing 1219
First 30,000 lexemes: Found 2757 Missing 1101
First 40,000 lexemes: Found 2819 Missing 1039
First 50,000 lexemes: Found 2853 Missing 1005

Trouble is a lot is missing because of irregular writing, e.g. Death Note uses 取調べる and the SUW list uses 取り調べる, so 取調べる counts as missing Sad.
Vosmiura, could you gently do the same thing for this list, when and if you have time/will?

My Girl JDrama word frequency list

Thank you in advance Tongue
Reply
#34
vosmiura Wrote:VDLJ_Ver1_0_General-Learners.xlsx gives word rankings. It doesn't give the frequency though, so it won't be enough calculate the coverage % for N words.
The academic one gives frequencies both overall and for each of the 10 corpora. The one for teachers might have that information as well, but I can't remember.

Quote:The BCCWJ frequency lists from here http://www.ninjal.ac.jp/corpus_center/bc...-list.html give you lexeme rankings as well as frequency for various corpora so we can calculate the coverage % with those.

The SUW lists include base lexemes only, while the LUW lists include more forms and compounds. For example in SUW you will find base frequency for lexeme 案内 but in LUW you will find 案内、案内する、御案内、御案内する、案内人、案内図 etc.

Obviously you need way more entries from LUW to get the same coverage as from SUW.
That's very helpful. Thanks.

Quote:I just did a similar test with Death Note against the SUW list. I have a small Python script for loading the lists and comparing.
Can you share that Python script please? I want to compare the VDLJ with the SUW list and with the community generated one that's floating around. (I realize that they split things up differently b/c of the differences in morphological analysis, but that's what I want to look at.) I also want to compare the "Core" vocabulary list that we use vs these frequency lists to see how we can improve core and having your script to start from would be very helpful.

Also, how does the SUW list compare against the VDLJ list in your opinion? I'm going to try to make an "improved Core" Anki deck and my current inclination is to go with the VDLJ list plus any very high frequency stuff from the SUW list that isn't covered for whatever reason. And to make the "Plus" deck by looking at the individual corpora and grabbing the most frequent 750-1000 words for each one that weren't covered by the core deck.

cophnia61 Wrote:Sorry MaxHayden, my explanation was awful! Thank Max and you all for your help, I think those statistics can be very useful to some of us, so it is good to continue with those sort of analysis!

So what do you think it is the best word list to use? And what is the best way to extract word frequency from subtitles? Is cb' utility ok or it is best to use MeCab?

Sorry if I endure with this topic but it is very interesting tongue
I'm not sure what the best approach is yet. I'm still trying to figure that out. If vosmiura shares his script, the *easiest* thing to do would be to use MeCab and the SUW list I think.
Edited: 2014-07-09, 11:34 am
Reply
#35
vosmiura Wrote:Trouble is a lot is missing because of irregular writing, e.g. Death Note uses 取調べる and the SUW list uses 取り調べる, so 取調べる counts as missing Sad.
As this is a common aspect of the Japanese Language, I think these and other anomalies can be easily handled in your python script. For instance, when parsing, you can omit any hiragana if it comes between two kanji and then compare.

If you decide to post your script, I'd like to have a look at it as well.
Reply
#36
I'll share a copy of the script as soon as I get home.
Reply
#37
Shared the script: http://pastebin.com/ax4nDLGR

How to use, by example on the file from cophnia61:

c:\temp>python lexeme_match.py -f "C:\temp\VDRJ_Ver1_1_Research_Top60894.csv" -m "C:\temp\My Girl JDrama Morphemes.txt" -r "Word Ranking for General Learners"
Using lexeme DB from 'C:\temp\VDRJ_Ver1_1_Research_Top60894.csv'
Matching lexemes in 'C:\temp\My Girl JDrama Morphemes.txt'
Using 'Standard (Newspaper) Orthography' lexemes
Using 'Word Ranking for General Learners' ranks
Loaded 60894 lexemes
Found 2626 Missing 1294
Coverage of found lexemes
Range Match Tot %
1 ~ 5000 : 1822 1822 69.38%
5001 ~ 10000 : 411 2233 85.03%
10001 ~ 15000 : 175 2408 91.70%
15001 ~ 20000 : 78 2486 94.67%
20001 ~ 25000 : 60 2546 96.95%
25001 ~ 30000 : 32 2578 98.17%
30001 ~ 35000 : 13 2591 98.67%
35001 ~ 40000 : 11 2602 99.09%
40001 ~ 45000 : 11 2613 99.50%
45001 ~ 60894 : 13 2626 100.00%
Reply
#38
Thanks.
Reply
#39
vosmiura Wrote:Shared the script: http://pastebin.com/ax4nDLGR

How to use, by example on the file from cophnia61:

c:\temp>python lexeme_match.py -f "C:\temp\VDRJ_Ver1_1_Research_Top60894.csv" -m "C:\temp\My Girl JDrama Morphemes.txt" -r "Word Ranking for General Learners"
Using lexeme DB from 'C:\temp\VDRJ_Ver1_1_Research_Top60894.csv'
Matching lexemes in 'C:\temp\My Girl JDrama Morphemes.txt'
Using 'Standard (Newspaper) Orthography' lexemes
Using 'Word Ranking for General Learners' ranks
Loaded 60894 lexemes
Found 2626 Missing 1294
Coverage of found lexemes
Range Match Tot %
1 ~ 5000 : 1822 1822 69.38%
5001 ~ 10000 : 411 2233 85.03%
10001 ~ 15000 : 175 2408 91.70%
15001 ~ 20000 : 78 2486 94.67%
20001 ~ 25000 : 60 2546 96.95%
25001 ~ 30000 : 32 2578 98.17%
30001 ~ 35000 : 13 2591 98.67%
35001 ~ 40000 : 11 2602 99.09%
40001 ~ 45000 : 11 2613 99.50%
45001 ~ 60894 : 13 2626 100.00%
Thank you vosmiura!

So from those statistics we can say that, for this drama, if we know 1500 lexemes we'll going to see an unknown word every 11 circa. Is it so?
Edited: 2014-07-10, 2:21 pm
Reply
#40
cophnia61 Wrote:So from those statistics we can say that, for this drama, if we know 1500 lexemes we'll going to see an unknown word every 11 circa. Is it so?
It means with those 15000 lexemes, you won't know 1 out of 11 words overall. You'll run into ~350 new words.

It doesn't mean every 1 out of 11 words you read will be unknown because that depends on how often those words appear. Those less common words might occur at frequency of say 2/1000000 words so 350 * 2 / 1000000 means you'll see an unknown word circa 1 in 1400 on average, unless one of the words is really important to the drama
Edited: 2014-07-10, 5:08 pm
Reply
#41
cophnia61,

I'm not sure if you'd be interested in this as well, but I organized your list by the rank Matsushita gave each morpheme (lexeme?) in terms of frequency.

(With rank)
http://pastebin.com/yfTGuCWF
(word only)
http://pastebin.com/3vPt4B8E

I'm thinking about taking sentences and averaging the words into a "sentence rank." The only thing that worries me about this is the i+1 issue when learning cards. Ranking sentences would seem to destroy this, but if anyone has any ideas I'm all ears.

Also, I'm still new to a lot of the programs and Anki plugins users use here, so please bear with my ignorance.
Edited: 2014-07-11, 12:54 am
Reply
#42
vosmiura Wrote:
cophnia61 Wrote:So from those statistics we can say that, for this drama, if we know 1500 lexemes we'll going to see an unknown word every 11 circa. Is it so?
It means with those 15000 lexemes, you won't know 1 out of 11 words overall. You'll run into ~350 new words.

It doesn't mean every 1 out of 11 words you read will be unknown because that depends on how often those words appear. Those less common words might occur at frequency of say 2/1000000 words so 350 * 2 / 1000000 means you'll see an unknown word circa 1 in 1400 on average, unless one of the words is really important to the drama
Now that you point out this to me it's so obvious! I'm so stupid because I didn't think of that! ._.

Btw, cb's utility (and maybe also MeCab?) gives the number of how many time every word appears in. So if we have a list of words taken from a drama/anime/novel etc.. with exactly the times every word appears in that media, we can make a script that gives you an average of how many times you'll encounter the words above a certain frequency in VDRJ's list!

Like:

0k - 30k: 2500 words, they overall appear 'x' times
30k - 100k: 50 words, they overall appear 'y' times
...
30k - 100k: you'll encounter them every 'z' times (where 'z' = 'x' / 'y')

But I'm not good in math, so feel free to correct me Tongue

EDIT2:

I did it just now for MyGirl drama and the conclusion is that the words which fall under the 10k in VDRJ's list do appear in total 74088 times in that drama, while the remaining words appear 588 times. So if we know only the first 10k words in VDRJ's list, we'll encounter a word we don't know one time every 126 words.

Furthermore, while they do appear 588 times, they are only 393 words, so they are unknown only the first time you encounter them... if one of more of those words appear more than the other, if you look their meaning the first time, the real number of unknown words is even inferior.

So I think it's not so bad, at least for me those numbers are conforting! Maybe someday we'll build a database of media based on simplicity, like es.: Doraemon has tot words, if you know tot words from [put here a frequency list] then Doraemon has tot unknown vocabs for you, and you'll encounter them every tot words. (I'm daydreaming here)

EDIT1:

and btw thank you jeff for those lists!
Edited: 2014-07-11, 10:03 am
Reply
#43
jeffberhow Wrote:I'm thinking about taking sentences and averaging the words into a "sentence rank." The only thing that worries me about this is the i+1 issue when learning cards. Ranking sentences would seem to destroy this, but if anyone has any ideas I'm all ears.

Also, I'm still new to a lot of the programs and Anki plugins users use here, so please bear with my ignorance.
The issue as I understand it from reading older threads, is that most of the sorting on sentences where they try to make it i+1 ends up grouping a bunch of generally unrelated sentences together because it has 1 rare kanji.

If you can come up with a better way to do this, I'd be open to hearing it. One thing I was thinking of playing with was using an m-estimator instead of "average" or "maximum" to estimate sentence difficulty b/c this would be immune to those rare outlier words and b/c the math isn't that complicated. But I'm not sure how to make this i+1, so it's still just a drawing board idea at this point...

cophnia61 Wrote:EDIT2:

I did it just now for MyGirl drama and the conclusion is that the words which fall under the 10k in VDRJ's list do appear in total 74088 times in that drama, while the remaining words appear 588 times. So if we know only the first 10k words in VDRJ's list, we'll encounter a word we don't know one time every 126 words.
Yeah. This seems much more reasonable. FWIW, for spoken English, 95% coverage (1 word in 20) is 3000 word families. 98% (1 in 50) is 7000. Because we are using lexemes instead of word families, the numbers for Japanese should be a little different. But the point is that you can probably get to a point where you can understand that drama and use it for comprehensible input with fewer than 10k words. (Maybe it would be a useful project to take a whole bunch of subs and generate a table listing the vocabulary levels you need for 95% and 98% coverage so that people can see what shows rank where.)
Edited: 2014-07-11, 10:01 am
Reply
#44
MaxHayden Wrote:
cophnia61 Wrote:EDIT2:

I did it just now for MyGirl drama and the conclusion is that the words which fall under the 10k in VDRJ's list do appear in total 74088 times in that drama, while the remaining words appear 588 times. So if we know only the first 10k words in VDRJ's list, we'll encounter a word we don't know one time every 126 words.
Yeah. This seems much more reasonable. FWIW, for spoken English, 95% coverage (1 word in 20) is 3000 word families. 98% (1 in 50) is 7000. Because we are using lexemes instead of word families, the numbers for Japanese should be a little different. But the point is that you can probably get to a point where you can understand that drama and use it for comprehensible input with fewer than 10k words. (Maybe it would be a useful project to take a whole bunch of subs and generate a table listing the vocabulary levels you need for 95% and 98% coverage so that people can see what shows rank where.)
It would be great! Man just let me tell you, you're great, this community is great! In the past days I was depressed thinking how many words I need to know and miscalculating them... so it was killing my enthusiasm to study japanese... but now thank to your and other users courtesy and willingness to share those data, I feel relieved and reassured!

Sincerely I was reading old threads full of negativity like "wow with 30k words consider yourself lucky if you understand books for 3 years old children.... you need to consult the dictionary 3000 times every half word..." and all the like xD Obviusly I'm exaggerating this but I do think some people are exaggerating themselves the difficulty in vocabulary acquisition. I know this is true for novels or other difficult things but to read those threads it feels like there is nothing you can read before you know 45k vocabulary!

Sorry for the overenthusiasm but it's like I feel now xD
Edited: 2014-07-11, 10:16 am
Reply
#45
I noticed there have been various trends on this forum regarding the amount of vocabulary needed for fluency. Reading topics from five years ago or so, core 6k seems the ultimate goal and after that it's just "contest learning", "watching fun stuff" and so on, with only the perfectionist ones interested in core 10k; if you read recent topics, it's just like you said "3 billion words needed, no shortcuts"! Big Grin

I think that the first trend was way too optimistic, maybe because of the influence of AJATT in those years, but the last one may be even worse, I too was getting depressed reading those 40.000 words estimates... so I stopped and went doing Anki reps instead! Tongue
Reply
#46
MaxHayden Wrote:One thing I was thinking of playing with was using an m-estimator instead of "average" or "maximum" to estimate sentence difficulty b/c this would be immune to those rare outlier words and b/c the math isn't that complicated. But I'm not sure how to make this i+1, so it's still just a drawing board idea at this point...
You're absolutely right about the m-estimator thing. I'll look into it; it looks similar to the Linear Algebra least squares idea.

I may put the i+1 thing on the backburner for now. Is there a plugin that will look at a sentence, evaluate the kanji in it, and only show it if you've learned that kanji via an RTK deck? That would be pretty valuable for a person who hasn't gotten through RTK yet (like me Big Grin) but would like to learn sentences/vocab from a monster deck ranked by frequency.

I feel like I'm stepping into diminishing returns land.

cophnia61 Wrote:Sincerely I was reading old threads full of negativity like "wow with 30k words consider yourself lucky if you understand books for 3 years old children.... you need to consult the dictionary 3000 times every half word..." and all the like xD Obviusly I'm exaggerating this but I do think some people are exaggerating themselves the difficulty in vocabulary acquisition. I know this is true for novels or other difficult things but to read those threads it feels like there is nothing you can read before you know 45k vocabulary!
This made my day, haha. I've read those same threads. I doubt I even know 10k words in English! Wink

edit: I was curious and did an online test, and I only know about 22k words in English as a native speaker: http://testyourvocab.com/result?user=4258365

I enjoy books by authors like Vonnegut and Asimov, and I don't look words up (although I really should for Vonnegut). This gives me a better sense of goal for Japanese.
Edited: 2014-07-11, 1:03 pm
Reply
#47
jeffberhow Wrote:You're absolutely right about the m-estimator thing. I'll look into it; it looks similar to the Linear Algebra least squares idea.

I may put the i+1 thing on the backburner for now. Is there a plugin that will look at a sentence, evaluate the kanji in it, and only show it if you've learned that kanji via an RTK deck? That would be pretty valuable for a person who hasn't gotten through RTK yet (like me Big Grin) but would like to learn sentences/vocab from a monster deck ranked by frequency.

I feel like I'm stepping into diminishing returns land.
I think Morphman does what you want.

Quote:This made my day, haha. I've read those same threads. I doubt I even know 10k words in English! Wink
If you are a native English speaker, you probably know around 30k word families (which is many more words). That's the normal amount...
Reply
#48
cophnia61 Wrote:I did it just now for MyGirl drama and the conclusion is that the words which fall under the 10k in VDRJ's list do appear in total 74088 times in that drama, while the remaining words appear 588 times. So if we know only the first 10k words in VDRJ's list, we'll encounter a word we don't know one time every 126 words.
Yeah as I was thinking about this, the frequency in the whole corpus is 2/1000000 but since they appear in the drama at least once then the frequency in the drama is already higher, at least 1 in 74k.

1 in 126 is not bad really. After reading some posts it was starting to sound like we need years of consistent vocab learning just to understand anything.
Edited: 2014-07-11, 3:23 pm
Reply
#49
Moregon Wrote:I noticed there have been various trends on this forum regarding the amount of vocabulary needed for fluency. Reading topics from five years ago or so, core 6k seems the ultimate goal and after that it's just "contest learning", "watching fun stuff" and so on, with only the perfectionist ones interested in core 10k; if you read recent topics, it's just like you said "3 billion words needed, no shortcuts"! Big Grin

I think that the first trend was way too optimistic, maybe because of the influence of AJATT in those years, but the last one may be even worse, I too was getting depressed reading those 40.000 words estimates... so I stopped and went doing Anki reps instead! Tongue
Absolutely agree. It's only recently that I've seen people say that 10k is not a level where you can just switch to reading a lot.

What people have to realize is that it's not that 10k or 15k words is all you need, but additional words become very rare so if you mass study them a majority will just live in your Anki with no association to a real world encounter. You might as well focus on learning words you encounter in context than burning random words.
Edited: 2014-07-11, 3:15 pm
Reply
#50
cophnia61, vosmiura:

Here's a great read I came across when looking up the meaning of word family.
http://www.nflrc.hawaii.edu/rfl/PastIssu...2hirsh.pdf

This also helped me understand vosmiura's stats up there and gave me more confidence in the structural learning of vocab. I hope not to keep going off topic.
Reply