MaxHayden Wrote:cophnia61 Wrote:EDIT2:
I did it just now for MyGirl drama and the conclusion is that the words which fall under the 10k in VDRJ's list do appear in total 74088 times in that drama, while the remaining words appear 588 times. So if we know only the first 10k words in VDRJ's list, we'll encounter a word we don't know one time every 126 words.
Yeah. This seems much more reasonable. FWIW, for spoken English, 95% coverage (1 word in 20) is 3000 word families. 98% (1 in 50) is 7000. Because we are using lexemes instead of word families, the numbers for Japanese should be a little different. But the point is that you can probably get to a point where you can understand that drama and use it for comprehensible input with fewer than 10k words. (Maybe it would be a useful project to take a whole bunch of subs and generate a table listing the vocabulary levels you need for 95% and 98% coverage so that people can see what shows rank where.)
Man this is a real drama

I think I did that calc wrong :/ Now that I've done a script to calculate it, it seems that with the first 15k words you're going to see 1 unknown word every 8,5... Is it possible? Oh no :'( I'm going to kill myself
My Girl JDrama word frequency report
Someone can do a script just to see if I'm wrong? The first column is the number of times that particular word appears in the drama. Could you tell me with the hypotetical knowledge of the first 15000 words (in one of those lists) how many time I'll encounter an unknown word in that drama? Because it's clear my math skills are way worst that I thinked...
I did this:
-for the first 15000 words in SUW list
- find the words in "My Girl word frequency list" that are present in those 15000 words
- for every word that is present in the 15000 word list, sum the times that word appear with the previews words
-for the remaining words in SUW list
- ...do the same thing...
-the overall number of appearances of the words in the 15000 / the overall number of appearances of the remaining words
Please tell me I did it wrong
EDIT:
Only now I realized I can try to read one of those subtitles just to see by myself... I read the first 100 words of the first episode, I searched every word in a couple of those frequency lists and they are all frequent words, in truth I already know most of those words xD
The only word above 15000 was 空席 and appeared two times in the same sentence.
Now... or the first 100 words are extremely easy and after that the drama become difficult with an unfrequent word every 8 words, or I don't know...
Tomorrow I'll continue to read, I'll read all the first episode, and after that I will say how many unfrequent words do appear and how much frequently
Edited: 2014-07-12, 3:08 pm