For those who are finishing/finished with the Core deck, are you close to or anywhere near N2, N1, or fluency? Can you watch a Japanese film or variety shows and understand 90% above of what they're saying? When you listen to natives speaking Japanese, do you need not translate in your head and have an almost full comprehension of what is being said? When there's incorrect grammar, do you know it's wrong because you feel that it's wrong, that it sounds wrong? Did you pair the deck with anything, and if so, how big was that of help compared to doing the Core deck alone? What's your next step? What are you doing now? Are you done with learning Japanese? Finally, for those who stopped doing the Core deck, why did you stop and what are you doing know in studying Japanese as compensation for having stopped with the Core deck?
2014-06-21, 7:31 pm
2014-06-21, 8:05 pm
I've done the core 10k + 5k extra words of my own, and I'm no where near fluent. I have to still look up tons of words while reading manga, let alone real novels and newspapers. My listening comprehension isn't that great, much lower than my reading ability. I thought when I first started that 10k would be a lot too, but in reality it's just the absolute basics.
Edited: 2014-06-21, 8:05 pm
2014-06-21, 8:43 pm
I've done 2k and currently on 3k. NHK easy articles are becoming, well.. easy to read. I can also read Doraemon and pick most unknown words up from context. I only listen to Japanese music and watch lots of J-dramas'.
)) Like a whole damn lot 
I probably understand about 80% or variety shows. When i watched both the new Sadako movies i understood about 99% of what they said - little and simple dialogues. I don't have to translate from J-E because Japanese have become natural to me.
I'm however far from being fluent. I'm not gonna bother studying grammar, ever.. I don't feel the need to. I've never had to in English, nor in my mother tongue. I can't hear when the grammar is wrong, unless it simple grammar, but I expect to pick this up once i start reading more difficult books.
I'm probably going to stop doing Anki at 10k, and learn the rest by having fun (reading, watching and listening).
Site note: I did 2k production ztyle (write with z for dem ztreet creds') in 2.5 months. That's 5 months ago. Since then I've only added about 1k new words and most of them seem essential for everyday use. So at 3k words there's still a whole lot i can't comprehend. Anki + Core is without a doubt the best thing that ever happened to my Japanese.
)) Like a whole damn lot 
I probably understand about 80% or variety shows. When i watched both the new Sadako movies i understood about 99% of what they said - little and simple dialogues. I don't have to translate from J-E because Japanese have become natural to me.
I'm however far from being fluent. I'm not gonna bother studying grammar, ever.. I don't feel the need to. I've never had to in English, nor in my mother tongue. I can't hear when the grammar is wrong, unless it simple grammar, but I expect to pick this up once i start reading more difficult books.
I'm probably going to stop doing Anki at 10k, and learn the rest by having fun (reading, watching and listening).
Site note: I did 2k production ztyle (write with z for dem ztreet creds') in 2.5 months. That's 5 months ago. Since then I've only added about 1k new words and most of them seem essential for everyday use. So at 3k words there's still a whole lot i can't comprehend. Anki + Core is without a doubt the best thing that ever happened to my Japanese.
Advertising (Register to hide)
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions!
- Sign up here
2014-06-27, 12:53 pm
Well, those decks don't have grammar, but assuming you mean "how many words can you understand?", I think this is objectively answerable. The 10k deck is the 10k most frequent Japanese words. So if you have the frequency of each word, you can calculate the cumulative coverage of those words.
It takes about 95% coverage to understand basic meaning and about 98% coverage to understand smoothly. I don't know what the figures are for Japanese, but you could calculate per above. (If you do please post here.)
But for reference, for spoken English, you need about 3000 word families to have 95% coverage and 7000 for 98%. For English writing, you need 4000/9000. It's also worth noting that the coverage is very skewed. The first 2000 words will get you 86% coverage. Recognition of common proper nouns, exclamations, etc. gets you another 3-4%. The next 7000 word families will only get you 9%. To get that last 1%, you need another 50,000 words. (Of course native speakers don't know all 50,000 of them. These words are specialized and occur in certain contexts. The people dealing with those materials know the "specialist" vocabulary for their area.)
It takes about 95% coverage to understand basic meaning and about 98% coverage to understand smoothly. I don't know what the figures are for Japanese, but you could calculate per above. (If you do please post here.)
But for reference, for spoken English, you need about 3000 word families to have 95% coverage and 7000 for 98%. For English writing, you need 4000/9000. It's also worth noting that the coverage is very skewed. The first 2000 words will get you 86% coverage. Recognition of common proper nouns, exclamations, etc. gets you another 3-4%. The next 7000 word families will only get you 9%. To get that last 1%, you need another 50,000 words. (Of course native speakers don't know all 50,000 of them. These words are specialized and occur in certain contexts. The people dealing with those materials know the "specialist" vocabulary for their area.)
2014-06-27, 1:56 pm
MaxHayden Wrote:Well, those decks don't have grammar, but assuming you mean "how many words can you understand?", I think this is objectively answerable. The 10k deck is the 10k most frequent Japanese words. So if you have the frequency of each word, you can calculate the cumulative coverage of those words.Problem is, doing Core10k is not enough to know those 10k words well enough to read them, pick them up in conversation, not to mention use them. It's not even enough to know exactly what they mean, for a lot of them.
It takes about 95% coverage to understand basic meaning and about 98% coverage to understand smoothly. I don't know what the figures are for Japanese, but you could calculate per above. (If you do please post here.)
But for reference, for spoken English, you need about 3000 word families to have 95% coverage and 7000 for 98%. For English writing, you need 4000/9000. It's also worth noting that the coverage is very skewed. The first 2000 words will get you 86% coverage. Recognition of common proper nouns, exclamations, etc. gets you another 3-4%. The next 7000 word families will only get you 9%. To get that last 1%, you need another 50,000 words. (Of course native speakers don't know all 50,000 of them. These words are specialized and occur in certain contexts. The people dealing with those materials know the "specialist" vocabulary for their area.)
2014-06-27, 4:04 pm
Well I guess we read the OP differently. I understood him as saying "If I learned the vocabulary in Core 10k etc." but you understood him as saying "If I just SRS the vocabulary in Core 10k etc." I.e. I took it as asking "how many words beyond this would I need to master?" Whereas you seem to have taken it to mean "What else would I need to do beyond SRS to master these words?"
Obviously these are different things. And I agree with you. It takes more than SRSing the core deck to gain input-fluency in those words.
I also don't know what kind of coverage you'd get with these words b/c I don't know how to interpret the vocab-frequency field that comes with the Core 10k deck. E.g. it says that する has a frequency of 9 and that 人 has a frequency of 27 (less frequent). Obviously it doesn't mean that every 9th word in Japanese is する (because that would imply that there were only 40 words in Japanese), but I don't know what that "9" represents, so I can't calculate a cumulative frequency for the OP. But if someone can explain what it means, I'll run a spread sheet and report the calculations here.
Edit: I figured out that those entries are an index into a word-frequency file. That file includes grammatical particles and other tokens that aren't actually words. But I can use it to calculate cumulative frequency coverage. I'll try to get to it on Sunday.
Obviously these are different things. And I agree with you. It takes more than SRSing the core deck to gain input-fluency in those words.
I also don't know what kind of coverage you'd get with these words b/c I don't know how to interpret the vocab-frequency field that comes with the Core 10k deck. E.g. it says that する has a frequency of 9 and that 人 has a frequency of 27 (less frequent). Obviously it doesn't mean that every 9th word in Japanese is する (because that would imply that there were only 40 words in Japanese), but I don't know what that "9" represents, so I can't calculate a cumulative frequency for the OP. But if someone can explain what it means, I'll run a spread sheet and report the calculations here.
Edit: I figured out that those entries are an index into a word-frequency file. That file includes grammatical particles and other tokens that aren't actually words. But I can use it to calculate cumulative frequency coverage. I'll try to get to it on Sunday.
Edited: 2014-06-27, 5:57 pm
2014-06-27, 6:10 pm
After finishing the core 6k sentences my reading comprehension was still at an absolute basic level, contrary to my expectations. It's only through general reading that it has improved.
I guess in hindsight this should have been obvious. Learning the answers to comprehension problems is not the same as learning to solve them.
I'm now at ~14k words and lookups are still frequent enough to be annoying.
I guess in hindsight this should have been obvious. Learning the answers to comprehension problems is not the same as learning to solve them.
I'm now at ~14k words and lookups are still frequent enough to be annoying.
2014-06-29, 6:49 pm
Okay, so I got the word-frequency.txt file that the core10k deck uses and calculated the cumulative frequency coverage of various vocabulary sizes. The problem is that the list includes proper names and grammatical particles (and maybe some other stuff) and the Core decks of various types aren't in strict frequency order. So Core 2k includes words that rank 3k-4k on the list, but also words that rank 10k to 20k.
So, using the word-frequency file and ignoring the fact that it includes grammar and other stuff, knowing the 200 most common "words" gets you 55% average coverage. 1200 gets you to 72%. 2000 is 78%. 6000 is 88%. 10000 is 92%. 20000 is 96%. 35000 is 98%. And that's the *average* across three different corpus. If you look at just the "novels" sub-component, you need 55200 entries to get 98% coverage.
Now, these numbers are probably over-estimating what you have to learn by a great deal b/c of all the proper names, etc. but even if you say that 60% of the stuff in that list isn't a vocabulary word that you have to know, you would need about 14000/22000 vocab to able to communicate effectively / read. I.e. WAY more than the 10k in core.
If someone has a good suggestion for easily "pruning" the frequency list to eliminate grammatical particles, proper names, and other non-words, I'll go back and make a better calculation.
But even under overly generous assumptions, the answer is that 10k words isn't enough to read smoothly even if you master them. And furthermore that mastery requires lots of comprehensible input and fluency practice, not just Anki work.
Edit: A someone different, and probably more pertinent, question to ask is how much vocabulary outside of the 10k does any given novel contain that you would need to "pre-learn" to get up to 98-99% word recognition? Provided that these are individually manageable chunks, then you could still read without having to have that large of a vocabulary all at once.
It's also worth mentioned that Japanese is highly agglutinative. So the ratio of morphemes per word is much higher than English and the word composition is more regular. So it should be easier to learn a given "word family" in Japanese than in English and the fall-off for Japanese of counting families instead of lemmas (like I did above) should be higher than the fall-off for English. Furthermore, because of the nature of the language, the number of *morphemes* you need to learn is probably much lower than the numbers above would imply. Unfortunately, I don't know of any good vocabulary books that teach you how Japanese word building works in the same way that there are books for English and other languages. Maybe some else here can suggest something.
So, using the word-frequency file and ignoring the fact that it includes grammar and other stuff, knowing the 200 most common "words" gets you 55% average coverage. 1200 gets you to 72%. 2000 is 78%. 6000 is 88%. 10000 is 92%. 20000 is 96%. 35000 is 98%. And that's the *average* across three different corpus. If you look at just the "novels" sub-component, you need 55200 entries to get 98% coverage.
Now, these numbers are probably over-estimating what you have to learn by a great deal b/c of all the proper names, etc. but even if you say that 60% of the stuff in that list isn't a vocabulary word that you have to know, you would need about 14000/22000 vocab to able to communicate effectively / read. I.e. WAY more than the 10k in core.
If someone has a good suggestion for easily "pruning" the frequency list to eliminate grammatical particles, proper names, and other non-words, I'll go back and make a better calculation.
But even under overly generous assumptions, the answer is that 10k words isn't enough to read smoothly even if you master them. And furthermore that mastery requires lots of comprehensible input and fluency practice, not just Anki work.
Edit: A someone different, and probably more pertinent, question to ask is how much vocabulary outside of the 10k does any given novel contain that you would need to "pre-learn" to get up to 98-99% word recognition? Provided that these are individually manageable chunks, then you could still read without having to have that large of a vocabulary all at once.
It's also worth mentioned that Japanese is highly agglutinative. So the ratio of morphemes per word is much higher than English and the word composition is more regular. So it should be easier to learn a given "word family" in Japanese than in English and the fall-off for Japanese of counting families instead of lemmas (like I did above) should be higher than the fall-off for English. Furthermore, because of the nature of the language, the number of *morphemes* you need to learn is probably much lower than the numbers above would imply. Unfortunately, I don't know of any good vocabulary books that teach you how Japanese word building works in the same way that there are books for English and other languages. Maybe some else here can suggest something.
Edited: 2014-06-29, 7:21 pm
2014-06-29, 8:49 pm
Average high school graduate native knows 40-45k. You can take away some of that for words that might only come up in school or in childhood. Even so, I would say 30k is minimum to be able to read something as smoothly as a native can. I think that is a realistic goal to shoot for to be honest.
2014-06-29, 9:53 pm
kameden Wrote:Average high school graduate native knows 40-45k. You can take away some of that for words that might only come up in school or in childhood. Even so, I would say 30k is minimum to be able to read something as smoothly as a native can. I think that is a realistic goal to shoot for to be honest.I also think comes down to how you define a word are 慣れる、慣れ、 and 不慣れ variants of one word or are they three separate words. A lot of what your number will be will be heavily based on whether you are counting individual words or word families.
2014-06-30, 5:14 am
Is this a good list, considering Wikipedia handles an ample range of topics?
Wiktionary:Frequency lists/Japanese
Wiktionary:Frequency lists/Japanese10001-20000
EDIT:
Do you think would it be useful to extract statistical data about readings from that list?
Wiktionary:Frequency lists/Japanese
Wiktionary:Frequency lists/Japanese10001-20000
EDIT:
Do you think would it be useful to extract statistical data about readings from that list?
Edited: 2014-06-30, 5:18 am
2014-06-30, 8:50 am
This made me curious about the origin of Core6K, so I did some digging. Here's what I found:
1. IKnow claims it's the most frequent 6K words. This, I found, is an oft repeated claim around here too. But they also claim their Core1K is the most frequent 1K words. Not very likely. Plus, they don't mention WHERE it's supposed to be the most frequent.
2. Most frequent opinion (
) on where the word list comes from is that "no one knows".
3. For the second most frequent opinion, I found this detailed discussion: http://iknow.jp/qa/topics/213-Japanese-Core . People there say that it's mostly from newspapers, and was compiled back in the 80s by Japanese teachers.
4. It does not contain some very common words.
5. Towards the end, it has some archaic words that won't really help you.
In conclusion, it would probably be a good idea to not believe the claim that these are the most frequent words, and use a better documented frequency list to select which Core cards to review, and which not to bother with. But I suspect that, as long as you're doing it in an optimized order, this won't be an issue early on. Still, later on, there's no point in doing a bunch of words that aren't frequently used anymore.
1. IKnow claims it's the most frequent 6K words. This, I found, is an oft repeated claim around here too. But they also claim their Core1K is the most frequent 1K words. Not very likely. Plus, they don't mention WHERE it's supposed to be the most frequent.
2. Most frequent opinion (
) on where the word list comes from is that "no one knows".3. For the second most frequent opinion, I found this detailed discussion: http://iknow.jp/qa/topics/213-Japanese-Core . People there say that it's mostly from newspapers, and was compiled back in the 80s by Japanese teachers.
4. It does not contain some very common words.
5. Towards the end, it has some archaic words that won't really help you.
In conclusion, it would probably be a good idea to not believe the claim that these are the most frequent words, and use a better documented frequency list to select which Core cards to review, and which not to bother with. But I suspect that, as long as you're doing it in an optimized order, this won't be an issue early on. Still, later on, there's no point in doing a bunch of words that aren't frequently used anymore.
2014-06-30, 2:14 pm
Well, if enough people are interested, we could probably make an "improved Core" for those who aren't using smart.fm to learn the vocab. Regrouping by frequency and rerunning the sort isn't that complicated.
The issue is that I'd like some confidence that the stuff that isn't in the 10k list (or the extra 15k deck that's floating around) but is on the lemma list is either grammatical or a proper name. Ideally, we'd also leave out transparent compound words made with agglutination so as to maximize the number of word families and morphemes you are learning. (Those words could be put in a separate deck that could be done more quickly.) But I don't really know how to go about that. Maybe we can look into how researchers did it with the British National Corpus for English since I'm fairly sure that they didn't do it by-hand.
If people are interested in this, I'll try to find out how this was done we can than decide if we have enough people with the right skills to do it for Japanese.
Edit: So it turns out that MeCab can identify parts of speech, so it's theoretically possible to generate a frequency list with the particles and proper nouns filtered. I posted asking cb if his word frequency program supported using this feature. If it does, we could take his novels data, wikipedia, and maybe some other stuff to generate a vocabulary lemma list. That means the only issue would be going from lemmas to word families.
The issue is that I'd like some confidence that the stuff that isn't in the 10k list (or the extra 15k deck that's floating around) but is on the lemma list is either grammatical or a proper name. Ideally, we'd also leave out transparent compound words made with agglutination so as to maximize the number of word families and morphemes you are learning. (Those words could be put in a separate deck that could be done more quickly.) But I don't really know how to go about that. Maybe we can look into how researchers did it with the British National Corpus for English since I'm fairly sure that they didn't do it by-hand.
If people are interested in this, I'll try to find out how this was done we can than decide if we have enough people with the right skills to do it for Japanese.
Edit: So it turns out that MeCab can identify parts of speech, so it's theoretically possible to generate a frequency list with the particles and proper nouns filtered. I posted asking cb if his word frequency program supported using this feature. If it does, we could take his novels data, wikipedia, and maybe some other stuff to generate a vocabulary lemma list. That means the only issue would be going from lemmas to word families.
Edited: 2014-06-30, 3:44 pm
2014-07-01, 12:06 pm
What is the difference between a lemma and a word family? For example if we take, 白, 白い and 白バイ (taken from core1000, "white police motorcycle" ._. ), what is the lemma? And what is the word family? Do they coincide?
Or, when we say "I know 10k words and still I've difficulty to read" we count words like 好き and 大好き as one or two words? Because if we count those as two separate words, and we do the same thing with similar words/compounds etc... then we are cheating with that "10k words".
Also I do think when you know 10k words, some (many?) of the next 10k words are easier to learn. I'm still on chapter 10 of Genki and my vocabulary is pratically non-existent, still I can infer some of the words in the second list of wikipedia:
http://en.wiktionary.org/wiki/Wiktionary...0001-20000
But if we aggregate words which could easily count as one (and if we delete proper names and surnames), then many of those 10k to 20k words in frequency actually fit better the 1 to 10k list...
(sorry if I don't explain myself well ._. )
Or, when we say "I know 10k words and still I've difficulty to read" we count words like 好き and 大好き as one or two words? Because if we count those as two separate words, and we do the same thing with similar words/compounds etc... then we are cheating with that "10k words".
Also I do think when you know 10k words, some (many?) of the next 10k words are easier to learn. I'm still on chapter 10 of Genki and my vocabulary is pratically non-existent, still I can infer some of the words in the second list of wikipedia:
http://en.wiktionary.org/wiki/Wiktionary...0001-20000
But if we aggregate words which could easily count as one (and if we delete proper names and surnames), then many of those 10k to 20k words in frequency actually fit better the 1 to 10k list...
(sorry if I don't explain myself well ._. )
Edited: 2014-07-01, 12:08 pm
2014-07-01, 12:21 pm
In English, a lemma is the "headword" and its inflected forms (book and books) plus variant spellings (favor, favour). The English inflections are plural, third person singular present tense, past tense, past participle, -ing, comparative, superlative, and possessive. There are a lot more inflections in Japanese, but the word lists we have list words by the headword of the lemma.
A word family would be everything included in the lemma and also modified by affixes like -ness, or un-. The issue is that Japanese has a more complicated morphology so there are a lot more things you can attach to a word. But generally speaking, if you say you know 10k words, it means you know 10k word families. Unfortunately our frequency lists are all by lemmas and I'm not sure how to turn them into word families. (For people learning English there are special vocabulary building textbooks that teach you morphology and introduce word families and affixes in a systematic order. I haven't been able to find anything comparable for Japanese.)
A word family would be everything included in the lemma and also modified by affixes like -ness, or un-. The issue is that Japanese has a more complicated morphology so there are a lot more things you can attach to a word. But generally speaking, if you say you know 10k words, it means you know 10k word families. Unfortunately our frequency lists are all by lemmas and I'm not sure how to turn them into word families. (For people learning English there are special vocabulary building textbooks that teach you morphology and introduce word families and affixes in a systematic order. I haven't been able to find anything comparable for Japanese.)
2014-07-01, 5:14 pm
Would you have words like 良 - りょう in your word list? Even though it's not really a word used by itself, it can be in words like 不良 and 最良.
Edited: 2014-07-01, 5:14 pm
2014-07-01, 7:05 pm
MaxHayden Wrote:I haven't been able to find anything comparable for Japanese.Do[BIA]JG have appendixes that to some extent address that issue. There you will find: counters, declination tables, common pre- and suffixes. I don’t think they aim to be exhaustive though.
2014-07-01, 7:29 pm
Thanks. That's extremely helpful. I took a look at them and see that there's a lot of information in B & I.
2014-07-06, 2:37 pm
So I spoke to someone at the University of Tokyo. He says that word families don't work as well for Japanese as they do for English and that he thinks we should just learn from a larger lexeme/lemma list instead.
He has recently compiled a frequency list from a large corpus. And he has a character frequency list as well. He suggests also looking at the frequency lists generated from the Balanced Contemporary Corpus of Written Japanese which is organized along slightly different lines.
Since these lists exclude proper nouns, grammatical particles, and the like. I'm going to rerun my frequency coverage stats against them when I have time. I'm also going to try to identify any words that are in Core10k that aren't in the most frequent parts of those lists and vice-versa.
He has recently compiled a frequency list from a large corpus. And he has a character frequency list as well. He suggests also looking at the frequency lists generated from the Balanced Contemporary Corpus of Written Japanese which is organized along slightly different lines.
Since these lists exclude proper nouns, grammatical particles, and the like. I'm going to rerun my frequency coverage stats against them when I have time. I'm also going to try to identify any words that are in Core10k that aren't in the most frequent parts of those lists and vice-versa.
2014-07-07, 6:04 am
So Core 10k only teaches you the absolute bare basics? Then what do i do after i finish it? Go the AJATT route?
2014-07-07, 6:49 am
There's another deck that has the next 15k words after Core 10k, if you're interested.
But it's also pretty common just to do whatever you want in Japanese, and use those as materials as sources of new words.
But it's also pretty common just to do whatever you want in Japanese, and use those as materials as sources of new words.
Edited: 2014-07-07, 9:24 am
2014-07-07, 7:11 am
murtada Wrote:So Core 10k only teaches you the absolute bare basics? Then what do i do after i finish it? Go the AJATT route?Yes, Core teaches you the absolute bare basics of 10000 words. To learn more about those words (and other words, too), you have to read native materials. To learn any grammar at all, you had to look elsewhere in the first place.
2014-07-07, 7:30 am
murtada Wrote:So Core 10k only teaches you the absolute bare basics? Then what do i do after i finish it? Go the AJATT route?Yes, in a way; however, you could get further with 10k sentences than Core does. After core you may wish to read up some basic grammar (Tae Kim, etc) and then go read from native materials and use rikaisama to import words into Anki. Core has a few problems:
1. Dated vocabulary. It includes (or used to, depending on your version) words like 'communist party member' within the first 6k since the list was generated from old newspapers. I believe Rikiasama puts that word at ~30k in frequency.
2. The vocabulary has been taken from newspapers, so it is more beneficial if you intend to read newspapers after core (this is why core contains many political terms). It's going to be less helpful for novels, manga, etc.
3. if you're reading the sentences, the grammar barely touches above JLPT N5. This is good for learning words (especially for your few thousand words), but isn't so useful afterwards. You would get more from Core 6k if you knew some grammar and changed the example sentences to include some more grammar points.
Ideally, you could have the original sentence and one using another grammar point.
4. Just as Stansfield123 said, it doesn't include some very basic words (e.g. Hello). These you can easily add yourself, or pickup from most audio courses.
There's an optimised core deck on here (and ankiweb) that contains another frequency index based on the analysis of a few thousand books. You could possibly remove the least frequent items from core...
2014-07-07, 11:55 am
If you like anime then why not start doing some subs2srs on your favorite shows, get your comprehension way up? I like Death Note, with useful phrases like '素敵な殺し方' ;-)
You can start consuming material way before finishing Core6k. Having a 6k or 10k vocab with a N5 grammar is a shame really. You need to process sentences that require some thought.
You can start consuming material way before finishing Core6k. Having a 6k or 10k vocab with a N5 grammar is a shame really. You need to process sentences that require some thought.
2014-07-07, 4:26 pm
MaxHayden Wrote:Since these lists exclude proper nouns, grammatical particles, and the like. I'm going to rerun my frequency coverage stats against them when I have time. I'm also going to try to identify any words that are in Core10k that aren't in the most frequent parts of those lists and vice-versa.So I took a look at these today. It seems like both of them include grammatical particles (there are fewer than 200) as words, but not proper names and other "assumed known words".
For the sake of clarity, 95% coverage means you can get the gist of what is being said and probably understand it with help from some of the software available here. Less than this and you're working memory will probably be under too much strain. 98% is the ability to read smoothly and learn from context. 99% would be near-native.
Prof. Matsushita's word list is more usable and groups some simply-related very-high-frequency words (e.g. it assumes you know 二十 if you know 二 and 十). He has a separate database of "assumed known words" which includes proper names and some other stuff that you should be able to understand if you know the language. (But strange proper names that you *wouldn't* be able to know are included in his vocabulary list.) His list is based on a large collection of books which probably puts it more in line with what people here want to do. (And if I understood what his 10 "domains" meant, we might be able to exclude certain kinds of academic words to shorten the amount you needed to learn.)
So using his word list, 200 words gets you 62% coverage. 1200 gets you 74%. 2000 gets 83%. 6000 gets you 92%. 10k gets you 95%. 20k gets you about 98%. 31k gets you over 99% coverage. These numbers line up with about what we'd expect given other languages frequency distributions for lemmas.
The BCCWJ word list has several versions. I used the "suw" version b/c it's the first one on the page (and is the second largest file). I don't know what the different codes for the sub-corpuses stand for, so I just used the overall frequency. 95% coverage occurs around the 14k mark. 98% coverage is around 32k. 99% coverage is around 50k. I'm not sure exactly what the difference is between this and Matsushita's list, but it doesn't seem to be b/c of proper names (searching for e.g. 東京 gets no hits). I *think* the difference has to do with how the two lists were grouped morphologically. With the BCCWJ treating some things as compound words that Matsushita treats as two separate ones. (And hence he has fewer words and higher coverage.)
So, as time allows over the next few months, I'm going to try to make use of Prof. Matsushita's word list to make a "Core 20k" deck out of the Core10k and CorePlus decks. If time allows I'll also make a "Plus" deck that includes the next 11k (and anything left over from the Core, Plus, or JLPT lists that didn't make it.)
Edited: 2014-07-07, 4:26 pm
