kanji koohii FORUM
Future Self to Past Self: To 10k or not to 10k? - Printable Version

+- kanji koohii FORUM (http://forum.koohii.com)
+-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html)
+--- Forum: General discussion (http://forum.koohii.com/forum-8.html)
+--- Thread: Future Self to Past Self: To 10k or not to 10k? (/thread-12773.html)

Pages: 1 2


Future Self to Past Self: To 10k or not to 10k? - z1bbo - 2015-06-11

Even if you learn all of the words contained in lets say, one anime, it would take you quite some time (probably months if you are low level). And you will end up with lots if rarer words, that in turn won't be of any use for the next piece of media you turn to.

What you could do is create a new core by scanning a few hundred(or thousand) subtitle files and learning the top xx k words or something.


Future Self to Past Self: To 10k or not to 10k? - cophnia61 - 2015-06-11

yogert909 Wrote:
EratiK Wrote:I've been watching anime with J-subs for a year now (and considering anime and manga are more or less the same material), and except for maybe 10% all the core 2k/6k words are used. Why? Because core vocab are really, really frequent words. I'd even go as far as saying about half of 10k have already showed up in shows I've watched.
I highly doubt 90% of core6k regularly shows up in anime. As I mentioned earlier, I've been comparing wordlists of various corpora and the overlap is a lot less than I expected. It's a project that I'm still working on when I have time between work and studying (I'm still studying core actually..) so I hesitate to throw precise numbers out there. But I will say that I've compared core 6k against the several thousand subtitles on dramanote and there's not one episode that core covers even 60% of unique words. The average coverage seems to be in the low 40% range.

As I mentioned, I'm still studying core 6k but my little experiments with wordlists are making me consider adding more native material into my study and focusing my study to specific works. Not that core words aren't common, useful words and I intend to learn every one of them eventually. But I'm studying Japanese in order to use Japanese. I've studied hundreds of hours and I'd like to use Japanese NOW if possible, not next year.

This question for me seems to boil down to enjoyment. Clearly it would be useful to know every word of core10k. But wouldn't it be more enjoyable if I studied 2k words and could understand every word of an anime or manga or news article? If I have an electronic text, it's not hard to use a few tools like cb's japanese text analysis tool to make myself a core norwegian wood, or a core totoro, or a core whatever. I don't know how many unique words there are in the average anime movie - but it's got to be around 2k, manga even less, and a news article, a few hundred if that. What if I study those words and then enjoy Japanese a lot sooner? Then learn some more words and enjoy another book or movie? I'd still learn 10k common words, but it would be much more enjoyable.

I haven't tried this yet and like I said, I'm still studying core... But I'm getting a little disillusioned with core, so we'll see..
It's true that all 10k core words are common, but the same is true for the 22k words marked as common in jisho.org. I think if you ask a japanese person if he knows all those 22k words, probably he knows almost all of them and beyond.

But in order to use the japanese language as soon as possible you must know the words specifically used in the material you want to consume, so if one choose to study from frequency lists, why not use a frequency list based on what he wants to read?

I'm almost sure if you do the same experiment using a novel-based word list the number will be higher than with core10k


Future Self to Past Self: To 10k or not to 10k? - cophnia61 - 2015-06-11

Ok, I tried it with cb's Japanese Text Analysis Tool, and those are the results:


Zero no Tsukaima vs the first 7000 words in Core10k:

- 28,04 45596 12783 32813 40,00 4538 1815 2723


Zero no Tsukaima vs the first 7000 words in that famous list based on the innocent novel corpus:

- 90,91 45596 41452 4144 58,15 4538 2639 1899


the readme says:

Field 1: Readability expressed as a percentage (0-100) of the total number
of non-unique known words vs. the total number of non-unique words.
Field 2: Total number of non-unique words
Field 3: Total number of non-unique known words
Field 4: Total number of non-unique unknown words
Field 5: Readability expressed as a percentage (0-100) of the total number
of unique known words vs. the total number of unique words.
Field 6: Total number of unique words
Field 7: Total number of unique known words
Field 8: Total number of unique unknown words


So if I understand right, if you study from the second list, you will know only 58% of the single words contained in Zero no Tsukaima, but that 58% is used most frequently so while you read you will end up recognizing 90% of the words you'll encounter, because the 42% words you don't know are used only 10% of the time.

While if you study the first list (core1000) you will end up knowing a good 40% of the unique words used in that light novel, but you will encounter those words only in the 28% of the text because the remaining 60% are those used the most.


So if you want to read visual novels as soon as possible maybe the words selection and especially the order of the core10k is not the best. Obviously those words are used. Maybe not all in this novel but in thousand of others, but frequency lists exist because you want to learn the most used words first, so the "they are all words a japanese person knows" doesn't stand here.

A goog idea could be to study the first 7000 words that will put you at 90% knowledge and at that point the remaining 10% depends on what you want to read so you can mine it from japanese sources.


Future Self to Past Self: To 10k or not to 10k? - EratiK - 2015-06-11

yogert909 Wrote:I highly doubt 90% of core6k regularly shows up in anime. As I mentioned earlier, I've been comparing wordlists of various corpora and the overlap is a lot less than I expected. It's a project that I'm still working on when I have time between work and studying (I'm still studying core actually..) so I hesitate to throw precise numbers out there. But I will say that I've compared core 6k against the several thousand subtitles on dramanote and there's not one episode that core covers even 60% of unique words. The average coverage seems to be in the low 40% range.
Is that core 6k alone (4,000 words)? How is the coverage of core 2k+6k on those dramanote subs? Also I've found drama to be more challenging than anime, so when you have the time, it would be nice to know how much of strictly core 10k and how much of core 2+6+10 is really in your dramanote corpus for a better overview.

But yes, I don't have hard numbers, it's just my impression. Browsing through 2k, and 2k definitely feels like 90%. I've methodically browsed through about 5% of 6k and so far the average coverage seems to be 80% (once you've seen a word in the wild, the feeling "seen in the wild" remains), some I even heard in Gingitsune just a few days ago (I watched that with English subs, but I can still listen).

That said you're not the first one to express disappointment in core -- it's usually from people who try to read LN/novels as I recall -- so I guess I can count myself lucky not to have experienced it. As broad a tool as core might be, it's common sense that statistically it shouldn't be useful to some, but I'm not convinced those are the majority yet.

yogert Wrote:This question for me seems to boil down to enjoyment. Clearly it would be useful to know every word of core10k. But wouldn't it be more enjoyable if I studied 2k words and could understand every word of an anime or manga or news article? If I have an electronic text, it's not hard to use a few tools like cb's japanese text analysis tool to make myself a core norwegian wood, or a core totoro, or a core whatever. I don't know how many unique words there are in the average anime movie - but it's got to be around 2k, manga even less, and a news article, a few hundred if that. What if I study those words and then enjoy Japanese a lot sooner? Then learn some more words and enjoy another book or movie? I'd still learn 10k common words, but it would be much more enjoyable.
Yes, but some of us can't use those tools. Plus as a beginner I was often overwhelmed, and I was more interested in a general understanding skill than just understanding Totoro. Plus, let's say there are 2k unique words in a movie? 2k is two months of learning, meaning you would spend 2 months on Totoro? In a way it's almost more boring than core.

There also the problem of learning the words before you enjoy the material (at least for me enjoyment is directly linked to understanding). Regardless of where the words are from, you still have to grind through the gruesome task of ankying them. This is also why intensive reading (of subs) is so much more enjoyable for me because I barely anki anything: just do a look-up and move on, it will stick eventually. But I just couldn't imagine that doing that at low-level (because the better you get, the more any text tend to look like an i+1, and even if it doesn't you have the power to parse your own i+1 from an i+4. As a newbie whenever there's too much information you break down).


Future Self to Past Self: To 10k or not to 10k? - ktcgx - 2015-06-11

Apparently the CJK databases contain over 24 million entries, so I would say that can't all be from newspapers.

http://en.wikipedia.org/wiki/CJK_Dictionary_Institute

As for those having problems with LN after Core, it's pretty much the case with authors around the world, that they have a large vocabulary and don't tend to use the more common words, because the less common ones they do use have a better 'flavour' or 'atmosphere' or whatever to suit their purposes.


Future Self to Past Self: To 10k or not to 10k? - ktcgx - 2015-06-11

yogert909 Wrote:But I'm studying Japanese in order to use Japanese. I've studied hundreds of hours and I'd like to use Japanese NOW if possible, not next year.
I think this is a common misconception that language learners have, that somehow after X words or grammar, they'll be able to 'use' the language. It's not true, you can't ever 'use' the language until you dive straight in and *use* it. Whether that's right from the beginning trying to muddle your way around Japan with a phrase book, picking up basic grammar and vocab as you need it, or by joining a conversation group, or tweeting in Japanese, or whatever, you will never be a language user until you actively use the language, no matter your level.


Future Self to Past Self: To 10k or not to 10k? - CureDolly - 2015-06-11

yogert909 Wrote:Not that core words aren't common, useful words and I intend to learn every one of them eventually. But I'm studying Japanese in order to use Japanese. I've studied hundreds of hours and I'd like to use Japanese NOW if possible, not next year.

This question for me seems to boil down to enjoyment. Clearly it would be useful to know every word of core10k. But wouldn't it be more enjoyable if I studied 2k words and could understand every word of an anime or manga or news article? If I have an electronic text, it's not hard to use a few tools like cb's japanese text analysis tool to make myself a core norwegian wood, or a core totoro, or a core whatever. I don't know how many unique words there are in the average anime movie - but it's got to be around 2k, manga even less, and a news article, a few hundred if that. What if I study those words and then enjoy Japanese a lot sooner? Then learn some more words and enjoy another book or movie? I'd still learn 10k common words, but it would be much more enjoyable.
Yes, this is pretty much what I've been saying too. In fact the approach can be even less "study"-oriented than this. I just plunged in and added words as I go along. I realize some people hate doing that, and since the idea is to use and enjoy Japanese, if making decks in advance works better for an individual, that's the way to go.

One of my most admired (and advanced) senpai doesn't use Anki at all and never has. She relies on massive exposure to reinforce words. I am still using the safety net of Anki to capture and reinforce words, but I'm moving slowly in her direction.


Future Self to Past Self: To 10k or not to 10k? - yogert909 - 2015-06-11

cophnia61 Wrote:Zero no Tsukaima vs the first 7000 words in Core10k:
- 28,04 45596 12783 32813 40,00 4538 1815 2723

Zero no Tsukaima vs the first 7000 words in that famous list based on the innocent novel corpus:
- 90,91 45596 41452 4144 58,15 4538 2639 1899
What's crazy is the first number. I wouldn't have guessed you could go from knowing 28% of the words on the page to over 90% just by changing which frequency list you study. That's HUGE. 90% is close to the level where you can infer the meaning of many of the unknown words.


Future Self to Past Self: To 10k or not to 10k? - yogert909 - 2015-06-11

CureDolly Wrote:Yes, this is pretty much what I've been saying too. In fact the approach can be even less "study"-oriented than this. I just plunged in and added words as I go along. I realize some people hate doing that, and since the idea is to use and enjoy Japanese, if making decks in advance works better for an individual, that's the way to go.
Unfortunately I do 99% of my study on my iphone, so rikaisama is out. I'm still figuring out how I want to go about it. I'm currently considering either SABU or parallel texts. SABU allows you to save words to a flashcard, but I don't believe there's an export function... maybe a little hacking will get me there. Parallel texts, well, you don't ever have to look anything up or save anything.

I've also become VERY interested in buonaparte's listening reading method. Or as he/she would say LISTENING-reading. It's hard to figure out based on the descriptions you'll find by casual googling, but I did a little write up a few days ago if anyone is interested.


Future Self to Past Self: To 10k or not to 10k? - CureDolly - 2015-06-11

Thank you for this link and for writing the piece. It actually makes the very confusing descriptions of the method a lot clearer!

Interestingly I would say that while I don't use this method, something akin to this is why we have always recommended the Japanese-subtitled anime learning method, rather than, say, doing the same thing with manga or books. While the latter is indeed possible, with anime you are hearing Japanese at the same time as reading it.

Currently I am going pretty rapidly through an entire 52-episode series with subtitles. It is one of my first experiments in truly massive exposure. Essentially I am doing what I have been doing from the beginning but at a much faster pace, and I do find that word/collocation reinforcement is much more than I might have expected. While I am using written material too, for me at least, reading/listening at the same time so that the two are pretty much seamless is fundamentally valuable.

And this is relevant to the Anki discussion because I have a (pragmatic) eye on building massive exposure to the point where it can replace Anki.


Future Self to Past Self: To 10k or not to 10k? - yogert909 - 2015-06-11

I should mention for anyone who's curious about L-R that buonaparte specifically cautions that watching movies with subtitles is not L-R for a few reasons:
1. The picture distracts from listening.
2. You have to stop the audio to look up new words
3. Subtitles come on when the actor speaks so you cannot read ahead.
4. Subtitle translations are usually sloppily done and often don't accurately express what the dialog is saying.

Of course I will probably go the subtitle route, so that's not to say there's anything wrong with subtitles. Only to clarify L-R and also maybe suggest some avenues for improvement - (e.g. printing out parallel text subtitles in advance negates 2 and 3)


Future Self to Past Self: To 10k or not to 10k? - yogert909 - 2015-06-11

EratiK Wrote:
yogert909 Wrote:...I've compared core 6k against the several thousand subtitles on dramanote and there's not one episode that core covers even 60% of unique words. The average coverage seems to be in the low 40% range.
Is that core 6k alone (4,000 words)? How is the coverage of core 2k+6k on those dramanote subs?.
Sorry my core 6k deck has 6,000 words so I meant 6,000 words. Of course it's not exactly 6k words as a few 'words' are combinations of previous words (eg 長袖 - long sleeves is a rehash of 長 long and 袖 sleeves which are also included in the count)


Future Self to Past Self: To 10k or not to 10k? - CureDolly - 2015-06-12

yogert909 Wrote:I should mention for anyone who's curious about L-R that buonaparte specifically cautions that watching movies with subtitles is not L-R for a few reasons:
1. The picture distracts from listening.
2. You have to stop the audio to look up new words
3. Subtitles come on when the actor speaks so you cannot read ahead.
4. Subtitle translations are usually sloppily done and often don't accurately express what the dialog is saying.

Of course I will probably go the subtitle route, so that's not to say there's anything wrong with subtitles. Only to clarify L-R and also maybe suggest some avenues for improvement - (e.g. printing out parallel text subtitles in advance negates 2 and 3)
Yes, the subtitle method isn't the same as L-R and doesn't have exactly the same aims. I am sure the criticisms are valid from an L-R perspective. From our perspective I would say:

1. The picture distracts from listening.
The picture helps immersion in the entire experience which is what we would consider the primary concern.

2. You have to stop the audio to look up new words
Yes, but that isn't a problem from our perspective.

3. Subtitles come on when the actor speaks so you cannot read ahead.
Not a problem from our perspective, but it is easy to adjust if you want to. You don't even need to re-time the subs. In VLC you can move soft subs forward or back as far as you like in relation to the video with the H and J keys. I do this all the time to get the subs just where I want them in relation to the speech.

4. Subtitle translations are usually sloppily done and often don't accurately express what the dialog is saying.
Very true. In fact I wrote a whole article about why even the best translations are bad.

However we don't use translations. The subtitles are Japanese only.

This isn't to criticize L-R. I don't know enough about it to criticize in any case. These are two different approaches, though they do seem to have an area of overlap, particularly in the key principle of hearing and reading in conjunction.


Future Self to Past Self: To 10k or not to 10k? - kraemder - 2015-07-18

I would go to 3k or maybe 4k if I did it again. Basically up until N3. After that, the example sentences are just too simple and you need to be practicing with something harder. You may think that the native audio makes up for it but I think it hurt me when studying for the N2.