Back

Nayr's Core VS Core2k

#26
yogert909, that's interesting. Could you do the same game too for the complete 10k (incl. the word/lemma count). Thanks!

Can you explain what "coverage" means. How is the calculation done, and what influences the values? Is the nhk easy the smallest of the 4 corpora and is therefore the coverage the highest? Does word frequency play any role?

What about the mentioned の, に & ね (and は, を , が, も, だ/です, ...)?

What about the more than 30 一年, 二年, 三年, 四年, ...? Are these separate words? Probably not, in that case there would be loads of "word" entries for years and numbers in Wikipedia.
Reply
#27
Matthias Wrote:yogert909, that's interesting. Could you do the same game too for the complete 10k (incl. the word/lemma count). Thanks!
I'd love to but no promises. I'll see what I have time for.

Matthias Wrote:Can you explain what "coverage" means. How is the calculation done, and what influences the values? Is the nhk easy the smallest of the 4 corpora and is therefore the coverage the highest? Does word frequency play any role?
Coverage is the ratio of lemmas shared by the corpus and the deck divided by the total number of lemmas in the corpus. The lemmas are non-unique so の, for instance, is counted however many times it appears. For example core6k has ~76% coverage of the anime corpus, so if you know every lemma in core, you should know 76% the lemmas you read.

The calculation would be much different had I used unique lemmas as that would treat the most common lemma as just as valuable as the rare lemmas. But I don't think that calculation is as helpful. The calculation would also be much different had I used the single vocabulary word on each card instead of all of the vocabulary used in each sentence. But again, I think this calculation is most helpful as I believe most people are studying the sentences.

edit: you are correct, nhk easy is the smallest corpus, it's about 900something articles. D-addicts is ~3300 episodes and kitsunekko is about 20k episodes. Wikipedia is the largest corpus by far. Corpus size shouldn't matter much in the calculation.

Matthias Wrote:What about the mentioned の, に & ね (and は, を , が, も, だ/です, ...)?
Yes, all of those are included and are very common, so you probably get 10% coverage just from those 10.

Matthias Wrote:What about the more than 30 一年, 二年, 三年, 四年, ...? Are these separate words? Probably not, in that case there would be loads of "word" entries for years and numbers in Wikipedia.
一,年, 二,年, 三,年, 四,年 are all individual lemmas, so combinations don't count as lemmas. Romaji words don't count and neither do roman numbers, but 一つ、二つ、三つ、四つ are counted as lemmas. If you are curious what counts as a lemma, if you click this link you will see the frequency for the first 10,000 lemmas of he wikipedia corpus.
Edited: 2015-09-18, 2:18 am
Reply
#28
I'm having trouble finding a working link for core10k. If anyone can direct me, I will add that to the comparison. Also, I'm not sure if my innocent corpus is complete, so if I can find a link i'll add that to the comparison.
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#29
For a more apples to apples comparison, here is the same analysis except this time it's only the first 4995 core sentences(nayr's deck has 4995). This time nayr's deck takes the win in the wikipedia category while core5k is still slightly ahead in the other 3 categories. But overall they are pretty close.

core 5k coverage
74.10% anime
60.17% drama
83.42% nhk
61.82% wikipedia

nayr 5k coverage
72.65%: anime
59.47%: drama
81.57%: nhk easy
62.42%: wikipedia
Reply
#30
yogert909 Wrote:For a more apples to apples comparison, here is the same analysis except this time it's only the first 4995 core sentences(nayr's deck has 4995). This time nayr's deck takes the win in the wikipedia category while core5k is still slightly ahead in the other 3 categories. But overall they are pretty close.

core 5k coverage
74.10% anime
60.17% drama
83.42% nhk
61.82% wikipedia

nayr 5k coverage
72.65%: anime
59.47%: drama
81.57%: nhk easy
62.42%: wikipedia
I really find that hard to believe because doesn't Nayr's core have particles and stuff? That accounts for almost all of the language, and core doesn't have it. Wouldn't that account for an incredibly high amount of the text?
Reply
#31
yogert909 Wrote:If you are curious what counts as a lemma, if you click this link you will see the frequency for the first 10,000 lemmas of he wikipedia corpus.
I was indeed curious, thanks! The first 10 "no real word" lemma give a coverage of 23% of Wikipedia, so 10% was an underestimation.

(In contrast to Nayr's deck) Core does not have any of these particles or morphological suffixes as extra vocab entry and the same is valid also for several very basic words as いる, ある or 日本.

Pretty sure that no one who goes through Core misses that stuff - but just with the first 30 not covered lemma of Wikipedia you have already a missing calculated coverage of 30%. So I guess you have to be quite cautious with drawing conclusions from this kind of figures:

# Rank Lemma missing
01 01 の 5,2%
02 02 に 8,1%
03 04 は 10,7%
04 05 を 13,1%
05 06 た 15,3%
06 07 が 17,3%
07 09 と 18,9%
08 10 て 20,4%
09 11 で 21,7%
10 12 だ 22,9%
11 13 れる 23,9%
12 15 いる 24,8%
13 16 ある 25,7%
14 18 も 26,2%
15 19 から 26,7%
16 21 ない 27,1%
17 22 こと 27,4%
18 23 第 27,8%
19 24 として 28,0%
20 25 や 28,3%
21 29 など 28,5%
22 30 日本 28,8%
23 32 ため 29,0%
24 34 この 29,1%
25 36 られる 29,3%
26 37 その 29,5%
27 39 へ 29,6%
28 41 まで 29,8%
29 44 号 29,9%
30 47 という 30,0%
Reply
#32
23% that's amazing. I suspected that it might be higher than 10%, but couldn't imagine it was over 20%.

Just to be clear about the numbers I posted - I extracted the lemmas from the sentences on both decks and then compared against the corpus. So the both decks include "words" like の、二、いる、etc. Indeed it would be much different if I had used only the single vocabulary words. But I think including sentences is more helpful as most people do the sentences. At the very least, including particles and the like makes sense as it's inconceivable that somebody does core 6k without learning particles somehow.
Reply
#33
So I am currently working on the sentence aspect of the Japanese learning system im making.

Initially I am going to cover the most frequent 5000 Japanese words based on the BCCWJ and CSJ corpses. I am working so not only do I cover all of the 5000 words, but also covering all of the (non obscure) varying nuances of each word.

I'm expecting 15,000 - 20,000 original sentences, keeping their frequency order, but also being n+1. I am also manually linking each newly introduced word to a dictionary definition from WWWJDIC.

I want it to be accessible and usable by beginners, so before that will be a modified version of the sentences taken from Tae Kims grammar guide (with voice), also ordered via n+1.

I am trying to set it up so there wont be needless double ups. So if you already learned a nuance of a word or a grammar point in the modified Tae Kims, it wont appear again as a learning point in cards that follow.

I then intend to do the same for 5001 - 10,000 next most frequent words from the BCCWJ and CSJ corpses.
Reply
#34
yogert909 Wrote:It's also worth noting that nayr's deck seems to re-use more words as his sentences contain only 4656 unique words(lemmas) vs 7057(52% more!) in core.
yogert909 Wrote:For a more apples to apples comparison, here is the same analysis except this time it's only the first 4995 core sentences(nayr's deck has 4995).
Hm. This will make me seem a little entrenched in my original position, but that is cheating. Just because you cram more words/sentence into your deck, doesn't make your deck better. In fact it makes it worse. Why are there more than 6000 words in a 6000 card deck?

I think the most relevant stat would be calculated like this: find the cutoff point in Core6k, where it covers 4656 unique words. Then see how much of the Japanese language those 4656 words cover, compared to the 4656 words Nayr's deck uses. That will tell you which deck is more relevant.

Comparing 6K sentences with 7057 words to 5K senteces with 4656 words is obviously unfair. But comparing 5K sentences with ~6000 words, to 5K sentences with 4656 words, is also unfair. What matters is the number of words, not the number of sentences. Just because you cram more words into the same number of sentences, doesn't make those words easier to learn.

Nayr's deck teaches you 4656 words, that cover a certain percentage of the Japanese language. I think it's fair to assume that learning 4656 words by going through a deck that has 5K sentences isn't harder than learning 4656 words by going through a deck that has fewer sentences.

So, let's see what percentage the first 4656 words that Core6K (the optimized version) teaches you, cover. If it's more than what Nayr's 4656 words cover, then hats off to Core6K optimized. If it's less, then people should go with Nayr's deck, because it achieves the same coverage with a lot less effort.
Edited: 2015-09-20, 8:26 pm
Reply
#35
Matthias Wrote:I was indeed curious, thanks! The first 10 "no real word" lemma give a coverage of 23% of Wikipedia, so 10% was an underestimation.
Why is that relevant? Both decks cover those words. Just because Nayr covers them explicitly, doesn't mean it takes less effort to learn them through Core.

So those "not real words" should be counted the same way, for both decks. Which yogert's stats do. Yogert's stats don't care whether a word shows up in the "word" field or the "sentence field". They correctly assume that the only field that matters is the sentence field (since sentence reviews are the best way to study with these decks). So this is not an issue, both decks teach those particles.
Edited: 2015-09-20, 8:24 pm
Reply
#36
Stansfield123 Wrote:I think the most relevant stat would be calculated like this: find the cutoff point in Core6k, where it covers 4656 unique words. Then see how much of the Japanese language those 4656 words cover, compared to the 4656 words Nayr's deck uses. That will tell you which deck is more relevant.

Comparing 6K sentences with 7057 words to 5K senteces with 4656 words is obviously unfair. But comparing 5K sentences with ~6000 words, to 5K sentences with 4656 words, is also unfair. What matters is the number of words, not the number of sentences. Just because you cram more words into the same number of sentences, doesn't make those words easier to learn.
There are lots of different ways to slice and dice the differences and your suggestion is a good one. I'll try your suggestion if I have time today. Of course the counterargument is that a deck is a deck is a deck and to maybe the nayr5k vs core6k comparison is the most appropriate to compare decks as the are. ...Or not because I just noticed the thread is entitles nayr vs core2k. Maybe I need to do that comparison too!

Anyway, at the end of the day I think the decks are pretty similar based on vocabulary coverage. Maybe there are other metrics to consider like sentence complexity or counting grammar points(if that were even possible). Based on the ~1% difference in subtracting 1k words from core6k, we can probably extrapolate another ~1-2% drop subtracting another 1k sentences from core. That still puts them pretty close with nayr's deck probably taking the lead by a small amount. I don't think that really makes much difference in practice, but I'll try the calculation for curiosity's sake.
Reply
#37
Stansfield123 Wrote:Why are there more than 6000 words in a 6000 card deck?
I know three reasons:
-Core is not a 6k deck, it is a 10k deck, and it reuses lots of its words (positively put: it reinforces the vocab)
-It does not contain many easy words like いる, ある or 日本, アメリカ, ヨーロッパ, etc. as vocab entry, but uses them in senteces. Ehrr, and yes, it even uses all the の, に, は, ...
-Some words in the sentences are not covered as vocab entry

Number 3 is no good - but it is almost like a real world situation. You encounter from time to time a word that you do not know ...

Normally the translation gives enough help in that cases. If not, it is easy to add reading / understanding hints so that your learning is not hampered. And if you think you like that word, just copy the card, change the vocab and you have an easily created additional vocab entry.

Stansfield123 Wrote:Why is that relevant?
Yogert was so kind to explain and link - so I had the chance to check this out by myself. I found it quite surprising and Yogert was amazed too. And that's it.

I don't think that you learn の, に, は, etc. through a deck. But even if Nayr's is waisting a few dozends of cards on them, it doesn't harm a lot. Still a good deck.

yogert909 Wrote:a deck is a deck is a deck ...
That is true as the size is about the same like 5k vs 6k. => Most people who do Core aim at 6k and do not plan to stop at 5k.

But size matters. So it is unfair to compare 2k to 5k or 10k to 5k - and draw conclusions about the quality out of it.

Nayr's has 5k and until he expands this will be valid:

cophnia61 Wrote:I think both are ok but obviously core10k has more words so you'll end up with more cards already done for you to use.
Edited: 2015-09-21, 8:17 pm
Reply
#38
Matthias Wrote:
Stansfield123 Wrote:Why are there more than 6000 words in a 6000 card deck?
I know three reasons:
-Core is not a 6k deck, it is a 10k deck, and it reuses lots of its words (positively put: it reinforces the vocab)
There are a lot of iterations of core, but the original wordlist from iknow/smart.fm was 6k afik. The 10k deck was expanded with vocab from another source. But that's all trivia, you can make it whatever is useful. Personally, I don't intend to go for 10k, probably not much more than 3k. I'd rather study by reading interesting things..but thats me.

I think what Stansfield is getting at is why would you study a sentence that has more than one unknown word in it? There is a school of thought where ideally there would be no more than one unknown word per sentence so you understand the sentence well. Of course, in practice, that kind of deck would be hard to pull off where you are only adding one new word and keeping the order in frequency order. Not to mention what you do with the first sentence (presumably the sentence の。which doesn't seem useful.) But again, I don't think it's of much practical use to quibble over. If you don't understand a sentence, just suspend it and come back later.
Edited: 2015-09-21, 9:20 pm
Reply
#39
This is just me repeating myself to clarify my previous point, but I think the most important thing to note about Nayr's deck is that it has 4656 unique words. That is the vocabulary set that you'll be learning if you do it.

And it's that vocabulary set that should be compared to other decks with a similar size vocabulary set, for a 100% fair comparison. The reason why Core5K has better coverage than Nayr's 5000 sentences, is because it has a bigger vocabulary set, not better word selection.That is why I suggested that a fair comparison should find the first 4656 words in Core6k, and compare the coverage of that set of words to Nayr's.

If that still has a bigger or comparable coverage, then I'll stop claiming that Nayr's deck has better word selection. But I think it won't come to that. I think Nayr's deck indeed has better word selection. It's just smaller than the Core deck. But that's not necessarily a bad thing.

Matthias Wrote:Yogert was so kind to explain and link - so I had the chance to check this out by myself. I found it quite surprising and Yogert was amazed too. And that's it.
Oh, OK. Just thought you were making an argument that it makes Core better.
yogert909 Wrote:I think what Stansfield is getting at is why would you study a sentence that has more than one unknown word in it?
Yes. In fact ideally there should be fewer unique words than sentences, because some of the sentences will inevitably contain a new sentence pattern, or a new meaning for the same word...those sentences shouldn't also contain a new word on top of that.
Edited: 2015-09-22, 2:37 am
Reply
#40
Stansfield123 Wrote:And it's that vocabulary set that should be compared to other decks with a similar size vocabulary set, for a 100% fair comparison. The reason why Core5K has better coverage than Nayr's 5000 sentences, is because it has a bigger vocabulary set, not better word selection.That is why I suggested that a fair comparison should find the first 4656 words in Core6k, and compare the coverage of that set of words to Nayr's.
Well, that's one valid way to look at it. But another way to look at it is that they are doth decks you could use period. Comparing the decks as they are is valid the same way it's 100% fair to compare a volkswagon to a ferrari without restricting the ferrari's horsepower to that of the vw's.
Reply
#41
Nayr182 Wrote:Initially I am going to cover the most frequent 5000 Japanese words based on the BCCWJ and CSJ corpses. I am working so not only do I cover all of the 5000 words, but also covering all of the (non obscure) varying nuances of each word.
Can't wait for this! I'm still working on Heisig myself, but getting to start on your project immediately after finishing Heisig would be perfect.
Reply
#42
Ok, by request from Stansfield, Here is another comparison of the two decks each with the same number of words. Trimming core down to 3884 sentences yields 4656 words, exactly the same number as nayr's deck. By this method nayr's deck handily wins 3 categories while core still has slightly better coverage of nhk easy. As you might recall, the sorting is sentence optimized, so it should be as close to n+1 as possible. Of course you could sort it any number of ways and get different results. Sorting by vocabulary frequency would likely improve core's coverage quite a bit.

I'll leave it to individuals to decide which deck is a better fit for their style of study.

nayr
72.65% anime
59.47% drama
81.57% nhk
62.42% wikipedia

core6k
76.46% anime
63.67% drama
84.46% nhk
63.98% wikipedia

core5k
74.10% anime
60.17% drama
83.42% nhk
61.82% wikipedia

core 3884 sentences (4656 words)
71.65% anime
57.39% drama
81.92% nhk
58.07% wikipedia
Edited: 2015-09-22, 4:39 pm
Reply
#43
Core organizes its vocabulary by the vocab entry's frequency, not the frequency of every word in the sentences. So wouldn't words that aren't as frequent be getting mixed up in your comparison, yogerto?
Reply
#44
Yes, there're many different ways of sorting and comparing. I chose sentence optimized because it seemed to match the way nayr's deck was organized, so it seemed like a fair comparison. Keep in mind that the way the sentences are sorted puts the most frequent 2k words at the beginning, so it's not as if the most frequent words are being left out of core's comparison. But you are right, I could probably find a more optimal sort to make core do better in the comparison, but I'm not sure that comparison would be any more useful.
Reply
#45
I would personally suggest both with MorphMan. That way, you can get the best of both decks. If you don't like one sentence in one, just press 'l' to use the other deck's card. If you need to learn a specific word, you'll have more options to choose from, and won't get stuck with a sentence with multiple unknowns.

While I personally like Nayr's deck better due to the sentence structure (and it's frequency list), I do both decks. If I learn a word with one deck, MorphMan ensures I don't accidentally relearn it.
Reply
#46
I really need to get to using morphman.
Reply
#47
sokino Wrote:I really need to get to using morphman.
Agreed. It seems intimidating to me, but I suppose I just need to take it step by step.
Reply
#48
Where the heck are all these decks you speak of?
Reply
#49
Alec_xvi Wrote:Where the heck are all these decks you speak of?
See my reply at http://forum.koohii.com/showthread.php?p...#pid225695.
Reply