kanji koohii FORUM
smart.fm core 2000/6000 and JLPT coverage - Printable Version

+- kanji koohii FORUM (http://forum.koohii.com)
+-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html)
+--- Forum: JLPT, Jobs & College in Japan (http://forum.koohii.com/forum-12.html)
+--- Thread: smart.fm core 2000/6000 and JLPT coverage (/thread-3689.html)

Pages: 1 2


smart.fm core 2000/6000 and JLPT coverage - radical_tyro - 2009-08-06

has anyone computed what percentage of each of the JLPT levels does the vocab in core 2000 and 6000 cover?

or more practically, would one's vocab be strong enough for 1級 after completing core 6000?


smart.fm core 2000/6000 and JLPT coverage - nest0r - 2009-08-06

I'm curious about how many unique words are in the 6000 sentences as a whole, rather than just the titular # of common words targeted by the collection. From the start, I've been mining the Core series in Anki in their entirety--dictation/shadowing/writing/every new word/deconstructing the grammar--but I don't think I'll finish the 6000 till December. Also, I won't be doing them in isolation, so I couldn't claim then that whatever results I have are solely the work of C6k.

Katsuo, Master of Lists posted these stats when I asked (http://forum.koohii.com/showthread.php?pid=39540#pid39540) about the iKnow kanji a while back:

"In case anyone's interested, here are the equivalent figures for the present total of 6,000 iKnow sentences:
1648 total unique kanji.
Jouyou: 1523 of 1945 (78.3%)"

However, as I go along, I always convert any word to kanji if the word is spelled in kana but doesn't have 'uk (usually kana)' in the definition. So once I finish I could check the stats again.


smart.fm core 2000/6000 and JLPT coverage - Nukemarine - 2009-08-06

There are a few words in the course's sample sentences that are not in course's vocabulary list. And like Nest0r says, some words are spelled in Kana where one can use Kanji.

What you can do is use the spreadsheet of Core 2k and 6k's vocabulary (kanji and kana), use spread sheet to merge with JLPT list (kanji and kana), do a formula looking for matches to mark another cell to tag JLPT words not in Core list. It's similar to what I did with Tanuki list and Core list, tagging words to remove blatant duplicates (about 2600 from the 7100 Tanuki words). Just remember to have columns with original sorting number and whether it's Core or Tanuki list. Also, when you sort, sort by kana then by kanji columns.


smart.fm core 2000/6000 and JLPT coverage - Transparent_Aluminium - 2009-08-06

I have completed the core 6000 course and have added anki facts for any words included in the JLPT2 list but missing from core 6000. That comes to about 390 words. So, IMO, core 6000 would be adequate for JLPT2 but not JLPT1.


smart.fm core 2000/6000 and JLPT coverage - radical_tyro - 2009-08-09

ok, i wrote some python code to do it. here are my results. they're not perfect, because there are just too many complications trying to match up words, but i think it's close. i used the lists for jlpt 1 and jlpt 2-4 from http://www.thbz.org/kanjimots/jlpt.php3

sentences: 6000
unique words: 6435
1級 only: 40.3% (1177/2922)
1-4級: 66.0% (5064/7667)
2-4級: 81.9% (3887/4745)

my script also spits out those jlpt words without a match in core 2000+6000, so one could study them to ensure jlpt coverage. perhaps it could be a group effort with the new collaborative lists on smart.fm. is there interest? personally i have to finish core 6000 first Tongue


smart.fm core 2000/6000 and JLPT coverage - mezbup - 2009-08-09

radical_tyro Wrote:ok, i wrote some python code to do it. here are my results. they're not perfect, because there are just too many complications trying to match up words, but i think it's close. i used the lists for jlpt 1 and jlpt 2-4 from http://www.thbz.org/kanjimots/jlpt.php3

sentences: 6000
unique words: 6435
1級 only: 40.3% (1177/2922)
1-4級: 66.0% (5064/7667)
2-4級: 81.9% (3887/4745)

my script also spits out those jlpt words without a match in core 2000+6000, so one could study them to ensure jlpt coverage. perhaps it could be a group effort with the new collaborative lists on smart.fm. is there interest? personally i have to finish core 6000 first Tongue
can you calculate these figures for KO2001? I'd be interested to know.


smart.fm core 2000/6000 and JLPT coverage - radical_tyro - 2009-08-09

mezbup Wrote:
radical_tyro Wrote:ok, i wrote some python code to do it. here are my results. they're not perfect, because there are just too many complications trying to match up words, but i think it's close. i used the lists for jlpt 1 and jlpt 2-4 from http://www.thbz.org/kanjimots/jlpt.php3

sentences: 6000
unique words: 6435
1級 only: 40.3% (1177/2922)
1-4級: 66.0% (5064/7667)
2-4級: 81.9% (3887/4745)

my script also spits out those jlpt words without a match in core 2000+6000, so one could study them to ensure jlpt coverage. perhaps it could be a group effort with the new collaborative lists on smart.fm. is there interest? personally i have to finish core 6000 first Tongue
can you calculate these figures for KO2001? I'd be interested to know.
sure. i used the "2001.Kanji.Odyssey Compiled List 01-12" lists on smart.fm:

sentences: 3437
unique words: 4704
1級 only: 30.6% (895/2922)
1-4級: 49.5% (3793/7667)
2-4級: 61.1% (2898/4745)


smart.fm core 2000/6000 and JLPT coverage - blackmacros - 2009-08-09

Wow 4704 unique words? I've read here before that Coscom says the KO book contains something on the order of ~3600 unique words. I guess the smart.fm sentences are much more word dense?


smart.fm core 2000/6000 and JLPT coverage - mezbup - 2009-08-09

Thanks heaps. Those figures are very interesting and that's a really handy comparison to have.

So after completing KO2001 you would effectively be able to read 50% of anything and after completing Core6000 you would effectively be able to read 66% of anything.

Although, I heard that KO2001 covers 80% of written material? Maybe not in terms of totality but frequency?


smart.fm core 2000/6000 and JLPT coverage - radical_tyro - 2009-08-09

@blackmacros: my results do show the smart.fm lists are sightly more word dense. i don't have the official KO2001 sentences list, but i found 2336 sentences for testing purposes. out of those 2336 sentences from the book, there were 3495 unique words. for the first 2336 sentences from core 6000, there were 3921 unique words. so that's about 12% more words for core 6000. however, core 2000 is less word dense than KO2001.

@mezbup: it's more complicated than that. knowing all the jlpt words doesn't mean you can 'read anything'. and the 80% of written material figure, what sort of written material? how much of it? surely it is some limited set of written material, because otherwise the number of unique words would be extremely high. without knowing what that material is, that figure means nothing Undecided. Edit: oh, frequency interpretation would make sense. maybe they used newspapers?


smart.fm core 2000/6000 and JLPT coverage - radical_tyro - 2009-08-09

i switched the jlpt data to the "JLPT Vocabulary" shared anki deck. it seems better, plus i get each level separate. here are the new figures:

core 2000+6000:
sentences: 6000
unique words: 6435
word repeats: 13700
cumulative:
1級: 65.9% (5431/8243)
2級: 82.6% (3781/4580)
3級: 94.5% (1092/1156)
4級: 97.1% (572/589)
exclusive:
1級 only: 45.0% (1650/3663)
2級 only: 78.5% (2689/3424)
3級 only: 91.7% (520/567)
4級 only: 97.1% (572/589)

KO2001 smart.fm:
sentences: 3437
unique words: 4704
word repeats: 8519
cumulative:
1級: 49.8% (4106/8243)
2級: 62.5% (2863/4580)
3級: 82.5% (954/1156)
4級: 88.6% (522/589)
exclusive:
1級 only: 33.9% (1243/3663)
2級 only: 55.8% (1909/3424)
3級 only: 76.2% (432/567)
4級 only: 88.6% (522/589)


smart.fm core 2000/6000 and JLPT coverage - mezbup - 2009-08-10

radical_tyro Wrote:@blackmacros: my results do show the smart.fm lists are sightly more word dense. i don't have the official KO2001 sentences list, but i found 2336 sentences for testing purposes. out of those 2336 sentences from the book, there were 3495 unique words. for the first 2336 sentences from core 6000, there were 3921 unique words. so that's about 12% more words for core 6000. however, core 2000 is less word dense than KO2001.

@mezbup: it's more complicated than that. knowing all the jlpt words doesn't mean you can 'read anything'. and the 80% of written material figure, what sort of written material? how much of it? surely it is some limited set of written material, because otherwise the number of unique words would be extremely high. without knowing what that material is, that figure means nothing Undecided. Edit: oh, frequency interpretation would make sense. maybe they used newspapers?
Yeah, according to frequency is what I meant. Like, for instance you might be able to read an entire passage of a simple text that doesn't include any specialized vocab but the next passage you might get 4 out of every 5 words. Either way its still a huge boost in reading capability. So long as you can comprehend it that is.

I'm not so sure newspapers are an accurate measure of frequency because of the style of writing. I'd say to get an extremely accurate average you'd have to take into account 5+ mediums and hundreds of articles per medium. But you know... I'm sure these things have been done before.

I'm more than happy to attempt to annihilate KO2001. The rest can be gleaned from mining natural resources.


smart.fm core 2000/6000 and JLPT coverage - greatfool - 2009-08-11

This seems to confirm my suspicion that the majority of JLPT1 specific words are currently not on any list anywhere. (not counting the pure vocab lists)


smart.fm core 2000/6000 and JLPT coverage - Transparent_Aluminium - 2009-08-11

The only "list" with JLPT1 coverage would be the Kanji in context sentences. I believe they cover a majority of the vocab required for the JLPT. Also, books like the UNICOM 1kyuu vocab book have a lot of sentences which cover some amount of the required vocab.


smart.fm core 2000/6000 and JLPT coverage - vosmiura - 2009-08-12

radical_tyro, do you think you could put your script in an Anki plugin? Or if you share it I can have a go at it. It'd be nice to have a word count estimate in Anki.

I also want to get statistics for Reibun de Manabu which I've been going through. I can also check what I have from KiK (bit more than half of workbook 1).


smart.fm core 2000/6000 and JLPT coverage - radical_tyro - 2009-08-12

@vosmiura: i believe mafried is working on the anki plugin for this and i posted the code to help him out with it over here http://forum.koohii.com/showthread.php?pid=65894#pid65894


smart.fm core 2000/6000 and JLPT coverage - vosmiura - 2009-08-13

Thanks.

I ran it on the 1500 sentences I have from Reibun de Manabu so far. This is a cherry picked selection of the first 3/4 of the book, excluding things I thought were too easy, so the actual counts would be higher.

unique words: 3271
word repeats: 6401
1級 only: 17.8% (652/3663)
2級 only: 54.6% (1868/3424)
3級 only: 71.6% (406/567)
4級 only: 79.5% (468/589)
cumulative:
1級: 41.2% (3394/8243)
2級: 59.9% (2742/4580)
3級: 75.6% (874/1156)
4級: 79.5% (468/589)

And this is from the first 44 chapters (~850 sentences) of KiC:

unique words: 2596
word repeats: 3542
1級 only: 17.2% (631/3663)
2級 only: 31.2% (1068/3424)
3級 only: 56.3% (319/567)
4級 only: 71.5% (421/589)
cumulative:
1級: 29.6% (2439/8243)
2級: 39.5% (1808/4580)
3級: 64.0% (740/1156)
4級: 71.5% (421/589)


smart.fm core 2000/6000 and JLPT coverage - Nukemarine - 2009-08-13

Well, since you are doing this, how does the Tanuki list look? At 7000 entries covering the entire jouyou, it's bound to cover a lot of JLPT material.


smart.fm core 2000/6000 and JLPT coverage - vosmiura - 2009-08-13

Tanuki list:

unique words: 8682
word repeats: 20801
1級 only: 42.8% (1568/3663)
2級 only: 57.9% (1983/3424)
3級 only: 78.3% (444/567)
4級 only: 85.1% (501/589)
cumulative:
1級: 54.5% (4496/8243)
2級: 63.9% (2928/4580)
3級: 81.7% (945/1156)
4級: 85.1% (501/589)

Hmm, not a great match either. It seems that Tanuki has a lot of words not on those JLPT lists. At the same time, it doesn't kanjify all the words, which is part of why they don't match up.


smart.fm core 2000/6000 and JLPT coverage - radical_tyro - 2009-08-13

vosmiura Wrote:These counts should be taken as lower bounds.

MeCab splits up some words, for example 郵便局 gets split into 郵便 - 局, and so 郵便局 is not counted. Same for 外国-人 and others.

Some words use different kana like 曲る vs 曲がる, and again don't get counted. Also some things like 主に don't get counted. 黄色 verus 黄色い, etc.
sort of. i noticed the mecab word splitting thing and attempted to get around it by trying a simple text search if the dictionary form matching didn't succeed. so, 郵便局, 外国人, 主に, etc. should be ok.

you are correct about things like 曲る/曲がる and 黄色/黄色い not being handled properly. that may be tricky.

yeah i guess tanuki is beyond jlpt @_@


smart.fm core 2000/6000 and JLPT coverage - mafried - 2009-08-13

If you run the lists through mecab first, all of those issues should go away. I'll be able to report when the plugin is finished (soon).


smart.fm core 2000/6000 and JLPT coverage - vosmiura - 2009-08-13

radical_tyro Wrote:
vosmiura Wrote:These counts should be taken as lower bounds.

MeCab splits up some words, for example 郵便局 gets split into 郵便 - 局, and so 郵便局 is not counted. Same for 外国-人 and others.

Some words use different kana like 曲る vs 曲がる, and again don't get counted. Also some things like 主に don't get counted. 黄色 verus 黄色い, etc.
sort of. i noticed the mecab word splitting thing and attempted to get around it by trying a simple text search if the dictionary form matching didn't succeed. so, 郵便局, 外国人, 主に, etc. should be ok.

you are correct about things like 曲る/曲がる and 黄色/黄色い not being handled properly. that may be tricky.

yeah i guess tanuki is beyond jlpt @_@
Yeah, i just realised it was my own hackery that broke the search for 郵便局, 外国人. I couldn't get MeCab for python working, so I changed to use MeCab through command line, and inadvertently broke part of it. I've updated the counts with that fixed; a little higher but not very different than before.


smart.fm core 2000/6000 and JLPT coverage - vosmiura - 2009-08-13

mafried Wrote:If you run the lists through mecab first, all of those issues should go away. I'll be able to report when the plugin is finished (soon).
Do you mean run the JLPT lists through? That's a good idea.


smart.fm core 2000/6000 and JLPT coverage - mezbup - 2009-08-16

Can you run the whole of KIC through this?


smart.fm core 2000/6000 and JLPT coverage - radical_tyro - 2009-08-16

has KIC been transcribed?