![]() |
|
smart.fm core 2000/6000 and JLPT coverage - Printable Version +- kanji koohii FORUM (http://forum.koohii.com) +-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html) +--- Forum: JLPT, Jobs & College in Japan (http://forum.koohii.com/forum-12.html) +--- Thread: smart.fm core 2000/6000 and JLPT coverage (/thread-3689.html) Pages:
1
2
|
smart.fm core 2000/6000 and JLPT coverage - radical_tyro - 2009-08-06 has anyone computed what percentage of each of the JLPT levels does the vocab in core 2000 and 6000 cover? or more practically, would one's vocab be strong enough for 1級 after completing core 6000? smart.fm core 2000/6000 and JLPT coverage - nest0r - 2009-08-06 I'm curious about how many unique words are in the 6000 sentences as a whole, rather than just the titular # of common words targeted by the collection. From the start, I've been mining the Core series in Anki in their entirety--dictation/shadowing/writing/every new word/deconstructing the grammar--but I don't think I'll finish the 6000 till December. Also, I won't be doing them in isolation, so I couldn't claim then that whatever results I have are solely the work of C6k. Katsuo, Master of Lists posted these stats when I asked (http://forum.koohii.com/showthread.php?pid=39540#pid39540) about the iKnow kanji a while back: "In case anyone's interested, here are the equivalent figures for the present total of 6,000 iKnow sentences: 1648 total unique kanji. Jouyou: 1523 of 1945 (78.3%)" However, as I go along, I always convert any word to kanji if the word is spelled in kana but doesn't have 'uk (usually kana)' in the definition. So once I finish I could check the stats again. smart.fm core 2000/6000 and JLPT coverage - Nukemarine - 2009-08-06 There are a few words in the course's sample sentences that are not in course's vocabulary list. And like Nest0r says, some words are spelled in Kana where one can use Kanji. What you can do is use the spreadsheet of Core 2k and 6k's vocabulary (kanji and kana), use spread sheet to merge with JLPT list (kanji and kana), do a formula looking for matches to mark another cell to tag JLPT words not in Core list. It's similar to what I did with Tanuki list and Core list, tagging words to remove blatant duplicates (about 2600 from the 7100 Tanuki words). Just remember to have columns with original sorting number and whether it's Core or Tanuki list. Also, when you sort, sort by kana then by kanji columns. smart.fm core 2000/6000 and JLPT coverage - Transparent_Aluminium - 2009-08-06 I have completed the core 6000 course and have added anki facts for any words included in the JLPT2 list but missing from core 6000. That comes to about 390 words. So, IMO, core 6000 would be adequate for JLPT2 but not JLPT1. smart.fm core 2000/6000 and JLPT coverage - radical_tyro - 2009-08-09 ok, i wrote some python code to do it. here are my results. they're not perfect, because there are just too many complications trying to match up words, but i think it's close. i used the lists for jlpt 1 and jlpt 2-4 from http://www.thbz.org/kanjimots/jlpt.php3 sentences: 6000 unique words: 6435 1級 only: 40.3% (1177/2922) 1-4級: 66.0% (5064/7667) 2-4級: 81.9% (3887/4745) my script also spits out those jlpt words without a match in core 2000+6000, so one could study them to ensure jlpt coverage. perhaps it could be a group effort with the new collaborative lists on smart.fm. is there interest? personally i have to finish core 6000 first
smart.fm core 2000/6000 and JLPT coverage - mezbup - 2009-08-09 radical_tyro Wrote:ok, i wrote some python code to do it. here are my results. they're not perfect, because there are just too many complications trying to match up words, but i think it's close. i used the lists for jlpt 1 and jlpt 2-4 from http://www.thbz.org/kanjimots/jlpt.php3can you calculate these figures for KO2001? I'd be interested to know. smart.fm core 2000/6000 and JLPT coverage - radical_tyro - 2009-08-09 mezbup Wrote:sure. i used the "2001.Kanji.Odyssey Compiled List 01-12" lists on smart.fm:radical_tyro Wrote:ok, i wrote some python code to do it. here are my results. they're not perfect, because there are just too many complications trying to match up words, but i think it's close. i used the lists for jlpt 1 and jlpt 2-4 from http://www.thbz.org/kanjimots/jlpt.php3can you calculate these figures for KO2001? I'd be interested to know. sentences: 3437 unique words: 4704 1級 only: 30.6% (895/2922) 1-4級: 49.5% (3793/7667) 2-4級: 61.1% (2898/4745) smart.fm core 2000/6000 and JLPT coverage - blackmacros - 2009-08-09 Wow 4704 unique words? I've read here before that Coscom says the KO book contains something on the order of ~3600 unique words. I guess the smart.fm sentences are much more word dense? smart.fm core 2000/6000 and JLPT coverage - mezbup - 2009-08-09 Thanks heaps. Those figures are very interesting and that's a really handy comparison to have. So after completing KO2001 you would effectively be able to read 50% of anything and after completing Core6000 you would effectively be able to read 66% of anything. Although, I heard that KO2001 covers 80% of written material? Maybe not in terms of totality but frequency? smart.fm core 2000/6000 and JLPT coverage - radical_tyro - 2009-08-09 @blackmacros: my results do show the smart.fm lists are sightly more word dense. i don't have the official KO2001 sentences list, but i found 2336 sentences for testing purposes. out of those 2336 sentences from the book, there were 3495 unique words. for the first 2336 sentences from core 6000, there were 3921 unique words. so that's about 12% more words for core 6000. however, core 2000 is less word dense than KO2001. @mezbup: it's more complicated than that. knowing all the jlpt words doesn't mean you can 'read anything'. and the 80% of written material figure, what sort of written material? how much of it? surely it is some limited set of written material, because otherwise the number of unique words would be extremely high. without knowing what that material is, that figure means nothing . Edit: oh, frequency interpretation would make sense. maybe they used newspapers?
smart.fm core 2000/6000 and JLPT coverage - radical_tyro - 2009-08-09 i switched the jlpt data to the "JLPT Vocabulary" shared anki deck. it seems better, plus i get each level separate. here are the new figures: core 2000+6000: sentences: 6000 unique words: 6435 word repeats: 13700 cumulative: 1級: 65.9% (5431/8243) 2級: 82.6% (3781/4580) 3級: 94.5% (1092/1156) 4級: 97.1% (572/589) exclusive: 1級 only: 45.0% (1650/3663) 2級 only: 78.5% (2689/3424) 3級 only: 91.7% (520/567) 4級 only: 97.1% (572/589) KO2001 smart.fm: sentences: 3437 unique words: 4704 word repeats: 8519 cumulative: 1級: 49.8% (4106/8243) 2級: 62.5% (2863/4580) 3級: 82.5% (954/1156) 4級: 88.6% (522/589) exclusive: 1級 only: 33.9% (1243/3663) 2級 only: 55.8% (1909/3424) 3級 only: 76.2% (432/567) 4級 only: 88.6% (522/589) smart.fm core 2000/6000 and JLPT coverage - mezbup - 2009-08-10 radical_tyro Wrote:@blackmacros: my results do show the smart.fm lists are sightly more word dense. i don't have the official KO2001 sentences list, but i found 2336 sentences for testing purposes. out of those 2336 sentences from the book, there were 3495 unique words. for the first 2336 sentences from core 6000, there were 3921 unique words. so that's about 12% more words for core 6000. however, core 2000 is less word dense than KO2001.Yeah, according to frequency is what I meant. Like, for instance you might be able to read an entire passage of a simple text that doesn't include any specialized vocab but the next passage you might get 4 out of every 5 words. Either way its still a huge boost in reading capability. So long as you can comprehend it that is. I'm not so sure newspapers are an accurate measure of frequency because of the style of writing. I'd say to get an extremely accurate average you'd have to take into account 5+ mediums and hundreds of articles per medium. But you know... I'm sure these things have been done before. I'm more than happy to attempt to annihilate KO2001. The rest can be gleaned from mining natural resources. smart.fm core 2000/6000 and JLPT coverage - greatfool - 2009-08-11 This seems to confirm my suspicion that the majority of JLPT1 specific words are currently not on any list anywhere. (not counting the pure vocab lists) smart.fm core 2000/6000 and JLPT coverage - Transparent_Aluminium - 2009-08-11 The only "list" with JLPT1 coverage would be the Kanji in context sentences. I believe they cover a majority of the vocab required for the JLPT. Also, books like the UNICOM 1kyuu vocab book have a lot of sentences which cover some amount of the required vocab. smart.fm core 2000/6000 and JLPT coverage - vosmiura - 2009-08-12 radical_tyro, do you think you could put your script in an Anki plugin? Or if you share it I can have a go at it. It'd be nice to have a word count estimate in Anki. I also want to get statistics for Reibun de Manabu which I've been going through. I can also check what I have from KiK (bit more than half of workbook 1). smart.fm core 2000/6000 and JLPT coverage - radical_tyro - 2009-08-12 @vosmiura: i believe mafried is working on the anki plugin for this and i posted the code to help him out with it over here http://forum.koohii.com/showthread.php?pid=65894#pid65894 smart.fm core 2000/6000 and JLPT coverage - vosmiura - 2009-08-13 Thanks. I ran it on the 1500 sentences I have from Reibun de Manabu so far. This is a cherry picked selection of the first 3/4 of the book, excluding things I thought were too easy, so the actual counts would be higher. unique words: 3271 word repeats: 6401 1級 only: 17.8% (652/3663) 2級 only: 54.6% (1868/3424) 3級 only: 71.6% (406/567) 4級 only: 79.5% (468/589) cumulative: 1級: 41.2% (3394/8243) 2級: 59.9% (2742/4580) 3級: 75.6% (874/1156) 4級: 79.5% (468/589) And this is from the first 44 chapters (~850 sentences) of KiC: unique words: 2596 word repeats: 3542 1級 only: 17.2% (631/3663) 2級 only: 31.2% (1068/3424) 3級 only: 56.3% (319/567) 4級 only: 71.5% (421/589) cumulative: 1級: 29.6% (2439/8243) 2級: 39.5% (1808/4580) 3級: 64.0% (740/1156) 4級: 71.5% (421/589) smart.fm core 2000/6000 and JLPT coverage - Nukemarine - 2009-08-13 Well, since you are doing this, how does the Tanuki list look? At 7000 entries covering the entire jouyou, it's bound to cover a lot of JLPT material. smart.fm core 2000/6000 and JLPT coverage - vosmiura - 2009-08-13 Tanuki list: unique words: 8682 word repeats: 20801 1級 only: 42.8% (1568/3663) 2級 only: 57.9% (1983/3424) 3級 only: 78.3% (444/567) 4級 only: 85.1% (501/589) cumulative: 1級: 54.5% (4496/8243) 2級: 63.9% (2928/4580) 3級: 81.7% (945/1156) 4級: 85.1% (501/589) Hmm, not a great match either. It seems that Tanuki has a lot of words not on those JLPT lists. At the same time, it doesn't kanjify all the words, which is part of why they don't match up. smart.fm core 2000/6000 and JLPT coverage - radical_tyro - 2009-08-13 vosmiura Wrote:These counts should be taken as lower bounds.sort of. i noticed the mecab word splitting thing and attempted to get around it by trying a simple text search if the dictionary form matching didn't succeed. so, 郵便局, 外国人, 主に, etc. should be ok. you are correct about things like 曲る/曲がる and 黄色/黄色い not being handled properly. that may be tricky. yeah i guess tanuki is beyond jlpt @_@ smart.fm core 2000/6000 and JLPT coverage - mafried - 2009-08-13 If you run the lists through mecab first, all of those issues should go away. I'll be able to report when the plugin is finished (soon). smart.fm core 2000/6000 and JLPT coverage - vosmiura - 2009-08-13 radical_tyro Wrote:Yeah, i just realised it was my own hackery that broke the search for 郵便局, 外国人. I couldn't get MeCab for python working, so I changed to use MeCab through command line, and inadvertently broke part of it. I've updated the counts with that fixed; a little higher but not very different than before.vosmiura Wrote:These counts should be taken as lower bounds.sort of. i noticed the mecab word splitting thing and attempted to get around it by trying a simple text search if the dictionary form matching didn't succeed. so, 郵便局, 外国人, 主に, etc. should be ok. smart.fm core 2000/6000 and JLPT coverage - vosmiura - 2009-08-13 mafried Wrote:If you run the lists through mecab first, all of those issues should go away. I'll be able to report when the plugin is finished (soon).Do you mean run the JLPT lists through? That's a good idea. smart.fm core 2000/6000 and JLPT coverage - mezbup - 2009-08-16 Can you run the whole of KIC through this? smart.fm core 2000/6000 and JLPT coverage - radical_tyro - 2009-08-16 has KIC been transcribed? |