Back

Coverage of Core 2000 in strictly Core 6000 sentences

#1
I'm thinking of ditching my Core 2000 cards because the sentences are way too simple, so I was wondering if anyone knew (or has an easy way of knowing) how much of Core 2000 vocabulary entries is reused in Core 6000 only sentences. My impression is it's a lot (say at least 2 thirds), but I'll gain confidence from hard numbers. Even better would a list of unique cards so I'd keep those, but that's probably too much to ask (though I really don't know with some of the wizard-class people that hang around here sometimes). If it's too complicated to find out, don't bother. Thank you in advance for your potential assistance because it's not the kind of thing I can do by myself.

Out of curiosity -- and if anyone can do all that -- I wonder if the coverage increases in strictly Core 6k+10k. Maybe likely, maybe not, as 10k maybe prioritizes the reuse of 6k words.
Edited: 2014-02-19, 2:31 pm
Reply
#2
I'm sure you can answer at least the first of those questions using morphology.
Reply
#3
Splatted Wrote:I'm sure you can answer at least the first of those questions using morphology.
If you could go through the trouble of developping this answer a little my ultimately untech-savvy self would be very grateful.
Edited: 2014-02-21, 4:31 pm
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
Another option would be to simply delete the sentence cards out of the Core2k deck and blast through straight vocab, deleting whatever it is you think is too easy and keeping the rest.
Reply
#5
Bumping for Splatted. If you meant using Morphman, I can't seem to manage to export decks with only one field that I can use as a db, as my idea was to compare the "kanji vocab" field of core2k with the "sentence kanji" field of core6k.
Edited: 2014-02-21, 4:34 pm
Reply
#6
My first reply was so brief because I was short on time, but the truth is I'm not sure I can add much more. I haven't used Morphman (or whatever it's called) in years so I obviously can't remember the specifics of how it works. I'm pretty confident in saying that it is the right tool for the job though, since it was basically designed to do this.

That being said, some thoughts that come to mind are:

a) The Core 2k sentences probably don't contain many words that aren't in the 2k, so why not just include them in the analysis?
b) I'm pretty sure you can create a db from a standard word document/spreadsheet/whatever, so if you want to just compare the actual 2k vocab you can just export the whole deck and delete the excess fields yourself (which is easy using a spreadsheet), then use that to create your db.

If this doesn't help you could try asking on the Morphman thread. I think Overture keeps an eye on it and habitual Morphman users are probably more likely to check it too.
Reply
#7
So I was on the right track then. Thank you very much for your answer, I'll persevere.
Reply
#8
If I didn't make any errors, there are 2405 unique tokens recognized by MeCab in the first 2000 sentences, 5931 in the last 4000 sentences, and 6701 in all sentences. So there are 770 tokens that appear in the first 2000 sentences but not the last 4000 sentences. "Tokens" include word endings (like ます and た in 読みました), punctuation characters, and numbers.

$ curl jptxt.net/core-6000.txt|awk -F\; '/^[^#]/{gsub(/<\/?b>/,"");print $5}'>/tmp/sentences
$ mecab -F '%f[6] ' -E '\n' /tmp/sentences>/tmp/tokens
$ head -n2000 /tmp/tokens|tr ' ' \\n|sort -u|wc -l
2405
$ tail -n4000 /tmp/tokens|tr ' ' \\n|sort -u|wc -l
5931
$ cat /tmp/tokens|tr ' ' \\n|sort -u|wc -l
6701

I used the newer version of the Core 6000 data that is available as JSON files from iknow.jp.

The cumulative number of unique tokens is really linear:

[Image: coreuniquetokens.png]

$ awk '{for(i=1;i<=NF;i++)a[$i];print length(a)}'</tmp/tokens>/tmp/cumulative
$ gnuplot<<<'set term png size 400,300;set output "/tmp/plot.png";plot "/tmp/cumulative" w lines'
Edited: 2014-04-02, 11:53 pm
Reply
#9
That's beautiful lauri-ranta, thanks for sharing (in the end I didn't get Morphman to work: I'll never use it again so I don't have time to investigate what went wrong right now). So that's like 80%? More than I'd have thought. 88% would have been ideal, but I'm not going to nitpick. They must have gone through some trouble just to reach this number. Another upside of this deck.

Yes, it's really linear. You feel the reuse is steady with core 6k having longer sentences than 2k, but with your data it's like one could get away with just having the cards 3000 to 6000. The more reason not to overly cling to them.

Anyway for now I'll do a speed run and suspend the easiest half of 2k. I could get rid of more, but easy cards help building recognition speed, which is also useful. Thanks again.
Edited: 2014-02-24, 10:31 am
Reply