I added a little gui manager so you can open up databases and view / compare them. Also you can create databases from a utf-8 text file now.
Some interesting results quickly achieved with the new manager:
Only 1100/3204 = 34.3%, 227/655 = 34.6%, and 839/1681 = 50% of morphemes from Fate/Stay Night ep1-9, Nogizaka Haruka no Himitsu ep1, and JSPfEC, respectively, are in the core6k words. If you expand it to including all morphemes in core6k words and example sentences those numbers only grow to 1706/3204 = 53%, 415/655 = 63%, and 1283/1681 = 76%.
For the morphemes in FSN that aren't in core6k, 39% are nouns and 36% are verbs. Nogizaka was 32% and 26%, JSPfEC was 59% and 25%. Also, only 2%, 3%, and 0% were listed as 'unknown' by mecab, which means core6k's deficiency isn't due to loan words. I also filtered out punctuation and particles, as that wouldn't be fair to core6k.
As for words that are in core6k but not in the 3 dbs listed above, ie wasted effort:
JSPfEC: 4982/5821 = 85.6%
FSN: 4721/5821 = 81%
Nogizaka: 5594/5821 = 96%
This is interesting as it gives considerable evidence that core6k's word selection is far from optimal if you have particular shows/books/etc in mind that you'd like to comprehend. This may be obvious, but I was surprised by the degree of its inefficacy.
If anyone has large anki decks from other sources, it'd be neat to see more analysis.
Some interesting results quickly achieved with the new manager:
Only 1100/3204 = 34.3%, 227/655 = 34.6%, and 839/1681 = 50% of morphemes from Fate/Stay Night ep1-9, Nogizaka Haruka no Himitsu ep1, and JSPfEC, respectively, are in the core6k words. If you expand it to including all morphemes in core6k words and example sentences those numbers only grow to 1706/3204 = 53%, 415/655 = 63%, and 1283/1681 = 76%.
For the morphemes in FSN that aren't in core6k, 39% are nouns and 36% are verbs. Nogizaka was 32% and 26%, JSPfEC was 59% and 25%. Also, only 2%, 3%, and 0% were listed as 'unknown' by mecab, which means core6k's deficiency isn't due to loan words. I also filtered out punctuation and particles, as that wouldn't be fair to core6k.
As for words that are in core6k but not in the 3 dbs listed above, ie wasted effort:
JSPfEC: 4982/5821 = 85.6%
FSN: 4721/5821 = 81%
Nogizaka: 5594/5821 = 96%
This is interesting as it gives considerable evidence that core6k's word selection is far from optimal if you have particular shows/books/etc in mind that you'd like to comprehend. This may be obvious, but I was surprised by the degree of its inefficacy.
If anyone has large anki decks from other sources, it'd be neat to see more analysis.
