yogert909, that's interesting. Could you do the same game too for the complete 10k (incl. the word/lemma count). Thanks!
Can you explain what "coverage" means. How is the calculation done, and what influences the values? Is the nhk easy the smallest of the 4 corpora and is therefore the coverage the highest? Does word frequency play any role?
What about the mentioned の, に & ね (and は, を , が, も, だ/です, ...)?
What about the more than 30 一年, 二年, 三年, 四年, ...? Are these separate words? Probably not, in that case there would be loads of "word" entries for years and numbers in Wikipedia.
Can you explain what "coverage" means. How is the calculation done, and what influences the values? Is the nhk easy the smallest of the 4 corpora and is therefore the coverage the highest? Does word frequency play any role?
What about the mentioned の, に & ね (and は, を , が, も, だ/です, ...)?
What about the more than 30 一年, 二年, 三年, 四年, ...? Are these separate words? Probably not, in that case there would be loads of "word" entries for years and numbers in Wikipedia.
