Back

Word frequency list, what corpus is ideal?

#1
Ok, looking at the word frequency list generated from the "innocent novel corpus", I see that many infrequent words are up in the list because they get used in the same novel.
So, I wonder how much big must a corpus be, to dilute those words, so that a couple of novels won't misrepresent the entire corpus?
Just think at the 才人, it's not a frequent word but it's up in that list because it is the name of Zero no Tsukaima main character.

Also what different kind of sources could I use to have a relatively agnostic frequency list?

I've just downloaded the Wikipedia dump, but obviouly as an encyclopedia it uses a certain selection of words.
I'm downloading a bunch of news websites with WinHTTrack, and I plan to use those too.
I'm thinking to use generalist websites, like the huffington post which treats all sort of topics, and maybe others. Any suggestion?
Reply

Messages In This Thread
Word frequency list, what corpus is ideal? - by cophnia61 - 2017-04-15, 12:03 pm