Word frequency list, what corpus is ideal?

Ok, looking at the word frequency list generated from the "innocent novel corpus", I see that many infrequent words are up in the list because they get used in the same novel.
So, I wonder how much big must a corpus be, to dilute those words, so that a couple of novels won't misrepresent the entire corpus?
Just think at the 才人, it's not a frequent word but it's up in that list because it is the name of Zero no Tsukaima main character.

Also what different kind of sources could I use to have a relatively agnostic frequency list?

I've just downloaded the Wikipedia dump, but obviouly as an encyclopedia it uses a certain selection of words.
I'm downloading a bunch of news websites with WinHTTrack, and I plan to use those too.
I'm thinking to use generalist websites, like the huffington post which treats all sort of topics, and maybe others. Any suggestion?

