I found some interesting links to various language corpuses (corpii?) and frequency lists on the web, in case folks are interested. I figured I'd post links and let people do with them what they will. (Or just ignore them completely. That's cool, too.) Most of these are used for linguistic research, but I'm tempted to snag one of the free ones and play with it a bit, just to see what I can get out of it. (Not Tanaka, of course.)
David's Corpus Based Linguistics Links: (Huge list of corpus sites, some pay, some free):
http://personal.cityu.edu.hk/~davidlee/d...LLinks.htm
Wortschatz (at the University of Leipzig): A free compendium of multi-language corpuses, in MySQL format.
http://corpora.informatik.uni-leipzig.de/download.html
Some Corpus data and frequency lists from the University of Leeds. (Can't access the JP corpus unless you're at the Uni there, but you can peruse their frequency lists.)
http://corpus.leeds.ac.uk/list.html
Interesting article on frequency lists, which is what led me down the rabbit hole in the first place:
http://www.lextutor.ca/research/
I'm not particularly interested in pursuing frequency-based learning, but I was kind of curious about it anyway, to see what others have done with it.
Either way, I thought this data might come in handy for someone out there with a crazy idea or two, so I figured I'd serve it up for y'all, and save you the work of digging it up again.
The Leeds frequency data is different from the frequency data we've all seen before, because it comes from their own corpus. And the Leipzig corpus data comes from the web, stripped of slang and orz and such, so it's not just a pile of newspaper clippings, nor is it a bunch of 2chan stuff. I suppose generating a frequency list from that would generate different results than the old frequency list we've seen before, which came from 5 years of newspapers, if I remember correctly.
David's Corpus Based Linguistics Links: (Huge list of corpus sites, some pay, some free):
http://personal.cityu.edu.hk/~davidlee/d...LLinks.htm
Wortschatz (at the University of Leipzig): A free compendium of multi-language corpuses, in MySQL format.
http://corpora.informatik.uni-leipzig.de/download.html
Some Corpus data and frequency lists from the University of Leeds. (Can't access the JP corpus unless you're at the Uni there, but you can peruse their frequency lists.)
http://corpus.leeds.ac.uk/list.html
Interesting article on frequency lists, which is what led me down the rabbit hole in the first place:
http://www.lextutor.ca/research/
I'm not particularly interested in pursuing frequency-based learning, but I was kind of curious about it anyway, to see what others have done with it.
Either way, I thought this data might come in handy for someone out there with a crazy idea or two, so I figured I'd serve it up for y'all, and save you the work of digging it up again.
The Leeds frequency data is different from the frequency data we've all seen before, because it comes from their own corpus. And the Leipzig corpus data comes from the web, stripped of slang and orz and such, so it's not just a pile of newspaper clippings, nor is it a bunch of 2chan stuff. I suppose generating a frequency list from that would generate different results than the old frequency list we've seen before, which came from 5 years of newspapers, if I remember correctly.
