rich_f
Member
From: north carolina
Registered: 2007-07-12
Posts: 1708
I found some interesting links to various language corpuses (corpii?) and frequency lists on the web, in case folks are interested. I figured I'd post links and let people do with them what they will. (Or just ignore them completely. That's cool, too.) Most of these are used for linguistic research, but I'm tempted to snag one of the free ones and play with it a bit, just to see what I can get out of it. (Not Tanaka, of course.)
David's Corpus Based Linguistics Links: (Huge list of corpus sites, some pay, some free):
http://personal.cityu.edu.hk/~davidlee/ … LLinks.htm
Wortschatz (at the University of Leipzig): A free compendium of multi-language corpuses, in MySQL format.
http://corpora.informatik.uni-leipzig.de/download.html
Some Corpus data and frequency lists from the University of Leeds. (Can't access the JP corpus unless you're at the Uni there, but you can peruse their frequency lists.)
http://corpus.leeds.ac.uk/list.html
Interesting article on frequency lists, which is what led me down the rabbit hole in the first place:
http://www.lextutor.ca/research/
I'm not particularly interested in pursuing frequency-based learning, but I was kind of curious about it anyway, to see what others have done with it.
Either way, I thought this data might come in handy for someone out there with a crazy idea or two, so I figured I'd serve it up for y'all, and save you the work of digging it up again.
The Leeds frequency data is different from the frequency data we've all seen before, because it comes from their own corpus. And the Leipzig corpus data comes from the web, stripped of slang and orz and such, so it's not just a pile of newspaper clippings, nor is it a bunch of 2chan stuff. I suppose generating a frequency list from that would generate different results than the old frequency list we've seen before, which came from 5 years of newspapers, if I remember correctly.
Nukemarine
Member
From: 神奈川
Registered: 2007-07-15
Posts: 2347
I find it best if you have a frequency list, but re-arranged in an order more conducive to learning.
Example:
Hiragana - Arranged in pronunciation patterns. Heisig did a different take, which some liked some did not.
Kanji - Heisig took the JLPT (edit: sorry, I meant Jouyou) (a very, very poor example of frequency), and rearranged it in an order easier to learn
Vocabulary - KO2001 took a kanji frequency and word frequency list, and arranged them by kanji themes.
It seems best if you use the frequency list to choose a large block of information you want to learn as a whole, then arrange that list into a better order. However, the frequency of the individual words are a poor guide for the actual order. Best current example I can think of is iKnow. Their Core 2000 using frequency based list. Unfortunately, they present it too you by frequency order. This made the jump from the first two lists (basic) to the next 1400 (low intermediate) very drastic. KO2001 uses similar vocabulary, but better arrangement.
I think about this every time I read somewhere "Why are you learning X word before M word, X word is more common?" I just want to scream at times "You're going to learn them both eventually, this order helps you learn as a whole faster."
Last edited by Nukemarine (2009 January 30, 8:11 pm)
rich_f
Member
From: north carolina
Registered: 2007-07-12
Posts: 1708
Yeah, but a corpus just isn't for frequency lists. It's a good sampling of a language, so you can use it for other stuff, too. The Leipzig site has a good program that works as a "corpus explorer" for the various languages they host that has a lot of interesting features. The idea is that you can see how words are related to each other in their use patterns in a target language, and what sort of relationships they make. It might give you an insight into the language, it might not. *Shrug*
It's just something to mess around with, for the most part.
rich_f
Member
From: north carolina
Registered: 2007-07-12
Posts: 1708
Yeah, that's a good point.
To be honest, I'm not very interested in the frequency list stuff anymore. I'm more interested what other things can be done with the corpus data. But I'm pretty sure that this thread is going to go on and on about frequency lists anyway.
I've actually found some very useful things in the Tanuki corpus that was dropped on our doorstep last year, sentence-wise, that I've been able to use in my deck when I was looking for example sentences I liked. I'm going to look at the Leipzig corpus later on when I have some time to see just what's in there. Like I said, the software does some pretty neat stuff when it comes to showing connections between words in a language, none of which has anything to do with frequency lists.
For me, the point isn't whether the organizing logic (frequency, grammar levels, etc) behind these resources is true and accurate across the board. It's about providing a transient structure that enables the creation, the curation and annotation of resources for language study. I take what's useful now and discard what isn't, but I prefer these resources to be dynamic and accessible. The rest, design/regimentation wise, is up to me as I use reference points to support me as I develop my own fluency. I think that it's all kind of a moot point, because there's clearly a meeting of top-down and bottom-up forces at work, driving the innovation of new resources. Thanks to technology and its self-studying users... *wanders off into techno-utopian fantasies*
Last edited by nest0r (2009 January 30, 10:14 pm)