Corpus Sources

Index » Learning resources

  • 1
 
rich_f Member
From: north carolina Registered: 2007-07-12 Posts: 1708

I found some interesting links to various language corpuses (corpii?) and frequency lists on the web, in case folks are interested. I figured I'd post links and let people do with them what they will. (Or just ignore them completely. That's cool, too.) Most of these are used for linguistic research, but I'm tempted to snag one of the free ones and play with it a bit, just to see what I can get out of it. (Not Tanaka, of course.)

David's Corpus Based Linguistics Links: (Huge list of corpus sites, some pay, some free):
http://personal.cityu.edu.hk/~davidlee/ … LLinks.htm

Wortschatz (at the University of Leipzig): A free compendium of multi-language corpuses, in MySQL format.
http://corpora.informatik.uni-leipzig.de/download.html

Some Corpus data and frequency lists from the University of Leeds. (Can't access the JP corpus unless you're at the Uni there, but you can peruse their frequency lists.)
http://corpus.leeds.ac.uk/list.html

Interesting article on frequency lists, which is what led me down the rabbit hole in the first place:
http://www.lextutor.ca/research/

I'm not particularly interested in pursuing frequency-based learning, but I was kind of curious about it anyway, to see what others have done with it.

Either way, I thought this data might come in handy for someone out there with a crazy idea or two, so I figured I'd serve it up for y'all, and save you the work of digging it up again.

The Leeds frequency data is different from the frequency data we've all seen before, because it comes from their own corpus. And the Leipzig corpus data comes from the web, stripped of slang and orz and such, so it's not just a pile of newspaper clippings, nor is it a bunch of 2chan stuff. I suppose generating a frequency list from that would generate different results than the old frequency list we've seen before, which came from 5 years of newspapers, if I remember correctly.

woodwojr Member
From: Boston Registered: 2008-05-02 Posts: 530

Just corpi. It'd be corpii if it was corpius in the nominative singular.

(Edit: this is wrong, see below.)

~J

Last edited by woodwojr (2009 February 18, 5:43 am)

Nukemarine Member
From: 神奈川 Registered: 2007-07-15 Posts: 2347

I find it best if you have a frequency list, but re-arranged in an order more conducive to learning.

Example:

Hiragana - Arranged in pronunciation patterns. Heisig did a different take, which some liked some did not.

Kanji - Heisig took the JLPT (edit: sorry, I meant Jouyou) (a very, very poor example of frequency), and rearranged it in an order easier to learn

Vocabulary - KO2001 took a kanji frequency and word frequency list, and arranged them by kanji themes.

It seems best if you use the frequency list to choose a large block of information you want to learn as a whole, then arrange that list into a better order. However, the frequency of the individual words are a poor guide for the actual order. Best current example I can think of is iKnow. Their Core 2000 using frequency based list. Unfortunately, they present it too you by frequency order. This made the jump from the first two lists (basic) to the next 1400 (low intermediate) very drastic. KO2001 uses similar vocabulary, but better arrangement.

I think about this every time I read somewhere "Why are you learning X word before M word, X word is more common?"  I just want to scream at times "You're going to learn them both eventually, this order helps you learn as a whole faster."

Last edited by Nukemarine (2009 January 30, 8:11 pm)

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
Tobberoth Member
From: Sweden Registered: 2008-08-25 Posts: 3364

Nukemarine wrote:

Kanji - Heisig took the JLPT (a very, very poor example of frequency), and rearranged it in an order easier to learn

Actually, the 常用 list of kanji has very little to do with JLPT, JLPT just happen to stick to the jouyou (as they should, it's the official list from the government after all). Remember that the jouyou list was finilized in 1981, the JLPT tests started in the 90's.

rich_f Member
From: north carolina Registered: 2007-07-12 Posts: 1708

Yeah, but a corpus just isn't for frequency lists. It's a good sampling of a language, so you can use it for other stuff, too. The Leipzig site has a good program that works as a "corpus explorer" for the various languages they host that has a lot of interesting features. The idea is that you can see how words are related to each other in their use patterns in a target language, and what sort of relationships they make. It might give you an insight into the language, it might not. *Shrug*

It's just something to mess around with, for the most part.

Tobberoth Member
From: Sweden Registered: 2008-08-25 Posts: 3364

My own ideas on frequency is that there is no such thing as an objective frequency list that actually matters. It all depends on how you measure it and how you use the language. The frequency of the word 経済 is obviously insane if you check newspapers and economic programs etc. If you want to read manga, it's a much less relevant word. When I lived in Japan, I heard it very seldomly. Therefor, I also find learning by way of frequency a bad idea, you have to learn by what YOU will see most frequently...

There's also another very important concept relevant to that. IF you will see a word extremely frequently... you don't have to study it! Why waste time studying a word you will learn from exposure in no time since it's so frequent? It's more important to focus on the words which you will encounter, but not so often that you learn them without effort.

nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

Corpora. ;p

Here's a good overview: http://www-writing.berkeley.edu/TESL-EJ/ej32/a1.html

It talks about the debates over the value of corpora, among other things.

Last edited by nest0r (2009 January 30, 7:47 pm)

rich_f Member
From: north carolina Registered: 2007-07-12 Posts: 1708

Yeah, that's a good point.

To be honest, I'm not very interested in the frequency list stuff anymore. I'm more interested what other things can be done with the corpus data. But I'm pretty sure that this thread is going to go on and on about frequency lists anyway.

I've actually found some very useful things in the Tanuki corpus that was dropped on our doorstep last year, sentence-wise, that I've been able to use in my deck when I was looking for example sentences I liked. I'm going to look at the Leipzig corpus later on when I have some time to see just what's in there. Like I said, the software does some pretty neat stuff when it comes to showing connections between words in a language, none of which has anything to do with frequency lists.

woodwojr Member
From: Boston Registered: 2008-05-02 Posts: 530

nest0r wrote:

Corpora. ;p

You are absolutely right. Frickin' third declension.

~J

Reply #10 - 2009 January 30, 8:39 pm
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

For me, the point isn't whether the organizing logic (frequency, grammar levels, etc) behind these resources is true and accurate across the board. It's about providing a transient structure that enables the creation, the curation and annotation of resources for language study. I take what's useful now and discard what isn't, but I prefer these resources to be dynamic and accessible. The rest, design/regimentation wise, is up to me as I use reference points to support me as I develop my own fluency. I think that it's all kind of a moot point, because there's clearly a meeting of top-down and bottom-up forces at work, driving the innovation of new resources. Thanks to technology and its self-studying users... *wanders off into techno-utopian fantasies*

Last edited by nest0r (2009 January 30, 10:14 pm)

onafarm Member
Registered: 2005-11-12 Posts: 129 Website

Indeed corpora. Or corpuses.

  • 1