how were the 25,000 word by freq. decks created?

Index » Learning resources

  • 1
 
Reply #1 - 2012 July 05, 11:35 pm
theadamie
Member
From: Kentucky-Seoul
Registered: 2011-07-31
Posts: 90
Website

I'm really curious how these decks were made and if it's possible to do it for another language?  is there a program out there that can scan ebooks, websites, etc and make these lists?  and is there a program that could convert that list into cards?

I would really like to have something like that for korean.  I have a 1700 card deck "3400 if you count generated reversed cards" that I hand typed, but I would like something more organized and larger.

Reply #2 - 2012 July 05, 11:40 pm
theadamie
Member
From: Kentucky-Seoul
Registered: 2011-07-31
Posts: 90
Website

mods- I don't think this is a technical question as it's relevant for generating large lists for the user's given subject aka learning resource, so please don't delete this one.

Reply #3 - 2012 July 05, 11:41 pm
HonyakuJoshua
Member
From: The Unique City of Liverpool
Registered: 2011-06-03
Posts: 572
Website

I can't seem to find this deck?

Advertising (register and sign in to hide this)
JapanesePod101
Sponsor
 
Reply #4 - 2012 July 05, 11:52 pm
vix86
Member
From: Tokyo
Registered: 2010-01-19
Posts: 1246

theadamie wrote:

mods- I don't think this is a technical question as it's relevant for generating large lists for the user's given subject aka learning resource, so please don't delete this one.

This question is fine.

In response. The 25k word decks basically take the Priority words from EDICT I think. If you look on Jim Breen's website, I believe it explains where he got the figure from. Probably the Tanaka corpus if I had to guess.

Reply #5 - 2012 July 05, 11:55 pm
theadamie
Member
From: Kentucky-Seoul
Registered: 2011-07-31
Posts: 90
Website

HonyakuJoshua wrote:

I can't seem to find this deck?

that's strange, I just looked and it's not there.  I tried it a long time ago but found that it was much better for me to use the decks that went with my textbooks.  I deleted it and don't have it on backup hmm  maybe someone here can email it or re-upload it...

Reply #6 - 2012 July 05, 11:57 pm
theadamie
Member
From: Kentucky-Seoul
Registered: 2011-07-31
Posts: 90
Website

theadamie wrote:

HonyakuJoshua wrote:

I can't seem to find this deck?

that's strange, I just looked and it's not there.  I tried it a long time ago but found that it was much better for me to use the decks that went with my textbooks.  I deleted it and don't have it on backup hmm  maybe someone here can email it or re-upload it...

so is there any way to easily re-create what he did but for korean?  I would be willing to put up to 10 hours of work into it + fine tuning if need be, but probably don't have time to take on a bigger project than that

Last edited by theadamie (2012 July 05, 11:57 pm)

Reply #7 - 2012 July 06, 12:20 am
erlog
Member
From: Japan
Registered: 2007-01-25
Posts: 518

The thing you need to understand about a lot of this data is that it wasn't really created by any of the people in any of the places you're looking. Breen did some cross-referencing, added that frequency data to EDICT, and then people just dumped the data out of EDICT into flashcards.

I believe the frequency data is based on a study of newspapers that was done somewhere else.

This project is probably going to take you more than 10 hours worth of work. Also, it's really unclear how useful frequency data actually is. It's always a Zipf distribution.

http://en.wikipedia.org/wiki/Zipf%27s_law

Once you get past like the first 1000 words the word frequencies start to be almost equal to one another in terms of the practical application of the number you're getting. So the 4,000th word tends to be only slightly more common than the 15,000th.

Last edited by erlog (2012 July 06, 12:33 am)

Reply #8 - 2012 July 06, 12:56 am
yudantaiteki
Member
From: 東京
Registered: 2009-10-03
Posts: 3020

And just as with kanji, once you get past a certain threshold of common words that appear in everything, the frequency of a word is heavily dependent on what you are going to be reading.  I know people have this idealistic notion that they'll learn everything so that they'll always be prepared no matter what they read, but that's probably not realistic for most people -- there will almost always be an adjustment period when you read new material to find out the vocabulary specific to it.

Reply #9 - 2012 July 06, 1:20 am
theadamie
Member
From: Kentucky-Seoul
Registered: 2011-07-31
Posts: 90
Website

yudantaiteki wrote:

And just as with kanji, once you get past a certain threshold of common words that appear in everything, the frequency of a word is heavily dependent on what you are going to be reading.  I know people have this idealistic notion that they'll learn everything so that they'll always be prepared no matter what they read, but that's probably not realistic for most people -- there will almost always be an adjustment period when you read new material to find out the vocabulary specific to it.

this makes sense, I'm *guessing* I have a vocabulary of 5-10,000 words, but who really knows right?  I found a list of 3000 words and knew about 90% of them all the way through.

Reply #10 - 2012 July 06, 1:22 am
theadamie
Member
From: Kentucky-Seoul
Registered: 2011-07-31
Posts: 90
Website

yudantaiteki wrote:

And just as with kanji, once you get past a certain threshold of common words that appear in everything, the frequency of a word is heavily dependent on what you are going to be reading.  I know people have this idealistic notion that they'll learn everything so that they'll always be prepared no matter what they read, but that's probably not realistic for most people -- there will almost always be an adjustment period when you read new material to find out the vocabulary specific to it.

good post

Reply #11 - 2012 July 06, 6:58 pm
theadamie
Member
From: Kentucky-Seoul
Registered: 2011-07-31
Posts: 90
Website

Is there a program where i can plug in material that i would find useful, like websites on nutrition, and generate a list/cards from that?

  • 1