Back

how were the 25,000 word by freq. decks created?

#1
I'm really curious how these decks were made and if it's possible to do it for another language? is there a program out there that can scan ebooks, websites, etc and make these lists? and is there a program that could convert that list into cards?

I would really like to have something like that for korean. I have a 1700 card deck "3400 if you count generated reversed cards" that I hand typed, but I would like something more organized and larger.
Reply
#2
mods- I don't think this is a technical question as it's relevant for generating large lists for the user's given subject aka learning resource, so please don't delete this one.
Reply
#3
I can't seem to find this deck?
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
theadamie Wrote:mods- I don't think this is a technical question as it's relevant for generating large lists for the user's given subject aka learning resource, so please don't delete this one.
This question is fine.

In response. The 25k word decks basically take the Priority words from EDICT I think. If you look on Jim Breen's website, I believe it explains where he got the figure from. Probably the Tanaka corpus if I had to guess.
Reply
#5
HonyakuJoshua Wrote:I can't seem to find this deck?
that's strange, I just looked and it's not there. I tried it a long time ago but found that it was much better for me to use the decks that went with my textbooks. I deleted it and don't have it on backup :/ maybe someone here can email it or re-upload it...
Reply
#6
theadamie Wrote:
HonyakuJoshua Wrote:I can't seem to find this deck?
that's strange, I just looked and it's not there. I tried it a long time ago but found that it was much better for me to use the decks that went with my textbooks. I deleted it and don't have it on backup :/ maybe someone here can email it or re-upload it...
so is there any way to easily re-create what he did but for korean? I would be willing to put up to 10 hours of work into it + fine tuning if need be, but probably don't have time to take on a bigger project than that
Edited: 2012-07-05, 11:57 pm
Reply
#7
The thing you need to understand about a lot of this data is that it wasn't really created by any of the people in any of the places you're looking. Breen did some cross-referencing, added that frequency data to EDICT, and then people just dumped the data out of EDICT into flashcards.

I believe the frequency data is based on a study of newspapers that was done somewhere else.

This project is probably going to take you more than 10 hours worth of work. Also, it's really unclear how useful frequency data actually is. It's always a Zipf distribution.

http://en.wikipedia.org/wiki/Zipf%27s_law

Once you get past like the first 1000 words the word frequencies start to be almost equal to one another in terms of the practical application of the number you're getting. So the 4,000th word tends to be only slightly more common than the 15,000th.
Edited: 2012-07-06, 12:33 am
Reply
#8
And just as with kanji, once you get past a certain threshold of common words that appear in everything, the frequency of a word is heavily dependent on what you are going to be reading. I know people have this idealistic notion that they'll learn everything so that they'll always be prepared no matter what they read, but that's probably not realistic for most people -- there will almost always be an adjustment period when you read new material to find out the vocabulary specific to it.
Reply
#9
yudantaiteki Wrote:And just as with kanji, once you get past a certain threshold of common words that appear in everything, the frequency of a word is heavily dependent on what you are going to be reading. I know people have this idealistic notion that they'll learn everything so that they'll always be prepared no matter what they read, but that's probably not realistic for most people -- there will almost always be an adjustment period when you read new material to find out the vocabulary specific to it.
this makes sense, I'm *guessing* I have a vocabulary of 5-10,000 words, but who really knows right? I found a list of 3000 words and knew about 90% of them all the way through.
Reply
#10
yudantaiteki Wrote:And just as with kanji, once you get past a certain threshold of common words that appear in everything, the frequency of a word is heavily dependent on what you are going to be reading. I know people have this idealistic notion that they'll learn everything so that they'll always be prepared no matter what they read, but that's probably not realistic for most people -- there will almost always be an adjustment period when you read new material to find out the vocabulary specific to it.
good post
Reply
#11
Is there a program where i can plug in material that i would find useful, like websites on nutrition, and generate a list/cards from that?
Reply