RECENT TOPICS » View all
I read a lot and save up a lot of vocab for studying later. Is there any way to reorder a personal vocab list based on frequency order?
By frequency order, I mean something like cangy's frequency-sorted core6k, EDICT's priority tags or whatever the hive mind can think of.
(I'm looking for an automatic or quick manual way of doing this; if it's too long winded, then never mind.)
one way of kind of simulating that would be to only add words after you've seen them a set number of times. That way the list will be concentrated with the more frequently occurring words.
This thread has a list of words ordered by they frequency based on a reasonable sample of Japanese literature. May be helpful.
As Inny Jay linked, CB released a program that calculates word and kanji frequency. He even has the list from the large list of 5000+ Japanese novels. So, that data is sound. As for learning in frequency order, I don't recommend it based on experience of others and myself. Having the word groups follow a pattern allows for better learning and retention.
Here's a tip: Take the ranked words from CB's list and split into groups of 1,000. Use Cangy's word sorting program to sort those 1,000 words into the 2k1KO order. Learn those words. This gives you ready to use words but in a order that easier to digest.
@Nukemarine
Thank you, but I can't follow your suggestion.
I referred to self-collected vocab precisely because I have outgrown premade vocab lists, e.g. from the 35 words I studied yesterday, only 3 appear in the mammoth list Inny Jan linked to. And I referred to frequency order because I've been hitting the wall of diminishing returns, i.e. I'm learning words I almost never find again, while I keep struggling with contextually important but unpredictable words that I haven't found even once before.
@nadiatims
Agreed. Perhaps I should stop being so liberal (obsessive?) about adding unknown vocab. The problem is how not to plateau in the process.
tuliaoth wrote:
@Nukemarine
Thank you, but I can't follow your suggestion.
I referred to self-collected vocab precisely because I have outgrown premade vocab lists, e.g. from the 35 words I studied yesterday, only 3 appear in the mammoth list Inny Jan linked to. And I referred to frequency order because I've been hitting the wall of diminishing returns, i.e. I'm learning words I almost never find again, while I keep struggling with contextually important but unpredictable words that I haven't found even once before.
@nadiatims
Agreed. Perhaps I should stop being so liberal (obsessive?) about adding unknown vocab. The problem is how not to plateau in the process.
Well, if the words are rare (not in the top 50,000 counts as rare in my book) I'm not sure if frequency order is the solution. Assuming your word list is large enough to benefit, the KO2k1 sorting order would still be the best bet.
Sorting a word list is not easy at first. You have to load perl and then run Franki hoping everything is set up right. It's likely another person would do it for you if you made the list available on google documents or something.
Hope it helps.
Nukemarine wrote:
Well, if the words are rare (not in the top 50,000 counts as rare in my book) I'm not sure if frequency order is the solution. Assuming your word list is large enough to benefit, the KO2k1 sorting order would still be the best bet.
Sorting a word list is not easy at first. You have to load perl and then run Franki hoping everything is set up right. It's likely another person would do it for you if you made the list available on google documents or something.
Hope it helps.
Thanks!
Regarding word frequency, I'm referring to vocab from light novels of basic to intermediate difficulty, e.g. words like 玉手箱, 木琴, 荒ぶ, 吸い付ける, 松竹梅, 強請る and 神籤, which didn't seem (to me) all that far-fetched to merit exclusion from a top 50k list.
And as for the list I have, it's simply the periodic 1k vocab I collect during any given month and then 'prime' for learning during the following month, which logically doesn't justify bothering anyone with scripts or the like.
Anyway, I'm beginning to see that it may be too difficult to sort a custom vocab list to prioritize the most useful words. Unless one of the generous programmers here is ever attracted to the idea, I'll just try to be more selective and keep going as I have.
In case you are curious, I zipped up the full Word Frequency Report (along with the other reports):
Download link (via MediaFire)
All of those words appear in the full list.
Last edited by cb4960 (2012 May 24, 10:28 pm)
tuliaoth wrote:
Anyway, I'm beginning to see that it may be too difficult to sort a custom vocab list to prioritize the most useful words. Unless one of the generous programmers here is ever attracted to the idea, I'll just try to be more selective and keep going as I have.
I wrote a simple program called cb's Frequency List Sorter that sorts a list of Japanese words based on their frequency. See the second part of this post: http://forum.koohii.com/viewtopic.php?p … 76#p176876
Last edited by cb4960 (2012 May 25, 2:06 am)
cb4960 wrote:
tuliaoth wrote:
Anyway, I'm beginning to see that it may be too difficult to sort a custom vocab list to prioritize the most useful words. Unless one of the generous programmers here is ever attracted to the idea, I'll just try to be more selective and keep going as I have.
I wrote a simple program called cb's Frequency List Sorter that sorts a list of Japanese words based on their frequency. See the second part of this post: http://forum.koohii.com/viewtopic.php?p … 76#p176876
Excellent work! With the data already available, harnessing it like this was bound to happen sooner or later. I copy-pasted your full list into Excel and found that...
The 10k most frequent words have a frequency index of 2,317 or above.
The 20k most frequent words have a frequency index of 949 or above.
The 30k most frequent words have a frequency index of 528 or above.
The 40k most frequent words have a frequency index of 328 or above.
The 50k most frequent words have a frequency index of 220 or above.
(From a total of 238,265 unique words.)
As I see it, an intermediate reader focused on vocab acquisition has no reason not to use the word frequency list generator. Even if 'limited' to 5,000 novels, I assume it's still a fairly accurate indicator of a word's absolute frequency in the language. In fact, I'm surprised this sorting-by-frequency idea wasn't brought up before by post-core learners.
Sorting collected vocab by frequency yields significant benefits:
- You will fill larger vocab gaps before smaller ones, so vocab-wise reading itself becomes easier as quickly as possible, as with texts having recurrent JLPT vocab.
- You can learn less vocab without compromising the overall progress of your reading ability. Simply decide not to learn words whose frequency index is lower than X.
- You have an easier time SRSing, as you're more likely to re-encounter the words you're reviewing, depending on the frequency index average of your word list.
Thanks for this. Good idea, effective implementation and possibly amazing results.
One problem with using the word frequency list generator (which is a remarkable achievement and I think useful in many respects!) is that it doesn't account well for different conjugations of the same verb, or kana/kanji variations. You have 1203 instances of 挟む, 1832 instances of 挟ま (eg 挟まれる), 5681 instances of 挟ん (挟んで・挟んだ), 1322 of はさむ, 1975 of はさま, and 7750 of はさん (some of which may well be 破算 or 破産). If you don't combine those, then verbs look as if they're somewhat rarer than they actually are, by comparison with nouns. And if you're going down the list in frequency order, the first one you'll come to is はさん -- so hopefully people will pay enough attention to realize that it's probably はさむ that accounts for the majority of those, and not 破産.
I'm not well-versed enough in the various Japanese text processing tools to know whether it would be feasible to do a list that combines conjugations of the same verb, but it would be really useful.
@cb4960
Exactly what I was looking for. I knew this wasn't impossible. Thanks a million! Forwarding to my friends right now.
@ivanov
Wow, I hadn't really thought that far out, but I do see how unexpectedly useful it can be.
@Fillanzea
I agree with Nukemarine that beginners can benefit more from patterns than frequency, so I think this tool would be more useful for intermediate learners, who are more likely to be aware of repetitions due to verb conjugations.
But still, this is for sorting a custom list based on that full 200k list with repetitions, so your custom list shouldn't have repetitions unless you put them there yourself. And even then, most vocab collecting comes from rikaichan, which (I guess) most users set up to save the dictionary form rather than the word as originally found in the text.
Re: kana/kanji variations, think of it as learning the most frequent variation first. I bet that most learners, by oversight or not, do sometimes add the same word with different variations to their vocab decks anyway.
If you use OS X or GNU/Linux, you can paste something like this in a terminal application:
curl -s lri.me/japanese/word-frequency.txt|tail -n+3>word-frequency.txt;awk -F$'\t' 'NR==FNR{a[$0]=n++;next}{print a[$1]"\t"$0}' word-frequency.txt wordlist.txt|sort -n
Where wordlist.txt is either a file with one word per line or a TSV file where the words are in the first column.
word-frequency.txt is an average of word frequency lists based on anime and drama subtitles, novels, and websites. All of them are for lemmas (dictionary forms), not for surface forms.
Last edited by lauri_ranta (2013 August 13, 10:10 am)

