Back

Counting the words you know?

#1
There's a point when you can't keep track of how many words you know. At least, for people who don't catalog and SRS everything. I don't know whether I know 7000 words or 10 000 or more. It doesn't really matter, but I'd like to know. Especially so I can go over the words I should know but have forgotten. I thought I could use Trinity and enter all the words I want to keep track of, but I realize that I can't start a new account :/ Is there another program that will keep track of how many unique words you add to it?
Reply
#2
Uhm, an anki plugin would be neat.
Reply
#3
I can tell you how you'd go about writing one. But if someone's already gone and written it, that'd be news to me.
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
Well, I can't write computer programs. I'll find some other alternative. Anki will prevent you from adding a card that's the same as another right? I guess I'll just use that.
Reply
#5
I've thought about this; if you were learning a language that (discounting contractions like I've, there're etc..) used exactly one character to separate words (ie " "), such as English, or any other language that uses the Roman alphabet then it is fairly stright forward to count the words you have learned quite accurately since you need no underlying knowledge of the language to count the distinct words in your SRS sentences. Although this doesn't quite capture all the information because it is arguable that knowing "press" and "conference" doesn't automatically mean you know "press conference," so if "press conference appeared, then perhaps it should be treated as a word in it's own right, I suppose this is more personal opinion. However, determining word boundries in Japanese strikes me as a much more involved problem and one that would almost certainly require a knowledge of grammar and a comprehensive dictionary to approximate accurately.
Reply
#6
bandwidthjunkie Wrote:I've thought about this; if you were learning a language that (discounting contractions like I've, there're etc..) used exactly one character to separate words (ie " "), such as English, or any other language that uses the Roman alphabet then it is fairly stright forward to count the words you have learned quite accurately since you need no underlying knowledge of the language to count the distinct words in your SRS sentences. Although this doesn't quite capture all the information because it is arguable that knowing "press" and "conference" doesn't automatically mean you know "press conference," so if "press conference appeared, then perhaps it should be treated as a word in it's own right, I suppose this is more personal opinion. However, determining word boundries in Japanese strikes me as a much more involved problem and one that would almost certainly require a knowledge of grammar and a comprehensive dictionary to approximate accurately.
Let's not forget all the words you know which aren't in your SRS.
Reply
#7
Tobberoth Wrote:Let's not forget all the words you know which aren't in your SRS.
How could possibly have learned anything that isn't in your SRS Wink lol
Reply
#8
bandwidthjunkie Wrote:I've thought about this; if you were learning a language that (discounting contractions like I've, there're etc..) used exactly one character to separate words (ie " "), such as English, or any other language that uses the Roman alphabet then it is fairly stright forward to count the words you have learned quite accurately since you need no underlying knowledge of the language to count the distinct words in your SRS sentences. Although this doesn't quite capture all the information because it is arguable that knowing "press" and "conference" doesn't automatically mean you know "press conference," so if "press conference appeared, then perhaps it should be treated as a word in it's own right, I suppose this is more personal opinion. However, determining word boundries in Japanese strikes me as a much more involved problem and one that would almost certainly require a knowledge of grammar and a comprehensive dictionary to approximate accurately.
fortunately, strong people have done the hard work. see for example mecab or kakasi.
Reply
#9
Actually I've thought about it some more, and there might be an even easier way. I'll look into writing a plugin tonight or tomorrow.

I know this is a "just google it" sort of question, but I'm very short on time: does anyone have a machine readable list of vocab for the JLPT levels? (preferably the new ones, but both would be ideal). If you're interested in the plugin, this would help it along.
Edited: 2009-08-10, 11:42 pm
Reply
#10
mafried Wrote:Actually I've thought about it some more, and there might be an even easier way. I'll look into writing a plugin tonight or tomorrow.

I know this is a "just google it" sort of question, but I'm very short on time: does anyone have a machine readable list of vocab for the JLPT levels? (preferably the new ones, but both would be ideal). If you're interested in the plugin, this would help it along.
http://www.thbz.org/kanjimots/jlpt.php3
Reply
#11
Perfect. Thanks, Tobberoth.
Reply
#12
mafried, i think the jlpt vocabulary shared anki deck is better. it has more words, and it doesn't include levels 3 and 4 in the level 2 file. also, it doesn't list a bunch of words in kana in level 4.

what are you trying to do? i can send you my python code used http://forum.koohii.com/showthread.php?p...7#pid65607 if you want.
Reply
#13
The same thing you already did, it seems. I wasn't aware of your efforts. Thanks for the link.

I am/was in the process of writing a plugin that would run the "Expression" or "Kanji" fields of a Japanese deck through mecab to extract a conjugation-free word list, then print statistical information gleaned from that (not unlike what you have posted). The JLPT data would have been for comparison's sake. I was also planning on showing a weighted percent coverage of a frequency list as well.

How is your python script setup?
Reply
#14
What I did to count the words in my SRS was export the content and then use the chasen parser to break down the sentences and then it was just a matter of counting the unique words in the result.

I wonder if it's a good thing to know how many words you know. It's easy to become afflicted by the "are we there yet?" syndrome: I only know 5000 words!? but I need at least 20'000 it's going to take me forever. *discouragement ensues*
Reply
#15
Tobberoth Wrote:Let's not forget all the words you know which aren't in your SRS.
That's very true. Every time I look at the list of Jouyou Kanji that I have not yet entered into my sentence deck (it's a feature of anki), I can come up with one or more words that I already know for each of them. Osmosis is amazing.
Reply
#16
This plugin would be super useful for me right now. It would be very convenient to know exactly how many words I know so that I can adjust (and hopefully reduce...) the number of words I need to learn for the JLPT in December.
Reply
#17
mafried Wrote:The same thing you already did, it seems. I wasn't aware of your efforts. Thanks for the link.

I am/was in the process of writing a plugin that would run the "Expression" or "Kanji" fields of a Japanese deck through mecab to extract a conjugation-free word list, then print statistical information gleaned from that (not unlike what you have posted). The JLPT data would have been for comparison's sake. I was also planning on showing a weighted percent coverage of a frequency list as well.

How is your python script setup?
here you go. i put some comments in it for you. hope it's useful. frequency list info would be cool. looking forward to seeing your improvements.

http://dl.getdropbox.com/u/1144428/word%20analysis.zip
Reply