Hi everyone! I'm thinking of creating a word list for use in a SRS with several (potentially novel) modifications to increase learning efficiency. To that end, I'd like to ask for people's thoughts and recommendations for data sources. First though, some motivation:
I've been learning Japanese for about 2.5 years on my own. In Anki I have Heisig books 1 and 3, the core 6000, and about 4000 other sentence cards from several grammar books. Despite this, I still regularly come across words I don't know. Having to stop to look up definitions really annoys me (even with rikaichan/sama), as does the lessened understanding which occurs if I skip over them. Weirdly though, I don't mind spending several hours a day grinding away at an Anki deck (go figure). So I thought I'd try something different: the much maligned word list! Of course, I'd like to minimise the effort it takes to get through it so I've been thinking of several potential improvements (in no specific order):
1. High-frequency words should appear earlier
2. Words with similar kanji/kanji compounds should be grouped together.
3. Words with related meanings (i.e. same meaning, opposite meaning, etc.) should be grouped together.
4. Homophones should be grouped together.
The idea behind (1) is to learn the "most useful" words earlier, while (2), (3), and (4) are about leveraging similarities to make the whole process easier. Unfortunately, it is impossible to satisfy all these constraints at once. So there must be some weighting as to which are more important. Currently, I'm thinking of organising the words into groups based on the kanji (and compounds of 2+ kanji) they contain, the groups would then be ordered by the combined frequencies of the constituent words. I'm not entirely sure how (3) and (4) might be profitably incorporated into this arrangement (or if they should).
So, what do people think?
a. How many words would you recommend using (I'm currently thinking the top 30K written words)?
b. What do you think about the ordering scheme?
c. Do you know of anything similar (I'd hate to reinvent the wheel
)?
Also, can anyone recommend any decent (free) data sources:
d. A large, solid frequency list for *written* Japanese. Ideally this would contain relative frequencies (e.g. 日 comprises x% of the corpus) and not just an ordering (e.g. 日 is the n'th most frequent word). Bigger lists would be better (as long as they are built on a decent corpus) since I can always cut words below a certain frequency.
e. A Japanese-English dictionary for definitions. I'll probably just use JMDICT/EDICT.
f. A Japanese-Japanese dictionary for definitions. I'm not sure what's available here.
g. (less important) A lexical database (ala wordnet: http://en.wikipedia.org/wiki/WordNet) for Japanese. Which has information about synonyms, antonyms, etc.
h. Anything else you think might help.
Of course, I'd be more than happy to release the code and results of this project. I'd try to make it plug and play to some extent - so that you could use your own word list, dictionary sources, etc.. Thanks for your patience in reading all that
! I'm looking forward to hearing what people think.
I've been learning Japanese for about 2.5 years on my own. In Anki I have Heisig books 1 and 3, the core 6000, and about 4000 other sentence cards from several grammar books. Despite this, I still regularly come across words I don't know. Having to stop to look up definitions really annoys me (even with rikaichan/sama), as does the lessened understanding which occurs if I skip over them. Weirdly though, I don't mind spending several hours a day grinding away at an Anki deck (go figure). So I thought I'd try something different: the much maligned word list! Of course, I'd like to minimise the effort it takes to get through it so I've been thinking of several potential improvements (in no specific order):
1. High-frequency words should appear earlier
2. Words with similar kanji/kanji compounds should be grouped together.
3. Words with related meanings (i.e. same meaning, opposite meaning, etc.) should be grouped together.
4. Homophones should be grouped together.
The idea behind (1) is to learn the "most useful" words earlier, while (2), (3), and (4) are about leveraging similarities to make the whole process easier. Unfortunately, it is impossible to satisfy all these constraints at once. So there must be some weighting as to which are more important. Currently, I'm thinking of organising the words into groups based on the kanji (and compounds of 2+ kanji) they contain, the groups would then be ordered by the combined frequencies of the constituent words. I'm not entirely sure how (3) and (4) might be profitably incorporated into this arrangement (or if they should).
So, what do people think?
a. How many words would you recommend using (I'm currently thinking the top 30K written words)?
b. What do you think about the ordering scheme?
c. Do you know of anything similar (I'd hate to reinvent the wheel
)?Also, can anyone recommend any decent (free) data sources:
d. A large, solid frequency list for *written* Japanese. Ideally this would contain relative frequencies (e.g. 日 comprises x% of the corpus) and not just an ordering (e.g. 日 is the n'th most frequent word). Bigger lists would be better (as long as they are built on a decent corpus) since I can always cut words below a certain frequency.
e. A Japanese-English dictionary for definitions. I'll probably just use JMDICT/EDICT.
f. A Japanese-Japanese dictionary for definitions. I'm not sure what's available here.
g. (less important) A lexical database (ala wordnet: http://en.wikipedia.org/wiki/WordNet) for Japanese. Which has information about synonyms, antonyms, etc.
h. Anything else you think might help.
Of course, I'd be more than happy to release the code and results of this project. I'd try to make it plug and play to some extent - so that you could use your own word list, dictionary sources, etc.. Thanks for your patience in reading all that
! I'm looking forward to hearing what people think.
Edited: 2012-03-05, 9:15 am
