Back

Improved word list

#1
Hi everyone! I'm thinking of creating a word list for use in a SRS with several (potentially novel) modifications to increase learning efficiency. To that end, I'd like to ask for people's thoughts and recommendations for data sources. First though, some motivation:

I've been learning Japanese for about 2.5 years on my own. In Anki I have Heisig books 1 and 3, the core 6000, and about 4000 other sentence cards from several grammar books. Despite this, I still regularly come across words I don't know. Having to stop to look up definitions really annoys me (even with rikaichan/sama), as does the lessened understanding which occurs if I skip over them. Weirdly though, I don't mind spending several hours a day grinding away at an Anki deck (go figure). So I thought I'd try something different: the much maligned word list! Of course, I'd like to minimise the effort it takes to get through it so I've been thinking of several potential improvements (in no specific order):

1. High-frequency words should appear earlier
2. Words with similar kanji/kanji compounds should be grouped together.
3. Words with related meanings (i.e. same meaning, opposite meaning, etc.) should be grouped together.
4. Homophones should be grouped together.

The idea behind (1) is to learn the "most useful" words earlier, while (2), (3), and (4) are about leveraging similarities to make the whole process easier. Unfortunately, it is impossible to satisfy all these constraints at once. So there must be some weighting as to which are more important. Currently, I'm thinking of organising the words into groups based on the kanji (and compounds of 2+ kanji) they contain, the groups would then be ordered by the combined frequencies of the constituent words. I'm not entirely sure how (3) and (4) might be profitably incorporated into this arrangement (or if they should).

So, what do people think?
a. How many words would you recommend using (I'm currently thinking the top 30K written words)?
b. What do you think about the ordering scheme?
c. Do you know of anything similar (I'd hate to reinvent the wheel Big Grin)?

Also, can anyone recommend any decent (free) data sources:
d. A large, solid frequency list for *written* Japanese. Ideally this would contain relative frequencies (e.g. 日 comprises x% of the corpus) and not just an ordering (e.g. 日 is the n'th most frequent word). Bigger lists would be better (as long as they are built on a decent corpus) since I can always cut words below a certain frequency.
e. A Japanese-English dictionary for definitions. I'll probably just use JMDICT/EDICT.
f. A Japanese-Japanese dictionary for definitions. I'm not sure what's available here.
g. (less important) A lexical database (ala wordnet: http://en.wikipedia.org/wiki/WordNet) for Japanese. Which has information about synonyms, antonyms, etc.
h. Anything else you think might help.

Of course, I'd be more than happy to release the code and results of this project. I'd try to make it plug and play to some extent - so that you could use your own word list, dictionary sources, etc.. Thanks for your patience in reading all that Smile! I'm looking forward to hearing what people think.
Edited: 2012-03-05, 9:15 am
Reply
#2
No. You've had time to internalize grammar and have a solid enough vocab base. Just keep reading. Use material under your level, on the same topic or by the same author. Do you have any idea how much effective use of the language you could learn in the time it takes to code, cram and review 30k sterile words?
Reply
#3
ivanov Wrote:No. You've had time to internalize grammar and have a solid enough vocab base. Just keep reading. Use material under your level, on the same topic or by the same author. Do you have any idea how much effective use of the language you could learn in the time it takes to code, cram and review 30k sterile words?
Agreed. However, if you're really committed to using a word list, I recommend the "Japanese corePlus" shared deck available via Anki. It includes about 25,000 words. Assuming you've already have 10,000 of these words into your Anki rotation, the remaining 15,000 will only take 150 days if you review 100/day, 75 days @ 200/day, and 50 days @ 300/day.

Considering that you already have a base vocabulary of 10,000 words, start with 100 new words per day and then read, read, and read some more. Do the new words first thing in the morning so you're free to read the rest of the day. If you're able to handle 100/day, then consider adding more. However, don't add more than 300/day because you will not have enough time for reading.

Note: I should mention that I'm assuming you do Anki reviews quickly (i.e. 100-200 reviews/10min).
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
there's this list http://www.manythings.org/japanese/words/leeds/
Reply
#5
A deck kind of just like what you're talking about here already exists, the Japanese corePlus. It also lists homophones, is sorted by (sort of) frequency (there will probably be a few odd ones out, obviously), 25k words. Not as ideal as yours but I think it would be easier to modify an existing one than make a new one.

Why don't you just make a new deck with all the new words, though? It should do a much better job at it than SRSing a monster. Especially if you're reading the same type of books/articles then most of the vocab should coincide.
Reply
#6
Aside from corePlus, there's a list here of common words from 4 years of the Mainichi Shinbun. (This was in the 1990s, so it's a little bit dated, but it does include word counts.)

http://ftp.ftp.edrdg.org/pub/Nihongo/wordfreq_ck.gz

Since it is from a newspaper, it's tilted towards words related to government and finance.

This is provided for informational purposes only; I don't think this is actually a good idea, especially on a scale of 30,000 words (most educated speakers of English know somewhere around 20,000!). A dedicated regime of reading easy-to-intermediate texts would probably be better.
Reply
#7
English is not a good comparison since it is so drastically changing and growing. It's hard to know exactly how many words one knows, but from the standard word-count tests (the ones which test you with words from certain levels of frequency and difficult and what not) IIRC most of my class was ranked at about 30.000-50.000 (probably focused on words rather than lexemes), and this was back in highschool. We do tend to have a larger vocab than average English natives though.

However, this was 30.000-50.000 words learned in about 15+ years of constant exposure, use and intensive classes. I would never aim for that amount again, especially not solely through study over a short amount of time.
Reply