Back

Printable core 2000 vocabulary list

#1
Currently I'm studying the original Core2k using Anki. I wanted a copy on paper to be able to study the words without a digital screen as a complement to SRS, so I did some coding and produced a (hopefully) good looking PDF. Perhaps other people will also find this useful so I thought I'd post it here.

Update: there are now several PDF versions available as well as a spreadsheet that can be used as source data for an SRS deck.
Update 2: added new versions of the iKnow PDFs and TSV (with furigana and several other enhancements), and links to derived works.
Update 3 (a few years later): started producing updated files in light of the recent developments regarding iKnow and Anki. I don't need Core for myself anymore but it would be nice to have some up-to-date archived source material for people to work with in case the API disappears behind a paywall manned by evil lawyers or something. Please let me know if I screw something up, didn't have much time to double check the results.

March 2015 iKnow! list
Same as below, based on a recent copy of the site's data. Also added some fields (romaji transcriptions and sentence difficulty rating). For now only the raw data, spreadsheet and media files are provided, so you can make your own deck and possibly share it with others (I'll gladly include quality decks in this post). I'll try to add furigana and new PDFs (like the section below has) later when I have time to sort that out.

TSV (spreadsheet with Anki compatible media references): Mediafire, Mega
Media pack (sentence/word audio and pictures): Mega [321 MiB]
JSON (raw data for programmers): Mediafire, Mega

November 2012 iKnow! list
Ordering and translations are much more up to date than Core PLUS below. Furigana version is now available (though there may be markup glitches in some places). Goes up to 6000. Not in sync with most shared Anki core decks, but the TSV file can be used to create one.

PDF variations (divided in 6 documents with 1k each):
- Furigana & translations: preview (first 1k), Mediafire download
- Furigana but no translations: Mediafire download
- Kana transcriptions & translations: preview (first 1k), Mediafire download

TSV (spreadsheet with all 6k): Mediafire
JSON (raw data for programmers): Mediafire

Other works based on this data:
- Created by buonaparte: a set of Office documents including media
- Created by frony0: a ready-to-use Anki deck with media packages

Core PLUS list
Based on an older Anki deck from the smart.fm days; has some outdated and incorrect entries. It has full furigana for the example sentences. Contains the original core 2000. Only advantage to newer lists above: the order is synchronized with Core PLUS and many other popular pre-made Anki decks as well.

PDF (single document): preview (complete 2k), Mediafire download
Edited: 2015-03-05, 5:42 pm
Reply
#2
Very nice! I hope this doesn't get lost in the forum archives. Very useful.
Reply
#3
At the bottom:

今日は一人で映画を見ます
I saw a movie by myself today.

That can't be right.
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
turvy Wrote:At the bottom:

今日は一人で映画を見ます
I saw a movie by myself today.

That can't be right.
Well the list is extracted from a public Anki deck, which in turn is based on an official smart.fm deck when that website still existed. Since there are so many items it's natural that there will be some errors. I've already fixed a few entries manually and I don't mind a few more, however I personally don't think it's worth it to extensively quality check the whole list.

I'm considering re-doing this with a different source (current iKnow! website) though. It'll be a lot more up-to-date and I could easily go up to 6k as well. Downside would be the misalignment with most pre-made Anki decks, since Cerego has shuffled the order quite a bit.
Edited: 2012-08-13, 8:38 am
Reply
#5
It would interest me if you can do this with iKnow! as the source. iKnow! updated their definitions since the Anki deck was created. Is there any way to take the html source from the page where you can see all the words in a lesson and add them to a spreadsheet? Ths would be a great time saver for me.
Reply
#6
http://pastebin.com/s8HEf27Z

Pastebin link for CORE2k vocabulary with single character lines removed. Thanks to Nestor / http://darkjapanese.wordpress.com/
Reply
#7
PotbellyPig Wrote:It would interest me if you can do this with iKnow! as the source. iKnow! updated their definitions since the Anki deck was created. Is there any way to take the html source from the page where you can see all the words in a lesson and add them to a spreadsheet? Ths would be a great time saver for me.
I did some research just now and it's actually easier than I'd thought; no need to parse any HTML pages, the entire 6k is served in programmer friendly JSON. In fact I've already managed to update the list with this as the new source. The only thing I'm still struggling with is furigana; iKnow has sentences in both kanji and kana, and I'm trying to create a procedure for reliably merging that into a single kanji sentence with furigana.
Reply
#8
Perhaps it's possible to update the decks on Anki to use the new data too. There's a Core1k deck, but no more of the revised series.. Can you describe the api?
Reply
#9
Update: furigana is going pretty well, except I'm not sure how reliable it is. I'll try to check the result against possible kanji readings to verify correctness.

frony0 Wrote:Perhaps it's possible to update the decks on Anki to use the new data too. There's a Core1k deck, but no more of the revised series.. Can you describe the api?
The website uses asynchronous requests (AJAX) to fetch JSON packages containing 100 vocab items each (i.e. core 3000 step 4), which are then formatted client-side. So the entire core 6k consists of 60 JSON datasets. I uploaded an archive with all of the files in it, named by course id; if they are processed in this order you should get the complete 6k in proper order. I could also generate an Anki importable (tab seperated) file if that helps, but no more than that; I don't have any knowledge about creating high quality Anki decks.

By the way, I assume this thread is okay with the forum rules, since the raw information is available publicly and free of charge, and the provided compilations are original work?
Edited: 2012-08-14, 12:52 pm
Reply
#10
Great work! Would it be possible to place the full 6k into a spreadsheet? Of course I want your pdf as well but with a spreadsheet its easier to manipulate the items and we can can update the Anki deck with the iKnow! order and the new definitions.
Reply
#11
Ah, I have little to no experience with ajax or manipulating json responses, so if you could convert that into a database or csv format or similar, that would be great Smile I'm happy to format it all into a deck, although I think Nukemarine has some scripts he should run on the list first to "optimize" it.
Reply
#12
The furigana didn't work out after all so I'll keep it at separate kanji/kana for now. I've produced a PDF version (split into six files) and a TSV dump (one big file). TSV is basically CSV with tabs instead of commas; it's importable in Anki and any modern spreadsheet program.

Will this be sufficient? I'm not sure what formatting details are optimal for an SRS deck so if anything in the sheet needs changing just say the word. Especially the audio/pictures; right now it's just the original URI's on iknow.jp but I remember something about Anki needing some special setup for media.

I'll update my original post with the new download links.

frony0 Wrote:I'm happy to format it all into a deck, although I think Nukemarine has some scripts he should run on the list first to "optimize" it.
I suppose you're talking about i+1 order? That's pretty nice, but there are advantages to frequency order as well, so I suggest publishing it as an alternative.
Reply
#13
Savii Wrote:I'll update my original post with the new download links.
Can't wait to check out that .tsv file. I'll let you know how it looks after you post a link for it.
Reply
#14
Savii Wrote:The furigana didn't work out after all so I'll keep it at separate kanji/kana for now. I've produced a PDF version (split into six files) and a TSV dump (one big file). TSV is basically CSV with tabs instead of commas; it's importable in Anki and any modern spreadsheet program.

Will this be sufficient? I'm not sure what formatting details are optimal for an SRS deck so if anything in the sheet needs changing just say the word. Especially the audio/pictures; right now it's just the original URI's on iknow.jp but I remember something about Anki needing some special setup for media.

I'll update my original post with the new download links.

frony0 Wrote:I'm happy to format it all into a deck, although I think Nukemarine has some scripts he should run on the list first to "optimize" it.
I suppose you're talking about i+1 order? That's pretty nice, but there are advantages to frequency order as well, so I suggest publishing it as an alternative.
Yeah, I saw when I had a fiddle with the JSON files. I'm pretty sure anki can handle remote audio and pictures though, and it's fairly easy to download them too in-app. One big TSV is fine though, it's easy to split them up based on lines, and many people here can work wonders with spreadsheets...

As for the order, it's just helpful to have that as an extra index, instead of the same multiple separate spreadsheets situation we have now.
Edited: 2012-08-14, 1:30 pm
Reply
#15
PotbellyPig Wrote:
Savii Wrote:I'll update my original post with the new download links.
Can't wait to check out that .tsv file. I'll let you know how it looks after you post a link for it.
Wow. Great job! I've been using iKnow! for a while now (up to Core 4000 Step 5). I've been maintaining a spreadsheet based on the Anki deck and updating the definitions where applicable by cutting & pasting from the web site. This will save me countless hours of work. Plus, when I'm finished going through all 6000 words via iKnow!, I'll have an Anki deck with the updated definitions. You really saved me from a boatload of work and the time can be used instead for studying.
Reply
#16
I just noticed something. The TSV file doesn't have the part of speech for each word (verb, noun, etc.). Can you add it?
Reply
#17
PotbellyPig Wrote:I just noticed something. The TSV file doesn't have the part of speech for each word (verb, noun, etc.). Can you add it?
Sure, I've added that column to the spreadsheet. The TSV download links are updated.

Also I'll be updating the documents once more in the next few days. Personally I really like furigana so I gave it another go by finding out how Anki's Japanese Support plugin does the job. I got the method they use (a morphological analyzer tool called MeCab) to work in my script, so if nothing unexpected pops up I should be able to roll out furigana versions soon. I suppose I won't need to add it to the TSV since programs like Anki can add it dynamically anyway?
Reply
#18
You need to:
Code:
awk -F"       " '{print $2 "\t" $3 "\t" $4 "\t" $5 "\t[sound:" $6 "]\t" $7 "\t" $8 "\t" $9 "\t[sound:" $10 "]\t<img src=\"" substr($11,0,length($11)-1) "\" />\t" $1}' < iknow_core6k_complete.tsv
(Or copy my one)

I've made a quick deck, I'll upload zips/tgzs of the media a little later.

Do add the furigana though, at the very least since the mobile client doesn't dynamically generate it like the deskop
Edited: 2012-08-16, 5:17 am
Reply
#19
http://users.bestweb.net/~siom/martian_m...iKnow6000/
Fore people who prefer parallel texts.
The texts generated from iknow_core6k_complete.tsv posted here http://forum.koohii.com/showthread.php?tid=9762
Three Japanese-spaced hiragana-English .doc files with links to off-line audio.
iKnow6000 Sentences.doc
iKnow6000 Vocab + Sentences.doc
iKnow6000 Vocabulary.doc
Mp3 files renamed:
0001s.mp3 – 6000s.mp3 – sentences
0001.mp3 – 6000.mp3 – vocabulary.
Audio playlists added – sentences only.

The original mp3 file names are a complete mess, these here are renamed.
It is not an anki deck, the files are meant for listening practice/reviweing mostly - it is easy to check against the written file provided here when in doubt.
Edited: 2012-08-18, 11:26 am
Reply
#20
Savii Wrote:
PotbellyPig Wrote:I just noticed something. The TSV file doesn't have the part of speech for each word (verb, noun, etc.). Can you add it?
Sure, I've added that column to the spreadsheet. The TSV download links are updated.
Thanks for adding it so quickly.
Reply
#21
It appears to no longer be possible to upload decks to the old Ankiweb, so see
http://dl.dropbox.com/u/8980026/core.anki for the deck. For the media I have
http://dl.dropbox.com/u/8980026/core1.zip for the vocab audio,
http://dl.dropbox.com/u/8980026/core2.zip for sentence audio, and
http://dl.dropbox.com/u/8980026/core3.zip for (sentence) images.

The media files have been renamed to something more "neat" Smile
Edited: 2012-08-18, 11:57 am
Reply
#22
I suggest nobody upload the old core decks to ankiweb 2, we can use these instead.
Reply
#23
MeCab's full automatic furigana generation wasn't as accurate as I'd hoped, so I ended up spending a lot more time on this than I though I'd need. Thanks to some additional generation methods it should be good to go now: 99,5% verified against the original kana transcriptions and I've fixed the remaining ones myself. Both the PDF and TSV files are now updated with these furigana transcriptions. I've also added rōmaji transcriptions to the spreadsheet (for completeness, though I doubt anyone wants them) and made several improvements to the printable lists.

To frony0 and buonaparte: great work on those files, I've added them to the first post as well.
Edited: 2012-08-25, 3:24 am
Reply
#24
I discovered a bug in my scripts which caused about 1% of the words to have an incorrect furigana reading. Fixed PDFs and TSV have been uploaded.
Reply
#25
Is the second sentence missing. "I went to the pool."

P.S. I was looking at the JSON.

Looks like the ready made deck doesnt even bother with half the sentences.

Has any one got an up to date deck with all the sentences.
Edited: 2012-11-13, 2:41 pm
Reply