RECENT TOPICS » View all
lauri_ranta wrote:
A few weeks ago I downloaded Japanese subs for about 10000 anime episodes and used mecab to make a frequency list of words with at least one kanji:
for f in *.srt *.ass; do mecab -F'%t %f[6]\n' "$f" | sed -En 's/^2 (.+)/\1/p'; done | sort | uniq -c | sort -r
I wrote a Ruby script that removed words that didn't contain kanji (932/6000), and then words that weren't on the first 10000 lines of the frequency list (1436/5068).
A random sample of the 1436 words removed after the second step:
有害 harmful, hazardous
願書 application form
停留所 (bus) stop
物差し ruler
重役 executive
鰻 eel
お辞儀 bow
細長い long and thin, long and narrow
観賞 ornamental
時給 hourly wage
It's not perfect, because mecab doesn't recognize all words, and the corpus is too small and not diverse enough.
Anyway, you can download the list as TSV from http://lri.me/upload/core3632.txt. Most Anki decks are based on an older version of the data from iKnow's website. Audio files that match this version can be downloaded from this thread.
I won't make an Anki deck for it, but I'm thinking about making a huge vocabulary deck based on different word frequency lists later. Most of the top 20000 words have sound files from JapanesePod101 or iKnow.
By the way, do you happen to still have the unparsed list, with the Hiragana words and everything? I'm going through this list still and it's working a lot better then the core decks order, however I'm not getting any kana words, which is a pretty big chunk of the language.
Last edited by egoplant (2013 February 05, 7:47 am)
egoplant wrote:
By the way, do you happen to still have the unparsed list, with the Hiragana words and everything? I'm going through this list still and it's working a lot better then the core decks order, however I'm not getting any kana words, which is a pretty big chunk of the language.
You can open http://lri.me/japanese/Core%206000.txt in a spreadsheet application and sort it by the word type column. There's just 510 hiragana-only words though.
I have tried to learn hiragana words with audio lessons (like those that can be created with Audio Lesson Studio) or recently by using lists like this as input in a type training application:
zureru to slide
jimejime damp and humid
andake to that extent
lauri_ranta wrote:
...
I recently made a list of hiragana-only words to review. (It's in rōmaji because I used it as input in a type training application.) There is also a version of EDICT with word frequencies indicated by number of Yahoo search hits. Here are all words tagged with on-mim (onomatopoeia / mimetic) sorted by frequency: http://lri.me/upload/edict-freq-on-mim.txt.
Thank you! ^___^
wasn't sure if i should post this to anki forums or here, but i figured it might be useful to be here if anyone else is so dim as i that they can't figure this out: i downloaded the 'core 2k/6k optimized japanese vocabulary' deck from shared decks. up until then i had been using the 'japanese core 2000 step 01', etc. up to step 8 before i realized the value of switching to the optimized deck, and the value of having all 6000 in one deck.
how do i filter out the cards i have already learned/have reviews of in the first 8 decks. it should be about 1600 cards i think?
thanks.
tashippy wrote:
how do i filter out the cards i have already learned/have reviews of in the first 8 decks. it should be about 1600 cards i think?
I'm not sure of the best way, but I can immediately think of two suggestions:
1) Use Anki duplication detection:
Import one deck into the other and have Anki flag duplicates. Then suspend the duplicates and remove the other deck.
2) Use MorphMan to suspend notes you've already learnt:
Have both decks in collection, run Morph Man. Suspend any i+0 facts (as they were already covered by your other decks).
I'm enjoying reading this thread, I finished Core 6k quite a while ago and have been struggling to find a better source of sentences since. It looks like there are a few different ways I can do it based on whats been said in here ![]()
But one thing though, the 10k list has been mentioned quite a few times -- does anyone know where I can please download the remainder 4k words and accompanied sentences from? I'd love to go through those too but I've never actually been able to find them.
Cheers guys!
overture2112 wrote:
tashippy wrote:
how do i filter out the cards i have already learned/have reviews of in the first 8 decks. it should be about 1600 cards i think?
I'm not sure of the best way, but I can immediately think of two suggestions:
1) Use Anki duplication detection:
Import one deck into the other and have Anki flag duplicates. Then suspend the duplicates and remove the other deck.
2) Use MorphMan to suspend notes you've already learnt:
Have both decks in collection, run Morph Man. Suspend any i+0 facts (as they were already covered by your other decks).
thanks. i might try using morphman. if i try your first suggestion, how do i do that? i see that finding duplicates is explained in the anki manual, but how do i combine the decks like you say? i guess what i was looking for is a list of sorts for each deck that i can just highlight multiple items and suspend or delete them. thanks
NightSky wrote:
But one thing though, the 10k list has been mentioned quite a few times -- does anyone know where I can please download the remainder 4k words and accompanied sentences from? I'd love to go through those too but I've never actually been able to find them.
Afaik Core10k was what we decided to call the 9669 words from the jsensei app that I reverse engineered. My last post on it is here and you can find links to the spreadsheet and audio files in that thread if you look carefully.
Ultimately I moved on to writing MorphMan for use with subs2srs to replace core 2k/6k/10k, so I haven't kept up with it and don't know if someone posted a shared anki deck or not.
tashippy wrote:
if i try your first suggestion, how do i do that? i see that finding duplicates is explained in the anki manual, but how do i combine the decks like you say?
In Anki 2 I believe* you would do:
1) modify note type so the appropriate fields are marked as not being allowed to have duplicates
2) change the note type of the notes in one deck to the note type of the other deck, mapping fields as appropriate
* untested. always backup the collection file first before doing large changes like this, just in case
tashippy wrote:
i guess what i was looking for is a list of sorts for each deck that i can just highlight multiple items and suspend or delete them
Not sure what you mean by this. Based on what you wrote, it sounds like you're describing the card browser, but I'm guessing you intended some sort of 2-pane version or something?
Last edited by overture2112 (2013 February 12, 12:04 pm)
A quick question regarding Core 6000 which I don't think merits its own thread, so I'll ask here:
Are the sentences from Core 6000 licensed under any licence that needs to be mentioned when displaying them on your website? Or, worse yet, are they the "property" of smart.fm/iknow.jp and you are not allowed to reproduce them at all on your website, and thus all the Anki decks, etc. are "illegal"?
toshiromiballza wrote:
Are the sentences from Core 6000 licensed under any licence that needs to be mentioned when displaying them on your website? Or, worse yet, are they the "property" of smart.fm/iknow.jp and you are not allowed to reproduce them at all on your website, and thus all the Anki decks, etc. are "illegal"?
You can find out the details if you search on this forum, but I think it basically went like this (someone correct me if I'm wrong).
Core10k was produced by some company (I forget the name). The first 6000 of those sentences were licensed by SmartFM. SmartFM used to have a nice API that allowed those sentences to be farmed out. They were put into the Anki deck we know as Core6k.
Then later some other company licensed all of Core10k and put it into some kind of computer software. The software was reversed engineered (by someone on this forum) and the remaining 4k sentences were extracted and combined with Core6k to produce the the Anki deck we know as Core10k.
I think the Anki decks are technically infringing on the original company's copyright.
If you wanted to put them on your website you would probably need to license them from the original company that created them.
Last edited by partner55083777 (2013 February 13, 6:08 am)
I searched the forum, and came up with this:
http://forum.koohii.com/viewtopic.php?pid=29522#p29522
Nukemarine wrote:
Some other notes from that video presentation and the site:
Common Content (CC) - All of the material on iKnow is listed as common content. That means it's free to download, share, use for public and private purposes so long as iKnow's contribution (or original contributor) is acknowledged.
http://forum.koohii.com/viewtopic.php?pid=30133#p30133
Nukemarine wrote:
It's in the FAQ that commercial and private entities can develop content for use with iKnow.
[...]
Seeing as it's Common Creative License, you're not doing anything wrong.
http://forum.koohii.com/viewtopic.php?pid=81271#p81271
Nukemarine wrote:
I'd be much more at ease had the iKnow Core 2000 and Core 6000 Decks been used since they are common creative license.
http://forum.koohii.com/viewtopic.php?p … 22#p126122
Blahah wrote:
The Japanese Sensei app doesn't sell Smart.fm material, they each licensed data from the CJK dictionary institute.
http://forum.koohii.com/viewtopic.php?p … 11#p130811
Blahah wrote:
They released all that material for free, and made a public API specifically so people could connect to their servers and access the material. It's perfectly legal to use it, even after they start charging for the site.
Nukemarine claims the iKnow data is licensed under Creative Commons, and Blahah claims that even though they are licensed from the CJK Institute, it is legal to use because they (iKnow) released it for free. So I'm confused.
Last edited by toshiromiballza (2013 February 13, 8:06 am)
toshiromiballza wrote:
Common Content (CC) - All of the material on iKnow is listed as common content.
I assumed this only applied for user created content on the site and not smart.fm's published content, but could be wrong.
partner55083777 wrote:
The software was reversed engineered (by someone on this forum) and the remaining 4k sentences were extracted and combined with Core6k to produce the the Anki deck we know as Core10k.
I think the Anki decks are technically infringing on the original company's copyright.
If you wanted to put them on your website you would probably need to license them from the original company that created them.
Correct. If you're using Core10k you should have purchased the jsensei app (only $14) and only share anki decks and the other materials with people who also own it (or at least use the honor system).
overture2112 wrote:
I assumed this only applied for user created content on the site and not smart.fm's published content, but could be wrong.
It doesn't go into specifics on the website: http://web.archive.org/web/200902281539 … now.co.jp/
Content on developer.iknow.co.jp is licensed under the Creative Commons Attribution 3.0 License
Loophole?
Core6K (the original 2000 + 4000) sentences was licensed by iKnow (smart.fm) under the Creative Commons. I'm not sure about 10k deck. The only thing that wasn't licensed like that was the images which they used a stock library for and couldn't sub license.
I remember checking up on this because the only thing of value for them was the sentences, which they gave away. Once you had the sentences what was the point of using their site, anyone could compete with them with their own data. Which is exactly what happened when everyone just used Anki.
I'm guessing they changed the license at some point, so any new additions, updates etc are not CC.
travis wrote:
I'm guessing they changed the license at some point, so any new additions, updates etc are not CC.
So if you downloaded the Core6k deck back when they were still under CC, does this mean you can legally share them now, even though they no longer are?
Edit:
From the CC FAQ:
What if I change my mind?
CC licenses are not revocable. Once a work is published under a CC license, licensees may continue using the work according to the license terms for the duration of copyright protection. Notwithstanding, CC licenses do not prohibit licensors from ceasing distribution of their works at any time. Additionally, CC licenses provide a mechanism for licensors and authors to ask that others using their work remove the credit to them that is otherwise required by the license. You should think carefully before choosing a Creative Commons license.
It seems so. Awesome!
Last edited by toshiromiballza (2013 February 13, 2:27 pm)

