RECENT TOPICS » View all
I'm wondering if anyone has something like the core deck, but without the massive amounts of useless business terms. If the core deck is a newspaper, I want the manga version. Anyone have any lists like this? It doesn't even have to be an Anki deck, just a list.
Isn't it just easier to suspend the cards you don't like as you review? The shortcut keys used to be ctrl+shift+S or something in Anki1, it was pretty quick. Rather than making a new deck entirely from a list, if that's what you're thinking of?
egoplant wrote:
If the core deck is a newspaper, I want the manga version.
It doesn't exist. And even if it did it wouldn't be worth it.
Manga vary so much from genre to genre and publisher to publisher. Some will have easier words vs harder words based on their target audience. You are better off just reading the manga and adding the words you encounter as you read.
You could try this. It's not pre-made, but its not a whole lot of work either. Find an HTML document of something you want to read, say a light novel or short story. Install Rikai-sama from this very site. Once you open your HTML, you can save words you mouse over or even entire sentences to anki with the click of a button (default is r.) The other nice thing is that, while your adding words, you can also read through the story. For example, you sit down and read you story until you get to a word you don't know. You mouse over it and get the definition (oh! that means "anger!") Press R to save it, and just continue reading. It doesn't even break the flow of the sentence once you get used to it.
He wasn't asking for light novels or short stories though...
If he was, then he'd be better served actually doing core6k because after you finish it all, you actually can read many light novels without much trouble.
The strength of core isn't simply that it's a word frequency list... It's that it has good audio for the 10,000 sample sentences. This is incredibly useful in
a)hearing the pronunciation when you learn the word (like the pitch-accent people want you to do but better) and
b)helping your listening skills progress.
Vanilla sentences are good for reading but not nearly as useful for speech/listening... Meanwhile when learning a new card, the brain does the same amount of work if audio is present/not.
I also dont like them because typing in a large volume of cards is really boring.
My advice would be subs2srs for your fav anime if you don't like core.
Unless you don't care about speech/listening that is.
Last edited by dtcamero (2013 January 27, 7:03 pm)
What sort of "useless business terms" are in the list?
yudantaiteki wrote:
What sort of "useless business terms" are in the list?
Yeah, there's maybe only five words in the whole Core6000 that are really quite business specific. And a lot of slightly business-y words show up in manga (and daily life) fairly frequently.
Last edited by Tzadeck (2013 January 27, 7:39 pm)
One route to go (the way I do it) is to suspend all the words in the Core deck, then just un-suspend the ones you want as you find vocabulary from other places. If the word is not there just make a card. This saves a bit of time, and you will only be learning what you want. You can also edit the cards, add pic's or extra example sentences or definitions (I do this a bit). I'll also read the Core example sentence on the card, and if there's a word I don't know then un-suspend that.
Edit: Here's the Optimized Book Frequency word list. I sort of cheated and grouped the counting and months in addition to a few other words that seemed to be better as an entire group. /edit
If it helps, I've hacked together a list based on CB's 5100 Japanese Novel Word Frequency. First, I took the all the Kana words from the top 10,000 and put those words into their own list. Though the first 100 are bound to be grammar words, these will be important to study.
I then took the first 10,000 Kanji only words and bunched them into groups 500, 1500, 2000, 2000, 2000, 2000. Each were sorted by the KO2k1 order. Finally, I organized that so that the first 1000 words came from KO2k1 555, then the next 1000 for KO2k1 1110, the next 2000 words based on the entire 2k1 and after that there was no filtering.
What that gives you is a go to list of "should learn these" to help you read novels easier. I don't think it's that difficult to use the equivalent Core 2k/6k/10k cards where there's a match, and fill in the missing info from the kenkyusha dictionary info which I think was the go to dictionary when they made the sample sentences for Core 2k/6k/10k.
Anyway, I'll put the list on Google Drive and if it helps you then all is good. If you have a better frequency list such as one based off thousands of Manga or Anime scripts, then it's not much to sort that via the method I think works (group frequency in bunches, sort those bunches via KO2k1 order, spread kana words throughout).
Last edited by Nukemarine (2013 January 29, 7:37 am)
I said business, but I guess I meant political as well.
yudantaiteki wrote:
What sort of "useless business terms" are in the list?
Here are just a couple words I added today alone:
政治家
府立
政党
野党
I can't imagine words like this popping up outside of the news or something.
Nukemarine wrote:
If it helps, I've hacked together a list based on CB's 5100 Japanese Novel Word Frequency. First, I took the all the Kana words from the top 10,000 and put those words into their own list. Though the first 100 are bound to be grammar words, these will be important to study.
I then took the first 10,000 Kanji only words and bunched them into groups 500, 1500, 2000, 2000, 2000, 2000. Each were sorted by the KO2k1 order. Finally, I organized that so that the first 1000 words came from KO2k1 555, then the next 1000 for KO2k1 1110, the next 2000 words based on the entire 2k1 and after that there was no filtering.
What that gives you is a go to list of "should learn these" to help you read novels easier. I don't think it's that difficult to use the equivalent Core 2k/6k/10k cards where there's a match, and fill in the missing info from the kenkyusha dictionary info which I think was the go to dictionary when they made the sample sentences for Core 2k/6k/10k.
Anyway, I'll put the list on Google Drive and if it helps you then all is good. If you have a better frequency list such as one based off thousands of Manga or Anime scripts, then it's not much to sort that via the method I think works (group frequency in bunches, sort those bunches via KO2k1 order, spread kana words throughout).
I have used CB's list too. I usually do a ctrl+f on that list before I add a card to make sure it's not useless, however that list isn't perfect because a lot of "newer" words do not appear to be frequent, and a lot of words that are older and not used as much anymore do appear as frequent. For example, there are a lot of war related terms in that list, but I don't know how common they are outside of books written about wars. I have however run his program on a batch of anime subtitles, and that seems to be an alright list. I guess I was just wondering if a list like that already existed or not, that had been analyzed by a human (like the core) and not just generated.
egoplant wrote:
政治家
府立
政党
野党
*Shrugs*, political words pop up in manga all the time.
Perhaps 府立 might be a bit rare since it's location specific. But actually ~立 is really the vocab word, right? 県立 府立 都立 are all basically the same except that some prefectures have special titles. I hear 府立 all the time, but I live in a 府 rather than a 県.
政治家 and 政党 are both common enough words that they appear in manga often (Evangelion, 20th Century Boys, off the top of my head...). 野党 is a bit more of a speciality word, I suppose.
(Note, as an example, when I was learning Genki 2 I remember learning the word 編む, to knit, and I thought why the hell am I learning this useless word? But after coming to Japan I've heard it a million times, for one reason or another. It's hard to tell what words will be useful to you.)
Last edited by Tzadeck (2013 January 27, 10:29 pm)
You will see those words in manga and they aren't that large in number, so there's no point in skipping them.
If you're set on having a list of common manga/anime words, you can look up the transcripts to a series (not a guaranteed find), combine them, and then run a frequency check.
You could do something similar with a VN by using ITH (I prefer it to AGTH). Just run through a bunch of text with the scroll wheel and then check the frequency. Of course, it'd be may be easier to just open the script if you have the tools to decrypt it, though you'll get more than the game text.
I don't see much point in doing that from the beginning, but I may be wrong. If you do try something like this though, please report on the results.
Last edited by sholum (2013 January 27, 11:14 pm)
egoplant wrote:
Nukemarine wrote:
If it helps, I've hacked together a list based on CB's 5100 Japanese Novel Word Frequency. First, I took the all the Kana words from the top 10,000 and put those words into their own list. Though the first 100 are bound to be grammar words, these will be important to study.
I have used CB's list too. I usually do a ctrl+f on that list before I add a card to make sure it's not useless, however that list isn't perfect because a lot of "newer" words do not appear to be frequent, and a lot of words that are older and not used as much anymore do appear as frequent. For example, there are a lot of war related terms in that list, but I don't know how common they are outside of books written about wars. I have however run his program on a batch of anime subtitles, and that seems to be an alright list. I guess I was just wondering if a list like that already existed or not, that had been analyzed by a human (like the core) and not just generated.
You're trying to second guess too much. Until you've read a lot of material in your area of expertise or interest, you're not really able to say "this is not needed, that will be needed". Anyway, word lists are meant more as a cover your bases type of deal. You can look at that CB list and see it has 50,000+ results with the top 10k covering over 99% of what you'll randomly encounter.
If that's not specific enough for you, as you mentioned CB offered up the program that will do a word count on any documents you feed it. So, if by some manner of magic you know right now all the Japanese documents you'll ever read, then you can have the perfect word list generated. Barring that minor miracle (sorry for the sarcasm), you have put a bit of faith in one of those word lists if that's the way you want to study. The Core 2k/6k will skew heavily to politics, science and other topics highlighted in news articles. The CB list will skew toward all genres which means war, politics, medical, fantasy, teen love, animal antics, etc. Your list you generated with subtitles likely was a small sample but at least its up your alley.
Ok, perhaps there's a solution here for you. There's a subtitle site that is being filled out which includes most if not all the subtitles for Dramas since 2010. Collect those and run them through the frequency program. The sheer volume of subtitles should give you the best bang for your buck frequency list of words in the most current version of the spoken language. Don't second guess the results and assume the top 2,000 words or at least the words that account for 99% random chance of hearing are necessary.
A few weeks ago I downloaded Japanese subs for about 10000 anime episodes and used MeCab to make a frequency list of words with at least one kanji:
for f in *.srt *.ass; do mecab -F'%t %f[6]\n' "$f" | sed -En 's/^2 (.+)/\1/p'; done | sort | uniq -c | sort -r
I wrote a script that removed words that didn't contain kanji (932/6000) and then words that were not on the first 10000 lines of the frequency list (1436/5068). A random sample of the words removed after the second step:
有害 harmful, hazardous
願書 application form
停留所 (bus) stop
物差し ruler
重役 executive
鰻 eel
お辞儀 bow
細長い long and thin, long and narrow
観賞 ornamental
時給 hourly wage
It's not perfect, because MeCab doesn't recognize all words, and the corpus is too small and not balanced at all.
Anyway, you can download the list from http://lri.me/upload/core3632.txt. Most Anki decks are based on older versions of the data. Audio files that match this version can be downloaded from this thread.
Edit: I also added word frequencies to http://lri.me/japanese/Core%206000.txt. They are currently based on anime and drama subs, cb's novel frequency list, and http://corpus.leeds.ac.uk/frqc/internet-jp-forms.num.
Last edited by lauri_ranta (2013 February 24, 1:57 am)
Thanks for that list, I'm going through it now. It might take a while, but I should eventually get to some new words I don't know.
lauri_ranta wrote:
A few weeks ago I downloaded Japanese subs for about 10000 anime episodes and used mecab to make a frequency list of words with at least one kanji:
for f in *.srt *.ass; do mecab -F'%t %f[6]\n' "$f" | sed -En 's/^2 (.+)/\1/p'; done | sort | uniq -c | sort -r
I wrote a Ruby script that removed words that didn't contain kanji (932/6000), and then words that weren't on the first 10000 lines of the frequency list (1436/5068).
A random sample of the 1436 words removed after the second step:
有害 harmful, hazardous
願書 application form
停留所 (bus) stop
物差し ruler
重役 executive
鰻 eel
お辞儀 bow
細長い long and thin, long and narrow
観賞 ornamental
時給 hourly wage
It's not perfect, because mecab doesn't recognize all words, and the corpus is too small and not diverse enough.
Anyway, you can download the list as TSV from http://lri.me/upload/core3632.txt. Most Anki decks are based on an older version of the data from iKnow's website. Audio files that match this version can be downloaded from this thread.
I won't make an Anki deck for it, but I'm thinking about making a huge vocabulary deck based on different word frequency lists later. Most of the top 20000 words have sound files from JapanesePod101 or iKnow.
Personally I think this whole thing is a long palaver over a handful of words difference from core. If you don't like a word skip it, but doesn't the audio, not to mention the fact that it's pre-made, make it heads and tails a winner? I mean are you planning on finishing your japanese project at 10k words? if not you're going to need all those (really basic) words, like politician, eventually.
dtcamero wrote:
Personally I think this whole thing is a long palaver over a handful of words difference from core. If you don't like a word skip it, but doesn't the audio, not to mention the fact that it's pre-made, make it heads and tails a winner? I mean are you planning on finishing your japanese project at 10k words? if not you're going to need all those (really basic) words, like politician, eventually.
The audio files can be downloaded from the thread I linked to.
I wouldn't use most premade decks without modifying them some way first. Like removing the most or least frequent words or katakana words. Or adding furigana or replacing translations with the first translations.
Last edited by lauri_ranta (2013 February 24, 2:01 am)
That it's pre-made doesn't make it the perfect choice. Some people, like me, don't want to use textbook examples to learn too many words. Also, it has English definitions in all 10k words; I started to add Japanese definitions after my 1000 first, gradually of course. Nowadays, I add only cards with Japanese definitions. And copying to Anki takes me less time that it takes to learn through Anki; the process take about a minute per card, but I save that time during review.
And I prefer to listen to native audio rather than listen to native speakers read boring sentences. And core sentences doesn't seem to be made in an n+1 way, which is ideal.
Last edited by Stian (2013 January 28, 1:28 am)
lauri_ranta wrote:
Personally I think this whole thing is a long palaver over a handful of words difference from core. If you don't like a word skip it, but doesn't the audio, not to mention the fact that it's pre-made, make it heads and tails a winner? I mean are you planning on finishing your japanese project at 10k words? if not you're going to need all those (really basic) words, like politician, eventually.
That's a bad argument. If your only goal was to learn as many words as possible, then frequency is a really bad way to do it. The reason people use frequency lists is because their goal is to actually use the language, as the sooner the better. I might eventually need to know politician, but it doesn't mean I need to learn it first, because it will never come up in the stuff I like to read.
lauri_ranta wrote:
A few weeks ago I downloaded Japanese subs for about 10000 anime episodes and used mecab to make a frequency list of words with at least one kanji:
*snip*
I won't make an Anki deck for it, but I'm thinking about making a huge vocabulary deck based on different word frequency lists later. Most of the top 20000 words have sound files from JapanesePod101 or iKnow.
I'm impressed with this. It's probably better to leave the kana only words in its own list. If I may ask, where did you get the anime list of subtitles? Next, do you have all the non-anime subtitles from the two or three subtitle depositories?
My guess is that a subtitle file equates to about 20 pages of a novel per hour of show. Those 10,000 anime subtitles I guess is the same as 200,000 pages or about 400 books. As you say, that's a small sample. Filling it up with files for both Japanese and translated foreign shows should allow for a sample 10x that.
dtcamero wrote:
Personally I think this whole thing is a long palaver over a handful of words difference from core. If you don't like a word skip it, but doesn't the audio, not to mention the fact that it's pre-made, make it heads and tails a winner? I mean are you planning on finishing your japanese project at 10k words? if not you're going to need all those (really basic) words, like politician, eventually.
There's quite a bit of use in this. Not everybody doing Japanese is going to go for a full 10k deck of vocabulary. If one can create a corpus of words that you can say "Drawn from 30,000 subtitles from TV (both Japanese and Foreign) and Anime shows over the last 10 years", then the top 2,000 words of such list would fulfill a big gap for people interested in fan based learning.
Yes, as the vocabulary list gets larger, the overlap increases. However, improving the out of the box benefit can't be a bad thing. It's only in recent years we've been able to generate such a list of words instead of depending on the Asahi Newspaper word list. So it stands to reason we'll find people get a bit more nit picky on what they learn.
And just like with the discussions of kanji learning, you have to remember that rote learning on a list is only part of learning. Some people seem quick to jump to the conclusion that "I am removing this from a basic anki deck" means "I will never learn or need this word in my life", or that the number of words in your anki deck exactly equals the number of words you know.
Nukemarine wrote:
If I may ask, where did you get the anime list of subtitles?
[...]
My guess is that a subtitle file equates to about 20 pages of a novel per hour of show.
I downloaded them from http://kitsunekko.net/dirlist.php?dir=s … apanese%2F with curl. The typical number of non-ASCII characters per file is about 4000, and novels have maybe 500 characters per page, so 20 novel pages per hour is pretty close.
Last edited by lauri_ranta (2013 January 28, 9:34 am)
egoplant wrote:
I said business, but I guess I meant political as well.
Here are just a couple words I added today alone:
政治家
府立
政党
野党
I can't imagine words like this popping up outside of the news or something.
If there were a manga list, these may well appear there too. In fact, I swear, the above four words actually look like they'd be among the most frequent words in "Kaji Ryuuske no Gi", a manga about a son of a late politician who himself runs for office as the youngest candidate in his district.

