Back

Beta: Anime2k vocab deck

#1
For the past several months I have been working on a new vocabulary deck, and I've reached a point where I thought it would be alright to release an early version of it and get some feedback.

What is this?
The Anime2k deck is based on a frequency list generated from a huge custom corpus that was created from subtitles of anime and dramas. It includes over 15 million lines of text taken from over 25,000 episodes, with a mix of about 60% anime 40% drama.
The deck is intended to be used as either an alternative or a supplement to the Core2k/6k decks, for people who are primarily interested in consuming media such as anime, manga, dramas, games, etc.
This deck has largely been compiled by hand, with over 100 hours of effort put into it so far.

Difference from other decks?
About half of the terms in this deck are different from terms that you will find in the Core2k deck. Furthermore, nearly 20% of the terms in this deck are not found in the 6000 words of the Core6k. The Core lists were also sourced from newspapers, while Anime2k is sourced from spoken (scripted) dialogue. There is quite a huge difference in the types of language that appears in print, and that in spoken dialogue.

How was it created? / What is included?
The initial corpus was created from massive archives of subtitle files, primarily SRT files to avoid issues with heavily styled scripts which could lead to false duplicates of words. An initial word list was generated using cb's Japanese Text Analysis Tool. Later on, a supplemental list was created using wareya's unnamed japanese text analyzer. This list was used to help find any gaps or issues present in the initial list. 
I decided not to include particles and some other types of "grammatical" words in the list, as most of them can not be sufficiently reduced to a short definition. I DID choose to include many conjunctions, which are largely missing from the Core decks.
Towards the end of creating this deck, I made a decision to remove most katakana words that come from English and mostly have the same meaning as their English counterpart. I ultimately made this decision because I felt that most of these words are "freebies". They don't really require any study to learn. You can hear them once in context and know what it means. I felt that much more value could be obtained by including additional Japanese words in their place. I did keep a handful of English loanwords which I felt have some nuance or difference from their English meaning.
A goal that I had with this deck was to not have multiple entries for a single word. Stemming from this decision, when a word has multiple common readings and meanings, I put these additional details into a "notes" field.
I have also largely avoided including counters and suffixes as words, except in a handful of cases.
Definitions were kept as concise as possible while trying to capture the most common meanings. Primary sources for definitions were Jisho.org (JMdict) and the definitions in the Core6k. Definitions were supplemented from a large variety of other sources when I felt it necessary.
Kanji was used if I found the kanji version of the word reasonably high in the frequency list (within the top 8-10k words). If the Kanji was less frequent than this, I generally went with the kana version.


What is still coming?
I plan to include 3 example sentences from the actual subtitles for each word. I don't intend to provide English translations. Collecting the sentences will be a long and time consuming process that I anticipate taking a few weeks/months (assuming 1 minute per sentence, this is 100 hours of work). While I could easily pull some sentences automatically, they would be of such low quality that they would not be of much use to anyone. Thus, I need to do this by hand.
In the process of collecting sentences, I will likely come across some words or expressions that the list generation tools missed, which may lead to some changes in the actual list of words included in Anime2k. I don't anticipate any huge changes in the list though.

Feedback requested
I feel that the list is currently in a state that can be seriously studied, as long as you don't care about missing the example sentences. I am not a native speaker, nor am I yet fluent in Japanese. There may be situations where the included definitions are not the best choices. I would appreciate any feedback in this regard, to make this deck as good as possible. I would also consider any suggestions for important words or expressions that I may have missed, provided that my data can show that it is genuinely a high-frequency word. I would also appreciate any suggestions for things to be added into the "notes" field, such as alternative forms of words.

Spreadsheet
https://docs.google.com/spreadsheets/d/1...sp=sharing
Comments are enabled on the spreadsheet, if you would like to provide feedback there.

Update 4月11日
Added core index column
Added "kanjified" versions of words into the notes column
Removed about 25 words and replaced with new ones
Fixed several errors.

Update 4月23日
Added sentences for words not contained in the core6k
Removed (and added) 60+ words, most of which were mainly used as names or had been incorrectly attributed by the automated tools. Also removed some words that were counters or suffixes, and removed most interjections.
Several small corrections and changes

Update 5月5日
Rearranged numerous words based on frequency from the unnamed Japanese Text Analyzer, which felt more accurate in some cases.
Replaced 20+ words, most of which were single kanji mainly used as parts of names.
Added sentences for the first 500 words.

I am currently taking a break from this, and will not be updating the list any more for the time being.
Edited: 2017-05-05, 4:45 pm
Reply
#2
This is serious work and I personally appreciate it. Thanks. Are you planning to insert this in an Anki deck anytime soon? Also, I would prefer it if kanjis were used if all kanjis in the word belonged to the 2200 Heisig selection from RtK1 6th edition. In case a kanji outside from those 2200 shows up, then check its frequency like you did. My personal preference. =)
Reply
#3
(2017-04-08, 4:50 pm)KameDemaK Wrote: This is serious work and I personally appreciate it. Thanks. Are you planning to insert this in an Anki deck anytime soon? Also, I would prefer it if kanjis were used if all kanjis in the word belonged to the 2200 Heisig selection from RtK1 6th edition. In case a kanji outside from those 2200 shows up, then check its frequency like you did. My personal preference. =)

You can import it into Anki if you download as a Tab Separated Values .tsv file. I'll post it onto Anki's shared decks once I finish the final version.

I disagree on using kanji's if its not typically used in the native material. For instance, words like これ and それ technically have kanji, but they weren't used even once in my entire corpus of material. Plus it can easily lead to "collisions" where two words have the same kanji, even though one of them doesn't actually use the kanji in practice.
Reply
Thanksgiving Sale: 30% OFF Basic, Premium & Premium PLUS Subscriptions! (Nov 13 - 22)
JapanesePod101
#4
Way cool, thank you! I'm adding it to my study rotation (going to download and convert to anki), will let you know how it goes. Thank you for sharing.
Reply
#5
(2017-04-08, 4:08 pm)Zarxrax Wrote: What is this?

The deck is intended to be used as either an alternative or a supplement to the Core2k/6k decks, for people who are primarily interested in consuming media such as anime, manga, dramas, games, etc.

Approximately 25% of the terms in this deck are not found in the 6000 words of the Core6k. 

What is still coming?
I plan to include 3 example sentences from the actual subtitles for each word. I don't intend to provide English translations. Collecting the sentences will be a long and time consuming process that I anticipate taking a few months (assuming 1 minute per sentence, this is 100 hours of work). While I could easily pull some sentences automatically, they would be of such low quality that they would not be of much use to anyone. 
1. If you have already made the comparison with Core, it would be helpful to bring in a column the comment Core2k/Core6k/Core10k. By this it can be easily used as a real supplement.

2. You can reduce the workload by going for 2 example sentences. It won't make a big difference, especially for the material which the user is not familiar with.

I guess you are automatically pre-selecting e.g. 5 sentences (there were some posts in this direction) and then you do a manual final selection. For your private use it would make sense to rearrange the order of the material, that in the automated pre-selection the top entries are from material you are familiar with / you like most.
Reply
#6
I like the sound of this deck! If this had been around when I was starting out, I'd definitely want to use it over Core.

RE: Kanji
While not usually in manga, I've noticed that a lot of VNs (and some webnovels) like to use the kanji for 此処 其処 何処 何故, etc. I don't think they're so commonly used that they need to be on the default card, but they might be useful in the Notes field you mentioned.

I don't want to make extra work for you, but these have shown up often enough for me that I wish I had learned them earlier, since it's just a matter of learning some kanji for some very common words.
Reply
#7
(2017-04-08, 6:54 pm)Matthias Wrote: 1. If you have already made the comparison with Core, it would be helpful to bring in a column the comment Core2k/Core6k/Core10k. By this it can be easily used as a real supplement.

2. You can reduce the workload by going for 2 example sentences. It won't make a big difference, especially for the material which the user is not familiar with.

I guess you are automatically pre-selecting e.g. 5 sentences (there were some posts in this direction) and then you do a manual final selection. For your private use it would make sense to rearrange the order of the material, that in the automated pre-selection the top entries are from material you are familiar with / you like most.

1. Good idea.
2. Reducing it to 2 example sentences won't really reduce the time much. I have made a tool that I can use to automatically pull out a set number of sentences, and then I manually select from there. Problem is, a lot of what I'm pulling over isn't really ideal to use, so I might end up having to go back and pull some new ones. Or in some cases, I might decide that the form of the word that I've used on the card might not actually be ideal. For instance, while しまう showed up as a very high frequency verb in the frequency list tools, when I actually looked at the sentences, I saw that it was mostly being used in the しまった! form, which has a separate meaning. This kind of stuff is really what is taking a lot of the time. I'm also trying to avoid sentences with names, because seeing weird kanji combinations that you don't know can often make the sentence a lot more difficult to read, and it turns out that a lot of sentences have names, so I end up having to pull more sentences.  I guess I could probably make some tweaks to my tool that could make some of this a bit easier for me though.

(2017-04-08, 6:58 pm)sholum Wrote: I like the sound of this deck! If this had been around when I was starting out, I'd definitely want to use it over Core.

RE: Kanji
While not usually in manga, I've noticed that a lot of VNs (and some webnovels) like to use the kanji for 此処 其処 何処 何故, etc. I don't think they're so commonly used that they need to be on the default card, but they might be useful in the Notes field you mentioned.

I don't want to make extra work for you, but these have shown up often enough for me that I wish I had learned them earlier, since it's just a matter of learning some kanji for some very common words.

Fair enough. I can add the ones that appear in the core decks, but for any others I would probably be relying on someone to point it out to me. I think there probably aren't a whole lot of these though, maybe less than 100.
Edited: 2017-04-08, 9:09 pm
Reply
#8
Can you share your total analysis of all those subtitle files, or zip up those subtitle files and share them here (and mention which tool you'd use). I'd SUPER appreciate it, since I was looking to do something similar, and to have something to use for vocab sorting. I couldn't find any batch of j-drama subs, and the Japanese Text Analysis Tool froze up when I tried to do too large of a batch.

On しまう
"しまう showed up as a very high frequency verb in the frequency list tools, when I actually looked at the sentences, I saw that it was mostly being used in the しまった! form, which has a separate meaning"
しまう has multiple meanings, including the "to accidentally do/finish" commonly used in しまった by the way, but I know what you meant here.
Edited: 2017-04-08, 10:32 pm
Reply
#9
(2017-04-08, 10:25 pm)vladz0r Wrote: Can you share your total analysis of all those subtitle files, or zip up those subtitle files and share them here (and mention which tool you'd use). I'd SUPER appreciate it, since I was looking to do something similar, and to have something to use for vocab sorting. I couldn't find any batch of j-drama subs, and the Japanese Text Analysis Tool froze up when I tried to do too large of a batch.

I got the majority of my drama subs from here: http://jpsubbers.web44.net/Japanese-Subtitles/
Anime subs from kitsuneko.
I also had a some others sitting around on my computer from over the years that I added in, not a significant amount though.

Frequency report from cb's japanese text analysis tool: https://drive.google.com/open?id=0B3_ufk...FlhcDAxODQ

I don't really remember the settings used. Probably default or close to default.
I never had any issues with the tool freezing up, but I did combine all of my subs into one file before processing, so maybe that has something to do with it.
Reply
#10
This is only tangentially related, but what kind of tools, and how much work, would it take to do the same type of deck, but with the most frequent collocations (let's say the most common 3 to 6 words used together repeatedly) instead of the most frequent words?

Would it even be possible, with one of these text analyzers? Because if it was, it would be an invaluable resource.
Reply
#11
Thanks a ton. I feel like you probably saw the v2k thing I posted https://www.reddit.com/r/LearnJapanese/c...al_novels/

Your analysis is probably more accurate for general anime use, since your sampling was way bigger. Man, if there was some way to get some good example sentences for each card, that'd be pretty cool. I can see someone doing a good job of defining the top 6k+ words and hand-selecting sentence cards and pictures if they were fluent in Japanese. Maybe someday...
There needs to be an alternative to the fu*cking Core2k/6k/10k, because that shit is boring, and I know anime-based frequency works and gets you big gains, as long as you know some kanji. I never see a large chunk of the core6k words used in anime past a certain point. It's useful for the first 1k or so, but for having fun in Japanese and understanding anime and j-drama, you won't master those Core decks enjoyably.

I think having proper definitions to learn the words the first time is pretty important. Google image search helps a lot with having a more accurate definition than the English definition, though, since you get to instantly see common usage. I've had to relearn so many hundreds of words I "knew" from Anki, since I knew english definitions for them but the common usage was different. Quick google searches and exposure aligned me with proper meanings, at least, but most people aren't going to get enough exposure and rely on Anki.

(2017-04-09, 11:01 am)Stansfield123 Wrote: This is only tangentially related, but what kind of tools, and how much work, would it take to do the same type of deck, but with the most frequent collocations (let's say the most common 3 to 6 words used together repeatedly) instead of the most frequent words?

Would it even be possible, with one of these text analyzers? Because if it was, it would be an invaluable resource.

Oh man, it definitely would. There was a tool out there for collocations in Japanese, but it didn't work out well for me. It's hard to actually get the definitions for the collocations, and it's probably best if someone who's fluent does it, and combines it with http://eow.alc.co.jp/search?q=%E4%BA%8C%E6%89%8B&ref=sa for reference. (Though this site's accuracy is apparently not the most reliable.)

Ahh but man, a well-done collocation deck could speed things up a bit with comprehension. You'd kinda cheat your way to quick comprehension by drilling those in conjunction with AJATT, and you could just suspend all the intuitive collocations.
Edited: 2017-04-09, 11:17 am
Reply
#12
(2017-04-09, 11:01 am)Stansfield123 Wrote: This is only tangentially related, but what kind of tools, and how much work, would it take to do the same type of deck, but with the most frequent collocations (let's say the most common 3 to 6 words used together repeatedly) instead of the most frequent words?

Would it even be possible, with one of these text analyzers? Because if it was, it would be an invaluable resource.

I attempted to search for collocations using this tool: http://www.laurenceanthony.net/software/antgram/
I first had to segment my text into words using mecab.

I did not have great results with this, even when specifically searching for lines containing particles like を に が
Reply
#13
(2017-04-09, 12:17 pm)Zarxrax Wrote:
(2017-04-09, 11:01 am)Stansfield123 Wrote: This is only tangentially related, but what kind of tools, and how much work, would it take to do the same type of deck, but with the most frequent collocations (let's say the most common 3 to 6 words used together repeatedly) instead of the most frequent words?

Would it even be possible, with one of these text analyzers? Because if it was, it would be an invaluable resource.

I attempted to search for collocations using this tool: http://www.laurenceanthony.net/software/antgram/
I first had to segment my text into words using mecab.

I did not have great results with this, even when specifically searching for lines containing particles like を に が

Yeah, this was the thing I tried to use.
I think something to find collocations could probably be programmed. It seems kinda complicated though, as you'd have to account for word stems, i.e. taking the い off of i-adjectives, the る off of ru-verbs, but also include whole word versions of the word, so that we don't miss any words. You'd need to use whatever logic cb's tool uses for morphemizing words.

Someone's gotta make it though. We have to see how much more efficiently we can learn Japanese instead of actually just learning it.
Reply
#14
(2017-04-09, 1:47 pm)vladz0r Wrote: Yeah, this was the thing I tried to use.
I think something to find collocations could probably be programmed. It seems kinda complicated though, as you'd have to account for word stems, i.e. taking the い off of i-adjectives, the る off of ru-verbs, but also include whole word versions of the word, so that we don't miss any words. You'd need to use whatever logic cb's tool uses for morphemizing words.

Someone's gotta make it though. We have to see how much more efficiently we can learn Japanese instead of actually just learning it.

Mecab can do that to the words, but even then, you probably aren't funding anything too useful. I really suspect the the actual most common collocations that would appear are things that are not useful, while the useful ones are much less common.
I highly recommend the book "common Japanese collocations" for daily use collocations, but it wouldn't have a lot of ones that might be common in anime and such.
Reply
#15
(2017-04-09, 11:06 am)vladz0r Wrote: Thanks a ton. I feel like you probably saw the v2k thing I posted https://www.reddit.com/r/LearnJapanese/c...al_novels/

Found that through the anki shared decks when I was about 1/2 way finished with mine. I did a comparison and saw that there was a lot of differences from the words in my deck, so I decided to keep going forward with this.
Reply
#16
awesome idea!
Reply
#17
Updated the spreadsheet.

- Added core index column
- Added "kanjified" versions of words into the notes column
- Removed about 25 words and replaced with new ones
- Fixed several errors.

The removed words included some duplicates, some things that probably shouldn't have been counted as words in the first place, some obvious combos of two words like ここから, and some counters and suffixes which I didn't really want to include a lot of in the first place.

The suggestion to add the core index column turned out to be a HUGE help overall, as it helped me to identify a lot of the words that needed to be removed, and helped me find some duplicates as well. I was also able to get more accurate stats on how much of the vocab is already covered by the core decks, which was actually a little higher than I previously thought.
about 300 words (15%) do not appear in core10k
about 400 words (20%) do not appear in core6k
about 1000 words (50%) do not appear in core2k

The deck can now truly be used as a supplement if you have already studied one of the core decks.
I've also decided to focus on first collecting sentences for the words that do not appear in the core6k, as these are going to be the most time consuming to collect.

Some examples of fun words that you aren't going to find in core:
貴様 野郎 召喚 親父 妖怪 海賊 兄貴 攻撃力 殺害 忍者 怪獣 天使 仙人
Edited: 2017-04-23, 1:15 pm
Reply
#18
Many thanks for this.  Having already done Core6K, it is incredibly helpful having a way to remove the cards I already know quickly!
Reply
#19
(2017-04-11, 8:57 pm)Zarxrax Wrote: The suggestion to add the core index column turned out to be a HUGE help overall, as it helped me to identify a lot of the words that needed to be removed, and helped me find some duplicates as well. I was also able to get more accurate stats on how much of the vocab is already covered by the core decks, which was actually a little higher than I previously thought.


I've also decided to focus on first collecting sentences for the words that do not appear in the core6k, as these are going to be the most time consuming to collect.
1. Glad the suggestion helped you.

2. It can help you even more as with this you can get without any effort example sentences with translation and mostly very good audio from Core. [actually I do not understand why you think you need 3 sentences - after some repetitions all sentences will be boring]

And here is a bonus on many of the remaining 296 which is almost free. You received in another thread some suggestions how to pick up sentences from a corpus (e.g. wildcard). Just apply this on core 10k and you will get more example sentences with translation and mostly very good audio. I noticed this when I saw 拳銃 in your list.

Actually for some of the words I ask myself whether it makes any sense to include them in any corpus (and make all the effort you planned to spend for 3 sentences): -さん, この, はい, ...
Reply
#20
(2017-04-12, 8:05 am)Matthias Wrote: 1. Glad the suggestion helped you.

2. It can help you even more as with this you can get without any effort example sentences with translation and mostly very good audio from Core. [actually I do not understand why you think you need 3 sentences - after some repetitions all sentences will be boring]

And here is a bonus on many of the remaining 296 which is almost free. You received in another thread some suggestions how to pick up sentences from a corpus (e.g. wildcard). Just apply this on core 10k and you will get more example sentences with translation and mostly very good audio. I noticed this when I saw 拳銃 in your list.

Actually for some of the words I ask myself whether it makes any sense to include them in any corpus (and make all the effort you planned to spend for 3 sentences): -さん, この, はい, ...

I want to have natural sentences from the anime rather than reusing core sentences. And I want three because I feel that one sentence can't give a complete picture of what a word means. Many words can be used in multiple different situations and have different meanings, and one sentence just didn't capture that. Three sentences can't fully capture the meaning of a word either, but it does a much better job than just one. I don't study sentences. They are just used in my view to clarify the meaning of a word by showing usage in context.

As far as whether it makes sense to include certain words, yes, those decisions are basically 90 percent of the work that has led to this so far. Every single word is a decision. How do you choose what's worthy to include or reject? How do you even define what is a word? Are some words too "easy" to include? Where do you draw the line? Maybe -さん is something that everyone would already know, but there are other name suffixes that people might not know because textbooks don't really teach them. If you are including some name suffixes, why not others? If I don't include この then what about これ?(core included one but not the other). One person already commented that they think it's important to have the rarely used kanji for these words.

I'm very open to removing more words, but I would like to hear a good argument for why. I want this to be useful both for beginners as well as people who are just looking to fill in a few gaps.
Reply
#21
(2017-04-12, 9:51 am)Zarxrax Wrote: I want to have natural sentences from the anime rather than reusing core sentences. And I want three because I feel that one sentence can't give a complete picture of what a word means. Many words can be used in multiple different situations and have different meanings, and one sentence just didn't capture that. Three sentences can't fully capture the meaning of a word either, but it does a much better job than just one. I don't study sentences. They are just used in my view to clarify the meaning of a word by showing usage in context.

As far as whether it makes sense to include certain words, yes, those decisions are basically 90 percent of the work that has led to this so far. Every single word is a decision. How do you choose what's worthy to include or reject? How do you even define what is a word? Are some words too "easy" to include? Where do you draw the line? Maybe -さん is something that everyone would already know, but there are other name suffixes that people might not know because textbooks don't really teach them. If you are including some name suffixes, why not others? If I don't include この then what about これ?(core included one but not the other). One person already commented that they think it's important to have the rarely used kanji for these words.

I'm very open to removing more words, but I would like to hear a good argument for why. I want this to be useful both for beginners as well as people who are just looking to fill in a few gaps.
Nobody loves the Core sentences - but they are free, have normally a reasonable translation and a good audio.

I don't study sentences, but I either actively searched for additional sentences or happily picked them up when I noticed a learned word is included. And my impression is that even "great" sentences from "great" scenes from "great" movies are in the end not any better or more helpful than the "boring" Core sentences. And I talk about my personally selected sentences. OK, and now one step further: You select a sentence from a scene that I do not know - I could not care less, right from the beginning.

Recently I started using Yomichan dictionary entries and by that I noticed that many of the "boring" Core sentences are constructed around word combinations [collocations?] you would also find in dictionary entries. That improved my opinion about the Core sentences. :-)

There was no request from me to exclude any words, I just noticed while I let Excel count from the beginning to the end of the list that a lot of the words "missing" in Core are not worth mentioning. It seems that the "fu*cking Core" covers anime much better than expected.
Reply
#22
(2017-04-12, 2:08 pm)Matthias Wrote: Nobody loves the Core sentences - but they are free, have normally a reasonable translation and a good audio.
Well, that's debatable. iKnow/Cerego have claimed that the content was never released under a CC license, though several people dispute that. Whether it ought to be free or not, there is still a possibility of it being taken down for whatever reason. So  I think having an alternative is a good thing. It's certainly fair use to pull a quote or two from an episode of a tv show.

(2017-04-12, 2:08 pm)Matthias Wrote: It seems that the "fu*cking Core" covers anime much better than expected.

Quite true, it even surprised me a bit. However, its not really fair to compare the coverage of a 6000 or 10000 word list to a 2,000 word list, because OF COURSE the larger list is going to get most of the low hanging fruit. Its just a matter of how fast you want to get there. For someone who is just starting out, and doesn't want to study 6000 business and politics terms, this could be a decent alternative path.
Reply
#23
1. My fault: I should have said effort free. The legal side is disputable, but if you want to share it at the ANKI webpage, you better not use Core. Well, the effort seems to be no issue for you.
2. Yes indeed, size matters. And when you are through Core you probably don't want to "learn" -さん, この, -ちゃん, はい, その, あの, こんにちは or 東京 and 日本人.
Reply
#24
The problem with Core on the anki shared decks page was actually the stock images. Not sure about the legal issues aside from that.
Reply
#25
I have made a big update to the spreadsheet. I have completed collecting 3 sentences for the words that do not appear in core6k.
In the process, I found a lot of words that shouldn't have been in the list, and have removed them. I also removed a few more counters/suffixes and some interjections. In all, more than 60 words have been replaced, which was more than I expected.

Going forward, I expect very few changes to the word list, probably less than 10 words at the most. Collecting sentences for the remaining words should be significantly faster on a per-word basis, though I'm not sure how much free time I'll have left for this. I've been mostly been doing this at work because I've had a lot of free time, but that free time will likely be drying up soon.
Reply