How to parse a deck for unique words

Index » Learning resources

  • 1
 
Reply #1 - February 20, 7:50 pm
afterglowefx Member
From: Gunma, Japan Registered: 2013-12-01 Posts: 228

Forgive me if this has been brought up, but a few searches of the forum returned nothing and I haven't been able to find much on Google.

Can somebody please explain to me, in relatively straight-forward language, the tools and methods necessary for figuring out the total number of unique words in a given anki deck?

I'm not computer illiterate but I've never coded anything, never parsed anything, nor anything of the sort.

If there's a rather straightforward process for doing what I would like to do, I'd really be indebted to whoever walks me through it.

Thanks in advance.

Reply #2 - February 20, 8:13 pm
comeauch Member
From: Canada Registered: 2011-11-04 Posts: 175

It depends on your deck: I'm assuming you know how to get the number of facts of a deck? If you only have 1 word/card, then it's simply a matter of finding the number of facts.

If your cards comprise many words (sentences for example), it seems you don't have much choice but to export your deck as text (File>Export...) and then parse the resulting txt file with something like cb's Japanese Text Analysis Tool (http://forum.koohii.com/viewtopic.php?id=9815)

Hope it helps!

Reply #3 - February 20, 9:08 pm
afterglowefx Member
From: Gunma, Japan Registered: 2013-12-01 Posts: 228

Wow, that was a lot easier than I expected.

Thanks a lot for pointing me to that tool, straight-forward and simple to use!

For those possibly coming after me and wondering what I did specifically:

1) export cards from chosen decks in plain text format
2) run cb's Japanese Text Analysis Tool on each deck (if using multiple)
3) combine frequency reports using cb's Japanese Text Analysis Tool on each deck (if using multiple)
4) This gives you a nice big frequency report of every word the parser recognizes in your deck (I saw は 17,000 times!) but no total amount of words. To get the total amount of words I then imported the data into excel: Data->From text->hit okay about 5 times.

Then I just scrolled down to see the total number of fields, which, to my surprise, was over 7000. I don't know that things like は and で count as words (I guess 'is' and 'of' do, though..) and I don't know that I feel comfortable claiming 7000 words, but this was nevertheless a fun little bit of trivia to have about my decks.

Cheers again to comeauch for the help, and to cb4960 for the awesome parser!

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
Reply #4 - February 20, 10:45 pm
afterglowefx Member
From: Gunma, Japan Registered: 2013-12-01 Posts: 228

Had another look at what I was doing. I realized that tons of my cards had WWWJDIC definitions with multiple kanji readings. I also realized that the 'reading' field in the Core deck was likely being counted on top of the kanji, thus 話 and はなし would be counted as two words instead of one.

So I exported my decks, made a test profile, imported the decks, deleted every field from the cards except the expression field, re-exported them in plain text, re-ran and recombined the parses, put it all back into excel, and came up with a bit over 5,000 words. That feels much more accurate to me then the prior 7,000, and confirms my fears that a lot of words were being counted twice.

Interesting, but a fair amount of fiddling required!

Reply #5 - February 21, 12:04 am
JunePin Member
Registered: 2011-10-12 Posts: 49

it's also possible in excel to sort, find and remove duplicates, you could also remove all particles using find and replace and then remove all empty rows.

Reply #6 - February 21, 2:41 pm
sholum Member
Registered: 2011-09-19 Posts: 265

So, is there a way to get it to produce cards with the frequency list in one go or do I need to use a spreadsheet for that (or JGlossator, if it has that kind of functionality)?

Also, I find it much more convenient to look at the total number of unique words/phrases it finds by opening it in Notepad++ (or something similar), which has numbered lines. I was surprised when I ran a particular innocent book through it and found that the total number of unique entries was only 5406 (using JParser)

I can't believe it took me this long to check out such a useful set of tools.

Reply #7 - February 21, 6:12 pm
afterglowefx Member
From: Gunma, Japan Registered: 2013-12-01 Posts: 228

You actually want the frequency number on your study cards? Man that'd be depressing once you break out of beginner Japanese wink

I didn't even think of Notebook++. I do suppose that would be much easier than importing it into excel. It's really neat having a huge list of words I (ostensibly) know--having never taken JLPT or a single class in Japanese or a single graded metric at all whatsoever, it's a nice little bit of feedback.

By the way, I keep seeing the phrase "innocent novel" come up--is an "innocent" novel in some way different from a normal one?

Reply #8 - February 21, 10:00 pm
yogert909 Member
From: Los Angeles, Ca Registered: 2013-05-03 Posts: 269 Website

There's an anki add-on(I think it's called kanji stats) which will scan your decks and tell you how many unique kanji there are among other things.

Reply #9 - February 21, 10:07 pm
afterglowefx Member
From: Gunma, Japan Registered: 2013-12-01 Posts: 228

yogert909 wrote:

There's an anki add-on(I think it's called kanji stats) which will scan your decks and tell you how many unique kanji there are among other things.

Yep, I've got it, but the number of unique kanji is in no way indicative of the number of words. It's a fun add-on, but it's something completely different.

Reply #10 - February 22, 12:16 am
sholum Member
Registered: 2011-09-19 Posts: 265

afterglowefx wrote:

You actually want the frequency number on your study cards? Man that'd be depressing once you break out of beginner Japanese wink

I didn't even think of Notebook++. I do suppose that would be much easier than importing it into excel. It's really neat having a huge list of words I (ostensibly) know--having never taken JLPT or a single class in Japanese or a single graded metric at all whatsoever, it's a nice little bit of feedback.

By the way, I keep seeing the phrase "innocent novel" come up--is an "innocent" novel in some way different from a normal one?

[To answer your last question, see the very last segment]

No, not the frequency number, I just want to auto-populate some fields with the relevant dictionary information and add them to a deck ordered by frequency. I figured it'd be helpful when reading, since, even though I've studied close to 7000 words now, I still run into a lot of new vocabulary; that's why I was surprised to find that a particular LN I was reading had less than 5406 unique results (that's including particles, names, and really, really common words as well); further more, nearly half of those results (which include phrases) only show up once in the entire book. I don't think that these unique results would be spaced so evenly that I'm constantly seeing unknown words every page (not just unknown, but unstudied), so that means, in the 2825 entries that appear twice or more contain words that I don't understand.

Basically, the idea of having a frequency deck for a book (or series) is that I efficiently learn words that are more common in that book or its genre, but may not be common elsewhere (hence, the reason I haven't seen them in Core yet: they just aren't common in the news).

I understand that it will only get me so far, but I'm a stickler for efficiency, so if I'm going to be reading fantasy novels, I want to get used to fantasy words; I think the best way to do that would be to learn the common fantasy words first and just learn the others as you see them (or, if you like Anki, you could import the entire list into a deck, which would prevent duplicates, for the most part; meaning you could study every unstudied word or phrase in the book).

It's just for some primers really, I don't like SRSing for more than two hours a day anymore (I've slowed to about 20 new cards a day, due to school; it takes thirty minutes at the most) and I've never cared for looking up every single unknown word (only if they seem important or show up enough to irritate me); even then, I'm pretty lazy about adding cards (hence, using Core), so I often don't think to make a card for a word I just looked up and just continue reading; so I only want to start with the most important words of books of varying genres and demographics to make future reads go a bit smoother.

And, since I'm lazy, automated is better, even if it's a bit ugly at first (it wouldn't be the first time I had to edit a card while reviewing), which is why I was wondering about automated reading and definition/translation entry for making cards.

As for 'innocent books':
http://forum.koohii.com/viewtopic.php?id=9575
figure it out from there; it's better that way.

  • 1