Back

Kanji list for manga

#1
Here's a suggestion:

How about using Japanese OCR software to scan through manga files(since many raw manga are released in jpg and png format).

Eliminate all the duplicate kanji....and voila....you have a list of all the kanji you'll need to understand that particular manga volume! Then enter them into your favorite SRS and study the list before reading that manga :-)

You could even use excel to eliminate duplicate kanji instead of writing a program.

At the moment, Readiris Pro Asian Edition($200) seems to be one of the most well known Asian OCR packages. I've tried searching for freeware or share OCR software, but couldn't find any. (Btw, there is SmartOCR Lite, which is free, but I could get it to work on my XP SP3 with .NET 3.0. The software requires .NET Framework 1.1, but wouldn't install. Giving a message saying that I HAVE to install 1.1. Oh well...).

Let me know if anyone finds a more suitable software package. For the time being, I guess Readiris is the best option.


Thanks.
Edited: 2008-12-12, 12:44 pm
Reply
#2
Or simply learn all the jouyou kanji and you will know all the kanji needed for pretty much any manga....
Reply
#3
Sounds like it would be a good idea if the OCR software can pick out kanji compounds (I don't know if it can), then you could add sentances with those compounds in, then you could test yourself reading it...
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
My suggestion is actually for when you're done with RTK.
It's not a replacement or shortcut. You should always keep
your Heising reviews up-to-date.


It would dramatically reduce the need to use a dictionary
while reading (for those that use dictionaries) and make the reading experience
a lot smoother. And by "study" the kanji, I mean in the Heisig sense(keyword to kanji).

So you'd get:

1)The kanji that you'll know to expect when reading
2)The random kanji that come up in daily reviews
Edited: 2008-12-12, 1:03 pm
Reply
#5
bombpersons Wrote:Sounds like it would be a good idea if the OCR software can pick out kanji compounds (I don't know if it can), then you could add sentances with those compounds in, then you could test yourself reading it...
Why not simply read the manga and learn the compounds as you go...

I guess it could be good for making a database of "manga you can read if you know these words" but outside of that, i don't see the point.
Reply
#6
bombpersons Wrote:Sounds like it would be a good idea if the OCR software can pick out kanji compounds (I don't know if it can), then you could add sentances with those compounds in, then you could test yourself reading it...
You can find OCR'd light novels on the Internet if you search hard enough, and you do can extract every compound with some help from MeCab (actually, MeCab can even deinflect verbs and do morphological analysis).

So, yeah, it's possible, and it sounds interesting.
I may give it a try when I'm idle enough.
Reply
#7
I don't understand. Knowing which kanji will appear in a particular manga doesn't do anything to your need to use a dictionary. Because kanji aren't words.
Compounds are different, but as Tobberoth mentioned, going out of your way to find sentences with the compounds you want to memorize because they're going to appear in a book is... fairly backwards when you could just take the sentences that contain the vocabulary out of the actual book. Of course, that's a choice. The first part just doesn't make any sense, though.
Edited: 2008-12-12, 1:28 pm
Reply
#8
Tobberoth Wrote:
bombpersons Wrote:Sounds like it would be a good idea if the OCR software can pick out kanji compounds (I don't know if it can), then you could add sentances with those compounds in, then you could test yourself reading it...
Why not simply read the manga and learn the compounds as you go...

I guess it could be good for making a database of "manga you can read if you know these words" but outside of that, i don't see the point.
Theoretically, learning the vocab/compounds beforehand would make the actual reading of the material more enjoyable. Looking up words in a dictionary while reading takes any enjoyment out of the process for me personally, which is why I don't use a dictionary while I read. I write down the words I don't know and study/look them up later.

I think the idea has merit, but requires too much work. If such a list existed for a series I had interest in reading, I'd definitely give it a try.
Reply
#9
As a proof of concept: list of nouns, verbs, adverbs and adjectives from the first chapter of Suzumiya Haruhi no Yuuutsu. And here's the script that extracted that information.

Not completely accurate, but pretty close.

Edit: Adjectives. How could I forget them?
Edit 2: Ordered by the number of occurrences.
Edited: 2008-12-13, 2:09 am
Reply
#10
This is less useful for people still learning RTK than it is for people who are done with RTK or who don't ever intend to do RTK.

For those other people, I think this could work quite well. Add in a way to actually study the kanji/vocabulary properly (iKnow comes to mind) and it could be quite a nice tool.

0) Someone creates the study lesson
1) Study the lesson thoroughly
2) Read the manga with little effort
3) FUN!
Reply
#11
I'm all for this -- I've always wanted to do something like this, but I'm too lazy to actually bring it about. I'd certainly be willing to help if a project gets underway.
Reply
#12
wccrawford Wrote:This is less useful for people still learning RTK than it is for people who are done with RTK or who don't ever intend to do RTK.

For those other people, I think this could work quite well. Add in a way to actually study the kanji/vocabulary properly (iKnow comes to mind) and it could be quite a nice tool.

0) Someone creates the study lesson
1) Study the lesson thoroughly
2) Read the manga with little effort
3) FUN!
As long as we're talking words instead of kanji this could work. This could become iKnow lessons or even Anki decks. I kind of like this idea but it might be more useful to pick long manga for it. Writers usually use very similar vocabulary throughout the whole manga so you wouldn't have to put in much work for a lot of reading.
Edited: 2008-12-12, 10:26 pm
Reply
#13
There's also the point that OCR software still kind of sucks -- it often can't even do English right, so surely kanji recognition is also problematic -- and can easily mistranscribe the characters, so you might end up with the wrong characters in your list. Heck, I'm a human, and I know I had trouble reading a few kanji when I compiled a list of kanji used in ゼルダの伝説 神々のトライフォース (The Legend of Zelda: A Link to the Past), which I needed to extract its text. I even had to post one on a forum asking what the hell it was. (It turned out to be 魔 -- one I'd have recognized easily if it had been at higher resolution.)

And manga that's lettered by hand is probably a lost cause. Smile

- Kef
Reply
#14
It's true that OCR isn't perfect and won't get 100% of everything right.
But if it can get 90%-95% accuracy most of the time, that's more than
enough to help out.

Some businesses these days use OCR systems to archive their large pile of documents.
Also, OCR software today can reliably partition the paragraphs for easier
recognition. even with kanji. So it definitely has come a long way.
Reply
#15
I remember initially thinking to myself "great! I finished RTK1 and now I can read anything!", but no. On the introduction page of 20世紀少年 Volume 1, I came across "絆", an RTK3 kanji. If you don't go through RTK3 then you'll frequently come across characters you've never seen before. Even after completing RTK3 I occasionally see kanji I don't recognise, the most recent being 訛 (accent) and 璧 (from 完璧【かんぺき】).

If you do RTK3, you won't regret it.
Edited: 2008-12-12, 9:15 pm
Reply
#16
I concur with those above. For kanji frequency, this is not useful. For word frequency, it's a damn good idea.

Here's been my current thought process on a baseline:

1. Finish RTK asap
2. Learn grammar by sentences. I think using Tae Kim or UBJG, but don't sweat the vocabulary, and make these recognition only.
3. Get a base line of vocabulary. KO2001, a frequency chart, or iKnow are great for this. Use example sentences, but production and recognition must focus on the one vocabulary word. Mentally kick yourself if you miss other parts of the sentence (grammar, vocabulary, kanji), but fail the card if you miss the vocabulary word (writing or reading).

With the above baseline out of the way (basic kanji, grammar, vocabulary), you can pretty much go any direction. Add in more kanji from RTK3, do an OCR sweep of a manga to find words you don't know ahead of time, do a sweep of your favorite jdorama's script to find new words (no ocr needed as many are online).

So yeah, sweep mangas, but I think that should be for word frequency.

PS: For those following iKnow, you can see how one can do the manga vocabulary sweep with iKnow. Consider: You do Core 2000 and 6000, which means you got about 6000 vocabulary under your belt. You scan 20th Century Boys and get a 4500 word list. When you upload this (broken down into 250 word chunks) via the API, the iKnow system will mark those words you already know, leaving just those you'll need to study.

So of those 4500 words, maybe 500 of them are new words, of which many will be unique names and places.
Reply
#17
nest0r Wrote:Also, I'm so sick of that lowercase 'i' being everywhere. Should I be cursing Apple for that little piece of branding?
iEsu. iUshould.
Reply
#18
furrykef Wrote:There's also the point that OCR software still kind of sucks -- it often can't even do English right, so surely kanji recognition is also problematic -- and can easily mistranscribe the characters, so you might end up with the wrong characters in your list. Heck, I'm a human, and I know I had trouble reading a few kanji when I compiled a list of kanji used in ゼルダの伝説 神々のトライフォース (The Legend of Zelda: A Link to the Past), which I needed to extract its text. I even had to post one on a forum asking what the hell it was. (It turned out to be 魔 -- one I'd have recognized easily if it had been at higher resolution.)

And manga that's lettered by hand is probably a lost cause. Smile

- Kef
A Link to the Past was originally called 神々のトライフォース?
Reply
#19
Yep! Triforce of the gods. Smile
Reply
#20
I also think it would be a nice idea to compile a list of vocabulary/kanji used in certain works. For instance, it seems that many people are reading 20th century boys. It would be nice to have a list of the vocab that shows up in the manga.

I'm also reading that at the moment and I'm compiling all the words I search for in a Wakan list. If anyone's interested, I could share that list, though it's of course only a partial list of the vocab that shows up in the manga.

What would be really useful is if someone uploaded the list to iknow using their api and then added example sentences for each of the words (I dunno if the api can do that).
That way, it would be easy to study the vocab before starting to read the manga.
Reply
#21
Turns out that Office 2007 has built-in OCR of TIF and MDI files.
It's called "Microsoft Office Document Imaging".
Just convert those JPG/PNG files to TIF.
Even works for Japanese.
Unfortunately, it crashes right when it does the OCR for me........ :-(

Will let everyone know how things turn out....
Edited: 2008-12-15, 8:08 pm
Reply
#22
well I could never get that ocr imaging for Microsoft office to work for Japanese.
Readiris 11 however seems to do the trick.

It even works well for manga, but you have to highlight every text box and then run OCR. So it's pretty tedious.

But it turns out to be REALLY great for light novels. Since most light novels are just words on a page. It automatically highlights each page as a single OCR target and works beautifully. I was impressed.

The output is filled with junk characters, but it seems to recognize nearly all the kanji, which is amazing. The junk text characters are Roman characters in between the kanji and hiragana.The only other step would be to write a program that strips out all the junk characters and then process the remaining document.

I can see this being easy if all you wanted was just the list of kanji appearing in each page. But stripping out all the kanji compounds is a MUCH more difficult task IMHO.

I guess you could try longest prefix match against a Japanese dictionary, but I'm not sure how well that would work. Probably fairly decent, but it'd take a long time to process.
Edited: 2008-12-30, 6:43 pm
Reply
#23
http://www.wanzafran.com/2009/01/languag...kanji.html

Here's wanzafran's tutorial on making a kanji list, if anyone's interested.
Reply
#24
For me this whole "learning the Kanji/vocabulary before reading" seems to be the opposite of what I want. I don't want to learn lists with no connection to a story, image or sentence. Isn't the idea that many of us pursue here (and maybe more on ajatt) that we learn FROM the material, not FOR. If you already went to the trouble of drilling those kanji/vocab, then is the reading of the manga really so valuable?

The techniques (OCR, getting words from manga) discussed sound interesting and promising though, making this a valuable thread. So please don't see this as mere criticism.

What I would do (or am already doing) is:
-read the manga and input when you look words up, this way the words stick better I think
-read the manga, mark the sentences that sound interesting, then later go back and put them all in through OCR or by whatever means.


I was actually planning to write about this as part of a different topic any time soon, so I guess I will sound repetitive by then.
Reply
#25
zwarte_kat Wrote:If you already went to the trouble of drilling those kanji/vocab, then is the reading of the manga really so valuable?
Now that's backward. I'm learning Japanese so that I can read Japanese manga, books, newspapers etc. Not the other way round.
Reply