Back

Manga Corpus or Frequency List?

#1
I was wondering if anyone could suggest a way that I can find lots of manga content in textfile format. I'm interested in extracting a frequency list for words from it Wink

Thanks
Reply
#2
It'll be easier to use that subtitle site for various anime. It's already in a text file, and it'll be dialogue based just like manga.

In addition, use subtitle text from d-addicts and maybe scan dramanote.seesaa.net script site. Should have more than enough conversation based vocabulary from all that.
Reply
#3
I don't know how good OCR software is these days, but you could try OCR some manga.
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
Download all the subtitle files here and use them to extract frequency lists:

http://kitsunekko.net/JapaneseAnimeSubtitles.aspx

This software is very good for dealing with corpus and frequency lists.


bombersons Wrote:I don't know how good OCR software is these days
Very good. He must try them.
Edited: 2009-06-21, 3:44 am
Reply
#5
ahibba Wrote:
bombersons Wrote:I don't know how good OCR software is these days
Very good. He must try them.
Do you have any suggestions on good OCR software, ahibba? I've been looking around, but I don't which are any good or not =( I'm on Linux, but I can run windows in Virtual Box if need be =)
Reply
#6
ReadIRIS 12 Asian Edition. (version 11 is also good.)

What do you want to OCR?
Reply
#7
I just wanted to try it out on some manga to see how well it worked =D

I'll take a look at that =) thanks
Reply
#8
What's annoying about readiris 11 is that the presence of differents type of box,like image box,text box...if you want to automate the process for like 100 images, the result is horrible because it doesn't understand if your picture(especially extracted png from sub/idx files)is an image or a text.
You have to manually change all the boxes for all the pictures from image to text,that is, pretty boringSad
However for asian ocr, it's the best I think
Edited: 2009-06-21, 8:56 am
Reply
#9
=( It's not free, and I can't find other places to "aquire" it. Are there any free OCR that do Japanese text? I've tried GOCR and Tesseract, but I can't seem to get them to work with Japanese =(
Reply
#10
SmartOCR was posted on these forums a while ago, I think.

But... If you want to learn words used in manga so that you can um, read manga, why don't you just read a manga you like and look up the words you don't know? Then not only will you learn the words common in manga, but they'll be exactly the words you need for the text.
Reply
#11
A frequency list of words from Manga might be interesting, but I'm not too sure if it'd be very useful for delving into other aspects of the language (real-life conversation, novels, etc) compared to other types of frequency list.

Then again I guess it just depends on the manga. For some reason I keep imagining a frequency list where it ranks 海賊 and 忍術 above です, haha Big Grin
Reply
#12
Aijin, my primary interest/motivation for learning japanese is to read manga Wink

But, I think I would aim for at least 95% coverage just so that it makes it so much easier to learn the other words organically, and lets me enjoy the mangas more. That 95% coverage would probably have something like 85% coverage of words seen in novels, and conversations anyways, as long as I extract the frequency list from multiple mangas.
Edited: 2009-06-21, 11:03 am
Reply
#13
Yeah, as long as your source material covers multiple genres, reading levels, and demographics, then I think it'd help for most all other areas of the language too. Some manga actually have pretty good vocabulary in them, especially ones involving sciences where you'll get exposed to more academic and scientific terms. Nothing more important than knowing how to say 'Basal ganglia' in Japanese, right? Big Grin
Reply
#14
ahibba Wrote:Download all the subtitle files here and use them to extract frequency lists: http://kitsunekko.net/JapaneseAnimeSubtitles.aspx
offtopic:
You helped me without knowing it Smile I was looking for subtitles of seto no hanayome
Reply