RECENT TOPICS » View all
Hi all,
So I have never really used a scanner before. But my friend has one with along with his Japanese computer and I tried scanning one of my text books which contains a lot of example sentences. It scans it as a photo, into MS word.
So, what is the best way to convert it to some kind of text that I can insert into anki? Advice for a newbie to scanners is appericate ![]()
You need OCR software, which will take the image created by your scanner, and try to recognize all of the text in the image, and will allow you to dump the results into a text file. In the pen scanner thread, I think someone mentioned some good OCR software for Japanese. (OCR stands for Optical Character Recognition, by the way.)
Whatever software you get, make sure it does a good job on Japanese before you buy.
http://www.ofzenandcomputing.com/zanswers/1017
Office has its own built in OCR according to this article. But if it does Japanese, I'm not too sure. It's called Microsoft Office Document Imaging.
Last edited by dabidos (2008 January 14, 3:48 pm)
Sorry to revive this thread, but I am still searching in vain for a Japanese OCR? I don't own a scanner so don't have any software of any sort, I do however have several books already digitised (currently in djvu format, but convertable to .pdf, or image files if necessary) Microsoft Onenote has a good OCR built in but I can only get it to work with English text.
For PDFs, try Foxit. It has Text PDF -> Text conversion capabilities, and it supposedly does Asian fonts as well. It's 100 times better than acrobat, because it's not bloatware, not adware, and not a resource hog. Loads fast, runs fast. The only downside is that to get the PDF-> plain text ability, you have to pay a little for the non-free version.
I don't know how well it handles Asian text, though. Haven't had a chance to try it.
There are some Japanese OCR programs you can buy, but the only common software that has Japanese OCR support is Adobe Acrobat which will allow you to create a new pdf file with the writing as text which you can then extract.
Last edited by mystes (2008 April 27, 9:36 am)
I've got the full suite of Foxit Programs - full versions, but no where can I find an OCR option. I even checked on their forums and it specifically says no OCR functionality. Am I missing something?
Which version of Adobe do I need for the Japanese OCR?
Update:
After more playing around I found the "text viewer" function. Which works fine for English, seemingly not at all for Japanese. Same for the "Export to .txt" option. Where did you see that it had Asian language support?
Last edited by tomusan (2008 April 27, 11:26 am)
Saw it somewhere on their website... there are separate things you have to download for Asian language support, but like I said, I'm not sure how well they work. Let me see if I can find some JP PDFs and see if I can get them to work.
EDIT: No, it only seems to recognize English. Ah, crud. Sorry about that.
Last edited by rich_f (2008 April 27, 2:14 pm)
tomusan wrote:
Which version of Adobe do I need for the Japanese OCR?
You need acrobat pro. It's expensive and probably not ideal, but you don't need a special Japanese version.
Alright, I got hold of Acrobat Professional, and once again I've been reminded of my hatred for pdf files. I'm trying to use one of the PDF files from japanesepod. First thing I tried was to OCR just one page.
Met with the message "Acrobat could not perform operation, page contains renderable text"
I'll admit I don't know much about pdf as a format. So went to google, and this message apparently means that OCR is unnecessary as there is already text there...Didn't quite understand that but accepted it anyway and next tried the plain "export" function, choosing export to word file.
Met with the message -
"Acrobat was unable to make this document accessible because of the following error
Bad PDF; Could not read page structure [95] <font error: cannot find CMap resource file> [1]"
No idea what means. Last option I tried, exporting a page to .jpg and then OCR the image. Which sort of worked, the exported jpg looked horrible, and the OCR was hit and miss at best. Often turning hiragana in Kanji.
Anyone have any experience with Acrobat that can help me out?
Update: Tried another file, a page from UBJG, saved to bitmap. Tried to run OCR
Met with message "This page is larger than the maximum 45x45 inches." - wtf. "Its too big" counts as a reason for an error now?
Update2: Resized the picture and the OCR worked perfectly. Good news I guess, however I don't think I can go through the hassle of Converting every page (several hundred) to an image file, resizing it, then OCRing it. Must be a way to get the text extraction to work directly from PDF.
Last edited by tomusan (2008 April 28, 8:03 am)
Tomusan,
Seems to be a limitation in the JPod101 PDF files. This is from the Jpod101 forum dated 06 April 2008
............ this is due to a technical limitation having to do with the way fonts are embedded in our dynamically generated PDF documents. Currently only a subset of the original font, containing only the glyphs used, is embedded in the output document. This embedded font contains only the minimum data needed to be embedded in a PDF document, and does not contain any codepage information. The PDF document contains indexes to the glyphs in the font instead of to encoded characters. While the document will be displayed correctly, the net effect of this is that searching, indexing, and cut-and-paste will not work properly.
We are searching for ways to overcome this, but at the moment have not found a solution that would work without making the size of the PDF files HUGE! One way to overcome this is to copy the text from our Line-By-Line transcripts in the Learning Center if you're a Premium member.
From your previous description it sounded like you were trying to OCR scanned books. For PDF files that are actually encoded as text, don't use OCR. You just need to find a better program to extract the text. I don't use windows, so I can't advise you on this.
Last edited by mystes (2008 April 28, 11:20 am)
Tourne - that explains a lot, thank you!
However still, having problems, albeit different ones. Ran a different PDF through the exporter.
Met with the message "Acrobat was able to make the document accessible but found the following oddities. Some difficult pages were encountered requiring all graphics on those pages to be labeled as figures"
Clicking OK on this then brings up a new error message. (Although one I was getting before also)
"Save as failed to process this document. No File was created"
Then tried another way, ticking the option in to include all image files. An hour later I had a 834MB word document. Which I was happy with until I tried to open it and word tells me it can't open as its bigger than 512MB. At which point I sighed and will go to bed, try again in the morning by splitting into into two separate pdfs first. Unless someone has a better idea?
Tried again this morning with the file split up. Letting Acrobat extract pictures as well just allowed it to make a jpg of every page and put in in a word document - not so useful. Anymore ideas? Currently the only open is to make every page into a jpg and run OCR on it. Which, although preferable to typing, is still going to take a while.
Last edited by tomusan (2008 April 28, 8:54 pm)
Thanks for the link, will have a look. Although I am thankful to say finally managed to get Acrobat to just OCR the PDF, seems to be working so far. It doesn't seem to like doing several hundred pages at once though. Regardless is going to be a big timesaver for me.
Soon I hope to resume my japanese study and plan to use a method similar to AJATT. However typing all those sentences in is going to be a real pain in the neck, and takes time away from revision which is how you really learn stuff. If I could OCR stuff that would be great, but it seems to be a source of a lot of frustration. Has anyone found a program that does a reasonable job of OCRing mixed english / japanese documents?
If I could get an accuracy of 95% or better that would be good, but is still a hell of a lot of errors to correct. Although this plan might be sacriligeous to some I was thinking of getting some books, cutting the spine off, and then scanning the whole book in one shot using the document feeder on my scanner at work. This would produce a pdf or series of images which I would then OCR. Then I would put the pages from the book in a binder, and as I worked through the book, I would check the sentences and add them into the SRS.
Don't know how much time I would save, but if I could reduce the time to enter each sentence by 25% that would add up to a huge time saving over the course of studying a whole book.
Anyway let me know your experiences with OCR of combined Japanese English text. I did read that this is pretty problematic for a lot of software. Thanks.
For good quality OCR you should probably use Japanese software. Not Western software that also supports Japanese in some way.
Example of Japanese OCR software: Yomitori kakumei Ver. 12 (Recognition revolution), made by Panasonic.
Homepage: http://panasonic.co.jp/pss/pstc/product … index.html
10 day trial version: http://eww.gp-club.panasonic.co.jp/data … 200TRY.exe
Should be good if you can handle the Japanese interface.
Last edited by kanjapan (2008 June 04, 6:53 am)
What I`ve been doing lately (for the JPOD PDFs at least) Is to download them, convert them to jpg using Adobe, then remake in a PDF using the jpgs, run OCR. Works well most of the time. Really useful for longer lessons.
tomusan wrote:
What I`ve been doing lately (for the JPOD PDFs at least) Is to download them, convert them to jpg using Adobe, then remake in a PDF using the jpgs, run OCR. Works well most of the time. Really useful for longer lessons.
Thanks for the information tomusan. So are you still using the OCR built into acrobat ? Or another product that works with pdf files? What % accuracy do you get?
I use the asian version of Readiris Pro. I haven't had a problem unless there is furigana or something similar. I don't have a percentage though, I don't have any novels or anything that text heavy to really easily get an accurate amount.
Last edited by cracky (2008 June 05, 6:17 am)

