Back

Japanese OCR Software

#1
Rather than necropost, or tack on to the IRIS Pen topic (which was wandering way off topic), I figured I'd post on-topic info here. I've decided to start looking for new Japanese OCR software after all of this discussion (and I know some of you are as well), so I figured I'd post what I found so far. I'm kind of tired of trying to get ReadIris Asian to play nice with things like underlined text and mixed JP and EN text, and its lack of learning kanji drives me nuts so...

First, there's 読んde!!ココ v.13. (Windows only.) Here's the basic info:

Main Page:
http://ai2you.com/ocr/

More Details:
http://ai2you.com/ocr/product/koko13/workings.asp

Free Trial:
http://ai2you.com/ocr/product/koko13/trial01.asp

Buy it here:
http://ai2you.com/shopai2you/ocr/koko13.asp

Works with TWAIN scanners and WIA scanners, will play nicely all the way to Win 7 64.

It claims to handle smudged kanji and underlined words, and has a learning mode. Plus, it has a bunch of built-in dictionaries to help with recognition. It claims to be able to handle both kana, kanji, and alphanumeric text on the same page as well, something that ReadIris choked on frequently when I used it.

If it does what it claims to, then it would be a heck of a lot better than anything IRIS puts out, for a lot cheaper. ~13,000 yen for the full download version. 20,000 yen if you want a box. Interface is all Japanese.


The other software I'm looking at is e.Typist (Windows only, supports Mac via Boot Camp... I think. It's vague about Mac support.) :

Main Page-- details along the sidebar links:
http://mediadrive.jp/products/et/index.html

Try the Eval version here:
http://mediadrive.jp/products/et/index8.html

Buy the Download version for cheap here:
http://shop.mediadrive.jp/item_list.html...ge&request=

Looks pretty similar to 読んde!!ココ, feature-wise, with a few notable exceptions. First, you can buy the Neo edition which only does EN and JP for 9800 yen (download), or the standard edition which does a bunch of languages for ~13000 yen (download). If you buy at a store, expect to pay 13,000/20,000 yen. Discounts for downloading are nice here, just like 読んde!!ココ. The Neo feature is nice if you don't care about other languages.

Otherwise, it seems to do just about everything that 読んde!!ココ does, with a few exceptions. First, it has a "preview mode" where it superimposes what it thinks it sees over the text it scans, so you can correct it. Also, it doesn't say whether it supports WIA scanners. It's vague about that. It says it supports Win 7 64, but it's kind of sketchy about which scanners it supports. I guess I'll try the eval version first to see if it likes my Brother MFC 7840.

Both handle image files, PDFs, scans, photos, and various input devices, and will output to txt, rtf, excel, word, etc., with some variations between the two. Check the websites to see if your flavors are supported.

Both have large dictionaries, and it looks like both support learning modes for Japanese, which ReadIris does not.

And if you want Free Japanese OCR, there's this thread here:
http://forum.koohii.com/showthread.php?tid=2480
Reply
#2
Another one that I often use just because I have the software is Adobe Acrobat Pro (Both Version 8 and 9 can do it). Its works well but if the document is mixed Japanese/English it won't pick up on the spaces between words in the English prose.
Reply
#3
Yeah, as part of their Black Friday, I just picked up a cheap light portable Canon flatbed at Newegg. They have one for $49, reduced from $56, I think. My Brother MFC is great for basic home use, but I'm starting to travel some again, and I wanted something light enough to carry with me. (And I'm sick of the pen.)

My old Canon won't do Win7. 10 years old, and it still works just fine, it's just that there are no drivers available for it anymore.

Due to the JLPT, I haven't had time to mess with the OCR software yet. It's on my pile of "Stuff to do on Dec. 6th..." Tongue
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
Necropost, but on-topic. Just started using the e.Typist 13 demo to scan some of my Kanzen Master questions into Anki, and wow. It's awesome.

It is not for the Japanese newbie, as the whole thing is in Japanese. But it does such a great job of scanning Japanese text that it just blows the IRIS products' doors off. It's not even close.

I haven't tried underlined text yet, but that's definitely on my list.

It comes with tutorials that will guide your through the scanning process as well.]

A few really neat features: After you OCR, it puts the OCR'd text right next to the scan, and it marches both selections down the screen as you double-check one with the other. When you put the cursor on the text to make modifications, it highlights the original scan in yellow to show where you are in the original scan.

It also has a dictionary to check the scanned text to make sure it's all proper Japanese, and will mark anything weird in red, and offer autosuggestions as you move the cursor next to the red characters for fast fixes.

It will save as a plaintext file in Shift-JIS (default), or in UTF-8 if that's your preference, or in a ton of different other formats, like pdf, word, excel, csv, etc.

One annoying bit-- the "30 day trial" starts as soon as you download the software. I downloaded it 25 days ago, and only got around to installing it today, so I only get 5 days to try it. That sucks, to be honest.

Also, some of the pictograms on the menu bar come out with ???s instead of Japanese in Win7, even with proper Japanese support enabled. Not sure why, but it's not a big deal, just a minor annoyance for a few icons. The drop-down menus all work just fine. It's just the ones with the pictures.
Reply
#5
Good to know. I think started to use it, then found a bunch of documents scanned with eTypist with a lot of errors and decided not to bother. Maybe earlier versions weren't as good, and/or the scanner didn't do a good job selecting parameters and such.
Edited: 2011-05-18, 10:46 am
Reply
#6
After using it for a few hours last night, I found that it does have some quirks (all software does), but for accuracy, it blows anything you can get in the US out of the water.

Most of the problems I had were with scanning irregularly-formatted JLPT practice books, where the answer sets have too much whitespace around them. It wanted to OCR everything in weird chunks. The way around it is to read the fine manual and figure out how to make it do what you want it to do. Not optimal, but not terrible, either.

The ReadIRIS ProAsian software I had had a nasty habit of randomly barfing all over my scans, or, not recognizing strings of kana, or not recognizing certain kanji, no matter what I did. The difference is like 99% to 90%, but that 9% is work.

The only thing I can't get e.Typist to recognize well are sentences where there's a ( )。 at the end. It can't handle the )。 as a solitary clump of text just floating out on the page, so it freaks out. It forces the OCR to scan it as a separate thing, since the spacing puts it far enough away from the sentence, then the OCR just barfs. Not a dealbreaker, since it's easily fixable in a text editor. Takes a minute. Just annoying.

I also figured out how to deal with sentences with a _____ in them for filling in answers. (I haven't found a way to get the program to OCR the _____ as part of the data yet.) There's a setting that will keep all of the spacing in the original, you have to look for it in the pop-up menu, for 空白出力, and select Y. I'm guessing it means to preserve whitespace.
Reply
#7
I can't stop thinking 'OverClock Remix' when I see OCR. @_@
Reply
#8
Found another neat thing e.Typist 13 does, and a way around some of the text selection problems.

First, if you have a page that's scanned crooked, it will sort it out in the software when you bring it in to e.Typist. So if it's on the flatbed at a 10 degree angle, the program will auto-rotate it those 10 degrees to square it up.

Also, if you scan 2 pages at once so that it looks rotated on the flatbed, it will auto-rotate it 90 degrees, so you can save some time with multiple-page scans.

To get around some of the text-selection issues, I found that scanning a bigger chunk of text and selecting more "stuff" will *sometimes* get around the auto-selection issues. But sometimes it won't. Meh.

The only real alternative right now is to go through and just select each sentence manually and order things the way you want them to show up in the final text file. Kind of a pain in the butt.

EDIT: I've been noticing that using the software over a long period of time has started to cause problems with weird artifacts on the text display that shows up in the software when you're in the OCR editing mode. At first, it only showed up on the first page. Now it shows up on both pages when I do a 2-page scan. It makes it hard to proofread. It's like looking at a really crappy photocopy. The scan is pristine... the OCR'd version looks like something that has been copied about 30 times, and then walked on by chickens with dirty feet. But when I save it as a .txt file, it's fine for the most part.

EDIT2 for this bit: did all the rest of the tutorials, and learned how to fix the hard-to-read bits. View Menu, second item from the top. Or just hit ctrl-B. That toggles the "view scan underneath text" mode, and removes all of the noise problems I was having with the OCR'd text. No need to have it if I'm going to read the scan side-by-side with the text anyway.

Doing multilingual scanning is kind of hit-and-miss. It's much better than IRIS, but that's not really saying a whole lot, since IRIS doesn't even support it. The multilingual mode will supposedly catch mixes of Japanese and other languages... but yeah... hit-and-miss.

Scanning white text on black backgrounds is mostly a miss, but IRIS also freaks out when it sees that.

Accuracy in general has dropped a bit, but that's because I have been scanning a lot of stuff. It's still high enough for my needs, and the workflow is pretty smooth. It's not perfect. I'm not sure just how much time I'm saving over typing. But then again, I hate copying stuff from books, so I'm biased.

When the eval is up later this week, I'll try the Yondeココ! (or whatever it's called) demo.

EDIT 2: Do all of the tutorials if you get this program. You really will catch a lot of somewhat "hidden" stuff there. It's easier to learn the interface that way, and it only takes about an hour or so. (Plus, 日本語レベルアップ!)

After learning how to tweak stuff, accuracy is better. More importantly, I've figured out what some of the buttons do, and how to clean up the scans so they OCR better. Multi-lang mode is still kind of hit and miss. Sometimes さ comes out as x. Whyyyy?
Edited: 2011-05-19, 12:34 am
Reply
#9
So, a few last things about e.Typist: after you scan in a page, you'll get a window with your scan, and in the upper L corner, there's a drop-down box, set to 範囲設定. This is where you select chunks of text for OCR. BUT, if you click that box, and select the other option, 画像修正, which is for image editing, you can remove/edit anything problematic on the page you just scanned-- you know, stuff like weird formatting, or "interesting" graphics that always screw up OCR.

Looking at the toolbar, you'll have the drop-down box, a rectangular selection box, a series of circles (this sets brush size), a ruler, and a yellow chunk of tofu that's actually an eraser. Of those 4 tools, I use all but the ruler. I have no idea what the ruler does. Probably something amazing.

Here's a practical application: I've been scanning KM3 to get the questions to dump into my Anki deck as a sort of general review of stuff I should know. Well, on each page, the main heading is put in a box with a drop-shadow. That's no good. It won't OCR properly. So I grab the tofu/eraser, click on the circles to set its size to 15, because that's what I'm comfortable with, and erase the line art around the heading. Then if there are any weird graphics that are in tight places that don't look like words, I erase those, too.

Then I click on the box tool, and select the graphics I don't want, press DEL, and they're gone. Easy.

Sometimes, text in KM3 is not lined up horizontally, for whatever reason. This totally freaks out the OCR engine. So I select the non-aligned text with the box, then drag it to the nearest line. Then I can use the arrow keys to fine-tune the alignment.

All of this takes less than 2 minutes, by the way.

There are a bunch of other tools there, but the pop-up text is broken, the images aren't intuitive to me, and I don't feel like looking them up. And to be honest, I can get plenty done with just these tools.

Clearing up the image before OCR like this makes selecting text MUCH easier. I don't have to go manually selecting 50 lines of text now-- the program will do it automatically in large chunks.

I have noticed that every now and then, I will get what I call "furigana fail." That's when the OCR mushes the furigana into the kanji, and creates new and annoying kanji, instead of giving me a line of furigana and a line of kanji. It happens much more often in the IRIS software than in e.Typist, but it does happen in e.T about 4-5% of the time. It could be a scanner setting problem, or that I'm using a $50 scanner, or that the book's formatting is weird, or just that OCR is hard to do.

EDIT: I have finally figured out how to lessen "furigana fail" quite a bit, and I figured out the main reason why it happens in e.T. The main cause comes from scanning 2 pages at once-- the auto-rotate can't compensate for the slight skew I get when I smush the book under the scanner lid, and since it sees the 2 pages as one page, it just sort of gives up on adjusting the initial scan after a certain point. One page will be slightly rotated sometimes, while the other page will be nice and perpendicular.

When it OCRs, it needs parallel lines of text that don't creep up above perfectly horizontal, even more so with furigana. If it's out of horizontal, then I get the OCR failures.

There's a way to get around this. If you look at the second row of buttons on the scanned image, the 4th button from the left (that looks like 2 arrows chasing each other) is the one you want. Go into 画像修正 mode, then select 手動補正 (M) from the drop down menu underneath the chasing arrows, and draw a nice, long, horizontal line following angle of the slanted text you want to straighten out, then hit the "do it" button. (Whatever that is in Japanese.) OCR that page, save the text, then go back and fix the other page, OCR it, and save the text again. Or, just don't slant the line all the way to fix it completely, and OCR it once and just fix 1 or 2 errors.

Or just scan a page at a time, I guess.

It depends on how much time you want to spend on this, and how slow your scanner is.

I'm also finding that OCR is pretty hit and miss with underlined text, but I can't say I'm too surprised. Mostly miss.
Edited: 2011-05-22, 4:05 pm
Reply
#10
Sorry to bother you people, but i have absolutely no experience with japanese, would it be possible for someone to post a picture of the GUI that is translated? or offer some other program that is pretty good at ocr and is free?
Reply