Back

Capture2Text - Japanese OCR Utility

#76
I select the image and get the correct kanji in the upper corner.
When I paste however, it often shows me the kanji of the previous selection. (exactly one before.)
Any idea or solution?

cheers
Reply
#77
Filip Wrote:I select the image and get the correct kanji in the upper corner.
When I paste however, it often shows me the kanji of the previous selection. (exactly one before.)
Any idea or solution?

cheers
The OCR isn't instantaneous - it depends on the size of the selection and the speed of your PC. Wait a second or two before pasting.
Reply
#78
This is pretty handy. :D
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#79
I have just posted version 2.1 of Capture2Text.

Download Capture2Text v2.1 via SourceForge (source code is included)

What Changed?

● Added command line options. From the readme:
Code:
You may OCR the screen via command line by calling Capture2Text in this format:

Capture2Text.exe x1 y1 x2 y2 [output_file]

  Required arguments:
    x1 - X1-Coordinate of the screen
    y1 - Y1-Coordinate of the screen
    x2 - X2-Coordinate of the screen
    y2 - Y2-Coordinate of the screen

  Optional arguments:
    output_file - The OCR'd text will be written to this file if specified.

Capture2Text will read settings.ini to determine settings such as OCR language
and output options (clipboard, popup, etc.).

Examples:
  Capture2Text.exe 10 152 47 321 output.txt
  Capture2Text.exe 10 152 47 321
● Added the "substitutions" feature. From the readme:
Code:
Sometimes Capture2Text consistantly makes the same OCR mistakes such as
recognizing an "M" as "I\/|".

By editing the subtitutions.txt file in the Capture2Text directory, you may
tell Capture2Text to substitute one text string for another text string.

Just find the appropriate language section and add one substitution
per line in this format:
  from_text = to_text

Example (adding 2 substitutions to the English section):
  English:
    I\/| = M
    >< = X

To create a substitution regardless of language, add the substitution to
the "All:" section.

Special tokens and escape characters:
  %A_Space% = Single space character
  %A_Tab%   = Single tab character
  %Equal%   = Equals (=)
  `,        = Comma (,)
  `%        = Percent sign (%)
  ``        = Backtick (`)
  `n        = Single linefeed character (\n)
  `r        = Single carriage return character (\r)

You may disable a substitution by adding a "#" in front.
 
 
 
  
  
cb4960
Edited: 2012-10-07, 3:23 pm
Reply
#80
I have just posted version 2.2 of Capture2Text.

Download Capture2Text v2.2 via SourceForge (source code is included)

What Changed?

● Upgraded to Tesseract v3.02.02 (see http://code.google.com/p/tesseract-ocr/w...leaseNotes for the changelist).

● Simplified the special tokens used in the substitution feature a bit and fixed a whitespace bug.

Things of limited interest to Japanese learners:

● Added a whitelist option for Tesseract dictionaries. Allows you limit the characters that Capture2Text can recognize, such as only digits.

● Added support for more languages. The complete list:
Afrikaans, Albanian, Ancient Greek, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Catalan, Cherokee, Chinese, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Finnish, Frankish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Korean, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Maltese, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovakian, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Telugu, Thai, Turkish, Ukrainian, Vietnamese.
Reply
#81
I have just posted version 2.3 of Capture2Text.

Download Capture2Text v2.3 via SourceForge (source code is included)

What Changed?

● When using the Japanese (Tesseract) dictionary, revert to Tesseract v3.01. It is MUCH more accurate than v3.02.02, both with vertical text and particularly with horizontal text.

● Added option to remove the capture box before a preview OCR. This is more accurate, particularly with NHocr, but causes the capture box to flicker. Disabled by default but you should probably enable it if you use the preview feature and the Japanese (NHocr) dictionary.

● Fixed bug that caused the text direction (horizontal/vertical) to be ignored for Chinese/Japanese. The bug was introduced in the previous release.

● Now passing a .ppm image to NHocr instead of a .pgm image to better handle non-grayscale captures. You might see slightly better accuracy with the Japanese (NHocr) dictionary now.

● Fixed bug that caused the capture box to stick around after it was supposed to
be removed.

● Changed the snapshot enlargement from 300% to 320% to meet Tesseract's
minimum recommended DPI.

● Increased update rate of the capture box to make it appear more fluid.
Reply
#82
This seems like a pretty amazing tool. Unfortunately, I am totally unable to use it. Whatever language I'm using and whatever text I select, the preview box stays blank, and nothing is copied to my clipboard.

Does anyone have any idea what could be going wrong? I'd really love to be able to use this program, but I'm clueless about how to get it working for me.
Reply
#83
honeybunch Wrote:This seems like a pretty amazing tool. Unfortunately, I am totally unable to use it. Whatever language I'm using and whatever text I select, the preview box stays blank, and nothing is copied to my clipboard.

Does anyone have any idea what could be going wrong? I'd really love to be able to use this program, but I'm clueless about how to get it working for me.
From the readme:

Quote:1) Unzip the contents of the zip file. Make sure that there are no Asian or
other non-ASCII characters in the path where you unzipped it. Also, if you
are on Windows 7, don't unzip it to the Program Files directory (this will
avoid issues related to write privileges).
...
If you did that, try unzipping the program to a very simple directory such as c:\temp and see what happens.

Also, it would be helpful if you wrote what operating system you are using (XP, Vista, 7, or 8).
Reply
#84
cb4960 Wrote:If you did that, try unzipping the program to a very simple directory such as c:\temp and see what happens.
I thought I did that. I had unzipped it to "C:\Documents and Settings\Michael\Desktop\Capture2Text_v2.4", but I guess something in that file path screwed it up. I just unzipped it again to "C:\temp", as you suggested, and now it works great. Guess I'll just keep it there and make a shortcut.

Also, sorry for not mentioning my operating system. I'm usually better about giving information like that if I have to ask for help. I'm on XP Professional with Service Pack 3, if you still want to know.

Anyhow, thanks for your help, and for making this program. It's great.
Reply
#85
honeybunch Wrote:
cb4960 Wrote:If you did that, try unzipping the program to a very simple directory such as c:\temp and see what happens.
I thought I did that. I had unzipped it to "C:\Documents and Settings\Michael\Desktop\Capture2Text_v2.4", but I guess something in that file path screwed it up. I just unzipped it again to "C:\temp", as you suggested, and now it works great. Guess I'll just keep it there and make a shortcut.

Also, sorry for not mentioning my operating system. I'm usually better about giving information like that if I have to ask for help. I'm on XP Professional with Service Pack 3, if you still want to know.

Anyhow, thanks for your help, and for making this program. It's great.
I'm glad that you got it to work. I'll boot up XP and investigate why it doesn't like that path.
Reply
#86
There are a whole bunch of ways to handle Japanese OCR, both on Mac and Windows. The end of this post on my blog goes over a bunch of ways to do it (in the context of getting the text of lyrics for your Japanese songs), from some of the free ways mentioned here to the top-of-the-line $200 Adobe software.
Reply
#87
Sadly no Linux support.
Has anybody tried this with Autohotkeyx and Wine
(http://appdb.winehq.org/objectManager.ph...ngId=44317)
I didn't get Autohotkey to work at my first try, and had no time for more testing.

Sadly the links to bombpersons skript are offline.
Reply
#88
I have just posted version 2.5 of Capture2Text.

Download Capture2Text v2.5 via SourceForge (source code is included)

What Changed?

Just a minor update:

● Updated NHocr from v0.20 to v0.21.

● Now compiled with Ahk2Exe v1.1.11.01 instead of v1.1.05.06.

cb4960
Reply
#89
Hi cb4960,
I really like your software, I just register to koohii.com to say "Thank you" and I think I have some idea that can improve Capture2Text OCR accuracy. Improve your ConvertImageFormat.exe that make it remove captured image background and make it smoother may improve OCR output accuracy.

You may not know, some people use Potrace and MKBitmap to improve accuracy of Tesseract. Download from here: http://potrace.sourceforge.net/

They use MKBitmap to remove color background from image: http://potrace.sourceforge.net/mkbitmap.html

See this picture they remove all other color except black(that is the color of almost text in image): http://potrace.sourceforge.net/img1/loxie-t3.png

And then use Potrace to enhance image quality, make it smooth, that make Tesseract easily recognize text from picture and also improve accuracy of Tesseract output: http://potrace.sourceforge.net/samples.html

You wrote Capture2Text in Autohotkey language, so you may have some idea from this script, it wrote in Autohotkey too, it uses NConvert to convert image to Bitmap format, and use MKBitmap to make image black and white only, remove background from image, and then uses Potrace to make image smoother: http://www.autohotkey.com/board/topic/10.../?p=431921

Hope you can get some idea how to make Capture2Text more accurate.

___________________________

And I have another method, that is use Textcleaner(a script of ImageMagick( http://www.imagemagick.org/script/index.php ) to remove picture background, keep only text: http://www.fmwconcepts.com/imagemagick/t.../index.php

And then use Tesseract to OCR that image, this improve output result, here is the result, even better than commercial OCR software:

http://vbridge.co.uk/wp-content/uploads/...24x512.png

More infomation about using Tesseract with Textcleaner from here: http://www.imagemagick.org/script/index.php

Thank you, I hope I can help you one hand Smile
Edited: 2013-07-27, 12:07 am
Reply
#90
@magicz123,

Thanks for the info! I'll try to experiment with these tools in the coming weeks.
Reply
#91
I have just posted version 3.0 of Capture2Text.

Download Capture2Text v3.0 via SourceForge (source code is included)

What Changed?

● Added option to binarize captured image before sending it to the OCR engine. It is disabled by default. To enable, you can either hit Win+b, or check the box in Preferences -> Output.

Binarization (aka Thresholding) is just the process of converting an image to 1 bpp (ie. black and white). However, it can DRAMATICALLY improve OCR accuracy for manga and other sources.

Comparison showing the same capture both before and after binarization:

[Image: binarization.png]

When reading manga, it is usually best to leave binarization enabled. If you find a word that the OCR engine fails on, try using Win+b to toggle binarization OFF to see if that helps (but be sure to toggle it back ON afterwards). In my testing, the difference between it being enabled and disabled was like night and day. I should have added this ages ago. If you were unsatisfied with Capture2Text in the past, you might want to give it another shot with binarization enabled.

(Note: As with previous releases, the default OCR language is English. Press Win+2 to switch the language to Japanese using the primary Tesseract OCR engine, or press Win+1 to switch the language to Japanese using the secondary Japanese NHocr OCR engine, or right-click the Capture2Text icon in the tray and select Japanese that way.)

cb4960
Reply
#92
Thank you for update Smile
Reply
#93
Will there be a possibility for a linux port??
Edited: 2013-12-09, 11:29 pm
Reply
#94
shan109 Wrote:Will there be a possibility for a linux port??
I have no plans to port Capture2Text to linux.
Reply
#95
The new(er) functionality sounds awesome.
Edited: 2013-12-10, 8:34 am
Reply
#96
I have just posted version 3.1 of Capture2Text.

Download Capture2Text v3.1 via SourceForge (source code is included)

What Changed?

● Improved Japanese OCR accuracy through use of better image pre-preprocessing and more finely tuned Tesseract configuration options. The previous version was only 70% accurate in my test suite. The new version is 90% accurate.

● Now supports text and backgrounds of any color when OCR pre-processing is enabled. In the previous version, only dark text on a light background was supported.

● Added option to place the preview text beside the capture box. See Preferences -> OCR -> Preview Box -> Location.

(Note: As with previous releases, the default OCR language is English. Press Win+2 to switch the language to Japanese using the primary Tesseract OCR engine, or press Win+1 to switch the language to Japanese using the secondary Japanese NHocr OCR engine, or right-click the Capture2Text icon in the tray and select Japanese that way.)

cb4960
Reply
#97
Just for reference, in case anybody wants to test there are 読んde ココ and SmartOCR Lite (couldn't find professional version). Those are unbelievably good OCR programs. You can OCR with few clicks one of those light novel that comes in jpeg entirely with 読んde ココ and It's very precise.
Reply
#98
arnaldosfjunior Wrote:Just for reference, in case anybody wants to test there are 読んde ココ and SmartOCR Lite (couldn't find professional version). Those are unbelievably good OCR programs. You can OCR with few clicks one of those light novel that comes in jpeg entirely with 読んde ココ and It's very precise.
e.Typist is another good option. The cheaper NEO version supports Japanese and English.
Edited: 2014-03-22, 4:24 pm
Reply
#99
I have just posted version 3.4 of Capture2Text.

Download Capture2Text v3.4 via SourceForge (source code is included)

What Changed?

● Added option to strip furigana. It is enabled by default. To disable: "Preferences > OCR > Strip Furigana".

[Image: furigana_demo.png]

The text direction preference affects how this feature operates.

● Added the "Auto" choice to the "Text direction" preference. It is enabled by default. It uses very simply logic: If the width is more than twice as long as the height, text direction is assumed to be horizontal, otherwise text direction is assumed to be vertical. As you can see, it is biased in favor of vertical text.

● Removed the "OCR pre-processing" hotkey option from the Preferences. By default is is now set to the awkward key combination Shift-Ctrl-Windows-B. It may still be edited in settings.ini.

(Note: As with previous releases, the default OCR language is English. Press Win+2 to switch the language to Japanese using the primary Tesseract OCR engine, or press Win+1 to switch the language to Japanese using the secondary Japanese NHocr OCR engine, or right-click the Capture2Text icon in the tray and select Japanese that way.)

cb4960
Reply
Is there any chance of getting a way to use this without a keyboard? Because I use a convertible tablet, so when I'm lounging around reading, I don't have they keyboard attached, and so can't really use ocr.

Like you click the icon in the taskbar, and it puts up a transparent grey layer, you then click and drag over that layer to make a selection, making the area selected fully transparent (similar to how the snipping tool looks when taking snips)
Reply