![]() |
|
KanjiTomo - New OCR program for Japanese text - Printable Version +- kanji koohii FORUM (http://forum.koohii.com) +-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html) +--- Forum: Learning resources (http://forum.koohii.com/forum-9.html) +--- Thread: KanjiTomo - New OCR program for Japanese text (/thread-9971.html) |
KanjiTomo - New OCR program for Japanese text - Kurotowa - 2012-09-23 Hello. I have written a program called KanjiTomo. It identifies Japanese characters under mouse cursor and finds the corresponding words from dictionary. It works a bit like Rikaichan, but my program can be used with any image on screen, not just text on web pages. This is how it looks like: ![]() KanjiTomo uses a custom OCR algorithm optimized for kanji recognition. Especially with small font sizes it's accuracy is often better than with generic algorithms (for example NHocr or Tesseract). Kana recognition is not as good, but considering the target audience (like people in this forum), I think this is a fair trade-off. You can download the program from www.kanjitomo.net I hope you find it useful! KanjiTomo - New OCR program for Japanese text - toshiromiballza - 2012-09-23 "Error: Could not find or load main class KanjiTomo.jar" Windows 7, Java 1.7.0_07-b10. Quote:It is recommended to install Java JDK instead of JRE, it will increase recognition speed by about 25%.How so? Will other java apps run/load faster on JDK too, or is it just the kanji recognition in your app? KanjiTomo - New OCR program for Japanese text - Kurotowa - 2012-09-23 toshiromiballza Wrote:"Error: Could not find or load main class KanjiTomo.jar"JDK contains a different Java virtual machine that can be launched with "-server" option. That means it will produce more optimized code at the price of a bit longer startup time. All Java programs can in theory benefit from this, but the actual improvement depends on the type of program. KanjiTomo uses a fairly CPU-intensive algorithm, so it is useful to have -server option on. As for the "Could not find or load main class" error, could you try to lauch the jar from a command line (cd to unzipped directory and run: "java -verbose -jar KanjiTomo.jar") and send the output to me? Preferably to kanjitomo@gmail.com , I'll post the fix/results here for other people if we find a solution. KanjiTomo - New OCR program for Japanese text - avelicu - 2012-09-23 That's awesome dude Works really well for me Thank you very much for your programming efforts. KanjiTomo - New OCR program for Japanese text - Kurotowa - 2012-09-23 avelicu Wrote:Thank you very much for your programming efforts.Thank you! Good to hear that it works for you. I just uploaded a new version to www.kanjitomo.net , it now has option to specify the Java installation in config.txt file if auto-detection fails. KanjiTomo - New OCR program for Japanese text - Zarxrax - 2012-09-23 This is really interesting. Though I am not having much luck with it. It only seems to work with kanji that are quite large. And a lot of the time, I can't even get it to put a box around the kanji. KanjiTomo - New OCR program for Japanese text - Kurotowa - 2012-09-23 Zarxrax Wrote:This is really interesting.Do you have a link for an example? I could try it myself. My own tests have been the opposite: even quite small kanji works, but very large characters (in titles for example) sometimes don't fit in the internal buffer the program is using (I'm planning to make this a configurable option at some point). KanjiTomo - New OCR program for Japanese text - toshiromiballza - 2012-09-23 Kurotowa Wrote:That means it will produce more optimized code at the price of a bit longer startup time.I see. Kurotowa Wrote:As for the "Could not find or load main class" error, could you try to lauch the jar from a command line (cd to unzipped directory and run: "java -verbose -jar KanjiTomo.jar")I tried it with "java -jar KanjiTomo.jar" now and it worked. I only ran "java KanjiTomo.jar" before; didn't know I had to run it with the -jar option, since I don't think I had problems with running any other java program by simply double-clicking them or typing "java program.jar" in the command prompt. This is quite awesome work; not having to manually scan images is really great! KanjiTomo - New OCR program for Japanese text - Kurotowa - 2012-09-23 toshiromiballza Wrote:I tried it with "java -jar KanjiTomo.jar" now and it worked. I only ran "java KanjiTomo.jar" before; didn't know I had to run it with the -jar option, since I don't think I had problems with running any other java program by simply double-clicking them or typing "java program.jar" in the command prompt.I have now added launch.bat file that contains "java -jar" command. It can be used to run the program if double-clicking jar file doesn't work. KanjiTomo - New OCR program for Japanese text - toshiromiballza - 2012-09-23 Some personal suggestions: Could you add an option to accumulate the kanji/words found somewhere and the possibility of copying the text afterwards? And an option to disable loading the dictionaries, in case I just want the raw text, but don't need the EDICT translations of words? I'm also confused by the control panel. What types of files is this used for? Next page of what? Page number of what? I tried to open a pdf file, but it's not supported, so I'm wondering. Also, what are the 3rd and 4th boxes in the "OCR results" panel for? They're always empty. Or are four character compounds something you'll add in the future? KanjiTomo - New OCR program for Japanese text - Kurotowa - 2012-09-23 toshiromiballza Wrote:Could you add an option to accumulate the kanji/words found somewhere and the possibility of copying the text afterwards? And an option to disable loading the dictionaries, in case I just want the raw text, but don't need the EDICT translations of words?Yes, this would be possible. I'll put this into consideration for future versions. toshiromiballza Wrote:I'm also confused by the control panel. What types of files is this used for? Next page of what? Page number of what? I tried to open a pdf file, but it's not supported, so I'm wondering. Also, what are the 3rd and 4th boxes in the "OCR results" panel for? They're always empty. Or are four character compounds something you'll add in the future?It can open common image file formats (png, jpeg, ...). Page number is used if you have multiple image files in a directory with numbers in file names. I might also add pdf support if I can find a Java library that can open pdf files. You are correct that extra boxes are reserved for four character compounds. I'm planning to add those in later versions. KanjiTomo - New OCR program for Japanese text - toshiromiballza - 2012-09-23 Kurotowa Wrote:Page number is used if you have multiple image files in a directory with numbers in file names.I see. I think you should add this information in the description on your website. Kurotowa Wrote:I might also add pdf support if I can find a Java library that can open pdf files.Great. DjVu support would be much appreciated as well, since most pdfs are searchable/text anyway, but djvu files are image scans. An open djvu library for java already exists: http://javadjvu.foxtrottechnologies.com/ I don't even think any major/commercial OCR suites support djvu files. On second thought, since your program allows scanning text from different programs anyway, this isn't really important, since it's already supported in a way! Kurotowa Wrote:My own tests have been the opposite: even quite small kanji works, but very large characters (in titles for example) sometimes don't fit in the internal buffer the program is using (I'm planning to make this a configurable option at some point).I can confirm this. Even tiny ones work, but large ones not at all. KanjiTomo - New OCR program for Japanese text - visualsense - 2012-09-23 Wow, reeaally cool. Great work! Amazing! It is like a 15+ years dream come true! obs1: It seems it is made with dark chars on clear background as default? It would be cool if you could press like "ctrl" key and it switched to detecting white letters on background of other colors. obs2: It sometimes detects as vertical text, other times as horizontal in packed texts when the letters line up as something in the dictionary. An switch/key to force one reading direction or the other would also cool. KanjiTomo - New OCR program for Japanese text - Zarxrax - 2012-09-23 Kurotowa Wrote:Do you have a link for an example? I could try it myself.I tested on this site and couldn't hardly get anything to work: http://www3.nhk.or.jp/ KanjiTomo - New OCR program for Japanese text - Kurotowa - 2012-09-23 visualsense Wrote:Wow, reeaally cool.Thank you! I'm glad you liked it. visualsense Wrote:obs1: It seems it is made with dark chars on clear background as default?Both of these options are already on my "upcoming features" list
KanjiTomo - New OCR program for Japanese text - Kurotowa - 2012-09-23 Zarxrax Wrote:I tested on this site and couldn't hardly get anything to work: http://www3.nhk.or.jp/The underlining effect you get from text links works agains the OCR. Also some of the characters in this site have white color on dark background. Currently black text on white background is expected. But in a site like this, it is better to use Rikaichan anyway. I would suggest to test the program in amazon.co.jp for example. Find some book with a preview, like this one KanjiTomo - New OCR program for Japanese text - Zarxrax - 2012-09-24 Ok, I guess the problem was that the characters are white. I was mainly trying to read the text in the images, not the page text. Going back and trying again, I see that it is able to get most of the black text just fine. KanjiTomo - New OCR program for Japanese text - Zarxrax - 2012-09-25 This looks really useful for reading manga on the pc. Biggest problem seems to be that it has trouble selecting exactly what I would want it to select. For instance, it will often just select a single kanji or a part of a word, rather than the whole word, or it might include a comma as part of a character. If there was a way to draw our own box indicating what we want to look up, that would be awesome. What would be even more awesome is if you made a manga viewer for android with this technology. That way we can read manga on a tablet while looking up the unknown words. It could also read the image files directly rather than going from a screen capture, which should help with accuracy at any zoom level. Just wishful thinking
KanjiTomo - New OCR program for Japanese text - Kurotowa - 2012-09-25 Zarxrax Wrote:This looks really useful for reading manga on the pc. Biggest problem seems to be that it has trouble selecting exactly what I would want it to select. For instance, it will often just select a single kanji or a part of a word, rather than the whole word, or it might include a comma as part of a character. If there was a way to draw our own box indicating what we want to look up, that would be awesome.It's true that sometimes the program doesn't know where one character ends and another starts. Let me show how it works. Here's a screenshot around mouse cursor: ![]() KanjiTomo first finds parts of the image where pixels are touching each other: ![]() Then areas that are not touching but their bounding boxes overlap are merged: ![]() See where the program made a mistake? It's 今度 on the left. Drawing boxes around problem areas would be a solution and I'm going to implement it at some point, but my first priority is to tune the auto-detection so that manual drawing is not needed too often. Zarxrax Wrote:What would be even more awesome is if you made a manga viewer for android with this technology. That way we can read manga on a tablet while looking up the unknown words. It could also read the image files directly rather than going from a screen capture, which should help with accuracy at any zoom level.I don't think that tablets could handle it just yet, it takes quite a lot of CPU power to run the program in real time. So it would first need to analyze the images in PC. Actually if you open a file inside the program, it reads the pixels directly from the image at original resolution even if display is scaled. Screen capture is used only if mouse cursor is outside the program. KanjiTomo - New OCR program for Japanese text - shinsen - 2012-09-25 Could this perhaps be used to batch processs .idx/sub subtitles into .srt? KanjiTomo - New OCR program for Japanese text - Kurotowa - 2012-09-27 shinsen Wrote:Could this perhaps be used to batch processs .idx/sub subtitles into .srt?I don't know the details of these formats, but it would be possible to add general image -> text file batch features at some point. But at the moment my focus is on interactive use. KanjiTomo - New OCR program for Japanese text - Kurotowa - 2012-10-07 I have uploaded a new version (0.9.5) of the program to www.kanjitomo.net Here's a list of new features: - four character compounds - selectable text orientation - selectable text color - hotkeys - faster startup speed To use hotkeys, you must first enable them by setting ENABLE_HOTKEYS=1 in config.txt file. This is because the hotkeys might override keys defined in other programs. The default keys are: - Alt+A = change text orientation - Alt+S = change text color - Alt+X = toggle automatic OCR on/off Currently the options for text color are "black on white" and "white on black". As long as there's enough contrast, other colors should also work. Some combinations (such as white text on light blue background) are not yet recognized, but I have some ideas for the future versions on how to improve this furher. KanjiTomo - New OCR program for Japanese text - shinsen - 2012-10-07 Sikasiisti! It works really well on ranobe and manga jpg scans. Interestingly, the algorithm is really good at detecting low-rez manga kanji that the human eye has trouble with but when the background is not uniform the algorithm will often fail even though the kanji is crystal clear to the human eye - in pictures or subtitles overlaid on video. The .jar file didn't launch on OS X by the way. KanjiTomo - New OCR program for Japanese text - Hashiriya - 2012-10-07 this works amazingly well on light novels... thanks a lot! KanjiTomo - New OCR program for Japanese text - Kurotowa - 2012-10-07 shinsen Wrote:Sikasiisti! It works really well on ranobe and manga jpg scans. Interestingly, the algorithm is really good at detecting low-rez manga kanji that the human eye has trouble with but when the background is not uniform the algorithm will often fail even though the kanji is crystal clear to the human eye - in pictures or subtitles overlaid on video.You are right! This shows how some problems are easy for humans but difficult for computers. It's very easy for us to see the character boundaries, but computer doesn't really know what is text and what is background. shinsen Wrote:The .jar file didn't launch on OS X by the way.At least one person has been able to run the program on Mac, but it was necessary to launch the program from command line. I have included a file "launch.sh" that can be used to run the program on Mac, but you must first set the executable bit on. Since I can't test this myself, it's not "officially" supported (i.e not advertised on the kanjitomo web page) |