Hi everybody!
I started to learn Chinese just a few weeks ago, and I'm quite interested in the advantages that OCR offer for Asian languages, specifically, the option to make progress in some of the canonical areas of a language -here, speaking and listening, via the study through romanized pinyin- postponing meanwhile others less significant to communicate, as writing and reading.
I've tried massively OCR software - Abby finereader, Acrobat, tesseract etc - and none either is trained, or seems to be trainable, to get right diacritics like ū or ǒ, so here we are.
Could anyone help me with this issue please?
Thank y'all in advance.
PS: I know this forum is about Japanese, but admittedly, comparing it to the RH one, the latter is simply derisory, so congratulations on such an impressive effort.
I have many resources just in hanzi, such as grammars not oriented to foreign learners and linguistic material - Asian approach- etc and, after studying a lot of Chinese Phonology, I would like to practice tone processes as I read it all aloud; in short, I try to develop a kind of optimizing theory applied to languages.
Thanks anyway!
comeauch
Member
From: Canada
Registered: 2011-11-04
Posts: 175
If you can get the text OCR'd in hanzi, the hanzi to pinyin conversion shouldn't be very difficult: Google Translate does a good job and includes diacritics... Just copy-paste the hanzi text and under it, the pinyin appears... For example, I copy-pasted something from China Daily: "正在中国访问的新西兰外交部长默里麦卡利于22日举行新闻发布会,对持续发酵的“毒奶粉”事件深表遗憾,称恒天然公司若让消费者失望,也就是让新西兰失望。" and it instantly "decodes" it as "Zhèngzài zhōngguó fǎngwèn de xīnxīlán wàijiāo bùzhǎng mò lǐ mài kǎ lìyú 22 rì jǔxíng xīnwén fābù huì, duì chíxù fāxiào de “dú nǎifěn” shìjiàn shēn biǎo yíhàn, chēng héng tiānrán gōngsī ruò ràng xiāofèi zhě shīwàng, yě jiùshì ràng xīnxīlán shīwàng."
Yeah, that's why I want to OCR'd the material I have, and do it in bulk, cuz those webs allowing a 5 mb max upload are useless in this case. There must be some software developed by somebody somewhere, and that's what I am trying to guess. I cannot understand why no commercial program in the world has already included such diacritics, not even Abby.
Thank you very much for your help anyway.