Back

Amazon ebooks to text files with ruby?

#1
I have a formula using Calibre's search and replace function to remove furigana completely from text files that someone helped me out with. But I can't help thinking it's possible to alter the formula a bit to retain the furigana and just make it ruby compatible. I love reading books using Wakarusa instead of the Kindle app since lookups so much easier and faster.  I'm hoping someone reading is comfortable with Regular Expressions and might be able to help. Here's the formula I have to remove furigana outright:

[font=.SF UI Text][font=.SFUIText]\<rt\>.*?\</rt\>[/font][/font]

[font=.SF UI Text][font=.SFUIText]Here's the bit about ruby from the Wakarusa manual. It had an actual picture that I can't copy and paste here but the first symbol is basically the marker showing where the furigana starts in the event it's not at the beginning of a word. [/font][/font]


"Wakaru" supports rubi which are readings for certain kanjis given by the author. "Wakaru" uses the convention adopted by Aozora Bunko to encode the rubi as follows: 
[img=300x0]file:///var/containers/Bundle/Application/03ED3250-081D-44A3-97FB-D7E23091EC76/Wakaru.app/help_rubi_01@2x.png[/img]

"|" is unicode 0xFF5C.
"《" is unicode 0x300A.
"》" is unicode 0x300B
Reply
#2
Images have to be hosted on a website like https://imgur.com/ to be shared.

The normal AB ruby text format is そんな漢字《かんじ》, which unfortunately isn't enough to tell how many kanji the kana are supposed to represent. Normally this is worked around by putting an ascii space before the kanji, or by using extra punctuation around the kanji, or putting both the kanji and the kana in the 《》 but separating them with something like your |. Which are you looking at here?
Edited: 2017-06-04, 4:56 am
Reply
#3
(2017-06-04, 4:55 am)wareya Wrote: Images have to be hosted on a website like https://imgur.com/ to be shared.

The normal AB ruby text format is そんな漢字《かんじ》, which unfortunately isn't enough to tell how many kanji the kana are supposed to represent. Normally this is worked around by putting an ascii space before the kanji, or by using extra punctuation around the kanji, or putting both the kanji and the kana in the 《》 but separating them with something like your |. Which are you looking at here?

Thanks for the link to the image sharing site.

http://imgur.com/uBH17dE
Reply
May 15 - 26: Pretty Big Deal: Get 31% OFF Premium & Premium PLUS! CLICK HERE
JapanesePod101
#4
Unfortunately you can't do that with a regular expression or something similarly generic, because stuff with writing system changes like this is valid:

Quote:|明るく《エンライテン》

The system isn't actually formally defined so there's no telling whether this will happen without the | present, which would break any regex for it.

I'm sure there's programs out there that have a good enough way to do this though.
Edited: 2017-06-04, 1:14 pm
Reply
#5
I don't think the writing system matters for HTML-to-Aozora conversion (though it does the other way around). The problem here is that you can put more than one thing in a ruby tag.

kraemder, could you post a sample of the book's HTML code, covering a few sentences? There might be something that works for this book, even if it doesn't work for all books.
Reply
#6
Oh, they're converting from html to aozora. Yeah, you can do that.
Edited: 2017-06-04, 1:15 pm
Reply
#7
(2017-06-04, 12:54 pm)wareya Wrote:
Quote:|明るく《エンライテン》
The system isn't actually formally defined so there's no telling whether this will happen without the | present, which would break any regex for it.
I remember reading a formal definition somewhere. If there is no bar, the furigana are attached to the string of kanji immediately before. So I believe the above expression without a bar would be illegal.

(2017-06-04, 4:55 am)wareya Wrote: そんな漢字《かんじ》
A legal expression with no bar, so this is where the writing systems matter. In this case there's 2 kanji characters before the furigana, so this parses as そんな|漢字《かんじ》.

I don't think it's a regular language though. But then, neither is HTML.
Edited: 2017-06-04, 1:36 pm
Reply
#8
If it's not recursive you can probably make a regular expression out of it. After all, that's the constraint. The issue is that some things that look like they're not recursive require a state machine capable of recursion to parse, because they accidentally sneak in recursive rules or whatever. (Note: regular expressions as used in pattern matching can use context-sensitive rules, even though formal regular grammars can't.)
Reply
#9
(2017-06-04, 1:14 pm)HelenF Wrote: kraemder, could you post a sample of the book's HTML code, covering a few sentences? There might be something that works for this book, even if it doesn't work for all books.

I second this. If the format is as I expect it to be, regexps would qualify for the job and we could definitely be able to help you. And, even if regexps won't do, we could scratch a quick script to do it.
Reply
#10
Well I don't know how to view the source for AZW3 files. Using Calibre to convert to HTMLZ, I can bring up the source for that in Firefox. Here's an example sentence.

<ruby>短<rt>みじか</rt></ruby>い<ruby>黒<rt>くろ</rt>髪<rt>かみ</rt></ruby>と<ruby>青<rt>あお</rt></ruby>い<ruby>目<rt>め</rt></ruby>。</p>

Is this helpful enough?

Found the option to edit books in Calibre (there's a lot of options in Calibre I haven't noticed). Here's a copy and paste of a sentence from the AZW3 file. It looks a lot like the HTMLZ.

<p> マスコミ<ruby>各<rt>かく</rt>社<rt>しや</rt></ruby>も<ruby>取<rt>しゆ</rt>材<rt>ざい</rt>規<rt>き</rt>制<rt>せい</rt></ruby>をうけ、<ruby>星<rt>ほし</rt>菱<rt>びし</rt>邸<rt>てい</rt></ruby>に<ruby>近<rt>ちか</rt></ruby>づくことができない。ときおり、テレビ<ruby>局<rt>きよく</rt></ruby>のヘリコプターが<ruby>飛<rt>と</rt></ruby>んでくるが、<ruby>警<rt>けい</rt>察<rt>さつ</rt></ruby>のヘリに<ruby>星<rt>ほし</rt>菱<rt>びし</rt>邸<rt>てい</rt></ruby>の<ruby>上<rt>じよう</rt>空<rt>くう</rt></ruby>から<ruby>追<rt>お</rt></ruby>いはらわれている。</p>
Edited: 2017-06-05, 1:08 am
Reply
#11
(2017-06-05, 12:24 am)kraemder Wrote: Well I don't know how to view the source for AZW3 files.  Using Calibre to convert to HTMLZ, I can bring up the source for that in Firefox.  Here's an example sentence.

<ruby>短<rt>みじか</rt></ruby>い<ruby>黒<rt>くろ</rt>髪<rt>かみ</rt></ruby>と<ruby>青<rt>あお</rt></ruby>い<ruby>目<rt>め</rt></ruby>。</p>

Is this helpful enough?

Yep, I think so. So, if I got it right, your objective is turning that text into this:
Code:
|短《みじか》い|黒《くろ》|髪《かみ》と|青《あお》い|目《め》。</p>
Is that correct?

If that's the case, here's the pair of regexps that would do it:
Search string:
Code:
(?:<ruby>|</rt>)(.+?)<rt>(.+?)(?:</rt></ruby>|(?=</rt>)(?!</ruby>))
Replacement string:
Code:
|\1《\2》
But keep in mind there could be lots of cases in your books where the markup is different to the one you provided here, so take it with a grain of salt. Specially because I tested this on a plain python interpreter, not inside Calibre d¬_¬;b

(2017-06-05, 12:24 am)kraemder Wrote: Found the option to edit books in Calibre (there's a lot of options in Calibre I haven't noticed).  Here's a copy and paste of a sentence from the AZW3 file.  It looks a lot like the HTMLZ.

<p> マスコミ<ruby>各<rt>かく</rt>社<rt>しや</rt></ruby>も<ruby>取<rt>しゆ</rt>材<rt>ざい</rt>規<rt>き</rt>制<rt>せい</rt></ruby>をうけ、<ruby>星<rt>ほし</rt>菱<rt>びし</rt>邸<rt>てい</rt></ruby>に<ruby>近<rt>ちか</rt></ruby>づくことができない。ときおり、テレビ<ruby>局<rt>きよく</rt></ruby>のヘリコプターが<ruby>飛<rt>と</rt></ruby>んでくるが、<ruby>警<rt>けい</rt>察<rt>さつ</rt></ruby>のヘリに<ruby>星<rt>ほし</rt>菱<rt>びし</rt>邸<rt>てい</rt></ruby>の<ruby>上<rt>じよう</rt>空<rt>くう</rt></ruby>から<ruby>追<rt>お</rt></ruby>いはらわれている。</p>

The same regexps applied to this last text render this result:
Code:
<p> マスコミ|各《かく》|社《しや》も|取《しゆ》|材《ざい》|規《き》|制《せい》をうけ、|星《ほし》|菱《びし》|邸《てい》に|近《ちか》づくことができない。ときおり、テレビ|局《きよく》のヘリコプターが|飛《と》んでくるが、|警《けい》|察《さつ》のヘリに|星《ほし》|菱《びし》|邸《てい》の|上《じよう》|空《くう》から|追《お》いはらわれている。</p>
Check it out carefully and tell me if this is what you want ;-)
Reply
#12
I just did a quick convert and it's looking wonderful! Thanks very much! If anyone else is reading this thread and looking to duplicate the effect to read AZW3/MOBI/EPUB books with furigana in the Wakaru app for iOS, just convert to a txt file UTF-8 and after you import it to Wakaru, you need to open the editor for that select title and turn on Recognize Aozora Format. For whatever reason the default is to off. This helps me a lot. I'm pretty eager to get reading now.
Reply