kanji koohii FORUM
How do I remove Furigana from an Epub? - Printable Version

+- kanji koohii FORUM (http://forum.koohii.com)
+-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html)
+--- Forum: Learning resources (http://forum.koohii.com/forum-9.html)
+--- Thread: How do I remove Furigana from an Epub? (/thread-12739.html)



How do I remove Furigana from an Epub? - Green_Airplane - 2015-05-13

Hello,

I just downloaded a Japanese ebook in Epub3 format (I actually paid for it, responsible citizen that I am). I need to convert it to a plain text file, so I can use it in a learning tool. I tried some conversion tools I found on google, but what happens is they mix furigana right into the output text. Say I have a word:
たんじょうび
誕生日
in the Epub (たんじょうび being furigana, appearing above 誕生日). What I get in the resulting txt is this:
誕たん生じょう日び
This is obviously no good, it can't be further processed, as all the metadata has been removed. So the removal needs to happen on the epub file itself, or during conversion.

I know epub is an open format, so it must be possible to do this without any hacking. I tried to open the file in a text editor, to see if I can identify the furigana markup and remove it, but the file looked like binary (or at least I couldn't figure out the right encoding).
I've downloaded an eBook reader called Calibre, which has a convert function with a lot of options. But I couldn't find anything specific to furigana.
Any suggestions?

EDIT: apparently the mechanism used for furigana is called "ruby characters" which is very inconvenient, because when I google things like "remove ruby characters from epub", what I get are tutorials for text manipulation in a programming language called Ruby.


How do I remove Furigana from an Epub? - Green_Airplane - 2015-05-13

So I figured out how to do it in Calibre. I thought I'd share it in case someone else is looking for this.
1. In Calibre, select Convert books
2. On the left click the Search and replace tab
3. You'll be adding Regex expressions to match the stuff you want to remove, and replacing it with nothing (empty string). The markup looks like this:

<ruby>誕<rt>たん</rt></ruby><ruby>生<rt>じょう</rt></ruby><ruby>日<rt>び</rt></ruby>

If you're a Regex guru, you may be able to write a pattern that removes it all in one go. What I did was add 3 patterns:
\<ruby\> - removes all occurrences of the opening tag - <ruby>
\</ruby\> - removes all occurrences of the closing tag - </ruby>
\\<rt\>.{1,7}\</rt\> - removes everything that starts with <rt>, ends with </rt> and has 1 to 7 characters in-between. Why 1 to 7? Because the number of characters between <rt> and </rt> varies, and simple testing showed that the longest one in my text is 7 characters:

<rt>ガーゴイルぞう</rt>

You can change the value to fit your needs. You need to restrict the length though, otherwise you'd match anything between any <rt> and any </rt> that follows it anywhere in the text - including between the very first <rt> and very last </rt> which would practically delete the whole book.

Enjoy.


How do I remove Furigana from an Epub? - yogert909 - 2015-05-13

Green_Airplane Wrote:You need to restrict the length though, otherwise you'd match anything between any <rt> and any </rt> that follows it anywhere in the text - including between the very first <rt> and very last </rt> which would practically delete the whole book.
I'm glad you get it working. just a small regex tip...

you can combine the 1st two regex into one using '?' which matches one or zero of whatever preceeds it:
\</?ruby\> - matches <ruby> and </ruby>

you can use '*?' to restrict to the smallest(lazy) match:
\\<rt\>.*?\</rt\> -matches just what's enclosed between each <rt> and </rt>


How do I remove Furigana from an Epub? - Green_Airplane - 2015-05-16

Thanks for the tips. I get to use Regex so rarely that even if I do learn some more advanced techniques, I usually forget them by the next time I get to use them. And when I do use Regex, the strategy is usually to get the results I need with minimal amount of googling. I'll try to remember those.