Back

How do I remove Furigana from an Epub?

#1
Hello,

I just downloaded a Japanese ebook in Epub3 format (I actually paid for it, responsible citizen that I am). I need to convert it to a plain text file, so I can use it in a learning tool. I tried some conversion tools I found on google, but what happens is they mix furigana right into the output text. Say I have a word:
たんじょうび
誕生日
in the Epub (たんじょうび being furigana, appearing above 誕生日). What I get in the resulting txt is this:
誕たん生じょう日び
This is obviously no good, it can't be further processed, as all the metadata has been removed. So the removal needs to happen on the epub file itself, or during conversion.

I know epub is an open format, so it must be possible to do this without any hacking. I tried to open the file in a text editor, to see if I can identify the furigana markup and remove it, but the file looked like binary (or at least I couldn't figure out the right encoding).
I've downloaded an eBook reader called Calibre, which has a convert function with a lot of options. But I couldn't find anything specific to furigana.
Any suggestions?

EDIT: apparently the mechanism used for furigana is called "ruby characters" which is very inconvenient, because when I google things like "remove ruby characters from epub", what I get are tutorials for text manipulation in a programming language called Ruby.
Edited: 2015-05-13, 1:53 pm
Reply
#2
So I figured out how to do it in Calibre. I thought I'd share it in case someone else is looking for this.
1. In Calibre, select Convert books
2. On the left click the Search and replace tab
3. You'll be adding Regex expressions to match the stuff you want to remove, and replacing it with nothing (empty string). The markup looks like this:

<ruby>誕<rt>たん</rt></ruby><ruby>生<rt>じょう</rt></ruby><ruby>日<rt>び</rt></ruby>

If you're a Regex guru, you may be able to write a pattern that removes it all in one go. What I did was add 3 patterns:
\<ruby\> - removes all occurrences of the opening tag - <ruby>
\</ruby\> - removes all occurrences of the closing tag - </ruby>
\\<rt\>.{1,7}\</rt\> - removes everything that starts with <rt>, ends with </rt> and has 1 to 7 characters in-between. Why 1 to 7? Because the number of characters between <rt> and </rt> varies, and simple testing showed that the longest one in my text is 7 characters:

<rt>ガーゴイルぞう</rt>

You can change the value to fit your needs. You need to restrict the length though, otherwise you'd match anything between any <rt> and any </rt> that follows it anywhere in the text - including between the very first <rt> and very last </rt> which would practically delete the whole book.

Enjoy.
Reply
#3
Green_Airplane Wrote:You need to restrict the length though, otherwise you'd match anything between any <rt> and any </rt> that follows it anywhere in the text - including between the very first <rt> and very last </rt> which would practically delete the whole book.
I'm glad you get it working. just a small regex tip...

you can combine the 1st two regex into one using '?' which matches one or zero of whatever preceeds it:
\</?ruby\> - matches <ruby> and </ruby>

you can use '*?' to restrict to the smallest(lazy) match:
\\<rt\>.*?\</rt\> -matches just what's enclosed between each <rt> and </rt>
Reply
(March 20-31) All Access Pass: 25% OFF Basic, Premium & Premium PLUS! 
Coupon: ALLACCESS2017
JapanesePod101
#4
Thanks for the tips. I get to use Regex so rarely that even if I do learn some more advanced techniques, I usually forget them by the next time I get to use them. And when I do use Regex, the strategy is usually to get the results I need with minimal amount of googling. I'll try to remember those.
Reply
#5
(2015-05-13, 3:32 pm)Green_Airplane Wrote: So I figured out how to do it in Calibre. I thought I'd share it in case someone else is looking for this.
1. In Calibre, select Convert books
2. On the left click the Search and replace tab
3. You'll be adding Regex expressions to match the stuff you want to remove, and replacing it with nothing (empty string). The markup looks like this:

<ruby>誕<rt>たん</rt></ruby><ruby>生<rt>じょう</rt></ruby><ruby>日<rt>び</rt></ruby>

If you're a Regex guru, you may be able to write a pattern that removes it all in one go. What I did was add 3 patterns:
\<ruby\> - removes all occurrences of the opening tag - <ruby>
\</ruby\> - removes all occurrences of the closing tag - </ruby>
\\<rt\>.{1,7}\</rt\> - removes everything that starts with <rt>, ends with </rt> and has 1 to 7 characters in-between. Why 1 to 7? Because the number of characters between <rt> and </rt> varies, and simple testing showed that the longest one in my text is 7 characters:

<rt>ガーゴイルぞう</rt>

You can change the value to fit your needs. You need to restrict the length though, otherwise you'd match anything between any <rt> and any </rt> that follows it anywhere in the text - including between the very first <rt> and very last </rt> which would practically delete the whole book.

Enjoy.

Hi, I've been searching for ages and trying to do the same thing, but am not very technically minded, and by bumping this post I hope someone can help. I'm converting kindle Japanese books to txt format with Calibre to read on the Wakaru app, and I want to remove the ruby characters. I can do the first two steps, but am not sure how to input the 3rd one in the "Search Regular Expression" field. Also, will this remove the ruby characters? Hope someone can help - I've been trying for far too long!
Reply
#6
I could be wrong, but it seems Green_Airplane may have added an extra backslash at the beginning of the 3rd regex. Try the following:

Code:
\<rt\>.{1,7}\</rt\>

actually a small improvement should work with furigana strings of any length, not just 1-7 characters:
Code:
\<rt\>.*?\</rt\>

let me know if this does what you need.
Edited: 2017-02-23, 9:17 pm
Reply
#7
Thanks for the quick reply - much appreciated!

As I said, I don't know anything about Regex, so am guessing how to input it and copying and pasting.

I get that \<rt\>.*?\</rt\> and \<rt\>.{1,7}\</rt\> should be put in the Replacement Text field, but what should I put in Search Regular Expression field? Apologies if I'm making a really obvious mistake here!
Reply
#8
You will be searching for the ruby code and replacing it with nothing so put this in the search field:
\<rt\>.*?\</rt\>

put nothing in the replacement field.

----
Here's my regex quickstart if you are interested:
you could search for <rt> but it seems in calibre you need to escape '<' and '>'. So you need to add a '\' before each '<' and '>' so we have \<rt\>. The next part, the period(.) character matches any character. After that, we have the star(*) which says gives you multiple of the preceeding character. But if you don't add the question mark(?) it will find everything between the first '<rt>' and last '</rt>' which isn't what we want. The question mark makes the search stop at the first '</rt>' and doesn't go any further. Then we just need to find the closing tag </rt> but again we need to escape '<' and '>' so it's \</rt\>.

so:
\<rt\> -matches <rt>
.*? -matches whatever is between
\</rt\> -matches </rt>
Edited: 2017-02-23, 10:11 pm
Reply
#9
(2017-02-23, 10:10 pm)yogert909 Wrote: You will be searching for the ruby code and replacing it with nothing so put this in the search field:
\<rt\>.*?\</rt\>

put nothing in the replacement field.

----
Here's my regex quickstart if you are interested:
you could search for <rt> but it seems in calibre you need to escape '<' and '>'.  So you need to add a '\' before each '<' and '>' so we have \<rt\>.  The next part, the period(.) character matches any character.  After that, we have the star(*) which says gives you multiple of the preceeding character.  But if you don't add the question mark(?) it will find everything between the first '<rt>' and last '</rt>' which isn't what we want.  The question mark makes the search stop at the first '</rt>' and doesn't go any further.  Then we just need to find the closing tag </rt> but again we need to escape '<' and '>' so it's \</rt\>.

so:
\<rt\>        -matches <rt>
.*?             -matches whatever is between
\</rt\>      -matches </rt>
Wow - I feel really stupid now. I spent most of last night and this morning trying to figure this out, searching google, randomly changing calibre settings  Confused   I knew I must have been missing something really simple, and posted in desperation.

That explanation was excellent, and I have a much better understanding of what I'm actually inputting now.
  I just input your code and it converted exactly as I wanted it. Many thanks Yogert909 and have a great day  Smile
Reply
#10
Cool! I'm glad it worked.
Reply