Hello,
I just downloaded a Japanese ebook in Epub3 format (I actually paid for it, responsible citizen that I am). I need to convert it to a plain text file, so I can use it in a learning tool. I tried some conversion tools I found on google, but what happens is they mix furigana right into the output text. Say I have a word:
たんじょうび
誕生日
in the Epub (たんじょうび being furigana, appearing above 誕生日). What I get in the resulting txt is this:
誕たん生じょう日び
This is obviously no good, it can't be further processed, as all the metadata has been removed. So the removal needs to happen on the epub file itself, or during conversion.
I know epub is an open format, so it must be possible to do this without any hacking. I tried to open the file in a text editor, to see if I can identify the furigana markup and remove it, but the file looked like binary (or at least I couldn't figure out the right encoding).
I've downloaded an eBook reader called Calibre, which has a convert function with a lot of options. But I couldn't find anything specific to furigana.
Any suggestions?
EDIT: apparently the mechanism used for furigana is called "ruby characters" which is very inconvenient, because when I google things like "remove ruby characters from epub", what I get are tutorials for text manipulation in a programming language called Ruby.
I just downloaded a Japanese ebook in Epub3 format (I actually paid for it, responsible citizen that I am). I need to convert it to a plain text file, so I can use it in a learning tool. I tried some conversion tools I found on google, but what happens is they mix furigana right into the output text. Say I have a word:
たんじょうび
誕生日
in the Epub (たんじょうび being furigana, appearing above 誕生日). What I get in the resulting txt is this:
誕たん生じょう日び
This is obviously no good, it can't be further processed, as all the metadata has been removed. So the removal needs to happen on the epub file itself, or during conversion.
I know epub is an open format, so it must be possible to do this without any hacking. I tried to open the file in a text editor, to see if I can identify the furigana markup and remove it, but the file looked like binary (or at least I couldn't figure out the right encoding).
I've downloaded an eBook reader called Calibre, which has a convert function with a lot of options. But I couldn't find anything specific to furigana.
Any suggestions?
EDIT: apparently the mechanism used for furigana is called "ruby characters" which is very inconvenient, because when I google things like "remove ruby characters from epub", what I get are tutorials for text manipulation in a programming language called Ruby.
Edited: 2015-05-13, 1:53 pm

