Back

Proof of concept: fully annotated, kindle-ready Japanese novels in PDF

#1
I just made this work and was way too excited about the idea to keep it to myself. It's definitely not perfect yet but the (rather large) picture ought to speak for itself: the first page of an entirely annotated Kindle-ready copy of Norwegian Wood!

Doing any book I happen to have in .txt format would now take me about 5 minutes. The resulting PDF for Norwegian Wood is just over 3 megs.

Unfortunately, jgloss, which is what I've used for annotations, fails rather horribly with two-syllable words written in Hiragana in the text. I've yet to see a single circumstance where it didn't get them hilariously wrong. On the first page it translates "tachi" as "the dominant partner in a homosexual relationship." It translate "toki" as "Japanese created ibis," and nara as "narcissistic" rather than simply "if." However, believe it or not, with words of more than two syllables written in hiragana (ie. most gitaigo/giongo) and words written in kanji, it's bafflingly accurate. I'm positively smitten about being able to potentially read Mishima without having to constantly hover over a dictionary.

However, getting here was a bit of a process. I'm not a Python wiz by any stretch so I had to use what tools I had available. I used Notepad++ to prune EDICT and ENAMDICT of parentheses with a bunch of regex, rotated all the full-width MS Mincho glyphs 90 degrees counterclockwise with Font Creator, and used jgloss with the Chasen morphological analyzer installed to get the annotations, at which point I tweaked the font sizes in jgloss and actually resized the jgloss window to avoid text getting cut off when I used Bullzip PDF printer to output to PDF. Lastly, I used Acrobat to rotate the whole thing 180 degrees, as the Bullzip output was upside down, and then saved as an Acrobat 6.0 compatible file. Acrobat's one of two non-free pieces of the puzzle (the other is Font Creator) it could easily be replaced with PDF Toolkit.

The kludgiest part of it all is actually having to resize the screen so that Bullzip gets a properly line and page-broken PDF, and as far as I've seen, it's actually nigh impossible to get the horizontal page breaks quite right using this method, though you can get pretty close. Moreover, the way that jgloss fails with two-syllable hiragana words is just begging to be fixed, and I've already begun to put some thought into it myself, though my Java skills were never good to begin with and that's what jgloss's openly available source code is in, and probably where you'd need to put the fix.

Mostly just food for thought! In fact, I readily consider it to be the coolest damn addition to my Japanese study arsenal pretty much ever, and alone makes my Kindle one of the best purchases in recent memory. If anybody is interested in getting this working for themselves, I'll readily help and answer questions. Not sure what the status is on sending around custom versions of MS Mincho, but that's pretty much the only thing you'd have to have in order to obviate the need for non-free stuff (ignoring of course the possibility that you may be getting your source .txt from somewhere other than Aozora Bunko!)
Edited: 2011-08-19, 5:09 am
Reply
#2
I think it's neat, but I find the English words too distracting. I read English really, really fast, even if it's sideways, upside-down or mirrored. -sigh- That means my brain reads the English first automatically and then I can try to Japanese... Which is the wrong way around.

Have you considered having a mini-dictionary at the bottom of each page with the Japanese root-word and English meaning? (I say root-word because it's obvious that it's not picking up the verb tenses.)
Reply
#3
Yeah, actually. Jgloss lets you dump the annotations to a separate file if you want (and gives you the option to display only furigana, or only translation, or neither.) The trick would be telling the annotation file where to put the page breaks, and then formatting it in to each page. If it were just a matter of putting together a jgloss mod (though I have no clue how complicated that would turn out,) it definitely seems doable. Otherwise you could just load the annotation file into Anki and study it separately. Norwegian Wood has 3k words or so.

Another thing you could do is have one page with only furigana annotation (the one you read), followed by the same page with just the English.

Though I will say, I'm a fast reader of English as well, and on an actual Kindle with the "darker" PDF setting the English text is so tiny and fat that it's impossible for me to see unless I actually turn the Kindle on its side. If this weren't the case, I would be distracted as well.

(Edit: I've ultimately decided to have completely annotation-free pages alternating with annotated pages, and I just flip back and forth between them if I need to "look up" a word on a page. Otherwise the page just feels too "busy".)
Edited: 2011-08-22, 4:18 am
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
Thanks for sharing your method and tools! I could see how something like this could be very helpful.
Reply
#5
Oh yeah, and the process for furigana-only is actually a lot simpler:

1) Import the source text into jgloss and save the annotated version as plain text.
2) Load the new file into ChainLP.
3) Check the box for tategaki and the box for ruby and select your type of reader.
4) Save as PDF. ChainLP will work its incredible magic and spit out a thing of beauty. Upload to reader of choice.

Along the way, you can easily have jgloss output a glossary and use it to study separately, which would probably be great for short stories, though probably too onerous (3-4k word decks!) for novels.

Personally I think getting the English in there was well worth the extra effort. I usually hate furigana in study materials, because they draw my eye to the pronunciations even for kanji I've already mastered, let alone English. This, however, is meant for the stuff I've definitely never seen in my life -- and LOTS of it -- which several notable novelists (like Mishima) seem to throw around with aplomb.

Also: Jgloss by default will only annotate the first instances of words, and this is critical. Authors really do tend to have a pet lexicon, so that by 2/3rds into any given novel there are going to be *substantially* fewer annotations. In the case of Norwegian Wood there are maybe one or two per page once you get past the exposition, if that. The real goal, then, is to get that far in as timely and uninterrupted a manner as possible, so that you can enjoy the home stretch with a bunch of new words in your brain and an easy way to look them up again if you really need to.
Edited: 2011-08-19, 1:37 pm
Reply
#6
cool, neat stuff
Reply