I just made this work and was way too excited about the idea to keep it to myself. It's definitely not perfect yet but the (rather large) picture ought to speak for itself: the first page of an entirely annotated Kindle-ready copy of Norwegian Wood!
Doing any book I happen to have in .txt format would now take me about 5 minutes. The resulting PDF for Norwegian Wood is just over 3 megs.
Unfortunately, jgloss, which is what I've used for annotations, fails rather horribly with two-syllable words written in Hiragana in the text. I've yet to see a single circumstance where it didn't get them hilariously wrong. On the first page it translates "tachi" as "the dominant partner in a homosexual relationship." It translate "toki" as "Japanese created ibis," and nara as "narcissistic" rather than simply "if." However, believe it or not, with words of more than two syllables written in hiragana (ie. most gitaigo/giongo) and words written in kanji, it's bafflingly accurate. I'm positively smitten about being able to potentially read Mishima without having to constantly hover over a dictionary.
However, getting here was a bit of a process. I'm not a Python wiz by any stretch so I had to use what tools I had available. I used Notepad++ to prune EDICT and ENAMDICT of parentheses with a bunch of regex, rotated all the full-width MS Mincho glyphs 90 degrees counterclockwise with Font Creator, and used jgloss with the Chasen morphological analyzer installed to get the annotations, at which point I tweaked the font sizes in jgloss and actually resized the jgloss window to avoid text getting cut off when I used Bullzip PDF printer to output to PDF. Lastly, I used Acrobat to rotate the whole thing 180 degrees, as the Bullzip output was upside down, and then saved as an Acrobat 6.0 compatible file. Acrobat's one of two non-free pieces of the puzzle (the other is Font Creator) it could easily be replaced with PDF Toolkit.
The kludgiest part of it all is actually having to resize the screen so that Bullzip gets a properly line and page-broken PDF, and as far as I've seen, it's actually nigh impossible to get the horizontal page breaks quite right using this method, though you can get pretty close. Moreover, the way that jgloss fails with two-syllable hiragana words is just begging to be fixed, and I've already begun to put some thought into it myself, though my Java skills were never good to begin with and that's what jgloss's openly available source code is in, and probably where you'd need to put the fix.
Mostly just food for thought! In fact, I readily consider it to be the coolest damn addition to my Japanese study arsenal pretty much ever, and alone makes my Kindle one of the best purchases in recent memory. If anybody is interested in getting this working for themselves, I'll readily help and answer questions. Not sure what the status is on sending around custom versions of MS Mincho, but that's pretty much the only thing you'd have to have in order to obviate the need for non-free stuff (ignoring of course the possibility that you may be getting your source .txt from somewhere other than Aozora Bunko!)
Doing any book I happen to have in .txt format would now take me about 5 minutes. The resulting PDF for Norwegian Wood is just over 3 megs.
Unfortunately, jgloss, which is what I've used for annotations, fails rather horribly with two-syllable words written in Hiragana in the text. I've yet to see a single circumstance where it didn't get them hilariously wrong. On the first page it translates "tachi" as "the dominant partner in a homosexual relationship." It translate "toki" as "Japanese created ibis," and nara as "narcissistic" rather than simply "if." However, believe it or not, with words of more than two syllables written in hiragana (ie. most gitaigo/giongo) and words written in kanji, it's bafflingly accurate. I'm positively smitten about being able to potentially read Mishima without having to constantly hover over a dictionary.
However, getting here was a bit of a process. I'm not a Python wiz by any stretch so I had to use what tools I had available. I used Notepad++ to prune EDICT and ENAMDICT of parentheses with a bunch of regex, rotated all the full-width MS Mincho glyphs 90 degrees counterclockwise with Font Creator, and used jgloss with the Chasen morphological analyzer installed to get the annotations, at which point I tweaked the font sizes in jgloss and actually resized the jgloss window to avoid text getting cut off when I used Bullzip PDF printer to output to PDF. Lastly, I used Acrobat to rotate the whole thing 180 degrees, as the Bullzip output was upside down, and then saved as an Acrobat 6.0 compatible file. Acrobat's one of two non-free pieces of the puzzle (the other is Font Creator) it could easily be replaced with PDF Toolkit.
The kludgiest part of it all is actually having to resize the screen so that Bullzip gets a properly line and page-broken PDF, and as far as I've seen, it's actually nigh impossible to get the horizontal page breaks quite right using this method, though you can get pretty close. Moreover, the way that jgloss fails with two-syllable hiragana words is just begging to be fixed, and I've already begun to put some thought into it myself, though my Java skills were never good to begin with and that's what jgloss's openly available source code is in, and probably where you'd need to put the fix.
Mostly just food for thought! In fact, I readily consider it to be the coolest damn addition to my Japanese study arsenal pretty much ever, and alone makes my Kindle one of the best purchases in recent memory. If anybody is interested in getting this working for themselves, I'll readily help and answer questions. Not sure what the status is on sending around custom versions of MS Mincho, but that's pretty much the only thing you'd have to have in order to obviate the need for non-free stuff (ignoring of course the possibility that you may be getting your source .txt from somewhere other than Aozora Bunko!)
Edited: 2011-08-19, 5:09 am
