Back

cb's JNovel Formatter

#1
cb's JNovel Formatter is a utility that will convert Japanese novels to nicely formatted HTML files.

Download JNovel Formatter via SourceForge (source code is available as well)

You will need Windows (XP/Vista/7/8/10) and .Net Framework v3.5 installed.

[Image: main.png]

Features:
▲ Most novel encodings are supported (SHIFT-JIS, UTF-8, UTF-16, etc.).
The output HTML will be encoded in UTF-8.

▲ Supports Aozora formatting (青空文庫形式), including:
- Ruby/furigana (ルビ・振り仮名). Kana that are placed over kanji to indicate
pronunciation.
- Emphasis (傍点). Marks or dots used to emphasize a word or passage.
- Images (挿絵)
- Underlining (傍線)
- Gaiji (外字)
- The following formatting constructs are not useful or not currently
supported and will be removed:
- Comments (コメント). Comments are found between the 2 dashed lines at
the start of the novel.
- Indentation (字下げ、地上げ、中央揃え)
- New page (改ページ)
- [#改丁]

▲ Option to add bookmark anchors to "。" characters. Just click a "。" character
and bookmark the page. When you load the bookmark, the page will return to
the sentence that you left off at.

▲ Option to break up the novel into smaller files that won't hose up your browser.

▲ Options to change the HTML style (font, colors, orientation, alignment, etc.).

Firefox users: Vertical orientation requires version 41+.

▲ Can process a single novel or an entire directory of novels.

▲ No installation required.

Have Fun!
cb4960
Edited: 2015-08-09, 5:04 pm
Reply
#2
Tips:

★ You can use the output of this utility with the Rikaisama to add words directly to an Anki deck from your browser.

Related Software:

★ 青P. (Recommeded). A Japanese program that will turn a Japanese novel into a nicely formatted PDF file with vertical text support. Highly configurable. Supports aozora formatting. See this post for more information.

★ TextMiru. (Recommeded). A Japanese novel viewer. Supports vertical text. Supports aozora formatting. Highly configurable. Bookmark feature. Doesn't require installation. Text is fully selectable. The default text size is very difficult to read on a netbook-sized screen but it should be acceptable on a normal-sized desktop display. See this post to learn how to change the default layout.

★ FooSoft's yomichan Anki plugin. It has a built-in dictionary and allows you to add vocabulary cards directly to your deck. I haven't actually tried it myself though.

★ ArisuVeiwer. Another Japanese novel viewer. Supports vertical text. Supports aozora formatting. Bookmark feature. Doesn't require installation. The text is NOT selectable!
Edited: 2015-08-09, 5:10 pm
Reply
#3
Very interesting! If they're .html files though, perhaps you can turn the readings into some sort of superscript? Edit: Or xhtml or whatever the Firefox ruby plugins identify. (The ones in question being HTML Ruby and XHTML Ruby Support: https://addons.mozilla.org/en-US/firefox...html-ruby/ and https://addons.mozilla.org/en-US/firefox...y-support/)

Found this: http://www.w3.org/TR/ruby/
Edited: 2011-02-07, 10:05 pm
Reply
(March 20-31) All Access Pass: 25% OFF Basic, Premium & Premium PLUS! 
Coupon: ALLACCESS2017
JapanesePod101
#4
nest0r Wrote:Very interesting! If they're .html files though, perhaps you can turn the readings into some sort of superscript? Edit: Or xhtml or whatever the Firefox ruby plugins identify. (The ones in question being HTML Ruby and XHTML Ruby Support: https://addons.mozilla.org/en-US/firefox...html-ruby/ and https://addons.mozilla.org/en-US/firefox...y-support/)

Found this: http://www.w3.org/TR/ruby/
I'll see what I can do. Though, with the possible exception of names, there isn't much value added. That is, assuming that you are using rikaichan or a similar tool.

Edit: Actually now that I think about it, it is also useful for when the "reading" isn't really the reading.
Edited: 2011-02-08, 1:03 am
Reply
#5
Yeah, just going through a brief swath of text from a light novel, using furigana injector and comparing the rubi'd results in Firefox with the bracketed readings, I noticed a half dozen instances of kanji without readings, inaccurate readings, or incorrect readings. I think 90% of the time automatically generated readings or assuming the reading from searching in a dictionary will be fairly accurate and adequate, but I don't know, if the readings are there already in the text, and marked, it seems a shame to just discard them and get imperfect exposure to the language.

Edit: I just did a find/replace on the opening and closing brackets with <sup> and </sup> and I'd be happy to make do with that, even, for now. ;p (Though I'd want to look into ways to adjust the size and position.)
Edited: 2011-02-08, 12:24 am
Reply
#6
Oooh, think I found something good! http://www.dokuwiki.org/plugin:xhtmlruby - It has the regex thingy it used but I've no idea how to change it for the brackets.

Edit: Ironically it looks like that pomax person wrote it. ;p Pomax strikes again!

Edit 2: Okay here's my super rough solution because I'm a n00b. I did two find/replaces with regex.

find: (.)《
replace: <ruby><rb>\1</rb><rp>《</rp><rt>

find: 》
replace: </rt><rp>》</rp></ruby>

I did this on a .txt and saved as .html, then opened in Firefox and the HTML Ruby plugin displayed it just fine. It looks weird to have all the furigana over one character, though. Still, better than nothing.

I guess having it recognize whether it's 1-3 or 4 kanji would be better but I don't know how to do that. (Edit 3: Figured that out—the curly braces limitation of 1 through 3, preceded by a range of kanji codes taken from the thread rich/jimmyseal/fabrice participated in—but can't get it to work in UltraEdit.)

Edit 4: I think I got it! Well, for me at least. I switched to Perl for the regex in UltraEdit and did the first Find (as listed above) as ([\x{4e00}-\x{9fa5}]{1,3})《

I'm sure others can think of better ranges but that seemed to do the trick.

Final edit: So in closing, this works, a single find/replace using Perl regex in UltraEdit:

Find: ([\x{4e00}-\x{9fa5}]{1,3})《([\x{3040}-\x{30FF}]+)》
Replace: <ruby><rb>\1</rb><rp>《</rp><rt>\2</rt><rp>》</rp></ruby>

Ahem. Final final edit: Forgot about that marker that sometimes appears so this seems to work: |*([\x{4e00}-\x{9fa5}]{1,3})《([\x{3040}-\x{30FF}]+)》 (Although perhaps 1,4 is a better range?)
Edited: 2011-02-08, 6:25 am
Reply
#7
By the way, now that I've actually tried the program... I can't actually get the resulting .html files to display properly. None of the encodings I try work.
Reply
#8
nest0r Wrote:By the way, now that I've actually tried the program... I can't actually get the resulting .html files to display properly. None of the encodings I try work.
What browser are you using? What version? What does it look like? What was the encoding of the novel? The output should be in UTF-8, but perhaps I screwed something up. Does it look okay in Notepad++?

BTW, I only tested with FF 3.6 and IE 8.

Also, thanks for the ruby info in your other post.
Edited: 2011-02-08, 9:08 am
Reply
#9
I get errors in Win7 when I try to select non-TT fonts. Maybe find a way to make them non-selectable? (Or add support for them.) I wanted to use Adobe Heiti, but all I get is an error message saying non-TT fonts are not supported. Since I have a lot of fonts on my box, and I hate Mincho, this makes finding the font I want tricky.

Also, fonts with Kanji in their names don't show up in the drop-down box, like 小塚ゴシックPro.

EDIT: I get the same problem with character encodings. All I get is gobbledy-gook, in all encodings. Tried latest FF.
Edited: 2011-02-08, 2:14 pm
Reply
#10
cb4960 Wrote:
nest0r Wrote:By the way, now that I've actually tried the program... I can't actually get the resulting .html files to display properly. None of the encodings I try work.
What browser are you using? What version? What does it look like? What was the encoding of the novel? The output should be in UTF-8, but perhaps I screwed something up. Does it look okay in Notepad++?

BTW, I only tested with FF 3.6 and IE 8.

Also, thanks for the ruby info in your other post.
I tried it on Netscape Navigator 3.0! Just kidding. I tried it on both Firefox 3.6.something and Google Chrome 9.something. I tried literally probably about 20 encodings just to try and figure out if it showed up in anything, but no luck. Couldn't read the resulting .html file in Notepad++ or UltraEdit either.

Original file I used was Shift_JIS (edit: pretty sure I also converted the original file to UTF-8 and tried that with the same results, but I was half asleep so I can't remember). The thing is, Firefox says it's UTF-8. My expert intuition tells me that something happens where the text is processed in a gobbledy-gook state and thus frozen that way somehow? Yep.

Unrelated: Just want to stick in the finished ruby regex product that works for me for now, for form's sake. At least with a .txt converted to UTF-8 on Windows 7, using UltraEdit's Perl regex and Firefox 3.6 w/ HTML Ruby add-on:

Find: |*([\x{4e00}-\x{9fa5}]{1,4})《([\x{3040}-\x{30FF}]+)》
Replace: <ruby><rb>\1</rb><rp>《</rp><rt>\2</rt><rp>》</rp></ruby>
Edited: 2011-02-08, 2:48 pm
Reply
#11
Thanks for the info guys. Sounds like there are some major issues with the encoding library that I'm using. It makes calls to the Windows API to heuristically determine the encoding of the input file (like IE or FF does). But maybe the API works differently on Vista/7 then it does on XP. I'll look into it and hopefully post a v2 soon.

Edit: I would also like to put in some sort of ruby support.

Edit 2: Looks like others using same encoding library are also having Windows 7 issues.
Edited: 2011-02-08, 3:13 pm
Reply
#12
Where can I download or purchase novels in text format ? Smile
Reply
#13
Same question, but it amounts to asking for illegal material, right? Which is not allowed per the forum rules. So I withdraw the question.

Are there free novels available? There are legally free novels available in English (usually very old ones) but I'm not quite sure where to get them.

Can you tell me where I can download Japanese fan fiction (or other similar free novels)?
Reply
#14
Aozora has a lot of free novels online.

http://www.aozora.gr.jp/

A Japanese friend pointed me at a couple of her favorites:

http://www.aozora.gr.jp/cards/000121/fil...13341.html

http://www.aozora.gr.jp/cards/000121/fil...14895.html
Edited: 2011-02-08, 7:18 pm
Reply
#15
suffah Wrote:Where can I download or purchase novels in text format ? Smile
This isn't the thread for such a discussion, but you might check this comment, making sure to read it all before clicking: http://forum.koohii.com/showthread.php?p...#pid126290

Or: http://forum.koohii.com/showthread.php?p...#pid114529
Reply
#16
Hello,

I have just released version 2.0 of JNovel Formatter.

Download JNovel Formatter v2.0 via MediaFire

Download JNovel Formatter v2.0 Source Code via MediaFire

What changed?

Fix: Input encoding is now user-selectable. JNovel Formatter will attempt to guess the encoding, but since it's not perfect the final decision is left to the user.

Thanks nest0r and rich_f!

(Oh, and if the encoding issue still isn't fixed on Windows 7, please let me know.)

New: Added HTML5 ruby tag option for readings.

Note: Some browsers don't support these tags yet. IE8+ does. Firefox requires the HTML Ruby add-on.

Thanks nest0r! In case you're interested, I used this C# regex based on your regex:
(?<Kanji>\p{IsCJKUnifiedIdeographs}{1,4})《(?<Reading>.*?)》

It works quite well. Here's an example:

[Image: ruby_example.png]

I should also note that rikaichan doesn't seem to have any issues with the ruby text.

Edit: This version doesn't include a fix to rich_f's font dialog issues. That one may be a bit more tricky.

cb4960
Edited: 2011-02-08, 11:14 pm
Reply
#17
Woohoo! It works splendidly. Thanks for writing all this great stuff, you've been on a roll lately.

I don't have problems with selecting and using fonts with kanji, but I confirm the error regarding OpenType (rather than TrueType, for example).

Also, an extreme trifle, but I shall mention irregardless: I noticed that the |is still present in the finished text. Does the way I added it to the beginning of the regex with an asterisk not work well? I thought it might be good to use it when it appears as a kind of extra boundary to prevent overlap in rare cases.

Edit: Also, HTML Ruby has a version for Opera: http://htmlrubyopera.codeplex.com/ and I think Google Chrome is fancy and supports ruby annotation without the need for an add-on.

I don't think the TrueType thing really matters either (Google said it might have something to do with unicode mapping tables?), though perhaps eliminating non-TT stuff as options, or having a warning would be good.

Edit: to note I removed other edits. I think I had it vaguely right the first time. I was in the zone for 5 minutes with regex and now I've no idea what I was thinking about the stick thingy at the time. ;p I guess I was thinking it would prevent overlap, but I think that's wrong for the overlap I was thinking of. But at the same time if it designated where a reading should go for certain kanji in certain compounds, then maybe it works to incorporate it? But then maybe it should be used to count characters and create a limit to the range function? And really one could easily just ditch the blasted thing entirely. My brain hurts.
Edited: 2011-02-09, 1:04 am
Reply
#18
I thought the | was a typo. You've kinda lost me with it's intended purpose. You say "the |is still present in the finished text." Where exactly do see you this? Do you actually see a pipe "|" character somewhere? Do you notice the problem with the ruby screenshot that I provided? If so, could you point it out? Maybe I'm just being dense again.

Also, thanks for your analysis of the font issue and your research into ruby support support for other browsers.

Edit: Ok, I just read your edit... I think I need some sleep.
Edited: 2011-02-09, 1:08 am
Reply
#19
cb4960 Wrote:I thought the | was a typo. You've kinda lost me with it's intended purpose. You say "the |is still present in the finished text." Where exactly do see you this? Do you actually see a pipe "|" character somewhere? Do you notice the problem with the ruby screenshot that I provided? If so, could you point it out? Maybe I'm just being dense again.

Also, thanks for your analysis of the font issue and your research into ruby support support for other browsers.
No problem, just trying to give back despite being a n00b. Sad

Here's what I meant. You'll see this on most (all?) Aozora texts and I think that goes for hypothetical novels in .txt format.

|:ルビの付く文字列の始まりを特定する記号 (This is generally at the beginning of a text.)

Example: (例)東山梨郡|萩原《はぎわら》村

via: http://www.aozora.jp/misc/cards/000283/f...ttoryu.txt

It seems to act as a marker for where the reading should be (i.e. which kanji in a compound).

So after coming up with the regex thingy I remembered through reviewing stuff pm215 and rich_f said in another thread that I'd forgotten to incorporate that, so I experimented and stuck that symbol followed by the asterisk at the beginning, and it seemed to work. By ‘work’ I mean that it replaced that bar and the kanji following it with the ruby code (within the 4 character range) where it appeared and the regular expression also continued, in that same find/replace, to rubify kanji/kana as normal where the symbol didn't appear. Not sure why it worked and still iffy on what that symbol does and how to improve the regex, but I decided to roll with it. ;p

Edit: And by still being in the finished text I just meant that JNF doesn't remove it/use it or whatnot, it's still there in the text in the style the example demonstrates.
Edited: 2011-02-09, 1:19 am
Reply
#20
I just use ArisuViewer. It takes .txt files containing ruby and images and parses them into book format in whatever way you want(text font, size, furigana size, text orientation, romaji orientation, etc. can be adjusted). The default title page looks something like this

[Image: arisuviewer.jpg]

It requires no installation to use, so you can run it off of a flash drive. I keep all of my book files in the same folder as the ArisuViewer.exe file for convenience and portability.

It also saves your place in the book you're reading for you. When you reopen any file with it, it opens up at the page you were on when you closed it.

It was reccomended to me in one of those "read me" files that came with a book I downloaded and I'm very pleased with it. You can download and read more about it here: http://www.forest.impress.co.jp/lib/offc...iewer.html

I'm not knocking the OP's program here. cb's Formatter looks very nice if you want to read books in your web browser for whatever reason (rikaichan comes to mind =P). However, nothing beats ArisuViewer when it comes down to simply reading Japanese books on your computer screen.
Edited: 2011-02-09, 3:28 am
Reply
#21
@hereticalrants

I don't know, it's nice, but it doesn't seem much different to me than smoopy and various other readers. It doesn't seem to have selectable text? I haven't looked through it, but I imagine there's some text editing option in this viewer similar to smoopy, where you have to essentially open it in a user-specifiable, separate text editor? It's not merely for definitions: having the text selectable and in a browser format opens it to many options for expansion/extensibility, I think.

I don't want to particularly argue the issue in this thread, though. I think we were all familiar with the existence of such readers before cb4960 made the JNF. Thanks for sharing this resource, though, it forms a nice pair with smoopy for when I want vertical text (at the expense of other constraints). Although there's also Wakaru for the iOS when I want to read on the iDevice(s).
Edited: 2011-02-09, 4:29 am
Reply
#22
someday when im old and graying, ill tell the story of how one day, CB came and changed the way we all learned japanese forever
Reply
#23
@hereticalrants
Thanks for the tip. I added a link to ArisuViewer in my second post.
Reply
#24
nest0r Wrote:So after coming up with the regex thingy I remembered through reviewing stuff pm215 and rich_f said in another thread that I'd forgotten to incorporate that, so I experimented and stuck that symbol followed by the asterisk at the beginning, and it seemed to work. By ‘work’ I mean that it replaced that bar and the kanji following it with the ruby code (within the 4 character range) where it appeared and the regular expression also continued, in that same find/replace, to rubify kanji/kana as normal where the symbol didn't appear. Not sure why it worked and still iffy on what that symbol does and how to improve the regex, but I decided to roll with it. ;p
OK, I get it now. It takes me a while sometimes. I'll try to add that tonight along with linux support and other minor things.
Edited: 2011-02-09, 3:19 pm
Reply
#25
nest0r Wrote:Here's what I meant. You'll see this on most (all?) Aozora texts and I think that goes for hypothetical novels in .txt format.

|:ルビの付く文字列の始まりを特定する記号 (This is generally at the beginning of a text.)

Example: (例)東山梨郡|萩原《はぎわら》村

via: http://www.aozora.jp/misc/cards/000283/f...ttoryu.txt

It seems to act as a marker for where the reading should be (i.e. which kanji in a compound).
Yeah. See the aozora docs section (5), talks about ruby markup and has some examples you can check your implementation on. Basically the "full" form is 武州|青梅《おうめ》の宿 where the bar says where the ruby starts and the ruby itself is in the angle brackets. But the bar can be omitted if it would happen at the first boundary between "character classes" before the open angle bracket (where character class = kanji, kana, alphabet, etc.), which happens most of the time.

I wouldn't be surprised if trying to do this with regexes turned out to be more trouble than it was worth. The scripts in this thread from iSoron just do a manual scan along looking for transitions.
Reply