Back

cb's JNovel Formatter

#26
pm215 Wrote:
nest0r Wrote:Here's what I meant. You'll see this on most (all?) Aozora texts and I think that goes for hypothetical novels in .txt format.

|:ルビの付く文字列の始まりを特定する記号 (This is generally at the beginning of a text.)

Example: (例)東山梨郡|萩原《はぎわら》村

via: http://www.aozora.jp/misc/cards/000283/f...ttoryu.txt

It seems to act as a marker for where the reading should be (i.e. which kanji in a compound).
Yeah. See the aozora docs section (5), talks about ruby markup and has some examples you can check your implementation on. Basically the "full" form is 武州|青梅《おうめ》の宿 where the bar says where the ruby starts and the ruby itself is in the angle brackets. But the bar can be omitted if it would happen at the first boundary between "character classes" before the open angle bracket (where character class = kanji, kana, alphabet, etc.), which happens most of the time.

I wouldn't be surprised if trying to do this with regexes turned out to be more trouble than it was worth. The scripts in this thread from iSoron just do a manual scan along looking for transitions.
Actually that makes me more confident in that single find/replace regex above.

If that mark (does it have a name? Other than stick thingy) actually points out where there is a string of kanji across multiple words, thus allowing for notation that prevents overlap of the readings, and when it doesn't occur that means there's non-kanji interrupting the string, then that means the range can be increased or unlimited, no? Like 1,6 or something.

Because it's referring to the CJK Unified Ideographs (and whatever else one might think should be used)... so, for the majority of instances, which are unmarked by the stick thingy, even if the regex contains a higher upper limit (like 6) than there are kanji in a given instance, the placement of the furigana will stop at the end of the previous word which won't contain the CJK Unified Ideographs, and thus will always be placed over the first character of the relevant word which immediately precedes the reading brackets.

And by adding the stick thingy at the onset, having it at the beginning as I have it (assuming it's not working for me as I described because of a fluke) sets the limit for when there are those multiple kanji/unbroken character classes in the CJK Unified Ideograph range, so in other words it's a fail-safe for when the numerical range of 6 or infinity is stymied by a string of kanji that might otherwise have caused improper placement of readings over the end of a previous word rather than the beginning of the correct word.
Edited: 2011-02-09, 10:18 pm
Reply

Messages In This Thread