How many readings do you need to know?

Index » The Japanese language

yukamina Member
From: Canada Registered: 2006-01-09 Posts: 761

nest0r wrote:

Jarvik7 wrote:

The bracketed ones are aozora-bunko format, so I assume if you open it in an appropriate reader it will look pretty.

I never could get any of those readers and suchlike to work properly. Is there a program/addon thingy that turns those brackets back into ruby without a special reader?

Not sure, but I think the Firefox addon HTML Ruby takes care of that.

Reply #102 - 2010 February 16, 1:17 pm
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

iSoron wrote:

nest0r wrote:

I never could get any of those readers and suchlike to work properly. Is there a program/addon thingy that turns those brackets back into ruby without a special reader?

This is what I use:

    http://isoron.org/stuff/japanese/aozora/aozora_to_html
    http://isoron.org/stuff/japanese/aozora/aozora_ruby.py

More scripts, sorry. tongue

And iSoron, dude, I don't know what you're talking about! If anything, the OCRs seem disturbingly good, as if some robot went through and hand-corrected everything page by page.

I've seem some pretty bad ones, with page headers/numbering still there, for example. But well, nevermind. Will recheck later.

You're worse than bombperson! I've still got a collection of scripts somewhere from them, with a mental note to one day figure them out. Damn sciencey brainiacs learning Japanese. They're a gift..... and a curse.

You had me freaked out, thinking I'd wasted my time gather--I mean relieved I'm not a pirate. ^_^ But I'm good now, haven't noticed any problems, myself.

Hmm, maybe you're just going by the initial pages (stuff before the actual stories begin)? I noticed some of those are messy perhaps due to the mixture of 'logo'-like phrases and cover images preceding the actual book, and then there's the files that explain the symbols such as [#],  《》, etc. Most confusing I think are the txts that simply place the furigana on its own line, so you can't really tell until you read the line after and see where the kanji would line up with the short all-kana line above.

I did notice the page number stuff on initial pages a cpl times, but those were easily identified and didn't seem to continue. The only consistent eye-catching format problem seems to be the indications of words with 傍点 (I thought it was something else, but nope that seems to be it, it simply brackets and repeats the associated word with 'に傍点’).

Last edited by nest0r (2010 February 16, 1:38 pm)

Reply #103 - 2010 February 16, 1:19 pm
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

yukamina wrote:

nest0r wrote:

Jarvik7 wrote:

The bracketed ones are aozora-bunko format, so I assume if you open it in an appropriate reader it will look pretty.

I never could get any of those readers and suchlike to work properly. Is there a program/addon thingy that turns those brackets back into ruby without a special reader?

Not sure, but I think the Firefox addon HTML Ruby takes care of that.

Doesn't seem to work for stuff like 漢字《かんじ》;/

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
Reply #104 - 2010 February 16, 1:30 pm
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

Jarvik7 wrote:

You could write a shell script to turn it into HTML (or just do a search and replace).. Safari now supports ruby tags (in the nightlies)

I guess I'll just make a copy of the file so I can refer back to the original, and have one version with all those bracketed furigana sets removed. Yep.

Edit: Yes! I now have the original Legend of Galactic Heroes sci-fi txts--I mean uh 'texts', sorry, typo *wink wink*.

Last edited by nest0r (2010 February 16, 1:32 pm)

Reply #105 - 2010 February 16, 2:38 pm
pm215 Member
From: UK Registered: 2008-01-26 Posts: 1354

iSoron wrote:

http://isoron.org/stuff/japanese/aozora/aozora_to_html
    http://isoron.org/stuff/japanese/aozora/aozora_ruby.py

Cool; I didn't realise you could programmatically identify which of the preceding kanji the ruby applied to in the aozora format. (also, it's a ruby script but it's a python script :-))

Reply #106 - 2010 February 16, 3:01 pm
Thora Member
From: Canada Registered: 2007-02-23 Posts: 1691

Nestor: Sounds like you've done a pile of work! I wouldn't mind doing a bit of proofreading, if that would help. Perhaps if a few of us did a bit each?

Jarvik7: pls let us know what you think of NHKにようこそ! Also, btw, what was the name of the dictionary you recommended in a recent thread?  I know you collect dictionaries, so I'm curious about this one. (started with "ko"?)

Reply #107 - 2010 February 16, 4:44 pm
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

I've gone through some more and compared with scans. The ones with justified blocks of text seem to be the most problematic, as they stick the furigana on the same lines as the base sentence. For my purposes this isn't a problem, and I think it's also not a problem for statistical analyses.

I think of the texts with the romaji titles, the person who posted them converted the first couple themselves and were glitchy, but at some point they decided to just grab properly revised txts from elsewhere. Essentially 'scene' quality releases are the dominant form in the batches we aren't talking about because that would be wrong.

But yes, feel free to compare scans w/ txts, but be sure to keep in mind where the initial words line up (vs. table of contents and those other initial image-heavy pages before the prologues) and the often bracketed formatting aspects.

Last edited by nest0r (2010 February 16, 4:44 pm)

Reply #108 - 2010 February 16, 5:07 pm
pm215 Member
From: UK Registered: 2008-01-26 Posts: 1354

iSoron wrote:

http://isoron.org/stuff/japanese/aozora/aozora_to_html
    http://isoron.org/stuff/japanese/aozora/aozora_ruby.py

I tweaked these a little not to need to write to a temporary file:
http://www.chiark.greenend.org.uk/~pmay … ra_to_html
http://www.chiark.greenend.org.uk/~pmay … ra_ruby.py

Reply #109 - 2010 February 16, 8:32 pm
Jarvik7 Member
From: 名古屋 Registered: 2007-03-05 Posts: 3946

Thora wrote:

Jarvik7: pls let us know what you think of NHKにようこそ! Also, btw, what was the name of the dictionary you recommended in a recent thread?  I know you collect dictionaries, so I'm curious about this one. (started with "ko"?)

The only dictionaries I remember mentioning recently are the Dictionaries of Basic/intermediate/advanced Grammar, and 研究社和英中辞典 / 漢字源 / 大辞林 for iphone.

On the topic of light novels, which has taken over this thread, I found this interesting post on the JP intarwebs:
425 :[名無し]さん(bin+cue).rar[sage]:2008/11/16(日) 01:12:11 ID:VlRzgjvF0
10代で読んでいないと恥ずかしい必読書

谷川流『涼宮ハルヒの憂鬱』角川スニーカー文庫
高橋弥七郎『灼眼のシャナ』電撃文庫
ヤマグチノボル『ゼロの使い魔』MF文庫J
奈須きのこ『空の境界』同人作品
今野緒雪『マリア様がみてる』コバルト文庫
上遠野浩平『ブギーポップ』電撃文庫
鎌池和馬『とある魔術の禁書目録』電撃文庫
神坂一『スレイヤーズ!』富士見書房
ハセガワケイスケ『しにがみのバラッド。』電撃文庫
水野良『ロードス島戦記』角川スニーカー文庫
喬林知『今日からマのつく自由業!』角川ビーンズ文庫
橋本紡『半分の月がのぼる空』電撃文庫
秋田禎信『魔術師オーフェン』富士見書房, ブログ
賀東招二『フルメタル・パニック!』富士見書房
桑原水菜『炎の蜃気楼』コバルト文庫
竹宮ゆゆこ『とらドラ!』電撃文庫
田中芳樹『銀河英雄伝説』徳間書店
築地俊彦『まぶらほ』富士見書房
栗本薫『グイン・サーガ』早川書房
秋山瑞人『イリヤの空、UFOの夏』電撃文庫
三田誠『レンタルマギカ』角川スニーカー文庫
小野不由美『十二国記』講談社X文庫―ホワイト ハート
深沢美潮『フォーチュン・クエスト』角川スニーカー文庫, 電撃文庫
時雨沢恵一『キノの旅』電撃文庫

Last edited by Jarvik7 (2010 February 16, 8:38 pm)

Thora Member
From: Canada Registered: 2007-02-23 Posts: 1691

Thanks anyway J7. I finally found the thread. It was kotonoko - which I gather is a mac EBWing viewer? (I read it as kotonoko w/ EBWin and thought it was a dictionary.)

PS  That deleted one really didn't seem like much of a light novel. ;-)
PPS  2nd gold for Canada! You're missing a few parties.

Jarvik7 Member
From: 名古屋 Registered: 2007-03-05 Posts: 3946

I removed it since it's linguistically an atrocity, not natively Japanese, and not a light novel tongue I do recommend that people read it though because it's the best way to realize how backwards and ridiculous organized religion is. I read it (KJ) while I was in elementary school. I got a free hardcover bilingual copy of the new testament, but no old testament. I guess missionaries try to hide the more ugly parts of Christianity until you're converted tongue

PS: I'm watching Japanese women's curling on the news now. It's way less energetic than Canadian curling where there are burly men shouting HAAAAARRRRRRRDDDDD all the time.

Last edited by Jarvik7 (2010 February 17, 12:18 am)

Thora Member
From: Canada Registered: 2007-02-23 Posts: 1691

Missed those guys. At the hill, there were complaints that the Canadian boardercross guys' pants were too tight though.
[Edit: er... that sounded weird. I was referring to the claim that they were gaining a competitive advantage by wearing narrower (more aerodynamic?) pants rather than abiding by the gentleman's agreement to wear cooler baggy snowboard pants.] /OT =]

Last edited by Thora (2010 February 17, 12:13 pm)

Reply #113 - 2010 February 17, 3:20 pm
iSoron Member
From: Canada Registered: 2008-03-24 Posts: 490

nest0r wrote:

Essentially 'scene' quality releases are the dominant form in the batches we aren't talking about because that would be wrong.

Yes, I think that would be enough. It's a shame.

Question: do you guys think visual novels should be allowed in the corpus?

pm215 wrote:

I tweaked these a little not to need to write to a temporary file:

Nice. smile

Reply #114 - 2010 February 17, 4:15 pm
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

iSoron wrote:

nest0r wrote:

Essentially 'scene' quality releases are the dominant form in the batches we aren't talking about because that would be wrong.

Yes, I think that would be enough. It's a shame.

Question: do you guys think visual novels should be allowed in the corpus?

pm215 wrote:

I tweaked these a little not to need to write to a temporary file:

Nice. smile

No, scene quality is goo--oh, *wink wink*, yes a 'shame'.

Has anyone else checked out these imaginary txts? I still haven't found any problems, but I kind of got lost at some point in the sea of 一般小説 blah blah stuff.

No idea about including visual novels. Aren't they basically the same as light novels, just shorter?

Last edited by nest0r (2010 February 17, 4:15 pm)

Reply #115 - 2010 February 23, 3:21 pm
rich_f Member
From: north carolina Registered: 2007-07-12 Posts: 1708

I don't suppose someone could explain how to run these scripts that could theoretically convert a text file with ruby in it into an html file? I've tried running them on my mac under python 2.6.4, and all I get is an errorfest.

Here's what I tried: Jamming all 5 files into a directory with the txt file to be converted, then running the bash script on the txt file I want to convert. I've got python 2.6.4 properly installed, and it's set to be the default. (And it's in the PATH.) At some point, it pukes on the UTF encoding and conks out.

What am I doing wrong? I'm not an old hand at python or bash, but I'm pretty good at figuring things out if you give some directions.

(And if there's a way to run it in WinXP, I'd be even happier...)

Reply #116 - 2010 February 23, 4:11 pm
yudantaiteki Member
Registered: 2009-10-03 Posts: 3619

iSoron wrote:

Question: do you guys think visual novels should be allowed in the corpus?

I don't see why not.  ひぐらしの鳴く頃に is the only one I've "read" but it has as much text as at least a light novel (and I had to look up way too many kanji...)

Reply #117 - 2010 February 23, 4:19 pm
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

Anyone able to get the readers smoopy, tobira, or azur to display the ルビ in any of the formats in the various txt files none of us have?

http://jpzin.com/reading-software-for-aozora/

tobira/smoopy require applocale to work properly, which I can't get to work on my PC, so I tried out the trial of azur, and it displays the text with the furigana in various brackets.

Hmm, looking into the t-time thingy, just realized t-time isn't a release group, it's a format with ruby tags, and there's a 15mb program for it. ;p

http://www.voyager.co.jp/hodo/070928_hodo_e.html

Well none of it works for me, just displays gibberish. But theoretically for ttz (t-time) files, at least, as they already have BR code for line breaks, I could find/replace the ruby tags and open in Firefox as html. I have the xhtml ruby support add-on, but what tags would I need to replace the t-time stuff with? (Also could apply this to the other aozora txt files I suppose, using some sort of software that adds line breaks--Frontpage? And converting those double-brackets into ruby tags.)

Last edited by nest0r (2010 February 23, 5:00 pm)

Reply #118 - 2010 February 23, 5:02 pm
hereticalrants Member
From: Winterland Registered: 2009-10-23 Posts: 289

Benny Bullocks wrote:

hey, yeh dun't havvta no nonnna dem kanjiis. evn dem japs dunt no der kanjiis.

Hahaha.

Last edited by hereticalrants (2010 February 23, 5:27 pm)

Reply #119 - 2010 February 24, 3:28 am
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

So I think I've slowly worked my way towards what some of you were talking about earlier. So, given that I can't run any of the readers, or rather, they only display gibberish, but of course Firefox + xhtml ruby works, would my mission, if I want to de-uglify the bracketed readings in certain texts, be to convert them into:

<ruby><rb>馬鹿気</rb><rp>(</rp><rt>ばかげ</rt><rp>)</rp></ruby> ? That is to say, I need some simple way to tell a text editor to find instances of the furigana readings, replace them with those tags, as well as the kanji words prior to them. I suppose, as Jarvik I think mentioned, the only other option would be simply to remove them. ^_^

Reply #120 - 2010 February 24, 3:44 am
Jarvik7 Member
From: 名古屋 Registered: 2007-03-05 Posts: 3946

@nest0r: Use a text editor that supports grep / regular expressions for its search and replace function. Using that you can convert to html/css ruby or remove altogether.

Reply #121 - 2010 February 24, 4:44 am
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

Jarvik7 wrote:

@nest0r: Use a text editor that supports grep / regular expressions for its search and replace function. Using that you can convert to html/css ruby or remove altogether.

I think I'm just going to remove them with basic find/replace, too lazy to learn regex and figure out how to replace preceding kanji or whatnot.

Frankly I'm not even sure why I care beyond my own OCD, it's not like I use them. (Stardict + my particular methods for self-study.) Suppose I was just being stubborn, once I realized that for no good reason readers don't work for me. ;p Having a streamlined copy (in addition to original for reference, plus there won't always be a handy dictionary parallel for readings) for certain strategies of study I'm still working out is good enough. /end ramble

Last edited by nest0r (2010 February 24, 4:54 am)

Reply #122 - 2010 February 24, 5:27 am
pm215 Member
From: UK Registered: 2008-01-26 Posts: 1354

nest0r wrote:

I think I'm just going to remove them with basic find/replace, too lazy to learn regex and figure out how to replace preceding kanji or whatnot.

Regular expressions are a tool which is a pain to learn but which lets you bypass a huge amount of manual messing around later on. It's a bit of an upfront investment which pays huge dividends later, so the really lazy approach is to learn them :-)

nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

Actually that program ChainLP does work for me, perhaps I can use that to convert to something-or-other.

Edit: Nevermind. Has trouble with Aozora tags for most of the files I pretend to have but don't really because that would be wrong. Edit 2: n/m again I think that's just for the image tags because I removed the images they point to. ;p

Last edited by nest0r (2010 February 24, 11:32 am)

Reply #124 - 2010 February 24, 1:03 pm
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

nest0r wrote:

So I think I've slowly worked my way towards what some of you were talking about earlier. So, given that I can't run any of the readers, or rather, they only display gibberish, but of course Firefox + xhtml ruby works, would my mission, if I want to de-uglify the bracketed readings in certain texts, be to convert them into:

<ruby><rb>馬鹿気</rb><rp>(</rp><rt>ばかげ</rt><rp>)</rp></ruby> ? That is to say, I need some simple way to tell a text editor to find instances of the furigana readings, replace them with those tags, as well as the kanji words prior to them. I suppose, as Jarvik I think mentioned, the only other option would be simply to remove them. ^_^

Several failures later: So, anyone want to write something up that newbies can run, that will do the above? Or that will turn "kanji《kana reading》" into a phonetic guide in Word, or something? I know I'm not the only nub, there are many of us n00bs here that will love you for it. Yes, many. Not just me. Many others. They're just quiet. sad

Reply #125 - 2010 February 24, 1:04 pm
rich_f Member
From: north carolina Registered: 2007-07-12 Posts: 1708

I picked up EditPadPro (http://www.editpadpro.com) for $50US. It's pretty sweet, even if it's a bit pricey. It's a feature-dense text editor, and it has a built in regular expression editor. If you go into the help documentation (under the help menu), there's a 47-page tutorial at the back of the PDF manual on how to create regular expressions. It has a separate pane you can open up to create regular expressions with, and you can slice and dice documents however you see fit.

It can convert text files on the fly, too, and it seems to handle JP pretty well. You can create files in JP as well, you just have to change the encoding. It handles many different kinds of encodings, and it has popup help windows when you're just starting out and still playing with the menus. (It'll tell you what each feature does when you enable a feature.) You can dismiss this as soon as it becomes annoying.

It even has some sort of e-mail sending ability buried in there somewhere.

I already like it a lot more than TotalEdit, which I used for a long time.

There's a lite version you can try first for free, which I'd recommend, because $50 isn't cheap for what is essentially Notepad+grep on steroids.