So in this article on tofugu:
http://www.tofugu.com/2010/08/30/5-step-...-for-kids/
If you scroll down, you can see my wall-o-text reply suggesting using Regular Expressions get rid of the offending kana in () in Japanese newspaper articles for kids.
The problem is that the only approach I could think of is, well, brute-force-ish. Basically, it entails using Dot (.), then 2 Dots (..), then 3, etc. And I know that Dot is an awful choice, but I don't have the time to study RegEx. It works, but it's clumsy.
Obviously, there's a more elegant solution out there, I just don't know what it is. Anyone else care to take a stab?
The idea would be to take text like this:
円高(えんだか)が進(すす)み、景気回復(けいきかいふく)の見通(みとお)しが立(た)たないなか、中央銀行(ちゅうおうぎんこう)である日本銀行(にっぽんぎんこう)は5日(いつか)、追加(ついか)の金融緩和(きんゆうかんわ)の政策(せいさく)を決(き)めました。
And turn it into text like this:
円高が進み、景気回復の見通しが立たないなか、中央銀行である日本銀行は5日、追加の金融緩和の政策を決めました。
using something like EditPad Pro (or another text editor with a RegEx capable search/replace) to make quick work of the stuff in parentheses-- removing the parentheses as well. And note that the parentheses are not EN parentheses , they're JP parentheses , a different critter.
You can find a lot of samples to experiment on here:
http://mainichi.jp/life/edu/maishou/index.html
If you find a better solution, post here, but also please post on the tofugu thread, too.
EDIT: Here are the better solutions. They will find all hiragana and katakana in parentheses: (Thanks to JimmySeal, quincy, ファブリス, and everyone else for their help.)
For Mainichi and sites that use Japanese parentheses (incl. Goo): ([\u3040-\u30FF]+)
For Yomiuri and sites that use English parentheses: \([\u3040-\u30FF]+\)
Mainichi site: http://mainichi.jp/life/edu/maishou/index.html
Yomiuiri site: http://www.yomiuri.co.jp/kyoiku/children/
Kids@Goo: http://kids.goo.ne.jp/index.html?SY=0&MD=2
http://www.tofugu.com/2010/08/30/5-step-...-for-kids/
If you scroll down, you can see my wall-o-text reply suggesting using Regular Expressions get rid of the offending kana in () in Japanese newspaper articles for kids.
The problem is that the only approach I could think of is, well, brute-force-ish. Basically, it entails using Dot (.), then 2 Dots (..), then 3, etc. And I know that Dot is an awful choice, but I don't have the time to study RegEx. It works, but it's clumsy.
Obviously, there's a more elegant solution out there, I just don't know what it is. Anyone else care to take a stab?
The idea would be to take text like this:
円高(えんだか)が進(すす)み、景気回復(けいきかいふく)の見通(みとお)しが立(た)たないなか、中央銀行(ちゅうおうぎんこう)である日本銀行(にっぽんぎんこう)は5日(いつか)、追加(ついか)の金融緩和(きんゆうかんわ)の政策(せいさく)を決(き)めました。
And turn it into text like this:
円高が進み、景気回復の見通しが立たないなか、中央銀行である日本銀行は5日、追加の金融緩和の政策を決めました。
using something like EditPad Pro (or another text editor with a RegEx capable search/replace) to make quick work of the stuff in parentheses-- removing the parentheses as well. And note that the parentheses are not EN parentheses , they're JP parentheses , a different critter.
You can find a lot of samples to experiment on here:
http://mainichi.jp/life/edu/maishou/index.html
If you find a better solution, post here, but also please post on the tofugu thread, too.
EDIT: Here are the better solutions. They will find all hiragana and katakana in parentheses: (Thanks to JimmySeal, quincy, ファブリス, and everyone else for their help.)
For Mainichi and sites that use Japanese parentheses (incl. Goo): ([\u3040-\u30FF]+)
For Yomiuri and sites that use English parentheses: \([\u3040-\u30FF]+\)
Mainichi site: http://mainichi.jp/life/edu/maishou/index.html
Yomiuiri site: http://www.yomiuri.co.jp/kyoiku/children/
Kids@Goo: http://kids.goo.ne.jp/index.html?SY=0&MD=2
Edited: 2010-10-10, 2:39 pm


