kanji koohii FORUM
Using RegEx to get rid of kana in () in articles - Printable Version

+- kanji koohii FORUM (http://forum.koohii.com)
+-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html)
+--- Forum: Learning resources (http://forum.koohii.com/forum-9.html)
+--- Thread: Using RegEx to get rid of kana in () in articles (/thread-6499.html)

Pages: 1 2


Using RegEx to get rid of kana in () in articles - rich_f - 2010-10-09

So in this article on tofugu:
http://www.tofugu.com/2010/08/30/5-step-jlpt-study-method-using-japanese-newspapers-for-kids/

If you scroll down, you can see my wall-o-text reply suggesting using Regular Expressions get rid of the offending kana in () in Japanese newspaper articles for kids.

The problem is that the only approach I could think of is, well, brute-force-ish. Basically, it entails using Dot (.), then 2 Dots (..), then 3, etc. And I know that Dot is an awful choice, but I don't have the time to study RegEx. It works, but it's clumsy.

Obviously, there's a more elegant solution out there, I just don't know what it is. Anyone else care to take a stab?

The idea would be to take text like this:

円高(えんだか)が進(すす)み、景気回復(けいきかいふく)の見通(みとお)しが立(た)たないなか、中央銀行(ちゅうおうぎんこう)である日本銀行(にっぽんぎんこう)は5日(いつか)、追加(ついか)の金融緩和(きんゆうかんわ)の政策(せいさく)を決(き)めました。

And turn it into text like this:

円高が進み、景気回復の見通しが立たないなか、中央銀行である日本銀行は5日、追加の金融緩和の政策を決めました。

using something like EditPad Pro (or another text editor with a RegEx capable search/replace) to make quick work of the stuff in parentheses-- removing the parentheses as well. And note that the parentheses are not EN parentheses , they're JP parentheses , a different critter.

You can find a lot of samples to experiment on here:
http://mainichi.jp/life/edu/maishou/index.html

If you find a better solution, post here, but also please post on the tofugu thread, too.

EDIT: Here are the better solutions. They will find all hiragana and katakana in parentheses: (Thanks to JimmySeal, quincy, ファブリス, and everyone else for their help.)

For Mainichi and sites that use Japanese parentheses (incl. Goo): ([\u3040-\u30FF]+)
For Yomiuri and sites that use English parentheses: \([\u3040-\u30FF]+\)

Mainichi site: http://mainichi.jp/life/edu/maishou/index.html
Yomiuiri site: http://www.yomiuri.co.jp/kyoiku/children/
Kids@Goo: http://kids.goo.ne.jp/index.html?SY=0&MD=2


Using RegEx to get rid of kana in () in articles - JimmySeal - 2010-10-09

([ぁ-ー]+)


Using RegEx to get rid of kana in () in articles - quincy - 2010-10-09

I did come up with a regex expression that would work for this, but then I found out that most text editors have limited regex support (I tried notepad++ and vim). What is use is (.\{1,6})
Actually I have no idea why that '\' is in there now, it might be a vim specific thing. It's just what I have in vim's memory and it works.

If you're using vim you can input it like this
:% s/ (.\{1,6}) //gi

The expression I came up with that should work if the editor supports it is this: (.+?)


Using RegEx to get rid of kana in () in articles - JimmySeal - 2010-10-09

Actually neither of your expressions works successfully on this piece of text:

注目(ちゅうもく)されるようになったのは江戸幕府(えどばくふ)を開(ひら)いた徳川家康(とくがわいえやす)が、1616年(ねん)に死(し)んでからのこと。駿府(すんぷ)(今(いま)の静岡(しずおか))で死(し)んだ家康(いえやす)は富士山(ふじさん)の眺(なが)めがきれいな久能山(くのうざん)にまつられましたが、「江戸(えど)の北(きた)を守(まも)れ」という遺言(ゆいごん)によって翌年(よくねん)、この場所(ばしょ)に改葬(かいそう)されました。最初(さいしょ)は東照社(とうしょうしゃ)、やがて東照宮(とうしょうぐう)(家康(いえやす)を東照大権現(とうしょうだいごんげん)という神様(かみさま)としてまつる神社(じんじゃ))と呼(よ)ばれるようになります。

Not only will they remove text that's not furigana, they will also leave behind two closing parentheses.


Using RegEx to get rid of kana in () in articles - quincy - 2010-10-09

must have to do with it being in parenthesis. I used these expressions for books in text form with the furigana in 《》 brackets. I don't think there's actually anyway to get it to work with this article because of this line 駿府(すんぷ)(今(いま)の静岡(しずおか))


Using RegEx to get rid of kana in () in articles - JimmySeal - 2010-10-09

quincy Wrote:must have to do with it being in parenthesis. I used these expressions for books in text form with the furigana in 《》 brackets. I don't think there's actually anyway to get it to work with this article because of this line 駿府(すんぷ)(今(いま)の静岡(しずおか))
The expression I suggested works just fine, in EditPad Pro and Notepad++.


Using RegEx to get rid of kana in () in articles - quincy - 2010-10-09

Oh, didn't notice your post. I didn't know regex had Japanese support, this is handy.


Using RegEx to get rid of kana in () in articles - FooSoft - 2010-10-09

I've always just used a small python script I wrote to do this.


Using RegEx to get rid of kana in () in articles - rachels - 2010-10-09

When tidying up an anki deck at one stage, I was wondering whether there might be a regular expression to remove all kanji, leaving the kana. Would anyone have any ideas ? Regex compatible with anki, ie using it within anki would be ideal.


Using RegEx to get rid of kana in () in articles - ropsta - 2010-10-09

JimmySeal Wrote:The expression I suggested works just fine, in EditPad Pro and Notepad++.
Can't get it to work. Maybe it's encoding related.

Edit: Got it to work in Notepad++ but not Editpad.

So it does seem to work.


Using RegEx to get rid of kana in () in articles - rich_f - 2010-10-09

Yeah, ([ぁ-ー]+) seems to work the best.

Especially on this bit here:(政策金利(せいさくきんり))--- so that the furigana bit gets removed, but not the kanji bit.

It seems like (.+?) works about 95% of the time. It's just times like: (政策金利(せいさくきんり)) where it seems to run into problems.

I can't get (.\{1,6}) to work in EPP. It just gacks on it.

But yeah, ([ぁ-ー]+) works great. Thanks JimmySeal, quincy, and everyone else for your input. This was driving me slightly nuts.

@JimmySeal is there a good website for Japanese input for RegEx, just for future reference?


Using RegEx to get rid of kana in () in articles - JimmySeal - 2010-10-09

rachels Wrote:When tidying up an anki deck at one stage, I was wondering whether there might be a regular expression to remove all kanji, leaving the kana. Would anyone have any ideas ? Regex compatible with anki, ie using it within anki would be ideal.
This expression should find all the kanji and leave everything else (but see the note below):

[㐀-﨩]

EDIT: Actually, that expression works in EPP, but not in Notepad++, so I'm not sure how it would work in Anki. Whatever you do, make sure to back up your deck first.

rich_f Wrote:@JimmySeal is there a good website for Japanese input for RegEx, just for future reference?
Not sure what you mean by Japanese input for RegEx, but I'm pretty sure I don't know of any.


Using RegEx to get rid of kana in () in articles - rich_f - 2010-10-09

Oh, I figured out this: ([一-龯]+) will work for 4e00-9faf, but not for the kanji in 3400-4dbf (which are kind of obscure.)

This ([㐀-龯]+) seems to work better, as it grabs the whole 3400-9faf range. (Seems to, anyway.)

I figured it out from this table: http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml (EDIT: Looks like this is outdated.)

It will remove kanji in parens like kana in parens, but in this phrase (政策金利(せいさくきんり)), it won't touch the kanji, and just ignores everything.

Don't mind me, I'm just intrigued by the whole thing. Big Grin

Edit: Oh, I see 﨩 is FA29.. is that an extension of Unicode for Japanese? Using this to get the Unicode: http://www.unicode.org/charts/unihan.html

So I guess the proper expression I'm looking for is: ([㐀-﨩]+)

Maybe Notepad+ is choking on the FA29 extended bits?


Using RegEx to get rid of kana in () in articles - rachels - 2010-10-09

Seems to work fine in anki. So easy (when you know how) - Thanks !!!


Using RegEx to get rid of kana in () in articles - JimmySeal - 2010-10-09

rich_f Wrote:Edit: Oh, I see 﨩 is FA29.. is that an extension of Unicode for Japanese? Using this to get the Unicode: http://www.unicode.org/charts/unihan.html
I don't actually know. I just went into Charmap, set it on SimSun font, and took the first and last kanji in the chart. [㐀-龯] is probably plenty, and stays out of uncharted territory.

Quote:Maybe Notepad+ is choking on the FA29 extended bits?
It seems like maybe Notepad++ just can't handle ranges of kanji in its regular expressions at all. Even if I use a run-of-the mill character as in [簡-簡], it winds up matching almost everything.


Using RegEx to get rid of kana in () in articles - quincy - 2010-10-09

Do you have the search set as case sensitive? Notepad++ gets really weird about kanji when it's not set.


Using RegEx to get rid of kana in () in articles - rich_f - 2010-10-09

Interesting. When I used used MS ゴシック in the Charmap, I got 頻 instead as the last kanji. (FA6A) Most other fonts go to FA6A. (but there are a bunch of other characters in there, like hangul, etc.)

Arial Unicode MS only goes up to 鶴 (FA2D) before it chokes. SimSun seems to be similar to Arial Unicode MS.

Then again, I know that there are a lot of inconsistencies with unicode and the representation of East Asian languages.

Anyway, this is only going for the truly obscure stuff, kanji-wise.

Oh, these are my new favorite characters: ಠ_ಠ


Using RegEx to get rid of kana in () in articles - mezbup - 2010-10-09

Why not just read stuff that doesn't have annoying parenthesis in the first place?


Using RegEx to get rid of kana in () in articles - rich_f - 2010-10-10

The idea is pretty simple. If you're having trouble reading full-blown newspaper content, you can use the kids' versions, which avoid some of the more obscure kanji. The problem is that some of them (I know Mainichi does this) put every reading of every kanji in parens. It's really annoying, and not a good way to check to see if you can read easier material.

If you can just strip out all of that stuff quickly and easily, however, you have access to a large amount of short reading excerpts that are ~ N2 or N3 level. You can also analyze them to see what vocab you're lacking, too.


Using RegEx to get rid of kana in () in articles - ファブリス - 2010-10-10

You can match any unicode ranges like this:

[\x3040-\x30ff]

You can also match unicode blocks, but these are a shortcut for above:

[\x3040-\x309f] => \p{InHiragana}

And don't forget the unicode modifier in php it's "u" at the end /expression/u

You must run your regexp in unicode! And feed it a utf8 string. Otherwise yes you will match one or two kanji but it will fail soon when a utf8 char from a kanji matches a basic ascii character.


If you're getting your text from shift JIS or other then I would look at iconv to easily convert between character sets.

Reference: Unicode Kanji Code table


Using RegEx to get rid of kana in () in articles - rich_f - 2010-10-10

Yeah, I got ([\u3400-\uFA6A]+) to work in EditPad Pro as a Kanji search query, and ([\u3040-\u30FF]+) works just fine as a hiragana+katakana query.

Before the \u searches wouldn't work for some odd reason. Probably because I wasn't careful, and had some error in it somewhere. Maybe EPP was being strict about capitalizing the hex in the Unicode?

It's weird. 頻 is FA6A in Unicode, and doesn't appear on the Rikai table, yet, it shows up in Rikai-chan. If you hover over it, it shows it as Unicode FA6A. Then again, it's probably at the far end of obscure.

The Unicode block should end at 9FAF, but if you open up Character Map in Windows, you can find a small band of characters showing up in the F91D to FA6A range. Some of them look awfully familiar, but not quite. 殺 is not quite 殺. There's a dot over the "tree," although it's very tiny. 勉 looks a lot like 勉, except it's just drawn differently style-wise.

*Shrug* Maybe they're just older versions?

It's not like I'm going to use the kanji search query that often, anyway. This was just out of curiosity. Thanks for the help, everyone.


Using RegEx to get rid of kana in () in articles - aphasiac - 2010-10-10

The whole point of that kana in parenthesis is that it gets automatically converted to furigana by your web-browser. Been a web-standard for ages hasn't it? You shouldn't ever see them..

Works perfectly for me in Chrome.

EDIT: confirmed - furigana rendered in Google Chrome, Safari and IE8, but not with Firefox 3.6 (instead you see kanji then kana in parenthesis). Weird - maybe there's a plugin available? Anyway, looks like solution is to use a browser that isn't Firefox.. Smile

EDIT EDIT: Oops may have miread your post - maybe you just dont want the furigana above the kanji, cos it interfers with your reading? if so it's easy to train yourself not to not to notice it..


Using RegEx to get rid of kana in () in articles - pm215 - 2010-10-10

rich_f Wrote:The Unicode block should end at 9FAF, but if you open up Character Map in Windows, you can find a small band of characters showing up in the F91D to FA6A range.
Kanji in Unicode are not all in a contiguous range. This reflects the history, where a large block were put in initially, and then there were later additions for various reasons. I think the following blocks will all have kanji in:
or
CJK Unified Ideographs Extension A (3400-4DBF)
CJK Unified Ideographs (4E00-9FFF)
CJK Compatibility Ideographs (F900-FAFF)
CJK Unified Ideographs Extension B (20000-2A6DF)
CJK Compatibility Ideographs Supplement (2F800-2FA1F)

(I haven't listed the blocks which only have radicals, or the ones with CJK-style punctuation, or oddities like ㌔ the single-character katakana [kana]KIRO[/kana].)
[\u3400-\uFA6A]+ is a little overenthusiastic since for example it will match Hangul as well, but if you're only dealing with Japanese text it's probably good enough.


Using RegEx to get rid of kana in () in articles - Asriel - 2010-10-10

Slightly-related question:

How about going the other way? Say you want to add furigana to text you find. Sure, you've got rikaichan, but how about something like that Expression->Reading thing they have in Anki?
Insert Japanese text -> Out comes furigana-ized text?

There's gotta be something, preferably that can be used offline, that does this?


Using RegEx to get rid of kana in () in articles - aphasiac - 2010-10-10

Asriel Wrote:Slightly-related question:

How about going the other way? Say you want to add furigana to text you find. Sure, you've got rikaichan, but how about something like that Expression->Reading thing they have in Anki?
Insert Japanese text -> Out comes furigana-ized text?

There's gotta be something, preferably that can be used offline, that does this?
https://addons.mozilla.org/en-US/firefox/addon/6178/

Havn't actually tested it though..