Scripts or other methods for adding furigana

Index » Learning resources

  • 1
 
lauri_ranta Member
Registered: 2012-03-31 Posts: 139 Website

toshiromiballza wrote:

Any chance one of you (Savii, lauri_ranta, cangy) would release your "furiganizer" scripts to the public? I'm not aware of any software that does this in bulk automatically, except for an ancient program called JGloss which is really buggy for me and produces terrible (when it does work) results (picks the first entry from EDICT, instead of for example going with those marked as common first). If output was possible in HTML5 ruby tag, so much better.

I have posted two Ruby scripts I use at http://jptxt.net/scripts.html. The first one adds furigana based on hiragana versions of words or sentences. It gets a few words like 物の怪/もののけ wrong though. The second one uses MeCab to generate furigana, like https://github.com/dae/ankiplugins/blob … reading.py. When I tried using Core 6000 sentences as input, about 5% had at least one difference from the correct furigana, but many of the differences are not necessarily errors.

Savii and cangy have added (almost?) perfect furigana for Core 6000 sentences though. I think I figured out how to do it from Savii's description, but I'm still looking for other ways to add furigana as well.

Last edited by lauri_ranta (2013 November 28, 11:45 am)

toshiromiballza Member
Registered: 2010-10-27 Posts: 277

There's a script from the tatoeba.org project to add furigana and romaji available here (in PHP) which also uses mecab:

https://www.assembla.com/code/tatoeba2/ … ntence.php

Last edited by toshiromiballza (2013 March 13, 3:39 pm)

Oniichan Member
From: 名古屋 Registered: 2009-02-02 Posts: 269

lauri_ranta wrote:

I have posted two Ruby scripts at http://lri.me/japanese/Notes.html. The first one adds furigana from hiragana versions, and the error rate is maybe 0.1% for vocabulary. The second one uses mecab to generate furigana (like Anki's Japanese support plugin / addons/japanese/reading.py), but it gets some readings wrong in about 5% of Core 6000 sentences.

I saved your scripts as ruby files and had success with the furigana one, getting this output when testing with the commented-out section:

Code:

Active code page: 65001

C:\Users\Administrator\scripts\Ruby\Sandbox>ruby furigana.rb
<ruby><rb>次々</rb><rt>つぎつぎ</rt></ruby>
<ruby><rb>次々</rb><rt>つぎつぎ</rt></ruby>
ユニークな
<ruby><rb>痛</rb><rt>いた</rt></ruby>い
<ruby><rb>困難</rb><rt>こんなん</rt></ruby>な
<ruby><rb>言</rb><rt>い</rt></ruby>い<ruby><rb>訳</rb><rt>わけ</rt></ruby>
ごろごろ
カット
くっ<ruby><rb>付</rb><rt>つ</rt></ruby>ける
ジェット<ruby><rb>機</rb><rt>き</rt></ruby>
<ruby><rb>湿</rb><rt>しめ</rt></ruby>っぽい
<ruby><rb>東京</rb><rt>とうきょう</rt></ruby>ドーム
<ruby><rb>3月</rb><rt>さんげつ</rt></ruby>
<ruby><rb>一ヶ月</rb><rt>いっかげつ</rt></ruby>
<ruby><rb>X線</rb><rt>えっくすせん</rt></ruby>
<ruby><rb>八ッ橋</rb><rt>やつはし</rt></ruby>
<ruby><rb>4ヵ年</rb><rt>よんかねん</rt></ruby>
<ruby><rb>ィ形容詞</rb><rt>いけいようし</rt></ruby>
<ruby><rb>命</rb><rt>い</rt></ruby>の<ruby><rb>親</rb><rt>ちのおや</rt></ruby>
<ruby><rb>千円貸</rb><rt>せんえんか</rt></ruby>してください

However, when I try to run the mecab script I get this error:

Code:

C:\Users\Administrator\scripts\Ruby\Sandbox>ruby mecabruby.rb
C:/Users/Administrator/scripts/Ruby/Sandbox/furigana.rb:9:in `furigana': wrong n
umber of arguments (1 for 2) (ArgumentError)
        from mecabruby.rb:9:in `block in mecab_furigana'
        from mecabruby.rb:4:in `map'
        from mecabruby.rb:4:in `mecab_furigana'
        from mecabruby.rb:31:in `block in <main>'
        from mecabruby.rb:30:in `each'
        from mecabruby.rb:30:in `<main>'

I also tested mecab.exe with the following command to make sure it is setup correctly:

C:\Users\Administrator\scripts\Ruby\Sandbox>mecab -O wakati in.txt -o out.txt

which yielded an output file with spaces between the morphemes, as expected.

For further reference, I'm running the scripts via a command prompt on a windows 7 machine using Ruby 1.9.3p125. The scripts are saved as utf-8 files and my cmd's code page is set to utf-8 as well.

Any ideas why the script breaks down when I run it?

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
Oniichan Member
From: 名古屋 Registered: 2009-02-02 Posts: 269

Here is another output when the code page is set to '932 (ANSI/OEM - Japanese Shift-JIS)'

Code:

Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\Administrator\scripts\Ruby\Sandbox>ruby mecabruby.rb
mecabruby.rb:9:in `split': invalid byte sequence in Windows-31J (ArgumentError)
        from mecabruby.rb:9:in `block in mecab_furigana'
        from mecabruby.rb:4:in `map'
        from mecabruby.rb:4:in `mecab_furigana'
        from mecabruby.rb:31:in `block in <main>'
        from mecabruby.rb:30:in `each'
        from mecabruby.rb:30:in `<main>'

I'm guessing that the console's code page has no effect on the script as the script doesn't output anything to the screen, but the encoding of mecab's dictionaries I'm not so sure about. My MeCab dictionary was encoded to utf-8 during setup. Do I need to reinstall it as EUC JP or something to make this script work?

Last edited by Oniichan (2013 March 06, 11:22 pm)

Reply #5 - 2013 March 11, 7:50 am
lauri_ranta Member
Registered: 2012-03-31 Posts: 139 Website

I finally got it to work (at least on my dad's Windows 8). I had to add "Encoding.default_external = Encoding::UTF_8" to the top and replace "mecab" with "C:/Program Files (x86)/MeCab/bin/mecab.exe". I installed Ruby 2.0 with RubyInstaller and MeCab with mecab-0.996.exe (using a UTF-8 dictionary), and I ran the script with just ctrl+B in Sublime Text.

On OS X you could install the command line tools package, save the two scripts as furigana.rb and mecab_furigana.rb, and then paste this in Terminal:

Code:

ruby -e "$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)"
brew install ruby mecab mecab-ipadic
echo 'export PATH=/usr/local/bin:$PATH' >> ~/.bash_profile
. ~/.bash_profile
ruby mecab_furigana.rb
Reply #6 - 2013 March 11, 9:55 pm
cangy Member
From: 平安京 Registered: 2006-12-13 Posts: 372 Website

lauri_ranta wrote:

Savii and cangy have added (almost?) perfect furigana for Core 6000 sentences and vocabulary though. I think I figured out how to do it from Savii's description, but I'm still looking for other scripts or methods for adding furigana as well.

cranki is here: https://sites.google.com/site/ankinihongo/

(see also http://forum.koohii.com/viewtopic.php?p … 20#p198020)

Last edited by cangy (2013 March 11, 9:56 pm)

Oniichan Member
From: 名古屋 Registered: 2009-02-02 Posts: 269

lauri_ranta wrote:

I finally got it to work (at least on my dad's Windows 8). I had to add "Encoding.default_external = Encoding::UTF_8" to the top and replace "mecab" with "C:/Program Files (x86)/MeCab/bin/mecab.exe". I installed Ruby 2.0 with RubyInstaller and MeCab with mecab-0.996.exe (using a UTF-8 dictionary), and I ran the script with just ctrl+B in Sublime Text...

Sorry for the delay, I just got back from a short vacation. I'll give this a try and see if it works on win7. Thank you.

Reply #8 - 2013 March 13, 3:54 am
toshiromiballza Member
Registered: 2010-10-27 Posts: 277

Yep, it works now.

I have to add:

Code:

#!/bin/env ruby
# encoding: utf-8

at the top of furigana.rb and:

Code:

#!/bin/env ruby
# encoding: utf-8
Encoding.default_external = Encoding::UTF_8

at the top of mecab.rb

Works like it's supposed to! Thanks.

Oniichan Member
From: 名古屋 Registered: 2009-02-02 Posts: 269

Thank you lauri_ranta and toshiromiballza, it's working for me too now!

Reply #10 - 2013 May 03, 7:59 pm
bambi73 Member
From: Czech Republic Registered: 2012-08-15 Posts: 16

Does anyone know some easy way how to generate furigana reading per-character instead of per-word? Like 自[じ]分[ぶん] instead of 自分[じぶん]?
I checked Mecab dictionary and it contains reading in one piece so I guess it isn't possible. Same goes for Kakasi and online web dictionaries what I know (because of my level I never looked at Japanese J-J online dictionaries).
Only place where I found per-character readings are some EPWing dictionaries, but generate it from there will be quite complicated. Isn't there any easier way?

Reply #11 - 2013 May 04, 3:02 am
toshiromiballza Member
Registered: 2010-10-27 Posts: 277
Reply #12 - 2013 May 04, 8:50 am
bambi73 Member
From: Czech Republic Registered: 2012-08-15 Posts: 16

This looks promising, thanks for hint.

  • 1