Typing the Kanji--RtK-based IME?

Index » The Japanese language

  • 1
 
wildweathel Member
Registered: 2009-08-04 Posts: 255

DISCLAIMER:  If you just want to type Japanese, you probably want the other thread.  You know, the one with the actual, professional IMEs used by millions of Japanese.  One component of Typing the Kanji has to be installed in your brain.  Wildweathel is not responsible for any brain damage that may result from installing TtK, repetitive stress injuries from your use of TtK, having to relearn because he changed TtK, or typing atrociously-poor kanji puns like 「俺は黄坑生だぞ」 because TtK won't stop you.

Abandon all hope, ye who enter here.

----
I'm pretty much certain that, no this doesn't exist yet and that it'll turn into a personal project, but I figured I'd ask before I start.

Can someone recommend a good direct-kanji (漢字直接) method for Japanese?  The shortcomings of kana to kanji conversion (漢字変換)--even with a kana keyboard--are really starting to bother me.

My ideal IME would be based on stroke-order decomposition of the elements within a character: thanks to RTK, 凝 is already in my memory as 冫匕矢マ疋, why can't I just type it that way?

(The obvious answer is that there aren't enough keys on the keyboard.  There are maybe 300 primitive elements, and only 32 easily-reached characters keys on an western keyboard.  However, those 32 keys allow for 1024 two-key combinations. In TUT-code ~2500 kanji and all the (basic) kana can be typed in two or three strokes.)

The CangJie method is fairly close to that ideal, with the added wrinkle of (sometimes heavy) abbreviation.  But, (most obviously) it's designed for Chinese and thus can't input kana.

Some Googling turned up a surprising variety of direct methods.  But, with the exception of TUT-CODE, they seem to be defunct.

I spent a couple hours today trying to get TUT-CODE installed without success.  It's probably just as well: I can't see any organizing principles behind the kanji layout.  Memorizing it would have been a huge pain.  The one thing I like about it most is that there aren't separate kana and kanji modes: you just type, some sequences correspond to kanji and some to kana.

Fortunately, my OS ( makes creating new input methods very easy.  (as long as prediction or multiple modes aren't required.  That's okay, I'm sick of both!)  http://code.google.com/p/ibus/wiki/HowT … rIBusTable

Anyway, just wondering if anyone else has an interest.

Last edited by wildweathel (2009 November 21, 6:59 pm)

yudantaiteki Member
Registered: 2009-10-03 Posts: 3619

I'm pretty sure such a thing does not exist, and there really wouldn't be that much market or interest in it, IMO.  When you're actually typing Japanese, how do you know you need to type 凝?  Usually, it's because whatever you're typing has the word 凝視, or 凝る, or whatever -- and if you have the word, why go through all the rigmarole of typing in parts of kanji when you can just type the word you're already thinking of and let the IME do the job of constructing the kanji for you?

markal Member
From: Tokyo Registered: 2007-10-22 Posts: 84

First, once you have learned Japanese well enough to be typing it properly you will abandon the system you spent so much time designing and learning because it is inefficient compared to regular typing.

Second, if you learn this system before learning how to type words via the standard method it might even interfere with the standard method later on.

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
magamo Member
From: Pasadena, CA Registered: 2009-05-29 Posts: 1039

I guess most people want to type what they say, not what they see. I push a key because that's what I'm hearing in my head. IME is great because it interprets what I said in my head and automatically kanjifies it.

mezbup Member
From: sausage lip Registered: 2008-09-18 Posts: 1681 Website

I don't wanna sound mean but this is kinda the most asinine idea i've heard of in a while. It'd be about 10x slower then the normal IME which is already fast enough and easy enough to use.

If you actually think it through, it'll take you 3x longer for simpler characters and 5 - 7x longer for more complex ones. On top of that you'd actually have to learn how to use it which would be a whole other skill in and of itself.

Try and write 原子力発電所. Now how much longer would that take? All I have to is type in げんしりょくはつでんしょ and hit space then enter. How much more would you have to type in? Aside from that you'd have to remember all the little peices you'd have to type in and in which order. This is something the current IME saves you from.

Moral of the story. Don't fix what ain't broken.

/rant

Last edited by mezbup (2009 November 19, 7:14 pm)

wildweathel Member
Registered: 2009-08-04 Posts: 255

原子力発電所
Anthy, romaji: gensiryokuhatudensho<SP><RET> (22 strokes)
Anthy, kana: け濁んしりょくはつて濁んしょ<SP><RET> (16 strokes, 2 shifted)
CangJie3: 一竹日火 弓木 大尸 弓人一大山 一月田山 竹尸竹一中<SP> (28 strokes)
TUT-CODE: ;REYSOHJJRKL (12 strokes)

romaji: 3.7 strokes per character
kana: 2.7
CangJie3: 4.7
TUT: 2.0

I think I can average 4-6 strokes per kanji, putting me on par with CangJie or a little slower than romaji conversion.  Of course, I can't "prove it" yet, but it certainly seems possible.

For kana, I'm within 1-3 strokes per kana, probably with an average of two: the system I have worked out is basically a rearrangement of TUT's kana input plus faster 拗音. 

げんしりょくはつでんしょ  is 20 strokes: .  That's exactly equal to romaji input. 

To be honest, speed is not my goal.  If it was, I'd be implementing and learning TUT.  The real issue I have with 変換 is later in the typing process.

My main goal is to make a dumb system.  When using conversion, you have to a) type what you would say, b) check it for mistakes, c)check the conversion for mistakes, d) fix them. 

So far, I've only analyzed stage A.  Stage B mistakes can be reduced (but not eliminated) through training.  Stage C mistakes can be reduced (but not eliminated) with better conversion software.  Stage D is universally a pain.  If you missed a typing error, you not only have to correct it, you have to find it first.  If you have a conversion error, you have to figure out what it is, how to correct it, and issue the right sequence of commands to fix it.

This requires a fairly high degree of reading fluency.  Say you want to type 判断, but you get 三段.  If you can read well, you'll recognize that you typed an s when you should have typed an h.  Now let's say you want 以外, but you get 意外.  That's a case where you did everything correctly, but the conversion failed.  This isn't a problem for someone who already reads Japanese well, but what about students who don't yet?  Should we say "learn to read fluently before you do any typing?"  We might as well stick to paper dictionaries while we're at it.

Yudan actually does make a good point.  Most typing is in words.  However, let's say you're a student looking up a new four-character compound.   You know an 音読み for the first and third character, a 訓読み for the first, and (thanks to RtK) how to write all of them.  The best you can do with 変換 is type in the characters you know--which only works if they don't have too many homophones--and multi-radical the ones you don't. 

I don't know about other students, but I hate using a mouse to hunt-and-peck KangXie radicals. 

Finally, Magamo's right: typists "hear" what they're typing more than they feel how it would be written by hand.  For example, when I type the word "the," I don't think "T---H---E," I just think "the" and my fingers press the right sequence (actually "kjd").  That those sequences don't have have to be based on sound.  The popularity of non-phonetic input among Chinese typists proves that.  I guess the biggest difference is where the conversion happens: in the computer (which can't read your mind) or in your cerebellum (which can, but requires training).

Anyway, that's the why.  I'm not trying to convince anyone to give up 変換.  I'm just not happy with it myself.  If it works for you there's no reason to switch.  I was hoping that either someone could point out something I missed, or be willing to give my personal hack a try.

If anyone is interested in that, I'm more than happy to share.

EDIT: PS.  This will require me creating a database of Hesig-style decompositions, which may be useful for other projects, even if you don't use it for your normal typing.  Yes, I'm willing to share that, too.

Last edited by wildweathel (2009 November 20, 10:18 am)

Codexus Member
From: Switzerland Registered: 2007-11-27 Posts: 721

wildweathel wrote:

That's a case where you did everything correctly, but the conversion failed.  This isn't a problem for someone who already reads Japanese well, but what about students who don't yet?  Should we say "learn to read fluently before you do any typing?"  We might as well stick to paper dictionaries while we're at it.

But your method requires that you remember which kanji to write and how to write them which I think is more difficult than selecting the correct conversion or fixing a typo.

Still, interesting idea. If you implement it, let us know the results.

sethg Member
From: m Registered: 2008-11-07 Posts: 505

That's a really cool idea smile Would certainly help with not forgetting how to write kanji just because you type most of the time big_smile

donjorge22 Member
From: UK Registered: 2009-08-03 Posts: 73 Website

Not so sure I like this idea.  If you're so bothered about getting the right kanji, why not just use a tablet IME (mine cost just a little more than £20 and works great).  Then you don't forget how to write them either.  If you want to type quickly, the current methods work really well too.

Save your time man! Go and do some learning instead, this just sounds like a great way to procrastinate to me :p

wildweathel Member
Registered: 2009-08-04 Posts: 255

donjorge22 wrote:

Not so sure I like this idea.  If you're so bothered about getting the right kanji, why not just use a tablet IME (mine cost just a little more than £20 and works great).

I've looked.  There simply aren't decent handwriting recognition systems for free OSes, no matter how much you pay.  Maybe someone will get around to cloning OS X's hanzi-on-touchpad feature, but I doubt it.

The reverse problem applies as well.  MS-IME can't load custom tables.  IBus, SCIM, and the OpenSolaris IME can.  Since I'm currently implementing this as a Python program that makes an IBus table, I don't know a good way to include Windows users.  (Maybe Autohotkey, but I don't know if has the ability to handle the type of conversions I want.)

Save your time man! Go and do some learning instead, this just sounds like a great way to procrastinate to me :p

Don't worry.  I'm timeboxing this.

That said, if someone already had a database of the RTK primitive decompositions, that would save me a lot of time.  A lot.

donjorge22 Member
From: UK Registered: 2009-08-03 Posts: 73 Website

I've looked.  There simply aren't decent handwriting recognition systems for free OSes, no matter how much you pay.  Maybe someone will get around to cloning OS X's hanzi-on-touchpad feature, but I doubt it.

The reverse problem applies as well.  MS-IME can't load custom tables.  IBus, SCIM, and the OpenSolaris IME can.  Since I'm currently implementing this as a Python program that makes an IBus table, I don't know a good way to include Windows users.  (Maybe Autohotkey, but I don't know if has the ability to handle the type of conversions I want.)

Ah.  Yes.  Actually, this in particular was something that had been causing me considerable grief - I wiped Windows from my system in favour of Ubuntu, and then discovered that I couldn't make my pad work, and even if I could, there wasn't any handwriting recognition, so I take back that comment.  I did see a collaborative thing for Hanzi recognition somewhere though, though they said they'd only done 20% or so, and it wasn't forgiving of improper stroke order.

Can I take this opportunity to ask how IBus and SCIM work - something else I couldn't get going :S.

donjorge22 Member
From: UK Registered: 2009-08-03 Posts: 73 Website

Sorry for double post, but...

Since I'm currently implementing this as a Python program that makes an IBus table, I don't know a good way to include Windows users.

Leave Windows users out? I think your idea would be far more useful to Linux users in any case (myself included), since the alternatives aren't up to scratch.  Would actually be great if you did that smile

Pauline Member
From: Sweden Registered: 2005-10-04 Posts: 134

wildweathel wrote:

My ideal IME would be based on stroke-order decomposition of the elements within a character: thanks to RTK, 凝 is already in my memory as 冫匕矢マ疋, why can't I just type it that way?

I have read about an input method called NIK-Code that seems to do what you want. Its home page has disappeared in 1998, but I could get most of the pages through an archive. You can use the descriptions (especially the keyboard layout) if you decide to implement a similar input method. Here is the pages I could get.

Some Googling turned up a surprising variety of direct methods.  But, with the exception of TUT-CODE, they seem to be defunct.

I spent a couple hours today trying to get TUT-CODE installed without success.  It's probably just as well: I can't see any organizing principles behind the kanji layout.  Memorizing it would have been a huge pain.  The one thing I like about it most is that there aren't separate kana and kanji modes: you just type, some sequences correspond to kanji and some to kana.

I'm using T-Code on Ubuntu which is similar to TUT-Code. The characters are placed according to frequency with the most common characters being the most comfortable to type.

Sebastian Member
Registered: 2008-09-09 Posts: 582

wildweathel wrote:

My ideal IME would be based on stroke-order decomposition of the elements within a character: thanks to RTK, 凝 is already in my memory as 冫匕矢マ疋, why can't I just type it that way?

I remember a couple of months ago there was someone who used something that sounded a lot like what you're describing. Unfortunately, I don't remember the name of the member nor the system he used, but those posts are certainly available if you can find them.

fugu68 Member
From: Tokyo Registered: 2005-11-30 Posts: 115

donjorge22 wrote:

Not so sure I like this idea.  If you're so bothered about getting the right kanji, why not just use a tablet IME (mine cost just a little more than £20 and works great).

Could you explain this a little further - are you using the standard MS IME Pad for this, or something fancier?

magamo Member
From: Pasadena, CA Registered: 2009-05-29 Posts: 1039

Are you using MS-IME or other default Japanese IMEs for your OS and complaining about the input method? It's like saying PCs suck when you use your Dell laptop out of the box and surf the internet without knowing what an internet browser means. People in the know use better software than bundled crap, right? The same goes for IMEs. People who know how IMEs should work use ATOK.

It analyzes your sentence and gives much more accurate suggestions based on grammar, collocation, context, etc. It also catches certain grammar errors, non-standard conjugations, unusual writing style changes (e.g., です/ます to だ/である switch), casual/informal conjugations and so on, and gives you warnings. You can type in a major dialect too.

For example, I'm converting がっこうからかえってきたらてれびみれるよ right now, and it shows 学校から帰ってきたらテレビ見れるよ, warns that 見れるよ is ら抜き表現, and suggests 見られるよ as a proper conjugation. I convert わかんない, and it says "くだけた表現" because it's a casual form of 分からない.

Some versions have 明鏡国語辞典 as a built-in J-J dictionary, so you can read the entry for each alternative kanjification while typing; a shortened dictionary entry pops up right next to each suggestion. As I said somewhere on RTK forums, the dictionary is like "learners' dictionary" for native speakers. It focuses on contemporary word usage and differences between synonyms/homonyms. For example, you can read the dictionary's explanation for the difference between 会う, 逢う, 遭う, and 遇う, all of which mean "meet" or "see" in English, while you're typing あう. There is a short-cut command to read the unabridged version of the dictionary entry for each word you're typing too.

Of course it doesn't work very well if your grammar isn't good because your sentence only confuses ATOK. It can't analyze your Japanese well if you ignore collocations and use unusual wording. But it's not the software's problem. If your Japanese is too irregular and unnatural, any kind of writing support software doesn't work well.

I think it's great if you make a similar IME for non-native speakers. You might modify the grammar analyzer so it detects/corrects typical errors by native English speakers; ATOK is good at picking up on errors by native speakers but may just be confused by unusual errors. It might be a good idea to replace the J-J dictionary with a J-E dictionary for beginning/intermediate learners.

Anyway, I think input methods should be improved in this direction. IMEs are not just kanji converters. A good IME assists your writing.

mezbup Member
From: sausage lip Registered: 2008-09-18 Posts: 1681 Website

@magamo: great post and that's the sort of IME that would be beneficial to us, you're quite right.

killeralgae Member
From: New York Registered: 2009-05-27 Posts: 15 Website

Is there any easy way to create an IME in windows? I know it is pretty simple on Linux/Mac

samesong Member
From: Nagano Registered: 2008-06-13 Posts: 242 Website

magamo wrote:

Save your time man! Go and do some learning instead, this just sounds like a great way to procrastinate to me :p

Don't worry.  I'm timeboxing this.

You missed his point. The amount of hours you are going to spend creating, implementing, and learning this system will probably be hundreds. You could spend a fraction of that time correctly learning how to use a standard IME, and the rest improving your Japanese skills.

Not to mention when you are at any other computer, you won't be able to use your own system to type.

If it's just something you want to make for shits and giggles, go for it. Everybody has side projects that they do for the hell of it. But as for a serious alternative to IME, you should seriously reconsider the amount of time and effort required of you to create something that will ultimately never be as fast or smart as what already exists.


magamo wrote:

I convert わかんない, and it says "くだけた表現" because it's a casual form of 分からない.

That sounds like it would be quite annoying after a while! tongue I have both ATOK and the standard MS IME installed at work, and honestly, sometimes ATOK tries to be TOO smart. I'm not at work so I can't give you a specfic example right now, but I remember trying to get just one kanji to show up, like typing ぎょうto get 凝 or something. ATOK didn't recognize that as a word, so I couldn't get the damned kanji to show up.

Though I'm sure the benefits of outweigh small annoyances such as that.

magamo Member
From: Pasadena, CA Registered: 2009-05-29 Posts: 1039

samesong wrote:

magamo wrote:

I convert わかんない, and it says "くだけた表現" because it's a casual form of 分からない.

That sounds like it would be quite annoying after a while! tongue I have both ATOK and the standard MS IME installed at work, and honestly, sometimes ATOK tries to be TOO smart. I'm not at work so I can't give you a specfic example right now, but I remember trying to get just one kanji to show up, like typing ぎょうto get 凝 or something. ATOK didn't recognize that as a word, so I couldn't get the damned kanji to show up.

Though I'm sure the benefits of outweigh small annoyances such as that.

You can turn off specific features though. You can also add specific words and personalize it. But if I want to get 凝 and that's the only kanji I want, I'd type ぎょうし and delete 視 because I know I can get 凝視; that's one of the most frequent words using 凝 and no homonyms would appear. Actually if you fail to get a correct kanji compound from kana and try to get the right one by this method, ATOK asks you if you want to include the kanjified version for the kana so it gives the kanji compound next time.

ATOK isn't perfect, but it works much better than MS-IME and ことえり (the default IME for Mac) for me. The most annoying thing when it comes to MS-IME is that it gets stupider and stupider as it "learns" users' styles... Have you experienced this?

http://furukawablog.spaces.live.com/Blo … 9079.entry

By the way, I think you misquoted another person's post in the first half of your post. It's not me.

samesong Member
From: Nagano Registered: 2008-06-13 Posts: 242 Website

magamo wrote:

IME stuff

from article wrote:

検証苑 "腱鞘炎"とどうしても変換入力できないので、検索エンジンでネットにて検索してコピペしました

Hah, I actually had this exact problem. It means tendinitis, and I was looking around for some problems regarding this because I'm an avid (but crappy) rock climber with some pain in my elbows and wrists. Stupid IME wouldn't convert it correctly.

Last edited by samesong (2009 November 20, 10:39 pm)

Jarvik7 Member
From: 名古屋 Registered: 2007-03-05 Posts: 3946

wildweathel wrote:

原子力発電所
Anthy, romaji: gensiryokuhatudensho<SP><RET> (22 strokes)
Anthy, kana: け濁んしりょくはつて濁んしょ<SP><RET> (16 strokes, 2 shifted)

Nitpick: <RET> is not required. You can simply start typing whatever you want to enter next or click a relevant button. The only time you'd need a <RET> is when you want to submit the default action for a form or start a new line, in which case you'd have to do <RET><RET> instead of <RET>.

Keystrokes should not be the metric though, avg. kanji-per-minute rate should be. If it takes longer because you have to think of every part of the kanji, the order they come in, and the corresponding code and then input it than it does to just type the reading, then there is no point. This IS an issue because most Japanese people have become ワープロ馬鹿.

Direct entry is interesting, but superfluous given a smart enough IME. It would make a useful alternative to 手書き when you are trying to lookup an unknown character though.

samesong wrote:

from article wrote:

検証苑 "腱鞘炎"とどうしても変換入力できないので、検索エンジンでネットにて検索してコピペしました

Hah, I actually had this exact problem. It means tendinitis, and I was looking around for some problems regarding this because I'm an avid (but crappy) rock climber with some pain in my elbows and wrists. Stupid IME wouldn't convert it correctly.

It sounds like a quote from someone who (no offence) doesn't know how to use computers very well. Most IME learn, so you can teach them the correct kanjification. Ex: my girlfriend's name is written with weird readings. I only had to hunt through the suggestion box individually for each kanji a few times before it automatically started outputting 英里名 for えりな. There are occasions where a word's kanji will not even appear in the individual selection box. Using the IME keyboard interface you can then change which part of the reading it will try to convert and find the kanji individually this way, or you can just enter it into your user dictionary so it'll come up easily next time. There are a number of pre-compiled "improved" dictionaries on the internet for various IME systems to save you the effort. For reference, KOTOERI on osx brings it up as the first suggestion no problem in the default dictionary.

Also, as magamo says, commercial IME like ATOK are a lot smarter.

Kotoeri has only gotten smarter over time for me, but the English auto-spellcheck on my iphone has gotten idiotic over the past few months. It doesn't even fix im -> I'm much of the time now. It's probably because it is meant to fix typos, not something the user repeatedly enters on purpose (in my case to speed typing by skipping the punctuation kb), so it is learning "im" as a word.

Last edited by Jarvik7 (2009 November 21, 12:12 am)

wildweathel Member
Registered: 2009-08-04 Posts: 255

First up, a huge thanks to Pauline.  That is very helpful indeed.  I've also found a Javascript implementation of NIK-50, which contains a large code list.  If anyone's curious, it shouldn't be hard to port NIK to IBus.
http://panathenaia.halfmoon.jp/key/nik50demo.html

Also, kudos to Magamo.  As usual, you're very helpful and insightful.  I would like to particularly thank you for mentioning 明鏡国語辞典, which is definitely going on my "to-buy" list.  I get the sense that we don't see eye-to-eye here: basically you see the IME as a tool for creating Japanese text, while I see it as a part of the keyboard driver.  The IME is, like any other keyboard driver, to enter characters, not second-guess language abuse such as weird names, bad kanji-substitution puns (that only I find funny), neologisms, or 漢字Esperanto.  Language support (spell check, grammar check, pop-up dictionary, prediction, syntax highlighting, etc) is language dependent, and thus belongs to a different tool entirely: the text editor or word processor.

-------
Changing gears, I would really appreciate if no one else chimes in with a post that boils down to "just make things easy for yourself: use a standard IME. 出る杭は打たれる。 etc"  There are some people who are interested in my progress.  For their sake, I'd like to keep this thread clear of unconstructive criticism.  If you feel you need to warn people about about the evils of 漢直, that's fair enough.  Please do that in another thread, and I'll link to it in post #1.  Even if not, I'm putting a big disclaimer up there to the effect that this is in fact a crazy personal project, not something that's necessarily (or even likely) better than standards.

And yes, I am just doing this for shits and giggles.  I try to live as much as possible for shits and giggles.  Sometimes it stinks, but at least you're always giggling.
-----
Okay, then, progress report for today:

The good:
- Hiragana works, and I've pretty much settled on my layout.
- I have a rough outline of how kanji input will work.  Basically, it's a hybrid of decomposition (NIK, CangJie) and ergonomic-not-mnemonic table methods (T-CODE, TUT-CODE, Table30).  Kanji -> up to 4 primatives (decomp) -> up to two strokes per primative (table).  This means that complex characters might be even longer than CJ, but
- I think I can make good use of stroked pairs (like QWERTY er and oi or Dvorak th sh ch tr ld nd ...).  Actually, I think I can do better than Dvorak here, since I'm not limited by the same constraints.  I have the beginnings of a way to visualize assignment of pairs.  Ideally this visualization would show the same data that's in the tables, so I don't have to keep different things synchronized.  That should be possible with the build script.

The bad:
- Figure out the Capslock vs. Shiftlock issue.  This is a shortcoming of the table-based approach.
- Dvorak currently hard-coded.
- Build script is currently an ugly nest of for-loops.  Needs to be refactored.

Moving ahead:
- Make regression test for hiragana table.
- Refactor build script to be layout neutral.
- Add support for QWERTY.
- Add katakana.
- DEV RELEASE so anyone interested can play with it.
- English-keyword input of kanji and primitives.  (Can be copy/pasted, should be very easy)
- DEV RELEASE
- Rough-out table visualization feature.
- Rough layout of primitive elements.
- DEV RELEASE
- Database of primitive decompositions.
- Tune primitive layout.
- ALPHA

PS: I just (re)discovered Katsuo's lists.  Wow.  Very cool.  I'm beside myself with warm bubblies.  I mean, I'd looked at them before, but I didn't think of them until now.

Last edited by wildweathel (2009 November 21, 7:27 pm)

  • 1