I know we have quite a few knowledgable programmers here so I thought you might help me with a question.
For a future area of the website, I want to be able to tell the readings of the kanji that appear in compounds, anybody knows how to do this ?
So far I've looked at EDICT (from Jim Breen), but the compounds readings are just strings of phonetics. I need something like a separator in there, so at least I could tell what phonetics are tied to what kanji in the compound.
.. Another way to explain would be for example, how would I determine what are the exact readings if I wanted to display furigana for a compound ?
Also if you know of any useful resources like EDICT and KANJIDIC, that I can use in developing the website and fill in the database with useful data, please let me know.
PS: looks like this could help but I can't read this yet, ChaSen. Can someone read it and give an overview of what it can do? (and what language it is?)
PPS: getting closer , hmmm...
Seems like you want to do something similar to Rikai? Maybe the solution could be the same (just show the reading of the compound, and the possible readings for each kanji). Even if it's not exactly what you want, it's pretty close. ;-)
You could also write a software to "guess" the readings, as you already have the kanjis and compound readings, but I guess there will be some times when this won't work...
BTW, did you try to contact Mr. Jim Breen about this subject?
Last edited by Ricardo (2007 March 22, 11:24 am)
You could also write a software to "guess" the readings, as you already have the kanjis and compound readings, but I guess there will be some times when this won't work...
Yes, that would be pretty difficult to do and very unlikely I would go that route, since I would be "reinventing the wheel". I think it's called "tokenizing" japanese sentences and that's a lot of work.
It looks like that's what "ChaSen" is doing. I hope it can do the furigana for lists of separate words, that way I could run it on all the EDICT entries and my problem will be solved. But I can't tell yet as all the site and documentation is in Japanese.
Perhaps ChaSen can find the furigana, but only for the compounds as a whole, so that won't help me much.
ファブリス wrote:
You could also write a software to "guess" the readings, as you already have the kanjis and compound readings, but I guess there will be some times when this won't work...
Yes, that would be pretty difficult to do and very unlikely I would go that route, since I would be "reinventing the wheel". I think it's called "tokenizing" japanese sentences and that's a lot of work.
It doesn't sound difficult to me, unless I don't understand what you're looking for. You find the compound reading in EDICT or somewhere else, and then check within KANJIDIC to see if the beginning of the reading is a match for the any of the readings of the first kanji, and if that's a success, you repeat for the remaining kanji in the compound. It wouldn't work 100% of the time but it doesn't sound difficult.
why not just hold the shift key when hovering over a kanji using rikaichan?
It'll give you the onyomi and kunyomi for individual kanji as you move your
mouse over them. And you can figure it out from there.
There are also problems you'll run into.
For example.
一ヶ月 pronounced "ikkagetsu"
一本 pronounced "ippon"
井上 (it's a name and it can be pronounced "Ine" or "Inoue")
I might even argue that "ippon" is the pronounciation for both kanji together and that there's no way to separate it into readings for individual kanji. Somehow I doubt that "ppon" is listed as a reading for the "hon" kanji.
Also, there the issue of names. There are many names that have many possible readings. In such a case, the only way to know the pronounciation would be to know what people call that person!
I think such a project is more trouble than it's worth. You can usually guess
the individual pronunciation by just looking at the kanji and the reading. It's really
not that hard. After a while it'll become second nature. And moreover, you can just use
rikaichan.
Last edited by chamcham (2007 March 22, 8:51 pm)
I think Mr. Fabrice is thinking of making a component for this website that would utilize that kind of information and therefore holding shift while using Rikaichan wouldn't get him very far.
Yes, he would have to allow for euphonic changes but that's still not very hard. They're very predictable. You certainly can split 一本 into two parts. イッ(いつ)+ポン(ほん).
I did also say that it won't work 100% of the time, and I think it's pretty obvious that it wouldn't be smart to try and process names in this way, so he wouldn't have to deal with 井上.
Last edited by JimmySeal (2007 March 22, 10:28 pm)
I think that a simple and small script could scan EDICT and KANJIDIC files and list the total of certain matches and the rest, so there would be no guessing about how this would work.
OTOH, I really don't know if all the effort (or any at all) for such feature is really worth, if you already have the compound reading AND the readings for each kanji...
Just my 2厘...
I need to tell readings of kanji apart, for the purposes of covering part of the RTK2 methodology, i.e. for the purpose of allowing people to review the kanji readings separately from vocab or sentences.
Additional review of kanji readings could be very beneficial in helping learners acquire more vocabulary. I think that's part of the idea behind RTK2. But most people won't do RTK2 systematically, and will learn readings "on the fly". That's where my plans comes in, and will blend the best of both approaches.
When someone enters a new compound for review, like 電話 I want them to be able to review kanji readings separately, and only those readings that actually appeared as part of the vocab they have studied! So it's all progressive and totally adapts to your learning material, what kanji you know, any order you like.
For this, I need to tell kanji readings apart.
I need to be able to split the phonetic string that is given for each compound in EDICT, so that I can tell with 100% accuracy what phonetic part corresponds to what character. Of course it seems evident in most cases with one onyomi but in other cases I guess you'd have to do something like back and forth matching like in regular expressions, because perhaps you have a compound with 3 kana: te tsu ku (not a real word) : maybe tetsu or te would both appear as readings for the first kanji and maybe the second one could be read as tsuku or just ku, this could be rare, but it could happen, I don't know.
chamcham good point. Proper names would not be covered so that would normally not be an issue.
Do you think there are kanji compounds like that which are not proper names, but can also have multiple readings?
JimmySeal wrote:
I think Mr. Fabrice is thinking of making a component for this website that would utilize that kind of information and therefore holding shift while using Rikaichan wouldn't get him very far.
Yes. Could you expand a bit on the euphonic changes?
Ricardo wrote:
I think that a simple and small script could scan EDICT and KANJIDIC files and list the total of certain matches and the rest, so there would be no guessing about how this would work.
I don't understand what you mean, Ricardo, could you explain again?
Ricardo wrote:
OTOH, I really don't know if all the effort (or any at all) for such feature is really worth, if you already have the compound reading AND the readings for each kanji...
It's not absolutely necessary however it's a substantial part of what I have in mind, and would be very beneficial to have for learners. Review kanji readings separately could help create yet more "links" in memory and really solidify all the material, and help learn new compounds at an increasingly higher pace.
For this, in the spirit of RTK1, I want people to be able to review only those readings that they actually use. Presenting the learner with a flashcard containing all the possible readings from the KANJIDIC database would be very inefficient,.. at best.
I think this would work a lot better than the experimental On Yomi area I had deeloped.
Kanji chains are just too much work, and people want to learn vocab and use it straight away!
PS: by the way, in regard to the "extra developers" thread, if you want to give a hand, this is something I could use extra help, and that would be extremely useful for upcoming area of the site.
Here's a simple algorithm explaining what I mean:
You start with the compound, like 電話.
From Edict, you get the compound reading: でんわ.
From Kanjidic, you get the readings for each kanji:
電 = でん
話 = わ, はな.す, はな.し
Following the order of the kanjis, check if the kanji reading is present in the compound reading:
1) is でん (reading for 電) present in the start of でんわ? Yes, we found the first reading, we continue with the remaining part of the reading (わ);
2) is わ (one of the readings of 話) present in わ? Yes, we found another reading, and there's nothing else to look for - finished.
Of course, this won't work with something like 今日 if the reading is きょう (but would if reading is こんにち).
Here's a bit more complicated example: 話し合い = はなしあい
話 = ワ, はな.す, はなし
し = し
合 = ゴウ, ガッ, カッ, あう, -あう, あ.い, あい-, -あい, あわす, あわせる, -あわせる, あう, あん,
い, か, こう, ごお, に, ね, や, り, わい
い = い
For each kanji, scan each kanji reading in the compound reading:
1) start processing はなしあい
2) 話: ワ = false; はな(.す) = true! continue processing しあい
3) し is not a kanji (the reading is itself); continue processing あい
4) 合: ゴウ, ガッ, カッ, あう, -あう = false; あ(.い) = true!
5) い = い
Of course, this is a bit oversimplified, and I'm not even mentioning the cases where readings of kanjis can overlap each other. As I said, I have no idea if this can work or not unless trying this algorithm over the data and checking the number of cases where it may work and where it surely won't (when the function finishes without finding all elements of the compound reading). The routine can be improved by analyzing the failed cases.
BTW, as you can see, this can be implemented as a recursive procedure/function... I think that a test routine like this can be written in 15min~30min (in Perl, anyway). :-P
Last edited by Ricardo (2007 March 23, 11:44 am)
ファブリス wrote:
Of course it seems evident in most cases with one onyomi but in other cases I guess you'd have to do something like back and forth matching like in regular expressions, because perhaps you have a compound with 3 kana: te tsu ku (not a real word) : maybe tetsu or te would both appear as readings for the first kanji and maybe the second one could be read as tsuku or just ku, this could be rare, but it could happen, I don't know.
I can't think of ever encountering a case like this, but I guess it could happen. There definitely are non-name kanji compounds that have more than one reading, for example 博士 and 宿主. In such cases you'd have to either allow multiple readings or choose the more common one. And of course there are 熟字訓 compounds that really can't be split (like 大人, 小豆, 今日 and 明日); this is what I was mainly thinking of when I said the method wouldn't work 100% of the time.
ファブリス wrote:
Could you expand a bit on the euphonic changes?
The ones I can think of are the following:
On-yomi ending in チ or ツ often become ッ when followed by an unvoiced syllable.
On-yomi beginning in ハヒフヘホ often become パピプペポ (pa, pi, pu, pe, po) when preceded by ッ.
The beginning of kun-yomi (and sometimes on-yomi) often become vocalized when they are not at the beginning of a compound (みつばち).
The simplest thing would be to just check all the possible variants as described right here. This might cause a few situations like the one you described at the top of my post, but probably not many.
I think it would be easy to go through EDICT processing them this way and just split the readings up with some separator like / or -, and flagging the ones that couldn't be matched. I'd be happy to make a program to do this, but it would have to be in a few days, since I'm leaving in a few hours to attend a wedding that's tomorrow..
Yeah create your own database, processing the whole thing and marking the breaks with a separator. I'd manually check out the list of failed kanji and do something with each of those. Then post that list of failed kanji here, as that'd be interesting!
Or better add some rules so that they too can be processed automatically...
Well, there are definitely going to be some that can't be split up at all, but I will rework the algorithm until I get the best results I can.
Anyway, I've been working on this task. It's more of a production than I had foreseen, but I should have some preliminary results within 24 hours.
RoboTact wrote:
Or better add some rules so that they too can be processed automatically...
Well, if there are in fact rules for the exceptions, than it would work, but I'm thinking of those like 今日 where I think we can assume there is no good rule that will solve that one. I'd like to see a list of all those like that.
JimmySeal wrote:
Well, there are definitely going to be some that can't be split up at all, but I will rework the algorithm until I get the best results I can.
See Other complicatons. It doesn't make sense to split these readings anyway, as far as I understand.
For the database we can have a string with the readings separated by a special character, and a string for the whole reading as it comes off EDICT. Only the On and Kun readings will be used.
In the case of 紫陽花 (あじさい) for example, the string with the split readings will be null, since no On or Kun readings could be matched.
Those kanji won't be reviewed separately, at least not in relation with this compound. But the compound itself can be reviewed as part of the vocabulary review. If that doesn't make sense email me and I'll give you some more details.
JimmySeal wrote:
Anyway, I've been working on this task. It's more of a production than I had foreseen, but I should have some preliminary results within 24 hours.
Any help is appreciated. You're writing a Perl script right? I won't be working on the new area until I'm back from Japan so there's plenty of time. But if you have a script that can fill in the on/kun readings in the database, for each compound, I'll be able to start right off with the interface. Thanks for looking at it. Btw, have you looked at ChaSen? It's made in Perl also I think, could save some time.
ファブリス wrote:
Any help is appreciated. You're writing a Perl script right? I won't be working on the new area until I'm back from Japan so there's plenty of time. But if you have a script that can fill in the on/kun readings in the database, for each compound, I'll be able to start right off with the interface. Thanks for looking at it. Btw, have you looked at ChaSen? It's made in Perl also I think, could save some time.
I have an overwhelming fear of Perl code, so though Perl would probably get the job done a little faster, I'm working with Java. But since I'm planning to simply go through EDICT, split the readings up with separators, and write it back to a file the output is not implementation-dependent. And once I have it ready, it should be easy to modify the parser to deal with any other collection of Entry/Reading pairs if need be.
Anyway, I'll let y'all know when I have some deliverables.
From what I can tell, ChaSen seems to be a Japanese tokenizer, which is not applicable to this task.
Last edited by JimmySeal (2007 March 26, 7:34 pm)
JimmySeal wrote:
From what I can tell, ChaSen seems to be a Japanese tokenizer, which is not applicable to this task.
Hmm yes I had doubts about that. So when it gets down to compounds it uses the whole reading. Also I considered furigana in books and noticed that they didn't really try to stick the readings next to each kanji but rather the whole reading for the compound above or next to the word (if it's vertical writing). So I considered looking for code sample that adds furigana and then realised it might not handle the "readings separating" task either.
I don't have experience with Java but if you have the logic down and I need to adapt it somehow I can rewrite parts of it in Perl, it will still have saved me a bunch of time so no problem!
I like Perl, although the syntax is a little obscure. But I find switching between Javascript, PHP (bleh), and Perl is confusing.
I find myself having a similar need - breaking up compounds readings into kanji readings in an automated way, for statistical purposes. I was planning to go with regular expressions, but if anyone has solved the problem already, I'd rather use that.

