RECENT TOPICS » View all
sikieiki wrote:
Noticed something which I believe to be a bug.
If you highlight a word in certain conjugations it is recognized by the english dictionary but not sanseido. Copying it with C will produce the unconjugated form from the english dictionary with definition, which sanseido will then recognize once pasted somewhere.
Not sure how the dictionary lookups are performed, so perhaps it is impossible, but perhaps a cross reference can be performed on the english dictionary to search again with sanseido if the first sanseido lookup fails to produce anything? Something like a hybrid dictionary?
Example :
こき下ろされた
Rikaisama currently uses the unconjugated kanji form of the word (こき下ろす) which doesn't exist in sanseido.
For a future release, I would like do something like this:
1) First try looking up the unconjugated kanji form.
2) If nothing is found, try looking up the unconjugated reading.
3) If nothing is found, fallback to the default EDICT dictionary.
I might have time this weekend for a Rikaisama update. Maybe this will make it in.
Fangio wrote:
If there is a new version on the way, I would really support including support for pitch accent, as already suggested (there are still so few materials to work on pitch accents!).
Can you explain how pitch accents are traditionally presented to the user and also point me to a resource that I can use to extract them?
I have just uploaded version 16.0 of the Rikaisama Firefox extension.
Download version 16.0 via SourceForge
What's New?
● Merged code with rikaichan 2.07 baseline.
● Implemented a fallback mechanism for Sanseido mode. (Thanks sikieiki!).
1) First the kanji form of the word is searched (example: こき下ろす)
2) If the kanji form is not found, search for the kana form (example: こきおろす)
3) If all else fails, display the default non-Sanseido definition.
● Fixed display of definition numbers and sub-definition numbers in Sanseido mode. Also, sub-definition numbers are now circled.
cb4960
Thank you for the update. The sanseido lookup seems to work as expected now. I noticed the voice lookup still has the old problem though.
cb4960 wrote:
I have just uploaded version 16.0 of the Rikaisama Firefox extension.
Download version 16.0 via SourceForge
What's New?
● Merged code with rikaichan 2.07 baseline.
● Implemented a fallback mechanism for Sanseido mode. (Thanks sikieiki!).
1) First the kanji form of the word is searched (example: こき下ろす)
2) If the kanji form is not found, search for the kana form (example: こきおろす)
3) If all else fails, display the default non-Sanseido definition.
● Fixed display of definition numbers and sub-definition numbers in Sanseido mode. Also, sub-definition numbers are now circled.
cb4960
sikieiki wrote:
Thank you for the update. The sanseido lookup seems to work as expected now. I noticed the voice lookup still has the old problem though.
I'll see about fixing the voice lookup code for the next version. Thanks.
cb4960 wrote:
Fangio wrote:
If there is a new version on the way, I would really support including support for pitch accent, as already suggested (there are still so few materials to work on pitch accents!).
Can you explain how pitch accents are traditionally presented to the user and also point me to a resource that I can use to extract them?
Here's a lot of input about pitch accent and how it was implemented (in the case of spreadsheets): http://forum.koohii.com/viewtopic.php?id=9787
Also an Anki plugin (unfortunately not yet translated to Anki 2.0): http://forum.koohii.com/viewtopic.php?p … 01#p176601
Both use Japanese-only online dictionaries, which use the common numbering system (no. of the "accented" mora, i.e. the mora with a falling pitch). A great (but probably somewhat complex) option (in addition rather than alternative) is the representation used here: http://accent.u-biq.org/english.html
Here's a lot of input about pitch accent and how it was implemented (in the case of spreadsheets): http://forum.koohii.com/viewtopic.php?id=9787
Also an Anki plugin (unfortunately not yet translated to Anki 2.0): http://forum.koohii.com/viewtopic.php?p … 01#p176601
Both use Japanese-only online dictionaries, which use the common numbering system (no. of the "accented" mora, i.e. the mora with a falling pitch). A great (but probably somewhat complex) option (in addition rather than alternative) is the representation used here: http://accent.u-biq.org/english.html
Thanks for the info. Pardon my original laziness as I could have just looked at the Wikipedia entry which contained everything that I needed to know about the problem domain.
In a sense, Rikaisama already supports pitch accents by allowing you to use EPWING mode with the 大辞林 dictionary.
To enable pitch accents for non-EPWING mode, I could create a database beforehand by running each word in EDICT through the 大辞林 EPWING dictionary and parsing out the pitch accent information. Should be simple enough.
The method of display presented in the accent.u-biq.org site is interesting. Perhaps I could be a little less fancy and use underlined/bold/colored text for mora with high pitch.
cb4960 wrote:
Perhaps I could be a little less fancy and use underlined/bold/colored text for mora with high pitch.
But how would words with two (or more?) possible accents be treated? Some are marked as 01, etc.
toshiromiballza wrote:
cb4960 wrote:
Perhaps I could be a little less fancy and use underlined/bold/colored text for mora with high pitch.
But how would words with two (or more?) possible accents be treated? Some are marked as 01, etc.
Very good point. Until somebody can think of something more clever, I'll just append 01 or [0][1] and forgo the fancy visualization.
Last edited by cb4960 (2012 December 01, 2:33 pm)
Here is a human-readable version of the pitch accent database that is used by Rikaisama:
Tabbed-Separated Values Format:
Download japanese_pitch_accents_121216.txt via MediaFire
Notes:
The database contains 117,101 entries.
The database was generated by dumping the full contents of the 『大辞林 第2版』, 『NHK日本語発音アクセント辞典』, and 『新明解国語辞典第五版』 EPWING dictionaries to a text file using EBライブラリー and then parsing out the pitch information for words used by Rikaisama's default dictionary (EDICT). Processing time takes a combined 15 seconds.
Database Format:
Column 1 = Expression -or- reading if word has no expression.
Column 2 = Reading -or- blank if word has no expression.
Column 3 = Pitch accent.
Last edited by cb4960 (2012 December 16, 4:52 pm)
I have just uploaded version 17.0 of the Rikaisama Firefox extension.
Download version 17.0 via SourceForge
What's New?
● Added pitch accent information to the right of the reading if available. (Thanks Fangio!).
To enable, check the "Options... -> Dictionaries Tab -> Show pitch accent" checkbox.
Screenshots:


Perhaps a future version will have a better visualization of the pitch accent such as underlining the mora with the high pitch (where possible).
cb4960
Last edited by cb4960 (2012 December 01, 11:57 pm)
I have just uploaded version 17.1 of the Rikaisama Firefox extension.
Download version 17.1 via SourceForge
What's New?
● Added pitch accents for katakana words.
cb4960
Thanks for the pitch database. People have been trying to make that for ages using website-scraping tools. Looks like you've found a much more effective solution.
HelenF wrote:
Thanks for the pitch database. People have been trying to make that for ages using website-scraping tools. Looks like you've found a much more effective solution.
You're welcome. Just keep in mind that it is a minimal database designed specifically to be used with the words that Rikaisama can recognize. 大辞林 第2版 has pitch accent information for many more words than exist in the database.
Last edited by cb4960 (2012 December 02, 5:03 pm)
Adding accent data to Rikaisama sounds great!
cb4960 wrote:
toshiromiballza wrote:
cb4960 wrote:
Perhaps I could be a little less fancy and use underlined/bold/colored text for mora with high pitch.
But how would words with two (or more?) possible accents be treated? Some are marked as 01, etc.
Very good point. Until somebody can think of something more clever, I'll just append 01 or [0][1] and forgo the fancy visualization.
You could just write the reading several times to cover the different accents.
For example:
飛車
Sub-definition 1
ひしゃ\ - ひ\しゃ
Sub-definition 2
ひしゃ ̄
Where \ and  ̄ represent the underlining and "overline" that would signal the accent.
Last edited by Sebastian (2012 December 02, 5:32 pm)
Sebastian wrote:
You could just write the reading several times to cover the different accents.
I think that's a bad idea. It would create too much unnecessary clutter.
cb4960 wrote:
I have just uploaded version 17.0 of the Rikaisama Firefox extension.
Download version 17.0 via SourceForge
What's New?
● Added pitch accent information to the right of the reading if available. (Thanks Fangio!).
I should be the one thanking you, this is great! Thanks a million!
Would it be possible to get a key to copy the highlighted word? You can now use C to copy, but it copies the definition as well. Can you make alt + C copy the highlighted word/phrase only? Sometimes there is no definition and you want to search manually. Of course, you can do this with right clicking and using "Search google for XXX" to get the same effect, but it might be useful to get just the word in the clipboard by itself.
Additionally, I was wondering if there is any plans to support firefox on android? It looks like android on firefox supports addons but I dont know how much of a change it would be to make it work.
How is it that some words are missing the accent? Are they not in Daijirin?
伸ばす or 並べる for example.
toshiromiballza wrote:
Sebastian wrote:
You could just write the reading several times to cover the different accents.
I think that's a bad idea. It would create too much unnecessary clutter.
Agreed.
sikieiki wrote:
Would it be possible to get a key to copy the highlighted word? You can now use C to copy, but it copies the definition as well. Can you make alt + C copy the highlighted word/phrase only? Sometimes there is no definition and you want to search manually. Of course, you can do this with right clicking and using "Search google for XXX" to get the same effect, but it might be useful to get just the word in the clipboard by itself.
You can press Ctrl-C to copy just the highlighted word/phrase.
sikieiki wrote:
Additionally, I was wondering if there is any plans to support firefox on android? It looks like android on firefox supports addons but I dont know how much of a change it would be to make it work.
No plans.
toshiromiballza wrote:
How is it that some words are missing the accent? Are they not in Daijirin?
伸ばす or 並べる for example.
Good catch. Those words are in Daijirin and have pitch accents.
The problem seems to either be a bug in the library that I'm using to perform EPWING lookups or with the way I'm using the library. I'll need to investigate. Hopefully this is fixable.
Edit:
I believe that I have found the bug in EBライブラリ. It appears that for Daijirin, it is incorrectly converting the voiced consonants of the JIS 0208 form of the word to their non-voiced equivalents (ie. ば -> は). Bypassing that particular conversion routine (eb_convert_voiced_consonants_jis) seems to fix the problem for my test words (伸ばす, 並べる, 黄ばむ). More testing is needed though.
Last edited by cb4960 (2012 December 04, 11:55 pm)
cb4960 wrote:
I believe that I have found the bug in EBライブラリ. It appears that for Daijirin, it is incorrectly converting the voiced consonants of the JIS 0208 form of the word to their non-voiced equivalents (ie. ば -> は). Bypassing that particular conversion routine (eb_convert_voiced_consonants_jis) seems to fix the problem for my test words (伸ばす, 並べる, 黄ばむ). More testing is needed though.
What about 表れる, 賜る or 行う. No voiced consonants.
Also, what are the accents marked with a hyphen between them? [1]-[1], [0]-[0], [1]-[0], [1][1]-[0], etc.
Last edited by toshiromiballza (2012 December 06, 4:16 pm)
toshiromiballza wrote:
What about 表れる, 賜る or 行う. No voiced consonants.
Pitch accents for these words are in v17.2 which I'm about to post. If you find any more, please don't hesitate.
toshiromiballza wrote:
Also, what are the accents marked with a hyphen between them? [1]-[1], [0]-[0], [1]-[0], [1][1]-[0], etc.
I'm not sure. Maybe some smart person can enlighten us.
I have just uploaded version 17.2 of the Rikaisama Firefox extension.
Download version 17.2 via SourceForge
What's New?
● Added pitch accents for 10,751 more words. This brings the total to 107,309.
● For many words, added which part-of-speech a pitch accent applies to.
● Added the "Hide pitch accent part-of-speech unless ',' or '|' is present" option. It is enabled by default. Read the next section to learn what ',' and '|' represent.
● Changes to format of pitch accents.
The new pitch accent format:
<blank> - Example: 単眼鏡 たんがんきょう
No pitch accent information available for this word.
0 – Example: 洗う あらう 0
Zero means no accent. From Wikipedia: "Word doesn't have an accent, the pitch rises from a low starting point on the first mora or two, and then levels out in the middle of the speaker's range, without ever reaching the high tone of an accented mora. Japanese describe the sound as "flat" (平板 heiban) or "accentless". "
2 – Example: 願う ねがう 2
The "2" indicates that the accent is on the 2nd mora (the が).
32 – Example: 著作権 ちょさくけん 32
The "32" indicates that the accent can be on either the 3rd mora (く) or 2nd mora (さ). This is in frequency order, meaning that it is more common for the accent to be on the 3rd mora than the 2nd mora.
{11} – Example: 超越論的観念論 ちょうえつろんてきかんねんろん {11}
Curly braces are placed around pitch accents that are in the double digits. The "11" indicates that the accent is on the 11th mora.
21,0 – Example: 飛車 しゃ 21,0
For some words, Daijirin contains multiple sub-definitions in an entry. Sometimes each sub-definition can have a different pitch. A comma separates the pitch accents for the multiple sub-definitions. The "21,0" means that in the 1st sub-definition of the word, the accent is on either the 2nd mora (しゃ) or 1st mora (ひ), and that in the 2nd sub-definition of the word, no accent is present.
1-2 – Example: 思案投げ首 しあんなげくび 1-2
I'm not sure what the "-" is supposed to represent. It is present in Diajirin so I left it in.
1|Ø – Example: 朝日 あさひ 1|Ø
For some words, Daijirin contains multiple entries that have identical expressions and readings. The "|" separates the pitch found in each entry. The "1" indicates that in the first Daijirin entry, the pitch accent was on the first mora. The "Ø" symbol indicates that the other Daijirin entry contained no pitch accent information.
(part-of-speech) – Example: 点々 てんてん (名)03,(副)0,(形動タリ)0
Sometimes pitch accent changes depending on the word's part-of-speech. The part-of-speech is placed inside of parenthesis. The above example shows that the pitch accent is "03" when the word is used as a noun, "0" when the word is used as an adverb, and "0" when used as a noun adjective (specifically a "classical form of na-adjective inflection formed by contraction of the particle "to" with the classical verb "ari" ("aru") ").
Valid part-of-speech options:
(名) 名詞
(代) 代名詞
(動五) 動詞五段活用
(動五[四])動詞口語五段活用 ・文語四段活用
(動四) 動詞四段活用
(動上一) 動詞上一段活用
(動上二) 動詞上二段活用
(動下一) 動詞下一段活用
(動下二) 動詞下二段活用
(動カ変) 動詞カ行変格活用
(動サ変) 動詞サ行変格活用
(動ナ変) 動詞ナ行変格活用
(動ラ変) 動詞ラ行変格活用
(動特活) 動詞特別活用
(形) 形容詞
(形ク) 形容詞ク活用
(形シク) 形容詞シク活用
(形動) 形容動詞
(形動ナリ)形容動詞ナリ活用
(形動タリ)形容動詞タリ活用
(ト|タル) 「~と」(副)「~たる」(連体詞)の形で用いられるもの
(連体) 連体詞
(副) 副詞
(接続) 接続詞
(感) 感動詞
(助動) 助動詞
(格助) 格助詞
(接助) 接続助詞
(副助) 副助詞
(係助) 係助詞
(終助) 終助詞
(間投助) 間投助詞
(並立助) 並立助詞
(準体助) 準体助詞
(接頭) 接頭語
(接尾) 接尾語
(連語) 連語
(枕詞) 枕詞
Random examples:
私年号 しねんごう 2
ドライバー 20,0
我がまま わがまま (名・形動)34
角 かく (名・形動)21,12,20
去る さる (動ラ五四)1|(連体)1
現金自動支払機 げんきんじどうしはらいき {10}3-6
十重二十重 とえはたえ 11-2
cb4960
Last edited by cb4960 (2012 December 08, 12:59 am)

