falsinsoft
New member
Registered: 2013-07-09
Posts: 9
Hi all
I'm developing a software tool for japanese sutdy. Currently I'm writing the code for import EDICT2 content into internal db. However, casu my poor knowledge of japanese language, I have some doubt regarding how the data are formatted inside the EDICT2 file. Reading the official documentation explaining the following:
-------------------------------------------------------------------------------
the EDICT2 file is in an expanded form of the original EDICT format. The main differences are the inclusion of multiple kanji headwords and readings, and the inclusion of cross-reference and other information fields, e.g.:
KANJI-1;KANJI-2 [KANA-1;KANA-2] /(general information) (see xxxx) gloss/gloss/.../
-------------------------------------------------------------------------------
Now what I understand from this explanation is that each line of the file contain data based to this shema:
kanji1;kanji2;kanji3 [reading of kanji1;reading of kanji2;reading of kanji3]
However, checking the file it seem this schema is not correct. I can find some line like:
嘈囃;そう囃 [そうざつ]
This mean both these kanji have the same unique reading? Or again:
噯;噯気;噫気;噯木(iK) [おくび(噯,噯気);あいき(噯気,噫気,噯木)]
This is a different way for assign the reading to different kanji? In this case it seem the kanji 噯気 have two different readings.
Someone can explain me better the correct format of the file or point me to a site where there is a better explanation than the "official" one?
Thank you
Haych
Member
From: Canada
Registered: 2008-09-28
Posts: 168
It looks like what you have there is the words arranged according to their definition. The kanji-1, kanji-2 are different ways of writing a word with the same definition, and kana-1, kana-2 are different pronunciations for words with the same definition.
That format looks correct. The first one is saying that 嘈 = そう and 囃 = ざつ and the word can be written as そう囃 or 嘈囃.
The second one is for words that mean 'burp', and it gives おくび which can be written as 噯,噯気 or あいき which can be written as 噯気,噫気,噯木.
You need to keep in mind it is completely normal to have multiple possibilities for kanji, especially with more obscure stuff, and these both look like pretty obscure words. You should look up the terms 'ateji' and 'gikun' and familiarize yourself with the concept.
netsplitter
Member
From: Melbourne
Registered: 2008-07-13
Posts: 183
I've done my own read-into-database type of thing.
Basically, don't use EDICT or EDICT2, they are old formats that are difficult to parse and only exist for legacy purposes. Grab a copy of JMDict, which is an XML document that can be parsed (mostly) sensibly. The format is very well defined in its documentation (look at the DTD).
In fact, EDICT is generated from the JMDict data (which is itself generated from JMDictDB).
Last edited by netsplitter (2013 August 25, 9:38 pm)
The edict2 file has multiple headwords or readings for some entries. For example in "脛(P);臑 [すね(P);はぎ(脛)(ok)]", すね is marked as a reading for both 脛 and 臑, but はぎ is only marked as a reading for 脛. (P) is used for the priorized headwords or readings that have elements like <ke_pri>ichi1</ke_pri> or <re_pri>ichi1</re_pri> in JMdict.xml. (ok) means outdated or obsolete kana usage.
$ grep shin/shank edict2
脛(P);臑 [すね(P);はぎ(脛)(ok)] /(n) (uk) shin/shank/lower leg/(P)/EntL1570850X/
$ grep shin/shank edict
脛 [すね] /(n) (uk) shin/shank/lower leg/(P)/
脛 [はぎ] /(ok) (n) (uk) shin/shank/lower leg/
臑 [すね] /(n) (uk) shin/shank/lower leg/
$ xmlstarlet sel -t -c "/JMdict/entry[ent_seq=1570850]" JMdict_e.xml
<entry>
<ent_seq>1570850</ent_seq>
<k_ele>
<keb>脛</keb>
<ke_pri>ichi1</ke_pri>
</k_ele>
<k_ele>
<keb>臑</keb>
</k_ele>
<r_ele>
<reb>すね</reb>
<re_pri>ichi1</re_pri>
</r_ele>
<r_ele>
<reb>はぎ</reb>
<re_restr>脛</re_restr>
<re_inf>out-dated or obsolete kana usage</re_inf>
</r_ele>
<sense>
<pos>noun (common) (futsuumeishi)</pos>
<misc>word usually written using kana alone</misc>
<gloss>shin</gloss>
<gloss>shank</gloss>
<gloss>lower leg</gloss>
</sense>
</entry>
Other examples of entries in the edict2 file:
醜い(P);見憎い [みにくい] /(adj-i) (1) ugly/unattractive/(2) (See 醜い争い) unsightly/unseemly/(P)/EntL1333810X/
じゃが芋(P);ジャガ芋(P);馬鈴薯 [じゃがいも(じゃが芋,馬鈴薯)(P);ジャガいも(ジャガ芋)(P);ジャガイモ;ばれいしょ(馬鈴薯)(P)] /(n) (uk) (See ジャガタラ芋) potato (Solanum tuberosum)/(P)/EntL1005930X/
洒落(ateji) [しゃれ(P);シャレ] /(adj-na,n) (1) joke/pun/witticism/(adj-na) (2) (See お洒落・1) smartly dressed/stylish/fashion-conscious/refined/(P)/EntL1568640X/
御洒落(P);お洒落 [おしゃれ(P);オシャレ] /(adj-na,adj-no) (1) (uk) (See 洒落・しゃれ) smartly dressed/stylish/fashion-conscious/(n) (2) someone smartly dressed/(vs) (3) to dress up/to be fashionable/(P)/EntL1002770X/
In EntL1333810X, 1333810 is the ID used in JMdict and JMdictDB (see http://www.edrdg.org/jmdictdb/cgi-bin/e … ;q=1333810), and X means that at least one of the readings has an audio file hosted at JapanesePod101. お洒落・1 refers to the first sense of the entry that has お洒落 as a headword.
JMdict has <sense> elements for the senses and <xref> elements for the cross-references:
<sense>
<xref>醜い争い</xref>
<gloss>unsightly</gloss>
<gloss>unseemly</gloss>
</sense>
Last edited by lauri_ranta (2013 August 26, 2:02 am)