kanji koohii FORUM
jmdict vs edict2 - Printable Version

+- kanji koohii FORUM (http://forum.koohii.com)
+-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html)
+--- Forum: General discussion (http://forum.koohii.com/forum-8.html)
+--- Thread: jmdict vs edict2 (/thread-13846.html)



jmdict vs edict2 - yogert909 - 2016-06-10

Does anybody know the difference between jmdict and edict2?  They both seem to be actively maintained by edrdg.org.  They say edict2 "contains almost all the information in the JMdict file."  What is missing from edict2 that is contained in JMdict aside from the additional languages?  JMdict also seems to have the advantage that it is in xml format so it should be easier to parse.  Should I consider JMdict superior?

Also, are there any other J-E dictionary databases that I might want to consider aside from these?

Thanks


RE: jmdict vs edict2 - jimeux - 2016-06-10

I had a little trouble with this too. JMDict is the new format, so you should go with that. I think they were or perhaps still are updating EDICT2 in unison, but you can basically think of it as deprecated. There's a DTD for JMDict and a page explaining the relational data model they use themselves for JMdict somewhere on the site. 

It took me quite a bit of work to parse it successfully into a DB, but then I was using a fairly low-level Java library. There are a lot of elements, so I'd study the DTD and the data model page before writing any code. Since JMdict is a relatively new format, the Bibliography, Link and Audit tags are currently completely empty, so I wouldn't prioritise those for now. 

If you're making a J-E dictionary, you'll probably be interested in Kanjidic2, the Tanaka Corpus, Kradile-u (check license), KanjiVG, and some kind of radical list (I had to make my own).


RE: jmdict vs edict2 - yogert909 - 2016-06-11

jimeux, thanks for the info. It's much appreciated.


RE: jmdict vs edict2 - Roketzu - 2016-06-11

(2016-06-10, 9:25 pm)jimeux Wrote: I had a little trouble with this too. JMDict is the new format, so you should go with that. I think they were or perhaps still are updating EDICT2 in unison, but you can basically think of it as deprecated. There's a DTD for JMDict and a page explaining the relational data model they use themselves for JMdict somewhere on the site. 

It took me quite a bit of work to parse it successfully into a DB, but then I was using a fairly low-level Java library. There are a lot of elements, so I'd study the DTD and the data model page before writing any code. Since JMdict is a relatively new format, the Bibliography, Link and Audit tags are currently completely empty, so I wouldn't prioritise those for now. 

If you're making a J-E dictionary, you'll probably be interested in Kanjidic2, the Tanaka Corpus, Kradile-u (check license), KanjiVG, and some kind of radical list (I had to make my own).

You sound like someone who knows what's what, so I wonder if you'd be able to enlighten me on something related to this.

I have built a somewhat large vocabulary deck in Anki, 99% of which has the English definition taken from EDICT (not sure if it's 1 or 2), and I'm wondering if there is any kind of batch process I could perform on around 40K notes to output definitions taken from JMDict.

The only piece of software I know of that can perform this sort of function is Epwing2Anki, and unfortunately it is only compatible with EDICT, not JMDict. It uses a .sqlite version of EDICT to pull from I believe, and this may be a question for the author, but he unfortunately hasn't been around here in a while, so I'll ask you just in case, if that's OK. Would it be possible to TRICK or DECEIVE this piece of software into thinking it's pulling from EDICT when in fact you have sneakily replaced the edict.sqlite with jmdict.sqlite? Now, this is even assuming there is such a thing available, I don't know and obviously I should have researched this first before typing this out, but alas here we are.

I'm not a coder and am probably ignorant of a whole slew of reasons why this wouldn't work, perhaps even just the difference in format between the 2 dictionaries would make this task impossible, but what say you sir? If this is a no-go, would you know of any other way to go about such a task?


RE: jmdict vs edict2 - anotherjohn - 2016-06-11

*Ahem* I'll just leave this here.

Not sure if it's relevant to this thread though.


RE: jmdict vs edict2 - Roketzu - 2016-06-11

(2016-06-11, 4:46 am)anotherjohn Wrote: *Ahem* I'll just leave this here.

Not sure if it's relevant to this thread though.

This is cool, and I can imagine somehow finding a way to use this to replace the EDICT-style definition field in my own deck if there's no other way to go about it. Not sure if there's a really simple way to do that, but I'm sure it'd a least be possible.


RE: jmdict vs edict2 - pm215 - 2016-06-11

(2016-06-11, 4:08 am)Roketzu Wrote: I have built a very large vocabulary deck in Anki, 99% of which has the English definition taken from EDICT (not sure if it's 1 or 2), and I'm wondering if there is any kind of batch process I could perform on around 40K notes to output definitions taken from JMDict.
If you already have definitions via EDICT, what are you hoping to achieve with this?
Quote:Would it be possible to TRICK or DECEIVE this piece of software into thinking it's pulling from EDICT when in fact you have sneakily replaced the edict.sqlite with jmdict.sqlite?
The sqlite is presumably this tool's internal choice of database. I wouldn't expect there to be any difference whether it was populated directly from jmdict info, or from edict (which is automatically created from jmdict, as I understand it).


RE: jmdict vs edict2 - Roketzu - 2016-06-11

pm215 Wrote: If you already have definitions via EDICT, what are you hoping to achieve with this?

JMDict numbers disparate definitions, whereas EDICT just lists them all together.

1 degree; extent; just right amount of noun (common) (futsuumeishi),adjectival nouns or quasi-adjectives (keiyodoshi),noun, used as a suffix
2 condition; state of health noun (common) (futsuumeishi),noun, used as a suffix
3 adjustment; moderation noun (common) (futsuumeishi),noun or participle which takes the aux. verb suru
4 addition and subtraction

vs

(n,adj-na,n-suf,vs) degree; extent; just right amount of; condition; state of health; adjustment; moderation; addition and subtraction; (P)

---------

So far I have about 20K notes that have Japanese definitions and and example usages filled in. So for 加減, for example, I could match the Japanese and English numbers:

1 degree; extent; just right amount of noun (common) (futsuumeishi),adjectival nouns or quasi-adjectives (keiyodoshi),noun, used as a suffix
2 condition; state of health noun (common) (futsuumeishi),noun, used as a suffix
3 adjustment; moderation noun (common) (futsuumeishi),noun or participle which takes the aux. verb suru
4 addition and subtraction

(Definition)
1 ある条件の具合によって変わる程度
2 物事の具合
3 調節すること。その程度
4 加えることと減らすこと

(Usage)
1 腹の減り―で時刻が分かる。ばかさ― 
2 寒さの―で腰が痛む。腹の減り―で食事を調節する
3 力[温度]を―する。湯―をみる
4 両方の数を―して調整する

They wouldn't all work out this neatly, but I like the idea of making a deck containing all of this information. Not even for myself, but as something that could potentially be shared and improved upon over time.

(2016-06-11, 9:41 am)pm215 Wrote: The sqlite is presumably this tool's internal choice of database. I wouldn't expect there to be any difference whether it was populated directly from jmdict info, or from edict (which is automatically created from jmdict, as I understand it).

So you're saying if there was a jmdict.sqlite file used in the place of the current edict.sqlite file, it would work just fine? Sounds great, but I can't find a .sqlite version of JMDict.


RE: jmdict vs edict2 - pm215 - 2016-06-11

OK, I see that having the split senses would be nice. What I meant by "wouldn't expect any difference" is that the sqlite db schema probably assumes edict-style 1-definition etc, so you lose the same information (like split senses, examples, etc) whether you go jmdict -> edict -> sqlite or jmdict -> sqlite, and the end product is the same thing you have now.


RE: jmdict vs edict2 - jimeux - 2016-06-12

(2016-06-11, 4:08 am)Roketzu Wrote: I'm not a coder and am probably ignorant of a whole slew of reasons why this wouldn't work, perhaps even just the difference in format between the 2 dictionaries would make this task impossible, but what say you sir? If this is a no-go, would you know of any other way to go about such a task?
I think pm215 is right. You'd likely need to code something yourself, and I find that homonyms often make it difficult to automate this kind of thing 100%, though it may be less of any issue in your case.

I haven't looked into how JMDict numbers its definitions, but I wouldn't be surprised if the order or even the definitions themselves differed to your monolingual ones. Even different Japanese dictionaries can differ between each other.


RE: jmdict vs edict2 - Roketzu - 2016-06-12

(2016-06-12, 5:45 am)jimeux Wrote:
(2016-06-11, 4:08 am)Roketzu Wrote: I'm not a coder and am probably ignorant of a whole slew of reasons why this wouldn't work, perhaps even just the difference in format between the 2 dictionaries would make this task impossible, but what say you sir? If this is a no-go, would you know of any other way to go about such a task?
I think pm215 is right. You'd likely need to code something yourself, and I find that homonyms often make it difficult to automate this kind of thing 100%, though it may be less of any issue in your case.

I haven't looked into how JMDict numbers its definitions, but I wouldn't be surprised if the order or even the definitions themselves differed to your monolingual ones. Even different Japanese dictionaries can differ between each other.

Oh yeah, they would absolutely differ and there would be definitions covered one way and not the other, even non-numbered English definitions that have separate Japanese entries depending on the dictionary like you said. The thing is there are crazy people like me out there who would enjoy going through and sorting it all one by one, making it neat and trying to get it right—then realizing 6 months later I should have done X differently from the start. Really living on the edge here.


RE: jmdict vs edict2 - yogert909 - 2016-06-13

(2016-06-11, 4:08 am)Roketzu Wrote: I have built a somewhat large vocabulary deck in Anki, 99% of which has the English definition taken from EDICT (not sure if it's 1 or 2), and I'm wondering if there is any kind of batch process I could perform on around 40K notes to output definitions taken from JMDict.

The only piece of software I know of that can perform this sort of function is Epwing2Anki, and unfortunately it is only compatible with EDICT, not JMDict. It uses a .sqlite version of EDICT to pull from I believe, and this may be a question for the author, but he unfortunately hasn't been around here in a while, so I'll ask you just in case, if that's OK. Would it be possible to TRICK or DECEIVE this piece of software into thinking it's pulling from EDICT when in fact you have sneakily replaced the edict.sqlite with jmdict.sqlite? Now, this is even assuming there is such a thing available, I don't know and obviously I should have researched this first before typing this out, but alas here we are.

I'm not a coder and am probably ignorant of a whole slew of reasons why this wouldn't work, perhaps even just the difference in format between the 2 dictionaries would make this task impossible, but what say you sir? If this is a no-go, would you know of any other way to go about such a task?

You could export anotherjohn's deck as a text file and import back into your edict deck with update duplicates turned on.  You would need to tag your existing cards sot hat you can find and delete the cards that aren't duplicates  You could even create fields for all of the fields that you don't want to overwrite.

If, for some reason this doesn't work,  I would imagine you could rename the sqlite database, but you would also need to make sure the structure of the databases are the same.  This may or may not not be as complicated as it sounds.  sqlite databases are kind of like excel spreadsheets with columns and rows, so you could conceivably open (a copy) of the database with a sqlite browser and change the names of the columns to match edict.sqlite and your add-on wouldn't know the difference.  However, it's possible the database is more complicated, with several different tables (sheets in my excel analogy) and it could get tricky.  But it may be worth looking into it because there's a very good chance each database is simply one big table with the columns named differently.


RE: jmdict vs edict2 - Roketzu - 2016-06-13

(2016-06-13, 1:00 pm)yogert909 Wrote:
(2016-06-11, 4:08 am)Roketzu Wrote: I have built a somewhat large vocabulary deck in Anki, 99% of which has the English definition taken from EDICT (not sure if it's 1 or 2), and I'm wondering if there is any kind of batch process I could perform on around 40K notes to output definitions taken from JMDict.

The only piece of software I know of that can perform this sort of function is Epwing2Anki, and unfortunately it is only compatible with EDICT, not JMDict. It uses a .sqlite version of EDICT to pull from I believe, and this may be a question for the author, but he unfortunately hasn't been around here in a while, so I'll ask you just in case, if that's OK. Would it be possible to TRICK or DECEIVE this piece of software into thinking it's pulling from EDICT when in fact you have sneakily replaced the edict.sqlite with jmdict.sqlite? Now, this is even assuming there is such a thing available, I don't know and obviously I should have researched this first before typing this out, but alas here we are.

I'm not a coder and am probably ignorant of a whole slew of reasons why this wouldn't work, perhaps even just the difference in format between the 2 dictionaries would make this task impossible, but what say you sir? If this is a no-go, would you know of any other way to go about such a task?

You could export anotherjohn's deck as a text file and import back into your edict deck with update duplicates turned on.  You would need to tag your existing cards sot hat you can find and delete the cards that aren't duplicates  You could even create fields for all of the fields that you don't want to overwrite.

If, for some reason this doesn't work,  I would imagine you could rename the sqlite database, but you would also need to make sure the structure of the databases are the same.  This may or may not not be as complicated as it sounds.  sqlite databases are kind of like excel spreadsheets with columns and rows, so you could conceivably open (a copy) of the database with a sqlite browser and change the names of the columns to match edict.sqlite and your add-on wouldn't know the difference.  However, it's possible the database is more complicated, with several different tables (sheets in my excel analogy) and it could get tricky.  But it may be worth looking into it because there's a very good chance each database is simply one big table with the columns named differently.

I'm leaning towards going with the first suggestion. The reason being there is no SQLite version of JMDict out there, I've only been able to find it as .XML and .EPWING. When I searched on how to go about converting .XML to .sqlite I mostly found people with coding skills asking other coders how to best go about it, and just from my cursory reading of the responses it doesn't seem like a trivial task.


RE: jmdict vs edict2 - yogert909 - 2016-06-13

Anki's database is sqlite, so if you import anotherjohns deck into another user account, you automatically have an sqlite database.  

But yea, the first option is the easiest and more likely to work.  It's also a great trick to know any time you want to update new information into an existing deck


RE: jmdict vs edict2 - Roketzu - 2016-06-13

Well I feel pretty dumb right now. Turns out the EDICT file bundled with Epwing2Anki does in fact number definition entries when there's more than one. I was assuming the whole time that since I was using EDICT via Rikaisama, where I don't get the numbering, that the same would be the case with Epwing2Anki (both tools made by cb4960).


RE: jmdict vs edict2 - Roketzu - 2016-06-13

I think I was able to successfully go through the process and update all of my notes, and it was all surprisingly seamless. I did it with a back-up, not my real deck, just to test it out.

Here is one result:

[Image: ZGfjhhp.png]

Now imagine how great it'd be to have an entire deck of tens of thousands just like it... Yeah, I really should have gone with the numbered style from the beginning. It'd probably take years to go through and sort them all so they align like this.