Hello everyone,
Introduction
I wrote a script to automatically look up the pitch accent of entries in a spreadsheet because I plan on beginning Core6k in a month or two and would like to have the pitch accents in my anki deck. The script automatically gets the pitch accent data from yahoo.co.jp (e.g. http://dic.yahoo.co.jp/dsearch?p=ときどき【時時...&dname=0ss)
This was intended for Core6k, but the program is designed to work with anyone's study list.
Edit: A primitive Anki plugin has been made for Anki 1.2.X: http://forum.koohii.com/showthread.php?p...#pid167578)
Core 6k
I ran it with Nukemarine's Optimized Kore (http://forum.koohii.com/showthread.php?tid=5110) and managed to confidentally determine the pitch accents of 4872/5094 (96%) of the entries that have kanji in them. The 906 (14%) kana-only entries would have to be manually looked over just to make sure that the dictionary entry was found due to the fact that yahoo.co.jp often returns an incorrect homonym for these entries.
Files
Here is the spreadsheet of the Optimized Core6k with the pitch accents that my program has determined (in columns S, T, U and V):
Only trust the pitch accents with uncertainties below 30 (81% of entries, or 96% of kanji+kana entries)
https://docs.google.com/spreadsheet/ccc?...mw5dWJXVHc
Here is the folder containing the results of the program's analysis:
Rar: https://docs.google.com/open?id=0B5wqBC4...3hJN0YxTU0
Zip: https://docs.google.com/open?id=0B5wqBC4...zBKTHphbEU
Incidentally, the uncertainties in the spreadsheet refer to where the analyzed file is located in the directory.
Here is the spreadsheet divided by the categories that need to be checked (use the tabs at the bottom to navigate):
https://docs.google.com/spreadsheet/ccc?...1REE#gid=4
Failed Dictionary Lookups [All] (150, 2.5%)
Figure Out Pitches [w/ Kanji] (52, 0.9%)
Verify Correct Page & Figure Out Pitches [Kana-Only] (40, 0.7%)
Verify Correct Page [w/ Kanji] (36, 0.6%)
Verify Correct Page [Kana-Only] (850, 14.2%)
Should Be Good [w/ Kanji] (4872, 81.2%)
I Would Like Help
I would really appreciate some help with the entries that my program has not been able to automatically parse, especially if other people would be interested having the pitch accents for all of core6k. I'm not very proficient in Japanese and that is in fact why I intend to do core6k. I myself shall be slowly going through these entries, but I made it so my spreadsheet at (https://docs.google.com/spreadsheet/ccc?...1REE#gid=4) can be edited so that you can contribute. All I ask is you delete the "No" in the verify column if the entry to be verified is correct.
In particular, I would need help with the files in Figure Out Pitches [w/ Kanji] (52, 0.9%), Verify Correct Page & Figure Out Pitches [Kana-Only] (40, 0.7%) and Failed Dictionary Lookups [All] (150, 2.5%)
Failed Dictionary Lookups
(should be fairly easy to do)
My program failed to find the dictionary entry. Someone needs to find the dictionary entry and input the correct pitches. This is divided into two main sections: "Did not find Dictionary Lookup?" and "Failed Dictionary Lookup?"
The "Failed Dictionary Lookup" entries were found in yahoo's second dictionary, the Daijisen. This makes it usually very easy to find the Daijirin's dictionary entry (the Daijisen may provide an alternate spelling that is contained in the Daijirin e.g. http://dic.yahoo.co.jp/dsearch?p=レインコーと&...&dname=0ss and http://dic.yahoo.co.jp/dsearch?p=ごはん【ご飯】...&dname=0ss; must look up 御飯 instead of ご飯 to find the entry)
Verify Correct Page
For this part, it must be verified if the program found the right page (due to the possibility of homonyms or other incorrect results). If the right page has been found, then the right pitches should already be in place (except for the 40 in Verify Correct Page & Figure Out Pitches [Kana-Only]). In some cases, one can simply compare the title of the webpage (provided in the spreadsheet) to the actual word. In other cases, it may be necessary to check the page itself to make sure the right page was chosen.
e.g. [Kana-Only] For entry 5291 (by Opt-Voc-Index number), the word in Core6k is ミュージック (or "music" in English). The title of the webpage that my program found was "ミュージック1【music】". Obviously, the correct webpage was found.
e.g. [w/ Kanji] The title for entry 2042 今に/いまに (before long, someday) has the webpage title "今(いま)に始まった事ではない". Hmm, that is strange. The page that the program analyzed was: http://dic.yahoo.co.jp/dsearch?p=いまに【今に】...&dname=0ss, which is NOT the right page! The right page can be found on the right (choice number 2). The list must be consequently updated.
Figure Out Pitches
My program has identified some pitches, but is unable to figure out when these pitches apply, or if these really are pitches. Someone has to manually take a look at each file in here and give a verdict as to the usage of the pitch accents.
e.g. [w/ Kanji] For 46 water 水, http://dic.yahoo.co.jp/dsearch?p=みず【水】&s...&dname=0ss, the second subscript (2) is part of "H2O". This is obviously not a pitch. Therefore, "0;2? Check" would be changed to "0;"
e.g. [w/ Kanji] Dictionary entry #4 of http://dic.yahoo.co.jp/dsearch?p=ひ【日】&st...&dname=0ss has a unique pitch.
Should Be Good
All entries here are probably good. Theoretically one should verify if the correct page was analyzed by the program, but chances are the correct page was analyzed.
A bit about the program
My program essentially looks up the webpage on the internet and finds the pitch. Here are four examples of files it may find:
Case 1: (2) 一つ + ひとつ -> http://dic.yahoo.co.jp/dsearch?p=一つ【ひとつ】...&dname=0ss (the number in the brackets refers to the Opt-Voc-Index number)
The pitch is 2. The program gives "2;" as the result
Case 2: (31) 時々 + ときどき -> http://dic.yahoo.co.jp/dsearch?p=ときどき【時時...&dname=0ss
The program finds that for (名), the pitch is 2 or 0), but also finds that for (副), the pitch is 0. The program returns "20(名)0(副)"
Case 3: (46) 水 + みず -> http://dic.yahoo.co.jp/dsearch?p=みず【水】&s...&dname=0ss
The program finds a pitch of 0 at the top. However, it also finds a 2 (as part of H2O in the first line of the dictionary). The program is confused because it doesn't understand why there is a 2 at that part of the dictionary definition. The program returns "0;2? Check"
Case 4: (146) ご飯 + ごはん -> http://dic.yahoo.co.jp/dsearch?p=ごはん【ご飯】...&dname=0ss
The entry is not found for the Daijirin dictionary, so yahoo automatically gives the result from the Daijisen. The Daijisen does not have any information about the pitch, so the program does not find anything. It returns a dictionary lookup error (Note that while the Daijirin has no entry for ご飯 + ごはん, it WOULD return a result for 御飯 + ごはん, but my program is not smart enough to figure this out)
About the folder with the results:
The files are organized by their uncertainty so that it is easier to check the entries.
In particular, I'm confused as what to do with a few of the files found in the folder 0_Kanji_and_Kana/6_Unknown_Location/. For example, for http://dic.yahoo.co.jp/dsearch?p=みぎ【右】&s...&dname=0ss, the pitch changes when the meaning of the word corresponds to the third definition. Also, I'm not proficient enough in Japanese to actually read the definitions to know what exact these individual entries mean (hence my plea for help in looking at the remaining files).
Questions
So anyways, some specific questions for you:
1. In the example of http://dic.yahoo.co.jp/dsearch?p=ときどき【時時...&dname=0ss, for (名), does 20 mean that the pitch accent may be either 2 or 0, but that 2 is more common?
2. (名) means noun & (副) means adverb.
3. I'm not sure in what form my program should actually OUTPUT the pitches for the complicated cases (e.g. "20(名)0(副)"). Is there any way to output the pitch in a more organized or meaningful way?
4. Is there interest in getting the pitch accent like I have done? i.e. is my program useful to people other than me?
5. Anything else?
I'm organizing my program a bit and will be shortly posting the source code here so that people can use it for their own lists if they want. Edit: Here is the code: https://docs.google.com/open?id=0B5wqBC4...zJXUWwtSkk
Thanks,
Seren
Introduction
I wrote a script to automatically look up the pitch accent of entries in a spreadsheet because I plan on beginning Core6k in a month or two and would like to have the pitch accents in my anki deck. The script automatically gets the pitch accent data from yahoo.co.jp (e.g. http://dic.yahoo.co.jp/dsearch?p=ときどき【時時...&dname=0ss)
This was intended for Core6k, but the program is designed to work with anyone's study list.
Edit: A primitive Anki plugin has been made for Anki 1.2.X: http://forum.koohii.com/showthread.php?p...#pid167578)
Core 6k
I ran it with Nukemarine's Optimized Kore (http://forum.koohii.com/showthread.php?tid=5110) and managed to confidentally determine the pitch accents of 4872/5094 (96%) of the entries that have kanji in them. The 906 (14%) kana-only entries would have to be manually looked over just to make sure that the dictionary entry was found due to the fact that yahoo.co.jp often returns an incorrect homonym for these entries.
Files
Here is the spreadsheet of the Optimized Core6k with the pitch accents that my program has determined (in columns S, T, U and V):
Only trust the pitch accents with uncertainties below 30 (81% of entries, or 96% of kanji+kana entries)
https://docs.google.com/spreadsheet/ccc?...mw5dWJXVHc
Here is the folder containing the results of the program's analysis:
Rar: https://docs.google.com/open?id=0B5wqBC4...3hJN0YxTU0
Zip: https://docs.google.com/open?id=0B5wqBC4...zBKTHphbEU
Incidentally, the uncertainties in the spreadsheet refer to where the analyzed file is located in the directory.
Here is the spreadsheet divided by the categories that need to be checked (use the tabs at the bottom to navigate):
https://docs.google.com/spreadsheet/ccc?...1REE#gid=4
Failed Dictionary Lookups [All] (150, 2.5%)
Figure Out Pitches [w/ Kanji] (52, 0.9%)
Verify Correct Page & Figure Out Pitches [Kana-Only] (40, 0.7%)
Verify Correct Page [w/ Kanji] (36, 0.6%)
Verify Correct Page [Kana-Only] (850, 14.2%)
Should Be Good [w/ Kanji] (4872, 81.2%)
I Would Like Help
I would really appreciate some help with the entries that my program has not been able to automatically parse, especially if other people would be interested having the pitch accents for all of core6k. I'm not very proficient in Japanese and that is in fact why I intend to do core6k. I myself shall be slowly going through these entries, but I made it so my spreadsheet at (https://docs.google.com/spreadsheet/ccc?...1REE#gid=4) can be edited so that you can contribute. All I ask is you delete the "No" in the verify column if the entry to be verified is correct.
In particular, I would need help with the files in Figure Out Pitches [w/ Kanji] (52, 0.9%), Verify Correct Page & Figure Out Pitches [Kana-Only] (40, 0.7%) and Failed Dictionary Lookups [All] (150, 2.5%)
Failed Dictionary Lookups
(should be fairly easy to do)
My program failed to find the dictionary entry. Someone needs to find the dictionary entry and input the correct pitches. This is divided into two main sections: "Did not find Dictionary Lookup?" and "Failed Dictionary Lookup?"
The "Failed Dictionary Lookup" entries were found in yahoo's second dictionary, the Daijisen. This makes it usually very easy to find the Daijirin's dictionary entry (the Daijisen may provide an alternate spelling that is contained in the Daijirin e.g. http://dic.yahoo.co.jp/dsearch?p=レインコーと&...&dname=0ss and http://dic.yahoo.co.jp/dsearch?p=ごはん【ご飯】...&dname=0ss; must look up 御飯 instead of ご飯 to find the entry)
Verify Correct Page
For this part, it must be verified if the program found the right page (due to the possibility of homonyms or other incorrect results). If the right page has been found, then the right pitches should already be in place (except for the 40 in Verify Correct Page & Figure Out Pitches [Kana-Only]). In some cases, one can simply compare the title of the webpage (provided in the spreadsheet) to the actual word. In other cases, it may be necessary to check the page itself to make sure the right page was chosen.
e.g. [Kana-Only] For entry 5291 (by Opt-Voc-Index number), the word in Core6k is ミュージック (or "music" in English). The title of the webpage that my program found was "ミュージック1【music】". Obviously, the correct webpage was found.
e.g. [w/ Kanji] The title for entry 2042 今に/いまに (before long, someday) has the webpage title "今(いま)に始まった事ではない". Hmm, that is strange. The page that the program analyzed was: http://dic.yahoo.co.jp/dsearch?p=いまに【今に】...&dname=0ss, which is NOT the right page! The right page can be found on the right (choice number 2). The list must be consequently updated.
Figure Out Pitches
My program has identified some pitches, but is unable to figure out when these pitches apply, or if these really are pitches. Someone has to manually take a look at each file in here and give a verdict as to the usage of the pitch accents.
e.g. [w/ Kanji] For 46 water 水, http://dic.yahoo.co.jp/dsearch?p=みず【水】&s...&dname=0ss, the second subscript (2) is part of "H2O". This is obviously not a pitch. Therefore, "0;2? Check" would be changed to "0;"
e.g. [w/ Kanji] Dictionary entry #4 of http://dic.yahoo.co.jp/dsearch?p=ひ【日】&st...&dname=0ss has a unique pitch.
Should Be Good
All entries here are probably good. Theoretically one should verify if the correct page was analyzed by the program, but chances are the correct page was analyzed.
A bit about the program
My program essentially looks up the webpage on the internet and finds the pitch. Here are four examples of files it may find:
Case 1: (2) 一つ + ひとつ -> http://dic.yahoo.co.jp/dsearch?p=一つ【ひとつ】...&dname=0ss (the number in the brackets refers to the Opt-Voc-Index number)
The pitch is 2. The program gives "2;" as the result
Case 2: (31) 時々 + ときどき -> http://dic.yahoo.co.jp/dsearch?p=ときどき【時時...&dname=0ss
The program finds that for (名), the pitch is 2 or 0), but also finds that for (副), the pitch is 0. The program returns "20(名)0(副)"
Case 3: (46) 水 + みず -> http://dic.yahoo.co.jp/dsearch?p=みず【水】&s...&dname=0ss
The program finds a pitch of 0 at the top. However, it also finds a 2 (as part of H2O in the first line of the dictionary). The program is confused because it doesn't understand why there is a 2 at that part of the dictionary definition. The program returns "0;2? Check"
Case 4: (146) ご飯 + ごはん -> http://dic.yahoo.co.jp/dsearch?p=ごはん【ご飯】...&dname=0ss
The entry is not found for the Daijirin dictionary, so yahoo automatically gives the result from the Daijisen. The Daijisen does not have any information about the pitch, so the program does not find anything. It returns a dictionary lookup error (Note that while the Daijirin has no entry for ご飯 + ごはん, it WOULD return a result for 御飯 + ごはん, but my program is not smart enough to figure this out)
About the folder with the results:
The files are organized by their uncertainty so that it is easier to check the entries.
In particular, I'm confused as what to do with a few of the files found in the folder 0_Kanji_and_Kana/6_Unknown_Location/. For example, for http://dic.yahoo.co.jp/dsearch?p=みぎ【右】&s...&dname=0ss, the pitch changes when the meaning of the word corresponds to the third definition. Also, I'm not proficient enough in Japanese to actually read the definitions to know what exact these individual entries mean (hence my plea for help in looking at the remaining files).
Questions
So anyways, some specific questions for you:
1. In the example of http://dic.yahoo.co.jp/dsearch?p=ときどき【時時...&dname=0ss, for (名), does 20 mean that the pitch accent may be either 2 or 0, but that 2 is more common?
3. I'm not sure in what form my program should actually OUTPUT the pitches for the complicated cases (e.g. "20(名)0(副)"). Is there any way to output the pitch in a more organized or meaningful way?
4. Is there interest in getting the pitch accent like I have done? i.e. is my program useful to people other than me?
5. Anything else?
I'm organizing my program a bit and will be shortly posting the source code here so that people can use it for their own lists if they want. Edit: Here is the code: https://docs.google.com/open?id=0B5wqBC4...zJXUWwtSkk
Thanks,
Seren
Edited: 2012-05-09, 5:32 pm

