Pitch Accent Lookup (Python Script)

Index » The Japanese language

  • 1
 
Reply #1 - 2012 May 08, 7:55 pm
Seren Member
Registered: 2012-02-27 Posts: 26

Hello everyone,


Introduction

I wrote a script to automatically look up the pitch accent of entries in a spreadsheet because I plan on beginning Core6k in a month or two and would like to have the pitch accents in my anki deck. The script automatically gets the pitch accent data from yahoo.co.jp (e.g. http://dic.yahoo.co.jp/dsearch?p=ӗ … ;dname=0ss)

This was intended for Core6k, but the program is designed to work with anyone's study list.

Edit: A primitive Anki plugin has been made for Anki 1.2.X: http://forum.koohii.com/viewtopic.php?p … 01#p176601)

Core 6k

I ran it with Nukemarine's Optimized Kore (http://forum.koohii.com/viewtopic.php?id=5322) and managed to confidentally determine the pitch accents of 4872/5094 (96%) of the entries that have kanji in them. The 906 (14%) kana-only entries would have to be manually looked over just to make sure that the dictionary entry was found due to the fact that yahoo.co.jp often returns an incorrect homonym for these entries.


Files

Here is the spreadsheet of the Optimized Core6k with the pitch accents that my program has determined (in columns S, T, U and V):

Only trust the pitch accents with uncertainties below 30 (81% of entries, or 96% of kanji+kana entries)
https://docs.google.com/spreadsheet/ccc … mw5dWJXVHc

Here is the folder containing the results of the program's analysis:
Rar: https://docs.google.com/open?id=0B5wqBC … 3hJN0YxTU0
Zip: https://docs.google.com/open?id=0B5wqBC … zBKTHphbEU
Incidentally, the uncertainties in the spreadsheet refer to where the analyzed file is located in the directory.

Here is the spreadsheet divided by the categories that need to be checked (use the tabs at the bottom to navigate):
https://docs.google.com/spreadsheet/ccc … 1REE#gid=4
Failed Dictionary Lookups [All] (150, 2.5%)
Figure Out Pitches [w/ Kanji] (52, 0.9%)
Verify Correct Page & Figure Out Pitches [Kana-Only] (40, 0.7%)
Verify Correct Page [w/ Kanji] (36, 0.6%)
Verify Correct Page [Kana-Only] (850, 14.2%)
Should Be Good [w/ Kanji] (4872, 81.2%)


I Would Like Help

I would really appreciate some help with the entries that my program has not been able to automatically parse, especially if other people would be interested having the pitch accents for all of core6k. I'm not very proficient in Japanese and that is in fact why I intend to do core6k. I myself shall be slowly going through these entries, but I made it so my spreadsheet at (https://docs.google.com/spreadsheet/ccc … 1REE#gid=4) can be edited so that you can contribute. All I ask is you delete the "No" in the verify column if the entry to be verified is correct.

In particular, I would need help with the files in Figure Out Pitches [w/ Kanji] (52, 0.9%), Verify Correct Page & Figure Out Pitches [Kana-Only] (40, 0.7%) and Failed Dictionary Lookups [All] (150, 2.5%)

Failed Dictionary Lookups
(should be fairly easy to do)

My program failed to find the dictionary entry. Someone needs to find the dictionary entry and input the correct pitches. This is divided into two main sections: "Did not find Dictionary Lookup?" and "Failed Dictionary Lookup?"

The "Failed Dictionary Lookup" entries were found in yahoo's second dictionary, the Daijisen. This makes it usually very easy to find the Daijirin's dictionary entry (the Daijisen may provide an alternate spelling that is contained in the Daijirin e.g. http://dic.yahoo.co.jp/dsearch?p=Ӥ … ;dname=0ss and http://dic.yahoo.co.jp/dsearch?p=ӕ … dname=0ss; must look up 御飯 instead of ご飯 to find the entry)


Verify Correct Page

For this part, it must be verified if the program found the right page (due to the possibility of homonyms or other incorrect results). If the right page has been found, then the right pitches should already be in place (except for the 40 in Verify Correct Page & Figure Out Pitches [Kana-Only]). In some cases, one can simply compare the title of the webpage (provided in the spreadsheet) to the actual word. In other cases, it may be necessary to check the page itself to make sure the right page was chosen.

e.g. [Kana-Only] For entry 5291 (by Opt-Voc-Index number), the word in Core6k is ミュージック (or "music" in English). The title of the webpage that my program found was "ミュージック1【music】". Obviously, the correct webpage was found.
e.g. [w/ Kanji] The title for entry 2042 今に/いまに (before long, someday) has the webpage title "今(いま)に始まった事ではない". Hmm, that is strange. The page that the program analyzed was: http://dic.yahoo.co.jp/dsearch?p=ӓ … dname=0ss, which is NOT the right page! The right page can be found on the right (choice number 2). The list must be consequently updated.


Figure Out Pitches

My program has identified some pitches, but is unable to figure out when these pitches apply, or if these really are pitches. Someone has to manually take a look at each file in here and give a verdict as to the usage of the pitch accents.

e.g. [w/ Kanji] For 46 water 水, http://dic.yahoo.co.jp/dsearch?p=ә … dname=0ss, the second subscript (2) is part of "H2O". This is obviously not a pitch. Therefore, "0;2? Check" would be changed to "0;"
e.g. [w/ Kanji] Dictionary entry #4 of http://dic.yahoo.co.jp/dsearch?p=Ә … ;dname=0ss has a unique pitch.


Should Be Good

All entries here are probably good. Theoretically one should verify if the correct page was analyzed by the program, but chances are the correct page was analyzed.


A bit about the program

My program essentially looks up the webpage on the internet and finds the pitch. Here are four examples of files it may find:

Case 1: (2) 一つ + ひとつ -> http://dic.yahoo.co.jp/dsearch?p=ߌ … ;dname=0ss (the number in the brackets refers to the Opt-Voc-Index number)
    The pitch is 2. The program gives "2;" as the result

Case 2: (31) 時々 + ときどき -> http://dic.yahoo.co.jp/dsearch?p=ӗ … ;dname=0ss
    The program finds that for (名), the pitch is 2 or 0), but also finds that for (副), the pitch is 0. The program returns "20(名)0(副)"

Case 3: (46) 水 + みず -> http://dic.yahoo.co.jp/dsearch?p=ә … ;dname=0ss
    The program finds a pitch of 0 at the top. However, it also finds a 2 (as part of H2O in the first line of the dictionary). The program is confused because it doesn't understand why there is a 2 at that part of the dictionary definition. The program returns "0;2? Check"

Case 4: (146) ご飯 + ごはん -> http://dic.yahoo.co.jp/dsearch?p=ӕ … ;dname=0ss
    The entry is not found for the Daijirin dictionary, so yahoo automatically gives the result from the Daijisen. The Daijisen does not have any information about the pitch, so the program does not find anything. It returns a dictionary lookup error (Note that while the Daijirin has no entry for ご飯 + ごはん, it WOULD return a result for 御飯 + ごはん, but my program is not smart enough to figure this out)


About the folder with the results:

The files are organized by their uncertainty so that it is easier to check the entries.

In particular, I'm confused as what to do with a few of the files found in the folder 0_Kanji_and_Kana/6_Unknown_Location/. For example, for http://dic.yahoo.co.jp/dsearch?p=ә … dname=0ss, the pitch changes when the meaning of the word corresponds to the third definition. Also, I'm not proficient enough in Japanese to actually read the definitions to know what exact these individual entries mean (hence my plea for help in looking at the remaining files).

Questions

So anyways, some specific questions for you:

1. In the example of http://dic.yahoo.co.jp/dsearch?p=ӗ … dname=0ss, for (名), does 20 mean that the pitch accent may be either 2 or 0, but that 2 is more common?
2. (名) means noun & (副) means adverb.
3. I'm not sure in what form my program should actually OUTPUT the pitches for the complicated cases (e.g. "20(名)0(副)"). Is there any way to output the pitch in a more organized or meaningful way?
4. Is there interest in getting the pitch accent like I have done? i.e. is my program useful to people other than me?
5. Anything else?

I'm organizing my program a bit and will be shortly posting the source code here so that people can use it for their own lists if they want. Edit: Here is the code: https://docs.google.com/open?id=0B5wqBC … zJXUWwtSkk

Thanks,

Seren

Last edited by Seren (2012 May 09, 5:32 pm)

Reply #2 - 2012 May 08, 10:58 pm
Thora Member
From: Canada Registered: 2007-02-23 Posts: 1691

Seren wrote:

1. In the example of http://dic.yahoo.co.jp/dsearch?p=ӗ … dname=0ss, for (名), does 20 mean that the pitch accent may be either 2 or 0, but that 2 is more common?

That has been my assumption, at least wrt Daijirin. I took a quick look at Daijirin's explanation of abbreviations, but didn't see anything specifically about alternate accents. Interestingly, I noticed that Daijisen gives only one accent, (2), for ときどき.

2. Just to confirm: does (名) mean noun? Does (副) mean adjective?

副 is adverb (副詞). i- adjective is 形 (形容詞).  na-adjective is 形動 (形容動詞).

3. I'm not sure in what form my program should actually OUTPUT the pitches for the complicated cases (e.g. "20(名)0(副)"). Is there any way to output the pitch in a more organized or meaningful way?

People who use accent markings might have views on this, but I imagine any format is okay so long as it's understandable. (名)2,0(副)0 is perhaps slightly more obvious?

4. Is there interest in getting the pitch accent like I have done? i.e. is my program useful to people other than me?

Some people have been adding accent marks manually, so I imagine your efforts will be greatly appreciated. smile

5. Anything else?

The usual caveats that word accent can vary (change in accent location or pitch range) depending on conjugation, adjacent words, location in sentence, existence of focus/emphasized word in sentence, intonation phrasing, etc. Also, pitch contour isn't actually comprised of just 2 tones: Low and High. It's more like a crazy wave. (Marking the accent location at least doesn't give the impression there are only  L and H tones.) You're probably aware of these limitations, but I wonder if it might be worthwhile to include a brief note somewhere for beginners who might not be?

In case you aren't aware, a member here, AlexandreC, has been putting together some information on pitch accent (summary and a video). He'd be a good source of info and will probably be interested to know about your project.

Reply #3 - 2012 May 08, 11:34 pm
HonyakuJoshua Member
From: The Unique City of Liverpool Registered: 2011-06-03 Posts: 617 Website

I messaged al on Facebook about this

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
Reply #4 - 2012 May 09, 2:43 am
partner55083777 Member
From: Tokyo Registered: 2008-04-23 Posts: 397

Can you put your code up on github or somewhere?  I've been thinking about doing something like this, but also adding in support for EPWING lookups using epywing[1].  This would allow you to do everything offline without hitting yahoo's servers.  It would also give you access to the 新明解国語辞典 (New Meikai Japanese Dictionary?) and possibly the NHK 日本語発音アクセント辞典 (NHK Japanese Pronunciation Accent Dictionary).  I imagine comparing the accents given from two different dictionaries would give you pretty accurate results.

[1]
- eb library [c library for working with epwing dictionaries]: http://www.sra.co.jp/people/m-kasahr/eb/
- ebmodule [python wrapper for eb library]: https://github.com/aehlke/ebmodule
- epywing [easier interface for working with eb library in python (uses ebmodule above)]:  https://github.com/aehlke/epywing

Reply #5 - 2012 May 09, 7:01 am
Irixmark Member
From: 加奈陀 Registered: 2005-12-04 Posts: 291

Excellent project! But wouldn't it make more sense to create an Anki plugin that looks up the pitch accent "on demand"?

That way you could use it beyond Core6k. Especially useful for words you might encounter in text without audio.

Reply #6 - 2012 May 09, 7:34 am
Tori-kun このやろう
Registered: 2010-08-27 Posts: 1193 Website

Can only repeat what irixmark said! Excellent project! Is there any way users who are working already with core6k and have finished it could profit from this? I'd like to have all pitch codes added automatically to all of my cards. Doing so manually/per demand is just pain in the ass, right? If already 94% or so would be added automatically, doing the rest by hand is a piece of cake...

Reply #7 - 2012 May 09, 8:58 am
AlexandreC Member
From: Canada Registered: 2008-09-26 Posts: 309

A few notes:

It's my understanding that something like 20 means that 2 is more common, but to be honest, I can't remember a case off the top of my head where it was 02. In other words, they might be numerically ordered. In any case, there is a recent tendency to flatten (zero'ize? HA!) words. Although I do remember asking people which was more common and they gave me different answers, some even saying that one of the 2 sounded wrong. (if this sounds too discouraging, simply pretend I never wrote it)

As for the fact that pitch changes, yes, that's a major issue, but not one that any dictionary addresses in its entries anyway (accent dictionaries do provide a detailed explanation at the end though); that's just how it is. If you learn the rules, you can derive the changes, but you still need to know if there is pitch, and if so, where it is.

Someone mentions that there are more heights than just L and H. That is true phonetically, but the system is binary, it's up to you to determine where pitch from one mora to the next will lie. Much like a stressed syllable in English can be lower than an unstressed syllable elsewhere in the sentence.

If you're trying to create an efficient system, I'd like to add something. In many cases, the most important data is whether a word has pitch at all or not (most don't). For instance, other than the base dictionary form of a verb, the only thing that matters is whether the verb has pitch or not because the placement of the pitch can be derived from the ending. So, I would recommend using different colours for accented vs. unaccented words, say blue if it's zero (and no need to put a number) and red if it's accented. If there is a single accent, then the accented mora could be bolded, which is visually much easier to remember than thinking "ok, number 2, so blah, BLAH, blah, blah...". If there is more than one possible pitch, you could bold the first and list the numbers at the end. Something like "(名)2,0(副)0" means zero except that 2 is possible for nouns. If you can simplify the notation is some odd cases, it's no big deal, anyway.

Reply #8 - 2012 May 09, 9:25 am
vileru Member
From: Cambridge, MA Registered: 2009-07-08 Posts: 750

Adding to what AlexandreC said, here are the basic rules for pitch accents:

Wikipedia wrote:

Most of the time the accent falls on the ante-penultimate mora (academese for "second to last"), or on the first mora for shorter words. A smaller number of nouns are accented on other syllables. (I-adjectives, however, are usually accented, and always on the penultimate mora ("next to last").)

I know it's Wikipedia, but the above description is accurate. Another observation I've made is that 3-syllable verbs tend to be accented, especially if the last two syllables are「りる」. There's plenty of other patterns I've noticed, but I haven't the time to recall and summarize them. If others would like to chime in with similar observations, I think this thread could become an especially useful resource. It's much easier to notice and mimic pitch accents once aware of the patterns and rules that govern their use.

Reply #9 - 2012 May 09, 9:48 am
AlexandreC Member
From: Canada Registered: 2008-09-26 Posts: 309

vileru wrote:

Adding to what AlexandreC said, here are the basic rules for pitch accents:

Wikipedia wrote:

Most of the time the accent falls on the ante-penultimate mora (academese for "second to last"), or on the first mora for shorter words. A smaller number of nouns are accented on other syllables. (I-adjectives, however, are usually accented, and always on the penultimate mora ("next to last").)

I know it's Wikipedia, but the above description is accurate. Another observation I've made is that 3-syllable verbs tend to be accented, especially if the last two syllables are「りる」. There's plenty of other patterns I've noticed, but I haven't the time to recall and summarize them. If others would like to chime in with similar observations, I think this thread could become an especially useful resource. It's much easier to notice and mimic pitch accents once aware of the patterns and rules that govern their use.

Well, hiTO (and hiTOga but hiTONO), taME, toKI... tendency doesn't mean much when you have to know where the pitch actually is. Then, KAnada, but kanaDAjin and nihonJIn -- there are too many other things going on to bother with tendencies. Otherwise, just say everything flat, and you'll be right most of the time!

However 3-mora verbs are definitely majoritarily accented on 2. Not that that helps the OP with his software in any way...

Reply #10 - 2012 May 09, 10:24 am
partner55083777 Member
From: Tokyo Registered: 2008-04-23 Posts: 397

Irixmark wrote:

Excellent project! But wouldn't it make more sense to create an Anki plugin that looks up the pitch accent "on demand"?

Ideally it would nice to have a script that does the accent lookup, and then create a plugin that makes use of the script.

Reply #11 - 2012 May 09, 12:43 pm
Seren Member
Registered: 2012-02-27 Posts: 26

Thank you Thora for this information.

In reply to partner, Irixmark, Tori-kun:

I'm not a frequent programmer, so I've never used/know how to use GitHub. I actually haven't done any programming in the last three (or so) years, and this is the first time I ever used Python. I didn't even know Anki used Python! But do not despair:

For the last two hours I've been messing around with trying to make a basic Anki plugin by looking at the source code of other plugins to see how it works with Anki, and I think I'll be able to make a plugin. I'll work on it today, but I cannot guarantee I'll have a completed project by the end of today; and if I don't, you'll probably have to wait a week before I continue because I have exams coming up for the end of my school year for which I have to study. At the moment my plugin consists of a primitive interface from which I will shortly add the code to check for pitches.

AlexandreC, vileru:

Compare
http://dic.yahoo.co.jp/dsearch?p=ӕ … ;dname=0ss
to
http://dic.yahoo.co.jp/dsearch?p=ә … ;dname=0ss

Thank you for this information! At the moment I'll work on getting an Anki plugin to retrieve the pitches as my current program does right now, and then I shall take a look at maybe changing the output format.

Last edited by Seren (2012 May 09, 12:46 pm)

Reply #12 - 2012 May 09, 5:17 pm
Seren Member
Registered: 2012-02-27 Posts: 26

Update!

I have a primitive Anki plugin that will look up the pitch accent for the current card you are reviewing. (Anki 1.2.X)

https://docs.google.com/open?id=0B5wqBC … DNwWkVEZ3M

Extract this file into your plugin folder (Settings>Plugins>Open Plugin Folder)

Once you reboot Anki, it will generate a settings file which you can edit by going to (Tools>Pitch Lookup>Edit Settings)

Some of the settings are self-explanatory. openWebsiteThreshold is the threshold at which the program will open up the yahoo page for you to review.

openWebsiteThreshold:
0 = basic pitch accent
10 = pitch accent specific to adverb/noun
30 = the program *thinks* it's a basic pitch accent; it's about 90% sure
50 = the program doesn't know what the pitch it found is
70 = there is no pitch accent or the program made a mistake
90 = the program could not find the dictionary entry
Plus 100 for any entry whose Kana-Only field = Expression field (due to possibility of homonyms)
So for example, setting the openWebsiteThreshold to 11 is realistically what one what probably do.

Also, pitchLookupShortcut is the shortcut to hit to run the pitch lookup.

You might want to back up your deck before trying this plugin, but I don't anticipate any problems with corrupting your deck or something.

If you download the plugin, please reply with feedback.

Obviously, I'm thinking of making it possible to get the pitch while in the fact editor (and batches, probably), but I'm not sure how exactly my program should handle pitches it's not sure of in batch in Anki. For now, you can always use my original python script to do a batch analysis.

Reply #13 - 2012 August 30, 5:59 am
dizmox Member
Registered: 2007-08-11 Posts: 1149

Hi, I'm running Anki 1.2.8 and tried your plugin, but "Lookup Pitch" is always greyed out for me, why do you think that could be?

  • 1