Back

Core10k with Accent

#1
I wanted to start studying intonation/accent, so I added accent markings to all of Core10000. It took me like 15+ hours, so I hope someone else can benefit from my work.

https://docs.google.com/spreadsheet/ccc?...3FtbXU3WXc

There are about 9600 cards in this deck, and I got about 9000 of the accent markings using my program DictScrape. I manually had to add about 600 accent markings.

Just when I thought I was finished, I realized that my program might not have been working correctly when there was no kanji in the "Target Kanji" field, so I had to go back and double check about another 1000 cards that only have kana (no kanji).

Are there many mistakes?

I don't think there are that many. I checked about 1500 cards by hand, so I could have made a mistake when updating those cards. For the other cards, I think the program did a pretty good job.

There may theoretically be a mistake if there are two separate words listed in the dictionary that have the same Kanji + Reading (kana) but have different accents. In my experience, I don't think this is very common (in fact, I don't think it happens at all in the Daijirin).

What should I do if I find a mistake?

Post it here! I will love you for it.

Where did you get these accents?

DictScrape queries Yahoo's Daijirin dictionary and parses out the accent for an entry. When going through and checking the cards by hand, I used a combination of NHK 日本語発音アクセント辞典, 新明解国語辞典 第五版, and 三省堂 スーパー大辞林.

Some words have multiple accents listed, which one did you use?

For the cards processed automatically, all available accents are listed. For instance, if the accent for a card is 01, then the 0 accent is more common, and the 1 accent is less common (at least, I think this is the way Daijirin does it).

There are some words that have different accents depending how the word is used. For instance, when the word is used as a noun, it has a 0 and when it is used as an adverb, it has a 1. I don't really make a distinction in my program. The accents are just liked 01 as normal. Sorry :-(.

For the cards that I checked/updated by hand, I used the accent that was in common between the three dictionaries I used. For instance, sometimes Daijirin would list a word as being 01, but the NHK dictionary would only list it as 0. In that case, I would just use 0.

What are your Anki templates going to look like?

I plan on doing accent production two ways. One way will be just looking at the Kanji (and maybe kana?). The other way will just be hearing the word (I imagine this will get too easy after a while...?). I also plan on listening to the whole sentence and trying to repeat it, while really concentrating on the accent. This will be hard to grade myself on.

Where can I get the audio?

Look for a file online called Core10Kv4.7z.

Doesn't the accent for some words change depending on where the word is in the sentence?

Yes, I believe that is true. You should ask AlexanderC about it ;-p (I'm looking forward to that 3rd youtube video!)
Edited: 2012-12-08, 1:30 am
Reply
#2
One problem with this is that the accent of words changes when it's in a sentence compared to when it's by itself, or in a compound.
Reply
#3
yudantaiteki Wrote:One problem with this is that the accent of words changes when it's in a sentence compared to when it's by itself, or in a compound.
You're completely right. One thing that I've noticed so far is verbs. They tend to change based on their conjugation.

In case anyone hasn't seen this, there is a wiki page created by AlexanderC that has more info about pitch accent. Also, he has a youtube video on pitch accent.

So I guess the question is, "Is it worth memorizing pitch accent for words even though it sometimes changes?" I want to say 'yes', but really I don't have anything to back this up.
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
Pitch changes in verbs are predictable if you know whether a verb is accented or not. It is more important to know whether it is accented, then where the accent is. For instance, the -te or -ta ending will have no effect on unaccented verbs (they will stay flat: shiTA, shiTE), but will (almost) systematically make the accent shift on the 3rd mora from the end (so I call -te and -ta 0/-3 endings).
Reply
#5
AlexandreC Wrote:Pitch changes in verbs are predictable if you know whether a verb is accented or not. It is more important to know whether it is accented, then where the accent is. For instance, the -te or -ta ending will have no effect on unaccented verbs (they will stay flat: shiTA, shiTE), but will (almost) systematically make the accent shift on the 3rd mora from the end (so I call -te and -ta 0/-3 endings).
Hey, thanks for the info! Good to know. A lot of that kind of stuff is on the learnlanguages wiki page, right?
Reply
#6
Errors

- 未だ(まだ): The accent was listed as "0" when it should really be "1". (It looks like it was using いまだ's accent instead of just まだ.) Hopefully there are not too many of these types of errors...
Reply
#7
This sounds like exactly what I've been looking for, but... what exactly can I do with that google document? Can I somehow edit my own core10k deck to include the accents? I'm not starting from scratch, that much is for certain heh.
Reply
#8
Betelgeuzah Wrote:This sounds like exactly what I've been looking for, but... what exactly can I do with that google document? Can I somehow edit my own core10k deck to include the accents? I'm not starting from scratch, that much is for certain heh.
It depends on what your goals are. Do you just want to start studying the accents? In that case, you can use the google document to create a new deck and then make templates that give you a word and you have to produce the accent.

If you want to add accents to the deck you currently have, it will be a little more work for you. The problem is that in order to import something in to an existing anki deck, you need to have fields in each deck you can join on. I don't know how many core6k/10k decks have an existing "unique field" that you are able to join on. The google document currently doesn't (and even if it did, it's possible it would be different from your current deck, so you wouldn't be able to use it anyway).

The way I did it was to write scripts to compare my current deck with the google document, and intelligently merge the accents into my deck based on a couple different fields. If you're a programmer, this shouldn't be too hard, but if not it might be hard to do.

If you want to start studying accents, I would suggest just creating a new deck and using a template that just tests you on the accents.
Reply
#9
Thanks for the quick reply! I guess I should indeed create a new deck for studying the accent, since I'm no programmer.

I wonder what the best course of action would be. Front: word and back: the accent perhaps?

Also, how can I import the document into my anki, can I download it using a filetype that Anki supports, or...?
Reply
#10
Betelgeuzah Wrote:I wonder what the best course of action would be. Front: word and back: the accent perhaps?
I use three different templates. I'll just paste them here:

1) Accent Production from Kanji.

Front: Kanji

Code:
<span style="font-family: IPAGothic; font-size: 20px; color: #000000; white-space: pre-wrap;">
<span style="font-size: 20px; color: #FF0000;">[Accent]</span>

{{target_kanji}}
{{{kana_reading}}}

{{sentence}}
</span>
Back: Accent

Code:
<span style="font-family: IPAGothic; font-size: 20px; color: #000000; white-space: pre-wrap;">{{accent}}
{{target_kanji}} -- {{meaning}}
{{reading}}
{{sentence_meaning}}
{{word_audio}}
</span>
2) Accent Production from Audio (this should get really easy after a bunch of cards...)

Front: Word Audio

Code:
<span style="font-family: IPAGothic; font-size: 20px; color: #000000; white-space: pre-wrap;">
<span style="font-size: 20px; color: #00AA00;">[Accent]</span>
{{word_audio}}
</span>
Back: Accent

Code:
<span style="font-family: IPAGothic; font-size: 20px; color: #000000; white-space: pre-wrap;">{{accent}}
{{target_kanji}} -- {{kana_reading}} -- {{meaning}}
{{sentence}}
{{reading}}
{{sentence_meaning}}
{{word_audio}}
</span>
3) Sentence reading/imitation (just for fun and practice -- always grade yourself "Good")

Front: sentence and audio

Code:
<span style="font-family: IPAGothic; font-size: 20px; color: #000000; white-space: pre-wrap;">
<span style="font-size: 20px; color: #4444FF;">[Sentence Imitation]</span>

{{{target_kanji}}}

{{sentence}}
{{sentence_audio}}
</span>
Back: nothing interesting

Code:
<span style="font-family: IPAGothic; font-size: 20px; color: #000000; white-space: pre-wrap;">{{accent}}
{{target_kanji}} -- {{kana_reading}} -- {{meaning}}
{{sentence}}
{{reading}}
{{sentence_meaning}}
{{word_audio}}
</span>
Betelgeuzah Wrote:Also, how can I import the document into my anki, can I download it using a filetype that Anki supports, or...?
You can export the document as a CSV file, and from there import into Anki. If you use the column names from the first row of the file, then you will be able to use my templates and everything should work correctly (...maybe?).

Or, if you know somewhere I could upload it, I could just upload the Anki deck I've already created.
Reply
#11
Finally managed to make the deck work using the first template. Thanks a ton!
Reply
#12
Betelgeuzah Wrote:Finally managed to make the deck work using the first template. Thanks a ton!
No problem, hope it works for you.

Update

そう: This should be 10. I'll update it on the spreadsheet.
Reply
#13
Update
ここ (row 108): this should be 0
Reply
#14
The following rated as "1" in the spreadsheet are "01" in the Daijirin, i.e. "0" is the more common reading. Each word below is preceded by its Uindex number.

192.曲;543.チャンネル;1004.苺;1863.価格;2069.消費;2378.書店;2433.区分;2461.当日;2476.意外;2586.大陸;2602.豊富;2674.プラス;3230.続々;3333.学部;3390.列車;3697.ふと;3887.今更;4041.無用;4099.ほっと;4344.どっと;4399.種目;4565.出入り;4619.気力;4710.対比;4777.自国;5142.すっと;5328.すべすべ;5441.ほうき;5478.盛る;5595.かっと;5926.添加;5939.学;5986.不断;6037.突破;6085.設置;6128.証券;6131.司法;6300.帝国;6361.配布;6428.治安;6810.燃費;6977.脱衣;7005.懸念;7017.何とも;7053.書記;7165.遺憾;7198.分岐;7411.未熟;7417.苦悩;7479.嫉妬;7496.特異;7561.自首;7671.舞う;7724.巧み;7876.不気味;8380.巡査;8472.専ら;8496.他面;9069.御輿;9389.ぶかぶか
Reply
#15
Katsuo Wrote:The following rated as "1" in the spreadsheet are "01" in the Daijirin, i.e. "0" is the more common reading. Each word below is preceded by its Uindex number.
Hey, thanks for double checking this. I figured out why this is happening. When I upload the CSV file to google documents, it's turning the accent row into a numerical row and treating 01 as the number 1. It shows up as just "1", not "01" like you would expect.

I tried to upload the CSV file again but it doesn't look like there is any way to set column options when uploading the file. I don't have Microsoft Office (and LibreOffice keeps crashing), so I have no way of uploading a correct file to google documents.

Would someone be able to create the google doc for me? I threw the csv file up on my webserver.

http://72.13.95.4/Core10kWithAccent.txt

I need someone to download it, open it in Excel and make sure to mark the "Accent" column as Text (not a number). Search for the entry for 曲 (Uindex 192), make sure it is "01" and not just "1". If it is correct, then upload it to Google docs and send me the link. I'll use the data to update the google doc I'm currently maintaining (unless it would just be easier to use the document you created...).

@Katsuo, did you check those in an automated fashion? Would you be able to check again after someone helps me with the document?
I've been thinking of comparing each entry against cb14512's edict+accent database. It wouldn't be too hard to do and it might increase the quality of these results. I wonder how cb is handling homonyms that don't have kanji...?
Reply
#16
partner55083777 Wrote:I need someone to download it, open it in Excel and make sure to mark the "Accent" column as Text (not a number). Search for the entry for 曲 (Uindex 192), make sure it is "01" and not just "1". If it is correct, then upload it to Google docs and send me the link.
OK, here's the link.

Quote:@Katsuo, did you check those in an automated fashion?
No, I noticed few discrepancies and then listened to every entry whose accent started with "1" and investigated all those that didn't match.
Reply
#17
Katsuo Wrote:No, I noticed few discrepancies and then listened to every entry whose accent started with "1" and investigated all those that didn't match.
Thanks a lot of taking the time to do that! When I get some time (probably tomorrow night) I will update the spreadsheet.

Here's one more update:

こう This should be 10 instead of just 1.
Reply
#18
Okay, I updated the spreadsheet. I had some trouble so I ended up just creating a new spreadsheet. Here is the new link:

https://docs.google.com/spreadsheet/ccc?...3FtbXU3WXc

The problem that Katsuo brought up has been fixed, and all of the words I have found so far have been fixed. I'll continue to post in this thread when I find a word that needs to be updated.
Reply
#19
I went through and compared my list with the list that cb made for Rikaisama. There were about 120 differences:

Uindex Kanji Kana UpdatedAccent
1910 気 き 0
656 一番 いちばん 2/0
223 右 みぎ 0/1
663 多分 たぶん 0/1
1716 六つ むっつ 3/0
952 辺 へん 0/1
826 五日 いつか 30/0
607 一杯 いっぱい 1/0
1360 三つ みっつ 3/0
1503 四つ よっつ 3/0
7 上 うえ 0/2
240 昨日 きのう 2/0
1158 九日 ここのか 4/0
43 先 さき 0/01
10 下 した 0/2
4 月 つき 2
73 何 なに 1/0/01
526 見える みえる 2
1842 結果 けっか 0/1
1999 全体 ぜんたい 0/1
1836 後 のち 20/0
491 全く まったく 0/40
2401 年間 ねんかん 0/1
2263 教授 きょうじゅ 01/0
295 毎日 まいにち 1/10
184 数字 すうじ 0/1
851 戸 と 0
911 遠く とおく 3/0
1384 厚い あつい 0/02
192 曲 きょく 01/02
1112 大勢 おおぜい 3/0
1034 正月 しょうがつ 4/0
914 親切 しんせつ 1/0
2 日 ひ 0/1
815 遠慮 えんりょ 0/1
565 都合 つごう 0/1
6227 都合 つごう 0/1
296 明日 あした 3/0
1464 幾ら いくら 1/10
1338 一昨日 おととい 3/0
110 九 く 1
27 金 きん 1
413 大事 だいじ 13/0/3
269 血 ち 0
1094 昨夜 ゆうべ 3
2681 結局 けっきょく 40/0
2442 将来 しょうらい 1/0
3474 一層 いっそう 1/0
3965 余り あまり 3/01/0
6970 余り あまり 3/01/0
2338 結構 けっこう 03/1
2803 一面 いちめん 02/0
2220 明日 あす 2/0
3423 親切 しんせつ 1/0
2830 詰まり つまり 3/1
2337 結構 けっこう 03/1
1907 網 あみ 2
1831 品 しな 0/12
3336 折角 せっかく 0/40
5507 漢和 かんわ 1/0
4839 幾ら いくら 1/10
1784 会 かい 1
1767 小 しょう 1
4776 抱く だく 0
3719 発明 はつめい 0/2
2118 中古 ちゅうこ 0/1
4374 皮肉 ひにく 1/0
3742 手前 てまえ 0/01
1778 元 もと 1
4689 船長 せんちょう 1/0
848 大体 だいたい 0/03
3278 大体 だいたい 0/03
4878 凸凹 でこぼこ 0/1
1752 日 にち 1
4406 陽気 ようき 0/1
4369 頼み たのみ 31/13
2335 豆 まめ 2
2267 田 た 1
4505 長持ち ながもち 34/30
3579 振り ふり 02
1789 能 のう 1/01
2281 根 ね 1
1788 地 ち 1
1809 面 めん 0/1
1754 大 だい 1
1790 能 のう 1/01
4824 挟む はさむ 2
2882 蛇 へび 1
2697 本店 ほんてん 0/1
1973 間 ま 0
2308 末 まつ 10
2249 皆 みな 2/0
2481 虫 むし 0
2150 世 よ 10/1
2643 楽 らく 2/1
2450 礼 れい 1/0
5929 作 さく 12/02
5953 個 こ 1
5986 不断 ふだん 01/1
6020 角 かく 21/12/20
6053 松 まつ 1
6132 潮 しお 2
6215 文庫 ぶんこ 01/0
6307 同期 どうき 1/0
6357 一等 いっとう 03/0
6449 由来 ゆらい 0/1
6680 流し ながし 3/1/31
7017 何とも なんとも 01/0
7092 人道 じんどう 1/0
7106 雑 ざつ 0/1
7113 加減 かげん 01/0
7201 漁 りょう 1
7261 年頃 としごろ 0/2
7965 下見 したみ 0/2
8053 惚け ぼけ 2
8213 一円 いちえん 02/0
8496 他面 ためん 01/0
8747 訛り なまり 30
8948 擦る する 1
9111 陰気 いんき 1/0
9402 根掘り葉掘り ねほりはほり 1-14
9516 惚ける ぼける 2

I haven't updated these in the word doc yet.
Reply