Back

Systematic On and Kun Review Cards using a sort program.

#1
*edit* Here's a rough spreadsheet I made Optimized On Kun Sort that should be usuable. It only has onyomi and kunyomi for those words that matched up with the Tanuki list I had on hand. It's not much, but should offer a proof of concept for people if interested.*edit*

I'm sure people have done this manually, but is there a way to use a vocabulary list such as the Kore 2k/6k to create a systematic approach to reviewing the Kanji's On and Kun readings?

Basically, once you learn three words or more for a particular kanji's on or kun, you unlock a review card. It's simply a passive review card with the Kanji in large font and the three to six words underneath that use that Kanji's same pronunciation. The answer is simply the yomi in addition to the furigana version of the words. Maybe have the sorting number (in case of Kore 2k/6k) under each word.

I figure the benefit is this can be mixed in easily when learning vocabulary and further emphasize the readings using pre-existing knowledge. You create less cards and it's passive so reviewing these would be much faster. It's not about knowing what the words mean as your vocab deck takes care of that. It's about re-enforcing reading and maybe help difficulties when there are multiple readings.

Anyway, it seems like something a sorting program could do. It'd be cool if it could occur with any list so any student can apply it regardless of the path chosen. Has something like this been done? I figure many have done it manually, but that seems tedious even with the ease of find feature on spreadsheets.
Edited: 2013-01-14, 3:22 am
Reply
#2
I have had something like that on my todo list for months. But I don't use Anki, so I used a Ruby script to generate an HTML file like this:

[Image: 9c353d1ae6ee.png]

The script goes through random jōyō on-yomi (like カイ / 海) and selects words that start or end with them. So it misses a lot of words, but there aren't that many errors if it is restricted to two kanji compounds. The words and translations are taken from JMdict, excluding words I have saved to a text file for learned words.
Edited: 2013-11-11, 4:48 pm
Reply
#3
Nukemarine Wrote:Now if there's a fast way to get the On and Kun for that particular kanji for that particular word into a column
It's not easy. Consider:
1 - 外出[がいしゅつ]
2 - 出発[しゅっぱつ]
3 - 開発[かいはつ]
Do you consider 1 and 2 to have the same reading for 出? And 2 and 3 to have the same reading for 発?

Of course, it's easier for a learner to have words with exact readings, but there are many words where you won't find an abundant supply of other common words (or any words) that do have exact readings. The variants are also not terribly difficult to learn once you've learned the original reading, so you likely don't want to ignore them either.

The solution by lauri_ranta might be good enough for most applications, but you'll miss a large number of cases and get some false-positives.

When I first started studying Japanese, I was convinced that ordering vocabulary in this manner was the most optimal. So I set out to solve this problem and had some great success. Unfortunately, things happened, and I put my Japanese on hold. When I restarted learning the language, I went a different route, and I haven't worked on the project since then since it's not going to provide me any immediate value. I have it on hold for now, but I might be able to temporarily re-purpose its functionality for your case. I won't bother if you're fine with the above solution.
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
@Lauri, looks interesting.

@Netsplitter, I'm not too sure that it's that difficult assuming a programming background. Mecab should be able to parse the words and compare to the already existing kana version of the word to determine the correct onyomi and it's voiced or glottal variant.

On topic: I update the spreadsheet with my meager skills. First, I put on or kunyomi from the Tanuki spreadsheet when and where it matched (few false positives). In addition, I sorted the kanji in the KO2k1 order. Only 2,000 matches of the 9,000+ list. Still, should be enough to show if this is a good idea for anybody.
Reply
#5
netsplitter Wrote:The solution by lauri_ranta might be good enough for most applications, but you'll miss a large number of cases and get some false-positives.
Yeah, it included pairs of words like this:

由緒[ゆいしょ], 由来[ゆらい]
丸太[まるた], 太鼓[たいこ]

I updated the script to just ignore readings like タ for 太. It won't include pairs like 唯一[ゆいいつ] and 唯一[ゆいつ] either. So it now misses even more words, but it doesn't really matter in this case.

Does anyone know if mecab can be used like ryuujouji?

$ mecab -O??? <<< 漢字
カン ジ

I haven't figured out the custom output formats yet.
Edited: 2013-01-14, 6:30 am
Reply
#6
I don't think you can.

However, you did tempt me to peek at the ugly state I've left ryuujouji in and found it still mostly usable as a command-line python tool for the word-segmenting use-case that you're after. So, I've zipped up the useful parts and uploaded it here if you want to use it.

I assume you know how to run a python script. It consumes lines, each a pair of a kanji word and its reading, separated by a space (not the Japanese space or it will break). You can pipe each line like this:

Code:
$ echo "面倒臭い めんどうくさい" | python ryuujouji.py
面倒臭い[めんどうくさい]
面 めん
倒 どう
臭い くさい
Or use a file with -f file.txt:
Code:
$ python ryuujouji.py -f example.txt
赤鷽[アカウソ]
赤 アカ
鷽 ウソ

刈り手[かりて]
刈り かり
手 て

働き蟻[はたらきあり]
働き はたらき
蟻 あり

出席[しゅっせき]
出 しゅっ
席 せき

筆箱[ふでばこ]
筆 ふで
箱 ばこ
Edited: 2013-01-19, 10:25 am
Reply
#7
As to using Mecab for segmentation in Python: https://gist.github.com/4551083
This is adapted from some Anki plugin code.
Reply
#8
python ryuujouji.py -f file.txt > file2.txt

Code:
Traceback (most recent call last):
  File "C:\ryuujouji_0.0.1\ryuujouji.py", line 26, in <module>
    print "Error on line %d: %s: %s" % (i+1, e.message, line)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 19-20: ordinal not in range(128)
On Windows XP. Without > file2.txt it works, but I want to save the results into a file...

Edit:
Adding the following code to ryuujouji.py solved it:

Code:
import codecs
import locale
import sys

sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout)
Here are 93 common words (as per EDICT) where the script failed to segment them and returned "unknown". This is not a complete list, by the way.

お祖父さん おじいさん
お祖母さん おばあさん
悪戯 いたずら
為替 かわせ
為替相場 かわせそうば
羽撃く はばたく
横文字 よこもじ
仮令 たとえ
何卒 なにとぞ
可哀相 かわいそう
可愛い かわいい
可愛がる かわいがる
可愛らしい かわいらしい
火傷 やけど
垣間見る かいまみる
眼差し まなざし
気質 かたぎ
逆上せる のぼせる
居心地 いごこち
強者 つわもの
強請る ねだる
区々 まちまち
経緯 いきさつ
健気 けなげ
広東 かんとん
香港 ほんこん
高麗 こうらい
合点 がてん
最早 もはや
細やか ささやか
昨夜 ゆうべ
自棄 やけ
狩人 かりゅうど
住み心地 すみごこち
住居 すまい
小火 ぼや
小文字 こもじ
真似 まね
真似る まねる
身体 からだ
身代金要求 みのしろきんようきゅう
図々しい ずうずうしい
袋小路 ふくろこうじ
大刀 たち
大文字 おおもじ
長閑 のどか
天皇 てんのう [note: the script should make an exception for renjou like it does for rendaku]
等閑 なおざり
頭文字 かしらもじ
道程 みちのり
日向 ひなた
日本 にほん
日本語 にほんご
日本酒 にほんしゅ
日本人 にほんじん
乳母車 うばぐるま
入梅 つゆいり
如何 いかん
梅雨入り つゆいり
梅雨明け つゆあけ
微笑む ほほえむ
上手い うまい
美味い うまい
布地 きれじ
文字 もじ
文字通り もじどおり
亡骸 なきがら
北京 ペキン
薬缶 やかん
唯一 ゆいつ
余所 よそ
裸足 はだし
流行 はやり
流行る はやる
一寸 ちょっと
一昨年 おととし
一昨日 おととい、おとつい
明後日 あさって
明明後日 しあさって
不味い まずい
可笑しい おかしい
団扇 うちわ
女形 おやま
従兄弟 いとこ
欠伸 あくび
流石 さすが
海老 えび
煙草 たばこ
玩具 おもちゃ
百合 ゆり
相応しい ふさわしい
美味しい おいしい

Some of these are jukujikun, others ateji and gikun, still others apparently just use a weird reading. Note however that I have not included "officially recognized" jukujikun (一人, etc), for which the script also returned "unknown." Odd that お祖父さん and お祖母さん aren't on the official list...

I always thought "ニ" was one of the readings for 日... Well, I guess it is, sort of, just not "officially recognized," or in any dictionary...

Also, completely worthless, but I was interested in the numbers... Joyo kanji whose joyo onyomi readings start with ハヒフヘホ and undergo rendaku in common jukugo: 81
http://pastebin.com/RLATrZbZ
Of those, only 7 have both a dakuten and handakuten variant:
方 ホウ ボウ,ポウ
本 ホン ボン,ポン
布 フ ブ,プ
版 ハン バン,パン
奉 ホウ ボウ,ポウ
配 ハイ バイ,パイ
遍 ヘン ベン,ペン

And those that don't undergo any rendaku: 96
http://pastebin.com/Ra2N0b7a
Edited: 2014-06-06, 5:09 am
Reply
#9
ppvpp, unless I misunderstood how to use it, that file doesn't do what we are talking about. It simply extracts the morphemes. However, it's a great reference on interacting with mecab.

toshiromiballza, I've uploaded a newer version that hopefully fixes your issues. I hadn't tested it on Windows previously, and it really wasn't ready for release in any way. I've changed the syntax so you now use -i for input file and -o for output file. Omitting any will use stdio or stdout.

E.g.,
Code:
$ python ryuujouji.py -i example.txt -o output.txt
Edit: Oops, I didn't see your update (had a stale page open in a tab while I fixed it). I didn't know about getpreferredencoding(). Seems useful. I honestly spent longer than I care to admit blaming Windows for having a terrible console, even though it was mostly my fault.

It's easy for me to extract failures from a dictionary, so giving me a list won't do much I'm afraid. I don't know what counts as an "official" reading. I wouldn't mind some kind of explanation for this though, since I wonder sometimes if the reading is simply missing from kanjidic or someone decided that it isn't one.
Edited: 2013-01-19, 10:39 am
Reply
#10
netsplitter Wrote:It's easy for me to extract failures from a dictionary, so giving me a list won't do much I'm afraid. I don't know what counts as an "official" reading. I wouldn't mind some kind of explanation for this though, since I wonder sometimes if the reading is simply missing from kanjidic or someone decided that it isn't one.
The list wasn't specifically for you, it's just something I found interesting and decided to share. Kinda like an "unofficial" jukujikun list.

The official list of jukujikun can be found here: http://forum.koohii.com/showthread.php?p...2#pid36232

As for ニ in 日本, it would be about time if somebody finally decided to make it one of the readings for 日. Times change, languages evolve. Deal with it, Japanese!

Interesting how they intentionally didn't mention "日本" as an example word in the official joyo list*, probably to avoid this issue and continue ignoring it!

And yes, there are quite a few words which your program couldn't segment, because the readings aren't in KANJIDIC (yet?), however, they are in KanjiGen or kanjijiten.net for example. This is not the case with my list of 93 words, though, since those readings are not listed there either.

* http://www.bunka.go.jp/kokugo_nihongo/pd...ou_h22.pdf
Edited: 2013-01-19, 1:01 pm
Reply
#11
Thanks for those lists.

toshiromiballza Wrote:And yes, there are quite a few words which your program couldn't segment, because the readings aren't in KANJIDIC (yet?), however, they are in KanjiGen or kanjijiten.net for example.
One of my goals with this was to find missing readings and errors in kanjidic and hopefully make it more complete. I was thinking too that the leftover cases could be sifted through (by humans!) to tag them as ateji/gikun/jukujikun/whatevers in JMDict itself, then use those as a result instead of "unknown". I'll get around to it...one of those days (years).
Reply
#12
I just tried the 0.0.2 version (didn't have to before as I already got my data with 0.0.1) and it doesn't work. It just stalls there after doing ryuujouji.py -i example.txt -o output.txt.

Python 2.7 on Win7.
Reply
#13
Not sure what's happening on your end. It works for me with those exact commands on Python 2.7 on Win 7 using the standard "command prompt" (cmd.exe).
Reply
#14
Weird. I'll try on XP (where 0.0.1 worked), but can't do so sooner than tomorrow.

Yep, it works on XP with Python 2.7, but not on my Win7 PC. No idea why it just stalls there.
Edited: 2013-02-18, 9:32 am
Reply
#15
Sorry for the somewhat necro-post but I did make a little addon for anki that add's JLPT vocab to your RTK kanji cards. You could easily just use any list thats in the same format as the one in the addon.

https://ankiweb.net/shared/info/4039910178

You can use that to add the vocab (with cloze deletion on the kanji) to your kanji cards.

Another one:
https://ankiweb.net/shared/info/1139060204

You can use this to take a bunch of vocab cards, and add in RTK information. In particular you wanted to systematically order the vocab - you can use the 'hardest heisig number' option, and then order all of your vocab by difficultly according to heisig kanji.

Hope it helps!
I assume you can work out the usage. Big Grin

Nukemarine Wrote:*edit* Here's a rough spreadsheet I made Optimized On Kun Sort that should be usuable. It only has onyomi and kunyomi for those words that matched up with the Tanuki list I had on hand. It's not much, but should offer a proof of concept for people if interested.*edit*

I'm sure people have done this manually, but is there a way to use a vocabulary list such as the Kore 2k/6k to create a systematic approach to reviewing the Kanji's On and Kun readings?

Basically, once you learn three words or more for a particular kanji's on or kun, you unlock a review card. It's simply a passive review card with the Kanji in large font and the three to six words underneath that use that Kanji's same pronunciation. The answer is simply the yomi in addition to the furigana version of the words. Maybe have the sorting number (in case of Kore 2k/6k) under each word.

I figure the benefit is this can be mixed in easily when learning vocabulary and further emphasize the readings using pre-existing knowledge. You create less cards and it's passive so reviewing these would be much faster. It's not about knowing what the words mean as your vocab deck takes care of that. It's about re-enforcing reading and maybe help difficulties when there are multiple readings.

Anyway, it seems like something a sorting program could do. It'd be cool if it could occur with any list so any student can apply it regardless of the path chosen. Has something like this been done? I figure many have done it manually, but that seems tedious even with the ease of find feature on spreadsheets.
Edited: 2013-08-19, 4:00 am
Reply