Back

Mighty Morphin Morphology

#26
There is a utf-8 version of the dictionary that mecab uses which will causes errors when using your plugin. A user with that version needs to edit the source code to make the plugin work.

It is not something you need to fix (the official plugin for Japanese support have the same issue), but it might be good to know.

A simple fix:
Open morphemes.py and replace all occurrences of 'euc-jp' with 'utf-8'.

Edit: I confused mecab with the dictionary it uses.
Edited: 2011-05-15, 4:21 am
Reply
#27
Pauline Wrote:There is version of mecab that uses utf-8 which will causes errors when using your plugin. A user with that version needs to edit the source code to make the plugin work.

It is not something you need to fix (the official plugin for Japanese support have the same issue), but it might be good to know.

A simple fix:
Open morphemes.py and replace all occurrences of 'euc-jp' with 'utf-8'.
Really? I thought the new utf-8 support needed you to specify an option to make it use utf-8?
Reply
#28
This Plugin sounds great! thanks alot! I do think though that deciding which word to learn next should be a trade off between the value of the word and ease of learning. You said that it's possible to give priority to certain decks, but would it be possible to give words a rating based on how many times they appear? If possible it would be nice to be able to exclude decks from this as well, since it's obviously not relevant how many times a word appears in something like Core 6000. It might also be useful to give more significance to words that appear in multiple decks, as they are likely to be more useful for other things as well.

P.S. I was planning on making a Fate/Stay night deck too; where did you downlad it all from?
Edited: 2011-05-14, 3:37 pm
Reply
See this thread for Holiday Countdown Deals (until Dec 15th)
JapanesePod101
#29
Splatted Wrote:You said that it's possible to give priority to certain decks, but would it be possible to give words a rating based on how many times they appear?
Unfortunately my idea of using a full out assignment problem algorithm, which would allow you to assign 'costs' to certain pairings is being ruled out at the moment since it's too slow. A 100x100 (ie 100 morphems in db to 100 sentences in fact selection) cost matrix takes only a second, but 1000x1000 doesn't finish within an hour. I've done a bit of experimentation with implementing the matching algorithm in a different program using some fast C or Haskell code, but 1000x1000 still takes awhile and even slightly larger numbers very quickly cause it to not complete within an hour. I may return to this eventually, but for now I'm sticking with the maximum cardinality bipartite matching (ie, just find the largest set of possible pairings).

Splatted Wrote:If possible it would be nice to be able to exclude decks from this as well, since it's obviously not relevant how many times a word appears in something like Core 6000.
Note, you can still accomplish most things just by carefully crafting your DBs and fact selection. I make a separate DB for each deck, and sometimes even subsets of the deck, and then combine them (ie union) into larger DBs as well.

Splatted Wrote:It might also be useful to give more significance to words that appear in multiple decks, as they are likely to be more useful for other things as well.
The databases actually contain frequency information. The format is basically a TSV file with 4 column morpheme entry plus a number with how many times it was seen. I'll consider adding some frequency filter to MorphMan to make use of it.

Splatted Wrote:P.S. I was planning on making a Fate/Stay night deck too; where did you downlad it all from?
I made it myself. I'm currently working on re-timing subs from a bunch of shows I have and making decks for them. Once I get a complete set of sync'd eng/jap subs for a show I'll probably post them somewhere, FSN included.
Reply
#30
overture2112 Wrote:
Pauline Wrote:There is version of mecab that uses utf-8 which will causes errors when using your plugin. A user with that version needs to edit the source code to make the plugin work.

It is not something you need to fix (the official plugin for Japanese support have the same issue), but it might be good to know.

A simple fix:
Open morphemes.py and replace all occurrences of 'euc-jp' with 'utf-8'.
Really? I thought the new utf-8 support needed you to specify an option to make it use utf-8?
New utf-8 support i mecab? I don't know anything about that, but it seems I confused mecab with the dictionary it uses. On the Anki wiki you can read this about Japanese support (my bolding):
Quote:If you're on Linux, install Mecab, Mecab's IPADIC in euc-jp format, and Kakasi. If you want to use the UTF-8 IPADIC, you'll need to edit the plugin's source. You must make sure juman is not installed.
Reply
#31
I keep getting this error message whenever I try to create a db; do you ahave any idea what's wrong? At least the second to last line gave me some lolz.
Anki Wrote:Traceback (most recent call last):
File "C:\Documents and Settings\Oscar\Application Data\.anki\plugins\morph\util.py", line 63, in
ed.connect( a, SIGNAL('triggered()'), lambda e=ed: doOnSelection( e, overviewMsg, progMsg, preF, perF, postF ) )
File "C:\Documents and Settings\Oscar\Application Data\.anki\plugins\morph\util.py", line 37, in doOnSelection
st = preF( ed )
File "C:\Documents and Settings\Oscar\Application Data\.anki\plugins\morph\exportMorphemes.py", line 13, in pre
return { 'ed':ed, 'srcName':name, 'filePath':path, 'ms':[], 'mp':m.mecab(None) }
File "C:\Documents and Settings\Oscar\Application Data\.anki\plugins\morph\morphemes.py", line 18, in mecab
return subprocess.Popen( mecabCmd, bufsize=-1, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, startupinfo=si )
File "C:\cygwin\home\dae\Home\anki\win\build\pyi.win32\anki\outPYZ1.pyz/subprocess", line 672, in __init__
File "C:\cygwin\home\dae\Home\anki\win\build\pyi.win32\anki\outPYZ1.pyz/subprocess", line 882, in _execute_child
WindowsError: [Error 2] The system cannot find the file specified
overture2112 Wrote:The databases actually contain frequency information. The format is basically a TSV file with 4 column morpheme entry plus a number with how many times it was seen. I'll consider adding some frequency filter to MorphMan to make use of it.
I don't think there's any need for anything complicated here. Just sorting a batch of the most common words (e.g. the 1000 most common, or all the words that appear 3 or more times etc) and then sorting that in to the easiest to learn order would make huge difference.

overture2112 Wrote:I made it myself. I'm currently working on re-timing subs from a bunch of shows I have and making decks for them. Once I get a complete set of sync'd eng/jap subs for a show I'll probably post them somewhere, FSN included.
You are a saint ^^
Reply
#32
Splatted Wrote:I keep getting this error message whenever I try to create a db; do you ahave any idea what's wrong? At least the second to last line gave me some lolz.
Anki Wrote:Traceback (most recent call last):
File "C:\Documents and Settings\Oscar\Application Data\.anki\plugins\morph\util.py", line 63, in
ed.connect( a, SIGNAL('triggered()'), lambda e=ed: doOnSelection( e, overviewMsg, progMsg, preF, perF, postF ) )
File "C:\Documents and Settings\Oscar\Application Data\.anki\plugins\morph\util.py", line 37, in doOnSelection
st = preF( ed )
File "C:\Documents and Settings\Oscar\Application Data\.anki\plugins\morph\exportMorphemes.py", line 13, in pre
return { 'ed':ed, 'srcName':name, 'filePath':path, 'ms':[], 'mp':m.mecab(None) }
File "C:\Documents and Settings\Oscar\Application Data\.anki\plugins\morph\morphemes.py", line 18, in mecab
return subprocess.Popen( mecabCmd, bufsize=-1, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, startupinfo=si )
File "C:\cygwin\home\dae\Home\anki\win\build\pyi.win32\anki\outPYZ1.pyz/subprocess", line 672, in __init__
File "C:\cygwin\home\dae\Home\anki\win\build\pyi.win32\anki\outPYZ1.pyz/subprocess", line 882, in _execute_child
WindowsError: [Error 2] The system cannot find the file specified
It's failing to run mecab.
Do you have the Japanese plugin installed and can you generate readings?
Where is mecab's exe located on your system?
What operating system are you using?

Splatted Wrote:
overture2112 Wrote:The databases actually contain frequency information. The format is basically a TSV file with 4 column morpheme entry plus a number with how many times it was seen. I'll consider adding some frequency filter to MorphMan to make use of it.
I don't think there's any need for anything complicated here. Just sorting a batch of the most common words (e.g. the 1000 most common, or all the words that appear 3 or more times etc) and then sorting that in to the easiest to learn order would make huge difference.
If you want to try this now, open up the DB in a spreadsheet program or something (since it's just a tsv), use that to sort the rows by the frequency column, delete stuff that doesn't appear enough, then save it (perhaps to a new file) and use that.
Reply
#33
Thanks overture, I Think I finally got it working.

overture2112 Wrote:
Splatted Wrote:
overture2112 Wrote:The databases actually contain frequency information. The format is basically a TSV file with 4 column morpheme entry plus a number with how many times it was seen. I'll consider adding some frequency filter to MorphMan to make use of it.
I don't think there's any need for anything complicated here. Just sorting a batch of the most common words (e.g. the 1000 most common, or all the words that appear 3 or more times etc) and then sorting that in to the easiest to learn order would make huge difference.
If you want to try this now, open up the DB in a spreadsheet program or something (since it's just a tsv), use that to sort the rows by the frequency column, delete stuff that doesn't appear enough, then save it (perhaps to a new file) and use that.
Great! I will do that. ^^
Reply
#34
Hello, lurking for three years, first post, etc.

Boy.pockets Wrote:Another thing I thought might be cool (maybe you are already doing this): using the kanji readings to look for the next best word to use. For example, say you already know '出席「しゅっせき」', then an easy one to learn next might be '出廷「しゅってい」' (especially if you already know 'tei' from another word).
I've had this idea for years and finally got around to working on it. I've been doing this manually all along (favouring "knowable" words as much as I can), and can't imagine learning vocab any other way. I've tried, but prefer this method. Searching and adding words manually is very slow, so I am automating it!

overture2112 Wrote:Problem: given a kanji compound, how can you determine which parts of the reading are from which kanji?
That really is the core problem. I was surprised to find out nothing exists already to do this. I've started a project that uses readings of Kanji from KANJIDIC2 and tries to build a solution given a word's reading. I'll build a database of solutions for the entries in JMdict. I have the basics down, and it works well, but there are lots of other features of Japanese left to account for (like the example above, 出's reading is しゅつ, but the つ turns into a small っ in some compounds).

It's on github: https://github.com/ntsp/ryuujouji
ryuujouji = 粒状字 = granular characters, which I think describes it well. It's no where near complete or usable and has no documentation, but I'll get around to that in the next few days. Here is some sample output for your entertainment:

Code:
Solving: 小牛 == こうし
Solution # 0
小 -- こ  -- kanji
牛 -- うし  -- kanji

Solving: バス停 == バスてい
Solution # 0
バ -- バ  -- kana
ス -- ス  -- kana
停 -- てい  -- kanji

Solving: 非常事態 == ひじょうじたい
Solution # 0
非 -- ひ  -- kanji
常 -- じょう  -- kanji
事 -- じ  -- kanji
態 -- たい  -- kanji

Solving: 建て替える == たてかえる
Solution # 0
建て -- たて  -- kanji (た) with okurigana (て)
替える -- かえる  -- kanji (か) with okurigana (える)
Solution # 1
建て -- たて  -- kanji (た) with okurigana (て)
替え -- かえ  -- kanji (か) with okurigana (え)
る -- る  -- kana
Edited: 2011-05-21, 10:44 pm
Reply
#35
netsplitter Wrote:It's on github: https://github.com/ntsp/ryuujouji
Fantastic! If you look at some of the code for my plugin, it should be easy to see how to add a better vocab ranking system using your code once you think it's mature. Then you could add it to the shared plugins (and just note the dependency) or I could add it into mine if you wish.
Reply
#36
netsplitter Wrote:Hello, lurking for three years, first post, etc.
Happy to be in your first post : )

Your code is in Python, so maybe it won't be so difficult to use for this plugin then - to refine the morphology results? If you have already got the basics going, then I think it is enough to have something useful. I mean, you might not be able to have certainty with some words, but you should be able to make a good guess. And given the amount of Kanji we are working with, that should be good enough (if there are some mistakes, I don't think that will make much of an impact). Anyway, be sure to post an update when you make more progress.
Reply
#37
Just wanted to say thank you overture2112, I'm using both this plugin and your glossing plugin. Very helpful... wish I had both 2 years ago Big Grin.
Reply
#38
netsplitter: following up on my last post, what am trying to say is 'close enough is good enough'. If it helps to improve the accuracy, then that is what we need.
Reply
#39
Boy.pockets Wrote:'close enough is good enough'.
For now, okay. But I'm still aiming for a 99% match rate for the entries in JMdict. In the meantime, I'll create something more portable out of the current version so you can use it right away. overture2112, I haven't looked at your plugin in great detail, so I think I'll just let you add it. I'll let you know when it's ready.

Edit: Okay, it should be good to go (I hope). Just look at the example in the README. Let me know how that works out.
Edited: 2011-05-22, 11:43 pm
Reply
#40
Just had to post a screenshot of how cool this is getting...

Images, audio & subs auto extracted with subs2srs.
Furigana auto generated by Anki.
Gloss auto generated by jmrGloss plugin.
Cards ordered by iPlusN with the Morphology plugin.
Portable on ipad...

[Image: 2mg84r6.png]
Edited: 2011-05-25, 5:45 am
Reply
#41
Still haven't tinkered with this yet, but I posted some tools here you might want to play with: http://forum.koohii.com/showthread.php?p...#pid138877

The dependency parser (which uses KNP [which seems superior] and CaboCha [via ChaSen]) at Langrid is very interesting. As is the concept dictionary, though at the moment I'm probably going to explore the idea of using Japanese WordNet itself as a kind of extemporaneous hypertext trove of semantic relations.

I'm also taken with KH Coder's suite of tools, but at the moment I've turned my focus to AntConc's cluster and file view stuff to quickly identify, sort, and list collocations and the like.

I would bug cb4960 about this too but I figure they're taking a vacation. What do you think about, for example, adding dependency parsing to Anki?

Edit: Also, as for constructing a known.db and keeping it up-to-date, do you think a reminder plugin would come in handy?
Edited: 2011-05-25, 12:47 pm
Reply
#42
overture2112 Wrote:v1.5 implements a feature to find a maximum cardinality bipartite matching between morphemes in a DB to facts whose Expression field contains said morphemes.
I had been considering making a plugin to do what yours does, since I had actually lost track of where I heard of a plugin that can tract n+1. But obviously I found it again Big Grin. Anyway, one of the rough points I hit in just pondering how to build the app was dealing with the issue of "how do you programmatically build a deck such that its like Core2k and all sentences build on each other in an n+1 or n+2 fashion?" Is that basically what the Maximum cardinality matching algorithm does?
Reply
#43
nest0r Wrote:Edit: Also, as for constructing a known.db and keeping it up-to-date, do you think a reminder plugin would come in handy?
I would like to see something like this too. My first thought was that it would be nice to be able to specify "input decks" which Morfin* monitors. When cards come into and out of maturity, Morfin would add and remove them from the known db. This way you would only have to specify the source decks and then the rest would be automatic. Of course, this would mean that you would have to trust Anki's way of calculating maturity (which i think is fine).

Another nice to have would be to be able to specify target vocabulary, which you don't know, but would like to study next. Not sure how this would go with the current metrics. Perhaps a new one with a combined calculation would work best? This sort of functionality would be nice for people studying for tests, or studying from a text book. E.g. for tests, say if you are studying for Kanken, you could specify the vocabulary for your target level and the plugin would take this into account. Note, this would mean that the learning would not be as "efficeint", as you would be learning sentences with more unknowns - because of this introduced bias to certain words.


Notes:
*Morfin = Mighty Morphin Morphology - does it have a official name yet?

edit: changed wording because it did not sound polite.
Edited: 2011-05-26, 2:43 am
Reply
#44
vosmiura Wrote:Just had to post a screenshot of how cool this is getting...

Images, audio & subs auto extracted with subs2srs.
Furigana auto generated by Anki.
Gloss auto generated by jmrGloss plugin.
Cards ordered by iPlusN with the Morphology plugin.
I also add context sentences (2 before and 2 after), unknowns and matchedMorpheme (so I know what's new and thus to be especially focused on), i+N and vocabRank (mostly for seeing how well these work to gauge difficulty; no real pedagogical purpose) and the subs2srs time (reference so I can watch a scene in full for extra context / to appease my curiosity).

vix86 Wrote:Anyway, one of the rough points I hit in just pondering how to build the app was dealing with the issue of "how do you programmatically build a deck such that its like Core2k and all sentences build on each other in an n+1 or n+2 fashion?" Is that basically what the Maximum cardinality matching algorithm does?
No. You could use the plugin to write some code that would do that though:

0) Given some source corpus of cards C, empty deck D, and known db K.
1) Copy K to K'.
2) Using known db K', add all i+1 cards in C to deck D and db K'.
3) Repeat (2) until deck is as large as you want it or there are no more i+1 cards to add.


A maximum cardinality bipartite matching:

Given two distinct sets of things A and B (eg, A = words to learn, B = cards you can learn them from), there is a set S of all possible combinations of 1 to 1 pairings of things in A to things in B.

That is, A and B form a bipartite graph and thus S contains bipartite matchings of A and B. So a maximum cardinality bipartite matching is simply the largest member of S (ie, the combination of pairings which has the most pairings).
Edited: 2011-05-26, 11:31 am
Reply
#45
Boy.pockets Wrote:
nest0r Wrote:...reminder plugin...would come in handy?
...it would be nice to be able to specify "input decks" which Morfin* monitors. When cards come into and out of maturity, Morfin would add and remove them from the known db. This way you would only have to specify the source decks and then the rest would be automatic
Yes, I noticed this after creating a handful of new subs2srs decks and trying to keep them all in sync as I switch around between them (how could I possibly choose just one Kugimiya Rie show?). I started some work on automating everything, but it will probably have to wait until next week.

Boy.pockets Wrote:Another nice to have would be to be able to specify target vocabulary, which you don't know, but would like to study next....This sort of functionality would be nice for people studying for tests, or studying from a text book. E.g. for tests, say if you are studying for Kanken, you could specify the vocabulary for your target level and the plugin would take this into account...
This is basically the point of the morpheme matching feature.
1) Make a db from the list of words you want to learn. If your goal is certain sentences, select them and export (possibly merging if they're from multiple decks). If you have just a list of words, save them as a utf-8 text file and use MorphMan to extract a DB from that file.
2) Select some cards you want to these words from, then use It's Morphin Time > Match Morphememes. This will set the 'matchedMorpheme' field on some of the cards with a word from the DB.

It's mostly useful if you have short term goals like a chapter vocab list from a grammar book, all the words from a particular episode of a show you want to try watching, etc.

Boy.pockets Wrote:*Morfin = Mighty Morphin Morphology - does it have a official name yet?
Not really. I'm sure there's an excellent name using the word "Morph" or へんしん, but I guess I don't watch enough magical girl anime.
Reply
#46
See edit.

Traceback (most recent call last):
File "C:\Users\vix\AppData\Roaming\.anki\plugins\morph\util.py", line 63, in
ed.connect( a, SIGNAL('triggered()'), lambda e=ed: doOnSelection( e, overviewMsg, progMsg, preF, perF, postF ) )
File "C:\Users\vix\AppData\Roaming\.anki\plugins\morph\util.py", line 54, in doOnSelection
postF( st )
File "C:\Users\vix\AppData\Roaming\.anki\plugins\morph\exportMorphemes.py", line 24, in post
m.mergeFiles( st['filePath'], util.knownDbPath, util.knownDbPath )
File "C:\Users\vix\AppData\Roaming\.anki\plugins\morph\morphemes.py", line 120, in mergeFiles
a, b = loadDb( aPath ), loadDb( bPath )
File "C:\Users\vix\AppData\Roaming\.anki\plugins\morph\morphemes.py", line 60, in loadDbU
buf = open( path, 'rb' ).read().decode('utf-8')
IOError: [Errno 2] No such file or directory: 'C:\\Users\\vix\\AppData\\Roaming\\.anki\\plugins\\morph\\dbs\\known.db'

Edit: Not Python's fault. There just simply isn't a 'known.db' file there.
Edited: 2011-05-26, 9:06 pm
Reply
#47
Okay, so I've started playing with this. It's amazing! Just updated, haven't tested out the morpheme to expression matching, but I'm sure it'll be very useful, as you noted, re: subs2srs + vocabulary.

Have you already talked about vocab card creation somewhere in the matching process? Like when it determines unknown morphemes based on cards' expressions, it'll create cards for them, that sort of thing? Perhaps integrated with the glossing function? Guess I should read through your comments on this plugin.

Thanks for creating it. It's always lovely when someone brilliant and innovative appears to give us free tools that are revolutionizing language self-study, which is what I think happens on a strikingly regular basis here at RevTK.
Edited: 2011-05-26, 9:28 pm
Reply
#48
vix86 Wrote:See edit.

Traceback (most recent call last):
File "C:\Users\vix\AppData\Roaming\.anki\plugins\morph\util.py", line 63, in
ed.connect( a, SIGNAL('triggered()'), lambda e=ed: doOnSelection( e, overviewMsg, progMsg, preF, perF, postF ) )
File "C:\Users\vix\AppData\Roaming\.anki\plugins\morph\util.py", line 54, in doOnSelection
postF( st )
File "C:\Users\vix\AppData\Roaming\.anki\plugins\morph\exportMorphemes.py", line 24, in post
m.mergeFiles( st['filePath'], util.knownDbPath, util.knownDbPath )
File "C:\Users\vix\AppData\Roaming\.anki\plugins\morph\morphemes.py", line 120, in mergeFiles
a, b = loadDb( aPath ), loadDb( bPath )
File "C:\Users\vix\AppData\Roaming\.anki\plugins\morph\morphemes.py", line 60, in loadDbU
buf = open( path, 'rb' ).read().decode('utf-8')
IOError: [Errno 2] No such file or directory: 'C:\\Users\\vix\\AppData\\Roaming\\.anki\\plugins\\morph\\dbs\\known.db'

Edit: Not Python's fault. There just simply isn't a 'known.db' file there.
Yep, there's a little mishandling in that case.

Workaround is to export morphenes to known.db first.
Reply
#49
vosmiura Wrote:
vix86 Wrote:Traceback (most recent call last):
File "C:\Users\vix\AppData\Roaming\.anki\plugins\morph\util.py", line 63, in
ed.connect( a, SIGNAL('triggered()'), lambda e=ed: doOnSelection( e, overviewMsg, progMsg, preF, perF, postF ) )
File "C:\Users\vix\AppData\Roaming\.anki\plugins\morph\util.py", line 54, in doOnSelection
postF( st )
File "C:\Users\vix\AppData\Roaming\.anki\plugins\morph\exportMorphemes.py", line 24, in post
m.mergeFiles( st['filePath'], util.knownDbPath, util.knownDbPath )
File "C:\Users\vix\AppData\Roaming\.anki\plugins\morph\morphemes.py", line 120, in mergeFiles
a, b = loadDb( aPath ), loadDb( bPath )
File "C:\Users\vix\AppData\Roaming\.anki\plugins\morph\morphemes.py", line 60, in loadDbU
buf = open( path, 'rb' ).read().decode('utf-8')
IOError: [Errno 2] No such file or directory: 'C:\\Users\\vix\\AppData\\Roaming\\.anki\\plugins\\morph\\dbs\\known.db'

Edit: Not Python's fault. There just simply isn't a 'known.db' file there.
Yep, there's a little mishandling in that case.

Workaround is to export morphenes to known.db first.
It looks like you said 'Yes' when it asked if you wanted to merge with known.db and so it failed since there's no known.db to merge with- that's why the instructions say to click 'No' when making your initial known.db.
Reply
#50
overture2112 Wrote:It looks like you said 'Yes' when it asked if you wanted to merge with known.db and so it failed since there's no known.db to merge with- that's why the instructions say to click 'No' when making your initial known.db.
Logic dictated I didn't have a known.db yet so maybe I should just make my own. Seemed to fix the problem Tongue.
Reply