Back

Mighty Morphin Morphology

#76
Just a heads up that I shared my modified version of the plugin: "Japanese Morpholpgy - mod".

Changes from original v1.5 plugin:

* Verbs and adjectives are converted to dictionary form, so 行く, 行きます, 行った, 行こう, 行け, etc... are treated as one morpheme instead of multiple morphemes in the original plugin.

* The reading form is changed, for example 学校 now has the reading ガッコウ instead of ガッコー in the original plugin. (this is just a personal preference)

* Changed the vocabRank calculation
- Now vocabRank measures how much you don't know, instead of how much you know in a fact.
- A lower number means easier to learn based on what you know.
- Fixed a bug where the vocabRank database was missing some morphemes from the known database.
Edited: 2011-06-13, 2:37 am
Reply
#77
Perhaps these options can be incorporated into the original. I really can't decide which I like better, re: base form vs. inflected.

Edit: Perhaps inflected forms could be weighed differently, or adjusted over time to reflect how frequent inflected forms are stored in the brain (okay that last bit is much more hypothetical and perhaps might require integration with a Rikaichan mod tracking feature that doesn't yet exist ;p).
Edited: 2011-06-13, 8:00 am
Reply
#78
vosmiura Wrote:Just a heads up that I shared my modified version of the plugin: "Japanese Morpholpgy - mod".
Awesome! Will they work side by side? For the sake of more awesome-ness could it not be merged back into the original plugin? with some configuration options maybe?
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#79
Of course, I tried contacting the OP about merging but haven't yet heard back. If some are finding the old version better in some cases then adding some configuration options makes sense.

@nest0r, regarding weighing inflected forms differently - could be good.

I've been thinking in general about analyzing grammar patterns and giving them a score based on their occurrence in your decks and status of cards.

For example, I came across the point ~ざるをえない recently and I thought, what if I had a tool that would pull examples from the net that use this grammar point and are i+1, and google TTS them. That would rock.

Playing around with http://langrid.org/playground/dependency-parser.html

行かざるをえない。
行か ざる を え ない 。
[verb] 行く [other] ぬ [other] を [verb] える [other] ない [other] 。

竜崎はいつかは二人を解放せざるをえなくなる、\そして記憶を失った僕は…
解放 せ ざる を え なく なる 、
[noun.other] 解放 [verb] する [other] ぬ [other] を [verb] える [other] ない [other] なる [other] 、

So... we need some grammar parser runs through CaboChar or KNP and then match the pattern [verb]+ぬ+を+える+ない as the ざるをえない pattern.
Reply
#80
After using the dependency parser quite a bit, I've continued to find CaboCha's consistently inferior to KNP, both in terms of Chasen vs. Juman handling parts-of-speech tagging and recognizing morphemes and breaking them down (which can skew the bunsetsu identification), and also how it handles what that KNP paper calls coordinate structures (the PA referring to parallel rather than dependent constructions).

That bit about morphology and parts-of-speech, if we were looking at that alone (e.g. to identify grammar points across sentences) then possibly Mecab could handle that and it might be better for targeted, user-selected point generation. But if bunsetsu (the stuff on each line) were a target, then the parser would come into play; or perhaps the parser could do the coarse work then Mecab could be implemented separately to refine?

Something else I'm interested in is adding another dimension (beyond semantics and readings) to the N-ness via syntactic complexity as determined using interdependencies as a metric, perhaps the number of features that lead to a given sentence root (e.g. the arrows pointing to it, though perhaps the interdependencies of other bunsetsu would also factor in). I think jReadability (Obi-2) and lexical density (using Mecab or whathaveyou to determine parts-of-speech and count content and function words used in the equation) might be better there also.

I hadn't thought much beyond that for user implementation (e.g. that tool you imagined). Interesting possibilities. I like that bit about finding frequent grammar points.
Edited: 2011-06-13, 1:19 pm
Reply
#81
vosmiura Wrote:Just a heads up that I shared my modified version of the plugin: "Japanese Morpholpgy - mod".
Neato. Sorry I didn't catch your email this weekend, but I like your changes.

I thought to avoid the base form because it might cause more false positives, but in retrospect maybe that isn't the case. Perhaps someone could try both and report their findings?
Reply
#82
@vos

Just as a tangent related to the dependency parser, I added a dependency field to various decks, with this in the card layout: <table><tr><td>
{{Dependency}} - I added the table stuff because it helps the arrows line up. (In retrospect don't really need the <tr>. ;p)

I'm sure it could be done better, but it's useful for when I choose to add the KNP results to certain types of new cards (I have a couple decks I'm using it with at the moment in different ways, each having a certain type of card).
Edited: 2011-06-14, 8:16 pm
Reply
#83
That sounds neat. Are you importing them manually?
Reply
#84
overture2112 Wrote:-- To create a morpheme database and initial known.db from anki cards
Open a deck, go to card browser, select all the ones you've studied, go to Actions>It's Morphin Time->Export Morphemes, select a source name (doesn't do anything in this initial release) then a file to save it to.
I just realized "source" wasn't implemented, when it is, will this allow me to select which fields to extract morphemes from? Because this is a feature I'd really like.
Reply
#85
@vos - Yes unfortunately, for now, but I've integrated it into my workflow of studying new cards, e.g. when I first encounter them I open it up and copy the sentence, paste into KNP, process and copy/paste the result. For one type that's enough right there, and I use it as a general reference, and for another type I also use the bunsetsu information to highlight unknown morpheme bunsetsu (quickly select the bunsetsu and hit f7/f6). All in all it adds seconds to the study of new cards, but it's novel at the moment and fun. ;p

Edit: And I have KNP/Juman installed as an offline tool on my laptop but no idea how to run that script stuff: http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?KNP ; I'd also played with the source code of the Langrid site because it's provided in an easy-to-use format, but still no luck.

@Daichi - I use different source fields for certain cards in this way: http://forum.koohii.com/showthread.php?p...#pid140024 (By the way, that Tanuki stuff is incredibly enjoyable when you do it that way, even with stuff you already know as it's cast into a monolingual light.)
Edited: 2011-06-15, 8:20 am
Reply
#86
@Daichi - I was going to make a small joke/comment about your use of "n+1" rather than "i+N" a couple times, and then just recently I stumbled across this, so it actually is an equation and an interesting one!

http://psycnet.apa.org/journals/xhp/36/6/1669/ (Where N is the word being fixated on, and the +X is the upcoming word[s].)

Related to this and previous stuff about syntactic complexity: http://www.mendeley.com/research/depende...ce-metric/ and http://citeseerx.ist.psu.edu/viewdoc/dow...1&type=pdf (Edit: Although for the latter and second language learners we must keep in mind that what constitutes a unit is complicated in that new words, for example, are not a single unit till their compositional elements have been learned and chunked and that chunking and retrieval are strategy-friendly [via metacognition], dynamic processes; not to mention the different types of representations and how they interact [e.g. phonology and orthography]; but in terms of identifying and quantifying another dimension of N-ness on the syntactic level, it's interesting.)
Edited: 2011-06-15, 1:30 pm
Reply
#87
I really liked the N+i idea of this, but didn't like how integrated it was with anki. So I threw together a command line program to do somthing similar on a plain text file.
http://pastebin.com/ktZui2M7

Code:
$ nplus1.py /tmp/toadd.txt
00 [質問は利用案内でどうぞ。] []
01 [複数のアカウントを使用している方、] [アカウント]
01 [方針に賛同していただけるなら、] [いただける]
02 [誰でも記事を編集したり新しく作成したりできます。] [新しく, 作成]
03 [ウィキペディアはオープンコンテントの百科事典です。] [オープンコンテント, 事典, 百科]
03 [ガイドブックを読んでから、サンドボックスで練習してみましょう。] [サンド, ガイドブック, ボックス]
Reply
#88
ToasterMage Wrote:I really liked the N+i idea of this, but didn't like how integrated it was with anki. So I threw together a command line program to do somthing similar on a plain text file.
http://pastebin.com/ktZui2M7

Code:
$ nplus1.py /tmp/toadd.txt
00 [質問は利用案内でどうぞ。] []
01 [複数のアカウントを使用している方、] [アカウント]
01 [方針に賛同していただけるなら、] [いただける]
02 [誰でも記事を編集したり新しく作成したりできます。] [新しく, 作成]
03 [ウィキペディアはオープンコンテントの百科事典です。] [オープンコンテント, 事典, 百科]
03 [ガイドブックを読んでから、サンドボックスで練習してみましょう。] [サンド, ガイドブック, ボックス]
Nice. I was tempted to use the mecab python library but decided not to since most people only have python via Anki. Have you noticed any speed difference?
Reply
#89
exporting morphemes through anki takes ~5 seconds.
Code:
$ time nplus1.py /tmp/toadd.txt
00 [質問は利用案内でどうぞ。] []
01 [複数のアカウントを使用している方、] [アカウント]
01 [方針に賛同していただけるなら、] [いただける]
02 [誰でも記事を編集したり新しく作成したりできます。] [新しく, 作成]
03 [ウィキペディアはオープンコンテントの百科事典です。] [オープンコンテント, 事典, 百科]
03 [ガイドブックを読んでから、サンドボックスで練習してみましょう。] [サンド, ガイドブック, ボックス]

real    0m1.215s
user    0m0.724s
sys    0m0.488s
I fiddled around with your code and bit to get mecab working (http://pastebin.com/AmYiYBUB). There was no difference in speed. However my version did seem to pick up some things yours did not: mine vs yours. I might have just broke something, I dunno.
Reply
#90
There's apparently a homophone/homograph/synonym plugin for Anki; from the description the synonym part works manually only. I wonder if some of those ideas could be related to the morphology/reading stuff discussed here, to reduce interference; perhaps being paired up with Anki's ‘sibling cards’ (or the ability to sort/filter review sessions I mentioned before), and maybe Japanese WordNet could be useful for automatically categorizing word senses/synsets.

Just more thinking aloud, don't mind me.
Edited: 2011-06-17, 5:54 pm
Reply
#91
I'm curious, are others making use of the plugin and if so how?

For what it's worth, I've settled into using only my modified 'vocabRank' for sorting new cards. At a high level it has about the same order as 'iPlusN', but it has finer steps.

For example in the Death Note deck, I have around ~3000 new cards left to go through, and ~1300 of those are iPlusN=1. That's quite a lot to choose from. With 'vocabRank' I can narrow down further to ~100 'easiest' cards out of those ~1300.

Picking a set of ~100 cards with lowest 'vocabRank', I'm also finding that it tends to capture a lot of cards that have the same word together, for example in one set, I must have seen a dozen uses of 裁く (to judge, since Kira is judging criminals), conjugated in all sorts of ways 裁き、裁いた、裁いて、裁かれた, so it helps to learn the words easily in multiple contexts. That's also one plus for treating all conjugations of a word as one word.
Edited: 2011-06-27, 3:13 am
Reply
#92
vosmiura Wrote:Picking a set of ~100 cards with lowest 'vocabRank', I'm also finding that it tends to capture a lot of cards that have the same word together, for example in one set, I must have seen a dozen uses of 裁く (to judge, since Kira is judging criminals), conjugated in all sorts of ways 裁き、裁いた、裁いて、裁かれた, so it helps to learn the words easily in multiple contexts. That's also one plus for treating all conjugations of a word as one word.
Do you find that this means you are spending a large amount of time in the Anki Browser. From what I understand, you are suspending all the new cards, then sorting on i+N, then on vocabRank and un-suspending the top 100 cards?

I guess I would prefer to do it this way too, but I hate having to go into the Anki Browser all the time. I do my study on an iPod Touch (no Anki deck browser yet); so there is an extra step (sync iPod, since Desktop). Anyway, I think I will give your way a bit of a try. Sounds good - just a little annoying with the Anki Browser.

It would be nice to combine the two fields (iPlusN and VocabRank). Say 'MorphSort' (whatever) that would sort on iPlusN first, then VocabRank. It could be equal to;
Code:
iPlusN + (1 - (1 / VocabRank))
Either that, or Anki should be able to schedule via sorting two fields...
Edited: 2011-06-27, 9:30 am
Reply
#93
I haven't used vocabRank yet, still looking into it. I'm not sure I like the idea of it clustering similar words together, as with KO2001 I found a similar structure contributed to interference, and there's studies that confirm that similar items, especially semantically, do this, when you learn them grouped together. Thematic groupings, on the other hand, facilitate one another. (e.g. http://citeseerx.ist.psu.edu/viewdoc/dow...1&type=pdf)

I really like where vocabRank seems to go, looking at readings and morphemic constituents of compounds, even factoring in positioning and such. Clicks with what I've read of studies in kanji/compound/radical/word/sentence processing and formation processes. (Such as http://www.valdes.titech.ac.jp/~terry/2kcw.html and http://www.lang.nagoya-u.ac.jp/~ktamaoka/gyoseki_en.htm)

You know, with the field focus, you can always play with what field it uses to establish vocabRank. Such as the unknowns field? Perhaps find a way to just do it for the iPlusN cards?

Perhaps you could create a duplicate version of the plugin with renamed menu item (MorphSort?) specifically for cards tagged iPlus1 (which you find and tag presumably through searching for iPlusN:1 or whathaveyou when doing that with the main plugin), and when you tell that version of the plugin to set vocabrank it uses the iPlus1s' unknowns field, so when you sort cards by that, it automatically uses that single unknown field? That might reduce interference also.

On the other hand, it could be interesting, if we get that functionality to break up the unknowns field into single new cards that I mentioned before, and you could use their sentence's vocabrank information, that would presumably store their thematic relatedness (since they'd often be used in sentences that establish such protocontexts), so future sorting might keep them grouped together in some way. ‘The unknowns of (Sentence X with VocabRank Y)’. Perhaps new cards from unknowns could have a context field with its original Expression field, and that original Expression's vocabRank field...

Edit 2: Ah, I see we could do that already, by using a different template of the same fact for iPlusN:1 cards. Just now beginning to see what overture was getting at elsewhere.

I've stuck with the original MorphMan (that's what I'm calling the whole plugin), because I think I do prefer focus on the individual conjugations rather than dictionary forms (not sure how Mecab deals with lexemes and such, now that I think of it), until there's some way to grade them more incrementally and integrate this function with the original. (Plus I didn't feel like going through and regenerating databases or creating and working with sibling dbs. ;p)

Another interesting thing could be to store maturity, and factor that in (so that you'd get a lower score for Expressions containing words you've encountered that are still young or youngish mature, relatively close to some 21 day baseline [previously I thought youngish mature could be 22-120 days and very mature could be 4 months+]), but that ties back into having gradients of machine awareness of learner knowledge and a reminder plugin which I'm still desperate for! ;p

Edit: I see I was using overture's vocabRank as the basis and trying to factor in unknown stuff without realizing what vosmiura had done with vocabRank. Now I have to reverse my thinking and reapproach? Hmm. Well, I do like the idea of focusing on unknown information, so perhaps the vocabrank function could simply filter the unknown words then do its thing comparing the constituent elements of the unknown morphemes to your known.db? Otherwise vocabrank is more of a field-specific readability plugin rather than a word profiler sort of thing?
Edited: 2011-06-27, 11:49 am
Reply
#94
@Boy.pockets

Yes, I have to use the browser each time. I guess I haven't been doing it often enough to be bothered by it. Here's what I do:

1) Calculate vocabRank for all cards.
2) Show suspended cards only and sort by vocabRank.
3) Select 100 and add a tag (I use "live"), generate glosses, un-suspend those cards.
4) Review. I suspend anything that I don't want to keep.
5) Back in the browser, export morphemes for all "live" cards.
6) Delete all suspended "live" cards.
7) Repeat.

Regarding a combined vocabRank + iPlusN, with the way it is on my modified plugin there's no real need to do that, just use vocabRank only and sort from smallest to largest.

@nest0r, I was thinking about factoring in the maturity as well, but it would require some rethinking on the morpheme database -- like instead of just storing the morphemes, tagging all cards in all decks where they originate. Plus it's hard to say how much of an improvement it would make, as whenever we're summing or averaging stuff in sentences, the sentence's length itself might end up being the dominant factor.
Edited: 2011-06-27, 12:29 pm
Reply
#95
vosmiura Wrote:I'm curious, are others making use of the plugin and if so how?

For what it's worth, I've settled into using only my modified 'vocabRank' for sorting new cards. At a high level it has about the same order as 'iPlusN', but it has finer steps.

For example in the Death Note deck, I have around ~3000 new cards left to go through, and ~1300 of those are iPlusN=1. That's quite a lot to choose from. With 'vocabRank' I can narrow down further to ~100 'easiest' cards out of those ~1300.

Picking a set of ~100 cards with lowest 'vocabRank', I'm also finding that it tends to capture a lot of cards that have the same word together, for example in one set, I must have seen a dozen uses of 裁く (to judge, since Kira is judging criminals), conjugated in all sorts of ways 裁き、裁いた、裁いて、裁かれた, so it helps to learn the words easily in multiple contexts. That's also one plus for treating all conjugations of a word as one word.
I'm still building my list of known vocab for the most part. I tag everything that is has an iPlusN value of 1, then sort that tag by unknowns. Then I just search for words I already know or have seen somewhere else before, adding those to my known DB. Then re-run set iPlusN and un-suspend 0 values. I add like 100 new facts at a time in my sentence deck, which is mostly from Clannad.

I'm still using the original plugin, I don't really mind it seeing different conjugations of the same word as different vocab words. I'll get to them eventually. Don't even touch the vocab rank. I'm sure I'll change what I do when I run out of words I know. But for now, this works and is quite fun.
Reply
#96
vosmiura Wrote:@nest0r, I was thinking about factoring in the maturity as well, but it would require some rethinking on the morpheme database -- like instead of just storing the morphemes, tagging all cards in all decks where they originate.
I originally considered this but the main benefits were ankl-specific and I wanted to keep the size of the DBs relatively small to facilitate sharing as I originally started on the plugin to make a website where people could upload DBs that correspond to various media and let people cross reference their known.db against them in order to judge what native material they have a chance of comfortably reading.

That said, perhaps I should do an experiment to at least see how large they are after compression.
Reply
#97
I think it would be rather awesome and quite useful for any application that wanted to make use of that standard measurement of SRS knowledge, such as the stuff I rambled about here: http://forum.koohii.com/showthread.php?p...#pid141450 ^_^ (See Edit 4 of that comment for maturity relevance.)

Plus once MorphMan becomes sensitive to Anki maturity levels... ;p
Edited: 2011-06-27, 6:42 pm
Reply
#98
vosmiura Wrote:@Boy.pockets
Regarding a combined vocabRank + iPlusN, with the way it is on my modified plugin there's no real need to do that, just use vocabRank only and sort from smallest to largest.
Ah - I see. Your vocab rank is different to my vocab rank.

I tried to use vocabRank (standard MorphMan plugin, let's call this 'knownVocabRank') along with the iPlusN, and I found that the cards I were getting were really long. Too long. I guess that is why in your mod, you changed around the vocabRank (call it 'unkownVocabRank') so it measures what you don't know instead of what you do. This is probably better for what I want to do.
Reply
#99
I just figured out how to use the GUI database manager for MorphMan. Hmmph. You could've explained that stuff better. You and your equations and fancy mathematics words like cardinite bipartisan symptom matching. Maximite carbonate bipolar switching. Why can't you write sensibly and clearly like me?? How do you go from that to a goofy thread title like Mighty Morphin Morphology, anyway?? ;p

Edit: Also, I must thank you and congratulate you again. I already thought this stuff was a brilliantly simple and revolutionary implementation with tonnes of possibilities, but the database GUI presents an even greater range of integrations and innovations.
Edited: 2011-06-27, 10:59 pm
Reply
nest0r Wrote:Why can't you write sensibly and clearly like me??
I don't even...

nest0r Wrote:How do you go from that to a goofy thread title like Mighty Morphin Morphology, anyway?? ;p
Evil plot.

The clever title is to sucker people in. Then they'll click the link to the wikipedia article on the math topics and get stuck in an infinite wiki walk.
Reply