Back

Mighty Morphin Morphology

#51
Okay, read the comments in this thread. With regards to your interest in determining cost (re: matching), I have some quick thoughts:

Factor in lexical density using Mecab's parts-of-speech analysis? For an explanation and basic formula: http://www.unisanet.unisa.edu.au/Resourc...ensity.htm

Hmm, that keyword ‘readability’ reminds me of cb4960's jReadability tool; I can't recall whether that uses similar factors, but perhaps that's something extra to look into. (Edit: I do remember discussing length-related factors and how the Japanese formula finds this too inconsistent to rely on, or something: http://forum.koohii.com/showthread.php?p...6#pid64016 .)

Edit: It occurs to me that Japanese word classifications ought to be different. I can only find one paper that explicitly defines content words (内容語) and function words (機能語) for Japanese: “I assume that content words are nouns, adjectival
nouns, adjectives and full verbs, while function words are particles and auxiliary verbs.” - via

I assume by auxiliary they mean 助動詞.

Another bit I found: “語彙には、名詞、動詞、形容詞、副詞などの「内容語」(content words) と、冠詞、代名詞、前置詞、助動詞などの「機能語」(function words)” - Edit 2: Although actually pronouns in Japanese are considered content words? Edit 3: Correction, personal pronouns are; more on closed class (function) words: http://ja.wikipedia.org/wiki/%E9%96%89%E...9%E3%82%B9 and open (content): http://ja.wikipedia.org/wiki/%E9%96%8B%E...9%E3%82%B9

Edit 4: Wow, I was so wrong about Japanese corpus linguistics, etc., not being developed. There's so much stuff out there! Found this (p. 373) regarding a systemic functional linguistics look at lexical density: http://www.wagsoft.com/Systemics/Archive...edings.pdf - I like the idea of the clause boundary annotation programme, though I think KNP's parsing takes care of this? Maybe not. They don't go into criteria for content words. I'm thinking it's so basic they don't bother? So I guess they use something like above parts of speech, perhaps simply blacklisting the function stuff.

Also, perhaps if dependency parsing is incorporated in the future, we could factor in a focus on the root/head words: http://forum.koohii.com/showthread.php?p...#pid138900 (see link after bold edit for most accessible explanation; langrid uses the notational variant shown on the second page [bottom right] of the .pdf).

Edit 5: I got carried away, but one more on Japanese content/function words: http://books.google.com/books?id=NDknf0v...&q&f=false

Edit 6: Hmm, something else to look at is bunsetsu identification via KNP/CaboCha (e.g. factoring the bunsetsu into other areas rather than a step in the dependency process; more of a subjective shift in focus, perhaps).

Edit 7: Most edits ever. Last confirmation for content words in Japanese, but Japanese FrameNet confirms the above on content words.
Edited: 2011-05-27, 11:27 pm
Reply
#52
I got this plugin working the other day, in tandem with the gloss plugin. It took a while to configure, but when the sentences finally started coming out in perfect order, tears of joy started flowing! Thank you for making this!

overture2112 Wrote:
Boy.pockets Wrote:...This way you would only have to specify the source decks and then the rest would be automatic
Yes, I noticed this after creating a handful of new subs2srs decks and trying to keep them all in sync as I switch around between them (how could I possibly choose just one Kugimiya Rie show?). I started some work on automating everything...
Sweet.

overture2112 Wrote:
Boy.pockets Wrote:Another nice to have would be to be able to specify target vocabulary, which you don't know, but would like to study next....This sort of functionality would be nice for people studying for tests, or studying from a text book. E.g. for tests, say if you are studying for Kanken, you could specify the vocabulary for your target level and the plugin would take this into account...
This is basically the point of the morpheme matching feature.
1) Make a db from the list of words you want to learn. If your goal is certain sentences, select them and export (possibly merging if they're from multiple decks). If you have just a list of words, save them as a utf-8 text file and use MorphMan to extract a DB from that file.
2) Select some cards you want to these words from, then use It's Morphin Time > Match Morphememes. This will set the 'matchedMorpheme' field on some of the cards with a word from the DB.

It's mostly useful if you have short term goals like a chapter vocab list from a grammar book, all the words from a particular episode of a show you want to try watching, etc.
Before you posted this, I did something a little different, but I think it is working well (just not as intended as I understand it): I just added the list of words (as anki cards) to the db of known words. This affects the i+n ness of the cards I am studying from, reducing the value of the ones that have words in them that match my study words (thinking that they are known words). I *think* this is working.
Reply
#53
overture2112, I think in your morpheme database you may be using the inflected words, so for example 伏せる and 伏せろ are counted as different morphemes.

Mecab returns the de-inflected word in f[6], so you might want to try that.

Also, I don't know about you but I prefer to use the reading returned in f[7] than f[8].

Edit: Actually the reading returned is of the inflected word, which makes the entry unique in the database anyway. I'm going to remove the reading for now and see how it works out.
Edited: 2011-05-29, 6:11 am
Reply
MONSTER Sale Get 28% OFF Basic, Premium & Premium PLUS! (Oct 16 - 27)
JapanesePod101
#54
Don't you mean f[8] and f[9]? The seventh column should be base form (原形), right? Edit: Oh, if it starts with f[0], then I guess nevermind.

Do the plugin functions not use the base form? Hmm. Well, cognitively speaking, we do tend to store frequent inflected forms as their own mental entries... (http://forum.koohii.com/showthread.php?p...#pid138550).

Still, if nothing else, it could be interesting or useful to have a user option to compare the base forms rather than inflected forms?
Edited: 2011-05-29, 8:45 am
Reply
#55
I edited the plugin locally to use the base forms and re-built my "known" database. As might be expected this gives more positive matches between each source of sentences.
Reply
#56
Do you mind explaing how you did that vosmiura?
Reply
#57
In plugins\morph\morphemes.py look for the line with mecabCmd = ['mecab', '--node-format= and change it to this:

mecabCmd = ['mecab', '--node-format=%f[6]\t%f[0]\t%f[1]\t%f[6]\r', '--eos-format=\n', '--unk-format=%m\tUnknown\tUnknown\tUnknown\r']

This'll make it use the base form for both the word and reading parts.

I'm not sure how to get the reading of the base form in a simple way. Might have to run it through Mecab again.
Edited: 2011-05-30, 12:54 pm
Reply
#58
So I have to install python.... [sigh] I guess it was inevitable after joining this forum. Rolleyes

Thanks

vosmiura Wrote:I'm not sure how to get the reading of the base form in a simple way. Might have to run it through Mecab again.
Does this mean that it won't correctly export morphemes, or is it just about filling in the reading part of a card?
Reply
#59
Splatted Wrote:So I have to install python.... [sigh] I guess it was inevitable after joining this forum. Rolleyes
You don't have to; Anki executes the python code in plugins. Of course, languages like python are plenty useful in general anyway.
Reply
#60
Anki executes it fine, but I see no way to edit the code.
Reply
#61
Splatted Wrote:Anki executes it fine, but I see no way to edit the code.
In Anki, Settings -> Plugins -> Open Plugin Folder
Assuming you're on Windows, in the resulting window, inside the morph folder there is a file named morphemes.py. Right-click -> open with and choose notepad or your desired text editor, edit and save.
Reply
#62
Great! I never thought of using notepad. When windows didn't know how to open it I just assumed that I needed Python to edit it. Thanks for the help.

@Overture: Now that I've had time to try it out I just want to say that this plugin is Brilliant. Thanks alot for sharing it. Big Grin
Reply
#63
balloonguy Wrote:
Splatted Wrote:Anki executes it fine, but I see no way to edit the code.
In Anki, Settings -> Plugins -> Open Plugin Folder
Assuming you're on Windows, in the resulting window, inside the morph folder there is a file named morphemes.py. Right-click -> open with and choose notepad or your desired text editor, edit and save.
Minor warning: some of the files use unix newlines ( "\n" ) instead of window's standard ( "\r\n" ) and notepad is dumb and thus doesn't auto-detect which to use. Wordpad does, but a real text editor (eg, vim or emacs) should be preferred anyway.
Reply
#64
Slight update.

I added the before & after context lines, and I've changed the way I review by having audio on the back of the card instead of the front, so I check my reading ability as well as comprehension.

As I mentioned before I've changed the plugin to use the normal form of words instead of inflected words.

[Image: 2djigz.png]

I'm working on 24 episodes at once - so over 7500 sentences to begin with, and I'm going through by i+n order.

Step 1: I'll review ~100 new cards with i+1, then recalculate the numbers which usually opens up a few hunderd i+0 cards.
Step 2: Review all i+0 cards and repeat from Step 1.

Going through the i+0 cards in step 2 reinforces the new words learned in step 1.

I suspend and then delete anything I don't find useful (nothing new to learn, unclear audio, insufficient context). So far I've been through 3000 cards, kept 300 and deleted 2700. I have JLPT L2, so depending on your level your mileage may vary.
Edited: 2011-06-03, 1:07 pm
Reply
#65
vosmiura Wrote:Slight update...So far I've been through 3000 cards, kept 300 and deleted 2700. I have JLPT L2, so depending on your level your mileage may vary.
That's a lot of purging! Any idea how many of those purges come from i+0, kana only, or 1-2 word sentences (as opposed to poor audio or something)?

I dig the grey context lines. I'm curious if you have tried using the English translation for context sentences on the front side instead?
Reply
#66
@vosmiura - So you're using MorphMan to further cull decks, beyond the subs2srs filters? Could you explain what you mean by ‘recalculate the numbers’ and ‘opening up i+0’ stuff?

Funny that you moved audio to the back (which, setting aside my video clip dictation cards for subs2srs, is what I do for general sentence cards w/ audio, where I silently subvocalize before flipping and listening and repeating the audio for feedback). I recently decided I will probably start doing single vocabulary cards with audio on the front (along with kanji text), since I'm using those primarily for non-reading purposes (e.g. semantics) where multisensory stuff (writing, listening, images) is more for extra anchors. Sentence cards from general, fixed reference corpora I'm leaving alone due to prosody and suchlike, those are more like secondary reinforcements where I'm putting it all together before branching out into specialized cards again.
Edited: 2011-06-03, 1:50 pm
Reply
#67
@overture

Reminder plugin aside: I'm still thinking about ways to apply this to my strategy. Right now I'm thinking that I take a new subs2srs deck, find the global unknowns, and compare those to general reference databases made from stuff like Core 6000. The problem is here:

So I'm thinking I'd select the unsuspended, reviewed cards and merge those into the known.db for such reference decks, and I'd turn the suspended cards into a separate, unmerged .db, such as core6k.db. So I'd want to compare the globally unknown subs2srs morphemes to this core6k.db of suspended cards, and unsuspend them. How might I do that?

Also, you mentioned before having vocab-related cards in the subs2srs deck, based on found unknown morphemes, could you explain how I'd go about doing this so that I'm suspending those video clip cards but using the unknown morphemes derived from them as separate vocabulary cards that will later become mature and (with a new plugin/tool) unsuspend the video clip cards?

Since the above is in part hypothetical, I guess I'm looking for a more refined theoretical explanation, to combat my fuzzy logic. I get the feeling you're thinking of something related to card models sharing facts, but I don't know much about that, never really messed with Anki that way.
Edited: 2011-06-03, 1:53 pm
Reply
#68
nest0r Wrote:@vosmiura - So you're using MorphMan to further cull decks, beyond the subs2srs filters?
The deck I started with originally was one of the older shared decks and didn't have anything filtered in subs2srs. I've just been culling manually as I review new cards, and I also used Anki's import "update" feature to upgrade the deck to include context lines.

Quote:Could you explain what you mean by ‘recalculate the numbers’ and ‘opening up i+0’ stuff?
The iPlusN is set based on the current "known" database.

If you study a bunch of sentences with iPlusN=1, then add those to the "known" database, then recalculate iPlusN on the deck, you'll get some new cards with iPlusN=0 that have the same morphemes as the ones you've just learned.
Reply
#69
overture2112 Wrote:
vosmiura Wrote:Slight update...So far I've been through 3000 cards, kept 300 and deleted 2700. I have JLPT L2, so depending on your level your mileage may vary.
That's a lot of purging! Any idea how many of those purges come from i+0, kana only, or 1-2 word sentences (as opposed to poor audio or something)?
I didn't keep track of that but not so many because of poor audio. A good few hundred would be for 1-2 word sentences... and I purged a high percentage of kana only sentences. If I can understand it already - it goes.

My known.db doesn't have every word I've ever known, so some of the iPlusN=1 cards I do have nothing new for me.

As I progress by i+n I'm purging less and less... right now I think I purge about 2/3 or 3/4 out of new cards. I'm keeping more of them not only because of new morphemes but because they have grammar that I want to practice too.

Quote:I dig the grey context lines. I'm curious if you have tried using the English translation for context sentences on the front side instead?
I haven't tried it with English, but I found it useful to have the Japanese sometimes.
Edited: 2011-06-04, 12:41 am
Reply
#70
I've been trying 'vocabRank', but I don't know if it's doing what I would expect.

I think the algorithm looks at each morpheme in a sentence, and ranks highly if it's in the known.db exactly as is, and a bit less highly if it's partially known, therefore if a sentence is iPlusN=1 and has several known words it's going to be ranked pretty easy regardless of what the +1 is. In particular, sentences that have kanji that occur a lot in known.db are ranked higher.

I guess that to determine difficulty of learning a new card, rather than counting how much I know in a sentence, I want to try to count how much I don't know.

What do you think of something like this, for example? Lower points = easier.
- exact match in known.db = 0 points
- no match and no kanji = 10 points
- kanji characters that match position and sound = 5 points
- or kanji characters that exist in known.db = 10 points
- or kanji characters that are all new = 20 points

Edit: I implemented a slightly different version of the above that takes count of how many times a kanji appears in you known.db as well, and sent the code to overture2112 to check.

I've also changed my code to use the dictionary form for all verbs & adjectives, and pass through Mecab a 2nd time to get the readings. If anyone's interested I could share the modified plugin, or overture2112 if you're interested in merging I can send you that code.
Edited: 2011-06-05, 2:24 am
Reply
#71
Know what would be nice? It's such a simple solution, not sure why I didn't think of it before, re: card chains and call-and-response cards. Anyway, instead of ‘review from smallest and largest intervals’, etc., to be able to start a session and sort the day's due card reviews by things like iPlusN.

On a tangent based on old ideas for conversational cards, I wonder if this could be applied to sorting facts divided into numerically marked adjacency pairs. Like have pairs marked A1, A2, B1, B2, etc., and as long as consecutive cards in, say, A, are due that day for that session, they're sorted so that they appear in order. Or something. ;p Perhaps refer back to linguistics (discourse analysis, SFG, pragmatics, etc.) models to schematize cards and add metalinguistic information to them (allowing for two types of cards, a single card containing an adjacency pair split across front and back, which is connected to other pairs in a sequence of cards, and having each card be half the pair, with front and back divided into prompt and metalinguistic info, respectively). I digress.
Edited: 2011-06-05, 12:05 am
Reply
#72
I need to bump this topic, this is hands down one of the best plugins ever written for Anki. I always found Subs2srs a chore to even setup, but this lines everything together very nicely. N+1 is very possible now.

Edit: scratch my suggestion, apparently you can tell Anki to sort specific fields as numbers.
Edited: 2011-06-09, 8:40 am
Reply
#73
I figured I should give you an update on what I've been working on.

I've gotten it up to a point that I think is quite usable. I've accounted for a lot of reading transformations and can find a solution for 82% of the words/readings in JMdict. A lot of them don't have standard readings to begin with, so the success rate is actually higher. I'm actually surprised at the large number of the non-standard readings (I'm estimating more than 20k). I've even tagged the types of transformations so you can search specifically for those. As an example, say you have the kanji/reading 人 = ひと. You'll get back every word it knows with that reading. Now, the ひ in ひと can have a ゛(dakuten) or a ゜(handakuten) added to it. Those are two of the tags available, and you can search for only words with a particular tag, or as many tags as you want. So in this example, we restrict output to only words with those two tags (there's a "regular" tag for no transformations).

I've also recorded the index of the character, both from the start and from the end (so you can search backwards). An index of -1 means the last character, -2 the second last, etc. The index ignores kana and okurigana, so in word like 刈り入れ人 = かりいれびと, the indexes would be
刈 = 0
入 = 1
人 = 2
so you can easily search for something as the nth kanji in the word.

Now, the real problem is what do I do with all this information? I'd love some ideas.

As the next step, I'm going to try and see how I can integrate this into overture's plugin for ranking vocab. I also had the idea of colouring in readings of characters on the answer side of cards. I know it's probably against the whole "disfluent=memorize better" concept we've recently read about, but I'll do it anyway. I figured if association helps with memorizing things, it can't hurt to have another thing to associate a kanji reading to.
Edited: 2011-06-10, 4:36 am
Reply
#74
netsplitter Wrote:I figured I should give you an update on what I've been working on.
Awesome.

I had a bit of trouble understanding what it actually did until I read your read me - which might help other too. As I understand, it finds the readings for individual kanji given a word.

netsplitter Wrote:Now, the real problem is what do I do with all this information? I'd love some ideas.
I guess that this is what you are already thinking of, but I will say it anyway. With all this information, we (as in you), can figure out the n + 0.5. If you know what part of the reading belongs to what part of the word, then you can look for new word that have that kanji and that reading. You can follow the exact same model that Overture currently uses in the MMM plugin;

* We have a list of what we know (known.db)
* Now, not only do we have words we know, but we have individual kanji readings too.
* Look for new words with unknown words, but with known (as much as possible) readings.

This is awesome.

Example

My known db has:
- 出発 「しゅっぱつ 」(word)
| - 出 「しゅっ 」 (reading)
| - 発 「ぱつ 」 (reading)
- 自分 「じぶん」 (word)
| - 自 「じ」 (reading)
| - 分 「ぶん」 (reading)

So, some n + 0.5s might be (in order of preference):
1. 出自 「しゅっじ」(a perfect "0.5" match) - (all kanji are known and all readings are known, it is just the 'word' we need to learn)*
2. 自発 「じはつ」 (almost full "0.5" match) - (know both readings, though the 「ぱつ」 has become 「はつ」)
3. 自慢 「じまん」(one character off a "0.5" match) - (there is one kanji who's reading is unknown)
4. 自費出版 「じひしゅっぱん」 (two characters off a "0.5" match) - (there are two characters who's readings are unknown)

summary

So, with this information, we can calculate an even "better" n+1. So instead of having to (possibly) learn a completely new reading for a new word, we can learn a new word with readings we already know. Or, at least one reading we already know.
Note:


*This reading is not correct (should be しゅつじ), but for the sake of the example, please let it slide...
Reply
#75
Boy.pockets Wrote:As I understand, it finds the readings for individual kanji given a word.
Correct. It also goes through every word in JMdict and builds a database of those readings. That is, if it can successfully figure out the word. That way, we can search by readings. It's important to note that it solves the reading by using a kanji's dictionary reading. There's no guesswork involved. So, for a word like 日帰り = ひがえり, results look like this:
Code:
Solving: 日帰り == ひがえり
character[日] reading[ひ] tags[1] dic_reading[ひ] reading_id[6947]
character[帰] reading[がえり] tags[2, 8] dic_reading[かえ.る] reading_id[1537]
So for 帰 = がえり, it knows the reading came from かえ.る and it knows it got to it by adding a ゛to the main reading and inflecting the okurigana (that's why it has two tags). So this reading doesn't link back to an entry for "がえり", but the original "かえ.る".

Boy.pockets Wrote:I guess that this is what you are already thinking of, but I will say it anyway.
That's the basic idea. I think the real issue is deciding which words are useful for you. Especially when you know a lot of readings, I would imagine you would get a lot of results from an n + .5 (or n + 0, for that matter).

Anyway, there is lots of experimenting to do.
Edited: 2011-06-10, 8:40 am
Reply