Back

Mighty Morphin Morphology

I think this issue refers to https://github.com/kaegi/MorphMan/issues/11 which has been fixed with MorphMan 4 (specifcally with this commit). The config itself looks good (except that "Japanese Genki - Sentences" is duplicated, so the second one is ignored). I didn't want to write "use the dev version" again, so I released a new stable version.

To make refering to "the new MorphMan" easier and more tangible, it is now "MorphMan 4".
Reply
(2017-02-04, 8:41 am)kaegi Wrote: I think this issue refers to https://github.com/kaegi/MorphMan/issues/11 which has been fixed with MorphMan 4 (specifcally with this commit). The config itself looks good (except that "Japanese Genki - Sentences" is duplicated, so the second one is ignored). I didn't want to write "use the dev version" again, so I released a new stable version.

To make refering to "the new MorphMan" easier and more tangible, it is now "MorphMan 4".

Ok, it seems to be working fine now.

How does MorphMan handle a sort of setup where you're both of the following:
-An RTK deck with the CJK characters Morphemizer set
-A vocab deck with the Japanese Morphemizer

And you want to sort new vocab and sentences: http://i.imgur.com/fvKTQBf.png

Basically, I learned a lot of vocab from Visual Novels before I started learning kanji through RTK, but now I've been using RTK to fill in my gaps and properly learn the kanji meanings. Does MorphMan assumes that I recognize things like the individual kanji in 復讐 (a card I know by kanji+furigana, and is mature) just as well as the RTK Kanji 沼 that I have as mature. I don't really get how the CJK character morphemizer is factored into the sorting algorithm.

I have two main things I'm trying to improve at with Japanese now:
-Intentionally learn vocab with the Kanji I've learned in RTK, to improve my reading with those kanji. There are some kanji I know in RTK where I know 0 vocab words for, so I wanted to use Core10k + KWAT for this.
-Finding sentences with morphemes I'm familiar with and increasing my vocab (i+1) - I want to improve reading/listening, learn new words, etc., and MorphMan is incredible for this <33

I learned a lot of vocabulary from Visual Novels without really learning the kanji properly, so I I wanted to fill in some gaps by using the Core10k knowledge and KWAT-generated words. I'm afraid that MorphMan will give me a bunch of words in the Core10k that share morphemes I've experienced in VNs but have a weak understanding of, but maybe won't prioritize the Kanji I know and want to learn more words for.

At a glance, though, my Core10k got re-sorted so that it has a ton of kanji parts that I understand well towards the top, so I think I should be good with easy word expansion. For a kanji like 沼 though that I don't see often, I wonder how I'll improve at learning readings for that kanji in context.
Reply
My own experience with using the CJK morphemizer on my kanji decks was that MM assumed that I wanted to learn more words spelled with kanji I don't know, and since I do know meanings and readings for the whole joyo list, it was skewing toward giving me words from Core10K spelled with rare kanji (instead of, say, words I didn't know spelled with all kana). I didn't really want that, so I removed the CJK/Kanji deck filter and started over.

Since you have both the individual kanji and words spelled with kanji flagged as "mature," I think it's going to be hard to get MM to do what you want. You might have more luck with modifying your kanji+furigana deck to only show you words in kanji; that way you can fail yourself whenever you fail to remember a reading without help. I don't know if that will change how MM sees the kanji, but it will give you a chance to relearn it.

Would be curious to know if there are other ways to teach MM to "unlearn" what it thinks you know based on card maturity.
Reply
JapanesePod101
(2017-02-04, 11:14 am)tanaquil Wrote: My own experience with using the CJK morphemizer on my kanji decks was that MM assumed that I wanted to learn more words spelled with kanji I don't know, and since I do know meanings and readings for the whole joyo list, it was skewing toward giving me words from Core10K spelled with rare kanji (instead of, say, words I didn't know spelled with all kana). I didn't really want that, so I removed the CJK/Kanji deck filter and started over.

Since you have both the individual kanji and words spelled with kanji flagged as "mature," I think it's going to be hard to get MM to do what you want. You might have more luck with modifying your kanji+furigana deck to only show you words in kanji; that way you can fail yourself whenever you fail to remember a reading without help. I don't know if that will change how MM sees the kanji, but it will give you a chance to relearn it.

Would be curious to know if there are other ways to teach MM to "unlearn" what it thinks you know based on card maturity.

Hmm, I wonder how I can get it to undo the CJK characters I just added from RTK. It upped my K count by like 1100+ from the kanji I know in RTK. (halfway let's goooo).
Hoping this doesn't mess things up. I should probably learn how specifically the algorithms are calculated in MorphMan, but I'm just trying to dive into Japanese and make those gains right now.
Reply
(2017-02-04, 12:34 pm)vladz0r Wrote: Hmm, I wonder how I can get it to undo the CJK characters I just added from RTK. It upped my K count by like 1100+ from the kanji I know in RTK. (halfway let's goooo).
Hoping this doesn't mess things up. I should probably learn how specifically the algorithms are calculated in MorphMan, but I'm just trying to dive into Japanese and make those gains right now.

In order to roll things back, I just deleted my original copy of the morphman addin folders, recreated my filters (minus the kanji one), and recalculated. Not sure if you also need to search for and remove tags from cards that MM modified.

Then again, you could probably just ignore it (more or less) and concentrate on reviewing unless and until you notice that MM is producing behavior you really don't want. It is definitely all too easy to get sucked into tweaking addons instead of doing reviews.
Reply
Okay, I'm going to try to explain what is supposed to happen. I think I have a good grasp on morpheme and user interface code, but I didn't mess much with database code, so take everything I say now with a grain of salt.

There is only one database for all morphemes, regardless of the morphemzier - this is a legacy from the pre-morphemizer version. Morphemes are compared by their base form (歩いて -> 歩く), so that you won't learn the same vocab over and over again. On the other hand it will handle complex words as different words, so 出歩く and 歩く are two different morphemes for the Japanese morphemizer because they have different base forms. Substantives are generally unchanged (日 -> 日). So if you learned 日 in your RTK deck as CJK character, then it might skip the word with "Japanese Morphemizer". But it won't ever skip over 日本 -> 日本, if you use the Japanese Morphemizer, because 日 and 日本 are different words (just as 本 and  木).
If you use the CJK morphmizer on 日本 -> 日, 本, then only 本 is marked as unknown.

In reality, I think,  words are also compared by category. Every character extracted with the CJK morphemizer will have the category CJK_CHAR. So 日 will be saved as (日, CJK_CHAR) with the CJK morphemizer and by using mecab/Japanese morphemizer the result is something like (日, 名詞). So they get treated as separate words.

Bottom line: MorphMan might skip single-character words or not (I don't know for sure), but it will never change the order of two-or-more-letter-"Japanese Morphemizer"-morphemes with the knowledege of the "CJK Morphemizer" database. Again: I'm not 100% sure, but I hope this explanation helps nonetheless.
Reply
Ah, I see. I deleted and regenerated the database to get rid of the kanji, now.

As a random question, do you know if there's an addon out there to get the kanji breakdown for words?
ex. 徹夜 could generate:
徹 -penetrate, clear, pierce, strike home, sit up (all night)
夜 - night, evening

*edit* - Found it: https://ankiweb.net/shared/info/1041996709 Thank you based ja-dark.
Edited: 2017-02-04, 2:30 pm
Reply
Hey kaegi, will we see Chinese anytime soon. I know some people got Chinese morphemizers working on it before. Would be great to have built-in support though. Are you familiar with people's attempts to do this? I was thinking of trying it out myself, but having it built in would be nice given the swish GUI now, since all the efforts I know of predate your creation of a nice GUI.

Have you considered providing some documentation for others to get their own Morphemizers integrated with everything?
Reply
I'm going to be busy the next few weeks and will try to refrain from too much (interesting but distracting) programming projects. Implementing morphemizers is very easy (the MorphMan part is about 15 lines for each morphemizer). You can use the
That's all you have to do, the rest is handled by other parts of the plugin. Because I don't have the time to debug morphemizers that I don't use, I will leave extra morphemizers to the community.

If you decide to implement a new morphemizer, I'd be grateful for documentiation of that process in form of a Wiki entry, so new users/developers can roll out their own morphemizers right away.
Edited: 2017-02-08, 7:25 am
Reply
(2017-02-08, 7:24 am)kaegi Wrote: I'm going to be busy the next few weeks and will try to refrain from too much (interesting but distracting) programming projects. Implementing morphemizers is very easy (the MorphMan part is about 15 lines for each morphemizer). You can use the
That's all you have to do, the rest is handled by other parts of the plugin. Because I don't have the time to debug morphemizers that I don't use, I will leave extra morphemizers to the community.

If you decide to implement a new morphemizer, I'd be grateful for documentiation of that process in form of a Wiki entry, so new users/developers can roll out their own morphemizers right away.

Awesome, thanks. That should be plenty to get me started. If I manage to find the time to add a morphemizer, I will definitely take the time to document the process, since I'd like to see MM become all it can be, and ability to integrate Morphemizers smoothly would go a long way to that.
Reply
With the new version, how do I get it to recognize cards that I put the "alreadyKnown" tag on? It's not counting them, even though it is correctly updating the notetype that contains them. Here's what most of my alreadyKnown's have as tags now: "alreadyKnown comprehension mm_comprehension"

Edit: Nevermind, I added the tag "mm_alreadyKnown" to all of the cards and it works now. Why did the tags get changed? And they're not defined in the config file anymore. Sad
Edited: 2017-02-09, 6:22 am
Reply
MorphMan for Spanish and other languages would be hype. Maybe I could even code it myself someday. Someone would probably have to break down a verb like hablar in Spanish (meaning "to speak") into "habl" as the morpheme. With some masculine/feminine, getting rid of the last letter could turn it into a morpheme in most cases. It could be potentially easy to do if there was a database with word stems, and with a language as easy (compared to Japanese) as Spanish, the gains with sub2srs and Spanish TV shows would be awesome.

@kaegi
So, the [Extra Fields](http://i.imgur.com/XYDRMry.png) are basically so we can see the values MorphMan is using in Anki, but MorphMan's sorting won't change regardless of whether our notes have all these extra fields or not, right? We just need the MorphMan_FocusMorph field here and it sorts the same as if it had the additional 6 fields?

I want to reduce the number of fields in my Anki cards, since my collection has been loading kind of slowly.

It seems to work this way, from what I've checked.
Edited: 2017-02-09, 10:49 am
Reply
(2017-02-09, 6:18 am)rainmaninjapan Wrote: With the new version, how do I get it to recognize cards that I put the "alreadyKnown" tag on? It's not counting them, even though it is correctly updating the notetype that contains them. Here's what most of my alreadyKnown's have as tags now: "alreadyKnown comprehension mm_comprehension"

Edit: Nevermind, I added the tag "mm_alreadyKnown" to all of the cards and it works now. Why did the tags get changed? And they're not defined in the config file anymore. Sad

This configuration moved to the GUI MorphMan preferences/internal synchronized Anki settings store.

Quote:MorphMan for Spanish would be hype. I'd be able to make some serious gains if someone worked on it. Maybe I could even code it myself someday.

You can use the general "Languages with spaces" morphemizer. The only thing that's suboptimal with it is that it can't generate the base forms of words. (So every inflection of a word counts as a new word).

Quote:So, the [Extra Fields](http://i.imgur.com/XYDRMry.png) are basically so we can see the values MorphMan is using in Anki, but MorphMan's sorting won't change regardless of whether our notes have all these extra fields or not, right? We just need the MorphMan_FocusMorph field here and it sorts the same as if it had the additional 6 fields?

I want to reduce the number of fields in my Anki cards, since my collection has been loading kind of slowly.

Yes, all fields except "MorphMan_FocusMorph" are fully optional. "MorphMan_FocusMorph" is now only used for the skipping of morphemes which were seen this day, and for the "L" shortcut for new cards.
Reply
Hey Kaegi. I've almost got Jieba working at a basic level, however I've run into a problem that's a tad perplexing. I'm sure it's just because I'm punching above my weight, but it seems that all the support past the readme is in Chinese, including issues tracking and the like, so I was hoping it might be something I'm doing wrong on the MM side of things and you might know what it is.

So I've added Jieba as a Morphemizer like so

Code:
class JiebaMorphemizer(Morphemizer):
    import deps.jieba as jieba
    def getMorphemes( e, ws=None, bs=None ): # Str -> PosWhiteList? -> PosBlackList? -> IO [Morpheme]
        e = u''.join(re.findall( ur'[\u4e00-\u9fffa-zA-Z0-9]+', e)) # remove all punctuation
        return [ Morpheme( m, u'N/A', u'N/A', u'N/A', u'N/A') for m in jieba.cut(e, cut_all=True) ] # find morphemes using jieba's POS segmenter
        
    def getDescription(self):
        return 'Chinese (Jieba)'

And added it into the Morphemizer list, and it's filling in my Focus Morph fields of the appropriate models just fine.

The only problem is, it is, without fail, only filling the Focus Morph field with single characters, not whole words.

I have tested the segmenter outside MM in the command line, and used it with sentences from my deck, and it breaks them up into words/compounds correctly, spitting out a list of strings with the expected values.

However no matter what, I can't seem to get MM to fill the FM field in with the appropriate full string, which is what makes me think I'm misunderstanding something on MM's end.

Can you spot any obvious errors here?

Jieba.cut() documentation here: https://github.com/fxsjy/jieba#1-cut
Edited: 2017-02-16, 11:15 am
Reply
The only part that looks wrong is the second and last parameter of `Morpheme`. This is the base form and reading of the morphemes (pretty imporant). If you don't have any special information you should set them to the inflected form. See the other morphemizers for reference.

You should also test the "join-line". In a quick python2 REPL, this line returned an empty string. You can use the code from CjkCharMorphemizer to filter for chinese characters.

Some minor things: getMorphemes will only always get the expression string, so the second and third parameters are always None. The other morphemizers use UNKNOWN instead of N/A (no real implications - just consistency).
Reply
Thanks kaegi, that put me on the right track. Switched to using 'characters' because that made more sense.

It seems no matter what I am getting an empty array, even when I simply follow zhon documentation's most basic examples in the python command line.

I'm wondering if re.findall() could somehow be the culprit, but it seems to be working for everything else, so now I'm really perplexed at this point. Gonna keep testing.

EDIT: It seems that zhon is not giving me back unicode characters, will try and figure out what encoding it's using.
CORRECTION: It's not unicode by the time it gets to re.findall(). Hmmmmm.

UPDATE: Okay, so close I can taste it. The encoding seems to be a python issue??? It's encoding the string as 'cp932'. I've managed to get it converting the string to unicode, but some of the characters, not all, but some, are lost. From what I can tell it seems that the ones lost are those which are not the same in Simplified as they are in Traditional, as I've done some comparisons and all the ones preserved are the same. A Traditional string also does not lose characters, which reinforces thise theory. However when checking, 'cp932' seems to be a Japanese encoding, an extension to Shift JS.

And yes, it seems characters which are not found in Japanese are being lost. It's not clear to me if that's the only cause, but that would make sense given many Japanese kanji are the same form as traditional, so that might explain why it seems like traditional character were being preserved.

And that's as far as I've gotten right now.

Even just this:

Code:
>>> print(u'你什么意思啊?')
?什?意思??

It's still treating it like it's gone through cp932. And it's not a rendering issue as checking the values cirectly yields question marks again.
Edited: 2017-02-17, 1:17 am
Reply
This might be an issue with your terminal, which isn't capable of reading unicode input. On Linux (gnome-terminal) it works:

Code:
>>> print(u'你什么意思啊?')
你什么意思啊?

Doing a re.findall with zhon also seems to work:

Code:
re.findall('[%s]' % characters, u'你什么意思啊?')
[u'\u4f60', u'\u4ec0', u'\u4e48', u'\u610f', u'\u601d', u'\u554a']

>>> print(''.join(re.findall('[%s]' % characters, u'你什么意思啊?')))
你什么意思啊

I'm pretty sure Anki internally uses utf8 so this won't be a problem. You will have to read from/write to utf8 file to test your code in the REPL.
Reply
Cheers, I'll try fiddling around with it a little more but I'm starting to just get frustrated at this point.

Here's my code as is stands, I switched to posseg.cut() just because it gives POS tagging. It hadn't worked for me before but it was because its import of jieba was using the wrong relative path.

Anyways, here's what I have now, switching to posseg.cut() does not seem to have resolved the issue with only single characters being added, and yeah, at my wit's end on what is causing that. Running commands in cmdline has problems with unicode, but everything else seems to e working fine.

Code:
class JiebaMorphemizer(Morphemizer):
    def getMorphemesFromExpr(self, e): # Str -> [Morpheme]
        import deps.jieba as jieba
        import deps.jieba.posseg as posseg
        from deps.zhon.hanzi import characters
        e = u''.join(re.findall('[%s]' % characters, e)) # remove all punctuation
        return [ Morpheme( m.word, m.word, m.tag, u'UNKNOWN', m.word) for m in posseg.cut(e) ] # find morphemes using jieba's POS segmenter
        
    def getDescription(self):
        return 'Chinese (Jieba)'

I tried writing out to a file, but maybe I don't understand python's file operations properly as I keep just getting a blank file, with an only one time garbled file taking its place.
Reply
At first glace, I can't spot anything wrong with it.

I can try to play with this code, but I'd need some chinese sentences and expected output (it's just kanji glibberish for me).
Reply
(2017-02-18, 11:11 am)kaegi Wrote: At first glace, I can't spot anything wrong with it.

I can try to play with this code, but I'd need some chinese sentences and expected output (it's just kanji glibberish for me).

Cheers, that would be great as I'm just banging my head now. It's probably something really simple, but I'm not receiving any hints from my attempts to check what's going on inside.

One of the difficulties is many Chinese sentences will be made up mostly of one character words, but not entirely. So it's hard to give you a corpus that consists of enough Chinese that you should DEFINITELY begin getting two character+ morphs.

I could export my Chinese deck sans media if you like.

https://www.dropbox.com/s/iyf61ya3z9whvj....apkg?dl=0

Otherwise let me know how you'd like the sentences to play with.

Edit: You can also use jieba.cut(, cut_all=True) to get a list of all morphs,s o that you get compound words and the compound words that make up those compound words, which might actually be preferable.

EDIT: Here is a string that should definitely return multi-character morphs as I tested it with posseg so maybe variations on this? I unno.

Quote:你喜欢苹果还是橘子?
Edited: 2017-02-21, 8:38 am
Reply
There are a few peculiar things/interactions with the "Learn Now."
If I have a JP->Eng Card1 and an Eng->JP Card2, and hit "Learn Now" on the Card1, it'll also try and show me the Card2. If I hit "Learn Now" on the Card2 for a word, it also tries to show me the Card1 again for the word, messing with the scheduling. The "Learn Now" returning multiple cards for a single note can also cause Anki errors, since Anki normally buried the card2 for a word if you saw the card1 for the word today.
Reply
I dunno. I've reached the end of my tether. Not trying to get jieba working with MM for a couple weeks until I can come at this problem with fresh eyes. I've just been winding myself up more and more every day. Haha. Time better spent studying and letting my brain process the problem in the background.
Reply
Hello,

I want to learn about using morphman.

I'll just asked, is it fine if I include kanji decks to be analyzed by Morphman?

Also, does Morphman requires prior knowledge to be utilized conveniently, like most/all kanji should have been known/mature?

Thanks
Reply
(2017-03-06, 5:50 am)kevsestrella Wrote: I'll just asked, is it fine if I include kanji decks to be analyzed by Morphman?

You can use any cards you want with MorphMan, meaning kanji and/or sentences. MorphMan has a "CJK Morphemizer", which means that every kanji is handled as if it a single "word" by MorphMan. This means sentences get sorted by the number of unknown kanji (as opposed to words). You can configure it with "Tools -> MorhpMan Preferences"

Kanji database entries (probably) do not overlap with your vocabulary database entries. On this thread page, I already wrote about that.

(2017-03-06, 5:50 am)kevsestrella Wrote: Also, does Morphman requires prior knowledge to be utilized conveniently, like most/all kanji should have been known/mature?

MorphMan is suitable for all levels of proficiency. I did learn all jouyou kanji with Heisig beforehand, but that was only because I didn't know of MorphMan at that time. As long as you can recognize a word, you are good to go. Kanji knowledge helps with differentiating between similar looking characters, but it isn't required.
Reply
(2017-03-06, 6:22 am)kaegi Wrote: You can use any cards you want with MorphMan, meaning kanji and/or sentences. MorphMan has a "CJK Morphemizer", which means that every kanji is handled as if it a single "word" by MorphMan. This means sentences get sorted by the number of unknown kanji (as opposed to words). You can configure it with "Tools -> MorhpMan Preferences"

Kanji database entries (probably) do not overlap with your vocabulary database entries. On this thread page, I already wrote about that.

MorphMan is suitable for all levels of proficiency. I did learn all jouyou kanji with Heisig beforehand, but that was only because I didn't know of MorphMan at that time. As long as you can recognize a word, you are good to go. Kanji knowledge helps with differentiating between similar looking characters, but it isn't required.
 
Thanks for the info about CJK Morphemizer Kaegi, I haven't read previous post, I will now.
Edited: 2017-03-06, 6:43 am
Reply