Back

Mighty Morphin Morphology

overture2112 Wrote:
lesson17lesson17 Wrote:
overture2112 Wrote:Not out of the box, but if you had a Chinese equivalent of `mecab` (the Japanese morphological analysis tool that MorphMan uses) then it could probably be adapted.
What if the Chinese sentences were pre-segmented?
You'd lose out on the part-of-speech stuff, but otherwise it could work. Try looking at morphemes.py and rewriting `getMorphemes`.
I seem to have got Morph Man working with Chinese(including pos taggin) by following your recommendation. Using jieba (https://github.com/fxsjy/jieba#jieba-1) I changed the getMorphemems function in morphemes.py as follows:

Code:
@memoize
def getMorphemes( e, ws=None, bs=None ): # Str -> PosWhiteList? -> PosBlackList? -> IO [Morpheme]
    e = u''.join(re.findall( ur'[\u4e00-\u9fffa-zA-Z0-9]+', e)) # remove all punctuation
    ms = [ Morpheme( m.word, u'N/A', m.flag, u'N/A', u'N/A') for m in posseg.cut(e) ] # find morphemes using jieba's POS segmenter
    return ms
It seems to work if you import python regex and jieba's posseg. I have tested with my current decks and the Chinese core deck and it seems to be working.
Edited: 2014-02-13, 12:28 am
Reply

Messages In This Thread