Back

Mighty Morphin Morphology

#51
Okay, read the comments in this thread. With regards to your interest in determining cost (re: matching), I have some quick thoughts:

Factor in lexical density using Mecab's parts-of-speech analysis? For an explanation and basic formula: http://www.unisanet.unisa.edu.au/Resourc...ensity.htm

Hmm, that keyword ‘readability’ reminds me of cb4960's jReadability tool; I can't recall whether that uses similar factors, but perhaps that's something extra to look into. (Edit: I do remember discussing length-related factors and how the Japanese formula finds this too inconsistent to rely on, or something: http://forum.koohii.com/showthread.php?p...6#pid64016 .)

Edit: It occurs to me that Japanese word classifications ought to be different. I can only find one paper that explicitly defines content words (内容語) and function words (機能語) for Japanese: “I assume that content words are nouns, adjectival
nouns, adjectives and full verbs, while function words are particles and auxiliary verbs.” - via

I assume by auxiliary they mean 助動詞.

Another bit I found: “語彙には、名詞、動詞、形容詞、副詞などの「内容語」(content words) と、冠詞、代名詞、前置詞、助動詞などの「機能語」(function words)” - Edit 2: Although actually pronouns in Japanese are considered content words? Edit 3: Correction, personal pronouns are; more on closed class (function) words: http://ja.wikipedia.org/wiki/%E9%96%89%E...9%E3%82%B9 and open (content): http://ja.wikipedia.org/wiki/%E9%96%8B%E...9%E3%82%B9

Edit 4: Wow, I was so wrong about Japanese corpus linguistics, etc., not being developed. There's so much stuff out there! Found this (p. 373) regarding a systemic functional linguistics look at lexical density: http://www.wagsoft.com/Systemics/Archive...edings.pdf - I like the idea of the clause boundary annotation programme, though I think KNP's parsing takes care of this? Maybe not. They don't go into criteria for content words. I'm thinking it's so basic they don't bother? So I guess they use something like above parts of speech, perhaps simply blacklisting the function stuff.

Also, perhaps if dependency parsing is incorporated in the future, we could factor in a focus on the root/head words: http://forum.koohii.com/showthread.php?p...#pid138900 (see link after bold edit for most accessible explanation; langrid uses the notational variant shown on the second page [bottom right] of the .pdf).

Edit 5: I got carried away, but one more on Japanese content/function words: http://books.google.com/books?id=NDknf0v...&q&f=false

Edit 6: Hmm, something else to look at is bunsetsu identification via KNP/CaboCha (e.g. factoring the bunsetsu into other areas rather than a step in the dependency process; more of a subjective shift in focus, perhaps).

Edit 7: Most edits ever. Last confirmation for content words in Japanese, but Japanese FrameNet confirms the above on content words.
Edited: 2011-05-27, 11:27 pm
Reply

Messages In This Thread