Interesting. Are those %s cumulative or unique? The first two are a visual novel and anime? As you say, I think genre/theme are big determiners. The idea of an "essential" base still seems like a good idea, though. Maybe someone should put out an Anime Core, Biz Core, Novel Core, News Core so efficiency-minded folks can specialize early. :-)
I've read about morphological analyzers doing some odd segmentation. I guess you've checked what is considered unique in terms of conjugated variations, word stem, grammatical morphemes (~える), etc.
[Not exactly related to Morphin, but...] I have 2 questions you might know something about after looking into the individual kanji issue. I'm trying to replicate some kanji data I have (which is based on a newspaper corpus) using other genres:
1. Can mecab indicate whether kanji in a word is kun or on? (wabun vs kanbun)
2. Do you know of a way to check whether individual kanji readings are in a given list. I know of software that does this for vocab, but not for individual kanji readings. (I've been doing it with a spreadsheet set up to compare, but it has involved a lot of manual rejigging.)
I've read about morphological analyzers doing some odd segmentation. I guess you've checked what is considered unique in terms of conjugated variations, word stem, grammatical morphemes (~える), etc.
[Not exactly related to Morphin, but...] I have 2 questions you might know something about after looking into the individual kanji issue. I'm trying to replicate some kanji data I have (which is based on a newspaper corpus) using other genres:
1. Can mecab indicate whether kanji in a word is kun or on? (wabun vs kanbun)
2. Do you know of a way to check whether individual kanji readings are in a given list. I know of software that does this for vocab, but not for individual kanji readings. (I've been doing it with a spreadsheet set up to compare, but it has involved a lot of manual rejigging.)
Edited: 2011-05-10, 9:05 pm
