Thora Wrote:Interesting. Are those %s cumulative or unique? The first two are a visual novel and anime?Actually it's the anime of Fate/Stay Night, specifically episodes 1 to 9.
The %s are unique. I was going to do one against all of FSN+Nogizaka (and maybe JSPfEC too) combined, but I forgot to push my code which adds the ability to save results (eg, unions) in the MorphMan and thus can't easily make a combined DB to test against. I'll do so tomorrow morning.
Thora Wrote:As you say, I think genre/theme are big determiners. The idea of an "essential" base still seems like a good idea, though. Maybe someone should put out an Anime Core, Biz Core, Novel Core, News Core so efficiency-minded folks can specialize early. :-)Rather than making genre-specific standardized decks, I'd prefer tools for creating personalized decks catered to the particular media one actually enjoys using but with the same quality of standardized corpa like Core6k/ko2001. The 'unknowns' feature of my plugin on top a subs2srs deck is a step in the right direction, but it's not quite there.
That said, I created a new feature that, given a DB of words to learn and a selection of cards to learn them from, finds an optimal 1:1 pairing of word-to-learn to card-to-learn-from (ie, a maximum cardinality bipartite matching) and records that word on a field for the card, so we can truly emulate everything Core does for us in any subs2srs deck.
I almost released that as well in my update today, but my implementation had a performance bug and I was considering scrapping it and using a full out assignment problem algorithm; that is, something like above but you could also specify 'costs' for the pairings and have it minimize the overall cost of all the pairings. I haven't much considered what kind of cost functions would be useful, but some obvious ideas come to mind:
Increase the cost of a word (morpheme from db) <-> sentence (card from deck) pairing based on:
a) the length of the sentence (longer = harder)
b) the length of the sentence minus the word (more extra noise = harder)
c) if the user noted a fact was 'hard' (ie, add the card's 'Difficulty' field if it has one)
Not only would that finally give us 100% of everything Core provides, but it should theoretically be better and can always be ran again as you update your deck with new facts or learn more (and thus have less you want to learn, thus free up sentences to be matched with other words you've yet to learn).
If anyone can think of other ideas for determining the cost of using a sentence to learn a given word, I'm all ears. I'll try to incorporate this feature into the next update.
Thora Wrote:I've read about morphological analyzers doing some odd segmentation. I guess you've checked what is considered unique in terms of conjugated variations, word stem, grammatical morphemes (~える), etc.In playing around with my plugin thus far, I've noticed mecab does poorly with loan words and I've seen a few mistakes with kana-only words, but generally it does the right thing otherwise.
Thora Wrote:[Not exactly related to Morphin, but...] I have 2 questions you might know something about after looking into the individual kanji issue. I'm trying to replicate some kanji data I have (which is based on a newspaper corpus) using other genres:I didn't notice any such feature when looking at the docs, but I haven't read them thoroughly enough to say no for certain.
1. Can mecab indicate whether kanji in a word is kun or on? (wabun vs kanbun)
Thora Wrote:2. Do you know of a way to check whether individual kanji readings are in a given list. I know of software that does this for vocab, but not for individual kanji readings. (I've been doing it with a spreadsheet set up to compare, but it has involved a lot of manual rejigging.)I'm not entirely sure if I'm understanding correctly, but Mecab can provide the reading of individual morphemes (which you can trivially check against a provided list), as that is how Anki's 'Generate Readings' works. Or did you mean something else?
Edited: 2011-05-10, 10:52 pm
