Back

Mighty Morphin Morphology

#1
EDIT 19Feb2013: Information for the latest version, MorphMan 3, can be found here: (1) announcement/overview post (2) wiki (3) video tutorials

EDIT: I have released a new design of the plugin, called 'Morph Man 2'. More information can be found here. The rest of this post applies to the original plugin, Morph Man 1 (aka Japanese Morphology in Anki's shared plugin list).

I just released an anki plugin called 'Japanese Morphology' which is a suite of tools relating to morphological analysis. The biggest feature of the initial release is the ability to calculate how difficult a sentence is in terms of how many morphemes it has that you don't know. That is, it approximates the N in the "i + N"'ness of a sentence. This is useful if you subs2srs a show but want to only work on i+1 sentences.

* Note: currently only looks at Expression field, although you can modify that in the source.

-- How to use --

First you must create a database of all morphemes you know, known as "known.db". You can do this by creating databases of morphemes from various sources and merging them (you are automatically prompted to merge new databases with known.db).

Then I'd suggest setting the fields with the various features, filter by iPlusN:1, sort by vocabRank, then go to town on a subs2srs deck. If you run out of i+1 sentences, update known.db with the new sentences you've learned and repeat.

-- To create a morpheme database and initial known.db from anki cards
Open a deck, go to card browser, select all the ones you've studied, go to Actions>It's Morphin Time->Export Morphemes, select a source name (doesn't do anything in this initial release) then a file to save it to. It will prompt you to merge with known.db. For your first export, save as 'known.db' and don't merge. Continue doing this with all the cards you know in all your decks so that known.db is a better approximate of your knowledge.

-- To see what morphemes a card has
Select some cards in the card browser and go to Actions>It's Morphin Time>View Morphemes or hit ctrl+V to get a popup with all the morphemes in those cards' fields. It also displays all the morphemes those cards have that aren't in your known.db under the '-----New-----' heading. Note, you can blacklist certain parts of speech to filter which morphemes appear.

The columns in the output are the morpheme, the part of speech, the sub-part of speech, and the reading. The default blacklist is punctuation and particles.

-- To find the N in the i+N'ness of your cards
Create a field in your model called 'iPlusN' and mark it to be numerically sorted.
Then select some cards in the card browser and go to Actions>It's Morphin Time>Set i+N. This will determine the morphemes in that card and see how many aren't in known.db; the result is stored in the 'iPlusN' field. Note, you can provide a parts of speech blacklist so those morphemes will be ignored in the calculations.

-- To rank new vocab words by difficulty
Create a field called 'vocabRank', mark it to be numerically sorted, and then select some cards in the browser and go to Actions>It's Morphin Time>Set vocabRank. The higher the number, the more similar it is to other morphemes you've seen (in terms of same kanji and readings). *Note the current implementation is a rough approximation (see later posts) but errors in favor of false low scores, so you can trust that high scores are similar / easy.

-- To see which parts of a sentence are "unknown"
I wanted an easy way to see which parts of a sentence I should be focusing on when testing myself, so I added an option to set a 'unknowns' field which gets filled with the Ns in the i+N'ness of the sentence. I found it helpful to show this on my cards, below the sentence being tested and in a smaller font.

-- To create a db from a UTF-8 text file
Open MorphMan (on main window's menu bar), then Export morphemes from file, then select a file to export from and a place to save the db.

-- To analyze, compare, or merge dbs
Open MorphMan (on main window's menu bar), browse to open one or two dbs, then click the various buttons to compare db A to B (or just show A). You can view the resulting morphemes in 4column mode (morpheme, part of speech, sub part of speech, and reading) or 1column mode (just the base morpheme).

You can then save the results to a new database. This is useful if you want to have a separate db for every show you have and larger dbs that are a union of all shows of a genre, for more interesting analysis.

-- To tag facts that match against a db
You can use the It's Morphin Time>Mass Tagger feature to add a tag(s) to all facts whose Expression field has a morpheme found in a database you specify. For example, tag all facts in your subs2srs deck that contain words that are also in kore, or vice-versa.

-- To create a personalized vocab deck
You can use the It's Morphin Time>Morph Match feature to find an optimal matching of words (morphemes from a DB) to sentences (selected cards' Expression fields) and set the best word to learn from that sentence in the 'matchedMorpheme' field.

-- Final notes --
The code is up on github (and in your plugin directory): https://github.com/jre2/JapaneseStudy/tr...morphology , feel free to use/modify it in any way.

Mecab isn't great at determining the morphemes of loan words, so sometimes it will incorrectly break them up and wrongly inflate the i+N score of a sentence.

Please report any bugs, feature suggestions, etc.

--Changes:
v1.1: added vocabRank feature
v1.2: added unknowns feature, cleaned up menus, and renamed some files
v1.3: added MorphMan for managing dbs
v1.4: added MorphMan result saver and mass tagger
v1.5: added maximum cardinality morpheme matcher, bugfix for anki 1.0
Edited: 2013-02-19, 10:50 pm
Reply
#2
Wow seems like a very cool thing!

This would basically allow us to make core type cards from any subs2srs show.

I personally had learnt about 2000 sentences in my beginner period from subs2srs only to find out that they were too hard, (sometimes i +3) and that it wasn't an effective way of studying vocab.

That is the reason why I went to core and haven't been using subs2srs since, feeling like it's too difficult and ineffective to learn from.

There is a possibility that your tool will make me consider using subs2srs again!
Edited: 2011-05-06, 5:01 am
Reply
#3
I think i like this... Still playing with it, but definitely seems rather useful.
Reply
August Sale (14th - 25th): 30% OFF Premium PLUS - 25% OFF Premium
JapanesePod101
#4
Hey, this seems really awesome!

I have just been thinking about making something similar - I had got up to the part of naming it. Smile

I am interested to know how you are going about detecting the words. My initial thought was to use sentence glossing (via WWWJDIC's glossing feature), but I have no idea how well that would work.

I was also thinking it would be cool to be able to consider the grammar of the sentence as well. Say, a sentence that you know is <noun>+desu. So it could use patterns like that to help calculate the distance between two sentences (the n+1 ness).

Anyway, _very_ excited to see that you have already got something going for all this. I will be trying it soon. I will give more feedback then. Hope you keep on going with it!
Reply
#5
Boy.pockets Wrote:I have just been thinking about making something similar - I had got up to the part of naming it. Smile
I got stuck on that part, thus the boring name for now.

Boy.pockets Wrote:I am interested to know how you are going about detecting the words. My initial thought was to use sentence glossing (via WWWJDIC's glossing feature), but I have no idea how well that would work.
I'm using mecab to determine the morphemes. If you look at morpheme.py you can see the specific options sent to mecab.

Boy.pockets Wrote:I was also thinking it would be cool to be able to consider the grammar of the sentence as well. Say, a sentence that you know is <noun>+desu. So it could use patterns like that to help calculate the distance between two sentences (the n+1 ness).
I considered this as well but then decided it was too difficult to determine whether you 'know' a grammar point, since I've 'seen' a lot of grammar used in various sentences and merely used a simplified understanding of them to get by. It's also not as easy to detect.

That said, I can get the morphemes in order and with parts of speech information, so perhaps someone could look at the morpheme output of some sentences and figure out some simple rules to detect various patterns.
Reply
#6
overture2112 Wrote:That said, I can get the morphemes in order and with parts of speech information, so perhaps someone could look at the morpheme output of some sentences and figure out some simple rules to detect various patterns.
Maybe you could find a list of particles and count the number of particles per sentence and sort the sentences based on the amount of particles per sentence.

But there would be problems due to some nouns and adjectives being in hiragana and katakana
Reply
#7
jettyke Wrote:Wow seems like a very cool thing!

This would basically allow us to make core type cards from any subs2srs show.

I personally had learnt about 2000 sentences in my beginner period from subs2srs only to find out that they were too hard, (sometimes i +3) and that it wasn't an effective way of studying vocab.

That is the reason why I went to core and haven't been using subs2srs since, feeling like it's too difficult and ineffective to learn from.

There is a possibility that your tool will make me consider using subs2srs again!
Lol, I went the opposite way for the same reason Wink
Reply
#8
overture2112 Wrote:
Boy.pockets Wrote:I have just been thinking about making something similar - I had got up to the part of naming it. :)
I got stuck on that part, thus the boring name for now.
"Morphology" sounds pretty cool to me. Maybe "Morfin" to make it shorter?

Another thing I thought might be cool (maybe you are already doing this): using the kanji readings to look for the next best word to use. For example, say you already know '出席「しゅっせき」', then an easy one to learn next might be '出廷「しゅってい」' (especially if you already know 'tei' from another word).
Reply
#9
jettyke Wrote:
overture2112 Wrote:That said, I can get the morphemes in order and with parts of speech information, so perhaps someone could look at the morpheme output of some sentences and figure out some simple rules to detect various patterns.
Maybe you could find a list of particles and count the number of particles per sentence and sort the sentences based on the amount of particles per sentence.

But there would be problems due to some nouns and adjectives being in hiragana and katakana
It can already detect particles correctly, just remove it from the blacklist when you do a view/iPlusN. You can edit the code to change the default blacklist in the code (it's @ morph/util.py:14). I haven't experimented too much with what good blacklists would be, so the default of punctuation and particles is just my current personal preference.

I could also expose a whitelist option (eg, to only do iPlusN for particles- although you can accomplish this with a very long blacklist too), or an option to filter the sub-part of speech group (3rd column) too, if people think it'd be useful.

Boy.pockets Wrote:Another thing I thought might be cool (maybe you are already doing this): using the kanji readings to look for the next best word to use. For example, say you already know '出席「しゅっせき」', then an easy one to learn next might be '出廷「しゅってい」' (especially if you already know 'tei' from another word).
Awesome idea! I'll try playing with that soon.

Problem: given a kanji compound, how can you determine which parts of the reading are from which kanji?
Edited: 2011-05-06, 10:39 am
Reply
#10
overture2112 Wrote:Problem: given a kanji compound, how can you determine which parts of the reading are from which kanji?
I didn't find a great solution, but I uploaded a new version of the plugin that can assign a ranking to vocab (stored in 'vocabRank' field) that seems useful enough to be of benefit.


-- For nerds:
Basically the scoring works like this:

It looks through each kanji of each morpheme in the Expression field (skips kana) and gives +20 pts if you know a word with that kanji and an extra +50pts if they share the same position within their respective words. If they share position, it also looks at the first and/or last character of the reading to see if they're the same as well and gives a +100pt bonus if so- but it only does this if the kanji in question is at the very beginning or end of the word.

That is, we assume the first character of the reading is from the first character of the expression, the last character of the reading is from the last character of the expression (unless either were kana), and that, for a given kanji, readings that start with the same character are the same. Obviously not the best solution, but it's a start.

If you know a word with a higher score it uses that score instead, and it averages over the number of characters considered (ie, non-kana).

Finally, all the scores for the morphemes in your expression field are averaged (kana-only morphemes are ignored) and an expression field with only kana-only morphemes gets a total score of -10.
-----

Thus: The new 'vocabRank' feature does a great job of identify words which are easy to learn but doesn't necessarily find all of them (ie, low score words could still be plenty easy due to the positional dependance and other limitations with respect to readings).
Reply
#11
I wanted an easy way to see which parts of a sentence I should be focusing on when testing myself, so I added an option to set a 'unknowns' field which gets filled with the Ns in the i+N'ness of the sentence. I found it helpful to show this on my cards, below the sentence being tested and in a smaller font. For example, my sentence card looks like (sans relative font size differences):

----------------------------------------
That's...
Servant Saber. Upon your summoning, I have come forth.
問おう、貴方が私のマスターか
貴方,マスター
I couldn't say anything.
I don't know if I was just confused by all that had happened.

i+2; VR 258
----------------------------------------



And now I can easily turn my subs2srs deck into a vocab-only deck by just using the i+1 sentences and making a template whose layout focuses on the unknowns field:

---------- Front -----------------------

Your Noble Phantasm... is it a sword?
Now, what could it be?
斧か、槍か、いや、もしや弓ということもあるかもしれんぞ
Spare me the jokes, Saber.
This is only our first meeting; any chance of settling this fight as a draw?

i+1 VR 0
---------- Back ------------------------
斧[おの]か、 槍[やり]か、いや、もしや 弓[ゆみ]ということもあるかもしれんぞ

Your Noble Phantasm... is it a sword?
Now, what could it be?
An axe? A spear? No, it could even be a bow, could it not?
Spare me the jokes, Saber.
This is only our first meeting; any chance of settling this fight as a draw?

{plus gloss from my lookup plugin}
----------------------------------------
Reply
#12
I added a little gui manager so you can open up databases and view / compare them. Also you can create databases from a utf-8 text file now.

Some interesting results quickly achieved with the new manager:

Only 1100/3204 = 34.3%, 227/655 = 34.6%, and 839/1681 = 50% of morphemes from Fate/Stay Night ep1-9, Nogizaka Haruka no Himitsu ep1, and JSPfEC, respectively, are in the core6k words. If you expand it to including all morphemes in core6k words and example sentences those numbers only grow to 1706/3204 = 53%, 415/655 = 63%, and 1283/1681 = 76%.

For the morphemes in FSN that aren't in core6k, 39% are nouns and 36% are verbs. Nogizaka was 32% and 26%, JSPfEC was 59% and 25%. Also, only 2%, 3%, and 0% were listed as 'unknown' by mecab, which means core6k's deficiency isn't due to loan words. I also filtered out punctuation and particles, as that wouldn't be fair to core6k.

As for words that are in core6k but not in the 3 dbs listed above, ie wasted effort:
JSPfEC: 4982/5821 = 85.6%
FSN: 4721/5821 = 81%
Nogizaka: 5594/5821 = 96%

This is interesting as it gives considerable evidence that core6k's word selection is far from optimal if you have particular shows/books/etc in mind that you'd like to comprehend. This may be obvious, but I was surprised by the degree of its inefficacy.

If anyone has large anki decks from other sources, it'd be neat to see more analysis.
Reply
#13
overture2112 Wrote:If anyone has large anki decks from other sources, it'd be neat to see more analysis.
There was that 10k core deck and the 25 k (core plus?) deck on anki.
Reply
#14
Interesting. Are those %s cumulative or unique? The first two are a visual novel and anime? As you say, I think genre/theme are big determiners. The idea of an "essential" base still seems like a good idea, though. Maybe someone should put out an Anime Core, Biz Core, Novel Core, News Core so efficiency-minded folks can specialize early. :-)

I've read about morphological analyzers doing some odd segmentation. I guess you've checked what is considered unique in terms of conjugated variations, word stem, grammatical morphemes (~える), etc.

[Not exactly related to Morphin, but...] I have 2 questions you might know something about after looking into the individual kanji issue. I'm trying to replicate some kanji data I have (which is based on a newspaper corpus) using other genres:
1. Can mecab indicate whether kanji in a word is kun or on? (wabun vs kanbun)
2. Do you know of a way to check whether individual kanji readings are in a given list. I know of software that does this for vocab, but not for individual kanji readings. (I've been doing it with a spreadsheet set up to compare, but it has involved a lot of manual rejigging.)
Edited: 2011-05-10, 9:05 pm
Reply
#15
jettyke Wrote:
overture2112 Wrote:If anyone has large anki decks from other sources, it'd be neat to see more analysis.
There was that 10k core deck and the 25 k (core plus?) deck on anki.
B = CorePlus
Code:
A        | intersection    | B-A (waste)       |
------------------------------------------------
JSPfEC   | 1143/1681 = 68% | 22177/23320 = 95% |
FSN      | 1891/3204 = 59% | 21429/23320 = 92% |
Nogizaka |  425/ 655 = 65% | 22895/23320 = 98% |
I guess I'll have to start making more subs2srs decks to get a better idea of how these standard decks fair, but I can say with certainty that if your goal is to watch/read stuff that uses similar terms to the examples above, corePlus is a terrible idea.

corePlus has lots of words like 決算 but seems to lack ones like 斧, which I guess is fine for boring people...
Reply
#16
Thora Wrote:Interesting. Are those %s cumulative or unique? The first two are a visual novel and anime?
Actually it's the anime of Fate/Stay Night, specifically episodes 1 to 9.
The %s are unique. I was going to do one against all of FSN+Nogizaka (and maybe JSPfEC too) combined, but I forgot to push my code which adds the ability to save results (eg, unions) in the MorphMan and thus can't easily make a combined DB to test against. I'll do so tomorrow morning.

Thora Wrote:As you say, I think genre/theme are big determiners. The idea of an "essential" base still seems like a good idea, though. Maybe someone should put out an Anime Core, Biz Core, Novel Core, News Core so efficiency-minded folks can specialize early. :-)
Rather than making genre-specific standardized decks, I'd prefer tools for creating personalized decks catered to the particular media one actually enjoys using but with the same quality of standardized corpa like Core6k/ko2001. The 'unknowns' feature of my plugin on top a subs2srs deck is a step in the right direction, but it's not quite there.

That said, I created a new feature that, given a DB of words to learn and a selection of cards to learn them from, finds an optimal 1:1 pairing of word-to-learn to card-to-learn-from (ie, a maximum cardinality bipartite matching) and records that word on a field for the card, so we can truly emulate everything Core does for us in any subs2srs deck.

I almost released that as well in my update today, but my implementation had a performance bug and I was considering scrapping it and using a full out assignment problem algorithm; that is, something like above but you could also specify 'costs' for the pairings and have it minimize the overall cost of all the pairings. I haven't much considered what kind of cost functions would be useful, but some obvious ideas come to mind:

Increase the cost of a word (morpheme from db) <-> sentence (card from deck) pairing based on:
a) the length of the sentence (longer = harder)
b) the length of the sentence minus the word (more extra noise = harder)
c) if the user noted a fact was 'hard' (ie, add the card's 'Difficulty' field if it has one)

Not only would that finally give us 100% of everything Core provides, but it should theoretically be better and can always be ran again as you update your deck with new facts or learn more (and thus have less you want to learn, thus free up sentences to be matched with other words you've yet to learn).

If anyone can think of other ideas for determining the cost of using a sentence to learn a given word, I'm all ears. I'll try to incorporate this feature into the next update.

Thora Wrote:I've read about morphological analyzers doing some odd segmentation. I guess you've checked what is considered unique in terms of conjugated variations, word stem, grammatical morphemes (~える), etc.
In playing around with my plugin thus far, I've noticed mecab does poorly with loan words and I've seen a few mistakes with kana-only words, but generally it does the right thing otherwise.

Thora Wrote:[Not exactly related to Morphin, but...] I have 2 questions you might know something about after looking into the individual kanji issue. I'm trying to replicate some kanji data I have (which is based on a newspaper corpus) using other genres:
1. Can mecab indicate whether kanji in a word is kun or on? (wabun vs kanbun)
I didn't notice any such feature when looking at the docs, but I haven't read them thoroughly enough to say no for certain.

Thora Wrote:2. Do you know of a way to check whether individual kanji readings are in a given list. I know of software that does this for vocab, but not for individual kanji readings. (I've been doing it with a spreadsheet set up to compare, but it has involved a lot of manual rejigging.)
I'm not entirely sure if I'm understanding correctly, but Mecab can provide the reading of individual morphemes (which you can trivially check against a provided list), as that is how Anki's 'Generate Readings' works. Or did you mean something else?
Edited: 2011-05-10, 10:52 pm
Reply
#17
overture2112 Wrote:corePlus has lots of words like 決算 but seems to lack ones like 斧, which I guess is fine for boring people...
lol
Reply
#18
Question, what are you going to do about really long sentences, because there are lots of them in media?

As I understand it, you promote not learning all the sentences from a single show, but learning the most suitable words from numerous dramas, not covering all the words.
Edited: 2011-05-11, 6:03 am
Reply
#19
Is that zip file for 1.3?
Reply
#20
jettyke Wrote:
overture2112 Wrote:corePlus has lots of words like 決算 but seems to lack ones like 斧, which I guess is fine for boring people...
lol
This is why I went 'screw it' after 500 cards of kore, dropped it and started playing through FSN. Granted, it took months, but it was awesome. It's interesting to see some statistics on how much these kinds of materials differ.
Reply
#21
jettyke Wrote:Question, what are you going to do about really long sentences, because there are lots of them in media?
In what sense? I guess long sentences are useful for practicing listening comprehension, but I'd rather avoid them when learning words/grammar if possible. Thus the idea of assigning costs to word<->sentence pairings based on length. If that's the only sentence with the word you want to learn (or if the other sentences with the word are better suited for other words you want to learn) then you eat the cost.

jettyke Wrote:As I understand it, you promote not learning all the sentences from a single show, but learning the most suitable words from numerous dramas, not covering all the words.
Not sure if it's "promoting" so much as "I'd like to try it because it sounds better in theory", but yes, the idea is to create a DB of words you want to learn and throw the matching algorithm against some large subs2deck of a show you've seen before to find optimal sentences to learn them from. If you'd particularly like to focus on examples from some subset of your deck (like a particular episode of a show) then you could adjust the cost function so that those facts are relatively cheap.
Reply
#22
nest0r Wrote:Is that zip file for 1.3?
Oh, the one on github I think is 1.0, I should update that. The latest release (1.4) is available through Anki's shared plugins.
Reply
#23
overture2112 Wrote:
jettyke Wrote:Question, what are you going to do about really long sentences, because there are lots of them in media?
In what sense? I guess long sentences are useful for practicing listening comprehension, but I'd rather avoid them when learning words/grammar if possible. Thus the idea of assigning costs to word<->sentence pairings based on length. If that's the only sentence with the word you want to learn (or if the other sentences with the word are better suited for other words you want to learn) then you eat the cost.

jettyke Wrote:As I understand it, you promote not learning all the sentences from a single show, but learning the most suitable words from numerous dramas, not covering all the words.
Not sure if it's "promoting" so much as "I'd like to try it because it sounds better in theory", but yes, the idea is to create a DB of words you want to learn and throw the matching algorithm against some large subs2deck of a show you've seen before to find optimal sentences to learn them from. If you'd particularly like to focus on examples from some subset of your deck (like a particular episode of a show) then you could adjust the cost function so that those facts are relatively cheap.
Yeah, my bad, I couldn't come up with a better single words quickly, so I used "promote". I'm not accusing you or anything Big Grin Big Grin
I actually totally support your idea. It's much time-efficient, easier and smarter to just make a huge database and find the most suitable sentences rather than make every existing long sentence shorter.

I guess it is better to avoid them, yeah.
Edited: 2011-05-11, 11:01 am
Reply
#24
That's cool, didn't realize you'd added it to shared plugins. I'll have to test this out soon. I've been remiss but now I have more time to wast—spend on all these self-study tools you programmer-gods come up with.
Reply
#25
v1.5 implements a feature to find a maximum cardinality bipartite matching between morphemes in a DB to facts whose Expression field contains said morphemes.

Effectively this means you can create a DB of words you want to learn, select some cards with sentence you want to learn them from, and then have it fill the 'matchedMorpheme' field of the facts such that each fact is assigned at most 1 word and as many words are assigned as possible. The downside is that this operation can take some time depending on how large the DB is, how large the selection of cards is, and the number of potential pairings.

Note, cards that don't contain any morphemes from the deck are automatically ignored for calculations, as are morphemes that no card has.

----- For nerds and people who want to know about it's performance -----

The algorithm is implemented via Edmonds-Karp's solution to the maximum flow problem and thus has a complexity of O( |E| * max|f| )**, where |V| is the number of vertices on the graph (ie morphemes+facts), |E| is the number of edges (ie potential pairings), and max|f| is the maximum flow (ie the size of the largest matching)***.

Also, the algorithm has to initially scan all your facts to see which morphemes can be provided by which facts, which is equivalent to the amount of time it takes to export morphemes from a deck into a db.

Lies to children:
** Actually it's the lesser of O( |V| * |E|^2 ) [via Edmonds-Karp] or O( |E| * max|f| ) [via Ford-Fulkerson], but in our case max|f| is at most the lesser of the number of morphemes or facts, so really it's at worst O( |V| * |E| ).
*** Actually the graph contains a super source which is connected to all morphemes and a super sink which is connected to all facts, so actually |V| = morphemes + facts + 2 and |E| = potential pairings + morphemes + facts.

----- Some real numbers for people curious about efficacy and efficiency -----

1) Mapping Core6k db to Fate/Stay Night ep1-2:

456/5821 morphemes from db found in 462/558 facts from deck for a potential 1176 pairings. |V| = 920, |E|=2094. Successfully matched 372 in 5sec.

2) Mapping Core6k db to Fate/Stay Night ep1-9:

1083/5821 morphemes from db found in 2121/2521 facts from deck for a potential 5511 pairings. |V| = 3206, |E| = 8715. Successfully matched 991 in 40s (15s to scan facts, 25s to solve matching).

3) Mapping a union of my FSN+Nogizaka dbs to Core6k deck:

1154/3469 morphemes from db found in 1349/6000 facts from deck for a potential 1410 pairings. |V| = 2505, |E| = 3922. Successfully matched 1152 in 1m4s (58s to scan facts, 5s to solve matching).

4) Mapping animeWordsToLearn* to FSN deck:

* union of FSN+Nogizaka dbs subtracted by known.db

2878/3243 morphemes from db found in 2480/2521 facts from deck for a potential 11709 pairings. |V| = 5360, |E|=17067. Successfully matched 2004 in 2m41s (18s to scan facts, 2m23s to solve matching).

----- Conclusions -----

Notice a
- 2.22x increase in edges caused a 5x increase in matching time
- 4.35x increase in edges caused a 28.6x increase in matching time

The feature shows you the complexity information before starting the matching so you can get a rough idea of how long it will take. If it's too large, try matching with a smaller db or against fewer cards.

The idea of using the assignment problem to allow costs for pairings is being put to the side since that would be far slower ( O( |V|^3 ) ) and thus wouldn't be useful for decks of any real size. I still like the idea of filtering by sentence length though.
Reply