v1.5 implements a feature to find a maximum cardinality bipartite matching between morphemes in a DB to facts whose Expression field contains said morphemes.
Effectively this means you can create a DB of words you want to learn, select some cards with sentence you want to learn them from, and then have it fill the 'matchedMorpheme' field of the facts such that each fact is assigned at most 1 word and as many words are assigned as possible. The downside is that this operation can take some time depending on how large the DB is, how large the selection of cards is, and the number of potential pairings.
Note, cards that don't contain any morphemes from the deck are automatically ignored for calculations, as are morphemes that no card has.
----- For nerds and people who want to know about it's performance -----
The algorithm is implemented via Edmonds-Karp's solution to the maximum flow problem and thus has a complexity of O( |E| * max|f| )**, where |V| is the number of vertices on the graph (ie morphemes+facts), |E| is the number of edges (ie potential pairings), and max|f| is the maximum flow (ie the size of the largest matching)***.
Also, the algorithm has to initially scan all your facts to see which morphemes can be provided by which facts, which is equivalent to the amount of time it takes to export morphemes from a deck into a db.
Lies to children:
** Actually it's the lesser of O( |V| * |E|^2 ) [via Edmonds-Karp] or O( |E| * max|f| ) [via Ford-Fulkerson], but in our case max|f| is at most the lesser of the number of morphemes or facts, so really it's at worst O( |V| * |E| ).
*** Actually the graph contains a super source which is connected to all morphemes and a super sink which is connected to all facts, so actually |V| = morphemes + facts + 2 and |E| = potential pairings + morphemes + facts.
----- Some real numbers for people curious about efficacy and efficiency -----
1) Mapping Core6k db to Fate/Stay Night ep1-2:
456/5821 morphemes from db found in 462/558 facts from deck for a potential 1176 pairings. |V| = 920, |E|=2094. Successfully matched 372 in 5sec.
2) Mapping Core6k db to Fate/Stay Night ep1-9:
1083/5821 morphemes from db found in 2121/2521 facts from deck for a potential 5511 pairings. |V| = 3206, |E| = 8715. Successfully matched 991 in 40s (15s to scan facts, 25s to solve matching).
3) Mapping a union of my FSN+Nogizaka dbs to Core6k deck:
1154/3469 morphemes from db found in 1349/6000 facts from deck for a potential 1410 pairings. |V| = 2505, |E| = 3922. Successfully matched 1152 in 1m4s (58s to scan facts, 5s to solve matching).
4) Mapping animeWordsToLearn* to FSN deck:
* union of FSN+Nogizaka dbs subtracted by known.db
2878/3243 morphemes from db found in 2480/2521 facts from deck for a potential 11709 pairings. |V| = 5360, |E|=17067. Successfully matched 2004 in 2m41s (18s to scan facts, 2m23s to solve matching).
----- Conclusions -----
Notice a
- 2.22x increase in edges caused a 5x increase in matching time
- 4.35x increase in edges caused a 28.6x increase in matching time
The feature shows you the complexity information before starting the matching so you can get a rough idea of how long it will take. If it's too large, try matching with a smaller db or against fewer cards.
The idea of using the assignment problem to allow costs for pairings is being put to the side since that would be far slower ( O( |V|^3 ) ) and thus wouldn't be useful for decks of any real size. I still like the idea of filtering by sentence length though.
Effectively this means you can create a DB of words you want to learn, select some cards with sentence you want to learn them from, and then have it fill the 'matchedMorpheme' field of the facts such that each fact is assigned at most 1 word and as many words are assigned as possible. The downside is that this operation can take some time depending on how large the DB is, how large the selection of cards is, and the number of potential pairings.
Note, cards that don't contain any morphemes from the deck are automatically ignored for calculations, as are morphemes that no card has.
----- For nerds and people who want to know about it's performance -----
The algorithm is implemented via Edmonds-Karp's solution to the maximum flow problem and thus has a complexity of O( |E| * max|f| )**, where |V| is the number of vertices on the graph (ie morphemes+facts), |E| is the number of edges (ie potential pairings), and max|f| is the maximum flow (ie the size of the largest matching)***.
Also, the algorithm has to initially scan all your facts to see which morphemes can be provided by which facts, which is equivalent to the amount of time it takes to export morphemes from a deck into a db.
Lies to children:
** Actually it's the lesser of O( |V| * |E|^2 ) [via Edmonds-Karp] or O( |E| * max|f| ) [via Ford-Fulkerson], but in our case max|f| is at most the lesser of the number of morphemes or facts, so really it's at worst O( |V| * |E| ).
*** Actually the graph contains a super source which is connected to all morphemes and a super sink which is connected to all facts, so actually |V| = morphemes + facts + 2 and |E| = potential pairings + morphemes + facts.
----- Some real numbers for people curious about efficacy and efficiency -----
1) Mapping Core6k db to Fate/Stay Night ep1-2:
456/5821 morphemes from db found in 462/558 facts from deck for a potential 1176 pairings. |V| = 920, |E|=2094. Successfully matched 372 in 5sec.
2) Mapping Core6k db to Fate/Stay Night ep1-9:
1083/5821 morphemes from db found in 2121/2521 facts from deck for a potential 5511 pairings. |V| = 3206, |E| = 8715. Successfully matched 991 in 40s (15s to scan facts, 25s to solve matching).
3) Mapping a union of my FSN+Nogizaka dbs to Core6k deck:
1154/3469 morphemes from db found in 1349/6000 facts from deck for a potential 1410 pairings. |V| = 2505, |E| = 3922. Successfully matched 1152 in 1m4s (58s to scan facts, 5s to solve matching).
4) Mapping animeWordsToLearn* to FSN deck:
* union of FSN+Nogizaka dbs subtracted by known.db
2878/3243 morphemes from db found in 2480/2521 facts from deck for a potential 11709 pairings. |V| = 5360, |E|=17067. Successfully matched 2004 in 2m41s (18s to scan facts, 2m23s to solve matching).
----- Conclusions -----
Notice a
- 2.22x increase in edges caused a 5x increase in matching time
- 4.35x increase in edges caused a 28.6x increase in matching time
The feature shows you the complexity information before starting the matching so you can get a rough idea of how long it will take. If it's too large, try matching with a smaller db or against fewer cards.
The idea of using the assignment problem to allow costs for pairings is being put to the side since that would be far slower ( O( |V|^3 ) ) and thus wouldn't be useful for decks of any real size. I still like the idea of filtering by sentence length though.
