Hello,
The vocabulary review is very good idea
However would it be possible to select only the kanjis that user has reviewed successfully N times (configurable:1 to 7) in the kanji flashcards, instead of just all the kanji under a certain Heisig number ? The more so as that as far as i am concerned I skipped some kanji on purpose to focus on 800 or so (top 500 plus JLPT4 plus the primitives necessary to make them)
Trust me I'd love to find a way to do this. I haven't figured an effective way with MySQL to find words made only with an arbitrary set of kanji, between 130000+ dictionary items and a set of potentially 2000+ kanji.
If someone figures a way to do this with the JMDICT schema please let me know. I'm using a much simpler schema for this website but the main logic would be the same.
Today already for the vocab shuffle we have to input a max Heisig number : I believe then that you can filter words which contain only kanji with Heisig number below the number we input. Am I right ?
I may miss something, but then isn't there a way to eliminate words containing kanji that we have not reviewed successfully yet N times (N between 1 and 7 for example), but using the site's flashcard data ?
I'm not a programmer, but isn't it possible to just add a step after the db is queried, where the word's kanji is tested for whether it meets dmichaud's criterium (reviewed n times successfully)? Then, if 'yes', it would display the word; if 'no', it would grab another word from the db and repeat the test.
Oniichan wrote:
I'm not a programmer, but isn't it possible to just add a step after the db is queried, where the word's kanji is tested for whether it meets dmichaud's criterium (reviewed n times successfully)? Then, if 'yes', it would display the word; if 'no', it would grab another word from the db and repeat the test.
Its not that simple really mostly because of processing time. I also don't know how the vocab card thing is setup either, if theres a deck or what, but lets assume there's a deck of vocab words. Example:
Say we have a list of known kanji, this is the "Known set" and we'll call it K.
Then we have a list of possible words who's kanji all appear in K. This "Possible Set" is P.
And finally there is the deck we are studying, D.
Every time we add a knew kanji we learned to K, we have to re-scan the dictionary and add any new words which match all characters in K, to set P. Then you can choose what to add to D.
As you move through the dictionary you are checking against K to see if you know all the kanji in the word. While K is small, this search can be done quickly, but as the number in K grows, there will be a point where it takes more time. Interestingly enough though, as K increases in size though there will be a point where the search time begins to decrease because you know so many kanji. (This is my take on it from just thinking about it a bit)
It actually gets worse though because as you add more words your deck, you also have to run a check on D to make sure the returned word isn't one you already have.
I don't know if MySQL is actually optimal enough to deal with a problem like this, but data storage is critical in optimizing the algorithm to return results fast enough; which MySQL may not be set up to effectively handle.
So its not really as easy as one might think and if you have ~100 people requesting new words every few minutes, or even seconds, the server is going to come under some strain.
-------
If I were to actually build a system that would allow me to pull out word matches based on a known set of Kanji. I would probably go through and break my DB up. Word entries wouldn't be a single row in a table. What I would do is assign each entry an EntryID and dump the entry into a separate table called 'gloss' or something. Then take the individual word(s) and split them up into single character pieces and give each character a PositionID.
Example:
漢字 [かんじ] Chinese characters
"[かんじ] Chinese characters" might be given an EntryID of 752 in Entry table.
漢 has PosID of 1 and EntryID of 752 in table Split
字 has PosID of 2 and EntryID of 752 in table Split
You can even skip out on the PosID and just rely on "backup" copy of the word in the Entry table to tell you how the word was composed.
The Character table will serve as you search table.
Then you feed MySQL a fancy SELECT JOIN pointing to a user's set of K and have it only spit out those entries where all characters with the same EntryID intersect with set K.
There may very well be easier ways to do this but this is the idea that just popped to mind for me.
Again, today already it works for filtering based on a max heisig number. What is current test to make it work as it is now ?
In my naive assumption, my idea would mean replacing something like (sorry for using a non normalised pseudo language...)
result:=true
for all kanjis in word
if kanji.heisig > max_heisig then result:=false
end for
return result
by
result:=true
for all kanji in word
if kanji.flashcard_status(user) < min_review then result:=false
end for
return result
this is not MySQL, sorry, but i could not understand in current model where are the individual kanjis... (nowhere ?)
Oniichan wrote:
Then, if 'yes', it would display the word; if 'no', it would grab another word from the db and repeat the test.
No generally you can't run loops like that in MySQL as this is terribly inefficient. You have to keep in mind that a php page is meant to run under a tenth of second usually, because you have potentially hundreds or thousands people hitting that page in a short amount of time. Each sql querry is hitting tables, some can be millions of rows (the flashcard table on this site has one row per card per user). So when you run loops like that you have to know the limit.
For example you can imagine this forum has a php page execution every second at least, if not from you, it's going to be hit by Google's and many other web crawlers and archivers. A high activity forum would see hundreds or more hits per second. Now imagine one of those is running 100+ querries in a loop... the site will be very sluggish, and the server may eventually fail under pressure (running out of processes or memory).
dmichaud wrote:
I may miss something, but then isn't there a way to eliminate words containing kanji that we have not reviewed successfully yet N times (N between 1 and 7 for example), but using the site's flashcard data ?
No for this particular implementation, a list of vocab cards is pulled together once before the session begins. The list contains the ID's of the dictionary entries.
To filter out results means I would have to pull basically all possible results, whereas the current implementation is able to limit the results significantly because of the 130000+ dictionary entries only a few thousands match any given common character. Lots of entries are made of obscure characters. And lots of kanji in Heisig only have a handful of results that are "priority" entries (top 10000 words appearing in newspaper).
vix86 wrote:
If I were to actually build a system that would allow me to pull out word matches based on a known set of Kanji.
Yes I have a table like that with the split readings:
- dict id > the word ID
- kanji id > unique ucs_2 (unicode) unsigned id for storage
- pron id > pronunciation id referring another table, also for storage but basically it's simply the furigana (a good idea because there is only so many distinct readings)
- pos > as you suggested from 0-3 usually
- type > On or Kun or Kana ![]()
So one problem is yes you can make a join between the flashcard deck and that table USING( kanji id) (pseudo code). However that doesn't tell us if there are other kanji in that same word which are not known.
"all characters with the same EntryID intersect with set K" that's the key.
One vague idea yesterday was, maybe add the character count to those rows, so that I can group by dict id, and ... actually it might work but I'll stop here cause it's Sunday I'll give it another good look after the Hanzi forum is up ![]()
Shouldn't you be able to to make a query like this:
SELECT dict_id FROM word_table as wt
-- Include only words that DON'T have any kanji that AREN'T in the user's list
WHERE NOT EXISTS
(SELECT * FROM split_table as st
WHERE st.dict_id = wt.dict_id AND
kanji_id NOT IN
(SELECT kanji_id FROM user_kanji WHERE user_id = @userId)
)
-- Include only words that have at least 1 kanji
AND EXISTS
(SELECT * FROM split_table as st2
WHERE st2.dict_id = wt.dict_id AND st.type <> 'kana')
Or is the problem that that would be too slow?
(Bold part added afterward)
Last edited by JimmySeal (2012 June 24, 1:47 pm)
JimmySeal wrote:
Shouldn't you be able to to make a query like this:
SELECT dict_id FROM word_table as wt
WHERE NOT EXISTS
(SELECT * FROM split_table as st
WHERE st.dict_id = wt.dict_id AND
kanji_id NOT IN
(SELECT kanji_id FROM user_cards WHERE user_id = @userId)
)
Or is the problem that that would be too slow?
My SQL is pretty bad, and without having a DB in front of my to test it with I can't say. But wouldn't you want that to be "kanji_id IN (SELECT kanji_id FROM user_cards...)"? Since you want to match words from known kanji. You'll also need another conditional in there to check the words aren't already in the user's deck.
Bit I'm not sure on though is, wouldn't that spit out words that have one kanji that intersects?
Example:
User "Bob"
Known Kanji: "火カ家字"
Search hits "漢字" which he has no vocab card for.
字 matches with Known list and so will return a dict_id for 漢字 regardless of the fact that 漢 isn't in the Known list.
vix86 wrote:
My SQL is pretty bad, and without having a DB in front of my to test it with I can't say. But wouldn't you want that to be "kanji_id IN (SELECT kanji_id FROM user_cards...)"?
No, you're misinterpreting it. I wrote it to precisely not do this:
Bit I'm not sure on though is, wouldn't that spit out words that have one kanji that intersects?
The query is written to find all words that don't contain a kanji that's not in the user's list. Though I guess you'd need an additional condition to exclude words that don't contain any kanji at all (fixed above).
Example:
User "Bob"
Known Kanji: "火カ家字"
Search hits "漢字" which he has no vocab card for.
漢字 contains 漢 which is not in the user's list of characters, so this word is excluded from the results.
Last edited by JimmySeal (2012 June 24, 10:32 am)
I'd have to run some tests with subquerries to see what the cost is.
I see what you're doing there, it's smart... but the not not melts my brain ![]()
The query looks inefficient but with the right indexes it might be workable. Main problem with subquerries I think is it creates temporary tables that may have to be written to disk.
EDIT: Thanks, on further thought I think it would work. (also the outer querry is a LEFT JOIN / IS NULL variant).
@JimmySeal:
Ah! I get it now, haha. That is quite clever I must say.
JimmySeal wrote:
Shouldn't you be able to to make a query like this:
SELECT dict_id FROM word_table as wt
-- Include only words that DON'T have any kanji that AREN'T in the user's list
WHERE NOT EXISTS
(SELECT * FROM split_table as st
WHERE st.dict_id = wt.dict_id AND
kanji_id NOT IN
(SELECT kanji_id FROM user_kanji WHERE user_id = @userId)
)
-- Include only words that have at least 1 kanji
AND EXISTS
(SELECT * FROM split_table as st2
WHERE st2.dict_id = wt.dict_id AND st.type <> 'kana')
Or is the problem that that would be too slow?
(Bold part added afterward)
Looks good compared to my initial request, thanks.
I'd just like to precise the criteria for selection the user kanji list
I' d like to to restrict the list only to the one which have been reviewed successfully Min_review_level (parameter) times without errors in the RTK1 kanji flashcards (= the ones appearing in column >= Min_review_level of progress bar chart...)
so
(SELECT kanji_id FROM user_kanji WHERE user_id = @userId) :
would become something like
(SELECT kanji_id FROM user_kanji WHERE (user_id = @userId AND user_kanji.progress > Min_review_level)
Possible ?
Do you really need this level of granularity dmichaud? Or could we simplify to a fixed N consecutive succesfull reviews (ie. stack 3 or 4)?
ファブリス wrote:
Do you really need this level of granularity dmichaud? Or could we simplify to a fixed N consecutive succesfull reviews (ie. stack 3 or 4)?
If this a problem, yes we can can fix the number of review
3 or 4 is a good choice.
(in my case some kanji are repeatedly going 1-2-0-1-2-0-1-0 but once i reach stack 3 or 4 they rarely go back to 0)
Last edited by dmichaud (2012 June 27, 5:50 pm)
The query works, EXPLAIN shows two "DEPENDENT SUBQUERY". Without any keys on the main dictionary table it takes 0.84 seconds for 4929 results (500 kanji), 1.49 seconds for 63630 results (2000 kanji). So it's looking good, will have to spend more time finding ways to not exclude the kana, and what indexes/keys works best.
ファブリス wrote:
The query works, EXPLAIN shows two "DEPENDENT SUBQUERY". Without any keys on the main dictionary table it takes 0.84 seconds for 4929 results (500 kanji), 1.49 seconds for 63630 results (2000 kanji). So it's looking good, will have to spend more time finding ways to not exclude the kana, and what indexes/keys works best.
good news ! (what is "EXPLAIN" ?)
I am eager to trying it....
It gives so info about how MySQL compiles the query. Most importantly it shows what keys MySQL uses.
I'm working a different query that is easier to use with the kanji count approach I hinted in a previous post. I couldn't get it below 0.60sec with 2042 kanji due to a GROUP BY clause, but there is a way to reduce further by including an extra column in the indexes. On the other hand with few kanji it can get as low as 0.01sec.
That said JimmySeal's suggestion is very useful because I can compare the results, and the solution I'm using makes use of one subquery, which I never used before on this site. ![]()
VocabShuffle session with words made only of known kanji is working. I'm adding the "learned" kanji count to the Labs page, otherwise it may be confusing as this feature only works when kanji have been reviewed for a little while.
Do you mean you released it ?
I did now ![]()
I tried it, perfect and very fast for me (540 - only- kanji learned) ! (even though showing some complicated words, but i believe there is no other way to reduce to more common words). Thanks a lot !
Just a few detailed comments :
- when clicking on a kanji, it does not open a new window/tab -> when we come back it restarts from 1/20 -> could it open a new tab / window ?
- when pass the mouse on one kanji, the heisig# and keyword is displayed. But keyword is the original one, could it be the one user have input, if any ?
Regarding complicated words:
At the moment I am filtering words based in ichi1 + news1 + news2.
JMDICT wrote:
* - news1/2: appears in the "wordfreq" file compiled by Alexandre Girardi
* from the Mainichi Shimbun. (See the Monash ftp archive for a copy.)
* Words in the first 12,000 in that file are marked "news1" and words
* in the second 12,000 are marked "news2".
* - ichi1/2: appears in the "Ichimango goi bunruishuu", Senmon Kyouiku
* Publishing, Tokyo, 1998. (The entries marked "ichi2" were
* demoted from ichi1 because they were observed to have low
* frequencies in the WWW and newspapers.)
So it is possible to reduce further the pool of available words however that leaves many kanji with no example words, when adding example words to flashcards. That said this may be less of an issue for Vocab Shuffle.
when clicking on a kanji, it does not open a new window/tab -> when we come back it restarts from 1/20 -> could it open a new tab / window ?
The idea is that you use middle-click of shift click to open to a new tab. I would prefer not to force the user to use a new window, however maybe that's better for this review mode, I'll try it out. Likewise for the "search on google" , should probably add it there too.
when pass the mouse on one kanji, the heisig# and keyword is displayed. But keyword is the original one, could it be the one user have input, if any ?
Sure, looks like I have forgotten the custom keywords there.

