![]() |
|
smart.fm corpus - Printable Version +- kanji koohii FORUM (http://forum.koohii.com) +-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html) +--- Forum: Learning resources (http://forum.koohii.com/forum-9.html) +--- Thread: smart.fm corpus (/thread-3959.html) |
smart.fm corpus - dawhite - 2009-09-09 Nonono, see, you DON'T need to know where words start and end. No Mecab required. If we were trying to find a list of all the words in the sentences that were unsuspended and then seeing if any of those words were in our vocab deck, then we would need to know where words started and ended, true. HOWEVER, what WE'RE doing is getting a list of vocab words that were care about directly from our vocab deck and then just seeing if they are in any of the unsuspended sentences in the sentence deck AS A SUBSTRING. It's true that it may sometimes look like one of our vocab words is in a sentence when really it's only there as a substring of another word, but that's a hit I'm willing to take. smart.fm corpus - dawhite - 2009-09-09 ruiner Wrote:Exactly. All I'd like is a way to automate the process of making that list of words I don't know from Tae Kim and then running the script.dawhite Wrote:It is a rare species of antelope.Ohh, your c2k deck is vocabulary only? Can't you just make a list of words you don't know when you're doing Tae Kim sentences and then tell the script bombpersons wrote to unsuspend the cards in the C2k deck that have those listed words? Not sure why you want to use Tae Kim as a guide for vocab anyway, best to just use grammar cards for grammar. Don't need to memorize the words in grammar sentences as long as you get the gist from the translation enough to understand the points. As far as Tae Kim and vocab is concerned, well... I'd like to be totally comfortable with the grammar sentences, and I see no downside to learning the vocab before KO2001. I'll have to learn it some day anyway. smart.fm corpus - Nukemarine - 2009-09-09 Ruiner, I think he's doing what I'm doing: Having multiple decks that test different things. I have a vocabulary deck that I fail if I miss the ONE VOCABULARY WORD that's being tested on that card, and not on all the words in the sentence using that word. I have a grammar deck that if fail if I miss the GRAMMAR CONCEPT that's being tested. I don't fail because I did not know a word, cause I know I'm testing that in my vocabulary deck. I have a kanji deck that I fail if I don't know how to write out that kanji. The grammar and vocab decks are not testing my writing/recognition of individual kanji. Make sense? So he want's words in his grammar deck that are on activated cards to be activated automatically in his vocabulary deck. Personally, I caution against doing this automatically, as Anki's search feature is pretty fast. Plus, you may activate the wrong word. Not a big deal as you can re suspend it. PS: Yes, I know, some people like to get every part of a sentence correct. I'm just not that way, and things went faster when I started keeping separate decks for separate jobs. smart.fm corpus - ruiner - 2009-09-09 bombpersons Wrote:Now apparently I need setuptools? Can't seem to get that working. Meh, too much trouble, hehe. I only do like 15 words at a time, I don't mind browsing through myself. As a tool for a larger project though I think it's immensely useful. I'll have to keep my lists of words for sharing once I'm done w/ episodes/movies/etc. Though I guess they'll be incomplete since I don't mark down all the words, it can be added to the hive mind for parsing.ruiner Wrote:Ahem, okay, how do I uh, make it work? I installed Python ;p and I got as far as something about an sqlmodule that I don't have? Can't I push a button and make everything work?You need to download and install sqlalchemy, http://www.sqlalchemy.org/download.html (this is what anki uses to access the sqlite anki decks). @dawhite Seems like a lot of trouble with little payoff! smart.fm corpus - dawhite - 2009-09-09 Yeah. Incidentally, Nuke, you were my inspiration for separate decks. So... props. As far as Anki and search, sure, it's fast, but typing in each individual word isn't. smart.fm corpus - bombpersons - 2009-09-09 dawhite Wrote:Nonono, see, you DON'T need to know where words start and end. No Mecab required.あ、そうか! Ok I think I might be able to do that. smart.fm corpus - ruiner - 2009-09-09 Nukemarine Wrote:Ruiner, I think he's doing what I'm doing: Having multiple decks that test different things.Eh? I do this too, though I don't think you necessarily need to separate the decks for it (that's another thread, tags vs. multiple decks, etc). However, if you're having each deck act complementary to the other, you should be smart about it. Seems to me that subordinating the vocabulary you learn to grammar cards isn't a good idea. @dawhite, I just copy paste, myself. Copy paste as I come across unfamiliar words, then copypaste into Anki's browser. However, I do think that if it's a project that multiple users will find useful (ie "vocabulary from Death Note"), finding a way to automate this would be good. smart.fm corpus - dawhite - 2009-09-09 @ruiner Not at all. Let's say it takes me 20 seconds to type in a vocab word I don't know into one of those lists, and let's say I'm going to learn, say, 2000 vocab words before I can just start picking it up from context (a low estimate, btw). That's 11 hours of typing that script just saved me. You can do the math if you don't believe me. Now, forget about me. Let's suppose that, I don't know, ten more people find this script useful. That's 220 hours of time, almost ten man-days, that we just saved. This time can now be used for watching porn. Also, re: the "lot of work" -- not at all. This is one of those cases where the algorithm takes forever to type out clearly but can be accomplished in a few lines; the only missing part is whatever part of the API lets me know whether or not a card is suspended. smart.fm corpus - dawhite - 2009-09-09 @Bomberpersons: My hero! @ruiner: Yes, subordinating vocab to grammar would be teh suck. However, this is going to be helpful even beyond Tae Kim and into KO2001 and even sentence mining. smart.fm corpus - dawhite - 2009-09-09 Also, bonus points if anyone can tell me how to, given a string A, where A is a substring of one or more fields of a question in Anki, make A bold wherever it occurs in the question (i.e. if I have a word I want to focus on in a particular sentence, how can I bold it all the times it shows up in that sentence?) This would be really simple if you knew: 1) A way to get at the individual text fields in the question and 2) What format Anki uses to make things bold I'm sorry I keep asking all of these questions, btw... is there some kind of Anki python library documentation? EDIT: Shit. On second thought, this one wouldn't actually work without Mecab, since if a word gets conjugated then it wouldn't get bolded. smart.fm corpus - nest0r - 2009-09-09 dawhite Wrote:@ruinerI'm still confused about what it is you're wanting. I think in part because you popped in with this particular workflow while I'm caught on my thread's target logistics. Sometimes I read a comment and I think get it and think we want the same thing, then I re-read and get confused and thinking you want the opposite. Bombpersons seems to understand what you're saying, so I'll see what they come up with and work out possible uses. I'll probably realize this is related to discussions we've had in other threads and I've been wasting my time. ;p smart.fm corpus - dawhite - 2009-09-09 nest0r Wrote:Oh, I haven't 100% been following your thread (slightly above my head). BUT, just to reiterate: what I want is, given a deck with sentences and a deck with isolated vocab words, to unsuspend the isolated vocab words in the vocab deck that feature in the unsuspended sentences in the sentence deck.dawhite Wrote:@ruinerI'm still confused about what it is you're wanting. I think in part because you popped in with this particular workflow while I'm caught on my thread's target logistics. Sometimes I read a comment and I think get it and think we want the same thing, then I re-read and get confused and thinking you want the opposite. Bombpersons seems to understand what you're saying, so I'll see what they come up with and work out possible uses. I'll probably realize this is related to discussions we've had in other threads and I've been wasting my time. ;p As I'm reading through Tae Kim today and finding out how much everything varies based on context, I'd also sort of like to be able to go a step further and, given this list of vocab words we just unsuspended, apply it to a THIRD deck (a "vocab in context" deck, I guess) containing the original sentences, so that the unknown vocab words get bolded in those sentences. However, as I'm typing this, I just realized that BOTH of my ideas aren't going to work without Mecab because of conjugation. I.E. if a vocab word is in a sentence, but not in dictionary form, then it won't get identified by comparing it against the vocab list. I guess now I need to figure out Mecab... smart.fm corpus - nest0r - 2009-09-09 dawhite Wrote:Hmm, maybe what you're talking about then is, given for example a subs2srs deck where you unsuspend some sentences and want to learn the unknown words in those sentences from a Core deck, instead of individually/group creating a list from the subs2srs deck and then using that list as a whitelist to unsuspend cards in the Core deck, you can actually do it this way: Use the Core deck's suspended cards as a whitelist, looking for matches in the unsuspended cards of the subs2srs deck--conjugations wouldn't matter as much because the Core deck could have the dictionary form *and* the sentence w/ its conjugation, right? iknow's importer has that option, no? So then you've actually got a broad whitelist from the Core deck that's pre-parsed, then it scans the unsuspended sentences in the subs2srs deck, and when there's matches, it unsuspends its own cards that match? That could work!nest0r Wrote:Oh, I haven't 100% been following your thread (slightly above my head). BUT, just to reiterate: what I want is, given a deck with sentences and a deck with isolated vocab words, to unsuspend the isolated vocab words in the vocab deck that feature in the unsuspended sentences in the sentence deck.dawhite Wrote:@ruinerI'm still confused about what it is you're wanting. I think in part because you popped in with this particular workflow while I'm caught on my thread's target logistics. Sometimes I read a comment and I think get it and think we want the same thing, then I re-read and get confused and thinking you want the opposite. Bombpersons seems to understand what you're saying, so I'll see what they come up with and work out possible uses. I'll probably realize this is related to discussions we've had in other threads and I've been wasting my time. ;p This is kind of what I was getting at here: http://forum.koohii.com/showthread.php?pid=67839#pid67839 and with my audio ecology 60/30/10 thread, but my scripting abilities are bleh. smart.fm corpus - dawhite - 2009-09-09 Bam. That's exactly it. The one caveat is the conjugation thing: you COULD make a core deck that included the word in a sentence, but unfortunately the conjugation in that sentence isn't guaranteed to be the same as the conjugation in the Subs2SRS sentence. However, I found a nifty Mecab sample by RadicalTyro on the forums and, although I'm juggling four or five things right now, I think using Mecab to handle conjugation issues shouldn't be that hard. I'll be posting some messy code on here in fifteen to thirty minutes that should do it, hopefully. I just have to get my stupid computer to start displaying Japanese characters first. Minor detail. smart.fm corpus - bombpersons - 2009-09-09 Right I think I have something working, only problem is anki automatically puts html into the answer and question of a card. So now I have to strip away the html from the question side. Anyone know how to do this easily? smart.fm corpus - nest0r - 2009-09-09 dawhite Wrote:Bam. That's exactly it.Well if the Core deck has both sentence and vocabulary sections, then not only do you have one possible conjugation (in the sentence) and the dictionary form (via however smart.fm formats it), & at least you've narrowed it down by one possibility, though I can see how integrating a dictionary here could be handy w/ Mecab somehow. Plus this way you have the option of, when the Core cards are unsuspended, studying either vocab or sentence. Oops, when did I login as nest0r? smart.fm corpus - dawhite - 2009-09-09 Lol. Who are you supposed to be exactly? smart.fm corpus - bombpersons - 2009-09-09 Meh, I almost have it working, just need to sort out a XML error (looks like something to do with the encoding). Here's the code so far, I'm going to sleep now, so you can try and fix it if you like. http://massmirror.com/e9e5a420137397a298767a8c0b7ea927.html smart.fm corpus - ropsta - 2009-09-09 dawhite Wrote:Lol. Who are you supposed to be exactly?We call him/her/it "ruiner". Watch out, you just might read something that could ruin your life. smart.fm corpus - dawhite - 2009-09-09 bombpersons Wrote:Meh, I almost have it working, just need to sort out a XML error (looks like something to do with the encoding). Here's the code so far, I'm going to sleep now, so you can try and fix it if you like.Ergh... I can't debug for crap =/. However: AMAZING job so far! Also, I almost have the Mecab stuff to the point where it makes sense. I feel like this is going to be crazy badass. smart.fm corpus - dawhite - 2009-09-09 OK, right. My Mecab file is here: http://massmirror.com/e8d675b2f91d18923ad103bd02fdddaf.html To make it work, you'll need the rest of radical_tyros script, of which this is a minimal modification. This can be found here: http://forum.koohii.com/showthread.php?tid=3689 Given a text file containing your sentences and a text file containing your vocab words, it outputs a text file containing the vocab words that were in at least one of the sentences. This SHOULD take things like conjugation and automatic word parsing into account, since it's using Mecab. So, theoretically, all we need to to for a totally working product is have the following sequence of transformations: 1) Deck with isolated vocab words -> text file containing the SUSPENDED members of that deck 2) Deck with sentences -> text file containing the UNSUSPENDED members of that deck 3) Applying this script, these two decks -> text file containing the suspended vocab words from step 1 that were in some sentence from step 2 4) We apply the text file from step three to the deck of isolated vocab words to unsuspend all of the vocab words that were found to match. Bomberpersons did step 4 already. I think steps 1 and 2 are handled by what he just (heroically) wrote, which I'll have to look into. So, once that's fixed, all we have left is step 5, which is to tie it all together, which I, hilariously, do not know enough python to do. Also, a step 6, which would be to take the list of matching words and to make those bold in the sentences, would be AMAZING. Now that we have Mecab figured out, finding which words to bold would be easy, but I'm still not sure how to change the formatting. ###################### I should probably mention that my code is completely untested and I would have no idea how to debug it if it were broken, since I don't know python. ############## EDIT: Bomberpersons, I just read through your script again. Is "card.priority > -3" the part that checks to see if it's suspended? In that case, does card.priority <=-3 mean that it's UNsuspended? Also, I'm not quite sure what's up with this "parsing" thing, and I can't see where "char" is coming from... I am not so hot at this language. Does this export both full word and reading (or would it, if it worked?) Now... if we could only figure out how to bold those words... smart.fm corpus - mafried - 2009-09-09 You'll really want to use mecab for this. You'll find everything you'll need in the anki sources. I'm afraid I don't have the time to work on these sorts of plugins anymore (or the motivation--i'm no longer using SRS the way i used to). But my old JSPfEC script does some mecab analysis that may be a good starting point for what you want to do here. smart.fm corpus - dawhite - 2009-09-09 I think I actually have the mecab part nailed (see my attached script). The problem is just figuring out which cards are suspended and which are unsuspended, and also in figuring out how to bold text in an anki field. smart.fm corpus - mafried - 2009-09-09 Suspended facts have the 'Suspended' tag, right? In any case you can ask Damien about this. For bold text, you'll just have to insert the HTML around the word in the field. smart.fm corpus - dawhite - 2009-09-09 Woah! Woah woah woah that is freaking AWESOME. Now, how would one go about accessing tags? |