Back

Japanese Article Text to Anki Deck

#1
Hello,

After reading nest0rs post on how he goes through relevant vocabulary before working on his subs2srs deck I had an idea. I am sick of looking up words while reading an article, then marking them, putting them into Anki later, etc. All this work keeps me from actually enjoying the reading.

Why not parse the text, create an Anki deck containing the vocabulary, study it, and THEN read the article? I don't know if such a tool already exists, but I made one on the fly. It's not really beautiful and probably buggy as well since I put it together in less than 1 hour. It uses EDICT for sentence lookup and then simply imports the results into an appropiate text format. I don't know how long the text can be at maximum.

Anyway, if you think this is a good idea I can improve the tool a little bit. I.e., make it easier to use, include some more features, etc. If it's not useful, well, it doesn't matter Wink

Here it is:
http://dennybritz.com/texttotsv/

Type/Copy in text, click on Run, then click on "Get Tab Separated File". Then import into Anki (Fields: Kanji/Kana/Meaning).

EDIT: Features to include
x (DONE) Checkboxes to select vocabulary for import
x (DONE) Filter automatically based on JLPT Level or custom deck (such as Core2000, Core6000, etc)
- Support longer text
- Custom Filter List
- Kanji Lookup
- Example sentences for Vocabulary
- Cross references with RTK, Audio, etc
- URL Fetch
- File Upload
Edited: 2010-04-15, 11:44 pm
Reply
#2
You'll end up with many words you already know, too. Better off using Rikaichan's 'save entry to file' option, then importing that file into Anki later.
Reply
#3
Yeah that's what I thought. To be honest though, I am annoyed by using Rikaichan Wink
That's why I thought of including an option like

"Don't include words below JLPT2/JLPT1", "Don't include words contained in Core6000", etc..

Also, I think this tool is not very useful for conversational readings since one often knows the majority of words. When I read news however I encounter A LOT of words I don't know. And just skipping/suspending a word in Anki I already know takes less than one second, much less time than messing around with Rikaichan.

Of course, if your Japanese level is already around JLPT1 or higher this might not be useful at all since you will know the majority of words. It might be useful for people at lower levels though.
Edited: 2010-04-15, 11:08 am
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
While I do agree you'll have a lot of words you already know, for texts of relatively small length, I can definitely see the use to this.

In fact, now it makes it a lot easier to go through srt subs and see if there's words that I need to study.

Is there some way to run this offline?

Also, a suggestion: a checkboxes that determine which pieces are actually included in the TSV in the end; that way you can save a lot of time going through and deleting those words you already know.

Also, a note about the size (I copy/pasted the entire Haruhi novel) and got this error:
http://pastebin.org/152419
Edited: 2010-04-15, 11:16 am
Reply
#5
Quote:Is there some way to run this offline?
Since it uses an online dictionary lookup, no. I could make a little program you can use offline but you would still need an internet connection for the lookup. To make it run fully offline I would need to include the dictionary files, which is well, not worth it I think.

Quote:Also, a suggestion: a checkboxes that determine which pieces are actually included in the TSV in the end; that way you can save a lot of time going through and deleting those words you already know.
Yes, I thought about that as well. That's VERY easy to implement but then I thought:
Instead of going through the list and checking off words you know, why not just suspend the card in Anki once you come across it? That's probably faster. What do you think? But like I said above, I would like to include a feature to filter based on JLPT Level or other decks.
Reply
#6
I don't know if this is a coincidence but the thread I created (which came right before this one) was about something similar.

http://forum.koohii.com/showthread.php?tid=5329

I have almost the same idea as you, except I'd like to do this for individual kanji instead of words/phrases.

So that I can study all of the kanji that appear in a Japanese subtitle script BEFORE I
start reading it. Then study them as an Anki deck and figure out which kanji I don't know (and have my RevTK stories cross-referenced in the anki deck).
Reply
#7
Quote:I have almost the same idea as you, except I'd like to do this for individual kanji instead of words/phrases.
That would be the same code, just a different lookup type (Kanji instead of Sentence Lookup). I could include this pretty easily.
Reply
#8
Check out the edit about the length, by the way

Also...
Yes, I do agree with having a "not below JLPTsomething" level, but i would personally like checkboxes.
The reason is because I probably wouldn't go straight from this to Anki -- I would run this through my smart.fm-sentence-selector tool, which is similar to this, except incredibly primitive and manual. That, or ALC, etc...
Reply
#9
Quote:Also, a note about the size (I copy/pasted the entire Haruhi novel) and got this error:
http://pastebin.org/152419
Thank you. The length is not really a problem, I was just too lazy to implement checks for it. It can split up the text, look it up, then combine it again, so that any text size can be supported. I didn't want to go through the trouble though before finding out if anyone is even interested in this.

EDIT: Alright, I can include checkboxes as well.
Edited: 2010-04-15, 11:22 am
Reply
#10
Well, personally it's not really necessary for me
I just thought I'd put it to the test and see if it could do the entire novel...

Anyway, I might be back later tonight, but I gotta head out for a while.
Really all I can think of is checkboxes, jlpt, maybe some other "weed out" options, and stuff

I'm personally really looking forward to this
Reply
#11
ThomasB Wrote:
Quote:I have almost the same idea as you, except I'd like to do this for individual kanji instead of words/phrases.
That would be the same code, just a different lookup type (Kanji instead of Sentence Lookup). I could include this pretty easily.
That would be great if you could include this.
If not, I can finish writing my Java program to do the same thing.
Reply
#12
ThomasB Wrote:Anyway, if you think this is a good idea I can improve the tool a little bit. I.e., make it easier to use, include some more features, etc. If it's not useful, well, it doesn't matter Wink

Here it is:
http://dennybritz.com/texttotsv/
Thanks, this is EXACTLY what I've been looking for. I love it already Big Grin I was going to use Rikaichan to save all the words I come across reading and didn't know, learn them, then come back to the article, but this would be even easier.
The only suggestion I can think of is to make it filter out stuff you already know (like in my case half of core 2000), or any vocab deck you already have, but I'm not a programmer (at all) so I don't even know if that's possible :/

But anyway, so far even the current version would be very useful for someone of low vocab like me Smile
Reply
#13
I included the Checkbox feature. Now I need to actually do some work for my university so I will try to include the JLPT-Filter and Kanji feature tonight.
Reply
#14
ThomasB Wrote:EDIT: Alright, I can include checkboxes as well.
Those checkboxes could add selected words to a permanent list. That way you would have to mark each known words just once, and the program would skip it for any new text.

You could also manage to edit the blacklist manualy. For example, exporting your Anki vocabulary deck and extracting the list of words you already have and then adding that list to your program's blacklist.

With that feature, using your program would become easier and more precise each time.

EDIT: Now that I think of it, that "blacklist" would actually be your own personal glossary of vocabulary items. It would be great if you could use the program not only for extracting words you don't know from new texts, but also for managing the words you already know. For example, tagging words and creating wordlists of words grouped by tag. That way you could create wordlist divided by field or source. Then you could even share wordlists, creating a whole repository for anyone who wanted to make use of them.
Edited: 2010-04-15, 12:54 pm
Reply
#15
Edit: Okay, wow, testing it out, it looks fantastic. I can't think of anything to add, seems like everyone's touched on the stuff I would've thought of. ;p

This is great for any number of texts, but for interactive audio via balloonguy's tool, it would also be great as the glossary I mentioned in those related comments. (Kind of like how Read Real Japanese has.) There's even the possibility, if one isn't going to use Anki, of having the kana/kanji fields used to generate JDIC sound links in another column or somesuch.

Also, if the program can relate back to the areas it extracted these definitions from, it could generate tool tips somehow (as mentioned to sheetz in the audiobook thread), like Google Translate has? (Though perhaps hotkeyed since hovering will already generate highlights and clickable audio in balloonguy's tool.) Or perhaps simply highlighting the longform text/words above when you mouseover the individual words/rows at the bottom?

Since most of my word-extractions come from subs2srs decks, I suppose I'm also interested in how to apply this mass vocab-extractor to those. Oh right, I can just export the Anki .txt and import into this program... I probably would just cull any redundancies when importing and studying new cards in my vocabulary deck/reference corpora.
Edited: 2010-04-15, 3:25 pm
Reply
#16
> Sebastian: That's a good idea but I don't want to make this little "tool" more complex than it should be. If I included features such as Personal Data / User Login it would not longer be a "tool" Wink
> nest0r: I don't know any of the tools/programs you mentioned but I will look into it and see what I can incorporate. Soundlinks for example sounds like a rather easy addition.

I would include a feature to use one's own deck for filtering the results, the problem with this however is, that these decks can become rather large, on the order of several Megabytes. Thus if you were to upload your deck each time you are using the website it would take much too long. That's why I wanted to include the commonly used decks such as JLPT1/2/3/4, Core2000/6000 and KO20001 for filtering. If anyone knows an alternative way to make filtering more customizable please let me know.

I hope I will have time this evening to work on it. Thanks for the feedback.
Reply
#17
My posts on this little tool are all over the place, but: http://forum.koohii.com/showthread.php?p...8#pid95348

Though balloonguy's main comment is here: http://forum.koohii.com/showthread.php?p...5#pid95155

Edit: And the reason I specified "if one isn't going to use Anki" for the sound files thing, in case you missed it and find it useful, is Third's plugin for Anki/JDIC: http://forum.koohii.com/showthread.php?tid=5397
Edited: 2010-04-15, 4:03 pm
Reply
#18
Personally I find it odd to SRS something before you read it. You're supposed to SRS things as you run into them since that gives it a very personal touch. Sucks to read something and running into stuff you don't know? Well, that's learning for you.

Hell, give it some time and you'll parse an article, put the words in an SRS and then simply not read the article.
Reply
#19
Tobberoth, I think it's a way of improving reading speed and fluency, not just increasing vocab. This is difficult to do if there are too many unknown words or if you're stopping to look words up. You might want to check out "intensive reading".

So some people use certain texts for close reading to increase knowledge of grammar, vocab, structure, etc. and use other (perhaps simpler) texts to improve reading speed and fluency.
Reply
#20
Tobberoth, you have a point. I agree with you if there are a couple words one need to look up. It's not even worth to put them into an SRS.

For me, with vocabulary around JLPT2, however it is still difficult to read news articles even remotely fluently. In a longer article there are often dozens of words I don't know and looking all of them up is a real pain in.... Moreover, after 20minutes and 40 lookups, when I finally finished reading the article I don't even remember what I have just read! That's because I spent most of my time looking up words, not understanding the content.
That's where it might come in handy to simply study the words beforehand.
Reply
#21
ThomasB Wrote:Tobberoth, you have a point. I agree with you if there are a couple words one need to look up. It's not even worth to put them into an SRS.

For me, with vocabulary around JLPT2, however it is still difficult to read news articles even remotely fluently. In a longer article there are often dozens of words I don't know and looking all of them up is a real pain in.... Moreover, after 20minutes and 40 lookups, when I finally finished reading the article I don't even remember what I have just read! That's because I spent most of my time looking up words, not understanding the content.
That's where it might come in handy to simply study the words beforehand.
IMO, that's not a problem. Say you spent 20 minutes and 40 lookups (if you used rikaichan, that shouldn't really be an issue) and you don't remember it, read it though again. Now THAT would be awesome since you get some more exposure to everything, and when you enter the words into your SRS, you will already know them pretty well. Yeah, on the second readthrough you might need to do a few lookups again, but you will remember the context and you probably won't need to look much up, further increasing your exposure.

I can agree that it's not fun to read something where you know very few of the words contained, but from the theories I've heard, that's not a good source for learning if that's the case. We learn from comprehensible input and an article where you don't even know about 2 words a sentence, it's not going to be comprehensible without massive lookups. One could make the case that by pre-learning the words contained, you're forcing it into comprehensible input, but I personally don't believe it's that simple.

As for what Thora wrote, it's sort of the same thing there. Yeah, you might want easy texts to train reading fluency etc, but why not simply use simpler texts for that? Being able to read one news article fluently will probably not help you read other news articles fluently if your current ability demands you do massive lookups unless you prelearn for each individual article.
Edited: 2010-04-15, 5:54 pm
Reply
#22
"You're supposed to SRS things as you run into them since that gives it a very personal touch." - Uh, no, that's just one type of thing you can SRS. I happen to do that as just one part of my learning. But I also have different goals for different decks and media, etc. For example, I think the best strategy for native videos is to SRS it as closely to the actual experience as possible: Video clips on the front, other stuff on the back, where you're just focused on parsing and listening, as you would when actually watching stuff. To reduce overhead, I think it's best to not try to SRS the words according to what's in that deck--that is to say, not to try and learn the words in that deck during reviews, but to instead use other decks for vocabulary and other goals (such as multisensory integration of structurally streamlined sentences, or simple standalone word cards for simple yet robust encoding).

Likewise with longform texts, I think it's better to be strategic about approaching them, depending on your goals: Personally I think it's better to approach them like an assembly line, learning the words first, then focusing on maintaining semantic awareness and subvocalization while parsing sentences in larger and larger swaths of text, for longer and longer periods of time without interruption. You *could* do all this at once with each pass-through, but why would you want to do that when you can break it up into complementary elements in a streamlined workflow? And yes, you still would target materials at your level, but this would be relevant to the isolated, reduced overhead parsing and subvocalization/listening area, not the vocabulary area.

Likewise, even if you don't use this strategy, why would you prefer to have people individually, manually pick out words, rather than automatically isolate them and then choose what's relevant to their preferences and levels?
Edited: 2010-04-15, 6:02 pm
Reply
#23
There's a lot of reasons not to make it into an assembly, varying in importance and impact.

1. It's unnatural. You certainly didn't learn your native language like that.
2. It takes away any chance you had of picking the words up through context.
3. It's ineffective. Your goal is to read the article, you might not need to know every single unknown word in there. By getting a list from it before hand, you have no idea how important the words in it are to the article or to Japanese in general.
4. A list of words is just a list of words. As you look it up in a dictionary, how will you know which sense of the word is important to the article? There are quite a few words which can be used in several very different ways, learning the wrong one would kind of suck.
5. You're actively fooling yourself. The act of reading something with unknowns in it is a skill like any other and needs training. By learning all the words beforehand, the difficulty of the text is greatly minimized. Eventually, you will need to read Japanese texts without the help of pre-learning vocabulary and it might shock you how much harder it suddenly became to read, while someone who is used to reading texts with unknown words will have a greater skill at understanding the text regardless. For people who wish to get a good score on JLPT1 reading comprehension, this is something worthy of focus.
Reply
#24
Tobbs Wrote:1. It's unnatural. You certainly didn't learn your native language like that.
1. Unnatural like RTK? I learned my native language as a child, over the years, and I certainly did learn many of my words in separate lists, often with example sentences. I recall that much from classes. Unnatural = strategy as an adult SLA learner, much faster and with superior metacognition.
Tobbs Wrote:2. It takes away any chance you had of picking the words up through context.
2. How? You not only come back to them and have them reinforced when you SRS the original material (planned redundancy), but you use them in context when learning them by using Core6k/KO2001/any example sentences you like where the grammar/content is less important than the word in context. That's the whole point of using general reference corpora.
Tobbs Wrote:3. It's ineffective. Your goal is to read the article, you might not need to know every single unknown word in there. By getting a list from it before hand, you have no idea how important the words in it are to the article or to Japanese in general.
3. It's extremely effective from my experience which is why I recommend it repeatedly as the perfect multi-corpus, user-customized, general and robust method for learning Japanese: effective to reduce overhead, and as you progress, you need to do so less and less ahead of time. You don't need to learn every unknown word, it's up to you to decide how much you want to pre-SRS. I agree that there are techniques that could enhance this, that's why I go on about determining the most important swaths of text of the most interesting texts when discussing how to do this with reading, just as I pick the most interesting sentences of native media to do this with. It's the best of all worlds. It doesn't matter if it's important to Japanese, if it's used in media that you're interested in, that's what important. If you want to only focus on what's important to Japanese, then I happen to think that's another function of the general reference corpora, as well, or some other frequency-based corpus, or simply discarding that and focusing on preference.
Tobbs Wrote:4. A list of words is just a list of words. As you look it up in a dictionary, how will you know which sense of the word is important to the article? There are quite a few words which can be used in several very different ways, learning the wrong one would kind of suck.
4. This happens very rarely, but it has happened, and I shrugged and learned it when SRSing the new one, no big deal because overhead is still drastically reduced from the overall method, and it's not hard then to pick up an odd word or two. Likewise when I note there are multiple senses to a word, I sometimes check the original, sometimes just SRS what's in the smart.fm/Core deck, sometimes SRS them all in my extemporaneous/individual word/phrase deck.
Tobbs Wrote:5. You're actively fooling yourself. The act of reading something with unknowns in it is a skill like any other and needs training. By learning all the words beforehand, the difficulty of the text is greatly minimized. Eventually, you will need to read Japanese texts without the help of pre-learning vocabulary and it might shock you how much harder it suddenly became to read, while someone who is used to reading texts with unknown words will have a greater skill at understanding the text regardless. For people who wish to get a good score on JLPT1 reading comprehension, this is something worthy of focus.
5. You aren't fooling yourself at all, because it's self-conscious and strategic. 'Negative capability', developing that can refer to many things, and I've commented on those before. You don't eliminate all unknowns from all reading, you eliminate the bulk and brunt of it so that you can focus on parsing and semantic awareness and subvocalization/listening in certain readings. If you want to do other types of reading where you're factoring in a handful of lexical unknowns, enough to spice things up and train your negative capability for vocabulary, then I suggest either not pre-SRSing all of the words, or selecting texts that reflect that, or simply waiting until you know so many words that you come across fewer and fewer unknowns. I happen to think pre-SRSing unknown words until they're youngish/mature (according to preference), having a system for that, is best for developing reading skills--the collective benefits are superior to any other strategy, I feel. I would recommend cloze deletion for your suggestion of unknowns, and in fact made a thread specifically for that idea with regards to reading. But the prime function of that would be to glean overall meaning from imperfect knowledge, and I think this is best trained with an assembly line approach which would include some form of extraction and preparation to reduce overhead according to what your goals are.

Mostly I think it's best to develop them all progressively, chunking them together more and more until you're at a level where all this is rather moot. Till then, I recommend serious strategy. It works very well for me and especially it's great when I realize I'm extracting fewer and fewer words beforehand till eventually I'll know it's time to hang up that strategy... Anyway, I suggest you try it. My comprehension for stuff I come across without pre-SRSing has shot up incredibly, as has my 'negative capability' as I get used to thinking of the language as a collection of components paired up with the notion of multiliteracy.
Edited: 2010-04-15, 7:17 pm
Reply
#25
This is nice!
I think the hotlinking to definition, like here http://language.tiu.ac.jp/tools_e.html
would be a good thing.
it's a simple thing to add the audio lookup too...
Reply