kanji koohii FORUM
text2srs? - Printable Version

+- kanji koohii FORUM (http://forum.koohii.com)
+-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html)
+--- Forum: Learning resources (http://forum.koohii.com/forum-9.html)
+--- Thread: text2srs? (/thread-3777.html)



text2srs? - mezbup - 2009-08-16

Is there any way we could get some sort of program that you enter a particular word you're looking for or grammar point you want to study and then you parse a whole lot of text from say a web-page or a pdf through it and it extracts the sentences that match your criteria and compiles it to an anki deck?

I'm guessing it could be done and implemented much more usefully than my primitive concept of it.

But what do you guys think?


text2srs? - bombpersons - 2009-08-16

Sounds like a good idea, especially if you have a large portion of text to go through. I would imagine it would be quite easy to write a python script to do it, I'll have a go =D

The problem comes in seperating the sentences though, only when 。 or  ? or ! appears?


text2srs? - mezbup - 2009-08-16

cool Smile I guess you'd have to experiment with it... the best way to find out is to write it and use. That way you get a feel for what needs tweaking.

A really awesome feature to implement would be grammar points by JLPT level. So, you could select which grammar points from which JLPT level you want it to make a deck out of and hit the go button and boom. A nice deck for you to work your way through. Perfect for grammar study at the higher levels IMO.

Imagine parsing a whole book and picking out a deck full of all the JLPT2 grammar points in it. Could be a really fast way to get good sentences that are extremely relevant to you're learning.


text2srs? - Tobberoth - 2009-08-16

Isn't this something Mecab can do?


text2srs? - mezbup - 2009-08-16

What's Mecab?


text2srs? - bombpersons - 2009-08-16

What exactly is mecab? It has python bindings, but I can't seem to find what it does or find any documentation?


text2srs? - bombpersons - 2009-08-16

Well, managed to put something together. It just finds sentences with specified words in it.

Sorry no GUI, just command line at the moment. You'll need python to run it.

Example usage
python text2srs.py -i inputfile.txt -s search,terms,seperated,by,comma,no,spaces

If you don't put in any search terms, every sentence will be used.

The resulting file output.txt lists all the sentences found with the words in. It can be imported to anki, however there is only one field, the sentence. You'll have to do dictionary look ups etc, yourself.

Download link:
text2srs.rar (updated)

*Damn character encodings =( So confusing. Had to jump through some hoops to allow unicode characters in the terminal...


text2srs? - nonpoint - 2009-08-16

This is basically what I do, except the source material I search through is my subs2SRS decks... favorite drama/anime, mp3s, images, native, etc.
My thread http://forum.koohii.com/showthread.php?pid=61057#pid61057
has some perl code to help anyone who wants to write a script in perl.

Thanks for that code bombpersons, I'll have a look and see if I can improve my own stuff Smile

EDIT: With minor changes one can use bombpersons code to search the subs2srs decks like I do. Hint: combine the .tsv files, regexp for the mp3s and jpgs to easily import into anki.


text2srs? - bombpersons - 2009-08-16

nonpoint Wrote:This is basically what I do, except the source material I search through is my subs2SRS decks... favorite drama/anime, mp3s, images, native, etc.
My thread http://forum.koohii.com/showthread.php?pid=61057#pid61057
has some perl code to help anyone who wants to write a script in perl.

Thanks for that code bombpersons, I'll have a look and see if I can improve my own stuff Smile
Yeah, I'ved started doing this. Importing loads of stuff into anki and suspending it so I can easily unsuspend the cards I need. One problem that I've been having is that it means whenever I try to sync my deck, it takes ages since the whole deck gets synced =(


text2srs? - Tobberoth - 2009-08-16

Mecab is a Japanese parser. It is for example used by Anki to generate readings.


text2srs? - nest0r - 2009-08-16

Maybe creating a searchable online database of the combined subs2srs decks would be the best path--not necessarily multimedia itself, just the text for the initial search, then users can use the results to create/download the media files alongside the desired cards. Something like that. I'm not sure precisely how these corpus tools work, but between mecab and parts-of-speech tagging and the other metadata possibilities... It could even be collaboratively annotated, using Microformats and the Firefox Operator addon? Or whatever. Or, going back to previous conversations, would it be best to try and fit this into smart.fm's lists/API stuff?

Also, the future: video cards in Anki, created via subs2srs? ;p Or stick an SRS algorithm onto a videoplayer playlist.... Eating all that rice made with my neuro fuzzy has made my logic fuzzy too.

Speaking of annotation--I haven't yet delved into the possibilities of turning Flickr images and Youtube videos into picture 'dictionaries', as it were. Or just using the ones already made by .jp natives. Wonder if you can audio-annotate as well. Or perhaps incorporating Omnisio or something into Anki images could work? I guess Omnisio is video. But you know, whatever is used for image tagging on Flickr.


text2srs? - bombpersons - 2009-08-16

nest0r Wrote:Maybe creating a searchable online database of the combined subs2srs decks would be the best path--not necessarily multimedia itself, just the text for the initial search, then users can use the results to create/download the media files alongside the desired cards. Something like that. I'm not sure precisely how these corpus tools work, but between mecab and parts-of-speech tagging and the other metadata possibilities... It could even be collaboratively annotated, using Microformats and the Firefox Operator addon? Or whatever. Or, going back to previous conversations, would it be best to try and fit this into smart.fm's lists/API stuff?

Also, the future: video cards in Anki, created via subs2srs? ;p Or stick an SRS algorithm onto a videoplayer playlist.... Eating all that rice made with my neuro fuzzy has made my logic fuzzy too.

Speaking of annotation--I haven't yet delved into the possibilities of turning Flickr images and Youtube videos into picture 'dictionaries', as it were. Or just using the ones already made by .jp natives. Wonder if you can audio-annotate as well. Or perhaps incorporating Omnisio or something into Anki images could work? I guess Omnisio is video. But you know, whatever is used for image tagging on Flickr.
This would be awesome =D All we need is a system where people can upload cards / decks and then put that in one massive deck / database which users can search through. That way, whenever you find a word you need a sentence for you can find nice interesting sentences instead of the boring dictionary ones. It would be even better with audio / video which you can easily download. I'd imagine the bandwidth costs would be quite a bit though.

I've been wanting to get in web developing again for while now, I think this is a good excuse to start =D I'll try making mock site and see how it goes... Perhaps if I use python to create the site, I could use libanki to use an anki deck as the database...


text2srs? - cescoz - 2009-08-16

This was an idea of some times ago....think that you have like 200+ jap movie, with 200+ jap OCRed subtitles.
1.You write in a txt files the japanese words that you like
2. A program run through the media folder following the instructions of the text file and extracting the videos that matched the chosen words
3. The result will be a lot of videos from different movies in different contest of the words.
You know Spielberg called me the other day and suggest me thisXD


text2srs? - nonpoint - 2009-08-17

cescoz Wrote:This was an idea of some times ago....think that you have like 200+ jap movie, with 200+ jap OCRed subtitles.
1.You write in a txt files the japanese words that you like
2. A program run through the media folder following the instructions of the text file and extracting the videos that matched the chosen words
3. The result will be a lot of videos from different movies in different contest of the words.
You know Spielberg called me the other day and suggest me thisXD
I'm doing this, pretty much... audio and image only though.
After I search for a word I don't know I listen to a bunch of the sentences before actually looking up the word (in a j-j dictionary)
Sometimes (not very often, once in 30? 40 times?) I get a pretty good feel and the dictionary merely confirms my intuition (awesome feeling!).

If anyone wants to make an online version of this, the best/cheapest/most time-effcient way would probably be, as nest0r has stated, to use a smart.fm list. I guess people could start using smart.fm for serious sentence-picking then Smile


text2srs? - nest0r - 2009-08-17

Following the train of thought from here and here (thank goodness I deleted my whining first post on the board wanting shared sentence collections--consensus then was 'too difficult don't bother', hehe, so there was this constant influx of frustrated RTK graduates/AJATT n00bs--thank you programmers for bringing us to the future), I do think people could convert their subs2srs decks into lists at smart.fm or some variation of its database capabilities. They can create them around themes and tropes, or grammar points, or just emulate systems like KO2001 or some of the other template examples on the site.

Just search and export what you want from there into Anki. Or since there's the copyright issue that shaped the limitations of the Anki plugin (not to mention smart.fm understandably wanting to keep their nose clean), the person who compiles the lists can create a .media folder for Anki and point a link to it offsite (I guess the link'd either be there or at that wiki people are using for sharing decks), so you'd just import the 'text' information and then once on your HDD you combine it with the media you want?

+ These lists could be mixed and matched, though doing the same with offsite media databases would be more difficult...


text2srs? - bombpersons - 2009-09-09

nest0r Wrote:Following the train of thought from here and here (thank goodness I deleted my whining first post on the board wanting shared sentence collections--consensus then was 'too difficult don't bother', hehe, so there was this constant influx of frustrated RTK graduates/AJATT n00bs--thank you programmers for bringing us to the future), I do think people could convert their subs2srs decks into lists at smart.fm or some variation of its database capabilities. They can create them around themes and tropes, or grammar points, or just emulate systems like KO2001 or some of the other template examples on the site.

Just search and export what you want from there into Anki. Or since there's the copyright issue that shaped the limitations of the Anki plugin (not to mention smart.fm understandably wanting to keep their nose clean), the person who compiles the lists can create a .media folder for Anki and point a link to it offsite (I guess the link'd either be there or at that wiki people are using for sharing decks), so you'd just import the 'text' information and then once on your HDD you combine it with the media you want?

+ These lists could be mixed and matched, though doing the same with offsite media databases would be more difficult...
I'm gonna give a go at making this. I've started making something with Django (Python web framework), but I'm still learning how to use it, so... Would anyone like to help? I can share the source in a github so everyone can improve the code.

I guess the basic idea is to have a system where users can add sentences (through a basic form) first. Then we could move onto making an anki plugin that will automatically sync a users sentences (with media) to the database. Then a user could search the site and download the sentence(s) in an anki file which can be easily imported.


text2srs? - ruiner - 2009-09-09

bombpersons Wrote:
nest0r Wrote:Following the train of thought from here and here (thank goodness I deleted my whining first post on the board wanting shared sentence collections--consensus then was 'too difficult don't bother', hehe, so there was this constant influx of frustrated RTK graduates/AJATT n00bs--thank you programmers for bringing us to the future), I do think people could convert their subs2srs decks into lists at smart.fm or some variation of its database capabilities. They can create them around themes and tropes, or grammar points, or just emulate systems like KO2001 or some of the other template examples on the site.

Just search and export what you want from there into Anki. Or since there's the copyright issue that shaped the limitations of the Anki plugin (not to mention smart.fm understandably wanting to keep their nose clean), the person who compiles the lists can create a .media folder for Anki and point a link to it offsite (I guess the link'd either be there or at that wiki people are using for sharing decks), so you'd just import the 'text' information and then once on your HDD you combine it with the media you want?

+ These lists could be mixed and matched, though doing the same with offsite media databases would be more difficult...
I'm gonna give a go at making this. I've started making something with Django (Python web framework), but I'm still learning how to use it, so... Would anyone like to help? I can share the source in a github so everyone can improve the code.

I guess the basic idea is to have a system where users can add sentences (through a basic form) first. Then we could move onto making an anki plugin that will automatically sync a users sentences (with media) to the database. Then a user could search the site and download the sentence(s) in an anki file which can be easily imported.
I don't know how to do that other stuff, but how's the Media URL thing in Anki's deck properties work? Is it realtime? Like, if you have a .anki file on your PC does it DL the files into a cache, or can you 'import' media files from the Media URL? I haven't paid much attention to all this dropbox/synching stuff because I don't review on portable devices.


text2srs? - bombpersons - 2009-09-09

You can set the media URL so that when you are reviewing online, it can fetch the media from a seperate server. Quite useful if you have dropbox. You can just put your anki deck and media in the public folder and copy and paste the public URL as the media url and you can review your deck with media anywhere =)


text2srs? - cescoz - 2009-09-09

maybe the most useful thing to do is collect all the anki deck with media from all the members of the site...who want to partecipate sure! and then do a massive 50.000+ sentences deck....even if there aren't all the media it will be phenomenal °_° for learning
It's only an idea though, a deck are like personal thing for the person who did it...hmm i don't know