text2srs?

Index » Learning resources

  • 1
 
mezbup
Member
From: sausage lip
Registered: 2008-09-18
Posts: 1679
Website

Is there any way we could get some sort of program that you enter a particular word you're looking for or grammar point you want to study and then you parse a whole lot of text from say a web-page or a pdf through it and it extracts the sentences that match your criteria and compiles it to an anki deck?

I'm guessing it could be done and implemented much more usefully than my primitive concept of it.

But what do you guys think?

bombpersons
Member
From: UK
Registered: 2008-10-08
Posts: 907
Website

Sounds like a good idea, especially if you have a large portion of text to go through. I would imagine it would be quite easy to write a python script to do it, I'll have a go big_smile

The problem comes in seperating the sentences though, only when  。  or  ? or ! appears?

mezbup
Member
From: sausage lip
Registered: 2008-09-18
Posts: 1679
Website

cool smile I guess you'd have to experiment with it... the best way to find out is to write it and use. That way you get a feel for what needs tweaking.

A really awesome feature to implement would be grammar points by JLPT level. So, you could select which grammar points from which JLPT level you want it to make a deck out of and hit the go button and boom. A nice deck for you to work your way through. Perfect for grammar study at the higher levels IMO.

Imagine parsing a whole book and picking out a deck full of all the JLPT2 grammar points in it. Could be a really fast way to get good sentences that are extremely relevant to you're learning.

Advertising (register and sign in to hide this)
JapanesePod101
Sponsor
 
Tobberoth
Member
From: Sweden
Registered: 2008-08-25
Posts: 3364

Isn't this something Mecab can do?

mezbup
Member
From: sausage lip
Registered: 2008-09-18
Posts: 1679
Website

What's Mecab?

bombpersons
Member
From: UK
Registered: 2008-10-08
Posts: 907
Website

What exactly is mecab? It has python bindings, but I can't seem to find what it does or find any documentation?

bombpersons
Member
From: UK
Registered: 2008-10-08
Posts: 907
Website

Well, managed to put something together. It just finds sentences with specified words in it.

Sorry no GUI, just command line at the moment. You'll need python to run it.

Example usage
python text2srs.py -i inputfile.txt -s search,terms,seperated,by,comma,no,spaces

If you don't put in any search terms, every sentence will be used.

The resulting file output.txt lists all the sentences found with the words in. It can be imported to anki, however there is only one field, the sentence. You'll have to do dictionary look ups etc, yourself.

Download link:
text2srs.rar (updated)

*Damn character encodings sad So confusing. Had to jump through some hoops to allow unicode characters in the terminal...

Last edited by bombpersons (2009 August 16, 10:36 am)

nonpoint
Member
From: KON?
Registered: 2009-07-14
Posts: 168

This is basically what I do, except the source material I search through is my subs2SRS decks... favorite drama/anime, mp3s, images, native, etc.
My thread http://forum.koohii.com/viewtopic.php?pid=63811
has some perl code to help anyone who wants to write a script in perl.

Thanks for that code bombpersons, I'll have a look and see if I can improve my own stuff smile

EDIT: With minor changes one can use bombpersons code to search the subs2srs decks like I do. Hint: combine the .tsv files, regexp for the mp3s and jpgs to easily import into anki.

Last edited by nonpoint (2009 August 16, 12:35 pm)

bombpersons
Member
From: UK
Registered: 2008-10-08
Posts: 907
Website

nonpoint wrote:

This is basically what I do, except the source material I search through is my subs2SRS decks... favorite drama/anime, mp3s, images, native, etc.
My thread http://forum.koohii.com/viewtopic.php?pid=63811
has some perl code to help anyone who wants to write a script in perl.

Thanks for that code bombpersons, I'll have a look and see if I can improve my own stuff smile

Yeah, I'ved started doing this. Importing loads of stuff into anki and suspending it so I can easily unsuspend the cards I need. One problem that I've been having is that it means whenever I try to sync my deck, it takes ages since the whole deck gets synced sad

Reply #10 - 2009 August 16, 2:15 pm
Tobberoth
Member
From: Sweden
Registered: 2008-08-25
Posts: 3364

Mecab is a Japanese parser. It is for example used by Anki to generate readings.

Reply #11 - 2009 August 16, 3:19 pm
nest0r
Member
Registered: 2007-10-19
Posts: 5234
Website

Maybe creating a searchable online database of the combined subs2srs decks would be the best path--not necessarily multimedia itself, just the text for the initial search, then users can use the results to create/download the media files alongside the desired cards. Something like that. I'm not sure precisely how these corpus tools work, but between mecab and parts-of-speech tagging and the other metadata possibilities... It could even be collaboratively annotated, using Microformats and the Firefox Operator addon? Or whatever. Or, going back to previous conversations, would it be best to try and fit this into smart.fm's lists/API stuff?

Also, the future: video cards in Anki, created via subs2srs? ;p Or stick an SRS algorithm onto a videoplayer playlist.... Eating all that rice made with my neuro fuzzy has made my logic fuzzy too.

Speaking of annotation--I haven't yet delved into the possibilities of turning Flickr images and Youtube videos into picture 'dictionaries', as it were. Or just using the ones already made by .jp natives. Wonder if you can audio-annotate as well. Or perhaps incorporating Omnisio or something into Anki images could work? I guess Omnisio is video. But you know, whatever is used for image tagging on Flickr.

Last edited by nest0r (2009 August 16, 3:30 pm)

Reply #12 - 2009 August 16, 3:44 pm
bombpersons
Member
From: UK
Registered: 2008-10-08
Posts: 907
Website

nest0r wrote:

Maybe creating a searchable online database of the combined subs2srs decks would be the best path--not necessarily multimedia itself, just the text for the initial search, then users can use the results to create/download the media files alongside the desired cards. Something like that. I'm not sure precisely how these corpus tools work, but between mecab and parts-of-speech tagging and the other metadata possibilities... It could even be collaboratively annotated, using Microformats and the Firefox Operator addon? Or whatever. Or, going back to previous conversations, would it be best to try and fit this into smart.fm's lists/API stuff?

Also, the future: video cards in Anki, created via subs2srs? ;p Or stick an SRS algorithm onto a videoplayer playlist.... Eating all that rice made with my neuro fuzzy has made my logic fuzzy too.

Speaking of annotation--I haven't yet delved into the possibilities of turning Flickr images and Youtube videos into picture 'dictionaries', as it were. Or just using the ones already made by .jp natives. Wonder if you can audio-annotate as well. Or perhaps incorporating Omnisio or something into Anki images could work? I guess Omnisio is video. But you know, whatever is used for image tagging on Flickr.

This would be awesome big_smile All we need is a system where people can upload cards / decks and then put that in one massive deck / database which users can search through. That way, whenever you find a word you need a sentence for you can find nice interesting sentences instead of the boring dictionary ones. It would be even better with audio / video which you can easily download. I'd imagine the bandwidth costs would be quite a bit though.

I've been wanting to get in web developing again for while now, I think this is a good excuse to start big_smile I'll try making mock site and see how it goes... Perhaps if I use python to create the site, I could use libanki to use an anki deck as the database...

Reply #13 - 2009 August 16, 4:28 pm
cescoz
Member
From: Italy
Registered: 2008-01-22
Posts: 131

This was an idea of some times ago....think that you have like 200+ jap movie, with 200+ jap OCRed subtitles.
1.You write in a txt files the japanese words that you like
2. A program run through the media folder following the instructions of the text file and extracting the videos that matched the chosen words
3. The result will be a lot of videos from different movies in different contest of the words.
You know Spielberg called me the other day and suggest me thisXD

Reply #14 - 2009 August 17, 6:32 am
nonpoint
Member
From: KON?
Registered: 2009-07-14
Posts: 168

cescoz wrote:

This was an idea of some times ago....think that you have like 200+ jap movie, with 200+ jap OCRed subtitles.
1.You write in a txt files the japanese words that you like
2. A program run through the media folder following the instructions of the text file and extracting the videos that matched the chosen words
3. The result will be a lot of videos from different movies in different contest of the words.
You know Spielberg called me the other day and suggest me thisXD

I'm doing this, pretty much... audio and image only though.
After I search for a word I don't know I listen to a bunch of the sentences before actually looking up the word (in a j-j dictionary)
Sometimes (not very often, once in 30? 40 times?) I get a pretty good feel and the dictionary merely confirms my intuition (awesome feeling!).

If anyone wants to make an online version of this, the best/cheapest/most time-effcient way would probably be, as nest0r has stated, to use a smart.fm list. I guess people could start using smart.fm for serious sentence-picking then smile

Reply #15 - 2009 August 17, 7:18 am
nest0r
Member
Registered: 2007-10-19
Posts: 5234
Website

Following the train of thought from here and here (thank goodness I deleted my whining first post on the board wanting shared sentence collections--consensus then was 'too difficult don't bother', hehe, so there was this constant influx of frustrated RTK graduates/AJATT n00bs--thank you programmers for bringing us to the future), I do think people could convert their subs2srs decks into lists at smart.fm or some variation of its database capabilities. They can create them around themes and tropes, or grammar points, or just emulate systems like KO2001 or some of the other template examples on the site.

Just search and export what you want from there into Anki. Or since there's the copyright issue that shaped the limitations of the Anki plugin (not to mention smart.fm understandably wanting to keep their nose clean), the person who compiles the lists can create a .media folder for Anki and point a link to it offsite (I guess the link'd either be there or at that wiki people are using for sharing decks), so you'd just import the 'text' information and then once on your HDD you combine it with the media you want?

+ These lists could be mixed and matched, though doing the same with offsite media databases would be more difficult...

Last edited by nest0r (2009 August 17, 7:40 am)

bombpersons
Member
From: UK
Registered: 2008-10-08
Posts: 907
Website

nest0r wrote:

Following the train of thought from here and here (thank goodness I deleted my whining first post on the board wanting shared sentence collections--consensus then was 'too difficult don't bother', hehe, so there was this constant influx of frustrated RTK graduates/AJATT n00bs--thank you programmers for bringing us to the future), I do think people could convert their subs2srs decks into lists at smart.fm or some variation of its database capabilities. They can create them around themes and tropes, or grammar points, or just emulate systems like KO2001 or some of the other template examples on the site.

Just search and export what you want from there into Anki. Or since there's the copyright issue that shaped the limitations of the Anki plugin (not to mention smart.fm understandably wanting to keep their nose clean), the person who compiles the lists can create a .media folder for Anki and point a link to it offsite (I guess the link'd either be there or at that wiki people are using for sharing decks), so you'd just import the 'text' information and then once on your HDD you combine it with the media you want?

+ These lists could be mixed and matched, though doing the same with offsite media databases would be more difficult...

I'm gonna give a go at making this. I've started making something with Django (Python web framework), but I'm still learning how to use it, so... Would anyone like to help? I can share the source in a github so everyone can improve the code.

I guess the basic idea is to have a system where users can add sentences (through a basic form) first. Then we could move onto making an anki plugin that will automatically sync a users sentences (with media) to the database. Then a user could search the site and download the sentence(s) in an anki file which can be easily imported.

ruiner
Member
Registered: 2009-08-20
Posts: 751

bombpersons wrote:

nest0r wrote:

Following the train of thought from here and here (thank goodness I deleted my whining first post on the board wanting shared sentence collections--consensus then was 'too difficult don't bother', hehe, so there was this constant influx of frustrated RTK graduates/AJATT n00bs--thank you programmers for bringing us to the future), I do think people could convert their subs2srs decks into lists at smart.fm or some variation of its database capabilities. They can create them around themes and tropes, or grammar points, or just emulate systems like KO2001 or some of the other template examples on the site.

Just search and export what you want from there into Anki. Or since there's the copyright issue that shaped the limitations of the Anki plugin (not to mention smart.fm understandably wanting to keep their nose clean), the person who compiles the lists can create a .media folder for Anki and point a link to it offsite (I guess the link'd either be there or at that wiki people are using for sharing decks), so you'd just import the 'text' information and then once on your HDD you combine it with the media you want?

+ These lists could be mixed and matched, though doing the same with offsite media databases would be more difficult...

I'm gonna give a go at making this. I've started making something with Django (Python web framework), but I'm still learning how to use it, so... Would anyone like to help? I can share the source in a github so everyone can improve the code.

I guess the basic idea is to have a system where users can add sentences (through a basic form) first. Then we could move onto making an anki plugin that will automatically sync a users sentences (with media) to the database. Then a user could search the site and download the sentence(s) in an anki file which can be easily imported.

I don't know how to do that other stuff, but how's the Media URL thing in Anki's deck properties work? Is it realtime? Like, if you have a .anki file on your PC does it DL the files into a cache, or can you 'import' media files from the Media URL? I haven't paid much attention to all this dropbox/synching stuff because I don't review on portable devices.

bombpersons
Member
From: UK
Registered: 2008-10-08
Posts: 907
Website

You can set the media URL so that when you are reviewing online, it can fetch the media from a seperate server. Quite useful if you have dropbox. You can just put your anki deck and media in the public folder and copy and paste the public URL as the media url and you can review your deck with media anywhere smile

cescoz
Member
From: Italy
Registered: 2008-01-22
Posts: 131

maybe the most useful thing to do is collect all the anki deck with media from all the members of the site...who want to partecipate sure! and then do a massive 50.000+ sentences deck....even if there aren't all the media it will be phenomenal °_° for learning
It's only an idea though, a deck are like personal thing for the person who did it...hmm i don't know

  • 1