We can use subs2srs output to search for stuff we want to learn, with the benefits of:
* Liking the sentences since the source is a favorite drama/anime/movies
* It is real native material
* We get the audio of a real actor saying it as it should be said
* An Image from the scene as visual aid
* We know the context since we've watched it (and perhaps listened to it a bunch of times with an mp3-player
)
Note, I've combined the .tsv files generated by subs2srs into one big combo.tsv file. So combo.tsv contains all the text for many of my favorite drama/anime/movies.
Overview of how _I_ do it:
1) I write/copy-paste the stuff I want to learn to input.txt (example available below)
2) I run my sentence-Searcher script. (which merely prints sentences that contain what I put in input.txt)
3) I sentence-pick the resulting sentences in out.csv. I listen to the sentences while also reading their text (using another script).
Heres what input.txt can look like (which contains the things to search for..):
何もかも
運命
判
それでも
むちゃくちゃ
ほど
[snipped, more lines...]
I've done this for ~910 sentences this month, which means ~30 sentences per day.
I also use a script that plays the sentences according to line number (ENTER just plays next line) which I use while picking the out.csv file, it helps A LOT (this is easy to make, just have a look at the code for regexp:ing out the filename of the mp3, system("mplayer $1"); where $1 is the mp3 filename).
The double foreach part of the perl code I wrote:
The way I regexp to get the images and mp3s out from standard subs2SRS format is:
The variable $1 will contain the filename.
Those snippets were actually the "hardest part" of this project
The file IO takes the most time (a couple seconds depending on how many sentences you picked)
EDIT: Don't email me questions, ask here, the address I supplied to the site is fake.
EDIT: One way to filter away sentences that are not i+1 could be having the script write the kanji we already have in our anki deck to a file. Then checking if current sentence only has one or two unknowns. This will be put on hold for me, it was slightly harder than I though working with UTF8. Also I don't have that huge a collection that I absolutely need this.
EDIT: I guess I have to say this? I'm not attacking whatever method people use to pick sentences, I am merely showing another way of doing things.
* Liking the sentences since the source is a favorite drama/anime/movies
* It is real native material
* We get the audio of a real actor saying it as it should be said
* An Image from the scene as visual aid
* We know the context since we've watched it (and perhaps listened to it a bunch of times with an mp3-player
)Note, I've combined the .tsv files generated by subs2srs into one big combo.tsv file. So combo.tsv contains all the text for many of my favorite drama/anime/movies.
Overview of how _I_ do it:
1) I write/copy-paste the stuff I want to learn to input.txt (example available below)
2) I run my sentence-Searcher script. (which merely prints sentences that contain what I put in input.txt)
3) I sentence-pick the resulting sentences in out.csv. I listen to the sentences while also reading their text (using another script).
Heres what input.txt can look like (which contains the things to search for..):
何もかも
運命
判
それでも
むちゃくちゃ
ほど
[snipped, more lines...]
I've done this for ~910 sentences this month, which means ~30 sentences per day.
I also use a script that plays the sentences according to line number (ENTER just plays next line) which I use while picking the out.csv file, it helps A LOT (this is easy to make, just have a look at the code for regexp:ing out the filename of the mp3, system("mplayer $1"); where $1 is the mp3 filename).
The double foreach part of the perl code I wrote:
Code:
foreach $currLineFromInput (@InputDotTxtFile){
foreach $currSentence (@ComboFile){
if($currSentence =~ /$currLineFromInput /) #<--- matching
print toFILE $currSentence; # prints only if it matched
}
}Code:
if( /sound:(.*)\]/ )
if( /img src=\"(.*)\"\>/ )Those snippets were actually the "hardest part" of this project

The file IO takes the most time (a couple seconds depending on how many sentences you picked)
EDIT: Don't email me questions, ask here, the address I supplied to the site is fake.
EDIT: One way to filter away sentences that are not i+1 could be having the script write the kanji we already have in our anki deck to a file. Then checking if current sentence only has one or two unknowns. This will be put on hold for me, it was slightly harder than I though working with UTF8. Also I don't have that huge a collection that I absolutely need this.
EDIT: I guess I have to say this? I'm not attacking whatever method people use to pick sentences, I am merely showing another way of doing things.
Edited: 2009-07-19, 4:12 am
