Back

HOWTO Finding example-sentences using Subs2SRS (native sentences ftw)

#1
We can use subs2srs output to search for stuff we want to learn, with the benefits of:

* Liking the sentences since the source is a favorite drama/anime/movies
* It is real native material
* We get the audio of a real actor saying it as it should be said
* An Image from the scene as visual aid
* We know the context since we've watched it (and perhaps listened to it a bunch of times with an mp3-player Big Grin)

Note, I've combined the .tsv files generated by subs2srs into one big combo.tsv file. So combo.tsv contains all the text for many of my favorite drama/anime/movies.

Overview of how _I_ do it:
1) I write/copy-paste the stuff I want to learn to input.txt (example available below)

2) I run my sentence-Searcher script. (which merely prints sentences that contain what I put in input.txt)

3) I sentence-pick the resulting sentences in out.csv. I listen to the sentences while also reading their text (using another script).


Heres what input.txt can look like (which contains the things to search for..):
何もかも
運命

それでも
むちゃくちゃ
ほど
[snipped, more lines...]

I've done this for ~910 sentences this month, which means ~30 sentences per day.

I also use a script that plays the sentences according to line number (ENTER just plays next line) which I use while picking the out.csv file, it helps A LOT (this is easy to make, just have a look at the code for regexp:ing out the filename of the mp3, system("mplayer $1"); where $1 is the mp3 filename).

The double foreach part of the perl code I wrote:
Code:
foreach $currLineFromInput (@InputDotTxtFile){
    foreach  $currSentence (@ComboFile){
        if($currSentence =~ /$currLineFromInput /) #<--- matching
              print toFILE $currSentence; # prints only if it matched
    }
}
The way I regexp to get the images and mp3s out from standard subs2SRS format is:
Code:
if( /sound:(.*)\]/ )
if( /img src=\"(.*)\"\>/ )
The variable $1 will contain the filename.

Those snippets were actually the "hardest part" of this project Smile
The file IO takes the most time (a couple seconds depending on how many sentences you picked)

EDIT: Don't email me questions, ask here, the address I supplied to the site is fake.

EDIT: One way to filter away sentences that are not i+1 could be having the script write the kanji we already have in our anki deck to a file. Then checking if current sentence only has one or two unknowns. This will be put on hold for me, it was slightly harder than I though working with UTF8. Also I don't have that huge a collection that I absolutely need this.

EDIT: I guess I have to say this? I'm not attacking whatever method people use to pick sentences, I am merely showing another way of doing things. Smile
Edited: 2009-07-19, 4:12 am
Reply
#2
1h 30m!?
Thank god I pick lines as I find them from the material I enjoy. That way, it takes me a copy+paste instead of 1h30m.
Reply
#3
Tobberoth Wrote:1h 30m!?
Thank god I pick lines as I find them from the material I enjoy. That way, it takes me a copy+paste instead of 1h30m.
Agreed. Way too complicated. I just snatch the ones I want as I go along, and then I move on with my life. lol.
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
That all sounded a bit convoluted. But what I think you're suggesting is:

Use a big database of text and accompanying audio/video generated by subs2srs from lots of shows to form a corpus of 'sample sentences' instead of using boring dictionary sentences. Sounds interesting, but again, way too convoluted.
Edited: 2009-07-16, 10:25 am
Reply
#5
igordesu Wrote:Agreed. Way too complicated. I just snatch the ones I want as I go along, and then I move on with my life. lol.
I paste what I want into a text-file, run a script (double-click), go through a file filled with sentences from my favorite drama/anime. I'm sorry I made the first post huge! I've edited it down to the essentials.

blackmacros hit the nail on the head. The point is searching your favorite material instead of boring stuff someone else chose.

Again, sorry for the long post...

Instead of adding this to the first post, I'll add it here to give the first post an illusion of being short.
General sentence-picking tips:
* If there are too many unknowns, but you really want to learn the sentence you can find some good intermediary sentences with one of the unknowns in them to put before. This way you stay i+1 (i+1 == only one unknown per sentence, makes you progress constantly. Important, IMO)

* It is fine if you have a sentence with two unknowns since you can also put one of the unknowns in input.txt so you can turn it into a known next time you sentence-pick

* Don't worry too much about what is before what in your deck, the spaced repetition algorithm will take care of it. Smile

* Use material you love! It makes things easier.
Edited: 2009-07-16, 11:28 am
Reply
#6
Another big problem with this approach is the insane amount of space such a database must take up. 50k sentences with audio? That must be quite a few gigabytes of data you're not actively using.
Reply
#7
Tobberoth Wrote:Another big problem with this approach is the insane amount of space such a database must take up. 50k sentences with audio? That must be quite a few gigabytes of data you're not actively using.
For me the database takes up 8MB and the media-files take up 1GB.
This is the space 6-7 subs2srs sets take. Smile

Nope, this is not a big problem. (Atleast not for me, I prioritized this project over the JAV porn I had of Yoshizawa Akiho)

Please post any other problems with this little "system" I've set up, so I know what to improve.
Edited: 2009-07-18, 7:17 am
Reply
#8
Nonpoint,

First, you provide an excellent alternative to other structured methods for vocabulary learning. It's something quite a few people considered and implemented when subs2srs came about last year. I'll liken it to your own personal smart.fm.

I do the equivalent, but for vocabulary I just use iKnow sentences. However, I find it's easier to use Anki's search feature. So perhaps load your subs2srs in Anki, suspend the sentences you currently are not using, search for word you want, unsuspend the sentence if you like it.

But now comes the next questions: how are you finding out what these words mean? What steps do you take with sentences with new words? How are you determining what words you want to look up?
Reply
#9
Nukemarine Wrote:Nonpoint,
I do the equivalent, but for vocabulary I just use iKnow sentences. However, I find it's easier to use Anki's search feature. So perhaps load your subs2srs in Anki, suspend the sentences you currently are not using, search for word you want, unsuspend the sentence if you like it.

But now comes the next questions: how are you finding out what these words mean? What steps do you take with sentences with new words? How are you determining what words you want to look up?
I prefer not using anki when picking, I use notepad++ with special color formatting and the ability to listen to the audio while reading it. Also I noticed anki would get slow... the out.csv is fully-compatible with anki because I originally wanted to do as you have said here.

I find out what words mean using http://www.sanseido.net/ J-J dict, I just add the definition with a tab, my card template has a field for that. The first few sentences I actually used jisho.org english definitions, bleh.
Nukemarine didn't you do some cool trick with anki for using info from the web? Something with the card templates?

I just try to always be i+1 (or whatever its called), and always have only one unknown. I don't do anything special with sentences with a new word, than I would with a sentence with a new grammar-point (I just make sure I understand the new thing).
I'm determining words to look-up through whatever piques my curiosity. I'm listening to the audio so if it sounds like a cool word I'll learn it. Also since I use notepad++ I can see many lines simultaneously so it is easy to see commonly used words.
Edited: 2009-07-16, 1:25 pm
Reply
#10
I don't recall doing anything from the web. I did post my spreadsheets, and try to have a few items in Anki so it's versatile.

Anki does (or did) have a plug-in to use with Sanseido or other online dictionaries. I use the kenkyusha J-J and J-E dictionaries I "acquired" on my computer for further ease.

Sounds like you have a good set-up. I'm not saying it's the best, I'm saying it's the best for you. For another person, "boring sentences" may be best especially at the beginning where every word is pretty much new. We all walk different paths at different speeds.

Looking forward to great stuff from you. Keep it up.
Reply
#11
Nukemarine Wrote:For another person, "boring sentences" may be best especially at the beginning where every word is pretty much new. We all walk different paths at different speeds.
Hehe, actually, when I started I did sentences from the tae-kim spreadsheet you posted Smile
Edited: 2009-07-16, 2:24 pm
Reply