Back

Test case: using Subs2srs with Tonari no Totoro

#1
All hail cb4960!

I've spent the last four days figuring out how to use subs2srs to make an Anki deck for Tonari no Totoro. I'm not quite done yet. I'm leaving out all the false starts and errors.

Note: if I mention using a program, it doesn't mean it's the best program. It's just the first program I found when I googled for an answer. Feel free to recommend better ones.

Here's how I did it:

1. I bought My Neighbor Totoro. (DVD)

2. I used DvdFab 7 to take out the VOB's.
Why?
The newest version uses encryption to make the disk look corrupt. Even Handbrake chokes on it.
Is this really necessary?
Not for most DVDs. You should be able to manually copy the VOB files to your harddrive.
What are VOBs?
VOBs are the Video Object Files that contain the movie.

3. I used SubRip v1.50 Beta 4 to turn the English subtitles (and timings) into a .srt file.
Why?
I wanted the timings to use on the Japanese subtitles. The English voice actors have to match the mouth flaps when they do a dub, so the timings were going to be good enough to work with.
How does it work?
You click Open Vob (or image sequence, or hardsubbed files), and pick the directory that has the vobs in it and click start. A box pops up and asks you to fill in the first letter it can't recognize. You do this for 26 letters, plus upper case, plus punctuation, and it automagically writes out a text file for you. It's pretty neat, actually.

4. I found the Japanese script for Tonari no Totoro, and hit Google Translate.
What? Why?
Google translate turns out terrible translations, but when you copy/paste selected text, it gives you the original Japanese followed by the mangled English on the same line, like so:

(サツキ) お父さん キャラメル (Satsuki) Caramel Dad
(父さん) おっ ありがとう (Father) Thank you Oh
くたびれたかい? Kai tired?
ううん No

This is a great timesaver for lining up subs.

4. I did something really complicated with OpenOffice to turn the .srt file into something I could manipulate in a spreadsheet.
WHAT?!?
Srt files look like this:
27
00:02:43,070 --> 00:02:46,471
-Dad, want some carameI?
-Thanks. How you doing back there?

28
00:02:46,573 --> 00:02:48,040
-Fine.
-Are you tired? Oop!

What I needed was to put everything on one line like this:
Code:
27    00:02:43,070 --> 00:02:46,471 -Dad, want some carameI? -Thanks. How you doing back there? (サツキ) お父さん キャラメル (Satsuki) Caramel Dad (父さん) おっ ありがと (Father) Thank you Oh
A spreadsheet seemed like the thing to do, with cells for:
Number | Timing range | English translation | Japanese script line |
Why?
So I could mangle it back into an .srt file format after I lined up the script with the timings.
Okay, start the complicated explanation.

5. Complicated explanation:
Using Open Office writer, open the .srt file as a text file.
Go to Edit: Find/Replace.
` Pick the "More Options" button
Click "Regular Expressions"
Search for: $
(Dollar sign, which means linebreak in in Regular Expression speak.)
Replace with: @
(@ means nothing - it's a placeholder for the next step)
Then:
Search for: @
Replace with: \t
(\t means tab in Regular Expression speak. Get the backslash direction right, though.)
Congratulations! You now have a tab delimited text file that any spreadsheet in the world will recognize!
You can reverse the process when you are done working in the spreadsheet. Export -> tab delimited does the trick, and then open the file in a word processor and do these steps backwards. Microsoft Word doesn't use Regular Expressions, but they do let you use special characters and find/replace the same way.

6. Time to cut and paste!
What?!? More steps?
Yes.
Cut and paste the Japanese script with the Google Translate stuff into a column next to the English subtitles. You should be able to do hundreds of lines at a time. This beats doing this in Aegisub, where I could not figure out how to do it except one line at a time, and you lose the English lines as a reference for whether you are lined up correctly.
Cut and paste the Japanese script without the Google Translate stuff into the next column.
The Japanese script is going to have two lines where the English subtitles have one. So, after you cut and paste the script, you have to go back and combine lines, and then delete the empty cell to move things up. Do this for both columns of Japanese script.

7. Finishing work
I'm not done yet, but the finishing steps are:
a. delete the columns with the English subs
b. delete the terrible google translate translation column
c. export the spreadsheet as tab delimited text
d. import the text file into a word processor and exchange the tabs for returns. (Step 5 in reverse)
e. save as a text file. Hopefully the word processor will keep it as UTF-8, or you'll lose all your kana Smile

8. Checking
Open the video in VLC and after starting the movie, go to Video: Subtitles Track: Open File and try out your file.

9. Now that it doesn't sync, download Aegisub, and open your subs in that program.
In Aegisub, after you open your subs, go to Audio: Open audio from video.
After Aegisub crashes, download BeSweet and BeLight. Dump BeSweet into the same folder as BeLight.
Open BeLight, and use it to open one of the VOBs (or better yet, avi) from the movie and pick the WAV tab. You want to save the audio track as a WAV. Do this for each VOB.
Why does Aegisub crash?
Unless it is working with WAV files, it decompresses the audio file into your computer's memory - as a giant WAV file! In my case 972MB of wav files. More than my computer could handle, given the other things I had open.

10. Now you have an important choice:
You can either edit the timings with multiple srt files, the original VOBs for video, and multiple WAV files
OR
You can edit the timings with the one srt file you've just spent hours creating, and you rip the VOBs into one video file, and join the WAVs into one giant WAV file.

# If you want to mess with multiple VOBs & multiple WAVs, you'll have to split your srt file to match.
# If you want to work with what you've created, you'll need DVDfab or Handbrake to turn your VOBs into a single video file (I think mkv format works.) You can use Audacity (yet another free program) to connect all your WAV files.

11. Use Aegisub to edit your subtitle timing.

12. Import your video file into Subs2SRS, along with your subtitle files. Hit preview a bunch of times to see if it complains.
13. There's a 'preview timings' thing in Subs2SRS. You should check to see if it works.
14. Run Subs2SRS.

15. Write a long-winded post describing how you did it.
16. ???
17. Profit!

Thanks for reading this far. It took me 2 1/2 hours to write this up, including going back and checking steps in the different software programs. Please let me know how I can improve this.
Reply
#2
Wow. That's a lot of work. There has to be a way to automate a good portion of that. I'll be thinking about it. Smile
Reply
#3
wccrawford Wrote:Wow. That's a lot of work. There has to be a way to automate a good portion of that. I'll be thinking about it. Smile
Thanks, wccrawford!
Yes, I was thinking the same thing: all of the regexing to make a .srt into a format that can be whacked around in a spreadsheet could easily be done with a single line bash script or bit of perl. Unfortunately I suck at scripting Smile

All the other fiddly little steps could probably be wrapped up in some other bit of code that feeds each program what it wants. You'd still need user intervention with the OCR, (unless you had a pretty good training file to begin with), and you'd still have the manual join/split lines thing in the spreadsheet, but that would be a pretty big improvement as well. I lost track of the number of different little programs I downloaded to make this thing work.* I think any instructions that start with, "okay, download these 6 or 8 programs..." is doomed to never be followed.

I'm really hoping that someone will come along and say, "hey dude, you could have done all that in Aegisub, if you just do x, y, and z". I'm okay with that. I don't mind being wrong about how to do something, if there's an easier way.

*I still don't remember why I downloaded VobSub, but fyi, there's a patch you need to download to make the SubResync module work if you have Internet Explorer 8 installed.
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
I don't have this film or the subs, but I could do the following with little to no effort in a couple of hours, tops, methinks. ;p

1)Download Tonari no Totoro.
2)Download Japanese and English subs from kitsunekko/etc.
3)If Japanese subtitles don't match, use AegisSub to shift timing to fit whatever version of the film was downloaded, or choose to use Subs2 timing in Subs2SRS if the versions match.
4)Spend a couple minutes in Subs2SRS checking/unchecking the preferred options.
5)Run program.
6)Import into Anki.

At this point, I'm not sure that there are any Japanese films that aren't online. It's just a matter of encouraging release groups to rip/share the Japanese subs. Sad
Edited: 2010-03-31, 4:16 pm
Reply
#5
nest0r Wrote:2)Download Japanese and English subs from kitsunekko/etc.
3)If Japanese subtitles don't match, use AegisSub to shift timing to fit whatever version of the film was downloaded, or choose to use Subs2 timing in Subs2SRS if the versions match.
Well, that was the plan, see, Smile

I did find Japanese subtitles in idx/sub format, which AegisSub choked on. Is that normal?

Anyway, once I've got this thing fixed, I'll upload the script subs to kitsunekko. I'm resisting the urge to OCR the Japanese idx/sub, since I'm "almost" done. Even though "almost done" keeps receding from me the closer I get. I swear I'll reach the horizon!!!
Tom
Reply
#6
hehe, I don't know about Aegisub (I'm pretty sure it supports idx/sub), but I know other programs could handle the format, though I would probably just tell Subs2SRS to use the timing for the English subs. There's also the fix mismatched lines, pre/post timing shifts in Subs2SRS.
Edited: 2010-03-31, 4:42 pm
Reply
#7
nest0r Wrote:hehe, I don't know about Aegisub (I'm pretty sure it supports idx/sub), but I know other programs could handle the format, though I would probably just tell Subs2SRS to use the timing for the English subs.
I thought very seriously about that, but in the end I wanted to be able to cut/paste kanji to look them up, and I think idx/sub are image files, so that wouldn't be helpful.

Tom
Reply
#8
TomWatana Wrote:
nest0r Wrote:hehe, I don't know about Aegisub (I'm pretty sure it supports idx/sub), but I know other programs could handle the format, though I would probably just tell Subs2SRS to use the timing for the English subs.
I thought very seriously about that, but in the end I wanted to be able to cut/paste kanji to look them up, and I think idx/sub are image files, so that wouldn't be helpful.

Tom
You could always type what you hear and select the appropriate kanji via IME, or do an English-Japanese dictionary search, or just type the kanji readings you do know and copy paste once you have the sequence, or use the IME pad to search by radical, or use jisho.org's kanji lookup thingy.

Anyway, not trying to diss your efforts, I just think there are easier ways, especially for popular films. ;p And there's so many idx/sub files out there in Japanese (re: my thread for subs), so it would be troublesome for you to have to go through this process very often.
Edited: 2010-03-31, 4:47 pm
Reply
#9
Is there a program that can take Japanese subs off of VOB files?
I've found Japanese subs a lot harder to come by than English ones.

OK, getting idx/sub files is no problem whatsoever. Now, my problem is converting them to srt. Because my OCR doesn't recognize non-ascii characters...

--specifically, I'd like the Japanese subs for 機動戦士ガンダム the original, so sometimes 0079 is tacked on. As well as for any movies I care to rent from geo
Edited: 2010-04-02, 7:37 am
Reply
#10
Japanese OCR software is still pretty awful, and subtitle graphics are pretty low-res for what's out. Also, Japanese people do not pirate anywhere near as much as pretty much every other country (probably because of low average computer proficiency thanks to keitai). Finally, most JP DVDs don't have subs to begin with unless it's a foreign movie. That's why there are so few JP subs out on the internets.

It might be better to just find a way to export the subs as jpgs and embed them in a deck that way, or "burn-in" the subs on a re-encode and have them in the screenshot.
Reply
#11
Yeah, after a more-or-less extensive search, I've convinced myself that I'm not going to find any reliable way to go from idx/sub to srt.

The good news is that I can use subs2srs to do just what you said, except it's pngs instead of jpgs.

I've found that quite a bit of the movies I rent DO have Japanese subtitles, actually. Although, you're correct about Gundam -- nothing there :/

Perhaps I could start (contribute to?) a Japanese Subtitle database here at RevTK
Edited: 2010-04-02, 8:56 am
Reply