Japanese VOB sub to plain text?

Index » Learning resources

  • 1
 
Reply #1 - 2013 March 17, 6:36 am
kodorakun Member
From: Seattle Registered: 2008-10-15 Posts: 276 Website

Hi All,

Is there a tried, true, and ideally free way to go from ripped DVD VOBSUB subtitles to plain text? Now that I live in Japan and have access to renting DVDs there's a goldmine of subtitles waiting to be got. If anyone can help me out I'll rent a couple DVDs in your name and rip the subs (if they exist -- but almost all DVDs have subtitles) for you big_smile

Cheers,

K.

Reply #2 - 2013 March 17, 6:41 am
Zarxrax Member
From: North Carolina Registered: 2008-03-24 Posts: 949

The only thing I have had any serious luck with was SmartOCR ( http://forum.koohii.com/viewtopic.php?id=2608 )
I used it on a full movie, but had to OCR each image individually. It had something like 95% accuracy, took me several hours to do 1 movie. (You need to read each line eventually anyways, so doing it during the ocr step isnt REALLY all that much extra work. It can be part of the learning process.

Last edited by Zarxrax (2013 March 17, 6:42 am)

Reply #3 - 2013 March 17, 8:31 am
toshiromiballza Member
Registered: 2010-10-27 Posts: 277

Here's something from another forum:

I looked around for OCR software to process that sequence of BMPs generated by SubRip. I successfully tested Abbyy Finereader 10. It is expensive, but the 15-day trial version does the job (for 15 days anyway). Here is what I did:


1. Generate BMPs with SubRip using the "Custom Colors and Contrast" option, and choose black characters and white background (I set those "border" areas to white as well). I set a minimum width of 360 pixels (and minimum height of 50 pixels) for the output bitmaps.

(1.1) In my case SubRip had a problem generating a correct sequence of BMPs from the original VOB file on the DVD (ca. 5% of the subtitle BMPs were either blank or a grainy mess). I solve this problem by generating an IDX file out of the VOB file, using VSRip. I then used that IDX file with SubRip.

2. Insert the bulk of BMPs into a Word Document (e.g. simply select the whole directory on the insert dialogue in Word). The result should see all bitmaps arrayed vertically, i.e. one bitmap per "line".

3. Save that document as a PDF file.

4. Open that PDF in Abbyy Finereader. Then comes the time-consuming part of resolving those ambiguities that Abbyy Finereader has marked (i.e. manually select from a list of suggestions the right match for a character that the programm wasn't able to recognize with certainty). It took me around 2 hours of work for one hour of movie to do that. I haven't found a way to "teach" the software to learn from my corrections. For example, there may be a very common character in the text which Abbyy Finereader cannot read repeatedly. I then had to resolve each instance separately, rather than tell the software once and have it apply it to all instances. If anyone knows how to "teach" Abbyy Finereader, please let me know.

5. The 15-day trial version of Abbyy Finereader doesn't allow you to save the resulting document, so instead I simply copy and pasted each page separately into a new Word document. It is a good idea to insert a page break into the Word document for each new page copied from the Finereader output.

6. You will notice that Finereader does not observe the line breaks of the original document, instead putting a simple space between each line of subtitles. That is not a problem: Once you have copied and pasted the complete clean OCR output into Word, you can simple replace " " with "^p", and you are good.

7. If you inserted a page break in Word after each new copy+pasted page from Finereader, then it is easy to error-check the line breaking in Word: each page should have exactly the same number of rows (provided that each of the original BMPs had the same height).

8. The result is a text that contains all subtitles as plain text -- one row per original subtitle BMP. You can go on to convert it into an SRT file (e.g. take the time stamps from the corresponding English subtitle files, which SubRip can OCR without additional software).


In this way, I obtained Chinese subtitles of the first episode of 痞子英雄 (Black & White). It is easy to create Pinyin subtitles from that file, too. Overall I am very surprised and satisfies by the OCR performance of Abbyy Finereader. This procedure is nevertheless time consuming. It could be sped up significantly if I found a way to "teach" Finereader. If anyone can tell me how to do that, I would highly appreciate the help.

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
Reply #4 - 2013 March 17, 9:44 am
Oniichan Member
From: 名古屋 Registered: 2009-02-02 Posts: 269
Reply #5 - 2013 March 18, 5:06 am
kodorakun Member
From: Seattle Registered: 2008-10-15 Posts: 276 Website

Oniichan wrote:

Maybe this...

http://forum.koohii.com/viewtopic.php?p … 15#p124615

Thanks oniichan, I think this is probably the easiest method. I tired it out, worked fine and relatively fast, minimal management too. I have to run a virtual box with windows but that's not so bad.

If you have any subtitle requests Iet me know and I'll see if Tsutaya has anything in stock.

k.

Reply #6 - 2013 March 18, 6:00 am
Zarxrax Member
From: North Carolina Registered: 2008-03-24 Posts: 949

That is actually working for you?
When I tried it on a movie, the accuracy was somewhere around 30-50% or so. Completely unusable for me. Maybe it was just the movie I had, I guess.

Reply #7 - 2013 March 18, 6:12 am
Rayath Member
From: Kansai Registered: 2008-07-22 Posts: 88

kodorakun, sorry for a little off-topic but you got me interested,
So most of the dramas and movies in Japan on DVD have subtitles (even older ones)? Is Tsutaya the best place to rent? How much they take for renting?

Reply #8 - 2013 March 18, 7:17 am
kodorakun Member
From: Seattle Registered: 2008-10-15 Posts: 276 Website

Zarxrax: No problems at all for me. I ripped the subs to .idx file and loaded them up, everything went fine. The subs themselves (plaintext form) had some glitches here and there, but after eye-scanning them it looked more or less reliable and worthwhile. The glitches were caused by weird symbols like parentheses and question marks or Japanese quotation marks as far as I could tell.

Rayath: Yeah, dude! I don't know why people on this forum haven't gone Tsutaya crazy, but there are HUGE selections of western films, western dramas, japanese films, japanese dramas and they all have Japanese subtitles (though J subtitles often don't match the J-dub on foreign content films). Some movies don't come with subtitles, to be fair... Unfortunately for anime fans it seems anime is one category that almost uniformly does NOT have subtitles included. The more popular animations do sometimes have subtitles, though, but something like long series (e.g. One Piece) do not.

It's 100Yen for ONE WEEK to rent not-new DVDs. And you can also rent Japanese music CDs... As far as I can tell Tsutaya is a jackpot for media content. The new releases are something like 300 or 400 yen rentals for a night or two nights, can't remember... I usually rent Aibou or some old content as there is so much available and it's cheap.

K

Reply #9 - 2013 March 18, 9:46 am
Rayath Member
From: Kansai Registered: 2008-07-22 Posts: 88

Wow, sounds really good. So I've read that if you rip subs from DVDs, they come in a image format. Can you try to use them like that with avi files later, or you need to change them into text somehow?

Last edited by Rayath (2013 March 18, 9:55 am)

Reply #10 - 2013 March 18, 11:36 am
tokyostyle Member
From: Tokyo Registered: 2008-04-11 Posts: 720

kodorakun wrote:

Thanks oniichan, I think this is probably the easiest method. I tired it out, worked fine and relatively fast, minimal management too. I have to run a virtual box with windows but that's not so bad.

subs2srs and vob2text both work perfectly fine in wine and CrossOver.  For CrossOver specifically all you need to do is create a bottle and install the Microsoft .NET 3.5 SP1 package and it will sort out all of the dependancies for you.

Zarxrax wrote:

That is actually working for you?
When I tried it on a movie, the accuracy was somewhere around 30-50% or so. Completely unusable for me.

Just use the .SRT file from vob2text as the L2 and the .IDX from the vobsub as the L1 and then the original subtitles will be on the back of your card for corrections.

Reply #11 - 2013 March 19, 5:12 am
kodorakun Member
From: Seattle Registered: 2008-10-15 Posts: 276 Website

Rayath wrote:

Wow, sounds really good. So I've read that if you rip subs from DVDs, they come in a image format. Can you try to use them like that with avi files later, or you need to change them into text somehow?

Well they come out in VOBSUB format, which is kind of like an image. If you use subs2srs to process the vobsub files they insert as PNG files (images) in anki decks after some processing. After the OCR processing you get an srt file, which is plaintext.

In any case, skipping all that you can rip DVDs with handbrake and select all the subtitle files to be soft-coded (not permanently displayed) into the ripped mkv/mp4 file. You can extract these subs to a file if you want, or just leave them embedded in the video file so you can optionally turn them on and off or swap between E and J (if both E and J are available). Or you can even manually insert E subs you find online during the encoding. Look up Handbrake, it's pretty sweet.

K.

Reply #12 - 2013 April 30, 9:02 am
eslang Member
Registered: 2012-01-27 Posts: 98

Zarxrax wrote:

That is actually working for you?
When I tried it on a movie, the accuracy was somewhere around 30-50% or so. Completely unusable for me. Maybe it was just the movie I had, I guess.

It works pretty well (about 80% accuracy) on some movie (idx+sub) files, but not on the others (almost 90% gibberish)... if I'm not mistaken, it may depend on the background color, font color, font border, etc... basically the colors and transparency of the subtitles file for the OCR to work properly.   

Personally, I find it easier to transcribe from the image files.  On the average, it takes about 3-4 hours for a 90 minutes (Japanese) movie.

Last edited by eslang (2013 April 30, 9:02 am)

  • 1