RECENT TOPICS » View all
Does anyone know of any (preferably free) software for extracting the subtitles and the audio from DVDs?
I have a program for the audio (DVD Audio Extractor), although it's a trial version that runs out in 30 days. I'm trying a couple of subtitles programs, but they either don't export as text or need you to tell it what every character is and still ends up with errors or messed up puncutuation. (Though maybe that's unavoidable, I've never done this before.)
I don't really want to do anything special with them. Just want the audio as an mp3 to put on my iPod/etc., and subtitles as text to read/to use for an SRS.
avidemux, free and cross-platform. I find it pretty easy to work with. Theres an article/blog post on the interwebs on how to rip subs from dvds
what s the point of extracting the subtitles ? japanese never transcript faithfully anyway
I searched and found this just now http://www.my-guides.net/en/content/view/167/26/ after nonpoint's post. Doesn't look too difficult. Although I don't know how it will handle Japanese subs (with the whole "type in every character from the subs file so that it can OCR it"...that might take a while)
ghinzdra wrote:
what s the point of extracting the subtitles ? japanese never transcript faithfully anyway
I guess it might be useful to extract the English subs to get the correct timing. Then you could run that in subs2rs to get a deck sans Japanese subs, that you can work through and translate yourself. Or you could then edit the subs with something like Aegisub and replace the English with Japanese subs you find online.
blackmacros wrote:
I searched and found this just now http://www.my-guides.net/en/content/view/167/26/ after nonpoint's post. Doesn't look too difficult. Although I don't know how it will handle Japanese subs (with the whole "type in every character from the subs file so that it can OCR it"...that might take a while)
If you are doing a drama you can use dramanote to cut and paste. I'd hope theres some sort of transcript for whatever it is you want to OCR ![]()
Timing extraction is another way to go, but I prefer having the program do some of the work and then do QA on the subs.
Last edited by nonpoint (2009 August 13, 6:43 pm)
I hadn't heard of Avidemux, I might give it a try tomorrow. Though that seems the same as SubRip, which I've been using now (though I've only tried it for Spanish things).
ghinzdra wrote:
what s the point of extracting the subtitles ? japanese never transcript faithfully anyway
I do have a couple of DVDs with exact (or near-exact) subs, and I don't mind even if they aren't if I could learn more vocabulary from them.
I don't use sub2srs, but I also considered being able to use it for that as well.
blackmacros wrote:
ghinzdra wrote:
what s the point of extracting the subtitles ? japanese never transcript faithfully anyway
I guess it might be useful to extract the English subs to get the correct timing. Then you could run that in subs2rs to get a deck sans Japanese subs, that you can work through and translate yourself. Or you could then edit the subs with something like Aegisub and replace the English with Japanese subs you find online.
OMG I didn t realize....
if you can t get the timing this way I guess I ll be able to mine my japanese dubbed DVD....dark knight , sin city ,matrix , evolution , snatch , MIB , die hard....it s gonna be great.
I just hate timing.
albion wrote:
I hadn't heard of Avidemux, I might give it a try tomorrow. Though that seems the same as SubRip, which I've been using now (though I've only tried it for Spanish things).
ghinzdra wrote:
what s the point of extracting the subtitles ? japanese never transcript faithfully anyway
I do have a couple of DVDs with exact (or near-exact) subs, and I don't mind even if they aren't if I could learn more vocabulary from them.
I don't use sub2srs, but I also considered being able to use it for that as well.
I have plenty of Japanese dubbed movie.... not a single one is really faithful. Close at best . Though I heard that miyazaki movie were faithfully transcribed.... (in which case I wonder why when you finally get your hand on miyazaki subtitles on internet it s full of kanji typo and strange kanji that makes it quite obvious it has been typed by a chinese )
I didn't mean films dubbed into Japanese, films that were made in Japanese. (Though, while it would be a bit of effort, the non-accurate subs could be edited into exact ones.)
does anyone know where to find a kanji character matrix ?
it seems like there are some hanzi/chinese character matrix out there but I m looking for japanese . (And I don t even find the chinese one ... it could be of some use as I bet the guys who wrote these must have an idea about where to find a japanese one)
Great thread with good information. Thanks Nonpoint and Blackmacros for the links and hints. I'm going to try this on Last Christmas and Byakuyakou at some point.
Out of interest, not wanting to start a group project, but if people have finished .srt files they're willing to share I'm sure we'll all be happy. Plenty of sites to post finished files, although d-addicts seems the most obvious choice.
Also, is it useful to share glyph information, or does that change every DVD?
I don t want to talk much about this as long as I m not 100 percent satisfied with the result but I have spent the previous 4 fours days to experiment a lot on my DVDs...
I succeeded today with a very recent DVD release so I m extremely confident on my ability to break through protections (maybe to the cost of my DVD
I still don t know) but I prefer to warn you nuke : it s far from being as easy as it looks on blackmacros link especially since we re talking about kanji and that subs2srs is based on ffmpeg which is not very reliable. For sure if you want to give it a shot by yourself AND/OR you re in a hurry go ahead . Otherwise if I were you I would wait 1 week (2 week max) . By this time I will post the result of my experiment and the safest/quickest way to get the most of your DVDS..... maybe even one of the decks I made
In the meantime if I could get my hand on a kanji character matrix it would be very much appreciated .... It s extremely likely it s a deadend but it would make things considerably easier. Even the slighest piece of information could be a precious hint ....
Last edited by ghinzdra (2009 August 19, 3:51 am)
blackmacros wrote:
I searched and found this just now http://www.my-guides.net/en/content/view/167/26/ after nonpoint's post. Doesn't look too difficult. Although I don't know how it will handle Japanese subs (with the whole "type in every character from the subs file so that it can OCR it"...that might take a while)
I've done this before. With Final Fantasy Advent Children it took me about 4-5 hours to OCR the subs.
ghinzdra wrote:
does anyone know where to find a kanji character matrix ?
it seems like there are some hanzi/chinese character matrix out there but I m looking for japanese . (And I don t even find the chinese one ... it could be of some use as I bet the guys who wrote these must have an idea about where to find a japanese one)
Mh.. This might be faster. Is there a way to save a matrix / make one? If just one of use makes one, someone can upload it and hopefully this will be much faster.
bombpersons wrote:
blackmacros wrote:
I searched and found this just now http://www.my-guides.net/en/content/view/167/26/ after nonpoint's post. Doesn't look too difficult. Although I don't know how it will handle Japanese subs (with the whole "type in every character from the subs file so that it can OCR it"...that might take a while)
I've done this before. With Final Fantasy Advent Children it took me about 4-5 hours to OCR the subs.
But wouldn't you have to know the readings of all the kanji used, in order to be able to type them, in order to feed them into the OCR? So I guess you'd have to have a dictionary standing by to search for readings?
blackmacros wrote:
bombpersons wrote:
blackmacros wrote:
I searched and found this just now http://www.my-guides.net/en/content/view/167/26/ after nonpoint's post. Doesn't look too difficult. Although I don't know how it will handle Japanese subs (with the whole "type in every character from the subs file so that it can OCR it"...that might take a while)
I've done this before. With Final Fantasy Advent Children it took me about 4-5 hours to OCR the subs.
But wouldn't you have to know the readings of all the kanji used, in order to be able to type them, in order to feed them into the OCR? So I guess you'd have to have a dictionary standing by to search for readings?
I used the drawing pad in the windows IME, so I didn't need to know the readings, just how to draw them
Together with a drawing tablet it's much faster.
Last edited by bombpersons (2009 August 19, 5:00 am)
Ah, smart idea. Wish I had a tablet...
/sulk
bombpersons wrote:
blackmacros wrote:
I searched and found this just now http://www.my-guides.net/en/content/view/167/26/ after nonpoint's post. Doesn't look too difficult. Although I don't know how it will handle Japanese subs (with the whole "type in every character from the subs file so that it can OCR it"...that might take a while)
I've done this before. With Final Fantasy Advent Children it took me about 4-5 hours to OCR the subs.
ghinzdra wrote:
does anyone know where to find a kanji character matrix ?
it seems like there are some hanzi/chinese character matrix out there but I m looking for japanese . (And I don t even find the chinese one ... it could be of some use as I bet the guys who wrote these must have an idea about where to find a japanese one)Mh.. This might be faster. Is there a way to save a matrix / make one? If just one of use makes one, someone can upload it and hopefully this will be much faster.
as I said it s not THAT easy.
first you must be aware what is exactly a kanji matrix : something like 8000 characters .
through KO and others we all know with about 1200 character you have a 85 percent coverage but for a 99 percent it s about 4000 . On top of that you ve got to include italic and other style ... which is why you can easily double the 4000 figure hence 8000 Obviously the most intelligent solution would be to write a matrix for the most frequent and after that typing in the missing characters in the subtitles . We re still in for typing 3000 characters
What s more as kanjis are combination of different part , the OCR can have some trouble to identify the character and just select a part of it . That s why subrip include a new feature to enlarge the grid if needed. It still slow down the process though
Now the best part :
do you think that every company use the same character set/style ?
do you think that the OCR software is particularly clever ?
You must have easily figured out where I m going There is a very high risk that we need several matrix....in the best case one by company ... in the worst one by DVD....
so yes you can save the matrix but if we need one matrix by dvd we re screwed
anyway as I said earlier none of my dubbed dvd have faithful subtitles... so it s just in case the native dvd have exact subtittle otherwise it s both excruciating and useless
that s why I said it s very likely it s a dead end . I just want to check out every possibilities before posting a tutorial .
Last edited by ghinzdra (2009 August 19, 6:46 am)
What sort of format is this 'matrix' thing in? I'm sure there is a spreadsheet somewhere on this forum for something like the Kanji Kentei (is that it? the exam that tests kanji knowledge, for native speakers), which would have like 6000+ kanji in it right? Is something like that exportable into this matrix you're talking about?
EDIT: I was thinking of this spreadsheet. http://spreadsheets.google.com/ccc?key= … &hl=en
~6500 kanji (as well as keywords) in a spreadsheet.
Last edited by blackmacros (2009 August 19, 8:01 am)
blackmacros wrote:
What sort of format is this 'matrix' thing in? I'm sure there is a spreadsheet somewhere on this forum for something like the Kanji Kentei (is that it? the exam that tests kanji knowledge, for native speakers), which would have like 6000+ kanji in it right? Is something like that exportable into this matrix you're talking about?
EDIT: I was thinking of this spreadsheet. http://spreadsheets.google.com/ccc?key= … &hl=en
~6500 kanji (as well as keywords) in a spreadsheet.
completely unrelated
a matrix is the equivalence table that a OCR software use to recognize an image and transform it in a symbol . As you probably already know subtitles on DVD are not stored as text file (srt,ass,ssa etc....) but as image file (sub vob)
the function of the OCR software is to transform those images into text through the identification of letter . Now imagine a prodigy child that is able to learn the calligraphy of thousands of kanji in the blink of an eye ....no matter how good he is you still have to teach him what each kanji means , the order of the strokes ,etc.... well the OCR software is infinitely more efficient than you to copy the subtitles but he still needs to be teached .... for every and each symbol he doesn t recognize you must type in his text-based counterpart . And he s especially dumb :he s unable to recognize the same character displayed in italic , bold , underlined style.... you ve got to teach him everything .
It s not much of a problem with western alphabet .... but with kanji.....see above
Last edited by ghinzdra (2009 August 19, 8:56 am)

