Back

Netflix Japanese content

#26
Geeze they really have this set up to make it hard to extract the text. I suppose it's probably not intentional, but...
Reply
#27
(2017-01-14, 11:35 am)ayu_modoki Wrote: Geeze they really have this set up to make it hard to extract the text. I suppose it's probably not intentional, but...
A shot in the dark but perhaps why they make Japanese subtitles into image files is because a non Japanese computer might not have proper Japanese fonts. Can anyone with Japanese netflix confirm that those subtitles are still images (and not text) ?
Edited: 2017-01-14, 3:33 pm
Reply
#28
(2017-01-14, 10:44 am)Zarxrax Wrote: I found how to see the png files as they play.
In chrome, in network tools (f12) , start recording as the video plays, and look under the "Img" filter. Each subtitle will appear right there. However, I am not able to save the images for whatever reason.

[Image: SFAaaCR.png]
I think it is because some of the Japanese subtitles come in style such as "italics", furigana "ruby" annotation, vertical text positioning and other stuff which can make it difficult to output the Japanese subtitles compared to Western languages. It is quite a common issue across most reading devices.

Meanwhile, I can help to manually type out or convert the image (png) files to something as follows:
Quote:(マスターの声)
1日が終わり
人々が家路へと急ぐ頃—

Would you like to try and see if the following trick can work at your end?

Chrome DevTools - 25 Tips and Tricks
https://www.keycdn.com/blog/chrome-devtools/
Quote:Copy Image as Data URI (base64 encoded)

You can save any image from a webpage out as a Data URI or rather base64 encoded. There is no need to use a free online converter as it is already built into Chrome DevTools. To do this simply click into the “Network” panel, click on an image, and then right-click on it and select “Copy image as Data URL.”

You will then get the image in the following format: “URURIsdata:image/png;base64,iVBORw0KGgoAAAANSUhEUgAABAAAAAFt...”
[Image: copy-image-as-data-URI.gif]
Reply
JapanesePod101
#29
(2017-01-14, 8:58 pm)eslang Wrote: Would you like to try and see if the following trick can work at your end?

Worked like a charm.
So it's possible to save the png images one at a time. I suppose the next step would be finding some way to automatically grab all of them while you watch something.
Reply
#30
(2017-01-14, 9:14 pm)Zarxrax Wrote:
(2017-01-14, 8:58 pm)eslang Wrote: Would you like to try and see if the following trick can work at your end?

Worked like a charm.
So it's possible to save the png images one at a time. I suppose the next step would be finding some way to automatically grab all of them while you watch something.
いいですね。 Good to know that the above little trick worked.

I have not seen any program that can automatically extract all the images which are spread across multiple webpages (with different "urls") while you watch something. It'll be awesome if such a program exist at this moment.

Usually I record it while watching, then convert the file to another format, and then use esrXP to extract the hard-subs, after that use idxsubocr to OCR the subtitles. Big Grin
Reply
#31
Guys, I've cracked the code!

Netflix recently added the ability to download videos to mobile devices, so I thought I would give it a shot and see how it handles subtitles. So I downloaded a video to android, and and gives you several files. I found two files with the extension .nfs. One of them was smaller, and when I opened it in a text editor, it showed me the English subtitles. The other nfs file was a bit larger, so I tried renaming the extension to .zip, and when I opened it, it contained all the Japanese subtitle data!!

Here is the file (random episode of Tokyo Stories) if anyone would like to examine it: https://drive.google.com/open?id=0B3_ufk...lhSWGEwTHM

I'm pretty sure you need to have a rooted device, and I had to use adb to copy the file to my pc. So this is probably out of the realm of the casual person for now.
Reply
#32
(2017-01-16, 7:29 pm)Zarxrax Wrote: Guys, I've cracked the code!

Netflix recently added the ability to download videos to mobile devices, so I thought I would give it a shot and see how it handles subtitles. So I downloaded a video to android, and and gives you several files. I found two files with the extension .nfs. One of them was smaller, and when I opened it in a text editor, it showed me the English subtitles. The other nfs file was a bit larger, so I tried renaming the extension to .zip, and when I opened it, it contained all the Japanese subtitle data!!

Here is the file (random episode of Tokyo Stories) if anyone would like to examine it: https://drive.google.com/open?id=0B3_ufk...lhSWGEwTHM

I'm pretty sure you need to have a rooted device, and I had to use adb to copy the file to my pc. So this is probably out of the realm of the casual person for now.

I'm tempted to post about this on Reddit, but I fear that Netflix might plug the hole if it becomes well known. Guess it's best to covertly collected subtitles for popular shows/movies that aren't available.
Reply
#33
(2017-01-16, 8:32 pm)Nukemarine Wrote: I'm tempted to post about this on Reddit, but I fear that Netflix might plug the hole if it becomes well known. Guess it's best to covertly collected subtitles for popular shows/movies that aren't available.

I doubt Netflix cares about it, since its just subtitles, and they didn't bother encrypting them, unlike the video. Still only halfway there though... next need to figure out a decent way to OCR them. Subtitle Edit kinda works, but it uses tessarect for the OCR, which really sucks.
I tried idxsubocr, but its a chinese app and the interface didn't display properly on my pc. I don't think it would work with this anyways.
Reply
#34
(2017-01-16, 8:42 pm)Zarxrax Wrote:
(2017-01-16, 8:32 pm)Nukemarine Wrote: I'm tempted to post about this on Reddit, but I fear that Netflix might plug the hole if it becomes well known. Guess it's best to covertly collected subtitles for popular shows/movies that aren't available.

I doubt Netflix cares about it, since its just subtitles, and they didn't bother encrypting them, unlike the video. Still only halfway there though... next need to figure out a decent way to OCR them. Subtitle Edit kinda works, but it uses tessarect for the OCR, which really sucks.
I tried idxsubocr, but its a chinese app and the interface didn't display properly on my pc. I don't think it would work with this anyways.
Well done, Zaxrax! Heart

I took a quick glance at the zip.file that you have posted, and as I suspected, the Japanese subtitles have style such as "italics", furigana "ruby" annotation, vertical text positioning and ♪~ too (笑)

Unfortunately, idxsubocr can only handle black text on white background. And you're right, I agree that tesseract sucks at OCR recognition.

Anyway, I will manually type out the PNG files into text format later. I'm thinking of just posting the (323 lines) text here. If you prefer to have it otherwise, please let me know how you would like the text format to be delivered.

[edit] Oh yes, for the idxsubocr (chinese app which is a simple gui) and the interface to display properly on the pc, just run it under Microsoft AppLocale and it will display nicely.
Edited: 2017-01-16, 9:44 pm
Reply
#35
(2017-01-16, 9:35 pm)eslang Wrote: Anyway, I will manually type out the PNG files into text format later.  I'm thinking of just posting the (323 lines) text here.  If you prefer to have it otherwise, please let me know how you would like the text format to be delivered.

[edit]  Oh yes, for the idxsubocr (chinese app which is a simple gui) and the interface to display properly on the pc, just run it under Microsoft AppLocale and it will display nicely.

Oh, I don't particularly need these subtitles. I just downloaded that as a test.
If you feel like typing something up, I might get an episode or two of terrace house for those folks who have been wanting it.
Reply
#36
Zarxrax Wrote:Oh, I don't particularly need these subtitles. I just downloaded that as a test.
If you feel like typing something up, I might get an episode or two of terrace house for those folks who have been wanting it.
I'll wait for the terrace house episode then. I've been typing too much Chinese characters lately, and my itchy fingers need to practice typing more Japanese characters. Big Grin
Reply
#37
(2017-01-16, 8:42 pm)Zarxrax Wrote:
(2017-01-16, 8:32 pm)Nukemarine Wrote: I'm tempted to post about this on Reddit, but I fear that Netflix might plug the hole if it becomes well known. Guess it's best to covertly collected subtitles for popular shows/movies that aren't available.

I doubt Netflix cares about it, since its just subtitles, and they didn't bother encrypting them, unlike the video. Still only halfway there though... next need to figure out a decent way to OCR them. Subtitle Edit kinda works, but it uses tessarect for the OCR, which really sucks.
I tried idxsubocr, but its a chinese app and the interface didn't display properly on my pc. I don't think it would work with this anyways.

Well, the Chinese fansub groups are constantly ripping DVD titles which also have image based subtitles and converting them to text before translating them to Chinese. This is the source of a bunch of the available Japanese sub files. (Most available Japanese sub files are from Chinese fansubbers ripping the text directly from closed caption, but they clearly have techniques for DVDs since they also do OVAs, special editions, and other disk releases).

If someone reads Chinese well enough and cares enough I'm sure they can find out how this is done.

Edit: If we're starting a manual typing effort, I could do an episode or two, but honestly, I wouldn't even bother with this subtitle ripping effort to do that, I'd just pause my netflix stream. I'm not up to doing the whole series though, so if you want me to do that pick your episodes carefully Smile
Edited: 2017-01-16, 10:20 pm
Reply
#38
eslang Wrote:Unfortunately, idxsubocr can only handle black text on white background. And you're right, I agree that tesseract sucks at OCR recognition.
Programs like ImageMagik's "convert" can help you put a solid background (and then invert the image if necessary) programmatically, so they are better suited to be the input for OCR programs.
(EDIT: to add the quote)
Edited: 2017-01-16, 11:58 pm
Reply
#39
SomeCallMeChris Wrote:Well, the Chinese fansub groups are constantly ripping DVD titles which also have image based subtitles and converting them to text before translating them to Chinese. This is the source of a bunch of the available Japanese sub files. (Most available Japanese sub files are from Chinese fansubbers ripping the text directly from closed caption, but they clearly have techniques for DVDs since they also do OVAs, special editions, and other disk releases).

If someone reads Chinese well enough and cares enough I'm sure they can find out how this is done.

Edit: If we're starting a manual typing effort, I could do an episode or two, but honestly, I wouldn't even bother with this subtitle ripping effort to do that, I'd just pause my netflix stream. I'm not up to doing the whole series though, so if you want me to do that pick your episodes carefully Smile
そうなんだ!  

I doubt they have to rip the Japanese subtitles from DVDs to translate it into Chinese language. Actually, I remembered there was a Japanese person who was studying the Chinese language then, and requested to join the Chinese fansubber group. So, it turned out to be a win-win situation for them.

Quite a few Chinese exchange students who major in the Japanese language told me that they have to do lots of listening dictation in Japanese while they were studying at the university in China.

In any case, it is great to know that you would like to participate with typing effort.  Big Grin

faneca Wrote:Programs like ImageMagik's "convert" can help you put a solid background (and then invert the image if necessary) programmatically, so they are better suited to be the input for OCR programs.
(EDIT: to add the quote)
faneca, thank you for the suggestion.  

I just found out that MS Paint can convert the image by selecting, 変型 --> 色の反転.

By the way, does anyone know how to convert the .ttml2 file to extract the time-stamps into a .srt or similar file types? There are two .IDX files, but there are no .SUB file at all in the zip folder posted by Zarxrax.  Or am I missing something?  Any pointers or advice would be appreciated.  ありがとう!
[edit 2] Google is my friend.  
The solution is...... Subtitle Edit 3.4.11 can import TTML2+PNG  Smile  やった!
Converting TTML2+PNG Subtitles to BDN(XML+PNG)
http://forum.videohelp.com/threads/37634...N(XML-PNG)

[edit 1]  It looks like tomorrow is going to be a busy day, after checking my schedule.  So I will not be able to reply here.  At any rate, I will be able to find some time, around the end of the week, to type out the (png file) subtitles.  Smile
Edited: 2017-01-17, 5:27 am
Reply
#40
Alright, I have the japanese+english subs for all 46 episodes of Terrace House here: https://mega.nz/#!kBkxwRiC!wFdvE5O5a2gbU...0NWFUJGSlA
or https://www.mediafire.com/?3eeza36lunl5cmn
Japanese subs are images, so you can't use Rikaichan or anything unless they are typed out. Hopefully a good OCR solution can be found soon.

Looks like the audio files are unencrypted as well, so it could actually work pretty well with subs2srs. However, I will not be uploading the audio on here.

If anyone has requests for subs from other series I'll be glad to try and get them.
Edited: 2017-01-19, 9:58 pm
Reply
#41
(2017-01-17, 9:31 pm)Zarxrax Wrote: Alright, I have the japanese+english subs for all 46 episodes of Terrace House here: https://mega.nz/#!kBkxwRiC!wFdvE5O5a2gbU...0NWFUJGSlA
Japanese subs are images, so you can't use Rikaichan or anything unless they are typed out. Hopefully a good OCR solution can be found soon.

Looks like the audio files are unencrypted as well, so it could actually work pretty well with subs2srs. However, I will not be uploading the audio on here.

If anyone has requests for subs from other series I'll be glad to try and get them.

This is really nice - thank you! Hopefully someone with better technical chops than me can get this into some kind of sync'd text format for subs2srs.

Edit: I looked around a bit, and it looks like the aforementioned Subtitle Edit should do it. Unfortunately, it is Windows-only. Anyone have it or have Windows and can give it a shot with Zarxrax's Subs?
Edited: 2017-01-17, 10:55 pm
Reply
#42
I haven't really bothered watching the show much yet still, but I was actually just looking into Google's Cloud Vision API earlier today which would probably work well for this. It's not perfect, but a large majority of this could probably be run through OCR software. Here's a small sample I did of the first 18 subtitle images from the first episode:

Tessarect Wrote:
Code:
(トリント"ル)

皆様 ご無沙汰しており ます
(一同) ご無沙汰しており ます
(Y〇U) テラスハウスは
見ず知らずの男女6人が
共同生活する様子を
ただただ記録したものです

用意したのはステキなおウチと
ステキな車だけです

相変わらず
台本は一切ございません

(山里) ございませんよ
くトリント"ル) ございません

この度テラスハウスが

NETF凵Xで
復活することになりました

(山里) そうなんですよね
これすごいことなんですよね

そ の NETF凵Xさ んが
日本に上陸するにあたって

ちよっと このテラスハウスに
羽の矢が立ったと

“テラスハウスやりたいよ”
みたいなことを

多分言ったんじゃないかつて
話です

今たどたどしい
ヨ本語みたいな感じね
“徳井とか几 たいよ” って

(徳井) 偉いさん 徳井” ってぃう
ワ一 ドを出 〕てくれはったん?

(山里) 多分なったと
(徳井) ありがたい

Google Cloud Vision API Wrote:
Code:
(トリントン)
皆様ご無沙汰しております

(一同)ご無沙汰しております

(YOU)テラスハウスは
見ず知らずの男女6人が

共同生活する様子を
ただただ記録したものです

用意したのはステキなおウチと
ステキな車だけです

相変わらず
台本は一切ございません

(山里)ございませんよ

(トリントン)ございません

この度テラスハウスが

NETFLIXで
復活することになりました

(山里)そうなんですよね
これすごいことなんですよね

そのNETFLIXさんが
日本に上陸するにあたって

ちょっとこのテラスハウスに
白羽の矢が立ったと

"テラスハウスやりたいよ"
みたいなことを

多分言ったんじゃないかって
話です

今たどたどしい
日本語みたいな感じね

"徳井とか見たいよって

(徳井)偉いさん、徳井っていう
ワードを出してくれはったん?

(山里)多分なったと
(徳井)ありがたい

Tesseract is free, but has some recognition issues.

Google Cloud Vision API isn't free but it's pretty good. You could potentially OCR all of the episodes for free by combining them into a large images (per episode probably) and then running it through the free "Try API" part of this website: https://cloud.google.com/vision/. Then parse the JSON response to figure out the coordinates of each match, and use that to determine what the original subtitle file it belongs to (fairly easy task). There are vertical files which would probably wouldn't play entirely nice though.

The only real issue I see with the Cloud Vision API output is that トリンドル gets read as トリントン. Probably because they use half-width katakana for that.
Edited: 2017-01-18, 1:13 am
Reply
#43
Zarxrax Wrote:Alright, I have the japanese+english subs for all 46 episodes of Terrace House here: https://mega.nz/#!kBkxwRiC!wFdvE5O5a2gbU...0NWFUJGSlA
Japanese subs are images, so you can't use Rikaichan or anything unless they are typed out. Hopefully a good OCR solution can be found soon.

Looks like the audio files are unencrypted as well, so it could actually work pretty well with subs2srs. However, I will not be uploading the audio on here.

If anyone has requests for subs from other series I'll be glad to try and get them.
I have exchanged my "modern" computer with my elderly Japanese friend's "dinosaur" computer so that my friend can read the Chinese subtitles. It will be a long while before I can get back my "modern" computer.
http://forum.koohii.com/thread-13880-pos...#pid240118

The "dinosaur" computer cannot handle the encryption codes at mega and unable to download file from mega site.  Is it possible to upload the file to some other cloud drive without encryption codes?
ご迷惑をおかけして申し訳ございません。

It will be wonderful if there is OCR software that can recognize Japanese fonts in italics, vertical text positioning as well as ruby annotation that works well together with subtitle file (timestamps), and I do not mind paying for it.

I took a look at the Google's Cloud Vision API, but it does not handle the timestamps together with the image files, unlike Subtitle Edit 3.4.11 and Idxsubocr which can generate the timestamps for subtitle file.

If it is a matter of OCR software for Japanese material that is in text-format, then SmartOCR lite which is a free software can easily do the job as well.

SomeCallMeChris Wrote:Edit: If we're starting a manual typing effort, I could do an episode or two, but honestly, I wouldn't even bother with this subtitle ripping effort to do that, I'd just pause my netflix stream. I'm not up to doing the whole series though, so if you want me to do that pick your episodes carefully. Smile
I agree with SomeCallMeChris thoughts about this subtitle ripping effort. Since this is manual typing effort from the image (png files) and not a simple OCR job, it will take up too much time to do the whole series of 46 episodes. Just a couple of episodes is what I am prepared to do, so let me know which are the two episodes and do pick the episodes carefully.

In addition to "Terrace House", I noticed that 'kusogaijin' and 'ayu_modoki' have suggested "Hibana Spark" in this thread. Would it be possible to try and get the Japanese subtitles for it?

As with "Terrace House", I can work on a couple of episodes for "Hibana Spark", and I plan to do the manual typing for episode 1 and 2, unless there are request from Koohii forum users to do other episodes instead of the first two episodes.
Reply
#44
Eh, I've never heard of anyone not being able to download from Mega. I can try to put it up somewhere else either tomorrow or this weekend. I should be able to grab Hibana spark this weekend too.
Reply
#45
(2017-01-18, 9:27 pm)eslang Wrote: It will be wonderful if there is OCR software that can recognize Japanese fonts in italics, vertical text positioning as well as ruby annotation that works well together with subtitle file (timestamps), and I do not mind paying for it.

I took a look at the Google's Cloud Vision API, but it does not handle the timestamps together with the image files, unlike Subtitle Edit 3.4.11 and Idxsubocr which can generate the timestamps for subtitle file.

If it is a matter of OCR software for Japanese material that is in text-format, then SmartOCR lite which is a free software can easily do the job as well.
Google's Cloud Vision API is just an API, it doesn't (and shouldn't to begin with) have any concept of timestamps. However, it's really easy to work with. I implemented a way to OCR Japanese text from images in less than 10 lines today using the JSON endpoint. It would most likely be trivial to write a tool to automatically generate a subtitle file based on the metadata (timestamps, positions, etc) from the manifest_ttml2.xml file.

However, if there's already existing software that can do it (and is accurate) then I don't think there's much of a point in doing that work. Tongue

Edit: I took an hour and made the program myself since I'm the one saying it'd be easy.
Requires Pillow and requests.
Tested with Python 2.7.

Code:
import xml.etree.ElementTree as ET
import requests
import base64
import json
import os
import sys
from io import BytesIO
from PIL import Image, ImageOps

AUTH_KEY = "<API key here>"
VISION_ENDPOINT = "https://vision.googleapis.com/v1/images:annotate"
REQUEST_CHUNK_SIZE = 15 # There is a limit to the size of the request, so chunk it down to as much data as we can send without being rejected
                        # If you run into errors with it crashing during the OCR step, try reducing this value
ADD_BACKGROUND = True # Can result in better OCR
SHRINK_IMAGE = True # Can result in better OCR
SCALE_FACTOR = 0.50 # Scale down to 50%
INVERT_COLORS = False

def read_master_xml(filename):
    print "Parsing %s..." % filename

    tree = ET.parse(filename)
    root = tree.getroot()
    
    entries = []
    for child in root:
        if "body" in child.tag:
            for child in child:
                start = child.attrib['begin'].replace(".",",")
                end = child.attrib['end'].replace(".",",")
                filename = child[0].attrib['src']
                
                entries.append({"start": start, "end": end, "filename": filename})
                
    return entries
    
def ocr_text(filenames):
    output = {}
    
    # Break filenames array down into chunks so that the API can handle it
    chunked_filenames = [filenames[i:i+REQUEST_CHUNK_SIZE] for i in range(0, len(filenames), REQUEST_CHUNK_SIZE)]        

    for i in range(len(chunked_filenames)):
        filenames = chunked_filenames[i]
        print "Generating request (%d/%d)..." % (i + 1, len(chunked_filenames))
        
        data = { "requests": [] }  
        requested_filenames = []
        for filename in filenames:
            if not os.path.exists(filename):
                continue
            
            requested_filenames.append(filename)
        
            im = Image.open(filename)
            if ADD_BACKGROUND:
                # Put the subtitles on a background for better results
                bg = Image.new("RGB", im.size, (0,0,0))
                bg.paste(im,im)
                im = bg                    
            else:
                im = Image.open(filename)
                
            if INVERT_COLORS:
                im = ImageOps.invert(im)
                
            if SHRINK_IMAGE:
                im = im.resize((int(im.size[0] * SCALE_FACTOR), int(im.size[1] * SCALE_FACTOR)), Image.ANTIALIAS)
            
            # Save the compressed image to a memory IO buffer instead of a real file
            raw_image = BytesIO()
            im.save(raw_image, format="png")
            raw_image.seek(0)  # rewind to the start
            
            # Add request list
            data['requests'].append({ "image": { "content": base64.b64encode(raw_image.read()).decode() },
                    "imageContext": { "languageHints": [ "ja", "en" ] }, # CHANGE FOR MORE LANGUAGES
                    "features": [ { "type": "TEXT_DETECTION", "maxResults": 1 } ],
                })
                
        if len(data['requests']) == 0:
            continue
                
        print "Requesting OCR text..."
        data = json.dumps(data)
        resp = requests.post(VISION_ENDPOINT, data=data, params={"key": AUTH_KEY}, headers={'Content-Type': 'application/json'})
        
        #open("output.json","wb").write(resp.text.encode('utf-8'))
        
        for idx, r in enumerate(resp.json()['responses']):
            #open("output_%d.json" % idx,"wb").write(json.dumps(r, indent=4, ensure_ascii=False, encoding="utf-8").encode('utf-8'))
            
            if 'textAnnotations' in r:
                output[requested_filenames[idx]] = r['textAnnotations'][0]['description'].encode('utf8') # First entry is always the "final" output
        
    return output
    
    
def generate_srt_from_xml(input_folder):
    xml_filename = os.path.join(input_folder, "manifest_ttml2.xml")
    srt_filename = os.path.join(input_folder, "output.srt")

    entries = read_master_xml(xml_filename)
    
    filenames = [os.path.join(input_folder, c['filename']) for c in entries]
    results = ocr_text(filenames)
    
    print "Generating SRT file..."
    with open(srt_filename, "w") as output:
        for i in range(len(entries)):
            entry = entries[i]
            
            text = "<TEXT HERE>"
            f = os.path.join(input_folder, entry['filename'])
            if f in results:
                text = results[f]
            
            output.write("%d\n" % (i + 1))
            output.write("%s --> %s\n" % (entry['start'], entry['end']))
            output.write("%s\n" % text)
            output.write("\n")
            output.write("\n")

    print "Finished!"

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print "usage: %s input_folder" % (sys.argv[0])
        exit(1)

    input_folder = sys.argv[1]        
    generate_srt_from_xml(input_folder)

To use,
Code:
python generate_srt_from_netflix.py "Terrace House J subs\Terrace House - Boys & Girls in the City 01"

It will read the manifest_ttml2.xml from the above folder, process the images (put them on a black background), and then generate a "output.srt" in the same folder.

Requires you to use your own Google Cloud API key.

It will generate a subtitle file similar to this: http://pastebin.com/GAN5EzPn
I removed a bunch of the image files since I don't want to use a ton of API calls, so anything that I did NOT run through the OCR has the text "<TEXT HERE>" below it.

This program should work on ANY of the Netflix subtitle dumps as long as they are in the same format (manifest_ttml2.xml and .png files in the same folder).

Edit 2: Fixed a bug when you tried to send *too* much data.
Here's the first episode OCR'd entirely, no editing or fixing mistakes: http://pastebin.com/PyrWvurp
You guys can decide if it's good enough to mess with this method any further.

I think it could be more accurate if there was code to detect and remove some false positive "words" that it detects, resulting in stuff like at the very bottom where it reads らら and 基地. That could be something like detecting if the text is a majority of horizontal (or vertical) words and remove the odd non-conforming words.
Edited: 2017-01-19, 11:37 pm
Reply
#46
Zarxrax Wrote:Eh, I've never heard of anyone not being able to download from Mega. I can try to put it up somewhere else either tomorrow or this weekend. I should be able to grab Hibana spark this weekend too.
Yeah, you can say it again. It was a terrible shock for me too.

Well, at least I get to see what are the problems with my friend's computer.  

There is no rush, I can wait, either this weekend or even next week is alright.

Thank you.  Heart

  At 'zx573'
Now we're talking Big Grin  Thanks for the new tip and trick.  
I'm sure it will benefit and be of help to other users who cannot install and use Subtitle Edit or Idxsubocr.

At the moment, I am not able to try out Python on my friend's terribly old computer.
It seems quite alright from the first glance.  
Are there any italics font type and ruby annotation in that "Terrace House - Boys & Girls in the City 01"?

Zarxrax posted this file earlier on google drive (of which I could download and have tested it)
Quote:Here is the file (random episode of Tokyo Stories) if anyone would like to examine it: https://drive.google.com/open?id=0B3_ufk...lhSWGEwTHM

It is a good way to test out the subtitle file since the image (png files) comprise of italics font type, ruby annotation, vertical text positioning and other stuff.  It will be great if you have the time to test out this file as well.

(2017-01-17, 9:31 pm)Zarxrax Wrote: Alright, I have the japanese+english subs for all 46 episodes of Terrace House here: https://mega.nz/#!kBkxwRiC!wFdvE5O5a2gbU...0NWFUJGSlA
or https://www.mediafire.com/?3eeza36lunl5cmn
I just saw this update!  素晴らしい~ Heart You're the Best!

At 'juniperpansy'
At 'johndoe2015'
At forum members who are interested in Terrace House Japanese subtitles

Please indicate two episodes out of the 46 Terrace House episodes that you'd like to see it in Japanese (SRT) subtitle.
Do give your reply by end of this week, and we can decide which two episodes to type out the subtitle. Thanks.
Edited: 2017-01-20, 4:59 am
Reply
#47
(2017-01-19, 10:39 pm)eslang Wrote: At 'zx573'
Now we're talking Big Grin  Thanks for the new tip and trick.  
I'm sure it will benefit and be of help to other users who cannot install and use Subtitle Edit or Idxsubocr.

At the moment, I am not able to try out Python on my friend's terribly old computer.
It seems quite alright from the first glance.  
Are there any italics font type and ruby annotation in that "Terrace House - Boys & Girls in the City 01"?

Zarxrax posted this file earlier on google drive (of which I could download and have tested it)
Quote:Here is the file (random episode of Tokyo Stories) if anyone would like to examine it: https://drive.google.com/open?id=0B3_ufk...lhSWGEwTHM

It is a good way to test out the subtitle file since the image (png files) comprise of italics font type, ruby annotation, vertical text positioning and other stuff.  It will be great if you have the time to test out this file as well.

Italics isn't much of a problem really, neural networks are pretty resilient against small changes if they're properly trained, which Google's is.

Here's the output from that episode: http://pastebin.com/ycdC3MX4

Ruby text is also read, but it's not exactly associated to anything in particular.
That's a problem with the program I wrote, not the OCR exactly.
If there's a subtitle format that lets you position text exactly on the screen using pixel coordinates then it would be possible to get it pretty spot on I think, but without something like that then it would take someone to write the code to figure out what the ruby text is associated to based on the x/y coordinates of the text.

One thing I noticed is that special characters like ♪ have some difficulty.

Overall, looking at the results of this episode, I'd say it's fairly accurate. Probably at least 80% accurate (maybe more, just an eyeballed estimate) through automatic OCR without any editing. The issues that it does have seem to be fairly minor, although a few times some words are completely messed up.

I updated the program in my post to include some more options that I added in to make the results slightly better.
Edited: 2017-01-19, 11:40 pm
Reply
#48
(2017-01-19, 11:37 pm)zx573 Wrote:
(2017-01-19, 10:39 pm)eslang Wrote: At 'zx573'
Now we're talking Big Grin  Thanks for the new tip and trick.  
I'm sure it will benefit and be of help to other users who cannot install and use Subtitle Edit or Idxsubocr.

At the moment, I am not able to try out Python on my friend's terribly old computer.
It seems quite alright from the first glance.  
Are there any italics font type and ruby annotation in that "Terrace House - Boys & Girls in the City 01"?

Zarxrax posted this file earlier on google drive (of which I could download and have tested it)
Quote:Here is the file (random episode of Tokyo Stories) if anyone would like to examine it: https://drive.google.com/open?id=0B3_ufk...lhSWGEwTHM

It is a good way to test out the subtitle file since the image (png files) comprise of italics font type, ruby annotation, vertical text positioning and other stuff.  It will be great if you have the time to test out this file as well.

Italics isn't much of a problem really, neural networks are pretty resilient against small changes if they're properly trained, which Google's is.

Here's the output from that episode: http://pastebin.com/ycdC3MX4

Ruby text is also read, but it's not exactly associated to anything in particular.
That's a problem with the program I wrote, not the OCR exactly.
If there's a subtitle format that lets you position text exactly on the screen using pixel coordinates then it would be possible to get it pretty spot on I think, but without something like that then it would take someone to write the code to figure out what the ruby text is associated to based on the x/y coordinates of the text.

One thing I noticed is that special characters like ♪ have some difficulty.

Overall, looking at the results of this episode, I'd say it's fairly accurate. Probably at least 80% accurate (maybe more, just an eyeballed estimate) through automatic OCR without any editing. The issues that it does have seem to be fairly minor, although a few times some words are completely messed up.

I updated the program in my post to include some more options that I added in to make the results slightly better.
Could you run the subtitles of Tiger&Dragon through that program? I would like to create a Memrise course for that series, but it'd take a number of hours to transcribe the subs by hand.
Reply
#49
At 'zx573'
Thank you for your help. Heart I really appreciate it. ありがとう~

While waiting for the forum members to decide on which episodes of Terrace House, I went ahead to proofread and edited this test file that was kindly shared by you.
Quote:Here's the first episode OCR'd entirely, no editing or fixing mistakes: http://pastebin.com/PyrWvurp
You guys can decide if it's good enough to mess with this method any further.

Many Kudos to 'Zarxrax' and 'zx573' who have shared and help to make this first episode possible. Smile
"Terrace House - Boys & Girls in the City 01" (Japanese Subtitle)
Proofread & Edited --> http://pastebin.com/CMmzE8y4 おまけで~す Big Grin

At the moment, I am not able to try out Python on my friend's terribly old computer.
In case some of the members would like to try out the new tip and trick by 'zx573', refer to this:
http://forum.koohii.com/thread-14158-pos...#pid241295

Just a simple breakdown of the errors while proofreading episode 01.
The result is surprisingly good. Smile

TOTAL LINES = 585

TOTAL ERRORS = 109 (18.63%)

Minor Errors = 46 (7.86%)
Major Errors = 43 (7.35&)
Critical Errors = 20 (3.42%)

<7>ERRORS : person name in katakana
<1>ERRORS : person name in hiragana
<38>ERRORS : person name in kanji
[46 Minor Errors]

<21>ERRORS : big tsu つ->っ 3"->っ
<5>ERRORS : big yo よ->ょ
<15>ERRORS : missing ー
<2>ERRORS : wrong - used instead of ー
[43 Major Errors]

<3>ERRORS : misplace word order sequence
<7>ERRORS : "boten" が->か | "katakana" カ->か
<10>ERRORS : wrong or missing word
[20 Critical Errors]

NOTE:
"False Syllables" such as らら, 基地 (字余り) identified in 95 Lines (16.24%) are not counted as errors.

zx573 Wrote:Italics isn't much of a problem really, neural networks are pretty resilient against small changes if they're properly trained, which Google's is.

Here's the output from that episode: http://pastebin.com/ycdC3MX4

Ruby text is also read, but it's not exactly associated to anything in particular.
That's a problem with the program I wrote, not the OCR exactly.
If there's a subtitle format that lets you position text exactly on the screen using pixel coordinates then it would be possible to get it pretty spot on I think, but without something like that then it would take someone to write the code to figure out what the ruby text is associated to based on the x/y coordinates of the text.

One thing I noticed is that special characters like ♪ have some difficulty.

Overall, looking at the results of this episode, I'd say it's fairly accurate. Probably at least 80% accurate (maybe more, just an eyeballed estimate) through automatic OCR without any editing. The issues that it does have seem to be fairly minor, although a few times some words are completely messed up.

I updated the program in my post to include some more options that I added in to make the results slightly better.
Thank you for taking the time and trouble to OCR the first test file by Zarxrax.  

I'll run through it and make some comparison during the weekend.

Yes, the ruby text is a headache!  I usually omit it while proofreading because I find that 3 lines (with the ruby text) take up too much space on the display screen.  Special characters are mainly used for 視聴障害者, and I also disregard it, because it require a special software to input those special characters. In some of my earlier OCR days, I have found that italics font to be something else and pretty tricky, when it is vertical text positioning.

I gotta run now.  Thanks again for your great help. Smile
Edited: 2017-01-20, 4:57 am
Reply
#50
(2017-01-19, 10:39 pm)eslang Wrote:
Zarxrax Wrote:Eh, I've never heard of anyone not being able to download from Mega. I can try to put it up somewhere else either tomorrow or this weekend. I should be able to grab Hibana spark this weekend too.
Yeah, you can say it again. It was a terrible shock for me too.

Well, at least I get to see what are the problems with my friend's computer.  

There is no rush, I can wait, either this weekend or even next week is alright.

Thank you.  Heart

  At 'zx573'
Now we're talking Big Grin  Thanks for the new tip and trick.  
I'm sure it will benefit and be of help to other users who cannot install and use Subtitle Edit or Idxsubocr.

At the moment, I am not able to try out Python on my friend's terribly old computer.
It seems quite alright from the first glance.  
Are there any italics font type and ruby annotation in that "Terrace House - Boys & Girls in the City 01"?

Zarxrax posted this file earlier on google drive (of which I could download and have tested it)
Quote:Here is the file (random episode of Tokyo Stories) if anyone would like to examine it: https://drive.google.com/open?id=0B3_ufk...lhSWGEwTHM

It is a good way to test out the subtitle file since the image (png files) comprise of italics font type, ruby annotation, vertical text positioning and other stuff.  It will be great if you have the time to test out this file as well.

(2017-01-17, 9:31 pm)Zarxrax Wrote: Alright, I have the japanese+english subs for all 46 episodes of Terrace House here: https://mega.nz/#!kBkxwRiC!wFdvE5O5a2gbU...0NWFUJGSlA
or https://www.mediafire.com/?3eeza36lunl5cmn
I just saw this update!  素晴らしい~ Heart You're the Best!

At 'juniperpansy'  
At 'johndoe2015'
At forum members who are interested in Terrace House Japanese subtitles

Please indicate two episodes out of the 46 Terrace House episodes that you'd like to see it in Japanese (SRT) subtitle.  
Do give your reply by end of this week, and we can decide which two episodes to type out the subtitle.  Thanks.

Very cool. I would say the first two would be ok. If someone else had specific preferences, i'll go along with that. I'd learn a lot from any of them.

(2017-01-20, 4:38 am)eslang Wrote: At  'zx573'
Thank you for your help. Heart  I really appreciate it. ありがとう~

While waiting for the forum members to decide on which episodes of Terrace House, I went ahead to proofread and edited this test file that was kindly shared by you.
Quote:Here's the first episode OCR'd entirely, no editing or fixing mistakes: http://pastebin.com/PyrWvurp
You guys can decide if it's good enough to mess with this method any further.

Many Kudos to 'Zarxrax' and 'zx573' who have shared and help to make this first episode possible. Smile
"Terrace House - Boys & Girls in the City 01" (Japanese Subtitle)
Proofread & Edited --> http://pastebin.com/CMmzE8y4  おまけで~す Big Grin

At the moment, I am not able to try out Python on my friend's terribly old computer.
In case some of the members would like to try out the new tip and trick by 'zx573', refer to this:
http://forum.koohii.com/thread-14158-pos...#pid241295

Just a simple breakdown of the errors while proofreading episode 01.
The result is surprisingly good. Smile

TOTAL LINES = 585

TOTAL ERRORS = 109 (18.63%)

Minor Errors = 46 (7.86%)
Major Errors = 43 (7.35&)
Critical Errors = 20 (3.42%)

<7>ERRORS : person name in katakana
<1>ERRORS : person name in hiragana
<38>ERRORS : person name in kanji
[46 Minor Errors]

<21>ERRORS : big tsu つ->っ 3"->っ
<5>ERRORS : big yo よ->ょ
<15>ERRORS : missing ー
<2>ERRORS : wrong - used instead of ー
[43 Major Errors]

<3>ERRORS : misplace word order sequence
<7>ERRORS : "boten" が->か | "katakana" カ->か
<10>ERRORS : wrong or missing word
[20 Critical Errors]

NOTE:  
"False Syllables" such as  らら, 基地 (字余り) identified in 95 Lines (16.24%) are not counted as errors.

zx573 Wrote:Italics isn't much of a problem really, neural networks are pretty resilient against small changes if they're properly trained, which Google's is.

Here's the output from that episode: http://pastebin.com/ycdC3MX4

Ruby text is also read, but it's not exactly associated to anything in particular.
That's a problem with the program I wrote, not the OCR exactly.
If there's a subtitle format that lets you position text exactly on the screen using pixel coordinates then it would be possible to get it pretty spot on I think, but without something like that then it would take someone to write the code to figure out what the ruby text is associated to based on the x/y coordinates of the text.

One thing I noticed is that special characters like ♪ have some difficulty.

Overall, looking at the results of this episode, I'd say it's fairly accurate. Probably at least 80% accurate (maybe more, just an eyeballed estimate) through automatic OCR without any editing. The issues that it does have seem to be fairly minor, although a few times some words are completely messed up.

I updated the program in my post to include some more options that I added in to make the results slightly better.
Thank you for taking the time and trouble to OCR the first test file by Zarxrax.  

I'll run through it and make some comparison during the weekend.

Yes, the ruby text is a headache!  I usually omit it while proofreading because I find that 3 lines (with the ruby text) take up too much space on the display screen.  Special characters are mainly used for 視聴障害者, and I also disregard it, because it require a special software to input those special characters. In some of my earlier OCR days, I have found that italics font to be something else and pretty tricky, when it is vertical text positioning.

I gotta run now.  Thanks again for your great help. Smile

Oh wow. Thanks a million for doing this. I will check this out as well tomorrow.
Edited: 2017-01-20, 7:43 pm
Reply