Back

Netflix Japanese content

#51
eslang: here is Hibana sub images: https://www.mediafire.com/?p55sm9tgbp2bw9z

Today I was able to whip up my own homemade OCR using some APIs available in windows 10. It seems pretty decent but not as good as that google OCR. Definitely better than tesseract though. I'll investigate it further to see if this might be viable to batch convert the whole series, since you have to pay to run the google thing on more than 1000 images.
Reply
#52
I need to use my Google Cloud Platform account for work-related test development for now (writing an OCR-based program which is why I started learning the API to begin with) so I don't really want to run a bunch of subtitles through it, but if anyone wants to make their own account and get an API key to plug into the script I posted, it's easy to sign up for a free trial. When you sign up for the trial it gives you $300 free Google Cloud Platform credit that's valid for 2 months. If it's nearing the date my money expires and I still have money left over then I'd be willing to run some OCR for you guys. I managed to burn through $5.23 mostly from testing the OCR app already. Tongue

https://cloud.google.com/free-trial/
Reply
#53
(2017-01-20, 7:55 pm)Zarxrax Wrote: eslang: here is Hibana sub images: https://www.mediafire.com/?p55sm9tgbp2bw9z

Today I was able to whip up my own homemade OCR using some APIs available in windows 10. It seems pretty decent but not as good as that google OCR. Definitely better than tesseract though. I'll investigate it further to see if this might be viable to batch convert the whole series, since you have to pay to run the google thing on more than 1000 images.
「火花」の日本語字幕が出来た!ありがとう、Zarxraxさん Heart

Hopefully, your home-brew OCR can whip up some decent stuff. Looking forward to see it later.

(2017-01-20, 8:33 pm)zx573 Wrote: I need to use my Google Cloud Platform account for work-related test development for now (writing an OCR-based program which is why I started learning the API to begin with) so I don't really want to run a bunch of subtitles through it, but if anyone wants to make their own account and get an API key to plug into the script I posted, it's easy to sign up for a free trial. When you sign up for the trial it gives you $300 free Google Cloud Platform credit that's valid for 2 months. If it's nearing the date my money expires and I still have money left over then I'd be willing to run some OCR for you guys. I managed to burn through $5.23 mostly from testing the OCR app already. Tongue

https://cloud.google.com/free-trial/
Thanks for the information on Google Cloud Platform credit system.  

Oh wow, just two test files and it already burn through $5.23! 高いですね。

So far, johndoe2015 have requested the first two episodes of Terrace House. But let us wait a little while for  'juniperpansy' reply.

It seems that your updated program is getting better at OCR recognition. すごい!
Here is the simple breakdown for the Random Tokyo Stories.

TOTAL LINES = 323

TOTAL ERRORS = 26 (8.05%)

Minor Errors = 2 (0.62%)
Major Errors = 18 (5.57%)
Critical Errors = 6 (1.86%)

<2>ERRORS : person name in kanji
[2 Minor Errors]

<16>ERRORS : big つ ツ big え
<1>ERRORS : big や
<1>ERRORS : missing ー
[18 Major Errors]

<6>ERRORS : wrong or missing word
[6 Critical Errors]

NOTE:
"False Syllables" (字余り;ダブリ字) identified in 19 Lines (5.88%) are not counted as errors.

(Random episode of Tokyo Stories)
Proofread & Edited:
http://pastebin.com/c6v6sYeQ
Edited: 2017-01-20, 8:51 pm
Reply
Thanksgiving Sale: 30% OFF Basic, Premium & Premium PLUS Subscriptions! (Nov 13 - 22)
JapanesePod101
#54
I think johndoe2015 has good suggestions! I watched the series with Japanese subtitles and because of that i missed a lot of the sexual innuendo in some of the episodes. Those would be interesting to have done but I have no idea which episodes they are haha
Reply
#55
(2017-01-20, 10:01 pm)juniperpansy Wrote: I think johndoe2015 has good suggestions! I watched the series with Japanese subtitles and because of that i missed a lot of the sexual innuendo in some of the episodes. Those would be interesting to have done but I have no idea which episodes they are haha
Oh dear, there are sexual innuendo is some of the episodes? 本当ですか? Angel

I don't know which episodes either, since I have not watched Terrace House at all. Do you want to wait for johndoe2015 or someone else to reply whether they know which episodes you are referring about?

Nukemarine Wrote:Could you run the subtitles of Tiger&Dragon through that program? I would like to create a Memrise course for that series, but it'd take a number of hours to transcribe the subs by hand.
It is possible to import the vobsub (idxsub) to Subtitle Edit and export into BDN xml+png subtitle format.
[Image: 0AlpoMo.jpg]

Theoretically, I think 'zx573' OCR program should be able to OCR the BDN subtitle format as well, but I am not able to test out  'zx573' OCR program at the moment, so you can give it a try and test the program, or wait for  'zx573' to reply.
Edited: 2017-01-20, 10:47 pm
Reply
#56
(2017-01-20, 10:29 pm)eslang Wrote:
(2017-01-20, 10:01 pm)juniperpansy Wrote: I think johndoe2015 has good suggestions! I watched the series with Japanese subtitles and because of that i missed a lot of the sexual innuendo in some of the episodes. Those would be interesting to have done but I have no idea which episodes they are haha
Oh dear, there are sexual innuendo is some of the episodes? 本当ですか? Angel

I don't know which episodes either, since I have not watched Terrace House at all.  Do you want to wait for johndoe2015 or someone else to reply whether they know which episodes you are referring about?

Nukemarine Wrote:Could you run the subtitles of Tiger&Dragon through that program? I would like to create a Memrise course for that series, but it'd take a number of hours to transcribe the subs by hand.
It is possible to import the vobsub (idxsub) to Subtitle Edit and export into BDN xml+png subtitle format.
[Image: 0AlpoMo.jpg]

Theoretically, I think 'zx573' OCR program should be able to OCR the BDN subtitle format as well, but I am not able to test out  'zx573' OCR program at the moment, so you can give it a try and test the program, or wait for  'zx573' to reply.

Oooof. The comedians who narrate are full of fun comments, but pinpointing specifics would be tough. 

It just occurred to me that there is a character in the later episodes called Handa, or "Mr Perfect", who is so eloquent, mature, and clever that the women all go nuts. Maybe we should get some of his lines Smile. I'll dig up specific episodes.

(2017-01-20, 11:51 pm)johndoe2015 Wrote:
(2017-01-20, 10:29 pm)eslang Wrote:
(2017-01-20, 10:01 pm)juniperpansy Wrote: I think johndoe2015 has good suggestions! I watched the series with Japanese subtitles and because of that i missed a lot of the sexual innuendo in some of the episodes. Those would be interesting to have done but I have no idea which episodes they are haha
Oh dear, there are sexual innuendo is some of the episodes? 本当ですか? Angel

I don't know which episodes either, since I have not watched Terrace House at all.  Do you want to wait for johndoe2015 or someone else to reply whether they know which episodes you are referring about?

Nukemarine Wrote:Could you run the subtitles of Tiger&Dragon through that program? I would like to create a Memrise course for that series, but it'd take a number of hours to transcribe the subs by hand.
It is possible to import the vobsub (idxsub) to Subtitle Edit and export into BDN xml+png subtitle format.
[Image: 0AlpoMo.jpg]

Theoretically, I think 'zx573' OCR program should be able to OCR the BDN subtitle format as well, but I am not able to test out  'zx573' OCR program at the moment, so you can give it a try and test the program, or wait for  'zx573' to reply.

Oooof. The comedians who narrate are full of fun comments, but pinpointing specifics would be tough. 

It just occurred to me that there is a character in the later episodes called Handa, or "Mr Perfect", who is so eloquent, mature, and clever that the women all go nuts. Maybe we should get some of his lines Smile. I'll dig up specific episodes.

He shows up in ep 25 (season two). That would be a good one, or anything after.
Edited: 2017-01-20, 11:55 pm
Reply
#57
(2017-01-20, 8:33 pm)zx573 Wrote: I need to use my Google Cloud Platform account for work-related test development for now (writing an OCR-based program which is why I started learning the API to begin with) so I don't really want to run a bunch of subtitles through it, but if anyone wants to make their own account and get an API key to plug into the script I posted, it's easy to sign up for a free trial. When you sign up for the trial it gives you $300 free Google Cloud Platform credit that's valid for 2 months. If it's nearing the date my money expires and I still have money left over then I'd be willing to run some OCR for you guys. I managed to burn through $5.23 mostly from testing the OCR app already. Tongue

https://cloud.google.com/free-trial/

I'm getting this error when I try your script. Any ideas what's wrong?


Code:
Parsing C:\Users\Alan\Desktop\hibana 01\manifest_ttml2.xml...
Generating request (1/44)...
Requesting OCR text...
Traceback (most recent call last):
  File "C:\Users\Alan\Desktop\generate_srt_from_netflix.py", line 132, in <module>
    generate_srt_from_xml(input_folder)
  File "C:\Users\Alan\Desktop\generate_srt_from_netflix.py", line 106, in generate_srt_from_xml
    results = ocr_text(filenames)
  File "C:\Users\Alan\Desktop\generate_srt_from_netflix.py", line 90, in ocr_text
    for idx, r in enumerate(resp.json()['responses']):
KeyError: 'responses'
Reply
#58
(2017-01-21, 10:50 am)Zarxrax Wrote:
(2017-01-20, 8:33 pm)zx573 Wrote: I need to use my Google Cloud Platform account for work-related test development for now (writing an OCR-based program which is why I started learning the API to begin with) so I don't really want to run a bunch of subtitles through it, but if anyone wants to make their own account and get an API key to plug into the script I posted, it's easy to sign up for a free trial. When you sign up for the trial it gives you $300 free Google Cloud Platform credit that's valid for 2 months. If it's nearing the date my money expires and I still have money left over then I'd be willing to run some OCR for you guys. I managed to burn through $5.23 mostly from testing the OCR app already. Tongue

https://cloud.google.com/free-trial/

I'm getting this error when I try your script. Any ideas what's wrong?


Code:
Parsing C:\Users\Alan\Desktop\hibana 01\manifest_ttml2.xml...
Generating request (1/44)...
Requesting OCR text...
Traceback (most recent call last):
  File "C:\Users\Alan\Desktop\generate_srt_from_netflix.py", line 132, in <module>
    generate_srt_from_xml(input_folder)
  File "C:\Users\Alan\Desktop\generate_srt_from_netflix.py", line 106, in generate_srt_from_xml
    results = ocr_text(filenames)
  File "C:\Users\Alan\Desktop\generate_srt_from_netflix.py", line 90, in ocr_text
    for idx, r in enumerate(resp.json()['responses']):
KeyError: 'responses'

After this line:
Code:
resp = requests.post(VISION_ENDPOINT, data=data, params={"key": AUTH_KEY}, headers={'Content-Type': 'application/json'})
Add the following code:
Code:
print resp
print resp.text

And you should be able to see the response code and what error it's returning.

My guess would be invalid API key maybe.
Edited: 2017-01-21, 11:15 am
Reply
#59
Thanks, that helped. I had not "enabled" the cloud vision api on googles site yet, so that did the trick.

I'll play around with this some and see how it works. Looks like this is a better avenue to pursue than what I was working on.
I'll try running every episode of terrace house, hibana, and midnight diner through this, if there is enough credit.
If we can get it to combine the images all into one, that would probably solve the issue of pricing altogether.
Reply
#60
(2017-01-21, 2:35 pm)Zarxrax Wrote: Thanks, that helped. I had not "enabled" the cloud vision api on googles site yet, so that did the trick.

I'll play around with this some and see how it works. Looks like this is a better avenue to pursue than what I was working on.
I'll try running every episode of terrace house, hibana, and midnight diner through this, if there is enough credit.
If we can get it to combine the images all into one, that would probably solve the issue of pricing altogether.

I'd check this out if you plan on going that route.
https://cloud.google.com/vision/docs/bes...age_sizing

There are file size limits as well as recommended best resolutions you should keep in mind.

It might take a longer time to process the images if you want to combine them, but there's already code to paste an image on top of another image in my code (the code setting the background to black). You could calculate how many strings you want to put into one file and pre-calculate the required size and make a canvas that big and paste everything at fixed coordinates.

I added some parameters to the latest version like scaling and such because I found some subs seemed to work better (italics) when scaled down a little bit. So play with that too.
Reply
#61
(2017-01-20, 1:11 am)Nukemarine Wrote: Could you run the subtitles of Tiger&Dragon through that program? I would like to create a Memrise course for that series, but it'd take a number of hours to transcribe the subs by hand.

This link is redirecting to some russian page. How do you download this?


So far I have downloaded subs and run OCR on all of: Terrace House, Hibana, Midnight Diner, Atelier. Currently processing: Mischievous kiss season 1 & 2, Good Morning Call.
Edited: 2017-01-21, 10:45 pm
Reply
#62
(2017-01-21, 5:06 pm)Zarxrax Wrote:
(2017-01-20, 1:11 am)Nukemarine Wrote: Could you run the subtitles of Tiger&Dragon through that program? I would like to create a Memrise course for that series, but it'd take a number of hours to transcribe the subs by hand.
This link is redirecting to some russian page. How do you download this?

So far I have downloaded subs and run OCR on all of: Terrace House, Hibana, Midnight Diner, Atelier. Currently processing: Mischievous kiss season 1 & 2, Good Morning Call.
Japanese TV series/miniseries 'タイガー&ドラゴン' (2005)
(right-click on the "green arrow pointing down" under "Link" to download .rar subtitles)

I managed to download the rar file from that site.  If you still cannot download it, I'll upload it for you somewhere else.
Quote:netflix + japanese subtitles + rikaikun
As for 'Underwear' subtitles, is it this one?

Japanese-Subtitles/@Mains/@2015/@2015_10-12_Fall_Season/アンダーウェア

『アンダーウェア』(英題:Atelier)- Wikipedia
Atelier (Japanese title Underwear) is a Japanese web television drama developed by Fuji Television for Netflix.

FaultyMaxim Wrote:Thanks eslang! The subtitles you linked for アンダーウェア seem to work well with the method of loading that I mentioned in the original post.
English Title - Atelier is the same as Japanese Title - アンダーウェア (Underwear)

At johndoe2015

Quote:Oooof. The comedians who narrate are full of fun comments, but pinpointing specifics would be tough.

It just occurred to me that there is a character in the later episodes called Handa, or "Mr Perfect", who is so eloquent, mature, and clever that the women all go nuts. Maybe we should get some of his lines Smile. I'll dig up specific episodes.

He shows up in ep 25 (season two). That would be a good one, or anything after.

Thanks for the reply about which episodes it is that juniperpansy had mentioned.

Upon checking the image (png) files in episode 25, there is a new person with the name Handa (半田) being introduced to the others in the "Terrace House", and I will take on these two episodes - 25 and 26.
Edited: 2017-01-22, 12:47 am
Reply
#63
(2017-01-22, 12:29 am)eslang Wrote: (right-click on the "green arrow pointing down" under "Link" to download .rar subtitles)

I managed to download the rar file from that site.  If you still cannot download it, I'll upload it for you somewhere else.

Yea, I still can't seem to download it, so it would be great if you could post it up somewhere.

Sometime tomorrow I'll post up all the subs I have ocr'ed so far.
Edited: 2017-01-22, 1:07 am
Reply
#64
(2017-01-22, 12:29 am)eslang Wrote:
(2017-01-21, 5:06 pm)Zarxrax Wrote:
(2017-01-20, 1:11 am)Nukemarine Wrote: Could you run the subtitles of Tiger&Dragon through that program? I would like to create a Memrise course for that series, but it'd take a number of hours to transcribe the subs by hand.
This link is redirecting to some russian page. How do you download this?

So far I have downloaded subs and run OCR on all of: Terrace House, Hibana, Midnight Diner, Atelier. Currently processing: Mischievous kiss season 1 & 2, Good Morning Call.
Japanese TV series/miniseries 'タイガー&ドラゴン' (2005)
(right-click on the "green arrow pointing down" under "Link" to download .rar subtitles)

I managed to download the rar file from that site.  If you still cannot download it, I'll upload it for you somewhere else.
Quote:netflix + japanese subtitles + rikaikun
As for 'Underwear' subtitles, is it this one?

Japanese-Subtitles/@Mains/@2015/@2015_10-12_Fall_Season/アンダーウェア

『アンダーウェア』(英題:Atelier)- Wikipedia
Atelier (Japanese title Underwear) is a Japanese web television drama developed by Fuji Television for Netflix.

FaultyMaxim Wrote:Thanks eslang! The subtitles you linked for アンダーウェア seem to work well with the method of loading that I mentioned in the original post.
English Title - Atelier is the same as Japanese Title - アンダーウェア (Underwear)

At johndoe2015

Quote:Oooof. The comedians who narrate are full of fun comments, but pinpointing specifics would be tough.

It just occurred to me that there is a character in the later episodes called Handa, or "Mr Perfect", who is so eloquent, mature, and clever that the women all go nuts. Maybe we should get some of his lines Smile. I'll dig up specific episodes.

He shows up in ep 25 (season two). That would be a good one, or anything after.

Thanks for the reply about which episodes it is that juniperpansy had mentioned.

Upon checking the image (png) files in episode 25, there is a new person with the name Handa (半田) being introduced to the others in the "Terrace House", and I will take on these two episodes - 25 and 26.

Thank you for all of the hard work! This is very much appreciated.
Reply
#65
(2017-01-22, 1:05 am)Zarxrax Wrote: Yea, I still can't seem to download it, so it would be great if you could post it up somewhere.

Sometime tomorrow I'll post up all the subs I have ocr'ed so far.
Somewhere else for you  Big Grin
Reply
#66
I was able to successfully rig up zx573's script to OCR the tiger and dragon vobsubs. It's looking like they turned out pretty well. I'll post all the stuff up when I get back later today.
Reply
#67
Alright, I've got the subs all OCR'ed and ready to go.

Netflix SRT pack
Contains:
  • Atelier (Underwear)
  • Good Morning Call season 1
  • Hibana (Spark)
  • Midnight Diner: Tokyo Stories
  • Terrace House - Boys and Girls in the City
  • Mischievous Kiss ~ Love in Tokyo (Itazura na Kiss)
  • Mischievous Kiss 2 ~ Love in Tokyo (Itazura na Kiss 2)
Also, Tiger and Dragon subs per Nukemarine's request.

All Japanese subs were OCRed and may contain mistakes. All or most of the Japanese subs are closed captions rather than true subtitles. Subtitle Edit can be used to automatically strip out sounds and character names (Tools > Remove text for hearing impaired).

English subtitles are included in the Netflix pack. The English subtitles were ripped as text, and contain no mistakes.

eslang has kindly proofread a few episodes to correct some of the OCR mistakes:
Terrace House 01
Terrace House episode 25
Terrace House episode 26
Midnight Diner 08
Hibana episode 1
Hibana episode 2
Edited: 2017-01-31, 5:03 pm
Reply
#68
Thank you very much, Zarxrax.  That is really awesome! Heart

Just curious, how much $$ did it burn through with the OCR of subtitles in the Netflix Pack and Tokyo & Dragon using the Google Cloud Platform credit system?

At  juniperpansy
At  johndoe2015

Here are the two episodes from Terrace House:
Proofread and Edited
Terrace House episode 25
http://pastebin.com/GwrgUstH

Terrace House episode 26
http://pastebin.com/hZkHLccJ

Next, I'll proofread and edit Hibana episode 1 and 2, it should be up here by end of this week. Smile
Reply
#69
(2017-01-22, 11:06 pm)eslang Wrote: Thank you very much, Zarxrax.  That is really awesome! Heart

Just curious, how much $$ did it burn through with the OCR of subtitles in the Netflix Pack and Tokyo & Dragon using the Google Cloud Platform credit system?

I've got about $160 of my credit left, so it looks like these blew through almost half of it.
Reply
#70
(2017-01-22, 7:12 pm)Zarxrax Wrote: Alright, I've got the subs all OCR'ed and ready to go.

Netflix SRT pack
Contains:
  • Atelier (Underwear)
  • Good Morning Call season 1
  • Hibana (Spark)
  • Midnight Diner: Tokyo Stories
  • Terrace House - Boys and Girls in the City
  • Mischievous Kiss ~ Love in Tokyo (Itazura na Kiss)
  • Mischievous Kiss 2 ~ Love in Tokyo (Itazura na Kiss 2)
Also, Tiger and Dragon subs per Nukemarine's request.

All Japanese subs were OCRed and may contain mistakes. All or most of the Japanese subs are closed captions rather than true subtitles. Subtitle Edit can be used to automatically strip out sounds and character names (Tools > Remove text for hearing impaired).

English subtitles are included in the Netflix pack. The English subtitles were ripped as text, and contain no mistakes.

eslang has kindly proofread a few episodes to correct some of the OCR mistakes:
Terrace House 01
Midnight Diner 08


I'm finished proactively ripping subs, but I'll be glad to take requests (for now) if someone has some other shows that they want subs for.

This is so great. Thank you, Zarxrax.

(2017-01-22, 11:06 pm)eslang Wrote: Thank you very much, Zarxrax.  That is really awesome! Heart

Just curious, how much $$ did it burn through with the OCR of subtitles in the Netflix Pack and Tokyo & Dragon using the Google Cloud Platform credit system?

At  juniperpansy
At  johndoe2015

Here are the two episodes from Terrace House:
Proofread and Edited
Terrace House episode 25
http://pastebin.com/GwrgUstH

Terrace House episode 26
http://pastebin.com/hZkHLccJ

Next, I'll proofread and edit Hibana episode 1 and 2, it should be up here by end of this week. Smile

Eslang, thank you. This is so helpful.
Edited: 2017-01-23, 9:56 am
Reply
#71
(2017-01-23, 6:40 am)Zarxrax Wrote: I've got about $160 of my credit left, so it looks like these blew through almost half of it.
I think zx573 updated OCR program is really cool because it can process Japanese italic font and vertical text positioning quite well, but the cost of that Google API is pretty pricey.

I tried a work-around method to OCR the subtitle, which is using IrfanView, Subtitle Edit and Idxsubocr (all of them are free software, compatible for Windows users).  The result is quite good on the average without processing Japanese italic font and vertical text positioning. Cool

ちなみに do you (and zx573) have any request for other Japanese dramas subtitle?  
I can check through my secret subtitle collection.
Edited: 2017-01-24, 2:28 am
Reply
#72
I think I could potentially increase the results even more, but it'd require code that I wrote for work so I can't release it. It would most likely fix the issue of random characters being added to the end of OCR'd text because it would try to read left-right and up-down for the same characters. A simplified version could be written that basically checked the direction all of the text is headed in generally, and then drop results that weren't going in that direction (so, no up-down if everything else is mostly left-right).

I'm glad to hear that everything worked out, though! Big Grin
I kinda figured it would blow through a bunch of money which is why I wasn't prepared to do it quite yet.

I don't really watch dramas so I'm good I guess. Unless there's a good (and not cheaply made) horror-themed drama I should know about, then you'd have my interest. Tongue
Edited: 2017-01-24, 11:17 am
Reply
#73
FYI, the wife (a major Netflix addict) just saw that TERRACE HOUSE: ALOHA STATE is now available on Netflix US. Have at it, everyone!!
Reply
#74
@ zx573 and Zarxrax
Thank you, zx573 and Zarxrax, I think both of you have done quite a lot already.
I'm happy to learn something new, from this thread. Smile

[Proofread and Edited]
Hibana episode 1
http://pastebin.com/9K1qUfxR

Hibana episode 2
http://pastebin.com/W0JTPzHJ
Reply
#75
Hmmm, this is weird. My google credit is only around $20 now, and I haven't used it since the other day when it last told me I had about $160 left. That is a serious delay in stats reporting or something Huh
In any case, looks like I will not be able to OCR anymore subs using this method.
Reply