Script to steal the audio from the news from TBS

Index » Learning resources

 
Reply #1 - 2009 May 22, 3:28 pm
mentat_kgs Member
From: Brasil Registered: 2008-04-18 Posts: 1671 Website

Hi I've done a cute script to download the videos from TBS, along with the text and put them in the current folder.

I'm using it on Ubuntu, but it probably works well on any platform with ruby and mencoder run (including windows and mac).

You only need ruby, hpricot and mencoder installed.

Download it from:
http://www.inf.ufsc.br/~emilio/japanese/aranha.rb

Example of use:

$ ruby aranha.rb

It will download the last 15 articles from TBS that have video, convert the videos to mp3, and throw it all at the same folder together with each article's text.


###### ONLY for windows users: ########

You can download the mplayer package (which contains mencoder) from this link:
http://sourceforge.net/project/showfile … _id=683443

Then you must set the MENCODER_PATH inside the script to wherever you install mencoder (notice the double \\ instead of only \ in pathnames). For example:

### BEGIN CONFIGURATION ###
MAX_FILES = 15
MENCODER_PATH = "c:\\my\\mencoder\\folder\\mencoder.exe"
HOST = "news.tbs.co.jp"
### END CONFIGURATION   ###

Last edited by mentat_kgs (2009 May 25, 2:59 pm)

Reply #2 - 2009 May 22, 3:59 pm
Codexus Member
From: Switzerland Registered: 2007-11-27 Posts: 721

I've tested it on my Ubuntu, I just had to install the necessary packages and it worked without problems!

Great script! Thanks!

Reply #3 - 2009 May 22, 4:18 pm
vengeorgeb Member
Registered: 2008-12-22 Posts: 308

mentat_kgs wrote:

Hi I've done a cute script to download the videos from TBS, along with the text and put them in the current folder.

Hey mentat, with no other intention than to educate, if you have some time, go over explaining this code in detail, I think everyone, programmers and non-programmers would appreciate your personal insight.

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
Reply #4 - 2009 May 22, 4:44 pm
denus Member
Registered: 2009-02-01 Posts: 22

Oh, wow, this is brilliant! Thanks a lot.

Reply #5 - 2009 May 22, 4:48 pm
ahibba Member
Registered: 2008-09-04 Posts: 528 Website

Users of Ubuntu and Linux are lucky.

We don't have that luxury!

Reply #6 - 2009 May 22, 4:56 pm
nac_est Member
From: Italy Registered: 2006-12-12 Posts: 617 Website

いただきま〜す

Reply #7 - 2009 May 22, 6:07 pm
Tobberoth Member
From: Sweden Registered: 2008-08-25 Posts: 3364

Good stuff. Might I ask why you put it all in a begin block though? You're not using that block at all.

Reply #8 - 2009 May 22, 6:18 pm
Tobberoth Member
From: Sweden Registered: 2008-08-25 Posts: 3364

jorgebucaran wrote:

mentat_kgs wrote:

Hi I've done a cute script to download the videos from TBS, along with the text and put them in the current folder.

Hey mentat, with no other intention than to educate, if you have some time, go over explaining this code in detail, I think everyone, programmers and non-programmers would appreciate your personal insight.

If you know Ruby, his code is very clear and quite easy to understand. Instead of reading an explanation on his code, you should probably learn the basics of Ruby and read it yourself. I think you would learn a lot from it. Ruby is so easy to read that it's pretty much self-documenting.

Reply #9 - 2009 May 22, 6:24 pm
kazelee Rater Mode
From: ohlrite Registered: 2008-06-18 Posts: 2132 Website

Interesting title, though, you're not actually 'stealing' it.

Reply #10 - 2009 May 23, 12:33 am
sethg Member
From: m Registered: 2008-11-07 Posts: 505

Oh my 神... this is amazing! I could kiss you!! Thanks so much big_smile big_smile big_smile

On a side note, paste and save the ruby into a document in Ubuntu and it should color code it for even easier reading smile

Reply #11 - 2009 May 23, 12:44 am
sethg Member
From: m Registered: 2008-11-07 Posts: 505

After downloading, though, when opening in Ubuntu's default text app, Text Editor, I get this:

$B!!5v2D$J$/309q?M$H@\?($7$?$H$7$F5/AJ$5$l$?%_%c%s%^!<$NL1<g2=1?F0;XF3<T%"%&%s!&%5%s!&%9!<!&%A!<$5$s$,#2#2F|!"K!Dn$G=i$a$FH/8@$N5!2q$rM?$($i$l!"!V$I$N$h$&$J:a$bHH$7$F$$$J$$!W$HL5:a$r<gD%$7$^$7$?!#(B


$B!!<+Bp$K?/F~$7$?%"%a%j%+?M$NCK@-$rGq$a$?$H$7$F9q2HKI8fK!0cH?$N:a$KLd$o$l$F$$$k%9!<!&%A!<$5$s$O#2#2F|$NK!Dn$G!"!V;d$O$I$N$h$&$J:a$bHH$7$F$$$J$$!#L5<B$G$9!W$HH]G'$7$^$7$?!#(B

$B!!$^$?!"?3M}$N$"$H!"%9!<!&%A!<$5$s$OJ[8n;N$KBP$7!"!VCK@-$,?/F~$G$-$?$N$O7YHw$K<jMn$A$,$"$C$?$+$i$@!#;d$O?MF;E*N)>l$+$iCK@-$rGq$a$?$@$1!W$H=R$Y$?$H$$$&$3$H$G$9!#(B

$B!!0lJ}!"%"%a%j%+?M$NCK@-$OK!Dn$G!V;d$,K,$l$?$N$O%9!<!&%A!<$5$s$,0E;&$5$l$kL4$r8+$?$+$i$@!W$H?/F~$7$?M}M3$r>Z8@$7$^$7$?!#(B

$B!!:[H=$OMh=50J9_$bB3$/Mbig_smilej$G!"%9!<!&%A!<$5$s$OM-:a$H$5$l$l$P:GD9$G6X8G#5G/$N7:$K=h$5$l$k2DG=@-$,$"$j$^$9!#!J(B23$BF|(B01:33$B!K(B

However, if I Open With > Emacs22 (Client), it comes out as normal Japanese... in... an ugly editor smile Any way I can just view it normally with Text Editor?

Last edited by sethg (2009 May 23, 12:44 am)

Reply #12 - 2009 May 23, 2:18 am
Codexus Member
From: Switzerland Registered: 2007-11-27 Posts: 721

sethg, why not simply open them with firefox? It recognizes the character encoding automatically and that way you get to use rikai-chan.

Reply #13 - 2009 May 23, 5:32 am
Tobberoth Member
From: Sweden Registered: 2008-08-25 Posts: 3364

sethg wrote:

After downloading, though, when opening in Ubuntu's default text app, Text Editor, I get this:

$B!!5v2D$J$/309q?M$H@\?($7$?$H$7$F5/AJ$5$l$?%_%c%s%^!<$NL1<g2=1?F0;XF3<T%"%&%s!&%5%s!&%9!<!&%A!<$5$s$,#2#2F|!"K!Dn$G=i$a$FH/8@$N5!2q$rM?$($i$l!"!V$I$N$h$&$J:a$bHH$7$F$$$J$$!W$HL5:a$r<gD%$7$^$7$?!#(B


$B!!<+Bp$K?/F~$7$?%"%a%j%+?M$NCK@-$rGq$a$?$H$7$F9q2HKI8fK!0cH?$N:a$KLd$o$l$F$$$k%9!<!&%A!<$5$s$O#2#2F|$NK!Dn$G!"!V;d$O$I$N$h$&$J:a$bHH$7$F$$$J$$!#L5<B$G$9!W$HH]G'$7$^$7$?!#(B

$B!!$^$?!"?3M}$N$"$H!"%9!<!&%A!<$5$s$OJ[8n;N$KBP$7!"!VCK@-$,?/F~$G$-$?$N$O7YHw$K<jMn$A$,$"$C$?$+$i$@!#;d$O?MF;E*N)>l$+$iCK@-$rGq$a$?$@$1!W$H=R$Y$?$H$$$&$3$H$G$9!#(B

$B!!0lJ}!"%"%a%j%+?M$NCK@-$OK!Dn$G!V;d$,K,$l$?$N$O%9!<!&%A!<$5$s$,0E;&$5$l$kL4$r8+$?$+$i$@!W$H?/F~$7$?M}M3$r>Z8@$7$^$7$?!#(B

$B!!:[H=$OMh=50J9_$bB3$/Mbig_smilej$G!"%9!<!&%A!<$5$s$OM-:a$H$5$l$l$P:GD9$G6X8G#5G/$N7:$K=h$5$l$k2DG=@-$,$"$j$^$9!#!J(B23$BF|(B01:33$B!K(B

However, if I Open With > Emacs22 (Client), it comes out as normal Japanese... in... an ugly editor smile Any way I can just view it normally with Text Editor?

What is your locale? en_US.UTF-8?

Reply #14 - 2009 May 23, 7:23 am
mentat_kgs Member
From: Brasil Registered: 2008-04-18 Posts: 1671 Website

@jorgebucaran
Hey, I don't think this would be an easy task to do, but I'll tell what the code does:

He opens the yomiuri frontpage.
Gets the first 15 links with a video icon.
Visieats each link and discovers the place where the video is and copy the text.

Then he downloads each of the videos and extracts the mp3.
wget and mencoder are external programs. wget is a download manager and mencoder is the encoding twin of mplayer.

@Toberoth
I put the begin block just to make the dirty part of the code clearer from the configuration variables.

Btw, I don't know what to do if your text file gets ugly, but try opening the file with

$ LC_ALL=ja_JP.UTF-8 gedit dirty_file.txt

Last edited by mentat_kgs (2009 May 23, 7:27 am)

Reply #15 - 2009 May 23, 11:18 am
sethg Member
From: m Registered: 2008-11-07 Posts: 505

I've tried with Firefox, and changed it to every Japanese encoding, but nothing works hmm

Tobberoth, I would assume it is, but I honestly don't know how to check that.

mentat_kgs, I tried your code, but it still showed up as the symbols I pasted above.

Awfully frustrating problem. I despise the Emacsen interface. I *could* go through and cut all the text from emacsen to Gedit and replace the file, but that just takes away from the beautiful simplicity of the script.

Anybody have any more ideas?


Edit: Actually, I have, indeed, got it working in Firefox. I just had to set Auto-detect to Japanese. I don't know why this working... I ran through, literally, all 3 Japanese settings yesterday. :: sigh :: Well, at least it's working, and with rikaichan for quick lookups, to boot smile

Last edited by sethg (2009 May 23, 11:29 am)

Reply #16 - 2009 May 23, 11:34 am
Tobberoth Member
From: Sweden Registered: 2008-08-25 Posts: 3364

Checking your locale is easy. Just open a terminal and run "locale".

As long as it ends in .UTF-8, simply copying into gedit should work...

Reply #17 - 2009 May 25, 9:51 am
denus Member
Registered: 2009-02-01 Posts: 22

If you use gedit, you can open it manually and select the encoding ISO-2022-JP, it should show it.

A script someone showed me:

Code:

for i in *.txt; do f=$(basename $i); iconv -f ISO-2022-JP -t UTF-8 $i > $f.utf8;done

can convert all those text files into Unicode, so it should display fine thereafter. Back up beforehand.

Edit: On that note, I just frigging  wish Japan would adopt Unicode, the stubborn so-and-sos. >.<

Last edited by denus (2009 May 25, 9:56 am)

Reply #18 - 2009 May 25, 12:29 pm
mentat_kgs Member
From: Brasil Registered: 2008-04-18 Posts: 1671 Website

Ok, I just fixed the text issue. The script was not tested on windows, but it should work fine now.

Last edited by mentat_kgs (2009 May 25, 12:36 pm)

Reply #19 - 2009 May 25, 1:05 pm
sethg Member
From: m Registered: 2008-11-07 Posts: 505

mentat_kgs wrote:

Ok, I just fixed the text issue. The script was not tested on windows, but it should work fine now.

YAY! Totally works fine now smile Thanks so much big_smile NOW it is incredible awesome big_smile

Reply #20 - 2009 May 25, 1:12 pm
sethg Member
From: m Registered: 2008-11-07 Posts: 505

denus wrote:

A script someone showed me:

Code:

for i in *.txt; do f=$(basename $i); iconv -f ISO-2022-JP -t UTF-8 $i > $f.utf8;done

can convert all those text files into Unicode, so it should display fine thereafter. Back up beforehand.

Thanks for this as well smile Converted all the ones I'd already downloaded, which saved me a lot of copying and pasting!

Reply #21 - 2009 May 25, 1:35 pm
denus Member
Registered: 2009-02-01 Posts: 22

Thanks so much, mentat. The script is indeed beyond awesome now. :3

Reply #22 - 2009 May 25, 2:03 pm
Tobberoth Member
From: Sweden Registered: 2008-08-25 Posts: 3364

Wow, I had no idea wget was preinstalled on Windows... Not that saving files down manually in Ruby is hard, but that's good to know, it's a useful tool.

Reply #23 - 2009 May 25, 2:08 pm
mentat_kgs Member
From: Brasil Registered: 2008-04-18 Posts: 1671 Website

It is not. I made ruby check the OS so it will only download using wget on linuxes.

Reply #24 - 2009 May 25, 2:11 pm
Tobberoth Member
From: Sweden Registered: 2008-08-25 Posts: 3364

mentat_kgs wrote:

It is not. I made ruby check the OS so it will only download using wget on linuxes.

Ah yes, I see it now... why did you decide to use wget on linux though? Is it faster? (I haven't tried it myself).

Reply #25 - 2009 May 25, 2:53 pm
mentat_kgs Member
From: Brasil Registered: 2008-04-18 Posts: 1671 Website

Not faster, but it resumes downloads and checks if the file was already downloaded.