kanji koohii FORUM
Automatic audio extraction from NHK News Easy? - Printable Version

+- kanji koohii FORUM (http://forum.koohii.com)
+-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html)
+--- Forum: Learning resources (http://forum.koohii.com/forum-9.html)
+--- Thread: Automatic audio extraction from NHK News Easy? (/thread-12343.html)



Automatic audio extraction from NHK News Easy? - JackBS - 2014-11-21

Do you know of any way to automatically extract the audio from the five or so daily articles at NHK News Easy? http://www3.nhk.or.jp/news/easy

I can do it manually with programs like TubeMaster++ to get the mp3 files, but I'm looking for more of an automated daily process.

Thanks in advance!


Automatic audio extraction from NHK News Easy? - Inny Jan - 2014-11-21

I use something that I brewed myself - sorry, but I don't have time to provide any kind of support, so it's just you and this code:
Code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
import re
import subprocess
import string
import urllib
from HTMLParser import HTMLParser

track_nb = 64
url_base = 'http://www3.nhk.or.jp/news/easy/'
time_stamp = '2014-11'

class NHKArticleParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.data = ''
        self.div_cnt = 0
        self.div_cnt_article = 0
        self.is_rt = False
        self.is_img = False
        self.img_url = ''

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            if ('id', 'mainimg') in attrs:
                self.is_img = True
                
        elif tag == 'img' and self.is_img:
            for attr in attrs:
                if attr[0] == 'src':
                    self.img_url = attr[1]

        elif tag == 'div':
            self.div_cnt += 1
            if ('id', 'newstitle') in attrs or ('id', 'newsarticle') in attrs:
                self.div_cnt_article = self.div_cnt
        
        elif tag == 'rt':
            self.is_rt = True

    def handle_endtag(self, tag):
        if tag == 'p' and self.is_img:
            self.is_img = False
        
        elif tag == 'rt':
            self.is_rt = False
        
        elif tag == 'div':
            self.div_cnt -= 1
            if self.div_cnt < self.div_cnt_article:
                self.div_cnt_article = 0

    def handle_data(self, data):
        if self.div_cnt_article > 0 and not self.is_rt:
            self.data += data

# Parser for newslisteven.html
# newslisteven.html is a file that results from saving 'Copy as HTML' of
# <ul class="newslisteven"> from http://www3.nhk.or.jp/news/easy/index.html
class NHKNewsListEvenParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.articles = []
        
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            #print attrs[0][1][2:]
            self.articles.append(attrs[0][1][2:])

def get_article(article_id):
    global url_base
    global track_nb
    fn_img_tmp = 'fn_img'
    fn_mp3_tmp = 'fn_mp3'

    url = url_base + article_id
    (filename, headers) = urllib.urlretrieve(url)

    parser = NHKArticleParser()
    article = open(filename).read()
    article_m = article.replace('"shorturl"content', '"shorturl" content')
    parser.feed(article_m)
    parser.close()

    img_url = parser.img_url
    if img_url[0: 4] != 'http':
        img_url = os.path.dirname(url) + '/' + img_url
    urllib.urlretrieve(img_url, fn_img_tmp)
    
    data = re.sub('\n +', '\n', parser.data).lstrip('\n')
    lines = data.split('\n')
    title = lines[1] + lines[0]

    # Make sure the title is a valid filename
    title = ''.join(ch for ch in title if ch not in '\/:*?<>|')
    print title
    
    data = ''
    for line in lines[0: 3]:
        data = data + line + '\n'
    for line in lines[3:]:
        if len(line) > 0:
            data = data + line + '\n'

    file = open(unicode(title + '.txt', 'utf8'), 'w')
    file.write(data)
    file.close()

    pos = article_id.find('html')
    url_mp3 = url_base + article_id[0: pos] + 'mp3'
    print url_mp3, '-', track_nb
    fn_mp3 = unicode(title + '.mp3', 'utf8')

    urllib.urlretrieve(url_mp3, fn_mp3_tmp)

    command = [
        'python',
        r'C:\Users\<your-profile-name>\Commands\eyeD3',
        '--remove-all',
        '--no-color',
        '--to-v2.3',
        '--set-encoding=utf16-LE',
        '--force-update',
        '--artist=NHKニュース',
        '--album=NHKニュース - ' + time_stamp,
        '--genre=Non-fiction',
        '--title=' + title,
        '--track=' + str(track_nb),
        '--add-image=' + fn_img_tmp + ':FRONT_COVER',
        fn_mp3_tmp
    ]
    
    subprocess.call(
        command,
        shell = True
    )
    track_nb += 1
    os.rename(fn_mp3_tmp, fn_mp3)
    os.remove(fn_img_tmp)

# Instantiate the parser and feed it some .html
parser = NHKNewsListEvenParser()
parser.feed(open('newslisteven.html').read())
parser.close()
for article in reversed(parser.articles):
    get_article(article)
Edit:
I've found some time now to record the steps to make use of the above script.

1. Save the python code to the dl.py file
If you don't care about tagging the mp3 files, then
2. Remove lines 118-138
(otherwise things get even messier...)

In Google Chrome (I believe that other browsers allow for similar workflow as well)
1. Open "http://www3.nhk.or.jp/news/easy/index.html"
2. Select "11月26日(水)"
3. Right-click > Inspect Element on "「黒人を撃った警察官を訴えない」アメリカ中で抗議"

In Developer Tools:
1. Locate '<ul class="newslisteven">'
2. Right-click > Copy '<ul class="newslisteven">'
3. Save the Clipboard to the newslisteven.html file (this file has to be in the same directory as the dl.py script)

Command box part:
1. Open the command window
2. Change the current directory to the location where dl.py and newslisteven.html are
3. Execute 'python -u dl.py'

After execution of the script finishes you should see .txt and .mp3 files for 11月26日(水).


Automatic audio extraction from NHK News Easy? - JackBS - 2014-11-22

Thank you! I'll find out how to use this.


Automatic audio extraction from NHK News Easy? - RawToast - 2014-11-26

If you're after more content, I believe there is an archive of NHK News Easy articles with separate audio mp3s in buonaparte's resources thread.

http://forum.koohii.com/showthread.php?tid=6840


Automatic audio extraction from NHK News Easy? - JackBS - 2014-11-30

RawToast Wrote:If you're after more content, I believe there is an archive of NHK News Easy articles with separate audio mp3s in buonaparte's resources thread.

http://forum.koohii.com/showthread.php?tid=6840
Thank you! I've practiced already with a good chunk of Buonaparte's archives, but I'm incredibly "hard of hearing" and so I'd like to try to get this to work to get more free new content every day.

Inny Jan Wrote:I've found some time now to record the steps to make use of the above script...
That is very kind of you, thank you very much! I confess I've battled for days to get it to work, and I know you're busy, so the following question is for anyone who might be able to spot the error, be it another programmer or a layman like me who nevertheless got it to work.

In the python-code-saving step, I removed lines 118 to 138 and saved the resulting code as "dl.py" in the folder "C:\Users\Jack\Desktop\python":
Code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
import re
import subprocess
import string
import urllib
from HTMLParser import HTMLParser

track_nb = 64
url_base = 'http://www3.nhk.or.jp/news/easy/'
time_stamp = '2014-11'

class NHKArticleParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.data = ''
        self.div_cnt = 0
        self.div_cnt_article = 0
        self.is_rt = False
        self.is_img = False
        self.img_url = ''

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            if ('id', 'mainimg') in attrs:
                self.is_img = True
                
        elif tag == 'img' and self.is_img:
            for attr in attrs:
                if attr[0] == 'src':
                    self.img_url = attr[1]

        elif tag == 'div':
            self.div_cnt += 1
            if ('id', 'newstitle') in attrs or ('id', 'newsarticle') in attrs:
                self.div_cnt_article = self.div_cnt
        
        elif tag == 'rt':
            self.is_rt = True

    def handle_endtag(self, tag):
        if tag == 'p' and self.is_img:
            self.is_img = False
        
        elif tag == 'rt':
            self.is_rt = False
        
        elif tag == 'div':
            self.div_cnt -= 1
            if self.div_cnt < self.div_cnt_article:
                self.div_cnt_article = 0

    def handle_data(self, data):
        if self.div_cnt_article > 0 and not self.is_rt:
            self.data += data

# Parser for newslisteven.html
# newslisteven.html is a file that results from saving 'Copy as HTML' of
# <ul class="newslisteven"> from http://www3.nhk.or.jp/news/easy/index.html
class NHKNewsListEvenParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.articles = []
        
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            #print attrs[0][1][2:]
            self.articles.append(attrs[0][1][2:])

def get_article(article_id):
    global url_base
    global track_nb
    fn_img_tmp = 'fn_img'
    fn_mp3_tmp = 'fn_mp3'

    url = url_base + article_id
    (filename, headers) = urllib.urlretrieve(url)

    parser = NHKArticleParser()
    article = open(filename).read()
    article_m = article.replace('"shorturl"content', '"shorturl" content')
    parser.feed(article_m)
    parser.close()

    img_url = parser.img_url
    if img_url[0: 4] != 'http':
        img_url = os.path.dirname(url) + '/' + img_url
    urllib.urlretrieve(img_url, fn_img_tmp)
    
    data = re.sub('\n +', '\n', parser.data).lstrip('\n')
    lines = data.split('\n')
    title = lines[1] + lines[0]

    # Make sure the title is a valid filename
    title = ''.join(ch for ch in title if ch not in '\/:*?<>|')
    print title
    
    data = ''
    for line in lines[0: 3]:
        data = data + line + '\n'
    for line in lines[3:]:
        if len(line) > 0:
            data = data + line + '\n'

    file = open(unicode(title + '.txt', 'utf8'), 'w')
    file.write(data)
    file.close()

    pos = article_id.find('html')
    url_mp3 = url_base + article_id[0: pos] + 'mp3'
    print url_mp3, '-', track_nb
    fn_mp3 = unicode(title + '.mp3', 'utf8')

    urllib.urlretrieve(url_mp3, fn_mp3_tmp)

    track_nb += 1
    os.rename(fn_mp3_tmp, fn_mp3)
    os.remove(fn_img_tmp)

# Instantiate the parser and feed it some .html
parser = NHKNewsListEvenParser()
parser.feed(open('newslisteven.html').read())
parser.close()
for article in reversed(parser.articles):
    get_article(article)
In the html-saving step, I followed the instructions for November 26 and saved the resulting code as "newslisteven.html" in the same folder "C:\Users\Jack\Desktop\python":
Code:
<ul class="newslisteven"><li><span class="newstitle"><a href="./k10013485691000/k10013485691000.html">「<ruby>黒人<rt>こくじん</rt></ruby>を<ruby>撃<rt>う</rt></ruby>った<ruby>警察官<rt>けいさつかん</rt></ruby>を<ruby>訴<rt>うった</rt></ruby>えない」アメリカ<ruby>中<rt>じゅう</rt></ruby>で<ruby>抗議<rt>こうぎ</rt></ruby></a></span><span class="date">[11月26日 17時00分]</span><span class="sound">音声</span><span class="movie">動画</span></li><li class="even"><span class="newstitle"><a href="./k10013485081000/k10013485081000.html"><ruby>中国<rt>ちゅうごく</rt></ruby>の<ruby>香港<rt>ほんこん</rt></ruby> <ruby>抗議<rt>こうぎ</rt></ruby>する<ruby>人<rt>ひと</rt></ruby>たちのバリケードを<ruby>片<rt>かた</rt></ruby>づける</a></span><span class="date">[11月26日 17時00分]</span><span class="sound">音声</span><span class="movie">動画</span></li><li><span class="newstitle"><a href="./k10013466411000/k10013466411000.html">タカタのエアバッグ <ruby>自動車会社<rt>じどうしゃがいしゃ</rt></ruby>は<ruby>早<rt>はや</rt></ruby>く<ruby>修理<rt>しゅうり</rt></ruby>して</a></span><span class="date">[11月26日 17時00分]</span><span class="sound">音声</span><span class="movie">動画</span></li><li class="even"><span class="newstitle"><a href="./k10013453111000/k10013453111000.html"><ruby>世界<rt>せかい</rt></ruby>の<ruby>人身売買<rt>じんしんばいばい</rt></ruby>のうち33%は<ruby>子<rt>こ</rt></ruby>ども</a></span><span class="date">[11月26日 11時30分]</span><span class="sound">音声</span><span class="movie">動画</span></li><li><span class="newstitle"><a href="./k10013454921000/k10013454921000.html"><ruby>鎌倉<rt>かまくら</rt></ruby>の<ruby>寺<rt>てら</rt></ruby>で<ruby>夜<rt>よる</rt></ruby>の<ruby>紅葉<rt>こうよう</rt></ruby>を<ruby>光<rt>ひかり</rt></ruby>で<ruby>照<rt>て</rt></ruby>らす</a></span><span class="date">[11月26日 11時30分]</span><span class="sound">音声</span><span class="movie">動画</span></li></ul>
And finally, in the python-script-executing step, I opened the command window and ran "python -u C:\Users\Jack\Desktop\python\dl.py" and also "python -u dl.py" without success. I small black DOS-like window opens and closes in a fraction of a second (not long enough for a screencap) and no downloading of mp3's takes place. I've also downloaded Python 2.7.8 for Windows and executed the script from there but with the same results.

If anyone has any hints as to where I might be making a mistake, I'll be very much in your debt!


Automatic audio extraction from NHK News Easy? - balloonguy - 2014-11-30

In the same folder as dl.py and newslisteven.html, create a file called run.bat containing:

Code:
dl.py
Double clicking on run.bat should cause it to work.


Automatic audio extraction from NHK News Easy? - Inny Jan - 2014-11-30

@JackBS

What you are doing sounds ok. I followed the same steps as you described and also run into some problems. I'm at work at the moment so can't look at the issue in detail but when I get home I will see what happens. Interestingly enough, right now I'm getting timeout in line 79:
(filename, headers) = urllib.urlretrieve(url)

when url is:
http://www3.nhk.or.jp/news/easy/k10013454921000/k10013454921000.html

They might have implemented testing for "User Agent" (they were not doing that so far...) and if that's the case then the script will have to be updated. It's also possible that my problem is related to how our network is configured...

(BTW, are you executing "python -u dl.py" within the DOS box? I mean, you need to type that command on the command line and the DOS box should not disappear. Also, the current directory must be where the "dl.py" and "newslisteven.html" are located. balloonguy's "run.bat" can be helpful although if the issue is with the "User Agent" it's unlikely that it will fix your problem).

Edit:
Righty-oh, it's all good in my place. If balloonguy's "run.bat" has issues then you can always put "python dl.py" there.

Good luck.


Automatic audio extraction from NHK News Easy? - JackBS - 2014-12-08

Thank you for your help last week! I've been learning about the command window and Python scripts. (I was not following correctly the very last part of the instructions.) With the help of my son, I think I've gotten closer this week, but unfortunately I'm not there yet! I'll be thankful if anyone can find the error, but I'll continue trying to get this to work in any case.

The relevant files are in this directory:
C:\Users\Jack\Desktop\python\dl.py
C:\Users\Jack\Desktop\python\newslisteven.html
C:\Users\Jack\Desktop\python\run.bat
The contents of the files are exactly as posted above. We've tried "run.bat" containing "dl.py" and then another one containing "python dl.py".

Our attempts have been as follows:

1. According to the instructions...
Code:
C:\Users\Jack>
C:\Users\Jack>cd C:\Users\Jack\Desktop\python
C:\Users\Jack\Desktop\python>python -u dl.py
'python' is not recognized as an internal or external command, operable program or batch file.
2. Then using balloonguy's file...
Code:
C:\Users\Jack>
C:\Users\Jack>cd C:\Users\Jack\Desktop\python
C:\Users\Jack\Desktop\python>run.bat
C:\Users\Jack\Desktop\python>dl.py
'dl.py' is not recognized as an internal or external command, operable program or batch file.
3. Then using balloonguy's modified file...
Code:
C:\Users\Jack>
C:\Users\Jack>cd C:\Users\Jack\Desktop\python
C:\Users\Jack\Desktop\python>run.bat
C:\Users\Jack\Desktop\python>python dl.py
'python' is not recognized as an internal or external command, operable program or batch file.
4. Then we copy-pasted the three files into the Python34 folder (where the actual Python program is installed) and tried to 'turn on' Python via the command box and then tried to run the script...
Code:
C:\Users\Jack>
C:\Users\Jack>cd C:\Python34
C:\Python34>
C:\Python34>python -u dl.py
    File "dl.py", line 98
    print title
              ^
SyntaxError: Missing parantheses in call to 'print'
I think the last one was the closest, as now it seems to recognize python as being installed, but it appears to be saying there's a syntax error in the code, which seems unlikely because it's working correctly on your end!


Automatic audio extraction from NHK News Easy? - Inny Jan - 2014-12-08

Your problems can be summarised as:
1. python.exe is not your PATH
2. Python compatibility issues

To solve problem 1. you can:
1. Modify your system wide PATH, or
2. Modify your run.bat file, so it reads:
Code:
PATH=%PATH%;C:\Python34
python dl.py
or
3. Modify your run.bat file, so it reads:
Code:
C:\Python34\python dl.py
1., in general, would the best option. However, given your limited familiarity with the "command line" and such, I would go with option 2.

To solve problem 2. you can refer to this article ("Print Is A Function") and replace all print statements (like: print title) to print functions (print(title)) as described in the article. Alternatively, you could just remove the lines with print statements (they are there just to indicate progress of downloading.)

After you've done the above you should be able to put and execute run.bat in any directory on your disk (even with your mouse if you wish Smile ...).


Automatic audio extraction from NHK News Easy? - balloonguy - 2014-12-09

Actually, to solve problem 2 you need to install Python 2, you currently have Python 3. However, to make it easier, and I hope Inny Jan doesn't mind, I've turned the script into an executable so that you don't need to install python or deal with the command prompt. It can be downloaded at http://dropcanvas.com/b5d3q. Once downloaded, place newslisteven.html in the same directory as dl.exe and just double click dl.exe. The downside is that there is no indication of progress, the files just appear.


Automatic audio extraction from NHK News Easy? - JackBS - 2014-12-10

The exe file worked like a charm! (Before that I had managed to fix the print statements, but then some indentation errors started coming up!) Thank you very much to both of you!

For the sake of other learners, mostly those unfamiliar with coding, here is then the procedure pared down to its essentials:

1. Go to NHK Easy and go to the calendar section. Choose a day, highlight the title of the first article, right-click on it and select Inspect Element.
2. In the Inspector window, locate a nearby line called <span class="newslisteven">, right-click on it and select Copy Outer HTML. Paste this code in a text editor and save it as "newslisteven.html".
3. Download ballonguy's exe version of Inny Jan's script, place both the html and the exe in the same directory, and double-click the exe file.

This will help immensely. Thank you again for your time and patience!


Automatic audio extraction from NHK News Easy? - Inny Jan - 2014-12-10

Great to hear it worked out for you!