Ways to rip internet-data into a spreadsheet?

Index » 喫茶店 (Koohii Lounge)

  • 1
 
Reply #1 - 2012 March 29, 5:11 am
Mesqueeb Member
From: Japan Registered: 2008-10-14 Posts: 253 Website

Hello! I am working on a project and I need some information from a certain Wiki website.
What I need is one section of several (over 500) wiki pages, automatically copied to a spreadsheet.
By hand it takes me about 20 seconds per page x 500 = 10 000 seconds = 160 minutes = 3 hours (loosely counted)

Is there any program which can do this? Or any other tricks?

Thanks!

Reply #2 - 2012 April 08, 5:17 am
Blahah Member
From: Cambridge, UK Registered: 2008-07-15 Posts: 715 Website

I would do this with Python and the BeautifulSoup package. It makes it quite easy to parse HTML content.

Reply #3 - 2012 May 22, 6:04 am
Mesqueeb Member
From: Japan Registered: 2008-10-14 Posts: 253 Website

Blahah wrote:

I would do this with Python and the BeautifulSoup package. It makes it quite easy to parse HTML content.

I think I managed to get it installed on my python 2.7
and when I go to terminal and follow this site: http://www.crummy.com/software/Beautifu … ation.html

upon entering "python" and then:

Code:

>>> from BeautifulSoup import BeautifulSoup
I get:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named BeautifulSoup

does this mean I didn't successfully install beautiful soup?
I have python 3 and 2.7 both installed (don't know why) maybe that is jamming it?

I installed it like this:

Code:

Mesqueebiator:~ Mesqueeb$ cd /Users/Mesqueeb/Downloads/beautifulsoup4-4.0.5
Mesqueebiator:beautifulsoup4-4.0.5 Mesqueeb$ python setup.py installrunning install
running build
running build_py
running install_lib
running install_egg_info
Removing /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/beautifulsoup4-4.0.5-py2.7.egg-info
Writing /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/beautifulsoup4-4.0.5-py2.7.egg-info
Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
Reply #4 - 2012 May 22, 7:59 am
netsplitter Member
From: Melbourne Registered: 2008-07-13 Posts: 183

Is that all the output you get? I'm not too familiar with Macs, but don't you need to do that as a privileged user (with sudo)?

Mesqueeb wrote:

I have python 3 and 2.7 both installed (don't know why) maybe that is jamming it?

It's possible. It looks like the "python" command is using python 2.7 (since that's where it installed to), so it should be working. Check anyway:
When you first enter "python", the very first line at the top will tell you which version it's running.

Code:

Python 2.7.3 (default, Apr 24 2012, 00:00:54)

That should hopefully be 2.7.x

Reply #5 - 2012 May 22, 8:25 am
Mesqueeb Member
From: Japan Registered: 2008-10-14 Posts: 253 Website

netsplitter wrote:

Is that all the output you get? I'm not too familiar with Macs, but don't you need to do that as a privileged user (with sudo)?

It gave the same result

netsplitter wrote:

When you first enter "python", the very first line at the top will tell you which version it's running.

Code:

Python 2.7.3 (default, Apr 24 2012, 00:00:54)

That should hopefully be 2.7.x

It is python 2.7.3

But I'm completely new at python, and actually don't know how to use it. xD
I would like to copy paste information of several pages on the web into one spreadsheet or text file.
Can you help me? ^^

Reply #6 - 2012 May 22, 8:28 am
vix86 Member
From: Tokyo Registered: 2010-01-19 Posts: 1469

Its

Code:

from bs4 import BeautifulSoup
soup = BeautifulSoup(whatever here)

Last edited by vix86 (2012 May 22, 8:29 am)

Reply #7 - 2012 May 22, 11:46 am
Blahah Member
From: Cambridge, UK Registered: 2008-07-15 Posts: 715 Website

Personally I use ActiveState python, and I have the 2.7 distribution installed.

Then, to install beautifulsoup, you type:

Code:

pypm install beautifulsoup

Then when you load a python shell, you type:

Code:

python
from BeautifulSoup import *

Note that this is case sensitive, so it has to be BeautifulSoup not beautifulsoup.

  • 1