Back

Ways to rip internet-data into a spreadsheet?

#1
Hello! I am working on a project and I need some information from a certain Wiki website.
What I need is one section of several (over 500) wiki pages, automatically copied to a spreadsheet.
By hand it takes me about 20 seconds per page x 500 = 10 000 seconds = 160 minutes = 3 hours (loosely counted)

Is there any program which can do this? Or any other tricks?

Thanks!
Reply
#2
I would do this with Python and the BeautifulSoup package. It makes it quite easy to parse HTML content.
Reply
#3
Blahah Wrote:I would do this with Python and the BeautifulSoup package. It makes it quite easy to parse HTML content.
I think I managed to get it installed on my python 2.7
and when I go to terminal and follow this site: http://www.crummy.com/software/Beautiful...ation.html

upon entering "python" and then:
Code:
>>> from BeautifulSoup import BeautifulSoup
I get:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named BeautifulSoup
does this mean I didn't successfully install beautiful soup?
I have python 3 and 2.7 both installed (don't know why) maybe that is jamming it?

I installed it like this:
Code:
Mesqueebiator:~ Mesqueeb$ cd /Users/Mesqueeb/Downloads/beautifulsoup4-4.0.5
Mesqueebiator:beautifulsoup4-4.0.5 Mesqueeb$ python setup.py installrunning install
running build
running build_py
running install_lib
running install_egg_info
Removing /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/beautifulsoup4-4.0.5-py2.7.egg-info
Writing /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/beautifulsoup4-4.0.5-py2.7.egg-info
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
Is that all the output you get? I'm not too familiar with Macs, but don't you need to do that as a privileged user (with sudo)?

Mesqueeb Wrote:I have python 3 and 2.7 both installed (don't know why) maybe that is jamming it?
It's possible. It looks like the "python" command is using python 2.7 (since that's where it installed to), so it should be working. Check anyway:
When you first enter "python", the very first line at the top will tell you which version it's running.
Code:
Python 2.7.3 (default, Apr 24 2012, 00:00:54)
That should hopefully be 2.7.x
Reply
#5
netsplitter Wrote:Is that all the output you get? I'm not too familiar with Macs, but don't you need to do that as a privileged user (with sudo)?
It gave the same result

netsplitter Wrote:When you first enter "python", the very first line at the top will tell you which version it's running.
Code:
Python 2.7.3 (default, Apr 24 2012, 00:00:54)
That should hopefully be 2.7.x
It is python 2.7.3

But I'm completely new at python, and actually don't know how to use it. xD
I would like to copy paste information of several pages on the web into one spreadsheet or text file.
Can you help me? ^^
Reply
#6
Its
Code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(whatever here)
Edited: 2012-05-22, 8:29 am
Reply
#7
Personally I use ActiveState python, and I have the 2.7 distribution installed.

Then, to install beautifulsoup, you type:
Code:
pypm install beautifulsoup
Then when you load a python shell, you type:
Code:
python
from BeautifulSoup import *
Note that this is case sensitive, so it has to be BeautifulSoup not beautifulsoup.
Reply