Processing pdf file (want to extract spanish vocabulary)

Index » 喫茶店 (Koohii Lounge)

  • 1
 
MeNoSavvy Member
Registered: 2008-05-24 Posts: 131

Hi All,
I'm getting a bit bored with Japanese recently but I'm still chipping away at it most days. Anyway in preparation for some travel to Spain (hopefully this year) and Central / South America (hopefully in future years) I've been studying a bit of Spanish. Well I'm really just getting started. I'm a bit tired of too much SRS so I'm adopting an alternative approach consisting of
1. Listening to Pimsleur in the car, mainly to train my ears at spanish
2. Practicing a bit of Assimil in the evenings
3. Studying some basic Grammar

Now at some point I would like to start drilling some vocab. I have a pdf of the book by Dorothy Richmond - Practice makes perfect Spanish Vocabulary. In this book she presents 10,000 vocabulary words along with the basics of spanish grammar. Most of the vocabulary lists are presented in shaded boxes. If there is a line by itself in the shaded box it is typically the title of the list of vocabulary (also it is in a different font). Then there are four columns english word - corresponding spanish word , english word - corresponding spanish word.

What I would like to do is extract these lists into an excel file or similar I can import into an SRS. Does anyone have any knowledge or experience extracting information from a pdf file and can give me some hints or pointers?

If it is possible to detect when the shading begins and ends, and what font size is used, it will be relatively easy to extract the information.

There are a lot of similarities between English and spanish vocabulary, so hopefully I can absorb the vocab at a faster rate than Japanese vocabulary. Although learning all those irregular verb conjugations looks to be an obstacle ! ha ha

unauthorized Member
Registered: 2009-08-04 Posts: 64

I suppose it would be possible if you use ghostscript, but unless you speak C++, that's not much of an option.
I don't think there are any existing utilities do achieve what you want. Your best bet is a crash course on Python and it's html phrasing library.

There are a number of ways to convert PDF to HTML which you can use to convert your ebook:
http://www.google.com/#hl=en&source … gle+Search

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
  • 1