Joined: Mar 2014
Posts: 81
Thanks:
0
Hi,
I've been trying to find some good program/tool for automatically storing an entire webpage into one text file, so I could quickly and easily find all the relevant words on that site by then using CBs text analysis tool. Any suggestions?
My main interest in this is going through profession-specific pages (physical therapy to be precise) to find perhaps the 2000-3000 most commonly used terms/words, without having to read through entire sites page by page.
Joined: Mar 2008
Posts: 1,049
Thanks:
4
In firefox there is an option to save a page as text. Or did you need something more complex than that?
When you really get down to it, html pages are just text anyways.
Joined: Mar 2014
Posts: 81
Thanks:
0
Thanks for the replies.
Stansfield123: that program looks interesting for sure. I'll check it out.
Zarxrax: the idea was to have some program or function to save an entire webpage/site all in one, not just one page of it. Otherwise I'd have to spend a lot of time clicking on every link on the entire page trying to find everything (unless there's an option in Firefox to do all of that automatically that I'm not aware of).
Joined: Mar 2008
Posts: 1,049
Thanks:
4
Hmmm, well actually there is a firefox extension called Downthemall which can probably do it. It can basically follow links on a page and download them, using some rules which you specify.
Joined: Apr 2008
Posts: 298
Thanks:
0
On Linux (or after installing cygwin on Windows) you can just use wget to mirror the whole website. Then if you want all the .html files as one you can just concatenate them together. Easy peasy.
Joined: Apr 2011
Posts: 1,085
Thanks:
15
I don't think CB's text analysis tool would fare all that well with html files. It needs plain text.
Html isn't text. It's markup. Since it isn't that difficult to convert html into text (by getting rid of said markup), why not advise to OP to find a tool that does it, just like he was planning to?
The tricky thing here is parsing the pages he needs, not converting them to text. It seems like he doesn't just need all the contents of one website, he needs all sorts of pages with similar content. There are plenty of companies who look to parse and analyze online content in that manner, so maybe he will find something that does it automatically.
Joined: Mar 2014
Posts: 81
Thanks:
0
I tried the Firefox extensions now. That scrapbook seems to save entire pages, but haven't figured out how to restrict it to just the html and not saving images and so on (I'm a klutz when it comes to technical matters).
I then found some cmd commands to rename the html to txt files ("ren *.html *.txt") and then merge the txt files into one txt file ("for %f in (*.txt) do type "%f" >> output.txt"). Then I used notepad++ to open the file and convert to utf8 format (otherwise the kanji didnt show up properly).
The merged 22mb large txt file seems to crash the analysis tool for some reason though ("runtime error. The application has requested the Runtime to terminate it in an usual way" etc). Guess I'll keep trying some more.
Joined: Mar 2014
Posts: 81
Thanks:
0
There's only one possible reply to that... Hail to the king!
Now I just need to figure out that Firefox Scrapbook a little better (or something else) to make the website dumping process a bit smoother and without unnecessary junk.