![]() |
|
A program for storing an entire website in a textfile? - Printable Version +- kanji koohii FORUM (http://forum.koohii.com) +-- Forum: Learning Japanese (http://forum.koohii.com/forum-4.html) +--- Forum: Learning resources (http://forum.koohii.com/forum-9.html) +--- Thread: A program for storing an entire website in a textfile? (/thread-11775.html) |
A program for storing an entire website in a textfile? - Termy - 2014-04-18 Hi, I've been trying to find some good program/tool for automatically storing an entire webpage into one text file, so I could quickly and easily find all the relevant words on that site by then using CBs text analysis tool. Any suggestions? My main interest in this is going through profession-specific pages (physical therapy to be precise) to find perhaps the 2000-3000 most commonly used terms/words, without having to read through entire sites page by page. A program for storing an entire website in a textfile? - Stansfield123 - 2014-04-18 I don't know much about the subject, but this sounds like something you could use: http://text-template-parser.software.informer.com/ It's the first one on this list of similar programs: http://softwaresolution.informer.com/Web-Page-Parsers/ A program for storing an entire website in a textfile? - Zarxrax - 2014-04-18 In firefox there is an option to save a page as text. Or did you need something more complex than that? When you really get down to it, html pages are just text anyways. A program for storing an entire website in a textfile? - Termy - 2014-04-19 Thanks for the replies. Stansfield123: that program looks interesting for sure. I'll check it out. Zarxrax: the idea was to have some program or function to save an entire webpage/site all in one, not just one page of it. Otherwise I'd have to spend a lot of time clicking on every link on the entire page trying to find everything (unless there's an option in Firefox to do all of that automatically that I'm not aware of). A program for storing an entire website in a textfile? - Zarxrax - 2014-04-19 Hmmm, well actually there is a firefox extension called Downthemall which can probably do it. It can basically follow links on a page and download them, using some rules which you specify. A program for storing an entire website in a textfile? - NightSky - 2014-04-19 On Linux (or after installing cygwin on Windows) you can just use wget to mirror the whole website. Then if you want all the .html files as one you can just concatenate them together. Easy peasy. A program for storing an entire website in a textfile? - Stansfield123 - 2014-04-19 I don't think CB's text analysis tool would fare all that well with html files. It needs plain text. Html isn't text. It's markup. Since it isn't that difficult to convert html into text (by getting rid of said markup), why not advise to OP to find a tool that does it, just like he was planning to? The tricky thing here is parsing the pages he needs, not converting them to text. It seems like he doesn't just need all the contents of one website, he needs all sorts of pages with similar content. There are plenty of companies who look to parse and analyze online content in that manner, so maybe he will find something that does it automatically. A program for storing an entire website in a textfile? - Termy - 2014-04-19 I tried the Firefox extensions now. That scrapbook seems to save entire pages, but haven't figured out how to restrict it to just the html and not saving images and so on (I'm a klutz when it comes to technical matters). I then found some cmd commands to rename the html to txt files ("ren *.html *.txt") and then merge the txt files into one txt file ("for %f in (*.txt) do type "%f" >> output.txt"). Then I used notepad++ to open the file and convert to utf8 format (otherwise the kanji didnt show up properly). The merged 22mb large txt file seems to crash the analysis tool for some reason though ("runtime error. The application has requested the Runtime to terminate it in an usual way" etc). Guess I'll keep trying some more. A program for storing an entire website in a textfile? - cb4960 - 2014-04-19 Stansfield123 Wrote:I don't think CB's text analysis tool would fare all that well with html files. It needs plain text.It should work fine with html files. Termy Wrote:I tried the Firefox extensions now. That scrapbook seems to save entire pages, but haven't figured out how to restrict it to just the html and not saving images and so on (I'm a klutz when it comes to technical matters).No need to merge the files. Just enter the root directory that contains the html files. Html files in sub-directories will also be analyzed. No need to rename the files to have a .txt extension either. In the settings.txt file for the tool, modify the final line from this: extensions = txt to this: extensions = txt;html A program for storing an entire website in a textfile? - Termy - 2014-04-19 There's only one possible reply to that... Hail to the king! Now I just need to figure out that Firefox Scrapbook a little better (or something else) to make the website dumping process a bit smoother and without unnecessary junk. |