My level is still pretty basic and I have found this CD from coscom quite useful. I wanted to extract the information and put it into an anki deck for revision (after initial learning from the CD). However rather than just the basic sentences I wanted to extract some additional information as well, such as whether the verb is transitive or intransitive, the number of the verb for sorting purposes, the dictionary form of the verb etc.
There was a perl script floating around that extracted some information, but I don't really know perl, and wasn't interested in learning it. I tried to ask the original author for some information assistance, but he wasn't really interested. Anyway since I've started learning python I wrote a basic python script to extract the information.
Rather than use regular expressions I tried to use the built in HTMLparser. The whole exercise worked out more complicated than I thought it would be. Some people told me not to use html parser but to use a different parser (such as beautiful soup). But I wasn't convinced those other parsers would make the job easier. I had a lot of trouble with little irregularities in the formatting on the CD. Are there any python gurus who can make suggestions as to how to better handle this task?
Anyway I've pretty much got the damn thing to work although the code is not very elegant, and it's pretty obvious I'm only a beginner at python.
Here is a link to the code:
http://massmirror.com/e0def19d155227ecb7...2c90e.html
maybe someone will find it useful.
The way to use it is to copy the cd to a directory on the hard disk. Then run the python script in the directory using something like python jv19.py --output file.txt
Maybe you guys are far beyond this CD, but if anyone does find this useful, or has any problems getting it to work, let me know and I will try and help if I can.
There was a perl script floating around that extracted some information, but I don't really know perl, and wasn't interested in learning it. I tried to ask the original author for some information assistance, but he wasn't really interested. Anyway since I've started learning python I wrote a basic python script to extract the information.
Rather than use regular expressions I tried to use the built in HTMLparser. The whole exercise worked out more complicated than I thought it would be. Some people told me not to use html parser but to use a different parser (such as beautiful soup). But I wasn't convinced those other parsers would make the job easier. I had a lot of trouble with little irregularities in the formatting on the CD. Are there any python gurus who can make suggestions as to how to better handle this task?
Anyway I've pretty much got the damn thing to work although the code is not very elegant, and it's pretty obvious I'm only a beginner at python.
Here is a link to the code:
http://massmirror.com/e0def19d155227ecb7...2c90e.html
maybe someone will find it useful.
The way to use it is to copy the cd to a directory on the hard disk. Then run the python script in the directory using something like python jv19.py --output file.txt
Maybe you guys are far beyond this CD, but if anyone does find this useful, or has any problems getting it to work, let me know and I will try and help if I can.
