Back

Extracting sentences from Coscom essential japanese verbs

#1
My level is still pretty basic and I have found this CD from coscom quite useful. I wanted to extract the information and put it into an anki deck for revision (after initial learning from the CD). However rather than just the basic sentences I wanted to extract some additional information as well, such as whether the verb is transitive or intransitive, the number of the verb for sorting purposes, the dictionary form of the verb etc.

There was a perl script floating around that extracted some information, but I don't really know perl, and wasn't interested in learning it. I tried to ask the original author for some information assistance, but he wasn't really interested. Anyway since I've started learning python I wrote a basic python script to extract the information.

Rather than use regular expressions I tried to use the built in HTMLparser. The whole exercise worked out more complicated than I thought it would be. Some people told me not to use html parser but to use a different parser (such as beautiful soup). But I wasn't convinced those other parsers would make the job easier. I had a lot of trouble with little irregularities in the formatting on the CD. Are there any python gurus who can make suggestions as to how to better handle this task?

Anyway I've pretty much got the damn thing to work although the code is not very elegant, and it's pretty obvious I'm only a beginner at python.

Here is a link to the code:
http://massmirror.com/e0def19d155227ecb7...2c90e.html

maybe someone will find it useful.

The way to use it is to copy the cd to a directory on the hard disk. Then run the python script in the directory using something like python jv19.py --output file.txt

Maybe you guys are far beyond this CD, but if anyone does find this useful, or has any problems getting it to work, let me know and I will try and help if I can.
Reply
#2
I though the perl script was easy to use. Well, here is the deck that gets created if you use my scripts:
ejv.zip (141MB)
ejv.z01 (150MB)
This is a split zip archive, so you need both files to extract the deck. It contains 6322 cards for 3161 facts. It contains full audio from the CD.

Here is the link to my script if anybody likes to improve on it:
EssentialJapaneseVerbsScripts.zip (1kb)
Edited: 2009-10-16, 1:10 pm
Reply
#3
I'm not a Python guru, but. Using an HTML parser rather than regular expressions was definitely the right approach -- doing HTML parsing 'by hand' is just too much effort. I hear good things about Beautiful Soup, though I haven't used it myself. I think it might have made your code easier to understand. The approach you have basically has lots of methods that say 'if this is a td tag then update a pile of state variables to keep track of where we are in the thing we're trying to parse, so that when we get the data callback we can store it to the right place'. The Soup approach parses everything up front and then you can just say things like "find the contents of the TD tag with an attribute of 'exeng'". I suspect this would be both shorter and clearer.

This kind of script is typically a one-off job, though: now you have the data you're done, so there's not much point rewriting the script except as an exercise in learning about python or Beautiful Soup...
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
pm215 Wrote:I'm not a Python guru, but. Using an HTML parser rather than regular expressions was definitely the right approach -- doing HTML parsing 'by hand' is just too much effort.
Code:
$ wc exm.pl essential.py
  72  283 3658 exm.pl
  44  107 1411 essential.py
116  390 5069 total
Code:
$ wc jv19.py
  694  2782 30591 jv19.py
If you messure "effort" in amount of code, then regular expressions were the right approach in this case. Well, there a lot of comments in jv19.py, so that skews the result.
Edited: 2009-10-16, 2:07 pm
Reply
#5
xaarg: you're amazing!
Reply
#6
xaarg Wrote:This is a split zip archive, so you need both files to extract the deck. It contains 6322 cards for 3161 facts. It contains full audio from the CD.
Didn't find all the audio there? Only 1465 audio files even though all facts contain audio filenames. Hmmm.
Reply
#7
xaarg Wrote:
Code:
$ wc exm.pl essential.py
  72  283 3658 exm.pl
  44  107 1411 essential.py
116  390 5069 total
Code:
$ wc jv19.py
  694  2782 30591 jv19.py
I'd love to play golf here but I don't have the input file to test a beautiful-soup approach with :-)
Reply
#8
brianobush Wrote:Didn't find all the audio there? Only 1465 audio files even though all facts contain audio filenames. Hmmm.
The deck got a mp3 file for every card. As Anki stores files by their hash, they can be no duplicates here.
Code:
$ ls ejv.media | wc -l
3161
Checking the CD there are even some mp3s (~800) that are not used in the deck:
Code:
$ find /mnt/EssentialVerbsCD -regex '.*\.mp3' | wc -l
3942
Here is the text output of the perl script, if you like to check which mp3 files are used for each card:
list.txt (1kb)
Google Spreadsheet

pm215 Wrote:I'd love to play golf here but I don't have the input file to test a beautiful-soup approach with :-)
The CD is quite small without all the mp3 files, so here you go:
ejvcd.zip (15 MB)
Edited: 2009-10-16, 6:04 pm
Reply
#9
got it! I was on linux and tried to repair the zip file and ended up truncating it. Finally stole my co-workers PC to unzip it. 3161 mp3 files.
thanks!
Reply
#10
brianobush Wrote:got it! I was on linux and tried to repair the zip file and ended up truncating it. Finally stole my co-workers PC to unzip it. 3161 mp3 files.
thanks!
I created the file on Linux, so that should not have been a problem, if you have used the right program and had both files (ejv.zip and ejv.z01) in the same folder. Well, I am glad it is working for you now.
Edited: 2009-10-16, 6:06 pm
Reply
#11
pm215 Wrote:I'd love to play golf here but I don't have the input file to test a beautiful-soup approach with :-)
Was this just a lie? I still see no beautiful-soup approach even though I uploaded the CD containing all input files except the audio files.
Reply
#12
xaarg Wrote:
pm215 Wrote:I'd love to play golf here but I don't have the input file to test a beautiful-soup approach with :-)
Was this just a lie? I still see no beautiful-soup approach even though I uploaded the CD containing all input files except the audio files.
Sorry, I wrote half a script (enough to convince myself that it would be much shorter than MeNoSavvy's script but probably about the same length as your regex stuff), hit Python's stupid "won't output utf8 to console" brokenness, and moved onto something else. If you're interested I could finish it.
Reply
#13
pm215 Wrote:If you're interested I could finish it.
If it's not too much work for you.
Reply
#14
So, I finally got round to looking at this again this evening. Beautiful Soup Python script is here: http://www.chiark.greenend.org.uk/~pmayd...pmm.py.txt
38 lines. You could trivially knock about another 10 lines off if you didn't care about readability.

Output is the same as MeNoSavvy's script with a few minor exceptions where I think my script actually produces better output (trimming trailing spaces, not inserting spurious [] in caption text).
Reply
#15
Nice. A bit slow perhaps (the regex-using Perl script returns instantly), but really nice.

Can you add extraction of the references to 2001.Kanji.Odyssey? (e.g. <div class="tankan2">私76,花68,割334</div>)?

Another nice thing would be to merge the kanji and kana output to get one compatible with the xfurigana plugin for Anki, e.g.
ここに 座っても いいですか。
and
ここに すわっても いいですか。
turns into
ここに 座[すわ]っても いいですか。

I guess splitting "gettext(ks, 'exjpk')" and "gettext(kks, 'exjpk')" at " ", "、", ":" and "。” (and keeping the split characters) would do. If both pieces match, just output them as is. If they don't match remove matching characters from the both ends, and then output the left over kks part with the ks part inside [] and put back the removed characters.

I tried to do this, but I always get errors like "UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 1: ordinal not in range(128)" if I use Japanese characters in a split pattern. Probably there is a way to tell Python that a regex contains Unicode characters.
Edited: 2009-10-31, 11:20 pm
Reply
#16
xaarg Wrote:Nice. A bit slow perhaps (the regex-using Perl script returns instantly), but really nice.
Yes, I was expecting the regex approach to beat it for speed. I'm not planning to add any features to it, though -- I was only really doing this to play around with Beautiful Soup, not because I wanted the output :-) (Also, I realised last night that there was a bug in my script -- it assumes the two glob.glob()s give the kana and kanji files in matching order, which isn't guaranteed. Either they both need a sort, or generate the kana filename from the kanji one.
Quote:I tried to do this, but I always get errors like "UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 1: ordinal not in range(128)" if I use Japanese characters in a split pattern. Probably there is a way to tell Python that a regex contains Unicode characters.
Code:
>>> print kana
ここに すわっても いいですか。
>>> print kana.split(' ')[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)
>>> print kana.split(u' ')[0]
ここに
Reply
#17
Awesome work pm215. I am impressed by your superior programming skills. After working a while on my script I realised the approach wasn't the best, but I had to weigh up starting again versus finishing with my existing approach. The setting of flags for the different types of information didn't really work as well as I thought, and I later had to use additional variables to deal with the situation where there might be a couple of english sentences instead of one, or when a "br tag" caused the handle_data to return with only half the sentence.

The python approach you used seems quite clear.

There seems to be a lot of debate around which parser is best, and some others may be faster than beautiful soup

http://blog.ianbicking.org/2008/03/30/py...rformance/

although beautiful soup seems to be far the most popular.

I definitely to need to learn more about parsing html, xml and the like.
Reply
#18
I also gave it another go.
http://rapidshare.com/files/302780688/xmlp.py

This version uses lxml, is pretty fast and finds 3181 sentences.

It also correctly parses the two "Group 3" verbs kuru and suru and correctly tags the verbs that are both transitive and intransitive. I also fixed three cases, where the 2001KO refs are missing. They are still missing for all sentences of verb 224, but I was too lazy to add them manually.

It also merges the mixed kanji/kana field with the kana field to create a field hopefully suitable for the xfurigana plugin of Anki. I still haven't checked how it looks like , but I soon will, when I create a updated version of the deck.
Edited: 2009-11-05, 10:37 am
Reply