dictscrape: library/anki plugin for semi-automatic card creation

Index » Learning resources

 
Reply #1 - 2012 June 23, 7:05 am
partner55083777 Member
From: Tokyo Registered: 2008-04-23 Posts: 397

Dictscrape is a library and an Anki plugin.  The library is basically a webscraper for Yahoo's online Japanese dictionaries.  The Anki plugin uses the library for semi-automatic card creation.  You can create new cards with definitions and example sentences from Yahoo's dictionaries in just a few clicks.

https://github.com/cdepillabout/dict-scrape

Here is a screenshot showing the Anki plugin.  You can see the definition/example sentence selector in action.

https://github.com/cdepillabout/dict-scrape/raw/master/screenshots/dictscrape.png

Many words still do not get parsed correctly, but if you want to try out the program, try looking up these words:

バリカン バリカン
うなる 唸る
きょうはく 強迫
なりすます 成り済ます
とかげ 蜥蜴
いく 行く
あかし 赤し
おもしろい 面白い
らくだ 駱駝

Current Status -- alpha01

Everything is still in the alpha stages.  Currently, if you want to try it out and you're not a developer, you're out of luck.  If you have experience with the command line, then checkout the README.md in the git repo. 

Development Help

I would really appreciate any help other developers interested in this project. 

There are a couple places that are specifically lacking.

1) The GUI.  This was my first pyQt project, so some of the code is kind of funky.  I would appreciate suggestions/pull requests fixing any of the stupid things I did.  The GUI code is in scraper.py and scraper_gui/.  Unfortunately, I haven't gotten around to adding docstrings to the GUI code, so if you have any questions while you're reading/hacking the code please feel free to ask.

Everything works for the most part, but I'm not sure the correct way to handle the card creation dialogs.  After you pick the parts of the definition and example sentences you want to use, the selector window closes and the next window pops up.  It lets you rearrange the order of the definitions and pick the example sentence to use on your main card.  When you click 'Okay', that window closes and another window pops up...  What's the best way to do this?  Basically I want to make a "Wizard"-like system, where the same window is used and you just have to keep clicking "Next" to go to the next screen.  Should I use the Wizard dialog that comes with Qt?  Do I need to use stacked widgets or something?  I haven't done any research in this area so I really have no idea.

2) The HTML/CSS.  As you can see in the screen shot, the definition/sentence selector screen uses QWebView widgets to display the definitions from the dictionaries.  The HTML/CSS for these widgets is located at the top of scraper_gui/ui/defwebviewui.py.  I'm not much of a designer, so I just 適当に designed it, but I'm sure someone with talent could make it look a lot nicer.  I would really appreciate any pull requests that fix the design (I guess mostly the CSS).

3) The big hurdle left to conquer is parsing the definitions from Yahoo's dictionaries.  There is so much crap in there and it seems like nearly every entry is formatted differently.  Parsing the definitions is currently being done in the parse_definitions() function in dictscrape/dictionaries/yahoo/*.py.

I would really appreciate any suggestions/pull requests with regards to parsing.  Be sure to check out the section on testing in the README.md. 

For non-developers, it would be awesome if you could let me know any words that don't get parsed correctly.  Ideally it would be nice if you could create an Issue on github. If you don't have a github account just post here (althought, I can't promise I'll be able to fix the parsing for your word).

Keep in mind I try to take out a lot of the "useless" information from the definition, like the verb conjugation, 補説 parts, etc.

Reply #2 - 2012 June 23, 8:16 am
Tori-kun このやろう
Registered: 2010-08-27 Posts: 1193 Website

@partner55083777: Wow, I had been waiting for someone to come up with this!! Great plugin, but unfortunately I cannot code/help... I'm getting tired adding definitions/example sentences from Eijiro/Kenkyusha (I find Kenkyusha has lots of great example sentences illustrating the correct usage) manually for all the cards I have in my Rikai-chan deck (about 5000 cards).

Reply #3 - 2012 June 23, 8:20 am
rich_f Member
From: north carolina Registered: 2007-07-12 Posts: 1708

I'm not a developer, but I'm eager to try this out. This is an awesome idea.

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
Reply #4 - 2012 June 23, 9:04 am
partner55083777 Member
From: Tokyo Registered: 2008-04-23 Posts: 397

rich_f wrote:

I'm not a developer, but I'm eager to try this out. This is an awesome idea.

Thanks.  I just added some cards with it and it's so fast.  The generated cards are really high quality.

You'll have to wait a little bit until it is released properly.  If someone is able to get this setup/running on Windows or OSX, I would really appreciate if you could write up some simple directions.

Also, I haven't tested it on anything but Arch Linux, so if you try it out on another OS it may not work.  YMMV.

Last edited by partner55083777 (2012 June 23, 9:16 am)

Reply #5 - 2012 June 23, 9:32 am
rich_f Member
From: north carolina Registered: 2007-07-12 Posts: 1708

Trying now to get it running on Windows, without much luck. I can't create the symlinks, even with Admin privileges. Win7 says file already exists. I'll have to do it another way.

Copying the code into the plugins directory causes multiple errors.

Also, which version of Python did you create this under? 2.x or 3?

Reply #6 - 2012 June 23, 10:14 am
partner55083777 Member
From: Tokyo Registered: 2008-04-23 Posts: 397

rich_f wrote:

Trying now to get it running on Windows, without much luck. I can't create the symlinks, even with Admin privileges. Win7 says file already exists. I'll have to do it another way.

Copying the code into the plugins directory causes multiple errors.

I don't know much about Windows, but if the errors are python stack traces you could try pasting them here and I could try to help you debug them.

rich_f wrote:

Also, which version of Python did you create this under? 2.x or 3?

2.7.3

Last edited by partner55083777 (2012 June 23, 10:15 am)

Reply #7 - 2012 June 24, 6:26 am
Inny Jan Member
From: Cichy Kącik Registered: 2010-03-09 Posts: 720

rich_f wrote:

I can't create the symlinks, even with Admin privileges.

Didn't see the code but on Windows 7 you probably need to resolve to using mklink. API for this should also exist but maybe it's not available in python yet?

Here is the first article that poped up in my google search that talks about junctions and hardlinks on Windows 7:
http://ipggi.wordpress.com/2009/09/07/w … ard-links/

Reply #8 - 2012 June 24, 9:10 am
partner55083777 Member
From: Tokyo Registered: 2008-04-23 Posts: 397

I guess you don't really need to use symlinks unless you plan on doing development.

From the root of the source code directory, copy "scraper.py" and the "scraper_gui/" directory into your anki plugins directory.  Then copy the "dictscrape/" directory into the "scraper_gui/" directory.  It might work...?

Reply #9 - 2012 June 24, 11:28 am
rich_f Member
From: north carolina Registered: 2007-07-12 Posts: 1708

Here's the error message I get:

An error occurred in a plugin. Please contact the plugin author.
Please do not file a bug report with Anki.

Traceback (most recent call last):
  File "C:\cygwin\home\dae\Home\anki\win\build\pyi.win32\anki\outPYZ1.pyz/ankiqt.ui.main", line 2679, in loadPlugins
  File "c:\pyi\iu.py", line 439, in importHook
  File "c:\pyi\iu.py", line 524, in doimport
  File "C:\Users\XXXX\AppData\Roaming\.anki\plugins\scraper.py", line 22, in
    from scraper_gui.selector import MainWindowSelector
  File "c:\pyi\iu.py", line 439, in importHook
  File "c:\pyi\iu.py", line 524, in doimport
  File "C:\Users\XXXX\AppData\Roaming\.anki\plugins\scraper_gui\selector.py", line 24, in
    from dictscrape import DaijirinDictionary, DaijisenDictionary, \
  File "c:\pyi\iu.py", line 439, in importHook
  File "c:\pyi\iu.py", line 524, in doimport
  File "C:\Users\XXXX\AppData\Roaming\.anki\plugins\scraper_gui\dictscrape\__init__.py", line 28, in
    from .dictionaries import Dictionary, YahooDictionary, \
  File "c:\pyi\iu.py", line 439, in importHook
  File "c:\pyi\iu.py", line 524, in doimport
  File "C:\Users\XXXX\AppData\Roaming\.anki\plugins\scraper_gui\dictscrape\dictionaries\__init__.py", line 23, in
    from .dictionary import Dictionary
  File "c:\pyi\iu.py", line 439, in importHook
  File "c:\pyi\iu.py", line 524, in doimport
  File "C:\Users\XXXX\AppData\Roaming\.anki\plugins\scraper_gui\dictscrape\dictionaries\dictionary.py", line 21, in
    from lxml import etree
  File "c:\pyi\iu.py", line 458, in importHook
ImportError: No module named lxml
Traceback (most recent call last):
  File "C:\cygwin\home\dae\Home\anki\win\build\pyi.win32\anki\outPYZ1.pyz/ankiqt.ui.main", line 2679, in loadPlugins
  File "c:\pyi\iu.py", line 439, in importHook
  File "c:\pyi\iu.py", line 524, in doimport
  File "C:\Users\XXXX\AppData\Roaming\.anki\plugins\test.py", line 28, in
    import argparse
  File "c:\pyi\iu.py", line 458, in importHook
ImportError: No module named argparse

Last edited by rich_f (2012 June 24, 11:29 am)

Reply #10 - 2012 June 24, 11:35 am
partner55083777 Member
From: Tokyo Registered: 2008-04-23 Posts: 397

rich_f wrote:

Here's the error message I get:

This looks like 2 separate errors.

1) Don't put test.py in the plugins directory.  Just scraper.py and scraper_gui/.  (Don't forget to put dictscrape/ inside scraper_gui/.  You'll also have to setup models in your deck like described in the README.md file.)

2) In order to run this plugin, you need lxml and the python bindings for it installed.  I'm not sure where you can get this for windows.  Some quick googling brought me to this page:

http://lxml.de/index.html#download

Reply #11 - 2012 June 25, 1:05 pm
rich_f Member
From: north carolina Registered: 2007-07-12 Posts: 1708

Okay, got it. I'm out of town for the next few days, but I'll give it a shot on Friday.

Reply #12 - 2012 June 25, 5:42 pm
nohika M.O.D.
From: America Registered: 2010-06-13 Posts: 384

This looks like this could be an awesome plug-in...I know absolutely nothing about coding/stuff like that, though, unfortunately.

It could also be worth it to work on it through Anki 2.0's beta since that should be coming out pretty soon...

Reply #13 - 2012 June 25, 10:00 pm
Sebastian Member
Registered: 2008-09-09 Posts: 582

Is there any possibility that in the future you add the capability to work with EPWing dictionaries?

Would that be harder, easier or the same difficulty compared to accessing online dictionaries?

Reply #14 - 2012 June 26, 10:25 am
partner55083777 Member
From: Tokyo Registered: 2008-04-23 Posts: 397

nohika wrote:

It could also be worth it to work on it through Anki 2.0's beta since that should be coming out pretty soon...

Yeah, I plan on doing this as soon as Anki 2 goes out of beta.  The anki-related code in my project is currently very small.  It's probably less than 5%.  It should be pretty easy to port to Anki 2.

Reply #15 - 2012 June 26, 10:38 am
partner55083777 Member
From: Tokyo Registered: 2008-04-23 Posts: 397

Sebastian wrote:

Is there any possibility that in the future you add the capability to work with EPWing dictionaries?

Would that be harder, easier or the same difficulty compared to accessing online dictionaries?

I also plan on doing this in the future.  Unfortunately it's not exactly top priority.

Here's what my roadmap looks like:

0) Continue working on getting the parsing to work correctly.  It's slowly coming along.  Right now, I'd say it's completely working for about 50% of the entries, and it's usable for about 90% of the entries.

1) Get preferences setup so that you can control how the fields are laid out in your model.  This way you don't have to have your models set up exactly like I do.  I basically just plan on stealing the preferences code from Yomichan.  This should take a while, but I don't expect it to particularly be difficult.

After getting this working, I plan on releasing the plugin in beta.  I'm still not sure what to do about the external lxml dependency.  Has anyone gotten the plugin working on Windows?

2) Add in support for weblio.jp's example sentences, 三省堂's really simple dictionary, and edict.  These three should be relatively simple to add support for.

3) Add in support for epwing dictionaries.  First I plan on adding 研究社 新和英大辞典 第5版, and then after that maybe 広辞苑 or 明鏡国語辞典 or 新明解国語辞典. 

I haven't played around with epwing libraries extensively, so it's hard to say how difficult this will be.  I imagine it will be as difficult as, if not more so than the Yahoo dictionaries.

Reply #16 - 2012 June 30, 8:33 am
rich_f Member
From: north carolina Registered: 2007-07-12 Posts: 1708

Sorry, can't get lxml installed on my machine. I've tried a number of things, but it just won't work. easy_install gives me errors for some vcvsomethingorother file that doesn't exist.

I tried using a windows installer I found somewhere, but it said it couldn't find python in my registry, which is aggravating, because python 2.7.3 runs just fine. Not really sure what that one wants.

Let me know when you have a new version out that has the necessary libraries built in, and I'll be happy to test it.

Reply #17 - 2012 July 07, 9:55 am
rich_f Member
From: north carolina Registered: 2007-07-12 Posts: 1708

Okay, I managed to get past the vcvarsall.bat problem by installing MS Visual Studio Express 2008 (C++). Now I'm having problems getting libxslt and ... forgot the other one... installed and recognized by easy_install. -_-

I'll probably get it up and running by the time you release the plugin.

Reply #18 - 2012 July 07, 3:01 pm
shinsen Member
Registered: 2009-02-18 Posts: 181

It's a hot day and my brain is being throttled so even though I read the readme on github I still don't understand what exactly this program is supposed to be doing. I gathered that it scrapes online dictionaries and somehow "semi-automates" and "aids in the process of" the creation of Anki cards with definitions and example sentences but it sounds really vague. Like vague enough to not make sense on a hot day without airconditioning. The screenshot shows some tabs and text fields but I don't understand how exactly they are being used. Do you select one field and get a card made from it?

Also, just a thought, maybe make it a web app instead of a Qt program if your interface is HTML/CSS anyway. Especially since python isn't half bad as a web language.

Reply #19 - 2012 July 07, 3:47 pm
rich_f Member
From: north carolina Registered: 2007-07-12 Posts: 1708

It's an Anki plugin, and it's in early alpha. Basically, you give it words, it lets you suck out sentences, meanings, etc. into your Anki deck from the Yahoo.co.jp dictionaries without you having to do a lot of copy/paste work. (Been there, done that. It's tedious and slow.)

If you can add sentences from the dictionaries to your deck just by clicking on them (same goes for other dictionary bits), then life gets a lot easier. Less time mucking with the deck, and more time for actual studying.

Reply #20 - 2012 July 07, 7:06 pm
shinsen Member
Registered: 2009-02-18 Posts: 181

Thanks for the explanation. I've never used an Anki plugin so I guess it made things less obvious for me. I could use a library like this for my dorama watching though. I'm not sure how feasible it would be so I'll just throw it out there as food for thought.

My setup looks something like this though the video is usually on a separate monitor:

http://i.imgur.com/OeXDkl.png

I'll explain for those not familiar with this (you can search the forum for more info) - AGTH under the video is something that hooks into the media player and pipes the subtitles into the system clipboard. I got Translation Aggregator on the left, it's getting the text from the clipboard and using WWWJDIC API to parse the subtitles into separate words with translations. In the middle I have Wakan which is also monitoring the clipboard (though its parser is much weaker) but Wakan has an "Examples" window at the bottom. The examples come from an edict file imported from WWWJDIC and they're very sparse and often there aren't any.

You can plug modules into Translation Aggregator such as MeCab for parsing kanji into furigana and such. In the screenshot below I'm using Google Translate API (just for example) and JParser (same as MeCab basically):

http://i.imgur.com/ryMDal.png

What this setup could use:
1. A block in Translation Aggregator with examples a la Wakan (you can see Wakan's example block in action in the second screenshot, bottom of its main window).
2. Some app to make Anki cards. I use Wakan to add words to its vocab list which can be exported to Anki later though I don't usually bother. If there was a way to quickly generate an Anki card with word plus meaning plus examples I'd probably start using Anki with this setup. In fact, number 1. is not necessary if I could have this open.

I'm guessing the only extra functionality DictScrape would need is something like "get text from clipboard" since that's how the data is being passed around anyway.

Reply #21 - 2012 July 08, 6:36 am
partner55083777 Member
From: Tokyo Registered: 2008-04-23 Posts: 397

rich_f wrote:

Okay, I managed to get past the vcvarsall.bat problem by installing MS Visual Studio Express 2008 (C++). Now I'm having problems getting libxslt and ... forgot the other one... installed and recognized by easy_install. -_-

I'll probably get it up and running by the time you release the plugin.

I did a little research about lxml on windows and it all seems really complicated. 

This looks like a page that has binaries for lxml and lxslt:
http://www.zlatkovic.com/libxml.en.html

It basically says you need to install iconv, zlib, libxml, and libxslt, in that order.

Then, it looks like you can install the binary packages for the python binding for libxml:

http://users.skynet.be/sbi/libxml-python/

Reading the skynet.be website, it looks like those installers might just install everything for you, including inconv, zlib, libxml, libxslt, and the python bindings.  It might be all that you need. 

Try it out and let me know how it works. 

I really didn't expect it to be so much trouble :-/

Reply #22 - 2012 July 08, 6:15 pm
rich_f Member
From: north carolina Registered: 2007-07-12 Posts: 1708

Okay, the skynet.be site seemed to install just fine. Then I closed and reopened Anki, and was met with another screen of errors:

I don't know what pyi or cygwin are. I don't have them installed, so I'm not sure what's up with that.

Traceback (most recent call last):
  File "C:\cygwin\home\dae\Home\anki\win\build\pyi.win32\anki\outPYZ1.pyz/ankiqt.ui.main", line 2679, in loadPlugins
  File "c:\pyi\iu.py", line 439, in importHook
  File "c:\pyi\iu.py", line 524, in doimport
  File "C:\Users\Rich\AppData\Roaming\.anki\plugins\scraper.py", line 22, in
    from scraper_gui.selector import MainWindowSelector
  File "c:\pyi\iu.py", line 439, in importHook
  File "c:\pyi\iu.py", line 524, in doimport
  File "C:\Users\Rich\AppData\Roaming\.anki\plugins\scraper_gui\selector.py", line 24, in
    from dictscrape import DaijirinDictionary, DaijisenDictionary, \
  File "c:\pyi\iu.py", line 439, in importHook
  File "c:\pyi\iu.py", line 524, in doimport
  File "C:\Users\Rich\AppData\Roaming\.anki\plugins\scraper_gui\dictscrape\__init__.py", line 28, in
    from .dictionaries import Dictionary, YahooDictionary, \
  File "c:\pyi\iu.py", line 439, in importHook
  File "c:\pyi\iu.py", line 524, in doimport
  File "C:\Users\Rich\AppData\Roaming\.anki\plugins\scraper_gui\dictscrape\dictionaries\__init__.py", line 23, in
    from .dictionary import Dictionary
  File "c:\pyi\iu.py", line 439, in importHook
  File "c:\pyi\iu.py", line 524, in doimport
  File "C:\Users\Rich\AppData\Roaming\.anki\plugins\scraper_gui\dictscrape\dictionaries\dictionary.py", line 21, in
    from lxml import etree
  File "c:\pyi\iu.py", line 458, in importHook
ImportError: No module named lxml

I'll try rebooting later tonight. I have a lot of things going on with the machine right now, and I can't afford to reboot yet.

Reply #23 - 2012 July 09, 10:01 pm
rich_f Member
From: north carolina Registered: 2007-07-12 Posts: 1708

Tried rebooting, twice, and it's still not working, even though the lxml installer said it was installed properly. I still get the same list of errors in Anki.

Reply #24 - 2012 July 10, 6:33 am
partner55083777 Member
From: Tokyo Registered: 2008-04-23 Posts: 397

It's basically saying it can't find the lxml bindings for python.  It's unclear whether this is because the lxml library is not installed, or the python bindings are not installed.  If I get some time I'll have to find a way to check this out.  I'm really surprised lxml is so hard to install on Windows :-/

Reply #25 - 2012 July 10, 6:49 am
partner55083777 Member
From: Tokyo Registered: 2008-04-23 Posts: 397

@shinsen, basically the only functionality available now is to input a word in kanji and kana, and then get the dictionary results from the 4 yahoo dictionaries for that word.  You are then able to click on which definitions and example sentences you want, and cards are automatically created for each of those sentences.

To be honest, it's mostly aimed at an intermediate/advanced learner who is doing sentence/vocab mining from example sentences in yahoo's dictionaries. 

It's in early alpha stages, so it isn't really ready to be used by anyone other than developers.  That being said, it does seem to work pretty good on Linux.  I don't plan on really doing much with it other than fixing the parsing of entries until Anki 2 is released.  At that point I'll port it to Anki 2, add in support for configuring how the newly created cards are laid out (the model), try to get it working on Windows, and hopefully release it in beta.