Automatic google search - crazy idea?

Index » 喫茶店 (Koohii Lounge)

 
Lindley Member
From: Ukraine Registered: 2008-04-03 Posts: 61

Hi all! Since this is the most technologically savvy and progressive (not to mention the friendliest wink) Japanese community on the web, I'd like to ask you this. Do you think it'd be possible and worthwhile to make a plugin of sorts that would let you put in a list of japanese words, and then it would search the web and give you back articles/posts/whatever that contain the highest ratio of these words? It'd be useful if you've just learned a bunch of words and want to practice reading them in context, to minimize googling/skimming time. Or has it been already made and I'm living in Dark Ages? smile

jettyke Member
From: 九州 Registered: 2008-04-07 Posts: 1194

One problem would be there being lots of other, probably more difficult words to learn in that single article which would make it hard to learn, perhaps.

Otherwise that's a good idea.

edit: Maybe if you would combine it with Cb's difficulty rating tool, you could sort articles like that and read articles that are not too hard.

Last edited by jettyke (2011 August 15, 3:07 pm)

Lindley Member
From: Ukraine Registered: 2008-04-03 Posts: 61

Yes, that's a good point!

So, any volunteer willing to work on this? I'm no programmer smile

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
aphasiac Member
From: 台湾 Registered: 2009-03-16 Posts: 1036

Lindley wrote:

Do you think it'd be possible and worthwhile to make a plugin of sorts that would let you put in a list of japanese words, and then it would search the web and give you back articles/posts/whatever that contain the highest ratio of these words?

How is this different to just entering the words into a google search?

Lindley Member
From: Ukraine Registered: 2008-04-03 Posts: 61

Well, for one, the search would be automatic, meaning you won't have to manually paste these words into a browser, and then click through a dozen pages trying to find a good article (and if you want more than one?..) As many people (me included) even on the intermediate stage aren't yet comfortable browsing Japanese web, this would make the whole process less painful and time consuming. Plus, as it has been wisely mentioned, getting the articles in order optimal for *your* learning would be pretty awesome. This way articles could be even used for incremental reading, something that has already been discussed on the forum. And, if these articles could be run in Yomichan, this will help you make cards easier.

jcdietz03 Member
From: Boston Registered: 2008-12-19 Posts: 324 Website

You must do it [browse the Japanese web] and get used to it. How is the hard part. It can be hard finding something interesting to you.

I recommend trying this thread and clicking things that sound interesting. It won't work as a long-term solution though.

Other things you could try:
Do you like online shopping? Try amazon.co.jp. Their site is easy to navigate. They have an English option too if you only want to read product names and descriptions in Japanese.

Try asahi.com (news). There's other news options too. Search around on this forum.

Try checking the website of your favorite Japanese game developer. Try checking the website of your favorite Japanese game. But it can be hard as many are flash and Rikaichan won't work.

Try checking the website of your favorite anime, manga, or publisher. Try checking the website of your favorite Japanese TV station.

Try finding a message board you like. But I don't like them.
There is one about Precure here: http://bbs.eastan.net/
There's 2ch (a message board) but I don't know much about it. Try asking about it if interested. I heard it's like 4chan but I don't know much about that either.

Bookmark these sites. Try to check one of them once per day.

abarone22 Member
From: Atlanta Registered: 2009-12-28 Posts: 14

One possible option is something like jisho.org where you can type a word and the site will come back with a ton of sentences that include that word in it.  It won't be autogenerated from a list of vocab, but its pretty quick to use.

nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

I don't think this is very doable because you're trying to add a deep, wide web crawling functionality, no? This tool would need to search and parse each of the hits that turn up with batch web searches? Doubt it'll happen.

What I would like to see is more usage of the known.db (created with MorphMan), applied to web browsers (see: http://forum.koohii.com/viewtopic.php?p … 37#p149237 [towards the end of my comment] and also: http://forum.koohii.com/viewtopic.php?p … 37#p149337) for a bit on adding articles to Anki and sorting w/ MorphMan). Like an extension that lets you parse a webpage and determine the number of unknowns, perhaps format them. Perhaps it could be tied to Rikaisan, since we were discussing that in the cbJisho thread and the Rikaisan thread (http://forum.koohii.com/viewtopic.php?p … 02#p149202 and http://forum.koohii.com/viewtopic.php?p … 16#p149016 respectively [also relevant is the reader tools I mention there Edit: Or was it here?]).

There's also: http://forum.koohii.com/viewtopic.php?p … 04#p147804

Last edited by nest0r (2011 August 16, 6:41 pm)

Asriel Member
From: 東京 Registered: 2008-02-26 Posts: 1343

nest0r wrote:

...

nest0r has an avatar!!!

Reply #10 - 2011 August 16, 8:22 pm
Asriel Member
From: 東京 Registered: 2008-02-26 Posts: 1343

On a more serious note, I was thinking about something like this not too long ago.
I started reading the MorphMan thread, but while admittedly I didn't finish it yet, I plan to get back to it -- I don't have internet access at home until Saturday.

Edit:
Apparently a lot of the stuff I mentioned before isn't really too much of a problem.

Although unless you have a set of sites that you want it to parse through, ie a set of news sites that follows a similar format for it's articles, etc, I can see it taking a long time. Going through google results and trying to match the best one for you could take a really long time.

Last edited by Asriel (2011 August 16, 8:26 pm)

Reply #11 - 2011 August 16, 9:04 pm
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

Asriel wrote:

nest0r wrote:

...

nest0r has an avatar!!!

What are you talking about, I've always had this avatar. I guess your spirit level wasn't high enough to see it, until now. ^_^

Reply #12 - 2011 August 17, 11:22 am
jettyke Member
From: 九州 Registered: 2008-04-07 Posts: 1194

Lol, what comes to my mind when I see it is  " you're stupid, man"

Reply #13 - 2011 August 17, 12:44 pm
overture2112 Member
From: New York Registered: 2010-05-16 Posts: 400

nest0r wrote:

What I would like to see is more usage of the known.db (created with MorphMan), applied to web browsers...

Yes, a much better idea would be to have a large list of interesting articles recommended by other users and then use morphman to analyze which are best to read first given your current knowledge and goal vocab.  Then perhaps an English version for getting through a nest0r reading list wink

Reply #14 - 2011 August 17, 1:07 pm
Nagareboshi Member
From: Austria Registered: 2010-10-11 Posts: 569 Website

There is software that does exactly this. It is called "Copernic Agent" and I have been using it in different versions since late '98. Of course this software comes with a little price tag, 39,95$ Not sure that it offers what you are looking for in the free version. You have to try and see that it does http://www.copernic.com/en/products/agent/personal.html If not, its worth the money, and saves time looking things up.

Reply #15 - 2011 August 17, 1:27 pm
Lindley Member
From: Ukraine Registered: 2008-04-03 Posts: 61

Jcdietz03 - thanks for the recommendations! The news sites are still too difficult for me, but I'll check out those forums you mention.

Abarone22 - aren't those online j-e dic sites using the not-so-natural Japanese? Although looking thing up in j-j dic might be a good idea, although again - lots of manual typing.

Nest0r - yes, such tool for article parsing and reading would be incredibly useful, like you and Jettyke say. If it existed, I wouldn't mind searching for e reading material manually smile

Asriel - I'll be interested in what you come up with! I've only recently started using morph plugin myself, so maybe I'm missing some of it's finest points.

Overture2112 - tempting, I agree smile There used to be (or still is?) a feature on this site which allowed you to paste a text and it would color kanji you've learned. Maybe something similar could be applied - showing known/unknown/difficult to remember  words and using that to create cards or use text for incremental reading?

Nagareboshi - thanks, I'll check it out. Not sure it will work for Japanese language, though.

Reply #16 - 2011 August 26, 2:12 pm
Lindley Member
From: Ukraine Registered: 2008-04-03 Posts: 61

A quick fix for looking up new words in context - install google tools bar, turn on highlighting, and enter a list of words. This is best done on text-heavy pages, like print versions of forum threads, expanded view of google reader blogs, etc. Every word will be highlighted in different color, so you can select the paragraph with most words, or whatever suits your fancy.

Reply #17 - 2011 August 26, 2:18 pm
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

Yeah I made note of that too (after searching for being able to search a page for multiple words and highlight them, I think finding the possible solution via a MeFi thread that recommended Google Toolbar), but didn't feel like installing the Google Toolbar and wanted something automatic, based off thousands of knowns/mature morphemes.

What I ended up doing is here if you're interested: http://forum.koohii.com/viewtopic.php?p … 29#p155029

It's a bit roundabout with the regexp and the filtering, so it'd probably work best to just do it like once a month. Maybe in the future Morph Man can support exporting lists from the GUI into a Vocabulary Highlighter-friendly .xml database. If it was a custom browser plugin, it could even be like an automatically updating thing.

Last edited by nest0r (2011 August 26, 2:19 pm)

Reply #18 - 2011 August 26, 3:00 pm
Lindley Member
From: Ukraine Registered: 2008-04-03 Posts: 61

Ah, okay, thanks. I'll have to sit down with a cuppa steaming coffee and read through your explanations, then smile

Reply #19 - 2011 August 26, 4:18 pm
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

Lindley wrote:

Ah, okay, thanks. I'll have to sit down with a cuppa steaming coffee and read through your explanations, then smile

Yeah you need to have Morph Man installed and running so you have a nice mature.db/known.db, and access the GUI to display what you want, then you need to know enough regexp/find/replace stuff to have the morpheme list you get in the GUI formatted to match the .xml you generate with Vocabulary Highlighter. It's mostly just tedious if you're all set with that stuff, but can seem daunting, I imagine, otherwise.

I already have Morph Man up and running with my knowns/matures, and the GUI is pretty intuitive plus I'm already familiar, and I know enough regexp to add the <vocabulary><word> around the beginning of each morpheme and the </word> ... </vocabulary> to the end, and then it was just a matter of first adding a cpl words to the extension and exporting the extension's database so I had a .xml with its bare essentials in place, and then replacing the entries with the above while leaving the opening and ending sections of the .xml alone.

In the future I can imagine having it all automated, and/or you can do stuff like: first derive a list of matures without the knowns and the knowns without the matures, and have the matures the darkest and the knowns medium dark, so the unknowns are lightest and you have a kind of gradient of your best to least knowns, if you want.

Plus you can control not only text and background colours, but the font types and sizes. So theoretically you could so something like make the unknown text match the colour of the background (e.g. white), and you could highlight by mousovering with Rikaisan or double-clicking to select text.

It's not going to be perfect though, since the headwords that, say, Rikaisan uses, are different from morphemes, but I think it looks pretty great for the most part. Plus you can filter the stuff by part of speech.

Last edited by nest0r (2011 August 26, 4:25 pm)

Reply #20 - 2011 August 26, 4:24 pm
overture2112 Member
From: New York Registered: 2010-05-16 Posts: 400

nest0r wrote:

In the future I can imagine having it all automated, and/or you can do stuff like: first derive a list of matures without the knowns and the knowns without the matures, and have the matures the darkest and the knowns medium dark, so the unknowns are lightest and you have a kind of gradient of your best to least knowns, if you want.

Yea, I should add an "export db as" button and we can make a few convertors for useful things. If you can document what exactly the xml needs to look like it should be easy enough.

nest0r wrote:

...To aru Majutsu no Index volume 1

I don't suppose you have a subs2srs deck for the show?

Reply #21 - 2011 August 26, 4:28 pm
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

overture2112 wrote:

nest0r wrote:

In the future I can imagine having it all automated, and/or you can do stuff like: first derive a list of matures without the knowns and the knowns without the matures, and have the matures the darkest and the knowns medium dark, so the unknowns are lightest and you have a kind of gradient of your best to least knowns, if you want.

Yea, I should add an "export db as" button and we can make a few convertors for useful things. If you can document what exactly the xml needs to look like it should be easy enough.

nest0r wrote:

...To aru Majutsu no Index volume 1

I don't suppose you have a subs2srs deck for the show?

No I don't, though I love that show. (I removed that section from my comment just now since I think it would be redundant if you're already showing all the known morphemes for anything in Firefox, but I guess specific stuff could be useful if you're messing with custom pens in the extension.)

And export stuff would be awesome. If you have the extension you can export the .xml yourself and it'll have little variations depending on your colour settings, but here's what it looks like with everything default except the background changed to dark gray, and with 3 words added:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<vhdatabase.sqlite xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <vocabulary>
        <word>morpheme1</word>
        <note/>
        <htmlnote/>
        <matchcase>0</matchcase>
        <nothighlight>0</nothighlight>
        <pen penname="">
            <style>background: none repeat scroll 0% 0% darkgray ! important; border: medium none transparent ! important; color: black ! important; display: inline ! important; float: none ! important; font: inherit ! important; margin: 0px ! important; padding: 0px ! important; position: static ! important; text-decoration: inherit ! important; text-transform: inherit ! important; vertical-align: baseline ! important; visibility: inherit ! important;</style>
            <leftboundary>0</leftboundary>
            <rightboundary>0</rightboundary>
            <findderivatives>0</findderivatives>
        </pen>
    </vocabulary>
<vocabulary><word>morpheme2</word>
        <note/>
        <htmlnote/>
        <matchcase>0</matchcase>
        <nothighlight>0</nothighlight>
        <pen/>
    </vocabulary>
    <vocabulary><word>morpheme3</word>
        <note/>
        <htmlnote/>
        <matchcase>0</matchcase>
        <nothighlight>0</nothighlight>
        <pen/>
    </vocabulary>
    </vhdatabase.sqlite>

So normally when you import an .xml, you can replace or append. But from the extension options it seems you can only adjust custom colours entry by entry, one at a time. So I figure you could adjust the colours and such in the .xml when you're adding the morphemes to the list (if manually). Not sure how it'd work if automatically exporting. Not that having different colours for different batches is essential, but it came to me that having both matures and known could be interesting to differentiate, since I'm always going on about gradients and 60/30/10, etc.

Edit: And this entry <vocabulary><word>。</word> would break it and make it say incorrect format till you removed that entry. (Not sure what else might break it. ‘!’ was okay, as was ‘,’ and 0-9/0-9. I had to remove those one by one, as well as other combinations of numbers, since the editor for the extension sucks like that and I didn't want all the numbers and commas, etc. on pages highlighted. In the future hopefully filtering/Morph Man can remove those, or maybe someone knows a good regex to remove them.)

Last edited by nest0r (2011 August 26, 4:59 pm)

Reply #22 - 2011 August 27, 2:27 pm
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

Okay Lindley my friend, here's what you can do for now:

1) Show your known.db in the MorphMan GUI as single column. (If you don't have this stuff working, use whatever list of words you prefer, then.)

2) Copy the list into your favourite text editing program.

3) Replace the beginning of each line with <vocabulary><word> and the end of each line with </vocabulary</word> (The regexp for this in UltraEdit is Find: ^ and Replace: <vocabulary><word> and Find: $ and Replace: </word></vocabulary>.)

4) Paste that between this: <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<vhdatabase.sqlite xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
[PASTE HERE]
    </vhdatabase.sqlite>

(Make sure you replace the [PASTE HERE] hehe.)

5) Save this as an .xml file (e.g. add the .xml extension to the filename, save as All Files or whathaveyou.)

6) Since you should have https://addons.mozilla.org/en-US/firefo … ghlighter/ installed, right click a webpage and click the Vocabulary Highlighter from the context menu (should be at or near the top), click Database → Import → Browse to the .xml you just saved.

7) Replace or append to whatever you already have listed.

8) If you want to customize the stuff, just select Pens (right next to Database) → Customize pens → Change whatever you like.

9) Visit a Japanese webpage or refresh the one you're already at and voila! Your knowns or whatever are darkened or lightened.

10) Bonus: If you want to make toggling easy, you can click the orange Firefox menu thingy at the top left, Options → Toolbar Layout and drag the Vocabulary Highlighter icon where you like (I have it down at the bottom next to the Rikaisan icon.)

Last edited by nest0r (2011 August 27, 4:39 pm)

Reply #23 - 2011 August 27, 4:38 pm
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

Ohh, and if you don't want to use regex, forgot about this: http://textmechanic.com/Merge-Text-Line-by-Line.html

Just paste the list on whichever side (since we're not merging), and add a prefix (e.g. <vocabulary><word> and a suffix (e.g. </word></vocabulary>), then click Merge. (By the way, just noticed I wrote the suffix in the wrong order in the above comment. Fixed.)

Edit: For now though this whole replace with extracted morphemes doesn't work so great, because Morph Man 2 doesn't use inflected forms, so you'll end up with 分かりました showing up as unknown, because 分かる is in the .db but not 分かり.

There's also the problem with words that are written in hiragana instead of kanji (e.g. わかりました).

Last edited by nest0r (2011 August 27, 4:43 pm)

Reply #24 - 2011 August 28, 5:32 am
Lindley Member
From: Ukraine Registered: 2008-04-03 Posts: 61

Okay, so I followed the steps you outlined (used textmechanic since UltraEdit shows me ??? instead of kanji), but the extension keeps telling me that the database has incorrect format sad I've made a db with just one page worth of morphemes, to check, and it doesn't contain none of those symbols you mentioned that might "break" the db, only kanji and kana. Any idea on what might be wrong? This sounds so awesome that I'm dying to have it working smile

Reply #25 - 2011 August 28, 12:31 pm
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

I'm not sure what might be wrong with it. The only time I had it break was then I had that 。 in there, since then I haven't had any problems playing with different lists.

Maybe try pasting the (original list without formatting) here: http://textmechanic.com/Sort-Text-Lines.html and sort it just to make sure there's nothing hinky? If it's not a super secret list, feel free to post it or a shortened version that's still broken.

But like I said, it's not quite so awesome at the moment because you'll see a lot of inflected forms that should be marked known but aren't, but it's still nice to get a kind of overview sort of like what you wanted. Plus I mentioned in another thread, there's doing that, and then when you want to be more specific, you can extract the morphemes specifically from an article or novel and then turn that into a .db in Morph Man, and subtract the known.db from that which will show you all the unknown morphemes in that particular text, so you can then import and highlight just the unknown morphemes.

That's a bit of effort so it probably works best with novels that you did not get from the ‘totally innnocent’ 日本語 books thread, because that would be wrong and there's nothing there anyway so don't bother looking because you won't find thousands of books there.

But soon, maybe, in a coming version of Morph Man, we'll have nice exporting and inflected form abilities. After that, there'll still be the occasional words written in kana, but I think if you add kana to your list, you might end up with a lot of false positives, plus really it's not so often that words that show up with kanji in Mecab are written as kana elsewhere.

Last edited by nest0r (2011 August 28, 12:32 pm)