search engines and homophones

Index » 喫茶店 (Koohii Lounge)

  • 1
 
Reply #1 - 2010 April 04, 9:07 pm
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

Here's a random thought: Being able to search Google with a 'sounds like' option (example: http://cogweb.ucla.edu/SearchTips.html)... Is this already implemented and I missed it? I also wonder at how it works. Integrate a database of known homophones? But then there's the 'fuzzy' stuff, like if you wanted to search Google for lyrics you heard but weren't sure of and could only roughly approximate.

The above example doesn't seem to work well, as if it simply uses wildcards, though it did give me the result 'share' when I searched for 'cher'.

Maybe they can use some form of captcha or misheard lyrics sites somehow to create a database of commonly misheard stuff? I threw in 'captcha' because of the 'recaptcha' OCR thing: http://recaptcha.net/learnmore.html

Edit: Hmm, how about OCRing Japanese vobsubs by incorporating kanji captcha (and kana) into sites like this somehow? As a game or for Serious Business™.

Last edited by nest0r (2010 April 04, 9:33 pm)

ropsta Member
From: 闇の底 Registered: 2009-07-23 Posts: 253

*Sees word captcha and immediately bails*

Reply #3 - 2010 April 05, 1:12 am
Asriel Member
From: 東京 Registered: 2008-02-26 Posts: 1343

From my searches through the depths of the internet, I've come to the conclusion that it would be quite difficult to come up with a Japanese OCR for vobsubs, because apparently it's different from DVD to DVD.
Granted, I've never actually gone through and tried to do it, but that's what people seem to say.

Plus, the OCR software I've stumbled upon only do ASCII characters, and the ones that don't...for instance, I had 浜村, and for 浜 it gave me the water-radical, the "axe," and the "animal feet" as possible letters. Perhaps OCR programs were programmed with Heisig in mind?

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
Reply #4 - 2010 April 05, 1:48 am
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

@Asriel - Try searching the My Study Method thread for OCR-related stuff. Japanese OCRing isn't that great, but it's not too bad, depending on the quality of the source. I'm only familiar--I mean my friend is, with Read Iris as mentioned in above reference. I haven't--my friend hasn't--tried e.Typist yet. Or maybe I did and forgot. That was ages ago.

Also, I got the impression you didn't know what I was suggesting, I didn't mean machine/automated OCRing. If you read what recaptcha does, apply that to kanji captcha as a 'game', etc. Just a thought-experiment. ;p

Last edited by nest0r (2010 April 05, 1:50 am)

Reply #5 - 2010 April 05, 2:32 am
Asriel Member
From: 東京 Registered: 2008-02-26 Posts: 1343

From what I understand about OCRing (which isn't much, mind you...next to none, actually) is that it will scan the picture for characters, "recognize it" as a character/word, and save it to a text file.

Doing this kanji-catchpa thing, would be equivalent to making a giant database of kanji/words that show up in vobsubs, so then the OCR can recognize them more effectively?

OR! it just occurred to me that the catchpa could actually BE the OCR.
But, with my experience (ie. 浜), the word/characters it shows you don't always match up with actual characters. I've seen some programs that allow you to fix this, but I haven't used any of them, so I'm afraid I don't know much about it.

Reply #6 - 2010 April 05, 2:57 am
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

Did you read what recaptcha does? ;p

FINE I will copy/paste from the above link:

"reCAPTCHA improves the process of digitizing books by sending words that cannot be read by computers to the Web in the form of CAPTCHAs for humans to decipher. More specifically, each word that cannot be read correctly by OCR is placed on an image and used as a CAPTCHA. This is possible because most OCR programs alert you when a word cannot be read correctly.

But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct. "

Last edited by nest0r (2010 April 05, 2:59 am)

Reply #7 - 2010 April 05, 3:05 am
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

ropsta wrote:

*Sees word captcha and immediately bails*

I used to hate captchas. Once they started using reCAPTCHA everywhere, I became okay with it. Also, the reCAPTCHA words are more entertaining and fuzzier than what I remember in the past, in terms of what you can mistype and still pass.

So I say, let's make a 'what's this Japanese text?' reCAPTCHA game, but with words taken from vobsubs! Maybe have an OCR program provide a best guess, then in the context/translation as presented as a game, users can do the rest. It's fun. Yes, fun.

Annotating books with audio in Transcriber is also fun and great listening/reading practice, and you end up with interactive books for use in the Kage Shibari reader! It's fun, damn it!

Last edited by nest0r (2010 April 05, 3:13 am)

Reply #8 - 2010 April 05, 6:25 am
Nemotoad Member
Registered: 2010-03-17 Posts: 66

In the spirit of obscurity, I never worked for a search engine so I can't tell you for sure that a 'sounds like' capability has been implemented for a while, albeit more obviously in certain contexts (such as the spellchecker). It's not an overt (user) option as such but on the server side homophones (and synonyms etc.) usually do come out in the wash for example when you look at a query session (e.g. someone types in something that sounds like something they want, and then they retry the query by respelling it or something, and then they click on some result that seems good to them) and then aggregate that over several users.

That reCAPTCHA thing is pretty interesting. Didn't Luis von Ahn do a lot of work on creating multiplayer games where players would actually compete to do work? Or was that someone else? For example players were pitted against opponents all over the world to label images or categorise items or evaluate a search engine's results for relevancy or the like. reCAPTCHA seems like a handy natural extension of that.

If an OCR program provides a best guess and gets it wrong, how will it be corrected? I mean for captcha a user would just type what they see for that word. Out of context it could be right, right?

Reply #9 - 2010 April 05, 7:48 am
Asriel Member
From: 東京 Registered: 2008-02-26 Posts: 1343

nest0r wrote:

Did you read what recaptcha does? ;p

I know what recatchpa does, i've known the general idea for a while, just never looked into exactly how it works.

But that's not my point, nor does it answer my concerns. Even if someone were to develop a Japanese catchpa for vobsubs, it still doesn't solve the problem that I've had with OCRs in the past.
What happens if you get http://japanese.about.com/library/weekly/graphics/sanzui.gif instead of 泣、and the next character it gives you is 立.
you could end up with "彼女は止まらなく三水立いた"

Reply #10 - 2010 April 05, 8:11 am
Jarvik7 Member
From: 名古屋 Registered: 2007-03-05 Posts: 3946

Asriel wrote:

What happens if you get http://japanese.about.com/library/weekl … sanzui.gif instead of 泣、and the next character it gives you is 立.
you could end up with "彼女は止まらなく三水立いた"

How exactly would you end up with that? Users entering the captcha would need to type in 泣. It isn't OCR...

Reply #11 - 2010 April 05, 8:26 am
Asriel Member
From: 東京 Registered: 2008-02-26 Posts: 1343

Jarvik7 wrote:

How exactly would you end up with that? Users entering the captcha would need to type in 泣. It isn't OCR...

More specifically, each word that cannot be read correctly by OCR is placed on an image and used as a CAPTCHA. This is possible because most OCR programs alert you when a word cannot be read correctly.

Doesn't it work like
1. OCR goes through, and gets things wrong
2. Images are made of things that are incorrect
3. Users type in what is wrong in order to correct them
4. Yay, we can read now!

The problem that I see occurs in between step 1 and 2. It goes through and sees 泣いた. When I was OCRing 浜村, I was prompted with the 三水, as opposed to 浜.
So, if an image is made of the 三水 as opposed to 泣, how are the users going to know to type in 泣、as opposed to 河, 浜、or even 水 for that matter.

I must be missing something somewhere...

Reply #12 - 2010 April 05, 8:52 am
Jarvik7 Member
From: 名古屋 Registered: 2007-03-05 Posts: 3946

Japanese OCR is in a pretty bad state, but any halfway decent one should be able to guess that most texts aren't going to have random radicals in the text. I haven't read a technical page about reCAPTCHA, but I don't think it does the OCR stage. Even if it does, any Japanese captcha you design could leave it out and just require users to type in everything.

Reply #13 - 2010 April 05, 1:33 pm
nest0r Member
Registered: 2007-10-19 Posts: 5236 Website

@Asriel - As the quote says, it doesn't make an image of the incorrect item. It takes what it's not sure about or doesn't know and presents that image to the user, alongside something it does know. The source image. Then, provided the user gets the one it does know correct, it assumes the same for the other, building a more and more certain aggregate over bazillions of attempts. From there I have no idea what happens.

So you don't have to worry about the user looking at the incorrect image, just the unknown but accurate one, should anyone be able to decipher it.

For learners of Japanese rather than experts, doing vobsubs, I suppose a different logistics would be in order.

For vobsub, I only did it once, ages ago, bt it seems via a d-addicts tutorial I glanced through, that it's mostly user-oriented anyway. The computer simply presents the images to you and you input it yourself, teaching the system? The tutorial recommended using the IME input pad for stuff you don't know. -- http://www.d-addicts.com/forum/viewtopic.php?t=16017 (bottom 'The Alternative')

As an aside, you speak from this single problem, so I wonder, how many different Japanese OCRs have you used, Asriel? Did you read that thing I didn't post in a totally innocent thread? That was from a low quality source, required perhaps 2 corrections a line if memory serves, and was easily fixed via human-looking-at-source-image.

At any rate, between the possibilities of providing users with source images and context, different input methods, compounds, translations, etc., I think there's lots of possibilities there to 'crowdsource' or whatever rather than relying on machine vision. One movie's vobsubs vs. 200 people OCRing it w/ human vision, etc.

There's also Transcriber's style... What if you have the film audio playing, the vobsubs in front of you, and that way you have the 'readings' that you can add phonetically, pausing and inputting kanji as you go along? Or something. Anyway, I'm just brainstorming. ;p

I didn't even go into how reCAPTCHA refines it to combine the two, because the OCRs I've used, I didn't know there was a place where it told you when it was spitting out something even though it wasn't sure. If I could tell it not to do that and instead to simply mark that image as unknown, that would be good...

Also, it did answer your concerns and was your point, because even if you knew what reCAPTCHA does, as you said, you didn't know exactly how it works, even though I went out of my way to post the quote in question that answered your concerns. So don't get snarky with me! ;p

Also, can't we type in Anki? Maybe people can start typing up their vobsub'd film cards/decks made in subs2srs, and donate the results?

Last edited by nest0r (2010 April 05, 2:12 pm)

Reply #14 - 2010 April 06, 12:03 am
Asriel Member
From: 東京 Registered: 2008-02-26 Posts: 1343

I think I'm just getting caught up on details that don't even make a difference at this point in time.

Let's just forget I ever said anything, and if this actually does get into some sort of development, I will either go "I told you so," or "Gee, I was stupid."

  • 1