clemente
Member
From: venexia
Registered: 2008-11-06
Posts: 22
Hello everyone,
I was wondering if anyone knew how to compare various Japanese texts and see if there are matches in the sentences (and obviously see what the matches are).
I have an archive of texts (around 100) and I wanted to compare each one of them against the others to see if there were sentences that were exactly the same (it would be nice to be able to set some parameters such as minimum length, exact match or similar etc.).
I have found this piece of software, called Corsis (formerly Tenka text), but the documentation is not yet full and I am not able to use it for my purpose.
I also tried with the anti copycatting software but they generally don't handle Japanese characters or are too expensive.
Thanks for any help and suggestion.
Cheers.
FooSoft
Member
From: Seattle, WA
Registered: 2009-02-15
Posts: 513
Website
That sounds like a pretty specific task, I think the best bet would be to use a scripting language (python/ruby/perl). Then you would create some regular expressions for matching the types of text you want (whole sentences, expressions, whatever), processing the results as needed. Then you unleash the script on the globs of input.
If you are at all technically minded, you should try your hand at scripting. It's not hard and allows you to harness the power of your PC for doing repetitive tasks in a very fast manner 
Last edited by FooSoft (2010 April 17, 11:42 pm)
UltraEdit lets you compare at least two texts, not sure how many more, and not sure how it works. I've used Ultraedit for years, but mostly out of habit, as at the time I began using it, it was one of the few editors I knew of that allowed texts to be opened in multiple tabs. In fact, I think it might've been the only one? Can't remember that far back.
Last edited by nest0r (2010 April 17, 10:55 pm)
clemente
Member
From: venexia
Registered: 2008-11-06
Posts: 22
Thank you all for the kind replies.
Unfortunately I am not yet able to write scripts, although I hope to soon have some time to learn. Do you know if anyone has ever done anything like this?
As for the other software, it works quite well, but I have more than a hundred files and checking one by one against the others would take really a long time.
Cheers
Rereading your initial post, if all you want to do is find instances of words/phrases, just use UltraEdit's "Find in Files". Lots of options there and it lists the results by line in a bottom pane (or a new tab).
Pretty stupid of me, because I asked this here: http://forum.koohii.com/viewtopic.php?pid=93929#p93929 - And I thought I had to make do with Word, but it's much better in UltraEdit to actually see the results. And even though I already knew I could do this, it was like a brain glitch because I didn't put the ideas together till just now.
Edit: Okay, with new UE it works for Shift-JIS and other encodings, fantastic!
Last edited by nest0r (2010 April 18, 10:27 pm)