Comparing kanji lists

Index » General discussion

  • 1
 
thecite Member
From: Adelaide Registered: 2009-02-05 Posts: 781

I've got two big lists of kanji in separate word documents, I want to compare the data and see which kanji from list #1 aren't in list #2. Does anyone know how I'd go about doing this?

Any help would be appreciated, thanks.

Mushi Member
From: USA Registered: 2010-07-06 Posts: 252

Could you copy the kanji only from the two sources to two text files with one kanji on each line, sort them, then do a line by line comparison?

It depends on your OS, but on Windows, for example, I believe you could do something like save them in Notepad in UTF8 text format, then sort each text file with sort.exe, then diff them textually via fc.exe (file compare) or via some free text comparison tool like BeyondCompare.

When doing things similar to this before, I've found that only minor issue to keep in mind is to remember to preserve the little UTF-8 file marker at the beginning of the text files when managing the output...

Mushi Member
From: USA Registered: 2010-07-06 Posts: 252

Or, it occurred to me that since you're using Word, you probably also have Excel, which may be easier for you to use. If so, you may want to search Excel help for the various was you can do this, for example, http://office.microsoft.com/en-us/excel … 3915.aspx.

Advertising (register and sign in to hide this)
JapanesePod101 Sponsor
 
frony0 Member
From: London United Kingdom Registered: 2011-12-10 Posts: 257

Learn SQL!

(I'd tell you how, but I've forgotten exactly how myself...)

thecite Member
From: Adelaide Registered: 2009-02-05 Posts: 781

Thanks for your help!
Keep in mind that there's likely 1500+ characters that aren't in list #2, so any manual compilation would be an extreme hassle.

Anyway, I've got the UTF8 .txt files, but how do I run them through sort.exe? Googled it, but couldn't find any good explanations.

Last edited by thecite (2013 January 03, 12:30 pm)

Katsuo M.O.D.
From: Tokyo Registered: 2007-02-06 Posts: 887 Website

If you have an Apple computer then here's a simple AppleScript to do that using the TextEdit word processor application.

The script will go through each character of a TextEdit document (named docOne.rtf) and see if it is present anywhere in another document (named docTwo.rtf). It will make a list of all those that are not in docTwo. (First set up the two documents then paste everything below this paragraph into the AppleScript Editor and click "Run". The result will appear in the bottom pane. If there are thousands of characters it may take a few minutes.)


tell application "TextEdit"
    set totalList to ""
    set text1 to text of document "docOne.rtf"
    set text2 to text of document "docTwo.rtf"
    set totalChar to count of characters of text1

    repeat with characX from 1 to totalChar
        set nexChar to character characX of text1
        if nexChar is not in text2 then set totalList to totalList & nexChar
    end repeat
    say "I have finished"
    totalList
end tell

thecite Member
From: Adelaide Registered: 2009-02-05 Posts: 781

Thanks Katsuo, that sounds like exactly what I'm looking for. Do the two documents have to be saved to any particular location?

Edit: Worked like a charm, thank you.

Last edited by thecite (2013 January 03, 12:49 pm)

lauri_ranta Member
Registered: 2012-03-31 Posts: 139 Website

On OS X you could also save the lists to plain text files and run this in Terminal:

comm -23 <(sort 1.txt) <(sort 2.txt)

Another option using Ruby:

ruby -e 'puts File.readlines("1.txt") - File.readlines("2.txt")'

  • 1