Joined: Feb 2009
Posts: 755
Thanks:
0
I've got two big lists of kanji in separate word documents, I want to compare the data and see which kanji from list #1 aren't in list #2. Does anyone know how I'd go about doing this?
Any help would be appreciated, thanks.
Joined: Jul 2010
Posts: 252
Thanks:
0
Could you copy the kanji only from the two sources to two text files with one kanji on each line, sort them, then do a line by line comparison?
It depends on your OS, but on Windows, for example, I believe you could do something like save them in Notepad in UTF8 text format, then sort each text file with sort.exe, then diff them textually via fc.exe (file compare) or via some free text comparison tool like BeyondCompare.
When doing things similar to this before, I've found that only minor issue to keep in mind is to remember to preserve the little UTF-8 file marker at the beginning of the text files when managing the output...
Joined: Dec 2011
Posts: 256
Thanks:
1
Learn SQL!
(I'd tell you how, but I've forgotten exactly how myself...)
Joined: Feb 2009
Posts: 755
Thanks:
0
Thanks for your help!
Keep in mind that there's likely 1500+ characters that aren't in list #2, so any manual compilation would be an extreme hassle.
Anyway, I've got the UTF8 .txt files, but how do I run them through sort.exe? Googled it, but couldn't find any good explanations.
Edited: 2013-01-03, 1:30 pm
Joined: Feb 2007
Posts: 915
Thanks:
5
If you have an Apple computer then here's a simple AppleScript to do that using the TextEdit word processor application.
The script will go through each character of a TextEdit document (named docOne.rtf) and see if it is present anywhere in another document (named docTwo.rtf). It will make a list of all those that are not in docTwo. (First set up the two documents then paste everything below this paragraph into the AppleScript Editor and click "Run". The result will appear in the bottom pane. If there are thousands of characters it may take a few minutes.)
tell application "TextEdit"
set totalList to ""
set text1 to text of document "docOne.rtf"
set text2 to text of document "docTwo.rtf"
set totalChar to count of characters of text1
repeat with characX from 1 to totalChar
set nexChar to character characX of text1
if nexChar is not in text2 then set totalList to totalList & nexChar
end repeat
say "I have finished"
totalList
end tell
Joined: Feb 2009
Posts: 755
Thanks:
0
Thanks Katsuo, that sounds like exactly what I'm looking for. Do the two documents have to be saved to any particular location?
Edit: Worked like a charm, thank you.
Edited: 2013-01-03, 1:49 pm
Joined: Mar 2012
Posts: 128
Thanks:
1
On OS X you could also save the lists to plain text files and run this in Terminal:
comm -23 <(sort 1.txt) <(sort 2.txt)
Another option using Ruby:
ruby -e 'puts File.readlines("1.txt") - File.readlines("2.txt")'