Back

Hanzi counter?

#1
I'm looking for an application or script that can count the number of unique hanzi in a given text, the Anki plugin for this doesn't seem to work, atleast not for me.

Anyone know of any?
Reply
#2
Nobody? Sad
Reply
#3
http://lingua.mtsu.edu/chinese-computing...85187ede20
Maybe this might be of help for you.
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
What you ask for can be found here:

http://www.chinese-forums.com/showthread.php?t=28000

(check out HedgePigs excel-sheet)
Edited: 2010-03-03, 12:26 pm
Reply
#5
If that's any help, regular expressions can check for unicode blocks and unicode scripts (see bottom of page). You could filter out all unwanted characters with one simple regexp, then split a string with another simple regexp, then remove all doubles by assigning characters as keys into a hash or something like that.

I actually didn't know about these until last year.. and had been using hexadecimal code points in the RevTK code for a long while (>_>)
Reply
#6
Do you just want the number? This could be done in Python in one line of code. If you need it done, let me know. (I can also get you the list of characters in one line, if need be.)
Reply
#7
vorpal Wrote:Do you just want the number? This could be done in Python in one line of code. If you need it done, let me know. (I can also get you the list of characters in one line, if need be.)
Well, it could be done in one line of C, but that doesn't mean you should stack up all your code on the same line. Weird formatting was cute the first time someone did it on IOCCC, but not anymore.

There are multiple solutions above that can do what OP wanted. He would have no doubt figured out or given up by now.
Reply
#8
Thanks for your responses guys. As a programmer myself I could build it quite easily but I was looking to save myself the time since if there was one already made then I didn't have to divert time from studying. HerrPetersons link had the right solution, there was an excel file posted that would let people input a list of characters match against and it would give the unique characters in and not in that list. Quite handy.

I should probably create a web based version of it for anyone that needs it.

Thanks for the help guys.
Reply
#9
Can use or tweak my perl scripts, I did this exact same thing for counting kanji in Wikipedia!

http://foosoft.net/japanese/kanji_frequency/
Edited: 2010-04-27, 10:19 pm
Reply
#10
for the record, in python it is this difficult:
Code:
print len(set(u'日本語今後コンゴ日本語'))
only counting hanzi (and not spaces etc) would actually require some work.

excluding from a list:
Code:
sometext=u"""
一匹馬
兩隻貓
三個人
"""
print ''.join(set(sometext) - set(u'一二三四五六'))
gives: 兩個馬貓匹隻人
Reply
#11
You can just iterate over every character in the file, checking it's unicode code block right? All the characters are nicely organized so you just have to know the ranges.
Reply