Back

Alphabetical order of Kanji

#1
This came up in a discussion at work of the string collation class under Java, and the Unicode collation algorithm as described on uncode.org.

So - what's the typical ordering, or aiueo-jun used in Japanese for kanji strings? For example, if you were to sort words such as prefecture names, or people's names in Excel or a database or something, is the ordering typically done phonetically, or through straight Unicode order, or some other means? Heisig order, maybe? Smile
Reply
#2
When you sort comments on nicovideo (the only sort I can think of offhand), it does roman letters first, then symbols, then hiragana (in aiueo), then katakana, then kanji, then JIS symbols, then JIS roman letters (I guess), and then some other symbols.

The kanji appear to be ordered by radical.
Reply
#3
Apparently there's a standard for sorting order (JIS X 4061:1996). Doesn't look like the standard's conveniently available, but the perl ShiftJIS::Collate docs give an idea of what it does: basically, sort by character class, and kanji are just in JIS order.

Prefecture names: the list at http://www.tvguide.or.jp seems to be in geographical order, so it starts with 北海道、青森, works its way down through 東京 and 愛知 and ends with 沖縄. http://www.bookoff.co.jp/shop/shop.php seems to have the same list.
Reply
May 16 - 30 : Pretty Big Deal: Save 31% on all Premium Subscriptions! - Sign up here
JapanesePod101
#4
Thanks, that's interesting that things like JIS order come into play. Japanese must be a very hard language to alphabetize due to multiple readings.
Reply
#5
I wondered about this too. When I sort by kanji in anki or openoffice calc it it doesn't seem to be ordered by JIS or Kuten codes (which have similar ordering, but different actual codes). Do different applications select their own order instead of following that JIS order guideline? Or is the order somehow based on that hex number thingy instead of the 4 digit grid code we see? I couldn't figure it out.

This shows the JIS order - which isn't what I get. hmm
Reply
#6
The Unicode collation algorithm is more or less ordering by code points with a small number of exceptions. It's meant to be a reasonable default, and is not phonetic.

EDIT: Those applications probably use Unicode, and Unicode is not JIS. I would expect similar, but markedly different orderings.
Edited: 2010-07-27, 12:45 am
Reply
#7
Your locale settings can affect sort order. In linux, you can override the locale for an application. To experiment, create a short text file with a mix of ascii, 8-bit, and japanese text, then try:

$ sort file.txt
$ LC_ALL=C sort file.txt
$ LC_ALL=UTF-8 sort file.txt
$ LC_ALL=ja_JP sort file.txt

to see possible differences. For example the default sort (on Fedora with no LC_* settings) does case folding, "C" locale does not (it's strictly byte-value order, I believe).

So, replace "sort ..." with your application to see if it helps.

(LC_ALL is overkill -- you probably can use LC_COLLATE instead, but it depends on the application.)
Reply
#8
Yes, thanks, mafried. It looks like both anki and openoffice sort by unicode code #. The order is quite different.

Does this means that to sort a list such that it can be visually searched (phonetic order), we'd need to add either a romaji or hiragana column? I wonder if some programs let you choose JIS or unicode sorting? (I guess I'm assuming Japanese programs would use JIS or have the option?)
mushi Wrote:Japanese must be a very hard language to alphabetize due to multiple readings.
Some info on the JIS kanji ordering (wikipedia):

Level 1 kanji: (most common)
- by “representative reading” - on or kun (chosen for this standard only)
- gojūon order - あいうえお
- for verb kun readings, 連用形 form is used
- where kanji share a reading, on reading kanji goes first
- where kanji share the same on or kun reading, it goes by primary radical and stroke count

Level 2 kanji are arranged in order of primary radical and stroke count, then by reading. Variants in both levels directly follow their exemplar form.
Reply