Collating Strings

Collating Strings

Different locales can have different rules governing collation of strings, even within identical encodings.

The Issue

In English, sorting rules are extremely simple: each character sorts to exactly one unique place. Under ASCII, the characters are even in numeric order. However, neither of those statements is necessarily true for other languages and other codesets. Furthermore:

Sorting order for a language may be completely unrelated to the (numerical) order of the characters in a given encoding.
Even with a correctly sorted list of the characters in a character set, you may not be able to sort words properly.
Locales using identically encoded character sets may use very different sorting rules.

Programs using ASCII can do simple arithmetic on characters and directly calculate sorting relationships; such programs frequently rely on truisms such as the fact that

'a' < 'b'

in ASCII. But internationalized programs cannot rely on ASCII and English sorting rules. Consider some non-English collation rule types:

One-to-Two mappings collate certain characters as if they were two. For example, the German ß collates as if it were "ss."
Many-to-One mappings collate a string of characters as if they were one. For example, Spanish sorts "ch" as one character, following "c" and preceding "d." In Spanish, the following list is in correct alphabetical order: calle, creo, chocolate, decir.
Don't-Care Character rules collate certain characters as if they were not present. For example, if "-" were a don't-care character, "co-op" and "coop" would sort identically.
First-Vowel rules sort words based first on the first vowel of the word, then by consonants (which may precede or follow the vowel in question).
Primary/Secondary sorts consider some characters as equals until there is a tie. For example, in French, a, á, à, and â all sort to the same primary location. If two strings (such as "tache" and "tâche") collate to the same primary order, then the secondary sort distinguishes them.
Special case sorts exist for some Asian languages. For example, Japanese kanji has no strict sorting rules. Kanji strings can be sorted by the strokes that make up the characters, by the kana (phonetic) spellings of the characters, or by other agreed-upon rules.

It should be clear that a programmer cannot hope to collate strings by simple arithmetic or by traditional methods.

The Solution

Locale-specific collation should be performed with strcoll() and strxfrm(). These are table-driven functions; the tables are supplied as part of locale support. The value of LC_COLLATE determines which ordering table to use. (See the strcoll(3) and strxfrm(3) reference pages.)

strcoll() has the same interface as strcmp() and can be directly substituted into code that uses strcmp(). However, strcoll() can consume more CPU time, so where it is used in a time-critical loop you may have to redesign.