Next | Prev | Up | Top | Contents | Index

Collating Strings

Different locales can have different rules governing collation of strings, even within identical encodings.


The Issue

In English, sorting rules are extremely simple: each character sorts to exactly one unique place. Under ASCII, the characters are even in numeric order. However, neither of those statements is necessarily true for other languages and other codesets. Furthermore:

Programs using ASCII can do simple arithmetic on characters and directly calculate sorting relationships; such programs frequently rely on truisms such as the fact that

'a' < 'b'

in ASCII. But internationalized programs cannot rely on ASCII and English sorting rules. Consider some non-English collation rule types:

It should be clear that a programmer cannot hope to collate strings by simple arithmetic or by traditional methods.


The Solution

Locale-specific collation should be performed with strcoll() and strxfrm(). These are table-driven functions; the tables are supplied as part of locale support. The value of LC_COLLATE determines which ordering table to use. (See the strcoll(3) and strxfrm(3) reference pages.)

strcoll() has the same interface as strcmp() and can be directly substituted into code that uses strcmp(). However, strcoll() can consume more CPU time, so where it is used in a time-critical loop you may have to redesign.


Next | Prev | Up | Top | Contents | Index