Collation Details

Guide

If you are an advanced user and interested in trying out more rules, here is a brief explanation of how they work. The Collation Rules field is a list of rules, where each rule is of three forms:

Text Argument

A text argument is any sequence of characters, excluding special characters (that is, whitespace characters and the characters used in modifier, relation and reset). If those characters are desired, you can put them in single quotes (as in the Ampersand example).

Modifier

There is only a single modifier currently, which is used to specify that all accents (secondary differences) are backwards.

@

Indicates that accents are sorted backwards, as in French

Relation

The relations are the following:

<

Greater, as a letter difference (primary)

;

Greater, as an accent difference (secondary)

,

Greater, as a case difference (tertiary)

=

Equal

Reset

There is only a single reset currently, which is used primarily for contractions and expansions, but which can also be used to add a modification at the end of a set of rules.

&

Indicates that the next rule follows the position to where the reset text-argument would be sorted.

The reset does not put the text-argument into the sorting sequence.

This sounds more complicated than it is in practice. For example, the following are equivalent ways of expressing the same thing:

Notice that the order is important, as the subsequent item goes immediately after the text-argument. The following are not equivalent:

Either the text-argument must already be present in the sequence, or some initial substring of the text-argument must be present. (e.g. "a < b & ae < e" is valid since "a" is present in the sequence before "ae" is reset). In this latter case, "ae" is not entered and treated as a single character; instead, "e" is sorted as if it were expanded to two characters: "a" followed by an "e".

This difference appears in natural languages: in traditional Spanish "ch" is treated as though it contracts to a single character (expressed as "c < ch < d"), while in traditional German "ä" (a-umlaut) is treated as though it expands to two characters (expressed as "a & ae ; ä < b").

Ignorable Characters

The first rule must start with a relation (the examples we have used above are really fragments; "a < b" really should be "< a < b"). If, however, the first relation is not "<", then all the all text-arguments up to the first "<" are ignorable. For example, ", - < a < b" makes "-" an ignorable character, as we saw earlier in the word "black-birds". In the samples for different languages, you see that most accents are ignorable.

Normalization and Accents

The Collation object automatically normalizes text internally to separate accents from base characters where possible. This is done both when processing the rules, and when comparing two strings. Collation also uses the Unicode canonical mapping to ensure that combining sequences are sorted properly (for more information, see The Unicode Standard, Version 2.0.) and click on "U" for Unicode.

Most languages that use accents sort them in a consistent fashion, immediatedly after the unmodified base character. This can be achieved by making the accents ignorable, and putting them in the right order at the beginning of the collation rules. When this is done, only special cases like the German "ä" need to be handled by explicit rules.

Errors

The following are errors:

If you produce one of these errors, a message at the bottom of the screen will tell you where the error is at, and select the incorrect text (note: on some browsers, the selection will not appear correctly).



© Copyright 1997. All rights reserved. Taligent, Inc., IBM Corp.