IBM's Classes for Unicode

Collation Framework


Contents

What is collation?

Collation framework performs locale-sensitive string comparison. The user of this class can use this class to build searching and sorting routines for natural language text, build table of contents for large documentation or create efficient index look up for database entries.

The ICU Collator classes provides services to allow:

There are 4 comparison levels in the Collator classes to allow different levels of difference to be considered significant:

The rule symbols and their usage

A string is decomposed to be one or more collation elements when using with the collation classes. The collation rules specify the order of these collation elements. The collation table is composed of a list of collation rules, where each rule is of three forms:

  1. <modifier>
  2. <relation> <text-argument>
  3. <reset> <text-argument1> <relation> <text-argument2>

<modifier>

<text-argument>

A text-argument is any sequence of characters, excluding special characters (that is, common whitespace characters [0009-000D, 0020] and rule syntax characters [0021-002F, 003A-0040, 005B-0060, 007B-007E]). If those characters are desired, you can put them in single quotes (e.g. ampersand => '&'). Note that unquoted white space characters are ignored; e.g. "b c" is treated as "bc".

<relation>

<reset>

Interesting Examples

The following is a list of interesting examples of the rules and some string comparison results using those rules. The comparison relation will be denoted as "<" of primary difference of less than, "<<" of secondary difference of less than, "<<<" of teriatry difference of less than and "==" of equal to relationships:

Implementation Details

Three parts of the code will be carefully examined here:

Building the Collation Table

The process of building a collation table is as following:

 

Incremental Comparison Diagram

 

Generating a Collation Key

The control flow of generating a collation key is as the following:

  1. Retrieve the next collation element of the source string. Go to step 5 when reaches the end of string.
  2. Append the primary weight of element to the primary weight buffer.
  3. Checks if it's necessary to process secondary weights. If so, append the secondary weights to the secondary weight buffer. If the collator is marked to process French secondary, reverse the order of all the secondary weights before encounters the next primary weight.
  4. Checks if it's necessary to process tertiary weights. If so, append the tertiary weights to the tertiary weight buffer.
  5. Concatenate the primary weight buffer, secondary weight buffer and tertiary weight buffer and add a null delimiter among the weights. Return the concatenated buffer as the collation key.

Q & A

  1. How do I customize the collation sequence?
    A: Using the RuleBasedCollator constructor, the user of the collation framework can then create his/her own Collator with a customized rule.
  2. Will the collation framwork support the surrogate and private use characters?
    A: It's part of our future work items.  However, no firm schedule has been set for this yet.
  3. How does the French secondary turn-on affect the generation of collation key?
    A: In French, the secondary differences are sorted backwards so this will invoke the collation key to reverse the secondary weights in the keys.
  4. Is there any support for composing characters? If so, how does it work?
    A: Yes, it is based on the Normalizer interface.  When a expanding character is detected, the rule builder will construct collation entries for the precomposed version internally to handle the composed characters correctly.
  5. Is there any plan for performance improvement, for instance, contracting/expanding character lookup?
    A: Yes, the performance enhancement is an ongoing work item.

 

ReadMe for IBM's International Classes for Unicode