The RuleBasedCollator class provides the simple implementation of Collator, using data-driven tables
The RuleBasedCollator class provides the simple implementation of Collator, using data-driven tables. The user can create a customized table-based collation.RuleBasedCollator maps characters to collation keys.
Table Collation has the following restrictions for efficiency (other subclasses may be used for more complex languages) :
1. If the French secondary ordering is specified in a collation object, it is applied to the whole object.
2. All non-mentioned Unicode characters are at the end of the collation order.
3. Private use characters are treated as identical. The private use area in Unicode is 0xE800-0xF8FF.
The collation table is composed of a list of collation rules, where each rule is of three forms:
. < modifier > . < relation > < text-argument > . < reset > < text-argument >The following demonstrates how to create your own collation rules:
- Text Argument: A text argument is any sequence of characters, excluding special characters (that is, whitespace characters and the characters used in modifier, relation and reset). If those characters are desired, you can put them in single quotes (e.g. ampersand => '&').
- Modifier: There is a single modifier, which is used to specify that all secondary differences are sorted backwards.
'@' : Indicates that secondary differences, such as accents, are sorted backwards, as in French.
- Relation: The relations are the following:
- '<' : Greater, as a letter difference (primary)
- ';' : Greater, as an accent difference (secondary)
- ',' : Greater, as a case difference (tertiary)
- '=' : Equal
- Reset: There is a single reset, which is used primarily for contractions and expansions, but which can also be used to add a modification at the end of a set of rules.
'&' : Indicates that the next rule follows the position to where the reset text-argument would be sorted.
This sounds more complicated than it is in practice. For example, the following are equivalent ways of expressing the same thing:
. a < b < c . a < b & b < c . a < c & a < bNotice that the order is important, as the subsequent item goes immediately after the text-argument. The following are not equivalent:. a < b & a < c . a < c & a < bEither the text-argument must already be present in the sequence, or some initial substring of the text-argument must be present. (e.g. "a < b & ae < e" is valid since "a" is present in the sequence before "ae" is reset). In this latter case, "ae" is not entered and treated as a single character; instead, "e" is sorted as if it were expanded to two characters: "a" followed by an "e". This difference appears in natural languages: in traditional Spanish "ch" is treated as though it contracts to a single character (expressed as "c < ch < d"), while in traditional German "�" (a-umlaut) is treated as though it expands to two characters (expressed as "a & ae ; � < b").Ignorable Characters
For ignorable characters, the first rule must start with a relation (the examples we have used above are really fragments; "a < b" really should be "< a < b"). If, however, the first relation is not "<", then all the text-arguments up to the first "<" are ignorable. For example, ", - < a < b" makes "-" an ignorable character, as we saw earlier in the word "black-birds". In the samples for different languages, you see that most accents are ignorable.
Normalization and Accents
The Collator object automatically normalizes text internally to separate accents from base characters where possible. This is done both when processing the rules, and when comparing two strings. Collator also uses the Unicode canonical mapping to ensure that combining sequences are sorted properly (for more information, see The Unicode Standard, Version 2.0.)
Errors
The following are errors:
- A text-argument contains unquoted punctuation symbols (e.g. "a < b-c < d").
- A relation or reset character not followed by a text-argument (e.g. "a < , b").
- A reset where the text-argument (or an initial substring of the text-argument) is not already in the sequence. (e.g. "a < b & e < f")
. Examples: . Simple: "< a < b < c < d" . Norwegian: "< a,A< b,B< c,C< d,D< e,E< f,F< g,G< h,H< i,I< j,J . < k,K< l,L< m,M< n,N< o,O< p,P< q,Q< r,R< s,S< t,T . < u,U< v,V< w,W< x,X< y,Y< z,Z . < �=a�,�=A� . ;aa,AA< �,�< �,�"To create a table-based collation object, simply supply the collation rules to the RuleBasedCollator contructor. For example:
. UErrorCode status = U_ZERO_ERROR; . RuleBasedCollator *mySimple = new RuleBasedCollator(Simple, status);Another example:
. UErrorCode status = U_ZERO_ERROR; . RuleBasedCollator *myNorwegian = new RuleBasedCollator(Norwegian, status);To add rules on top of an existing table, simply supply the orginal rules and modifications to RuleBasedCollator constructor. For example,. Traditional Spanish (fragment): ... & C < ch , cH , Ch , CH ... . German (fragment) : ...< y , Y < z , Z . & AE, � & AE, � . & OE , � & OE, � . & UE , � & UE, � . Symbols (fragment): ...< y, Y < z , Z . & Question-mark ; '?' . & Ampersand ; '&' . & Dollar-sign ; '$'To create a collation object for traditional Spanish, the user can take the English collation rules and add the additional rules to the table. For example:
. UErrorCode status = U_ZERO_ERROR; . UnicodeString rules(DEFAULTRULES); . rules += "& C < ch, cH, Ch, CH"; . RuleBasedCollator *mySpanish = new RuleBasedCollator(rules, status);In order to sort symbols in the similiar order of sorting their alphabetic equivalents, you can do the following,
. UErrorCode status = U_ZERO_ERROR; . UnicodeString rules(DEFAULTRULES); . rules += "& Question-mark ; '?' & Ampersand ; '&' & Dollar-sign ; '$' "; . RuleBasedCollator *myTable = new RuleBasedCollator(rules, status);Another way of creating the table-based collation object, mySimple, is:
. UErrorCode status = U_ZERO_ERROR; . RuleBasedCollator *mySimple = new . RuleBasedCollator(" < a < b & b < c & c < d", status);Or,. UErrorCode status = U_ZERO_ERROR; . RuleBasedCollator *mySimple = new . RuleBasedCollator(" < a < b < d & b < c", status);Because " < a < b < c < d" is the same as "a < b < d & b < c" or "< a < b & b < c & c < d".To combine collations from two locales, (without error handling for clarity)
. // Create an en_US Collator object . Locale locale_en_US("en", "US", ""); . RuleBasedCollator* en_USCollator = (RuleBasedCollator*) . Collator::createInstance( locale_en_US, success ); . . // Create a da_DK Collator object . Locale locale_da_DK("da", "DK", ""); . RuleBasedCollator* da_DKCollator = (RuleBasedCollator*) . Collator::createInstance( locale_da_DK, success ); . . // Combine the two . // First, get the collation rules from en_USCollator . UnicodeString rules = en_USCollator->getRules(); . // Second, get the collation rules from da_DKCollator . rules += da_DKCollator->getRules(); . RuleBasedCollator* newCollator = new RuleBasedCollator( rules, success ); . // newCollator has the combined rulesAnother more interesting example would be to make changes on an existing table to create a new collation object. For example, add "& C < ch, cH, Ch, CH" to the en_USCollation object to create your own English collation object,
. // Create a new Collator object with additional rules . rules = en_USCollator->getRules(); . rules += "& C < ch, cH, Ch, CH"; . RuleBasedCollator* myCollator = new RuleBasedCollator( rules, success ); . // myCollator contains the new rulesThe following example demonstrates how to change the order of non-spacing accents,
. UChar contents[] = { . '=', 0x0301, ';', 0x0300, ';', 0x0302, . ';', 0x0308, ';', 0x0327, ',', 0x0303, // main accents . ';', 0x0304, ';', 0x0305, ';', 0x0306, // main accents . ';', 0x0307, ';', 0x0309, ';', 0x030A, // main accents . ';', 0x030B, ';', 0x030C, ';', 0x030D, // main accents . ';', 0x030E, ';', 0x030F, ';', 0x0310, // main accents . ';', 0x0311, ';', 0x0312, // main accents . '<', 'a', ',', 'A', ';', 'a', 'e', ',', 'A', 'E', . ';', 0x00e6, ',', 0x00c6, '<', 'b', ',', 'B', . '<', 'c', ',', 'C', '<', 'e', ',', 'E', '&', . 'C', '<', 'd', ',', 'D', 0 }; . UnicodeString oldRules(contents); . UErrorCode status = U_ZERO_ERROR; . // change the order of accent characters . UChar addOn[] = { '&', ',', 0x0300, ';', 0x0308, ';', 0x0302, 0 }; . oldRules += addOn; . RuleBasedCollator *myCollation = new RuleBasedCollator(oldRules, status);The last example shows how to put new primary ordering in before the default setting. For example, in Japanese collation, you can either sort English characters before or after Japanese characters,
. UErrorCode status = U_ZERO_ERROR; . // get en_US collation rules . RuleBasedCollator* en_USCollation = . (RuleBasedCollator*) Collator::createInstance(Locale::US, status); . // Always check the error code after each call. . if (U_FAILURE(status)) return; . // add a few Japanese character to sort before English characters . // suppose the last character before the first base letter 'a' in . // the English collation rule is 0x2212 . UChar jaString[] = { '&', 0x2212, '<', 0x3041, ',', 0x3042, '<', 0x3043, ',', 0x3044, 0 }; . UnicodeString rules( en_USCollation->getRules() ); . rules += jaString; . RuleBasedCollator *myJapaneseCollation = new RuleBasedCollator(rules, status);NOTE: Typically, a collation object is created with Collator::createInstance().
Note:
RuleBasedCollator
s with different Locale, CollationStrength and Decomposition mode settings will return different sort orders for the same set of strings. Locales have specific collation rules, and the way in which secondary and tertiary differences are taken into account, for example, will result in a different sorting order for same strings.
alphabetic index hierarchy of classes
this page has been generated automatically by doc++
(c)opyright by Malte Zöckler, Roland Wunderling
contact: doc++@zib.de