Normalizer

class U_COMMON_API Normalizer

Normalizer transforms Unicode text into an equivalent composed or decomposed form, allowing for easier sorting and searching of text

Public Classes

enum EMode

The mode of a Normalizer object

NO_OP: Null operation for use with the {@link #Normalizer constructors} and the static {@link #normalize normalize} method
COMPOSE: Canonical decomposition followed by canonical composition
COMPOSE_COMPAT: Compatibility decomposition followed by canonical composition
DECOMP: Canonical decomposition
DECOMP_COMPAT: Compatibility decomposition

enum

The options for a Normalizer object

IGNORE_HANGUL: Option to disable Hangul/Jamo composition and decomposition

Public Fields
static const UChar DONE

Public Methods
Normalizer(const UnicodeString& str, EMode mode): Creates a new Normalizer object for iterating over the normalized form of a given string
Normalizer(const UnicodeString& str, EMode mode, int32_t opt): Creates a new Normalizer object for iterating over the normalized form of a given string
Normalizer(const CharacterIterator& iter, EMode mode): Creates a new Normalizer object for iterating over the normalized form of the given text
Normalizer(const CharacterIterator& iter, EMode mode, int32_t opt): Creates a new Normalizer object for iterating over the normalized form of the given text
Normalizer(const Normalizer& copy): Copy constructor
~Normalizer(): Destructor
static void normalize(const UnicodeString& source, EMode mode, int32_t options, UnicodeString& result, UErrorCode &status): Normalizes a String using the given normalization operation
static void compose(const UnicodeString& source, bool_t compat, int32_t options, UnicodeString& result, UErrorCode &status): Compose a String
static void decompose(const UnicodeString& source, bool_t compat, int32_t options, UnicodeString& result, UErrorCode &status): Static method to decompose a String
UChar current(void) const: Return the current character in the normalized text
UChar first(void): Return the first character in the normalized text
UChar last(void): Return the last character in the normalized text
UChar next(void): Return the next character in the normalized text and advance the iteration position by one
UChar previous(void): Return the previous character in the normalized text and decrement the iteration position by one
UChar setIndex(UTextOffset index): Set the iteration position in the input text that is being normalized and return the first normalized character at that position
void reset(void): Reset the iterator so that it is in the same state that it was just after it was constructed
UTextOffset getIndex(void) const: Retrieve the current iteration position in the input text that is being normalized
UTextOffset startIndex(void) const: Retrieve the index of the start of the input text
UTextOffset endIndex(void) const: Retrieve the index of the end of the input text
bool_t operator==(const Normalizer& that) const: Returns true when both iterators refer to the same character in the same character-storage object
Normalizer* clone(void) const: Returns a pointer to a new Normalizer that is a clone of this one
int32_t hashCode(void) const: Generates a hash code for this iterator
void setMode(EMode newMode): Set the normalization mode for this object
EMode getMode(void) const: Return the basic operation performed by this Normalizer
void setOption(int32_t option, bool_t value): Set options that affect this Normalizer's operation
bool_t getOption(int32_t option) const: Determine whether an option is turned on or off
void setText(const UnicodeString& newText, UErrorCode &status): Set the input text over which this Normalizer will iterate
void setText(const CharacterIterator& newText, UErrorCode &status): Set the input text over which this Normalizer will iterate
void getText(UnicodeString& result): Copies the text under iteration into the UnicodeString referred to by "result"

Documentation

Normalizer transforms Unicode text into an equivalent composed or decomposed form, allowing for easier sorting and searching of text. Normalizer supports the standard normalization forms described in Unicode Technical Report #15.
Characters with accents or other adornments can be encoded in several different ways in Unicode. For example, take the character "Á" (A-acute). In Unicode, this can be encoded as a single character (the "composed" form):
00C1    LATIN CAPITAL LETTER A WITH ACUTE
or as two separate characters (the "decomposed" form):
0041    LATIN CAPITAL LETTER A
0301    COMBINING ACUTE ACCENT
To a user of your program, however, both of these sequences should be treated as the same "user-level" character "Á". When you are searching or comparing text, you must ensure that these two sequences are treated equivalently. In addition, you must handle characters with more than one accent. Sometimes the order of a character's combining accents is significant, while in other cases accent sequences in different orders are really equivalent.
Similarly, the string "ffi" can be encoded as three separate letters:
0066    LATIN SMALL LETTER F
0066    LATIN SMALL LETTER F
0069    LATIN SMALL LETTER I
or as the single character
FB03    LATIN SMALL LIGATURE FFI
The ffi ligature is not a distinct semantic character, and strictly speaking it shouldn't be in Unicode at all, but it was included for compatibility with existing character sets that already provided it. The Unicode standard identifies such characters by giving them "compatibility" decompositions into the corresponding semantic characters. When sorting and searching, you will often want to use these mappings.
Normalizer helps solve these problems by transforming text into the canonical composed and decomposed forms as shown in the first example above. In addition, you can have it perform compatibility decompositions so that you can treat compatibility characters the same as their equivalents. Finally, Normalizer rearranges accents into the proper canonical order, so that you do not have to worry about accent rearrangement on your own.
Normalizer adds one optional behavior, {@link #IGNORE_HANGUL}, that differs from the standard Unicode Normalization Forms. This option can be passed to the {@link #Normalizer constructors} and to the static {@link #compose compose} and {@link #decompose decompose} methods. This option, and any that are added in the future, will be turned off by default.
There are three common usage models for Normalizer. In the first, the static {@link #normalize normalize()} method is used to process an entire input string at once. Second, you can create a Normalizer object and use it to iterate through the normalized form of a string by calling {@link #first} and {@link #next}. Finally, you can use the {@link #setIndex setIndex()} and {@link #getIndex} methods to perform random-access iteration, which is very useful for searching.
Note: Normalizer objects behave like iterators and have methods such as setIndex, next, previous, etc. You should note that while the setIndex and getIndex refer to indices in the underlying input text being processed, the next and previous methods it iterate through characters in the normalized output. This means that there is not necessarily a one-to-one correspondence between characters returned by next and previous and the indices passed to and returned from setIndex and getIndex. It is for this reason that Normalizer does not implement the {@link CharacterIterator} interface.
Note: Normalizer is currently based on version 2.1.8 of the Unicode Standard. It will be updated as later versions of Unicode are released. If you are using this class on a JDK that supports an earlier version of Unicode, it is possible that Normalizer may generate composed or dedecomposed characters for which your JDK's {@link java.lang.Character} class does not have any data.

static const UChar DONE

enum EMode

The mode of a Normalizer object

NO_OP

Null operation for use with the {@link #Normalizer constructors} and the static {@link #normalize normalize} method. This value tells the Normalizer to do nothing but return unprocessed characters from the underlying String or CharacterIterator. If you have code which requires raw text at some times and normalized text at others, you can use NO_OP for the cases where you want raw text, rather than having a separate code path that bypasses Normalizer altogether.

See Also:: setMode

COMPOSE

Canonical decomposition followed by canonical composition. Used with the {@link #Normalizer constructors} and the static {@link #normalize normalize} method to determine the operation to be performed.

If all optional features (e.g. {@link #IGNORE_HANGUL}) are turned off, this operation produces output that is in C.

See Also:: setMode

COMPOSE_COMPAT

Compatibility decomposition followed by canonical composition. Used with the {@link #Normalizer constructors} and the static {@link #normalize normalize} method to determine the operation to be performed.

If all optional features (e.g. {@link #IGNORE_HANGUL}) are turned off, this operation produces output that is in KC.

See Also:: setMode

DECOMP

Canonical decomposition. This value is passed to the {@link #Normalizer constructors} and the static {@link #normalize normalize} method to determine the operation to be performed.

If all optional features (e.g. {@link #IGNORE_HANGUL}) are turned off, this operation produces output that is in D.

See Also:: setMode

DECOMP_COMPAT

Compatibility decomposition. This value is passed to the {@link #Normalizer constructors} and the static {@link #normalize normalize} method to determine the operation to be performed.

If all optional features (e.g. {@link #IGNORE_HANGUL}) are turned off, this operation produces output that is in KD.

See Also:: setMode

enum

The options for a Normalizer object

IGNORE_HANGUL: Option to disable Hangul/Jamo composition and decomposition. This option applies to Korean text, which can be represented either in the Jamo alphabet or in Hangul characters, which are really just two or three Jamo combined into one visual glyph. Since Jamo takes up more storage space than Hangul, applications that process only Hangul text may wish to turn this option on when decomposing text.
The Unicode standard treates Hangul to Jamo conversion as a canonical decomposition, so this option must be turned off if you wish to transform strings into one of the standard setOption

Normalizer(const UnicodeString& str, EMode mode)

Creates a new Normalizer object for iterating over the normalized form of a given string.

Parameters:: str - The string to be normalized. The normalization will start at the beginning of the string.
mode - The normalization mode.

Normalizer(const UnicodeString& str, EMode mode, int32_t opt)

Creates a new Normalizer object for iterating over the normalized form of a given string.

The options parameter specifies which optional Normalizer features are to be enabled for this object.

Parameters:: str - The string to be normalized. The normalization will start at the beginning of the string.
mode - The normalization mode.
opt - Any optional features to be enabled. Currently the only available option is {@link #IGNORE_HANGUL} If you want the default behavior corresponding to one of the standard Unicode Normalization Forms, use 0 for this argument

Normalizer(const CharacterIterator& iter, EMode mode)

Creates a new Normalizer object for iterating over the normalized form of the given text.

Parameters:: iter - The input text to be normalized. The normalization will start at the beginning of the string.
mode - The normalization mode.

Normalizer(const CharacterIterator& iter, EMode mode, int32_t opt)

Creates a new Normalizer object for iterating over the normalized form of the given text.

Parameters:: iter - The input text to be normalized. The normalization will start at the beginning of the string.
mode - The normalization mode.
opt - Any optional features to be enabled. Currently the only available option is {@link #IGNORE_HANGUL} If you want the default behavior corresponding to one of the standard Unicode Normalization Forms, use 0 for this argument

Normalizer(const Normalizer& copy)

Copy constructor

~Normalizer()

Destructor

static void normalize(const UnicodeString& source, EMode mode, int32_t options, UnicodeString& result, UErrorCode &status)

Normalizes a String using the given normalization operation.

The options parameter specifies which optional Normalizer features are to be enabled for this operation. Currently the only available option is {@link #IGNORE_HANGUL}. If you want the default behavior corresponding to one of the standard Unicode Normalization Forms, use 0 for this argument.

Parameters:: source - the input string to be normalized.
aMode - the normalization mode
options - the optional features to be enabled.
result - The normalized string (on output).
status - The error code.

static void compose(const UnicodeString& source, bool_t compat, int32_t options, UnicodeString& result, UErrorCode &status)

Compose a String.

The options parameter specifies which optional Normalizer features are to be enabled for this operation. Currently the only available option is {@link #IGNORE_HANGUL}. If you want the default behavior corresponding to Unicode Normalization Form C or KC, use 0 for this argument.

Parameters:: source - the string to be composed.
compat - Perform compatibility decomposition before composition. If this argument is false, only canonical decomposition will be performed.
options - the optional features to be enabled.
result - The composed string (on output).
status - The error code.

static void decompose(const UnicodeString& source, bool_t compat, int32_t options, UnicodeString& result, UErrorCode &status)

Static method to decompose a String.

The options parameter specifies which optional Normalizer features are to be enabled for this operation. Currently the only available option is {@link #IGNORE_HANGUL}. The desired options should be OR'ed together to determine the value of this argument. If you want the default behavior corresponding to Unicode Normalization Form D or KD, use 0 for this argument.

Returns:: the decomposed string.
Parameters:: str - the string to be decomposed.
compat - Perform compatibility decomposition. If this argument is false, only canonical decomposition will be performed.
options - the optional features to be enabled.
result - The composed string (on output).
status - The error code.

UChar current(void) const

Return the current character in the normalized text

UChar first(void)

Return the first character in the normalized text. This resets the Normalizer's position to the beginning of the text.

UChar last(void)

Return the last character in the normalized text. This resets the Normalizer's position to be just before the the input text corresponding to that normalized character.

UChar next(void)

Return the next character in the normalized text and advance the iteration position by one. If the end of the text has already been reached, {@link #DONE} is returned.

UChar previous(void)

Return the previous character in the normalized text and decrement the iteration position by one. If the beginning of the text has already been reached, {@link #DONE} is returned.

UChar setIndex(UTextOffset index)

Set the iteration position in the input text that is being normalized and return the first normalized character at that position.

Note: This method sets the position in the input text, while {@link #next} and {@link #previous} iterate through characters in the normalized output. This means that there is not necessarily a one-to-one correspondence between characters returned by next and previous and the indices passed to and returned from setIndex and {@link #getIndex}.

Returns:: the first normalized character that is the result of iterating forward starting at the given index. @throws IllegalArgumentException if the given index is less than {@link #getBeginIndex} or greater than {@link #getEndIndex}.
Parameters:: index - the desired index in the input text.

void reset(void)

Reset the iterator so that it is in the same state that it was just after it was constructed. A subsequent call to next will return the first character in the normalized text. In contrast, calling setIndex(0) followed by next will return the second character in the normalized text, because setIndex itself returns the first character

UTextOffset getIndex(void) const

Retrieve the current iteration position in the input text that is being normalized. This method is useful in applications such as searching, where you need to be able to determine the position in the input text that corresponds to a given normalized output character.

Note: This method sets the position in the input, while {@link #next} and {@link #previous} iterate through characters in the output. This means that there is not necessarily a one-to-one correspondence between characters returned by next and previous and the indices passed to and returned from setIndex and {@link #getIndex}.

UTextOffset startIndex(void) const

Retrieve the index of the start of the input text. This is the begin index of the CharacterIterator or the start (i.e. 0) of the String over which this Normalizer is iterating

UTextOffset endIndex(void) const

Retrieve the index of the end of the input text. This is the end index of the CharacterIterator or the length of the String over which this Normalizer is iterating

bool_t operator==(const Normalizer& that) const

Returns true when both iterators refer to the same character in the same character-storage object

Normalizer* clone(void) const

Returns a pointer to a new Normalizer that is a clone of this one. The caller is responsible for deleting the new clone.

int32_t hashCode(void) const

Generates a hash code for this iterator

void setMode(EMode newMode)

Set the normalization mode for this object.

Note:If the normalization mode is changed while iterating over a string, calls to {@link #next} and {@link #previous} may return previously buffers characters in the old normalization mode until the iteration is able to re-sync at the next base character. It is safest to call {@link #setText setText()}, {@link #first}, {@link #last}, etc. after calling setMode.

Parameters:

newMode - the new mode for this Normalizer. The supported modes are:

{@link #COMPOSE} - Unicode canonical decompositiion followed by canonical composition.
{@link #COMPOSE_COMPAT} - Unicode compatibility decompositiion follwed by canonical composition.
{@link #DECOMP} - Unicode canonical decomposition
{@link #DECOMP_COMPAT} - Unicode compatibility decomposition.
{@link #NO_OP} - Do nothing but return characters from the underlying input text.

See Also:

getMode

EMode getMode(void) const

Return the basic operation performed by this Normalizer

See Also:: setMode

void setOption(int32_t option, bool_t value)

Set options that affect this Normalizer's operation. Options do not change the basic composition or decomposition operation that is being performed , but they control whether certain optional portions of the operation are done. Currently the only available option is:

{@link #IGNORE_HANGUL} - Do not decompose Hangul syllables into the Jamo alphabet and vice-versa. This option is off by default (i.e. Hangul processing is enabled) since the Unicode standard specifies that Hangul to Jamo is a canonical decomposition. For any of the standard Unicode Normalization Forms, you should leave this option off.