Next Previous

The Unicode Basis of CFString Objects

Conceptually, a CFString object represents an array of Unicode characters (UniChar) along with a count of the number of characters. Unicode-based strings in Core Foundation provides a solid basis for internationalizing the software you develop. Unicode makes it possible to develop and localize a single version of an application for users who speak most of the world’s written languages, including Russian (Cyrillic), Arabic, Chinese, and Japanese.

The Unicode standard is published by the Unicode Consortium (http://www.unicode.org), an international standards organization. The standard defines a universal, uniform encoding scheme that is 16 bits per character. A “character” in this scheme is the smallest useful element of text in a language; thus it can be a character as understood in most European languages, an ideogram (Chinese Han), a syllable (Japanese hiragana), or some other linguistic unit. Encoded characters also include mathematical, technical, and other symbols as well as diacritics and computer control characters. Each Unicode character is termed a “code point” and is assigned a name and a unique numeric value.

The Unicode standard provides the capacity for encoding all the characters used for written languages throughout the world. With 16-bit encoding, Unicode makes over 65,000 code points possible. This capacity is in marked contrast to standard 8-bit encodings, which permit only 256 characters and thus necessitate elaborate ancillary schemes, such as shift or escape bits, to express characters other than those found in the common Indo-European scripts.

Figure 1 Unicode versus other encodings of the same characters

In addition to its encoding scheme, the Unicode standard provides case mappings and sets aside 6000 code points for private use. It also specifies mappings from the Unicode scheme to repertoires of international, national, and industry character sets. Figure 1 illustrates two of these mappings. String objects make frequent use of the encoding mappings. The underlying representation (and in many cases the underlying storage) of strings is Unicode-based. However, the encodings required by the programming interfaces and output devices that actually display the strings in the user interface are commonly 8-bit. Thus there is a need for efficient and accurate conversion between Unicode and other encodings. String objects largely fulfills that need.

For more information on the Unicode standard, see the consortium’s website. The consortium also publishes charts of Unicode code points and glyphs at www.unicode.org/charts/.

Next Previous

Last updated: 2008-03-11

Did this document help you?

Shop the Apple Online Store (1-800-MY-APPLE), visit an Apple Retail Store, or find a reseller.