Character Sets, Codesets, and Encodings

Character Sets, Codesets, and Encodings

One major difference between nationalized and internationalized software is the availability in internationalized software of a wide variety of methods for encoding characters. Developers of internationalized software no longer have the convenience of always being able to assume ASCII. Three terms that describe groupings of characters are the following:

character set An abstract collection of characters.
codeset A character set with exactly one associated numerical encoding for each character. The English alphabet is a character set; ASCII is a codeset.
encoding A set of characters and associated numbers; however, this term is more general than "codeset." A single encoding may include multiple codesets; Extended UNIX Code (EUC), for instance, is an encoding that provides for four codesets in one data stream.

This section describes these topics:

"Eight-Bit Cleanliness" explains how to make 8-bit clean characters.
"Character Representation" discuses multibyte and wide characters.
"Multibyte Characters" covers using and handling multibyte characters, conversions to constant-size characters, and the number of bytes in a character and string.
"Wide Characters" explains wchar strings, support routines, and conversion to multibyte characters.
"Reading Input Data" covers nonuser-originated data.

For information on installing and using fonts with an application, refer to Chapter 5, "Working With Fonts."

Eight-Bit Cleanliness
Character Representation
Multibyte Characters
Wide Characters
Reading Input Data

character set	An abstract collection of characters.
codeset	A character set with exactly one associated numerical encoding for each character. The English alphabet is a character set; ASCII is a codeset.
encoding	A set of characters and associated numbers; however, this term is more general than "codeset." A single encoding may include multiple codesets; Extended UNIX Code (EUC), for instance, is an encoding that provides for four codesets in one data stream.