Multibyte Characters

Multibyte Characters

A multibyte character is a series of bytes. The character itself contains information on how many bytes long it is. Multibyte characters are referenced as strings (and are therefore of type char *); before parsing, a string is indistinguishable from a multibyte character. The zero byte is still used as a string (and MB character) terminator.

A string of MB characters can be considered a null-terminated array of bytes, exactly like a traditional string. A multibyte string may contain characters from multiple codesets. Usually, this is done by incorporating special bytes that indicate that the next character (and only the next character) will be in a different codeset. Very little application code should ever need to be aware of that, though; you should use the available library routines to find out information about multibyte strings rather than look at the underlying byte structure, because that structure varies from one encoding to another. For one example of an encoding that allows characters from multiple codesets, see "EUC".

Use of Multibyte Strings

Multibyte strings are very easy to pass around. They efficiently use space (both data and disk space), since "extra" bytes are used only for characters that require them. MB strings can be read and written without regard to their contents, as long as the strings remain intact. Displaying MB strings on a terminal is done with the usual routines: printf(), puts(), and so on. Many programs (such as cat) need never concern themselves with the multibyte nature of MB strings, since they operate on bytes rather than on characters; so MB strings are often used for string I/O.

Manipulation of individual characters in an MB string can be difficult, since finding a particular character or position in a string is nontrivial (see "Handling Multibyte Characters," below). Therefore, it is common to convert to WC strings for that kind of work.

Handling Multibyte Characters

Usually, multibyte characters are handled just like char strings. Editing such strings, however, requires some care.

You cannot tell how many bytes are in a particular character until you look at the character. You cannot look at the nth character in a string without looking at all the previous n - 1 characters, because you cannot tell where a character starts without knowing where the previous character ends. Given a byte, you don't know its position within a character. Thus, we say the string has state or is context-sensitive; that is, the interpretation we assign to any given byte depends on where we are in a character.

This analysis of characters is locale-dependent, and therefore must be done by routines that understand locale.

Conversion to Constant-Size Characters

Multibyte characters and strings are convertible to wchars using mbtowc() for individual characters and mbstowcs() for strings (see the mbtowc(3) and mbstowcs() reference pages).

How Many Bytes in a Character?

To find out how many bytes make up a given single MB character, use mblen(), as shown in Example 6-1 (see also the mblen(3) reference page).

Example 6-1 : Find Number of Bytes in an MB Character

#include <stdlib.h>
. . .
size_t n;
int len;
char *pStr;
. . .
len = mblen(pStr, n); /* examine no more than n bytes */

It is the application's responsibility to ensure that pStr points to the beginning of a character, not to the middle of a character.

The maximum number of bytes in a multibyte character is MB_LEN_MAX, which is defined in limits.h. The maximum number of bytes in a character under the current locale is given by the macro MB_CUR_MAX, defined in stdlib.h.

How Many Bytes in an MB String?

Since strlen() simply counts bytes before the first NULL, it tells you how many bytes are in an MB string.

How Many Characters in an MB String?

When mbstowcs() converts MB strings to WC strings, it returns the number of characters converted. This is the simplest way to count characters in an MB string.

Note: Many code segments that deal with individual characters within a string are better served by wide character strings. Since counting often involves conversion, such segments are often better served by working with a WC string, then converting back. Getting the length without performing the conversion is straightforward, but not as simple. mbtowc() converts one character and returns the number of bytes used, but returns the same information without conversion if a NULL is passed as the address of the WC destination. Thus

len = mblen(pStr, n);

is equivalent to

len = mbtowc((wchar_t *) NULL, pStr, n);

In fact, mblen() calls mbtowc() to perform its count. Therefore, counting characters in an MB string without converting would look like the code in Example 6-2.

Example 6-2 : Counting MB Characters Without Conversion

int cLen;
char *tStr = pStr;
numChars = 0;
cLen = mbtowc((wchar_t *) NULL, tStr, MB_CUR_MAX);
while (cLen > 0) {
    tStr += cLen;
    numChars++;
    cLen = mbtowc((wchar_t *) NULL, tStr, MB_CUR_MAX);
    if (cLen == -1)
        numChars = cLen; /* invalid MB character */
}