A string of MB characters can be considered a null-terminated array of bytes, exactly like a traditional string. A multibyte string may contain characters from multiple codesets. Usually, this is done by incorporating special bytes that indicate that the next character (and only the next character) will be in a different codeset. Very little application code should ever need to be aware of that, though; you should use the available library routines to find out information about multibyte strings rather than look at the underlying byte structure, because that structure varies from one encoding to another. For one example of an encoding that allows characters from multiple codesets, see "EUC".
Manipulation of individual characters in an MB string can be difficult, since finding a particular character or position in a string is nontrivial (see "Handling Multibyte Characters," below). Therefore, it is common to convert to WC strings for that kind of work.
You cannot tell how many bytes are in a particular character until you look at the character. You cannot look at the nth character in a string without looking at all the previous n - 1 characters, because you cannot tell where a character starts without knowing where the previous character ends. Given a byte, you don't know its position within a character. Thus, we say the string has state or is context-sensitive; that is, the interpretation we assign to any given byte depends on where we are in a character.
This analysis of characters is locale-dependent, and therefore must be done by routines that understand locale.
Example 6-1 : Find Number of Bytes in an MB Character
#include <stdlib.h> . . . size_t n; int len; char *pStr; . . . len = mblen(pStr, n); /* examine no more than n bytes */It is the application's responsibility to ensure that pStr points to the beginning of a character, not to the middle of a character.
The maximum number of bytes in a multibyte character is MB_LEN_MAX, which is defined in limits.h. The maximum number of bytes in a character under the current locale is given by the macro MB_CUR_MAX, defined in stdlib.h.
Note: Many code segments that deal with individual characters within a string are better served by wide character strings. Since counting often involves conversion, such segments are often better served by working with a WC string, then converting back. Getting the length without performing the conversion is straightforward, but not as simple. mbtowc() converts one character and returns the number of bytes used, but returns the same information without conversion if a NULL is passed as the address of the WC destination. Thus
len = mblen(pStr, n);is equivalent to
len = mbtowc((wchar_t *) NULL, pStr, n);In fact, mblen() calls mbtowc() to perform its count. Therefore, counting characters in an MB string without converting would look like the code in Example 6-2.
Example 6-2 : Counting MB Characters Without Conversion
int cLen; char *tStr = pStr; numChars = 0; cLen = mbtowc((wchar_t *) NULL, tStr, MB_CUR_MAX); while (cLen > 0) { tStr += cLen; numChars++; cLen = mbtowc((wchar_t *) NULL, tStr, MB_CUR_MAX); if (cLen == -1) numChars = cLen; /* invalid MB character */ }