Thus, when accepting UTF-8 input, you need to check if the input is
valid UTF-8.
Here is a list of all legal UTF-8 sequences; any character
sequence not matching this table is not a legal UTF-8 sequence.
In the following table, the first column shows the various character
values being encoded into UTF-8.
The second column shows how those characters are encoded as binary values;
an ``x'' indicates where the data is placed (either a 0 or 1), though
some values should not be allowed because they're not the shortest possible
encoding.
The last row shows the valid values each byte can have
(in hexadecimal).
Thus, a program should check that every character meets one of the patterns
in the right-hand column.
A ``-'' indicates a range of legal values (inclusive).
Of course, just because a sequence is a legal UTF-8 sequence doesn't
mean that you should accept it (you still need to do all your other
checking), but generally you should check any UTF-8 data for UTF-8 legality
before performing other checks.
Table 4-1. Legal UTF-8 Sequences
UCS Code (Hex) | Binary UTF-8 Format | Legal UTF-8 Values (Hex) |
---|
00-7F | 0xxxxxxx | 00-7F |
80-7FF | 110xxxxx 10xxxxxx | C2-DF 80-BF |
800-FFF | 1110xxxx 10xxxxxx 10xxxxxx | E0 A0*-BF 80-BF |
1000-FFFF | 1110xxxx 10xxxxxx 10xxxxxx | E1-EF 80-BF 80-BF |
10000-3FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F0 90*-BF 80-BF 80-BF |
40000-FFFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F1-F3 80-BF 80-BF 80-BF |
40000-FFFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F1-F3 80-BF 80-BF 80-BF |
100000-10FFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F4 80-8F* 80-BF 80-BF |
200000-3FFFFFF | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | too large; see below |
04000000-7FFFFFFF | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | too large; see below |
As I noted earlier, there are two standards for character sets,
ISO 10646 and Unicode, who have agreed to synchronize their
character assignments.
The definition of UTF-8 in ISO/IEC 10646-1:2000 and the IETF RFC
also currently support
five and six byte sequences to encode characters outside the range
supported by Uniforum's Unicode, but such values can't be used to
support Unicode characters and it's expected that a future version of
ISO 10646 will have the same limits.
Thus, for most purposes the five and six byte UTF-8 encodings aren't legal,
and you should normally reject them (unless you have a special purpose
for them).
This is set of valid values is tricky to determine, and in fact
earlier versions of this document got some entries
wrong (in some cases it permitted overlong characters).
Language developers should include a function in their libraries
to check for valid UTF-8 values, just because it's so hard to get right.
I should note that in some cases, you might want to cut slack (or use
internally) the hexadecimal sequence C0 80. This is an overlong sequence
that, if permitted, can represent ASCII NUL (NIL). Since C and C++
have trouble including a NIL character in an ordinary string,
some people have taken
to using this sequence when they want to represent NIL as part of the
data stream; Java even enshrines the practice.
Feel free to use C0 80 internally while processing data, but technically
you really should translate this back to 00 before saving the data.
Depending on your needs, you might decide to be ``sloppy'' and accept
C0 80 as input in a UTF-8 data stream.
If it doesn't harm security, it's probably a good practice to accept this
sequence since accepting it aids interoperability.
Handling this can be tricky.
You might want to examine the C routines developed by Unicode to
handle conversions, available at
ftp://ftp.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.c.
It's unclear to me if these routines are open source software (the
licenses don't clearly say whether or not they can be modified), so
beware of that.