Columbia Kermit

home *** CD-ROM | disk | FTP | other *** search

/ Columbia Kermit / kermit.zip / archives / charsets.tar.gz / charsets.tar / utf-8-test.txt < prev next >

Wrap

Text File | 1999-09-12 | 3KB | 64 lines

UTF-8 decoder capability and stress test ---------------------------------------- Markus Kuhn <mkuhn@acm.org> - 1999-04-28 This test text examines, how UTF-8 decoders handle various types of corrupted or otherwise interesting UTF-8 sequences. According to ISO 10646-1, sections R.7 and 2.3c, a device receiving UTF-8 shall interpret a "malformed sequence in the same way that it interprets a character that is outside the adopted subset". Test sequences (all enclosed in ""): Correct UTF-8 text (Greek word 'kosme'): "╬║ß╜╣╧â╬╝╬╡" Correct 2-byte sequence (U+00000080): "┬Ç" Correct 3-byte sequence (U+00000800): "αáÇ" Correct 4-byte sequence (U+00010000): "≡ÉÇÇ" Correct 5-byte sequence (U+00200000): "°êÇÇÇ" Correct 6-byte sequence (U+04000000): "ⁿäÇÇÇÇ" Correct 2-byte sequence (U+000007ff): "▀┐" Correct 3-byte sequence (U+0000ffff): "∩┐┐" Correct 4-byte sequence (U+001fffff): "≈┐┐┐" Correct 5-byte sequence (U+03ffffff): "√┐┐┐┐" Correct 6-byte sequence (U+7fffffff): "²┐┐┐┐┐" Correct 2-byte sequence (U+0000): "└Ç" Correct 3-byte sequence (U+0000): "αÇÇ" Correct 4-byte sequence (U+0000): "≡ÇÇÇ" Correct 5-byte sequence (U+0000): "°ÇÇÇÇ" Correct 6-byte sequence (U+0000): "ⁿÇÇÇÇÇ" Unexpected continuation byte (10000000): "Ç" Another lonely continuation byte (10111111): "┐" Sequence of 2 unexpected continuation bytes: "Ç┐" Sequence of 3 unexpected continuation bytes: "Ç┐Ç" Sequence of 4 unexpected continuation bytes: "Ç┐Ç┐" Sequence of 5 unexpected continuation bytes: "Ç┐Ç┐Ç" Sequence of 6 unexpected continuation bytes: "Ç┐Ç┐Ç┐" Sequence of 7 unexpected continuation bytes: "Ç┐Ç┐Ç┐Ç" Sequence of all 64 possible continuation bytes (10000000-10111111): "ÇüéâäàåçêëèïîìÄÅ ÉæÆôöòûùÿÖÜ¢£¥₧ƒ áíóúñÑªº¿⌐¬½¼¡«» ░▒▓│┤╡╢╖╕╣║╗╝╜╛┐" Sequence of all 32 first bytes of 2-byte sequences (11000000-11011111), each followed by a space character: "└ ┴ ┬ ├ ─ ┼ ╞ ╟ ╚ ╔ ╩ ╦ ╠ ═ ╬ ╧ ╨ ╤ ╥ ╙ ╘ ╒ ╓ ╫ ╪ ┘ ┌ █ ▄ ▌ ▐ ▀ " Sequence of all 16 first bytes of 3-byte sequences (11100000-11101111), each followed by a space character: "α ß Γ π Σ σ µ τ Φ Θ Ω δ ∞ φ ε ∩ " Sequence of all 8 first bytes of 4-byte sequences (11110000-11110111), each followed by a space character: "≡ ± ≥ ≤ ⌠ ⌡ ÷ ≈ " Sequence of all 4 first bytes of 5-byte sequences (11111000-11111011), each followed by a space character: "° ∙ · √ " Sequence of all 2 first bytes of 6-byte sequences (11111100-11111101), each followed by a space character: "ⁿ ² " Impossible byte (11111110): "■" Impossible byte (11111111): " " 2-byte sequence with last byte missing: "└" 3-byte sequence with last byte missing: "αÇ" 4-byte sequence with last byte missing: "≡ÇÇ" 5-byte sequence with last byte missing: "°ÇÇÇ" 6-byte sequence with last byte missing: "ⁿÇÇÇÇ" All these 5 sequences with last byte missing concatenated: "└αÇ≡ÇÇ°ÇÇÇⁿÇÇÇÇ"