Usenet 1994 January

home *** CD-ROM | disk | FTP | other *** search

/ Usenet 1994 January / usenetsourcesnewsgroupsinfomagicjanuary1994.iso / sources / std_unix / volume.29 / text0009.txt < prev next >

Wrap

Text File | 1992-12-26 | 1.8 KB | 43 lines

Submitted-by: enag@ifi.uio.no (Erik Naggum) Peter da Silva <peter@ferranti.com> writes: >In article <16rpgaINNol0@ftp.UU.NET> david@mks.com (David Rowley) writes: >> Note that UTF and 8-bit Latin 1 (ISO 8859-1) are identical for >> characters 0x00 to 0x9f. Codepoints above 0x9f are used to >> introduce the multibyte sequences. > >That seems strange. 0x80 through 0x9f are all controls, and all the >national characters in Latin-1 are in 0xA0 to 0xFF. Why would they allow >Latin-1 control codes (CSI, etc) and blow off all the graphics? Are you >sure they didn't overload the high control range (0x80 to 0x9f)? That >would seem a much more useful encoding. Character numbers 128 (0x80) through 159 (0x9F) are not used in ISO 10646, and are not used in UTF, either. It's highly misleading to claim that they are used, since, in fact, they aren't even graphic characters in _any_ ISO 4873-conforming coded character set (of which the ISO 8859 family is an instance), and row 0 of ISO 10646 (but only row 0) conforms to ISO 4873 with respect to not populating the control character ranges with graphic characters. ISO 8859-1 characters (i.e. the right half of row 0) are introduced with character number 160 (0xA0). Following this "code extension" character is a single ISO 8859-1 character with the same character number that the character has in ISO 8859-1. For example, if the original string is (hex) A1 43 61 72 61 6d 62 61 21 ("!Caramba!" with the first ! up-side down) in ISO 8859-1, it will be (hex) A0 A1 43 61 72 61 6d 62 61 21 in ISO 10646 UTF. Best regards, </Erik> -- Erik Naggum | ISO 8879 SGML | +47 295 0313 | ISO 10744 HyTime | <erik@naggum.no> | ISO 10646 UCS | Memento, terrigena. <enag@ifi.uio.no> | ISO 9899 C | Memento, vita brevis. Volume-Number: Volume 29, Number 10