InfoMagic Internet Tools 1993 July

home *** CD-ROM | disk | FTP | other *** search

/ InfoMagic Internet Tools 1993 July / Internet Tools.iso / RockRidge / mail / sendmail / sendmail-5.65c+IDA-1.4.4.1 / ida / charset / Design < prev next >

Wrap

Text File | 1991-04-05 | 6.8 KB | 162 lines

Guidelines for character mnemonics in a minimal character set. By Keld Simonsen, Danish UNIX User Group (DKUUG) Representative to SC22 WG on Character Set Usage for Danish Standards Association (DS), Denmark. Draft January 1991. Aim of Character Mnemonics The aim of the mnemonics is to be able to represent all characters in all standard coded character sets in any standard coded character set. Thus all standard coded character sets will be related, and a conversion can take place. The usage of the character mnemonics is primarily intended within computer operating systems, programming languages and applications, and this work with character mnemonics is the current state of work which has been presented to the ISO working group responsible for these computer related issues, namely the ISO/IEC JTC1/SC22 special working group on character set usage. Covered Coded Character Sets Almost all characters in the standard coded character sets have been given a mnemonic name in the minimal character set. The minimal character set is defined as the basic character set of ISO 646, where 12 positions are left undefined. The standard coded character sets are taken as the sum of all ISO defined or ISO registered character sets. The most significant ISO coded character set is the 10646 coded character set, whose aim is to code in 32 bits all characters in the world. These guidelines can be seen as assigning mnemonic attributes to most characters in 10646, currently at DIS stage. Other ISO coded character sets covered include all parts of ISO 8859, ISO 6937-2 and all ISO 646 conforming coded character sets in the ISO character set registry managed by ECMA according to ISO 4873. Some non-ISO character sets are also covered for convenience. The Character Mnemonics Classes The character mnemonics are classified into two groups: 1. A group with two-character mnemonics - Primarily intended for alphabetic scripts like Latin, Greek, Cyrillian, Hebrew and Arabic, and special characters. 2. A group with variable-length mnemonics - primarily intended for non-alphabetic scripts like Japanese and Chinese. All mnemonics are given a long descriptive name, written in the reference character set and taken from ISO 10646, if possible. The Two-Character mnemonics The two-character mnemonics include various accented Latin letters, Greek, Cyrillic, Hebrew, Arabic, Hiragana, Katakana and Bopomofo. Also quite some special characters are included. Almost all ISO or ISO registered 7- and 8-bit coded character sets are covered with these two-character mnemonics. Thus conversions between these character sets can be done via a two-character conversion table. The two characters are chosen so the graphical appearence in the reference set resembles as much as possible (within the posibilities available) the graphical appearance of the character. The basic character set of ISO 646 is used as the reference set, as mentioned above. The characters in the reference character set are chosen to represent themselves. You may consider them as two-character mnemonics where the second char is a space. Control characters mnemonics are chosen according to ISO 2047 and ISO 6429 . Letters, including Greek, Cyrillic, Arabic and Hebrew, are represented with the base letter as the first letter, and the second letter represents an accent or relation to a non-Latin script. Non-Latin letters are translitterated to Latin letters, following translitteration standards as closely as possible. After a letter, the second character signifies the following: Exclamation mark ! Grave Apostrophe ' Acute accent Greater-Than sign > Circumflex accent Question Mark ? tilde Hyphen-Minus - Macron Left parenthesis ( Breve Full Stop . Dot Above/Ring above Colon : Diaeresis Comma , Cedilla Underline _ Underline Solidus / Stroke Quotation mark " Double acute accent Semicolon ; Ogonek Less-Than sign < Caron Equals = Cyrillian Asterisk * Greek Percent sign % Greek/Cyrillian special Plus + smalls: Arabic, capitals: Hebrew Four 4 Bopomofo Five 5 Hiragana Six 6 Katakana The ampersand & is reserved as an intro character, indicating that the following string is in the mnemonic character set. This character could also be another character, e.g. in the control character set. One common choice in the control character set is decimal 29, which seems to have no effect on almost all current equipment. The intro character can be negotiated between the communicating parties, but the default is the ampersand "&". Two intro characters in a row signifies the intro character itself. The underscore is reserved for the variable-length mnemonics. This use does not eliminate usage as an accent or language identifier. The right-pointing parenthesis ")" is not in use at the moment for accent or language identifying. This is also the case for some digits. Special characters are encoded with some mnemonic value. These are not systematic thruout, but most mnemonics start with a special character of the reference set. Special chars with some sort of reference to the reference character set normally have this character as the first character in the mnemonic. The Variable-length Character Mnemonics The Variable-length Character Mnemonics are primarily meant for the ideographic characters in larger Asian character sets. To have the mnemonics as short as possible, which both saves storage and is easier to type in, a quite short name is preferred. Considering the Chinese standard GB 2312-1980 and the Japanese standards JIS X0208 and JIS X0212, they are all given by row and column numbers between 1 and 99. So two positions for row and column and a character set identifier of one character would be almost as short as possible. The following character set identifiers are defined: c GB 2312-1980 j JIS X0208-1990 J JIS X0212-1990 k KS C 5601-1987 The first idea was to have a name in Latin describing the pronunciation but that is not possible according to Asian sources. One prominent character in the reference character set is reserved for identifying variable-length mnemonics, namely the underscore "_". This character is intended as a delimiter both in the front and in the end of the mnemonic. An example of its use would be: (&=intro): &_j3210_ &_j4436_&_j6530_ The Variable-Length Character Mnemonics can also be used for less-used Latin letters with more than one accent or other less-used special characters.