kermit.columbia.edu

home *** CD-ROM | disk | FTP | other *** search

/ kermit.columbia.edu / kermit.columbia.edu.tar / kermit.columbia.edu / archives / protocol.zip / isok7.txt < prev next >

Wrap

Text File | 1997-06-19 | 113KB | 2,244 lines

Sun Dec 5 14:34:25 1993 A KERMIT PROTOCOL EXTENSION FOR INTERNATIONAL CHARACTER SETS ********** NOTE: This is a work in progress, and will be updated from time to time. ********** Christine M. Gianone cmg@watsun.cc.columbia.edu Manager, Kermit Development and Distribution Frank da Cruz fdc@watsun.cc.columbia.edu Manager, Network Planning Columbia University Center for Computing Activities 612 West 115th Street New York, NY 10025, USA DRAFT NUMBER 7.1 Dec 5, 1993 ABSTRACT An extension to the presentation layer of the Kermit file transfer protocol is proposed to allow transfer of non-English-language text files between unlike computers by substitution of standard character sets other than ASCII in Kermit's text-file transfer data packets. Methods for selection, announcement, and use of these character sets are described. The reader is assumed to be familiar with the Kermit file transfer protocol and with basic computing and terminology. The relevant ANSI and ISO character-set related standards are summarized in Appendix B of this document. This is a nearly final draft. The protocol and many of the commands described in this document have been successfully implemented in major Kermit programs including MS-DOS Kermit 3.0, C-Kermit 5A for UNIX and VAX/VMS, and IBM Mainframe Kermit 4.2 for VM/CMS, MVS/TSO, MUSIC, and CICS. Special thanks to John Chandler and Hirofumi Fujii for extensive contributions to this draft, and to John Klensin for his comments and support. SUMMARY OF CHANGES SINCE DRAFT #5, April, 1990 - Abandonment of the two-level concept. Mixed languages will be handled by using ISO 10646 or UNICODE as the transfer character set. The details remain to be specified. - Abandonment of CSN 36 91 03, Czechoslovak Standard alphabet as a transfer character set. Czech is adequately covered by ISO 8859-2. - Adoption of Japanese EUC as the transfer character set for Japanese text files, rather than JIS X 0208. - Explanation of Japanese EUC added to Appendix B. - Reference to Kermit's new locking shift transport protocol. - Removal of (unworkable) design for user-defined translations. - Addition of mechanism for automatic translation-table selection. - Addition of notion of "translation goal" and related commands. - Deletion of irrelevant or redundant appendices. - Addition of an annotated References section. - Short sections added on terminology and notation. - Note: Table I moved to Appendix B, so table numbers are out of order. SUMMARY OF CHANGES SINCE DRAFT #4, August, 1989 - Changes for Level 1 only, to reflect experience in writing the code to implement it for MS-DOS Kermit 3.0, C-Kermit 5A, and Kermit 370 4.2. Level 2 is on hold indefinitely pending ISO 10646 & Unicode developments. - Abandonment of separate attributes for encoding and character set. - Change all references to ASCII as I2 into I6 (ISO Registration Number). - Change description of SET LANGUAGE to remove side effects. - Differentiation of SET TRANSFER CHARACTER ASCII and TRANSPARENT. - The section on terminal emulation has not been changed, even though this subject needs detailed treatment in this document. SUMMARY OF CHANGES SINCE DRAFT #3, July 20, 1989 - Expanded & more precise definition of Kermit's character set designators - Simplification of the syntax of the (former) SET TRANSFER-SYNTAX command - Addition of SET LANGUAGE command - Clarification of Kermit's behavior when it receives an unknown character set - Addition of Appendix F to specify how each Kermit Level is invoked - Correction of numerous typographical and other errors ACKNOWLEDGEMENTS Many thanks to these people for their helpful and constructive comments during the drafting process. In most cases, their suggestions or the information they provided have been incorporated into this or previous drafts. John Chandler (Harvard/Smithsonian Center for Astrophysics, USA) Alan Curtis (University of London, UK) Joe Doupnik (Utah State University, USA) Hirofumi Fujii (Japan National Laboratory of High Energy Physics, Tokyo) John Klensin (Massachusetts Institute of Technology, USA) Ken-ichiro Murakami (Nippon Telephone and Telegraph Research Labs, Tokyo) Vladimir Novikov (VNIIPAS, Moscow, USSR) Jacob Palme (Stockholm University, Sweden) Andre Pirard (University of Liege, Belgium) Paul Placeway (Ohio State University, USA) Gisbert W. Selke (WIdO, Bonn, Germany) Fridrik Skulason (University of Iceland, Reykjavik, Iceland) Johan van Wingen (Leiden, Netherlands) Konstantin Vinogradov (ICSTI, Moscow, USSR) Amanda Walker (InterCon Systems Corp, USA) Thanks also to the following people for organizing meetings or conferences in their countries at which the issues of this proposal were discussed: Kohichi Nishimoto (Nihon DEC, Tokyo, Japan) Juri Gornostaev and A. Butrimenko (ICSTI, Moscow, USSR) and thanks also to those who attended these gatherings! Thanks to the Kermit developers who have implemented this extension in their Kermit programs: John Chandler (Kermit-370) Frank da Cruz (C-Kermit) Joe Doupnik (MS-DOS Kermit) Hirofumi Fujii (C-Kermit, MS-DOS Kermit, and NEC PC9801 Kermit) Finally, thanks to other experts who provided valuable information: Jerry Andersen, IBM Lloyd Anderson, Ecological Linguistics Joe Becker, Xerox Corporation and UNICODE Consortium James Do, Mentor Graphics, San Jose, CA Edwin Hart, Johns Hopkins University Applied Physics Laboratory NOTATION This document is written in plain 7-bit US ASCII, and to be understood correctly it should be displayed in plain 7-bit US ASCII. The notation: <xxx> is used to express a non-ASCII or non-graphic character, where "xxx" is replaced by the name of the character, for example: <ESC> (All capital letters: the name of a control character) or: <A-grave> (Lower or mixed case: a letter with a diacritical mark) In other places (which should be clear from the context), the same notation is used to denote a parameter to a Kermit command, for example: <filename> to stand for the name of any file. TERMINOLOGY A "character" is the minimum unit of a writing system: a letter, a digit, a punctuation mark, an ideogram, without regard to the style of rendering except for capitalization in scripts where that is possible, and without regard to computer encoding. A "character set" is a particular, specified group of characters, for example (and most typically) all the letters, digits, and punctuation marks needed for a particular writing system. A "coded character set" is the internal computer representation of a character set, in which each character is assigned a unique code, often with the addition of special control codes. In this document, "character set" and "coded character set" are used synonymously unless otherwise noted. "Code page" is the term used by IBM and Microsoft to mean "coded character set". A "code point" is the association between a character and its encoding in a particular character set. An "octet" is a computer storage unit of 8 bits. A "byte" is an octet, unless otherwise noted. The word "translation" is used loosely in this document to denote conversion between character set encodings, not translation between languages or any other higher-level notion. When characters are intentionally replaced by different characters, the word "transliteration" is used. STATEMENT OF THE PROBLEM The Kermit file transfer protocol has always been able to transfer text files between unlike computers (e.g. a UNIX system with ASCII stream text files and an IBM mainframe with EBCDIC record-oriented text files). To do the text file code conversion, Kermit transfers text in ASCII. However, ASCII includes only enough letters and symbols for English. There are now computers capable of representing the characters of other languages: Roman letters with diacritical marks, Cyrillic letters, Hebrew, Arabic, and Greek characters; Chinese, Japanese, and Korean ideograms. However, different computer manufacturers use different codes for these characters. For example, the IBM PS/2 and the Apple Macintosh both have character sets that are "8-bit ASCII". When the character value is 32-127, the character is (normally) a standard ASCII graphic (printable) character. When the value is 128 or higher, it is a "special" character. Unfortunately, the PC and the Macintosh assign different special characters to these values. Here are just a few examples: Value PS/2 Character Macintosh Character 138 Small e grave Small a diaeresis 143 Capital A ring Small e grave 144 Capital E acute Small e circumflex 136 Small e circumflex Small a grave When a file contains "8-bit ASCII", basic Kermit transfers it without any character translation. Therefore, a text file written in French, German, Italian, or Norwegian transferred between a PS/2 and a Macintosh will contain the wrong characters when it arrives at its destination: the PS/2's e-grave becomes a-diaeresis on the Macintosh, etc. There are many computer vendors in the world and nobody controls what codes they use to represent characters. Without a standard protocol for transferring non-ASCII text, each computer would have to know the codes of all the other computers in order for correct transfer of non-English text files to occur between all combinations of unlike systems. To complicate matters, many computers now support more than one character set. IBM mainframes have not only "standard" US EBCDIC, but also several EBCDIC-based Country Extended Code Pages (CECPs) for the support of West European languages, Hebrew, Kanji, etc. The IBM PC and PS/2 have a variety of ASCII-based 8-bit code pages for the same purpose. These character sets are a welcome addition because they allow users of these computers to create, display, and print documents in languages other than English. Unfortunately, the computer's file system keeps no record of which character set is used in each file. IBM is not the only source of private character sets. The Apple Macintosh has many character sets and fonts. DEC supports its own multinational character set as well as private encodings for Greek, Hebrew, etc. The NeXT workstation has its own unique character set. Similarly for Data General, Atari, Commodore, and many other computer manufacturers. In the USSR, up to five different Cyrillic character sets are in use. In Japan, there are many different encodings for Roman, Katakana, and Kanji characters. China and Taiwan use different encodings for Chinese characters. NORMAL KERMIT FILE TRANSFER SYNTAX The Kermit file transfer protocol makes a distinction between text and binary files. Binary files are transmitted with no translation or conversion. For text files, the Kermit protocol defines a standard intermediate representation ("transfer syntax") for text files, namely ASCII characters with carriage return and linefeed (CRLF) after each line, so text can be stored in useful fashion on any computer to which it is transferred. Each Kermit program knows how to translate from the local text-file storage conventions to ASCII/CRLF syntax, and vice versa. This is the basic, required, and default mode of operation for any Kermit program. INTERNATIONAL KERMIT TRANSFER SYNTAX This proposal adds a new mechanism that permits the use of character sets other than ASCII in file transfer. These additional character sets are taken from recognized national or international standards, such as the ISO 8859 Latin Alphabets. Using a standard character set (other than ASCII), it is possible to transfer a text file written in a language other than English between unlike computers, and it is also possible to transfer a text file containing more than one language. For example Latin Alphabet 1 can represent a file containing a mixture of Italian, Norwegian, French, German, English, and Icelandic. The character set used in a text file stored in a particular computer is called the "file character set" (FCS). When the characters in a text file can be represented by a single standard character set, that character set can be used in place of ASCII in Kermit's transfer syntax. This is called the "transfer character set" (TCS). Whatever the transfer character set, there must be a mapping between the local file character set and the transfer character set. That is, there must be a pair of translation functions in the program: one from the local file character set to the transfer character set, and one from the transfer set to the local set: COMPUTER A COMPUTER B +------------------+ +------------------+ | +-------------+ | | +-------------+ | | | Translation | | Transfer | | Translation | | | | Function: |--------------------------->| Function: | | | | FCS to TCS | | Character Set | | TCS to FCS | | | +-------------+ | | +-------------+ | | ^ | | | | | | | | v | | Kermit Program | | Kermit Program | | SEND | | RECEIVE | +------------------+ +------------------+ ^ | | v +------------------+ +------------------+ | Local File | | Local File | | Character Set A | | Character Set B | +------------------+ +------------------+ The use of a common, standard transfer character sets means that each Kermit program only has to know about its own local character sets and a small number of standard ones. International transfer syntax is an optional feature for Kermit programs, and is designed to interoperate (with, of course, no claim to correct translation) with Kermit programs that do not support it. SPECIFYING THE FILE CHARACTER SET The following command allows the Kermit user to specify the local file character set: SET FILE CHARACTER-SET <file-character-set-name> The file character set name is a normally system-dependent item. Some computers have only one character set, in which case the SET FILE CHARACTER-SET command is unnecessary. This command will be required on computers where more than one file character set is used. These include private (corporate) character sets or the 7-bit national variants allowed by ISO Standard 646 (See Appendix B). A consistent, or at least sensible, naming convention should be used for private character sets. The following names for are recommended for the 7-bit national character sets: ASCII, BRITISH, CUBAN, DANISH, DUTCH, FINNISH, FRENCH, FRENCH-CANADIAN, GERMAN, HUNGARIAN, ITALIAN, JAPANESE-ROMAN, NORWEGIAN, PORTUGUESE, SPANISH, SWEDISH, and SWISS (note: most of these are ISO-646 sets, but several of them are private 7-bit sets). The Apple character sets might include APPLE-STANDARD, APPLE-QUICKDRAW, and APPLE-SYMBOL. The DEC Multinational Character Set can be called DEC-MULTINATIONAL, DEC Greek would be DEC-GREEK, DEC Hebrew would be DEC-HEBREW, etc. The NeXT character set can be NEXT-MULTINATIONAL. The Data General international character set can be DATA-GENERAL-MULTINATIONAL, and so on. Later, when these companies add new and no doubt unique character sets, these can be called NEXT-GREEK, NEXT-HEBREW, DATA-GENERAL-GREEK, DATA-GENERAL-HEBREW, etc. For the IBM character sets (code pages), the notation CPnnn is used, where nnn is the code page number: CP037, CP437, CP500, CP850, etc. EBCDIC should be used for "standard" USA EBCDIC. An alternative notation, more in keeping with the ones above, would be something like IBM-PC-STANDARD, IBM-PC-MULTINATIONAL, IBM-PC-PORTUGUESE, IBM-370-MULTINATIONAL, IBM-370-USA, IBM-370-JAPAN, etc. But because there are often several code pages that fit one such description, the CPnnn notation is preferred. These are simply samples and guidelines for naming conventions for corporate character sets. File character set names should be both precise and mnemonic when possible but, as in the IBM case, precision should take precedence. In countries like the USSR, character sets are not associated with particular companies, but have grown up as a matter of usage in several different computing environments, or have grown out of several different generations of standards. In such cases, it makes the most sense to stick to common usage. USSR character sets include KOI-7, KOI-8, DKOI, CP866 (Microsoft Cyrillic), ALT-CYRILLIC ("Alternative Cyrillic"), and CYRILLIC (ISO 8859-5). In Japan, a mixture of standard (JIS), modified standard, and corporate character sets are used: JIS-7, JIS-8, SHIFT-JIS, JAPAN-EUC, DEC-KANJI, FUJITSU-KANJI, HITACHI-KANJI, etc. Example: Consider a computer where the ASCII character set is used for programming and the German ISO 646 variant is used for text. The German phrase: Gr<u-diaeresis><ess-zet>e aus K<o-diaeresis>ln would be rendered in ASCII as "Gr}~e aus K|ln", and the ASCII C-language phrase "{~a[x]}" would appear as: <a-diaeresis><ess-zet>a<A-diaeresis>x<U-diaeresis><u-diaeresis> in German ISO 646 (ess-zet is the German double-s character, similar in appearance to Greek beta). The German-speaking user would want Kermit to interpret the local file characters as German (SET FILE CHARACTER-SET GERMAN) in the former case, and as ASCII (SET FILE CHARACTER-SET ASCII) in the latter. SPECIFYING THE TRANSFER CHARACTER SET To select the transfer character set for file transfer, the user enters the command: SET TRANSFER CHARACTER-SET <name> where <name> is the name of a standard character set. If the name is TRANSPARENT, Kermit does no character set conversion at all, but it still does text record format conversion. For ASCII-based systems, this is equivalent to Kermit's normal, basic mode of operation. If a name other than TRANSPARENT is given, and FILE TYPE is set to TEXT, Kermit translates between the current file character set and the named transfer character set when constructing or deciphering file data packets. If the transfer character set is ASCII, Kermit converts between the current file character set and 7-bit ASCII. This mode of operation is roughly equivalent to Kermit's basic mode of operation on non-ASCII based systems like IBM mainframes. But if the local file character set contains accented Roman characters, the accents are dropped in the transfer character set, for example a-acute becomes simply a. (But see SET LANGUAGE, described later.) Other transfer character sets must be chosen from among approved national or international standards. The sets shown in Table 2 are recommended. The criteria for including a character set in this table are: 1. 7-bit US ASCII (= ISO-646 US version) is included, for compatibility with the original Kermit protocol and the hundreds of programs that implement it. 2. An 8-bit single-byte character set, such those in the ISO 8859 series, may be included if it is registered, as in (4) below. 3. A multibyte character set may be included, if it is registered as in (4). 4. The set must be listed in the ISO International Register of Character Sets under the provisions of ISO Standard 2375 (see Appendix A), so it has a unique registration number and designating escape sequence with which the sending Kermit program can identify the character set to the receiving Kermit program. (An exception to this provision is made for Japanese EUC, which is a combination of two registered standards.) Allowance is made for the possibility of other registration authorities, should they appear. 5. The set must be a national or international standard graphic character set, intended for use in computer text processing or programming (as opposed to Videotex, Teletex, OCR, device control, or other applications). This category may include standard line-drawing or technical character sets which fit the other criteria. Note in particular that the national variants of ISO 646 are not included, since these are covered adequately by the ISO Latin alphabets. Standard character sets containing "composed characters", such as CCITT T.61, in which an accented letter is represented by a two-character sequence (for example, c-cedilla is encoded as a cedilla character followed by a "c" character), are not included at this time. The issue of composed versus precomposed characters will be addressed later. Standard "Kermit names" (for use with the SET TRANSFER CHARACTER-SET command) are given to these character sets so they may be referred to uniformly in all Kermit implementations. These names are chosen to be mnemonic so users don't have to remember cryptic designations like "ISO-8859-3". The choice of single words like "CYRILLIC" implies that there will not be more than one transfer syntax for Cyrillic text. However, if standards change in the future, it will be possible to add further identifying material to these names, e.g. "CYRILLIC-2, CYRILLIC-ANCIENT", etc. The Kermit names are English, as this is the language of the standards themselves. The Kermit commands are English words, and this document is written in English. Non-English user interface issues are beyond the scope of this document. _____________________________________________________________________________ Table 2: Standard Character Sets US 7-bit ASCII. English, Latin, Gaelic without accents, Dutch without y-diaeresis, German without umlauts (vowels marked by diaeresis) or ess-zet. Kermit name: ASCII. ISO Registration Number: 6. Kermit Designator: none (this is the default transfer character set). ISO 8859-1, Latin Alphabet 1. Danish, Dutch, English, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish, and Swedish. Kermit name: LATIN1. ISO Registration Number: 100. Kermit Designator: I6/100. ISO 8859-2, Latin Alphabet 2. Albanian, Czech, English, German, Hungarian, Polish, Romanian, Serbocroatian (Croatian), Slovak, and Slovene. Kermit name: LATIN2. ISO Registration Number: 101. Kermit Designator: I6/101. ISO 8859-3, Latin Alphabet 3. Afrikaans, Catalan, Dutch, English, Esperanto, French, Galician, German, Italian, Maltese, Spanish, and Turkish. Kermit name: LATIN3. ISO Registration Number: 109. Kermit Designator: I6/109. ISO 8859-4, Latin Alphabet 4. Danish, English, Estonian, Finnish, German, Greenlandic, Lappish (Sami), Latvian, Lithuanian, Norwegian, and Swedish. Kermit name: LATIN4. ISO Registration Number: 110. Kermit Designator: I6/110. ISO 8859-5, the Latin/Cyrillic Alphabet. Bulgarian, Byelorussian, English, Macedonian, Russian, Serbocroatian (Serbian), and Ukrainian (Compatible with USSR GOST Standard 19768-1987 and ECMA-113 = "New KOI-8"). Kermit name: CYRILLIC. ISO Registration Number: 144. Kermit Designator: I6/144. ISO 8859-6, the Latin/Arabic Alphabet. Kermit name: ARABIC. ISO Registration Number: 127. Kermit Designator: I6/127. ISO 8859-7, the Latin/Greek Alphabet. Kermit name: GREEK. ISO Registration Number: 126. Kermit Designator: I6/126. AKA: ELOT 928 (OMADA ELLINIKON ELOT 928) ISO 8859-8, the Latin/Hebrew Alphabet. Kermit name: HEBREW. ISO Registration Number: 138. Kermit Designator: I6/138. ISO DIS 8859-9, Latin Alphabet 5, in which the Icelandic letters Thorn and Eth plus upper and lowercase Y acute from Latin Alphabet 1 are replaced by six other letters needed for Turkish. Danish, Dutch, English, Faeroese, Finnish, French, German, Irish, Italian, Norwegian, Portuguese, Spanish, Swedish, and Turkish. Kermit name: LATIN5. ISO Registration Number: 148. Kermit Designator: I6/148. JIS X 0201, a 1-byte code for Japanese Katakana, used in conjunction with a slightly modified ASCII (backslash is replaced by Yen sign, tilde by overbar). Kermit name: KATAKANA. ISO Registration Numbers: 14 (Roman), 13 (Katakana). Kermit Designator: I14/13. Japanese EUC A variable-length code containing ASCII and Japanese Katakana in their JIS X 0201 representations, plus 2-byte JIS X 0208. JIS X 0208, in turn, includes Japanese Kanji, Katakana, Hiragana, Roman, Greek, and Russian characters, plus special symbols, etc. ASCII codes are single bytes with their 8th bit set to zero. JIS X 0208 codes are double bytes with the 8th bit of each byte set to one. JIS X 0201 Katakana bytes are preceded by Single Shift 2 (see Appendix B). This mixture allows single-width Roman and Katakana characters to coexist with double-width JIS X 0208 characters, a common practice in many Japanese computing environments. Kermit name: JAPAN-EUC ISO Registration Numbers: 14 (Right half of JIS X 0201), 87 (JIS X 0208). Kermit Designator: I14/87/13. Chinese Standard GB 2312-80, a 2-byte code for Chinese. Kermit name: CHINESE. ISO Registration Number: 58. Kermit Designator: I6/58. KS C 5601 (1989), a 2-byte code for Korean. Kermit name: KOREAN. ISO Registration Number: 149. Kermit Designator: I6/149. TCVN 5712:1993, an ISO 2022-compliant pair of single-byte sets for Vietnamese, one for uppercase letters, the other for lowercase. Kermit name: VIETNAMESE. ISO Registration Number: 180. Kermit Designator: I6/180. ISO/IEC 10646-1. International Standard 10646, Information Processing -- Multiple-Octet Coded Character Set, 1993. Table 2: Standard Character Sets _____________________________________________________________________________ BEWARE: The Latin-4 alphabet is confused. The original ECMA 94 standard was designed for the Scandinavian and Baltic languages and thus included the character A-ring (necessary for Swedish and Lappish/Sami), but some editions of the ISO Registry substitute L-acute (not used by any of the covered languages). NOTE: CNS 11643 (Taiwan) is not included because (a) one Chinese transfer character set should be sufficient, and (b) CNS 11643 does not show up in the ISO Register. The issue of "Han Unification" (combining Chinese, Japanese, and Korean ideograms into a single code set) is not addressed by this proposal, except insofar as it occurs in the base multilingual plane (BMP) of ISO 10646. Until and unless Kermit programs are updated to take advantage of ISO 10646, additional transfer character sets must be added to Kermit's repertoire for languages with writing systems not yet covered: Burmese, Thai, Lao, Khmer, Armenian, Georgian, Amharic, Sinhalese, Tibetan, Mongolian, Cherokee, many African languages, etc etc. The ISO Latin alphabets are 8-bit character sets whose left half is identical with ASCII, and whose right half contains the characters required for languages other than English. All accented letters are "precomposed", i.e. single code points. The ISO registration number refers only to the right half of each of these character sets, but each of these sets must be used in its entirety, because the unaccented Roman letters, the digits, and the common punctuation marks appear only in the ASCII left half, which is ALWAYS (unless otherwise noted) US ASCII, ISO Registration Number 6. The Kermit character-set name refers to the two halves combined as a single set. A particular Kermit program need not incorporate all of these character sets. In many cases, a single 8-bit character set will suffice, such as LATIN1 for Western Europe, LATIN2 for Eastern European countries with Roman-alphabet based writing systems, CYRILLIC for most of the USSR, and so on. When a language is representable in more than one character set from this table, as are English, German, Finnish, Turkish, etc., the character set highest on the list which adequately represents the language should be preferred. More precisely, when a character set other than ASCII is to be used in Kermit's transfer syntax, the ISO 8859 sets are preferred to other registered sets which contain the same characters. Within the ISO 8859 family, lower-numbered sets which contain the characters of interest are preferred to higher-numbered sets which contain the same characters. This guideline maximizes the chance that any two particular Kermit programs will interoperate. For example, LATIN1 would be chosen for French, German, Italian, Spanish, Danish, Dutch, Swedish, etc; LATIN3 for Turkish; JAPAN-EUC for Japanese text that includes Kanji characters, KATAKANA for Japanese text that includes only Roman and Katakana characters, etc. Unfortunately, but unavoidably, the burden of choosing the best transfer character set must be placed upon the user. If a file containing a mixture of English, Finnish, and Latvian must be transferred, the user must find a character set that can adequately represent all three languages, in this case Latin Alphabet 4. A table like Table 3 should be provided in the user documentation to help the user make this selection. _____________________________________________________________________________ Afrikaans LATIN3 Irish LATIN1,5 Albanian LATIN2 Italian LATIN1,3,5 Arabic ARABIC Japanese Kanji JAPAN-EUC Bulgarian CYRILLIC Japan.Katakana KATAKANA,JAPAN-EUC Byelorussian CYRILLIC Korean KOREAN Catalan LATIN3 Lappish (Sami) LATIN4 Chinese CHINESE Latvian LATIN4 Czech LATIN2 Lithuanian LATIN4 Danish LATIN1,4,5 Macedonian CYRILLIC Dutch LATIN1,2,3,4,5 Maltese LATIN3 English ASCII,LATIN1,2,3,4,5,etc Norwegian LATIN1,4,5 Esperanto LATIN3 Polish LATIN2 Estonian LATIN4 Portuguese LATIN1,5 Faeroese LATIN1,5 Romanian LATIN2 Finnish LATIN1,4,5 Russian CYRILLIC French LATIN1,3,5 *Serbocroatian LATIN2, CYRILLIC Galician LATIN3 Slovak LATIN2 German LATIN1,2,3,4,5 Slovene LATIN2 Greek GREEK Spanish LATIN1,5 Greenlandic LATIN4 Swedish LATIN1,4,5 Hebrew HEBREW Turkish LATIN3,5 Hungarian LATIN2 Ukrainian CYRILLIC Icelandic LATIN1 Table 3: Preferred Transfer Syntax Character Sets *If written in Cyrillic, this language is called Serbian. If written in Roman letters, it is called Croatian. _____________________________________________________________________________ Note, Table 3 is only a sample. To produce a comprehensive and definitive table would require a team of language experts. The information in the current table is based purely upon the claims made within the standards themselves, in which there is no mention of languages like Farsi, Urdu, Welsh, Cornish, Manx, Inuit, Old Church Slavonic, Armenian, Georgian, Tagalog, Swahili, Latin, etc, nor definitions of exactly what is meant by terms like "Greenlandic", "Irish", etc. Nevertheless, it is the intention of this proposal to support any language for which a computer character set can be standardized. OTHER NON-UNIVERSAL CHARACTER SETS This section lists character sets that are not listed in Table 2, but that are likely candidates for eventual inclusion therein (i.e. after they are registered with the ISO). ISO 6438, Extended Roman for African Languages. More information needed. ISCII-1991 (IS 13194-1991), Indian Script Code for Information Interchange. Supports the nine official Indian scripts derived from Brahmi: Devanagari, Gurmukhi, Gujarati, Bengali, Assamese, Oriya, Telugu, Kannada, Malayalam, and Tamil, with Roman transliteration. A series of single-byte codes with 00/00-07/15 = ASCII, different right halves for each script. All of the right halves are structurally identical to each other to facilitate automatic transliteration and display of alternate alphabets using the same system software. As Perso-Arabic scripts have a different alphabet, a different standard is envisaged for them. ISCII-1991 is a successor to earlier codes ISSCII-83 and ISCII-88 announced by the Department of Electronics, Government of India. Bureau of Indian Standards, Manak Bhavan, 09 Bahadur Shah Zafar Marg, New Delhi 110002. No ISO registration number. (Add others...) THE UNIVERSAL CHARACTER SET Though ISO Standard 10646 was approved in 1993, it will continue undergo continuous change as national standards bodies evolve and engage in the ISO process, and it will take many years before it replaces the many existing private and standard character sets in data processing and communication. Therefore there is no intention to drop support in the Kermit protocol for the standard character sets listed above at any time in the foreseeable future. ISO 10646 can be added (in at least in one form, most likely a compacted version of the two-byte Base Multilingual Plane) to Kermit's list of transfer character sets. IMPLEMENTATION Character set translation can be added to existing Kermit programs with a minimum of effort. The following steps are required for each Kermit program: 1. Add the SET FILE TYPE { BINARY, TEXT } command, if the program doesn't have it already. SET FILE TYPE TEXT enables text-file character set conversion. SET FILE TYPE BINARY disables conversions of all kinds, but does not destroy the file and transfer character-set selections (2 and 3 below), so that a subsequent SET FILE TYPE TEXT command will still be able to use them. 2. Add the SET FILE CHARACTER-SET <name> command. The set of <names> should include ASCII or EBCDIC (as appropriate, used for program source, etc) plus the names of any "national" or special character sets that are used on the particular computer. 3. Add the SET TRANSFER CHARACTER-SET <name> command. The set of <names> should include TRANSPARENT and ASCII plus the names of one or more other standard character sets from Table 2 which contain the characters from the computer's local character set(s). 4. Add translation tables (or functions) between each compatible pair of character sets in (2) and (3). For each pair, two translation tables are necessary: one from the local file character set to the transfer character set, and one from the transfer set to the local one. 5. Add SHOW commands to let the user find out what character sets are available, and which ones are currently selected, for the transfer syntax and for local files. The exact syntax of this command will vary. In some Kermit implementations, every SET command has a corresponding SHOW command, in which case it will be possible to SHOW FILE CHARACTER-SET and SHOW TRANSFER CHARACTER-SET. In others, related SET parameters are lumped together into broader categories for purposes of SHOW, for example SHOW FILE would show all file-related parameters; SHOW PROTOCOL would show all protocol-related parameters. Any particular Kermit program can support several (perhaps many) file character sets (FCS) and transfer character sets (TCS). No particular combination of them should be forbidden. If a useful translation between, say, Hebrew and Katakana can be devised, there is no reason the user should not be allowed to select it. However, programs that support large numbers of file and transfer character sets must bow to the limitations of the computer's architecture and memory space, as well as the knowledge and patience of the programmer. Hence, purely as a matter of implementation, certain combinations of FCS and TCS -- preferably the ones that would be least frequently used -- can remain unsupported. In that case, the SET { FILE, TRANSFER } CHARACTER-SET command that causes the conflict can issue an error message or switch automatically to a combination (if any) that makes sense. Optionally, several additional related commands can be included: 6. The command SET LANGUAGE may be added to allow the program to apply heuristics in the translation process that would not otherwise be possible. See discussion below. 7. Commands for modifying, loading, and saving translation tables (to be specified in a future draft of this document). 8. Once the new commands and translation tables are in place, it is simple to add a TRANSLATE command, to translate a local file from one character set to another, using a transfer character set as an intermediate step. With this command, Kermit may be used as a character-set conversion utility for local files (see Appendix D). 9. Commands governing automatic pairing of file and transfer character set and setting the goal for translation, described below. Translation occurs only in the data field of the D packets. Packet control fields are not translated, nor are the data fields of any other kind of packet, including F (filename) packets. (Filename packets cannot be translated because the attribute packet that announces the file's character set does not arrive until after the F packet.) As always, IBM Mainframe Kermit is a special case, since most character strings must be translated between EBCDIC and ASCII. Nonetheless, the rule applies even there, as long as we take "translation" to mean the specific translation between the transfer and file character sets, rather than the standard ASCII/EBCDIC conversion. Internally, the Kermit program that is sending a file: 1. Reads characters (one or more bytes) or lines of text from the file. 2. Translates the character from the FILE CHARACTER-SET to the TRANSFER CHARACTER-SET, applying any selected and applicable special rules or goals, and converting the record format if necessary. 3. Follows the negotiated lower-level encoding options: control prefixing, shifting, and compression. 4. Assembles and sends the packet. The Kermit program that is receiving a file: 1. Reads an incoming data packet. 2. Decodes the packet data according to the negotiated prefixing, shifting, and compression options. 3. Translates the resulting characters from the TRANSFER CHARACTER-SET to the FILE CHARACTER-SET, applying any selected and applicable special rules or goals, converting the record format if necessary. 4. Writes the translated characters to the output file. EXAMPLE To transfer a Finnish-language text file from a computer that uses the Finnish ISO 646 national variant to an IBM PS/2, and to store the file using the PS/2's Multilingual Code Page: On the sending computer: On the receiving computer: SET FILE TYPE TEXT SET FILE TYPE TEXT SET FILE CHARACTER-SET FINNISH SET TRANSFER CHARACTER-SET LATIN1 SET TRANSFER CHARACTER-SET LATIN1 SET FILE CHARACTER-SET CP850 SEND filename RECEIVE The file sender translates from Finnish ISO 646 to Latin Alphabet 1, the most appropriate transfer character set (see Table 3), and the file receiver translates from Latin-1 to Code Page 850. To transfer a C-language source program between the same two computers: On the sending computer: On the receiving computer: SET FILE TYPE TEXT SET FILE TYPE TEXT SET TRANSFER CHARACTER-SET ASCII SET FILE CHARACTER-SET ASCII SET FILE CHARACTER-SET ASCII SET TRANSFER CHARACTER-SET ASCII SEND filename RECEIVE Here all translations are from ASCII to ASCII, hence no translation at all. LANGUAGE-SPECIFIC TRANSLATIONS When national or international text must be translated into ASCII, information is necessarily lost. ASCII does not include accented or non-Roman letters. For readability, accented letters can be converted to their unaccented counterparts, but that can introduce ambiguities or mistakes (to use Andr'e Pirard's example: "a la francaise" without accents means "has the French girl"). If we know that the text is written in a specific language, sometimes certain language-specific rules can be applied to reduce the loss of information. For example, consider text containing the y-diaeresis character. It is acceptable to render y-diaeresis as "ij" if the language is Dutch, but not otherwise (yielding "Rijksmuseum" -- correct spelling -- rather than "Ryksmuseum"). Similarly, o-diaeresis can be rendered as "oe" in German or Swedish but not in English ("co<o-diaeresis>peration"). The command for selecting language-specific translation rules is: SET LANGUAGE <name> where <name> is the (English) name of the language, for example ITALIAN, NORWEGIAN, PORTUGUESE. Example: The command SET LANGUAGE GERMAN would allow the Kermit program, when translating from Latin-1 or the German ISO 646 variant into ASCII, to render: Gr<u-diaeresis><ess-zet>e aus K<o-diaeresis>ln as "Gruesse aus Koeln" (correct German) rather than "Gruse aus Koln" (Gruse means something entirely different from Gruesse -- something like "scum" rather than "greetings"). TRANSLATION MECHANISMS When translating from one character set to another, there are two goals possibly conflicting goals: 1. Readability (R): Achieving a translation that makes the most sense to the reader. 2. Invertibility (I): Achieving a translation that can be translated back to the original character set without loss or distortion of information. When readability is desired, nonmatching characters are converted to the closest matching character, for example Latin-1 e-grave becomes simply e in ASCII. But now "e" represents two different characters in the translation, so invertibility is lost. When no sensible counterpart exists, a special "this can't be translated" character is used (a unique character if possible, otherwise a question mark "?"). When this special character is used for more than one purpose, invertibility is lost. Invertibility is possible only when both character sets are the same size. When invertibility is desired, the characters of the intersection of the two sets are paired together: A in one set to A in the other, A-grave in one set to A-grave in the other, etc. The members of the two sets of differences between the two character sets are paired together in a way that gives every character a unique translation in each direction. The exact method for pairing is problematic, and frequently a particular pair makes no sense at all, for example "L-with-stroke" with "Vulgar fraction 3/4". Any such pairing will give an invertible translation, but to achieve the most useful translation it is necessary to examine all the character sets involved. To illustrate, Latin Alphabet 1 lacks the OE digraph character but this character is found in the DEC Multinational Character Set, the Apple Quickdraw character set, and the NeXT character set, but at different code points in each. Ideally, each of these character sets should map OE digraph into the same Latin-1 code point. Let's look at a few common translation scenarios. 1. From a 7-bit set to a different 7-bit set, e.g. from ISO 646 Spanish version to ASCII (or vice versa). The two sets do not contain the same characters. Here we must choose between readability and invertibility. To achieve readability in the Spanish-to-ASCII direction, we strip diacritical marks (n-tilde becomes simply n). To achieve invertibility (at least in this case), we make no translation at all. 2. From a 7-bit set to an 8-bit set. The 7-bit sets are usually ASCII or an ISO 646 national variant. Normally, all the characters from the 7-bit set are also present in the 8-bit set, and there is no R vs I conflict. Otherwise, we must choose between R and I. Normal example: ASCII (and most ISO 646 national variants) to Latin-1 -- here we satisfy both R and I. Bizarre example: ISO 646 Swiss national variant to ISO Latin / Arabic -- here we must choose between R and I. 3. From an 8-bit set to another 8-bit set. The common case here is converting between one of the corporate "extended ASCII" sets (DEC, IBM, Apple, NeXT, Data General, Commodore, etc) and ISO Latin-1. The two sets share a large percentage of common characters. How do we handle the characters that differ? Again, we must choose between R and I. To complicate this case, the IBM, Apple, and NeXT sets use the forbidden (by ISO standards) C1 control-character area for graphics characters; in this case there must be a mapping between graphics and C1 controls. 4. From an 8-bit set to a 7-bit set. For example, from Latin-1 to ASCII or to an ISO 646 national set. Here we are forced to accept a great deal of information loss. We cannot possibly achieve invertibility, so we should aim for maximum readability. The SET LANGUAGE command can be used to help. 5. From a single-byte character set to a multibyte character set. Most multibyte character sets include ASCII and sometimes several other alphabets (such as Greek and Cyrillic in JIS X 0208). Here we translate each character into its equivalent, if it has one, and if not we pick some unique nonsense value to ensure the translation is invertible (for the single-byte set). 6. From a multibyte set to a single-byte set, for example Japanese EUC into Latin-1 (or Latin/Cyrillic, Latin/Greek, or even ASCII). Here we lose the vast majority of characters -- there is no hope for a readable or even a sensible translation. The only way to translate Kanji into (say) ASCII is to replace ideograms by words, and that is beyond the scope of a simple character-set conversion scheme. Hence, we normally replace ideograms by the "this can't be translated" character. 7. From one national multibyte set to another. These sets are for Chinese, Japanese, and Korean, and have at least a large number of ideograms (Han characters) in common, and probably also Roman characters. How to translate among them is an item for study by language experts: by shape, by meaning, etc. How do we choose between readability and invertibility? It depends on what the user needs at a particular moment. We (Kermit designers and programmers) can give the user the ability to make this choice. Or we can make the choice for them, knowing full well that whatever our choice, it will be wrong. To give the user a choice -- at the expense of increased size and complexity in the program itself and of the user interface -- the following command can (optionally) be included: SET TRANSFER TRANSLATION { INVERTIBLE, READABLE } The existence of this command requires a dual set of translation tables and/or functions -- one optimized for invertibility (totally invertible if the two character sets are the same size), the other for readability. When a Kermit program handles many character sets, this can result in a significant increase in program size. When this command is not provided, the bias of the translation mechanisms -- readability or invertibility -- must be clearly stated in the user documentation. All else being equal, the bias should be towards invertibility; if an invertible translation is possible (i.e. the two character sets are of the same size), it should be provided. This ensures round-trip consistency PROVIDED the same invertible tables are always used. It is the programmer's choice whether translation is accomplished by tables or by functions that implement translation algorithms, or a combination of both. Functions provide maximum flexibility and tend to reduce program size, at some cost in execution overhead. Tables provide greatest speed, but generally with greater cost in program size. THE POLITICS OF INVERTIBILITY If two character sets are the same size and contain the same repertoire of characters, translation is simply a matter of rearranging code points. But when two character sets intended to serve the same language or group of languages differ on non-alphabetic code points or in other minor ways, arbitrary decisions must be made in assigning the nonoverlapping characters from the two sets. Who makes those decisions? The classic example is the translation between IBM Code Page 850 (the "Multilingual Code Page") and ISO Latin-1. Because IBM assigns graphics characters to its C1 area, it has 32 more graphics characters than Latin-1. Most of these are line- and box-drawing characters sprinkled throughout the code page. How should these be paired with Latin-1's C1 set? Such decisions are beyond the scope of the national and international standards activities, and they should not be made by Kermit designers or programmers. These tables (or algorithms) are most appropriately furnished by the creators of each private character set. This lends the appropriate "official" air, and encourages the makers of all software packages that need such a translation to use the official one so all such applications on a particular computer can interoperate. IBM has specified an invertible translation between certain of its code pages and ISO Latin-1 in its Character Data Representation Architecture (CDRA). Similarly, Apple should specify the translation between Quickdraw and Latin-1. Microsoft should specify the translation between CP866 and the Latin/Cyrillic alphabet. And so on. In the absence of such vendor-provided translations, Kermit programmers are forced to produce their own, but should continue to press vendors for official versions. Eventually, the actual contents of each invertible translation table or algorithm should be specified in a document or set of documents to accompany this proposal, or references to the relevant corporate standards should be listed in Appendix G. Before leaving this topic, let's also remember to encourage designers of computer operating systems to RECORD THE CHARACTER SET in a text file's directory entry, so Kermit or any other application program can find out what it is automatically without requiring the user to identify it manually. (Of course, this begs the larger question of recording the file type as well... item for futher study.) THE POLITICS OF READABILITY Similarly, we can ask: Who decides what is readable? Transliteration of a language like Greek or Russian into ASCII can be done in many different ways, depending upon -- among other things -- the language spoken by the person reading the result. For example, the surname of a former leader of the USSR, which, when written in Cyrillic, has only six letters; transliterated into English, the name is "Khrushchev". Into German, "Khruschtschew". There are few, if any, widely recognized standards for transliteration, and yet it is often desirable. Newspapers and magazines, library catalogers, immigrant bureaus, and many other organizations have procedures for transliterating "foreign" writing systems. Not just in "ASCII-speaking" lands, but everywhere: Russian names are written in Arabic newspapers, Hebrew names in Greek journals, English names on Chinese passports, Korean publications in Vietnamese library catalogs. When a translation function is optimized for readability -- and some must be -- the designer must consider whether to force a particular kind of readability on the user, or to give the user a choice. The precise mechanism for this (if indeed any such mechanism can be precise!) is another topic for further study: How to best transliterate from Language A in Writing System B to Language X in Writing System Y? USER-DEFINED TRANSLATIONS It should be possible for users to override the decisions made by Kermit programmers regarding the bias of the translation mechanism or its particular details, as well as to add totally new translations, by introducing their own translation tables or functions. Methods for doing this will be described in a future draft of this document. This is primarily a user-interface design issue. *** How to do user-defined translations: SET FILE CHARACTER-SET USER-DEFINED SET XFER CHARACTER-SET <valid-xfer-charset> SET USER-TRANSLATION FROM <tcs> xxx yyy ; for incoming files SET USER-TRANSLATION TO <tcs> yyy xxx ; for outbound files Applies to <tcs> + USER-DEFINED FCS. Can have one pair for each TCS. Now announcements work right, etc etc. DUMP USER-TRANSLATION <tcs> [ <file> ] ; list tables (to file) For C-Kermit, we have 4 TCS's, so 8 x 2 x 256 = 2K bytes, not bad: . Add FC_USER FCS . Add 6 tables (not supported for Kanji) . Initialize each table to identity function . Add 8 functions (even for TRANSPARENT, but NULL for Kanji) . Figure out a way of telling user whether table has been defined. . Add SHOW USER-TRANSLATION <tcs> [ <file> ] (= a bunch of SET USER-TRANSL commands, with comments) . Add DUMP USER-TRANSLATION <tcs> [ <file> ] (= just the numbers, comma-separated? space-separated? one per line?) . Add LOAD USER-TRANSLATION <tcs> <file> (= read table written by DUMP, watch out for value and table-size overflow) . Add some kind of built-in test pattern? *** ATTRIBUTE PACKETS We want to accommodate as many computers as possible with a minimum of programming effort, but this approach places a burden on the user in the form of new commands and the confusion that results if the user forgets to issue these commands. This protocol extension does not require support for Kermit File Attribute Packets, whose use is negotiated in the Kermit Initialization exchange, but their use is recommended; the user's burden can be alleviated if the sending Kermit program uses an attribute field to inform the receiving Kermit of the character set used in the file data packets. The receiving program can accept or refuse the file based on whether it supports the specified character set. If the receiving program refuses a file, the user can override this refusal, for example, if a long file contains only a word or two in an unknown character set. The most common user-override is the command SET ATTRIBUTES OFF. However, this also disables other desirable effects of attribute packets, such as prenotification of file size. Therefore, it is desirable to let the user specify exactly which attributes are to be "turned off", e.g., SET ATTRIBUTES ENCODING OFF. When the transfer character set is ASCII (or TRANSPARENT when sent from an ASCII-based system), the Encoding attribute should have the traditional value of "A" (for ASCII): "*!A". In order for the sender to inform the receiver of transfer character sets other than ASCII, a new value for the Encoding attribute ("*") is defined, namely "C", which is substituted for the normal value "A" (ASCII). "C" means that the actual character set is specified as an operand which begins with a single letter that designates the character set registration authority, e.g. I for ISO, followed by a registration-authority-specific identifier, as in: Ixxx/yyy where the letter "I" (for ISO) is followed by a pair of ISO registration numbers for the character set, xxx for the "left half" and yyy for the right, expressed in decimal ASCII digits, for example: +---+---+---+--------+ | * | ' | C | I6/100 | +---+---+---+--------+ where "*" is code for the Encoding Attribute (or transfer syntax), "'" is the length of its value. In this case, "CI6/100" is 7 characters long, and "'" is the printable encoding for 7 (7 + 32 = 39, the ASCII code for "'"). The character "C" means "I'm using the specified Character set", and "I6/100" specifies the character set: "ISO registration number 6", i.e. US ASCII, in the left half, and ISO registration number 100, which is the right half of Latin-1, in the right. The "I" stands for ISO, and is included to allow for the possibility of other character set registration authorities. Designators for each character set are given in Table 2, labeled "Kermit Designator". Japanese EUC is a special case, because it is a mixture of single-byte JIS X 0201 (two character sets) and double-byte JIS X 0208. Its Kermit designator is I14/87/13: Japanese Roman in G0, Japanese Kanji in G1, Japanese Katakana in G2 (Katakana characters are indicated by SS2 in the data -- the SS2 is considered part of the file). In the event that a character set standard changes, but keeps the same registration number, the registration number for the new character set should be preceded by a non-numeric character which indicates the revision number: @ (atsign) = 1, A=2, B=3, and so on (as suggested in ISO 2022). For example "I@2/B100" would indicate an 8-bit single-byte character set having Revision 1 of ASCII as its left half and Revision 3 of Latin-1 as its right. Note: "Revision 1" does not mean the original version, but rather the first revision AFTER the original version. The Kermit designator for an original version does not have a revision indicator. The form of the character-set designator was chosen because the standards currently provide no single code to designate an 8-bit character set in its entirety. Each half of the character set has its own registration number. For example, ISO 8859-1 (Latin-1) is a single 8-bit character set, but registration number 100 only refers to its right half. Registration number 6 denotes ASCII, which is used as the left half of all ISO 8859 character sets. To promote maximum interoperability among extended Kermit programs, the Kermit designator should be treated as a character string, to be looked up in a small table, rather than as a flexible mechanism to be used for piecing together character sets from an arbitrary assortment of left and right halves. However, the Ixxx/yyy notation leaves open this possibility should it become desirable at a later time. In the event that a new class of registration numbers appears, for example, to denote a single-byte 8-bit character set in its entirety rather than just its left or right half, a different initial letter will be used in the designator, even if the registration authority is the ISO. In the event that other character-set registration authorities appear, they too can be assigned their own unique Kermit designator prefixes (for example, "K" for Kermit Development and Distribution), to avoid ambiguity from conflict of registration numbers. For the present, standards organizations like ANSI and CCITT are not treated as separate registration authorities, because their character sets are also registered by the ISO. Should these organizations adopt character sets that have no ISO counterpart, then special Kermit designator prefixes will be assigned for them. Based on the attribute information, the receiver may accept or reject the file, using Kermit's normal attribute response mechanism. To accept, it puts a "Y" as the first character of the data field of the acknowledgement to the attribute packet. To refuse, it puts an "N" instead of a "Y", followed by "*". If the file is refused in this manner, the sending Kermit should respond by sending a "Z" (end-of-file) packet containing a "D" (for Discard) in its data field. The behavior of the receiving Kermit program when an unknown character set is announced to it is governed by the command SET UNKNOWN-CHARACTER-SET. SET UNKNOWN-CHARACTER-SET KEEP means that it should not reject the file, but store it the best way it can (e.g., without translating any characters), DISCARD means that the file should be rejected. AUTOMATIC SELECTION OF FILE CHARACTER-SET BY THE FILE RECEIVER When a file arrives whose transfer character-set is announced in the attribute packet, it is desirable to include a mechanism to allow the receiving Kermit program to select the most appropriate file character-set automatically. Similarly, if the user gives a SET FILE CHARACTER-SET command, it would be desirable to switch to an appropriate TRANSFER CHARACTER-SET automatically, and vice-versa. Any such mechanism should also include a "manual override" to let the user disable it. Suppose, for example, an MS-DOS Kermit program that is about to receive a file has CP437 as its FILE CHARACTER-SET, but the arriving file is announced as CYRILLIC. The receiving Kermit can (a) translate the Cyrillic characters into ASCII using a transliteration scheme (like "Short KOI" phonetic transcription), or (b) switch its file character set to one that contains the greatest number of characters that are also in the transfer character set, in this case CP866. We can design Kermit programs to supply translations between every possible combination of file and transfer character set. Or we can allow only certain combinations, for example Roman-to-Roman, Cyrillic-to-Cyrillic, Hebrew-to-Hebrew, etc. In the former case, it is the user's responsibility to choose the most useful combination. In the latter, the receiving Kermit must either reject the file when the file character set is not valid for the incoming transfer character set (or accept it without translation, depending on the setting of UNKNOWN-CHARACTER-SET), or else switch to an appropriate file character set automatically. An optional automatic switching mechanism, configurable by the user, can be provided by the following command: SET SEND AUTOMATIC-TRANSLATION { OFF, ON, <FCS> [ <TCS> ] } Automatic translation action when sending files. OFF means don't automatically switch translations. ON means enable automatic translation. <FCS> <TCS> means: If the current file character set is <FCS>, then use <TCS> as the transfer character set. If <TCS> is omitted, automatic selection of a transfer character set for <FCS> is not done, and the current transfer character set is used. In either case, any previous entry for <FCS> is superseded. SET RECEIVE AUTOMATIC-TRANSLATION { OFF, ON, <TCS> [ <FCS> ] } Automatic translation action when receiving files. OFF means don't automatically switch translations. ON means enable automatic translation. <TCS> <FCS> means: if the announced transfer character set of the incoming file is <TCS>, then use <FCS> as the file character set. If <FCS> is omitted, automatic selection of a file character set for <TCS> is not done, and the current file character set is used. In either case, any previous entry for <TCS> is superseded. Many of these commands can be executed. Their effect is to build a pair of lookup tables. When AUTOMATIC-TRANSLATION is OFF, or the character set is not found in these tables, the prevailing settings are used. ON can be used to enable any tables that had been previously disabled by OFF. The programmer may preload the Kermit program with a default set of tables. However, the default AUTOMATIC-TRANSLATION setting in both directions should be OFF. INTEROPERABILITY WITH UNEXTENDED KERMIT PROGRAMS Extended Kermit programs must be fully interoperable with unextended ones. When the file sender is extended and the receiver is not, the receiver ignores the encoding attribute and stores the file data as received, but after applying any required record-format conversions. In case the sender's encoding attribute causes problems for the receiver, the sending Kermit should have an option to omit this attribute: SET ATTRIBUTE ENCODING OFF (or as a last resort, SET ATTRIBUTES OFF altogether). The sender has the option of translating from a local file character set to any desired transfer character set, including ASCII, that will be useful on the receiving computer. When the file receiver is extended and the sender is not, the receiver has the option of translating the received characters to a local file character set. This will be useful if the character set used in the packets corresponds with one of the receiver's transfer character sets, and it requires the user to manually inform the receiving Kermit of both the transfer and the file character sets. In other cases, the extended Kermit's TRANSLATE command can be used to pre- or postprocess a file to achieve the desired results if the desired translations are available. PERFORMANCE There is nothing in this proposal that affects the performance of the Kermit file transfer protocol. The efficiency of file transfer is the same with or without this extension. However, it is recognized that transfer of 8-bit text will not always be efficient. Since the special characters have their 8th bits set to one, there will be a lot of 8th-bit prefixing in the 7-bit environment -- the higher the proportion of special characters to ASCII characters, the lower the efficiency. For "left-handed" languages like Italian, Norwegian, and Portuguese (in which the preponderance of text characters are ASCII), the impact is negligible. For "right-handed" languages like Russian, Greek, Hebrew, and Arabic, where characters come from the right half of the character set, efficiency will be poor in the 7-bit environment. The situation is even worse for Japanese EUC, in which all Kanji bytes have their 8th bit set to 1. For this reason, it is recommended that Kermit programs that implement transfer character sets for non-Roman-based writing systems also include Kermit's locking shift protocol, which is specified and analyzed in a separate document. TERMINAL EMULATION While not part of the Kermit file transfer protocol, terminal emulation is an essential feature of many Kermit programs. It is hoped that all of Kermit's terminal emulators will evolve along the lines of the ISO standards described in the Appendices. In some cases, this is already a fact, insofar as DEC VT200 and 300 series terminals already follow these standards and Kermit programs are available that emulate these terminals. The following Kermit commands are recommended for terminal emulation: SET TERMINAL TYPE <name> Identify the type of terminal to be emulated, for example VT320. SET TERMINAL BYTESIZE <number> Tell how many bits of each arriving character are to be displayed on the screen. This command is used to protect the user from parity bits sent by the host during terminal emulation, even when PARITY is set to NONE, so the normal setting is 7. SET TERMINAL BYTESIZE 8 allows reception of 8-bit bytes. SET TERMINAL CHARACTER SET <remote-character-set> [ <local-character-set> ] Tell how to translate characters during terminal emulation. The <remote-character-set> denotes the codes sent by, and expected by, the remote host. The <local-character-set>, if given, specifies the character codes generated by the local keyboard and displayed on the local screen. If the <local-character-set> is not specified, the current FILE CHARACTER-SET is assumed. Since it is likely that neither one of the two character sets is a standard (TRANSFER) character set, the terminal emulator cannot always use Kermit's built-in file translation tables or functions directly. However, it is often possible to use them in a two-step process, using one of Kermit's transfer character sets as an intermediary. SET TERMINAL TRANSLATION { INVERTIBLE, READABLE } Specifies the desired style of character translation to use during terminal emulation. SET LANGUAGE Should not apply to terminal emulation -- characters should not be added or deleted during translation, because that would interfere with the formatting of the screen. SET TERMINAL DIRECTION { LEFT-TO-RIGHT, RIGHT-TO-LEFT } Specifies the direction of screen writing during terminal emulation. RIGHT-TO-LEFT can be used for Hebrew and Arabic. SET TERMINAL LOCKING-SHIFT { ON, OFF } Specifies whether the terminal emulator should use locking shifts (normally SO/SI) when sending and receiving 8-bit data in the 7-bit communications environment. This behavior is built in to certain terminal emulators (such as VT220, VT320); this command is for use with terminal emulators that do not have this capability built in. SET TRANSLATION INPUT \aaa \bbb or SET TERMINAL TRANSLATION \aaa \bbb Specify that when the character \aaa is received from the communication medium, it should be translated to \bbb before display on the screen. Many such commands can be given, allowing the user to form a custom-made terminal character set. SET KEY <code> <value> Specify that when the key whose code is <code> is pressed, the Kermit program sends the specified <value>. Many such commands can be given, allowing the user to customize the keyboard for any desired character set. The <value> can be a single character or a string of characters. Terminal character-set translation should be used in screen capture (session logging), non-transparent screen-print operations, and "raw uploading" of text files (TRANSMIT command, when FILE TYPE is TEXT). Character-set translation should NOT be used in scripting commands such as INPUT and OUTPUT. APPENDIX A: STANDARDS ANSI X3.4-1986, "Coded Character Sets - 7-bit American Standard Code for Information Interchange" (US ASCII), is the 7-bit code currently used by Kermit for transferring text files. ISO 646 (1983) (= ECMA-6), "Information Processing - ISO 7-bit Coded Character Sets for Information Interchange", gives us a 7-bit character set equivalent to ASCII with provision for substituting "national characters" in selected positions. ISO 4873 (1986), "Information Processing - ISO 8-bit Code for Information Interchange - Structure and Rules for Implementation", defines 8-bit character sets, their graphic and control regions, and how to extend an 8-bit character set by using multiple intermediate graphics sets. ANSI X3.134.1 (1991), "8-Bit ASCII - Structure and Rules", the USA equivalent of ISO 4873. ISO 2022 (1986) (= ECMA-35), "Information Processing - ISO 7-bit and 8-bit Coded Character Sets - Code Extension Techniques", describes how to use 8-bit character sets in both 7-bit and 8-bit environments, and how to switch among different character sets. ISO International Register of Coded Character Sets to be Used with Escape Sequences. This is the source of the ISO registration numbers. ISO 2375 (1985) "Data Processing - Procedure for Registration of Escape Sequences". The procedure by which a character set gets into the above register and has a registration number and designating escape sequence assigned to it. JIS X 0202, "Code Extension Techniques for Use the Code for Information Interchange", the Japanese counterpart of ISO 2022. ISO 6429-1983, "C1 Control Character Set". ANSI X3.41-1974, "Code Extension Techniques for Use with the 7-Bit Coded Character Set of the American National Standard Code for Information Interchange", describes 7- and 8-bit codes and extension techniques in approximately the same manner as ISO 4873 and ISO 2022. (Now obsolete?) ISO 8859 (1987-present) (see Table 6 for ECMA equivalents), "Information Processing - 8-Bit Single-Byte Coded Graphic Character Sets", defines the actual 8-bit character sets to be used for many of the world's languages. The left half of each of these is the same as ASCII and ISO 646 IRV. Each character, including those with diacritics, is represented by a single byte. ANSI X3.134.2 (1991), "7-Bit and 8-Bit ASCII Supplemental Multilingual Graphic Character Set", the USA equivalent of ISO 8859-1. JIS X 0201, Japanese Roman / Katakana set (need full reference). JIS X 0208, Japanese Kanji set (need full reference). JIS X 0212, Japanese Kanji set (superset of JIS X 0208, reportedly not in use yet, need full reference). ISO is the International Standardization Organization. ANSI is the American National Standards Institute. ECMA is the European Computer Manufacturers Association. JIS means Japan Industrial Standard. The ISO/ECMA standards discussed in this proposal may be obtained free of charge in their ECMA form by writing to: ECMA Headquarters Rue du Rhone 114 CH-1204 Geneva SWITZERLAND Be sure to specify the title and the ECMA number of each standard requested. In general, the ISO member body from each country acts as the local sales agent for ISO Standards in that country, for example ANSI in the USA: Sales Department American National Standards Institute 1430 Broadway New York, NY 10018 Telephone 212-354-3300 Each such organization has its own arrangements for disseminating printed documents. ANSI sells them for US dollars; organizations in other countries may either sell them for local currency or give them away, depending on how they are funded to operate. ISO standards and CCITT recommendations can also be ordered from the UN bookstore, but not free of charge: United Nations Bookstore United Nations Building New York, NY 10017 CCITT recommendations are also available by mail order from ANSI. CCITT recommendations are also available via anonymous FTP on the Internet from host BRUNO.CS.COLORADO.EDU or DIGITAL.RESOURCE.ORG in the directory /pub/standards/ccitt/. APPENDIX B: HOW THE STANDARDS WORK ASCII and ISO 646 give us a 128-character 7-bit character set. This set is divided into two parts: 1. 33 "control characters" (characters 0 through 31, and character 127). 2. 95 "graphic characters" (32-126). Graphic characters make ink appear on the page or phosphor glow on the screen. Control characters are used as fillers or format effectors and for transmission or device control. The ASCII / ISO-646 IRV character set is shown in Figure 1, arranged in a table of 16 rows and 8 columns. The graphic characters are shown literally (except SP stands for the space character), the control characters by name (control character names and functions are defined in ISO 646). _____________________________________________________________________________ 00 01 02 03 04 05 06 07 +---+---+---+---+---+---+---+---+ 00 |NUL DLE| SP 0 @ P ` p | 01 |SOH DC1| ! 1 A Q a q | 02 |STX DC2| " 2 B R b r | 03 |ETX DC3| # 3 C S c s | 04 |EOT DC4| $ 4 D T d t | 05 |ENQ NAK| % 5 E U e u | 06 |ACK SYN| & 6 F V f v | 07 |BEL ETB| ' 7 G W g w | 08 |BS CAN| ( 8 H X h x | 09 |HT EM | ) 9 I Y i y | 10 |LF SUB| * : J Z j z | 11 |VT ESC| + ; K [ k { | 12 |FF FS | , < L \ l | | 13 |CR GS | - = M ] m } | 14 |SO RS | . > N ^ n ~ | 15 |SI US | / ? O _ o DEL| +---+---+---+---+---+---+---+---+ Figure 1: The ASCII / ISO-646 International Reference Version 7-bit Character Set _____________________________________________________________________________ Characters are often referred to by their column and row position in this type of table. For example, character 05/08 in Figure 1 is "X". Columns 00-01, plus character 07/15, comprise the control set. Columns 02-07, minus character 07/15, comprise the graphics. ISO Standard 646 allows for national variant 7-bit character sets in which certain non-alphanumeric ASCII graphic characters are replaced by "national characters". The character positions in which replacements are permitted, along with the replacements used by four of the ten ISO 646 national variants, are shown in Table B-1. _____________________________________________________________________________ Column/Row ASCII German Finnish Norwegian French 04/00 at-sign section at-sign at-sign a-grave 05/11 left-bracket A-diaeresis A-diaeresis AE-digraph degree 05/12 backslash O-diaeresis O-diaeresis O-slash c-cedilla 05/13 right-bracket U-diaeresis A-circle A-circle section 05/14 circumflex circumflex U-diaeresis circumflex circumflex 06/00 accent-grave accent-grave e-acute accent-grave accent-grave 07/11 left-brace a-diaeresis a-diaeresis ae-digraph e-acute 07/12 vertical-bar o-diaeresis o-diaeresis o-circle u-grave 07/13 right-brace u-diaeresis a-circle a-circle e-grave 07/14 tilde ess-zet u-diaeresis tilde diaeresis Table B-1: Selected ISO 646 National Variants, Differences from ASCII _____________________________________________________________________________ The ISO-registered 7-bit national sets are listed in Table B-2. _____________________________________________________________________________ ISO Description Reg.# International Reference Version 2 British Version, BSI 4730 4 USA Version, ANSI X3.4-1986 6 Swedish Version, SEN 850200/B 10 Japanese Version, Roman Chars 14 Italian Version 15 Spanish Version 17 German Version 21 Norwegian Version, NS 4551 60 French Version, NF Z 62010 69 Portuguese Version 84 Hungarian Version, HS 7795/3 86 Cuba National Standard NC 99-10:81 151 Finnish (DEC Private) -- French Canadian (DEC Private) -- Swiss (DEC Private) -- Table B-2: National 7-Bit Character Sets _____________________________________________________________________________ 8-bit character sets are described in ISO 4873 and related standards (see Appendix A). An 8-bit character set has two sides. Each side has a control set and a graphics set. The "left half" consists of the control set C0 and the graphics set GL (Graphics Left). GL has 94 characters, and corresponds to ASCII (and ISO 646 IRV) positions 02/01-07/14. SP (space) and DEL are special: they are pieces of the template (the upper right and lower left corners, respectively) into which any 94-byte graphic character set must fit. All the characters in the left half (C0, GL, SP, and DEL) have their high-order, or 8th, bit set to zero, and are therefore representable in 7 bits. The "right half" consists of the control set C1 and the graphics set GR (Graphics Right). All characters in the right half have their 8th bits set to one. Figure 2 shows the layout of an 8-bit character set, with C1 occupied by the ISO 6429 control character set. _____________________________________________________________________________ <--C0--> <---------GL----------> <--C1--> <---------GR----------> 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 +---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+ 00 |NUL DLE| SP 0 @ P ` p | | DCS|---+ | 01 |SOH DC1| ! 1 A Q a q | | PU1| | 02 |STX DC2| " 2 B R b r | | PU2| | 03 |ETX DC3| # 3 C S c s | | STS| | 04 |EOT DC4| $ 4 D T d t | |IND CCH| | 05 |ENQ NAK| % 5 E U e u | |NEL MW | | 06 |ACK SYN| & 6 F V f v | |SSA SPA| | 07 |BEL ETB| ' 7 G W g w | |ESA EPA| | 08 |BS CAN| ( 8 H X h x | |HTS | (special | 09 |HT EM | ) 9 I Y i y | |HTJ | graphics) | 10 |LF SUB| * : J Z j z | |VTS | | 11 |VT ESC| + ; K [ k { | |PLD CSI| | 12 |LF FS | , < L \ l | | |PLU ST | | 13 |CR GS | - = M ] m } | |RI OSC| | 14 |SO RS | . > N ^ n ~ | |SS2 PM | | 15 |SI US | / ? O _ o DEL| |SS3 APC| +---| +---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+ <--C0--> <---------GL----------> <--C1--> <---------GR----------> Figure 2: An 8-Bit Character Set _____________________________________________________________________________ GR character sets can have either 94 or 96 characters. A 94-character GR set begins in position 10/01 and ends in position 15/14, with Space (SP) occupying position 10/00 and DEL in position 15/15, just like GL (the corners shown in GR in the diagram). A 96-character set has graphic characters in all 96 positions, 10/00 through 15/15. An 8-bit alphabet, therefore, has up to 94 + 96 = 190 graphic characters. This number is sufficient to represent the characters in many of the world's written languages, but not necessarily sufficient to represent all the graphic symbols required in a given application, for instance a multi-language document. To represent a greater number of graphic characters, ISO 4873 defines four "intermediate sets" of graphic characters, of either 94 or 96 characters each. These are called G0, G1, G2, and G3. The G0 set never has more than 94 graphic characters, and G1-G3 can have up to 96 each. Therefore there can be up to: 94 + (3 x 96) = 382 graphics characters simultaneously within the repertoire of a given device, assuming all are single-byte sets. These intermediate graphics sets are kept in tables in the memory of the terminal or computer. One of the intermediate sets (usually G0) is assigned to GL, and (in the 8-bit communications environment) another may be assigned to GR. When the terminal or computer receives a data byte, the numeric value of its bits denotes the position of the character in GL or GR. For example, the byte 01000001 binary = 65 decimal = 04/01 = uppercase A in ASCII. In the 8-bit environment, any byte with its 8th bit set to zero is from GL, and a byte with its 8th bit set to one is from GR. A language like English can be represented adequately by ASCII in GL, because all the required characters fit there. When a language has more than 94 characters, two techniques are used to represent all the characters: 1. For alphabetic languages, put ASCII (or the ISO-646 IRV) in GL and the special characters (like accented letters) in GR. French, German, and Russian are examples. 2. For languages with many symbols (e.g. where a symbol is assigned to each word, rather than to each sound), represent each character with multiple bytes rather than one byte. Japanese Kanji, for example, uses a 2-byte code. A multibyte code may be assigned to G0, G1, G2, or G3, just like a single-byte code. How do we assign actual character sets to G0-G3, and how do we associate the intermediate character sets with the active character set? Selection of character sets is accomplished using special control characters and escape sequences embedded within the data stream as described in ISO Standard 2022. An ESCAPE SEQUENCE is used to DESIGNATE a particular alphabet (such as Roman, Cyrillic, Hebrew, Arabic, Kanji, etc) to a particular intermediate graphics set (G0, G1, G2, or G3). A SHIFT FUNCTION is used to INVOKE a particular intermediate graphics set into GL or GR. In programmer's terms, GL and GR are pointers into the array of tables G0..G3, and the shift functions simply change the values of these pointers. In our discussion, we use the following notation (numbers are decimal unless otherwise noted): <ESC> Escape (ASCII 27, character 01/11) <SP> Space (ASCII 32, character 02/00) <SO> Shift Out (Ctrl-N, ASCII 14, character 00/14) <SI> Shift In (Ctrl-O, ASCII 15, character 00/15) Table 5 shows the alphabet designation functions for single-byte and multi-byte character sets in both the 7-bit and 8-bit environments. The character which is substituted for "F" identifies the actual character set to be used. _____________________________________________________________________________ Escape Sequence Function Invoked By <ESC>(F assigns 94-character graphics set "F" to G0. SI or LS0 <ESC>)F assigns 94-character graphics set "F" to G1. SO or LS1 <ESC>*F assigns 94-character graphics set "F" to G2. SS2 or LS2 <ESC>+F assigns 94-character graphics set "F" to G3. SS3 or LS3 <ESC>-F assigns 96-character graphics set "F" to G1. SO or LS1 <ESC>.F assigns 96-character graphics set "F" to G2. SS2 or LS2 <ESC>/F assigns 96-character graphics set "F" to G3. SS3 or LS3 <ESC>$(F assigns multibyte character set "F" to G0. SI or LS0 <ESC>$)F assigns multibyte character set "F" to G1. SO or LS1 <ESC>$*F assigns multibyte character set "F" to G2. SS2 or LS2 <ESC>$+F assigns multibyte character set "F" to G3. SS3 or LS3 Table 5: Escape Sequences for Alphabet Designation _____________________________________________________________________________ Table 6 shows the escape sequences used to designate the appropriate parts of each of the registered character sets discussed in this proposal to G1 (except that ASCII is designated to G0, which is the normal situation). It is important to note that the final letter of the escape sequence is not always sufficient to designate a character set. For example, Czech Standard and JIS Katakana are both designated by letter I, but the two can be distinguished by the intermediate characters of the escape sequence, which specify whether the set is single- or multibyte, or, when both sets are single-byte, whether there are 94 or 96 characters. _____________________________________________________________________________ Escape ISO ECMA ISO/ECMA Alphabet Name Sequence Reference Reference Registration ASCII (ANSI X3.4-1986) <ESC>(B ISO 646 IRV ECMA-6 6 Latin Alphabet No. 1 <ESC>-A ISO 8859-1 ECMA-94 100 Latin Alphabet No. 2 <ESC>-B ISO 8859-2 ECMA-94 101 Latin Alphabet No. 3 <ESC>-C ISO 8859-3 ECMA-94 109 Latin Alphabet No. 4 <ESC>-D ISO 8859-4 ECMA-94 110 Latin/Cyrillic <ESC>-L ISO 8859-5 ECMA-113 144 Latin/Arabic <ESC>-G ISO 8859-6 ECMA-114 127 Latin/Greek <ESC>-F ISO 8859-7 ECMA-118 126 Latin/Hebrew <ESC>-H ISO 8859-8 ECMA-121 138 Latin Alphabet No. 5 <ESC>-M ISO 8859-9 ECMA-128 148 * Math/Technical Set <ESC>-K ???? ???? 143 Chinese (CAS GB 2312-80) <ESC>$)A none none 58 Japanese (JIS X 0208) <ESC>$)B none none 87 JIS-Katakana (JIS X 0201) <ESC>)I none none 13 JIS-Roman (JIS X 0201) <ESC>)J none none 14 Korean (KS C 5601-1989) <ESC>$)C none none 149 Table 6: Alphabets, Selectors, Standards, and Registration Numbers _____________________________________________________________________________ * A math/technical set is clearly needed to handle the IBM PC, DEC VT-series, and other math/technical/line-drawing characters, but there is apparently no such standard set at this time (ISO 6862? ISO DIS 10367?) Tables 7 and 8 show the shift functions that are used to invoke the intermediate character sets. These shift functions may be either locking or single. "Locking shift" is like shift-lock on a typewriter. It means that all subsequent characters until the next shift are to be taken from the designated intermediate character set. "Single shift" applies only to the character (either single or multibyte) that follows it immediately, but single shift functions are only available for the G2 and G3 sets. Locking shift functions remain in effect across alphabet changes. In the 7-bit environment, only one character set, GL, can be active at a time. The active character set can be selected from among the intermediate sets G0-G3 by the shifts shown in Table 6. Control characters from C0 are transmitted as-is, and those from the C1 set are sent prefixed by <ESC> followed by the character value, minus 64. For example, the C1 character 10000001 binary (129 decimal) becomes <ESC>A (129 - 64 = 65 = "A"). _____________________________________________________________________________ Shift Representation Name Function SI Ctrl-O Shift In invoke G0 into GL SO Ctrl-N Shift Out invoke G1 into GL LS2 <ESC>n Locking Shift 2 invoke G2 into GL LS3 <ESC>o Locking Shift 3 invoke G3 into GL SS2 <ESC>N Single Shift 2 select single character from G2 SS3 <ESC>O Single Shift 3 select single character from G3 Table 7: Shifts Used in the 7-Bit Environment _____________________________________________________________________________ ISO 2022 also allows for an alternative C0 set in which the SS2 function is assigned to the 7-bit control character EM (Control-Y, 01/09). This set must be designated by ESC 2/1 4/12 ("The C0 set of control characters of ISO 646 with EM replaced by SS2", ISO Registration number 140). This set is not in common use. In the 8-bit environment two character sets, GL and GR, can be active at once. A GL character is selected by a byte whose 8th bit is zero, and a GR character by a byte whose eighth bit is one. The actual character sets assigned to GL and GR are selected by the shifts shown in Table 8. Control characters from both the C0 and C1 sets are sent as is. _____________________________________________________________________________ Shift Representation Name Function LS0 Ctrl-O Locking Shift 0 invoke G0 into GL LS1 Ctrl-N Locking Shift 1 invoke G1 into GL LS2 <ESC>n Locking Shift 2 invoke G2 into GL LS3 <ESC>o Locking Shift 3 invoke G3 into GL LS1R <ESC>~ Locking Shift 1 Right invoke G1 into GR LS2R <ESC>} Locking Shift 2 Right invoke G2 into GR LS3R <ESC>| Locking Shift 3 Right invoke G3 into GR SS2 08/14 Single Shift 2 select single character from G2 SS3 08/15 Single Shift 3 select single character from G3 Table 8: Shifts Used in the 8-Bit Environment _____________________________________________________________________________ So we have a 3-tiered system. At the bottom tier lie all the world's coded character sets. We can designate up to four of them, one to each of the intermediate graphics sets G0, G1, G2, and G3 using the escape sequences shown in Tables 5 and 6. The terminal or computer keeps each of the selected intermediate sets in memory. There is also one active set, composed of GL and GR. The intermediate sets are invoked to GL or GR (one at a time) by the shifts SO, SI, LS0, LS1, etc, shown in Tables 7 and 8. A simplified diagram for the 8-bit environment is shown in Figure 3 (see ISO 2022 for detailed diagrams of both the 7-bit and 8-bit environments). On a more sophisticated output device, Figure 3 would contain numerous arrows pointing upwards to demonstrate the operation of the designators and shifts. _____________________________________________________________________________ +--+--------+ +--+--------+ |C0| GL | |C1| GR | | | | | | | 8-Bit | | | | | | Code | | | | | | In Use +--+--------+ +--+--------+ LS0 LS1,LS1R LS2,LS2R LS3,LS3R Shifts SS2 SS3 +--------+ +--------+ +--------+ +--------+ Intermediate | | | | | | | | Graphics | G0 | | G1 | | G2 | | G3 | Sets | | | | | | | | +--------+ +--------+ +--------+ +--------+ Alphabet Designation <ESC>(B <ESC>-A <ESC>-B <ESC>-L <ESC>$)B Sequences +---------+ +--------+ +--------+ +--------+ +--------+ +--------+ | The world's | ISO | | ISO | | ISO | | ISO | | JIS X | | registered | 646IRV | | Latin | | Latin | | Latin | | 0208 | | character |(ASCII) | | 1 | | 2 | |Cyrillic| | Kanji | + sets +--------+ +--------+ +--------+ +--------+ +--------+ Figure 3: The ISO 2022 Character Set Selection Mechanisms _____________________________________________________________________________ For example, the following sequence could be used to transmit the German word "<u-diaeresis>bern<a-diaeresis>chtig" using Latin Alphabet 1 in the 7-bit environment: <ESC>(B<ESC>-A<SO>|<SI>bern<SO>d<SI>chtig where: <ESC>(B designates ASCII to G0 <ESC>-A designates the right half of Latin Alphabet 1 to G1 <SO> invokes G1 to GL | is character 07/12, but since G1 is invoked to GL, it really denotes character 15/12, which is <u-diaeresis> <SI> invokes G0 to GL bern are characters from G0, which is invoked in GL <SO> invokes G1 to GL d is character 06/04, but since G1 is invoked to GL, it really denotes character 14/04, which is <a-diaeresis> <SI> invokes G0 to GL chtig are characters from G0 The same word could be transmitted in the 7-bit environment using single shifts, if Latin Alphabet 1 were designated to G2 (or G3): <ESC>(B<ESC>*A<ESC>N|bern<ESC>Ndchtig (where <ESC>*A designates Latin-1 to G2, and <ESC>N is Single Shift 2). In the 8-bit environment it could be transmitted using no shifts at all: <ESC>(B<ESC>-A<u-diaeresis>bern<a-diaeresis>chtig The designation escape sequences are transmitted only at the beginning of a session and need not be repeated after the initial designations are made, unless an intermediate set (G0-G3) is to be recycled. To understand the three-tiered design of ISO 2022, imagine a computer programmed to display a mixture of character sets on its screen. A large collection of fonts might be stored on the disk, one font per file. These are the character sets of the bottom tier. When a font is needed, it will be read from the disk and stored in memory in an array, for rapid access. If several fonts are needed, they will be stored in several arrays. These arrays are the intermediate character sets, G0-G3. When a data byte arrives to be displayed, the actual graphic representation is taken from GL or GR (depending on the byte's 8th bit). GL is associated with one of the intermediate graphic sets, and GR with another. If no more than four character sets are used, then each one needs to be read from the disk only once, and display is rapid and efficient thereafter. Perhaps the most common application of ISO 2022 shifting techniques is with the Japanese EUC (Extended UNIX Code) character set, which combines JIS X 0201 (which in turn consists of an ASCII-like Roman alphabet in the left half and Japanese Katakana characters in the right) and JIS X 0208 (a double-byte Japanese Kanji character set). EUC encoding is used not only in data communications, but also in files, e-mail, etc. EUC is used as follows: Left half of JIS X 0201 (Roman, similar to ASCII) is designated to G0. JIS X 0208 (Kanji) is designated to G1. Right half of JIS X 0201 (Katakana) is designated to G2. G0 is initially invoked to GL. G1 is initially invoked to GR. In the 8-bit environment, any byte with its 8th bit equal to zero is a Roman G0 graphic or a C0 control character. A byte with its 8th bit equal to 1 and low-order 7 bits falling in the graphic range is the first byte of a Kanji character pair. Others are C1 controls. The C1 control character SS2 selects the subsequent single byte from the Katakana set. In the 7-bit environment, SO and SI are used to shift G1 in and out of GL, and Kanji bytes are transmitted without their high-order bits. C1 controls, including SS3, are transmitted in their 2-byte 7-bit form (SS2 becomes <ESC>N). ANNOUNCING ISO 2022 FACILITIES A large portion of ISO 2022 is devoted to describing how 8-bit characters may be transmitted on a 7-bit communication path, for example when parity is in use. In the 7-bit environment, there is only GL -- no GR. Therefore, all characters are transmitted with their 8th bit removed, and shifts are used to specify which intermediate set they belong to. In fact, there are many possible ways to use the ISO 2022 code extension facilities within both 7-bit and 8-bit environments. For example, the sender may inform the receiver in advance whether G1, G2, or G3 will be used, etc, so that the receiver can allocate the appropriate resources. At the beginning of any particular data transfer, the facilities that actually will be used can be announced with a sequence of the form <ESC><SP>F, where F is replaced by an ISO 2022 announcer. Several of the most important ones are described here. Table 9 lists all the defined announcers in summary form. For details, see ISO 2022. <ESC><SP>A means that only the G0 set will be used, invoked into GL. No shift functions will be used. In the 8-bit environment, GR is not used. In other words, only a single 7-bit character set is used. <ESC><SP>B means the G0 and G1 sets will be used with locking shifts. In the 7-bit environment <SI> invokes G0 into GL, <SO> invokes G1 into GL. In the 8-bit environment, LS0 invokes G0 into GL, LS1 invokes G1 into GL. In other words, two character sets are used, with characters from both sets always sent as 7-bit values, with locking shifts used to specify the 8th bit. <ESC><SP>C means that G0 and G1 will be used in the 8-bit environment, with G0 invoked in GL and G1 in GR. No locking shift functions are used. In other words, a single 8-bit character set is used, with all 8 bits transmitted as data. GL is selected when the character's 8th bit is zero, GR is selected when the 8th bit is one. <ESC><SP>D means that G0 and G1 will be used with locking shifts. In the 7-bit environment, <SI> invokes G0 into GL and <SO> invokes G1 into GL. In the 8-bit environment, all 8 bits of each character are transmitted with no shifts. <ESC><SP>L means that Level 1 of ISO 4873 will be used. That is, a single 8-bit character set with C0, G0, C1, and G1, with no shift functions. This is like <ESC><SP>C. <ESC><SP>M means that Level 2 of ISO 4873 will be used. This is equivalent to Level 1, with the addition of G2 and G3. Characters from G2 and G3 are invoked only by the single-shift functions SS2 and SS3. <ESC><SP>N means that Level 3 of ISO 4873 will be used. This is equivalent to Level 2 with the addition of the locking shift functions LS1R, LS2R, and LS3R. (Note that ISO 4873 does not concern itself with the 7-bit environment, and therefore does not discuss the use of LS0, LS1, LS2, or LS3.) _____________________________________________________________________________ Esc Sequence 7-Bit Environment 8-Bit Environment <ESC><SP>A G0->GL G0->GL <ESC><SP>B G0-(SI)->GL, G1-(SO)->GL G0-(LS0)->GL, G1-(LS1)->GL <ESC><SP>C (not used) G0->GL, G1->GR <ESC><SP>D G0-(SI)->GL, G1-(SO)->GL G0->GL, G1->GR <ESC><SP>E Full preservation of shift functions in 7 & 8 bit environments <ESC><SP>F C1 represented as <ESC>F C1 represented as <ESC>F <ESC><SP>G C1 represented as <ESC>F C1 represented as 8-bit quantity <ESC><SP>H All graphic character sets have 94 characters <ESC><SP>I All graphic character sets have 94 or 96 characters <ESC><SP>J In a 7 or 8 bit environment, a 7 bit code is used <ESC><SP>K In an 8 bit environment, an 8 bit code is used <ESC><SP>L Level 1 of ISO 4873 is used <ESC><SP>M Level 2 of ISO 4873 is used <ESC><SP>N Level 3 of ISO 4873 is used <ESC><SP>P G0 is used in addition to any other sets: G0 -(SI)-> GL G0 -(LS0)-> GL <ESC><SP>R G1 is used in addition to any other sets: G1 -(SO)-> GL G1 -(LS1)-> GL <ESC><SP>S G1 is used in addition to any other sets: G1 -(SO)-> GL G1 -(LS1R)-> GR <ESC><SP>T G2 is used in addition to any other sets: G2 -(LS2)-> GL G2 -(LS2)-> GL <ESC><SP>U G2 is used in addition to any other sets: G2 -(LS2)-> GL G2 -(LS2R)-> GR <ESC><SP>V G3 is used in addition to any other sets: G3 -(LS2)-> GL G3 -(LS3)-> GL <ESC><SP>W G3 is used in addition to any other sets: G3 -(LS2)-> GL G3 -(LS3R)-> GR <ESC><SP>Z G2 is used in addition to any other sets: SS2 invokes a single character from G2 <ESC><SP>[ G3 is used in addition to any other sets: SS3 invokes a single character from G3 Table 9: ISO 2022 Announcer Summary _____________________________________________________________________________ STANDARD VERSUS PRIVATE CHARACTER SETS Most of the popular private 8-bit character sets, notably the IBM PC code pages and the Apple Macintosh character sets (but they are not alone), differ from the standard character sets in three important ways: 1. The repertoire of characters is different. 2. The encoding of characters is different. 3. The C1 area is sometimes used for graphics, which is forbidden by the standards. 4. In some cases, even the C0 area is used for graphics. However, most of these character sets conform to the requirement that the left half be identical with US ASCII. APPENDIX C: (deleted) APPENDIX D: SUMMARY OF KERMIT COMMANDS RELATED TO CHARACTER SET TRANSLATION SET FILE TYPE { BINARY, TEXT } BINARY means no translation, and overrides all other file-related commands, including SET TRANSFER. TEXT is the default. Enables file transfer character set translation, depending on the setting of SET TRANSFER. SET FILE CHARACTER-SET <name> Effective only when file type is TEXT. Tell Kermit what character set the file is coded in, or what character set to translate an incoming file to. SET TRANSFER { CHARACTER-SET <name>, LOCKING-SHIFT { ON, OFF, FORCED } } CHARACTER-SET <name> Invoke file transfer character set translation. <name> is TRANSPARENT, ASCII, LATIN1, LATIN2, ..., CYRILLIC, JAPAN-EUC, etc. LOCKING-SHIFT { ON, OFF, FORCED } Enable, disable, or force locking-shift transport protocol for efficient transfer of 8-bit data in the 7-bit communications environment. Normally enabled. Used only if both Kermit programs agree in the feature negotiation phase to use it (essentially, if PARITY is not NONE, and they both have locking-shift capability). SET LANGUAGE <name> This command informs the program which language is being translated, to allow for special language-based transliteration rules, such as replacing a-diaeresis by ae. SET { TRANSFER, TERMINAL } TRANSLATION { INVERTIBLE, READABLE } Specify the goal of the specified translation: invertibility or readability. SET UNKNOWN-CHARACTER-SET { KEEP, CANCEL } Tell the file receiver whether to keep or cancel an incoming file that contains an unknown character set. KEEP is the default. SET { SEND, RECEIVE } AUTOMATIC-TRANSLATION { ON, OFF, <set1> [ <set2> ] } Enable or disable automatic selection of a file transfer translation table in the indicated direction, or specify pairs character sets to be used: given <set1>, automatically translate to <set2>. Default in both directions is OFF. SET ATTRIBUTES { ON, OFF } SET ATTRIBUTE <name-of-attribute> { ON, OFF } Enables or disables processing of attribute packets, or specific attribute fields such as DATE, ENCODING, LENGTH, etc. SET TERMINAL { CHARACTER-SET, DIRECTION, LOCKING-SHIFT, TRANSLATION } Specifies terminal emulation character-set translation, screen writing direction, locking shift usage, translation goal. SHOW { CHARACTER-SETS, LANGUAGE, FILE, TRANSFER, PROTOCOL, TERMINAL } Display what character sets, translation tables, and languages are available, and which ones are currently selected. TRANSLATE <file1> <file2> [ <file1-character-set> [ <file2-character-set> ] ] Copies local file <file1> to local file <file2>, translating <file1> from <file1-character-set> to <file2-character-set>. If <file1-character-set> is not specified, the current FILE CHARACTER-SET is used. If <file2-character-set> is not specified, the current TRANSFER CHARACTER-SET is used. Note that this command can be used to convert between two different FILE CHARACTER-SETS, in which case an appropriate TRANSFER CHARACTER-SET can be used in an intermediate step. APPENDIX E: (Deleted) APPENDIX F: (Deleted) APPENDIX G: OFFICIAL CHARACTER SET TRANSLATIONS Apple: ??? Atari: ??? IBM: IBM lists its character sets in the following manuals: "Graphic Character Identification System, Graphic Character Global Identifier (GCGID) Structure", C-H 3-3220-055, 1989 (Internal Use Only). "Registry of Graphic Character Sets and Code Pages", C-H 3-3220-050 (Internal Use Only). The translations between its corporate code pages and ISO standard character sets are given in the following manuals: "SAA Character Data Representation Architecture (CDRA)" Executive Overview: GC09-1392-00 (15 pages) Level-1, Reference: SC09-1390-00 (64 pages) Level-1, Registry: SC09-1391-00 (tables, 720 pages) In particular, IBM has adopted ISO 8859-1 Latin Alphabet 1 as IBM Code Page 0819, and publishes its official, invertible translations between this code page and and various private IBM code pages (such as CP850 and CECP500), as well as invertible or noninvertible translations between many other pairs of IBM code pages. From these, it is possible to infer other translations, for example between Code Page 437 and Latin-1. Commodore: ??? Data General: ??? Digital Equipment Corporation: ??? Microsoft: ??? (Much work is needed on this section...) REFERENCES: The standards listed in Appendix A, the documents in Appendix G, plus: CCITT Recommendation T.61, "Character Repertoire and Coded Character Sets for the International Teletex Service", Geneva (1980, amended at Malaga-Torremolinos 1984). Chandler, John, "IBM System/370 Kermit User's Guide", version 4.2, (1991) (Internet: watsun.cc.columbia.edu:kermit/b/ik[cmtx]ker.{doc,ps}). For VM/CMS, MVS/TSO, CICS, and MUSIC. Detailed description of how to use Kermit's character set translation facilities in the IBM mainframe environment. da Cruz, Frank, "Kermit, A File Transfer Protocol", Digital Press (1987). The specification of the Kermit file transfer protocol before the addition of this extension. Do, James, Ngo^ Thanh Nha`n, Hoa`ng Nguye^n, "A proposal for Vietnamese character encoding standards in a unified text processing framework", Computer Standards & Interfaces 14 (1992) 3-12, Elsevier North-Holland. Gianone, Christine M., "It's Time to Prepare for International Computing", PC Week, October 2, 1989. Gianone, Christine M., "Using MS-DOS Kermit", Second Edition, Digital Press (1991). Chapter 13 describes how to use the character set translation facilities of MS-DOS Kermit 3.0 and later on IBM PCs, PS/2s, and compatibles, for both terminal emulation and file transfer. Also included are character set and conversion tables for many Roman and Cyrillic character sets. Gianone, Christine M., and Frank da Cruz, "C-Kermit User Guide", version 5A (1991) (Internet: watsun.cc.columbia.edu:kermit/sw/ckuker.{doc,ps}). Description of the terminal and file transfer character set translation features of C-Kermit 5A for UNIX and VAX/VMS. Gianone, Christine M., and Frank da Cruz, "A Locking Shift Mechanism for the Kermit File Transfer Protocol", unpublished paper, Columbia University, October 1991 (watsun.cc.columbia.edu:kermit/e/lshift.txt). A Kermit protocol extension for transferring 8-bit text efficiently in the 7-bit communication environment. Hart, Edwin (ed.), "ASCII and EBCDIC Character Set and Code Issues in a Systems Applications Architecture", SSD #366, SHARE Inc., Chicago, IL, USA (June 1989). Commonly called the "SHARE White Paper". A cogent description of the problems of character set translation in the IBM computing environment, with recommendations adopted by SHARE, an international, voluntary organization of users of IBM systems. IBM System/370 Reference Summary, IBM GX20-1850-6. The definitive US-ASCII / US-EBCDIC translation table. ISO 639, "Code for Representation of Names of Languages" (1988). Useful for naming language-related symbols in Kermit programs. ISO 3166, "Country Codes" (1988 + Registration Newsletter updates). Useful for naming country-related symbols in Kermit programs. ISO/IEC 10646-1:1993, Multiple-Octet Coded Character Set. The universal character set. Pirard, Andr'e, "Guidelines to Use 8-Bit Character Codes", University of Liege, Belgium, unpublished paper on character set translation problems, written from the West European perspective, listing numerous suggested invertible translation tables. Files: watsun.cc.columbia.edu:kermit/charsets/iso8859.networking and iso8859.moretran. "The Unicode Standard, Worldwide Character Encoding", Version 1.0, Volume 1, Addison-Wesley (1991). [End of ISOK6.TXT]