home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
kermit.columbia.edu
/
kermit.columbia.edu.tar
/
kermit.columbia.edu
/
archives
/
protocol.zip
/
isok7.txt
< prev
next >
Wrap
Text File
|
1997-06-19
|
113KB
|
2,244 lines
Sun Dec 5 14:34:25 1993
A KERMIT PROTOCOL EXTENSION FOR INTERNATIONAL CHARACTER SETS
**********
NOTE: This is a work in progress, and will be updated from time to time.
**********
Christine M. Gianone
cmg@watsun.cc.columbia.edu
Manager, Kermit Development and Distribution
Frank da Cruz
fdc@watsun.cc.columbia.edu
Manager, Network Planning
Columbia University Center for Computing Activities
612 West 115th Street
New York, NY 10025, USA
DRAFT NUMBER 7.1
Dec 5, 1993
ABSTRACT
An extension to the presentation layer of the Kermit file transfer protocol
is proposed to allow transfer of non-English-language text files between
unlike computers by substitution of standard character sets other than ASCII
in Kermit's text-file transfer data packets. Methods for selection,
announcement, and use of these character sets are described. The reader is
assumed to be familiar with the Kermit file transfer protocol and with basic
computing and terminology. The relevant ANSI and ISO character-set related
standards are summarized in Appendix B of this document.
This is a nearly final draft. The protocol and many of the commands described
in this document have been successfully implemented in major Kermit programs
including MS-DOS Kermit 3.0, C-Kermit 5A for UNIX and VAX/VMS, and IBM
Mainframe Kermit 4.2 for VM/CMS, MVS/TSO, MUSIC, and CICS. Special thanks
to John Chandler and Hirofumi Fujii for extensive contributions to this
draft, and to John Klensin for his comments and support.
SUMMARY OF CHANGES SINCE DRAFT #5, April, 1990
- Abandonment of the two-level concept. Mixed languages will be
handled by using ISO 10646 or UNICODE as the transfer character set. The
details remain to be specified.
- Abandonment of CSN 36 91 03, Czechoslovak Standard alphabet as a transfer
character set. Czech is adequately covered by ISO 8859-2.
- Adoption of Japanese EUC as the transfer character set for Japanese
text files, rather than JIS X 0208.
- Explanation of Japanese EUC added to Appendix B.
- Reference to Kermit's new locking shift transport protocol.
- Removal of (unworkable) design for user-defined translations.
- Addition of mechanism for automatic translation-table selection.
- Addition of notion of "translation goal" and related commands.
- Deletion of irrelevant or redundant appendices.
- Addition of an annotated References section.
- Short sections added on terminology and notation.
- Note: Table I moved to Appendix B, so table numbers are out of order.
SUMMARY OF CHANGES SINCE DRAFT #4, August, 1989
- Changes for Level 1 only, to reflect experience in writing the code
to implement it for MS-DOS Kermit 3.0, C-Kermit 5A, and Kermit 370 4.2.
Level 2 is on hold indefinitely pending ISO 10646 & Unicode developments.
- Abandonment of separate attributes for encoding and character set.
- Change all references to ASCII as I2 into I6 (ISO Registration Number).
- Change description of SET LANGUAGE to remove side effects.
- Differentiation of SET TRANSFER CHARACTER ASCII and TRANSPARENT.
- The section on terminal emulation has not been changed, even though
this subject needs detailed treatment in this document.
SUMMARY OF CHANGES SINCE DRAFT #3, July 20, 1989
- Expanded & more precise definition of Kermit's character set designators
- Simplification of the syntax of the (former) SET TRANSFER-SYNTAX command
- Addition of SET LANGUAGE command
- Clarification of Kermit's behavior when it receives an unknown character set
- Addition of Appendix F to specify how each Kermit Level is invoked
- Correction of numerous typographical and other errors
ACKNOWLEDGEMENTS
Many thanks to these people for their helpful and constructive comments during
the drafting process. In most cases, their suggestions or the information
they provided have been incorporated into this or previous drafts.
John Chandler (Harvard/Smithsonian Center for Astrophysics, USA)
Alan Curtis (University of London, UK)
Joe Doupnik (Utah State University, USA)
Hirofumi Fujii (Japan National Laboratory of High Energy Physics, Tokyo)
John Klensin (Massachusetts Institute of Technology, USA)
Ken-ichiro Murakami (Nippon Telephone and Telegraph Research Labs, Tokyo)
Vladimir Novikov (VNIIPAS, Moscow, USSR)
Jacob Palme (Stockholm University, Sweden)
Andre Pirard (University of Liege, Belgium)
Paul Placeway (Ohio State University, USA)
Gisbert W. Selke (WIdO, Bonn, Germany)
Fridrik Skulason (University of Iceland, Reykjavik, Iceland)
Johan van Wingen (Leiden, Netherlands)
Konstantin Vinogradov (ICSTI, Moscow, USSR)
Amanda Walker (InterCon Systems Corp, USA)
Thanks also to the following people for organizing meetings or conferences
in their countries at which the issues of this proposal were discussed:
Kohichi Nishimoto (Nihon DEC, Tokyo, Japan)
Juri Gornostaev and A. Butrimenko (ICSTI, Moscow, USSR)
and thanks also to those who attended these gatherings!
Thanks to the Kermit developers who have implemented this extension in their
Kermit programs:
John Chandler (Kermit-370)
Frank da Cruz (C-Kermit)
Joe Doupnik (MS-DOS Kermit)
Hirofumi Fujii (C-Kermit, MS-DOS Kermit, and NEC PC9801 Kermit)
Finally, thanks to other experts who provided valuable information:
Jerry Andersen, IBM
Lloyd Anderson, Ecological Linguistics
Joe Becker, Xerox Corporation and UNICODE Consortium
James Do, Mentor Graphics, San Jose, CA
Edwin Hart, Johns Hopkins University Applied Physics Laboratory
NOTATION
This document is written in plain 7-bit US ASCII, and to be understood
correctly it should be displayed in plain 7-bit US ASCII. The notation:
<xxx>
is used to express a non-ASCII or non-graphic character, where "xxx" is
replaced by the name of the character, for example:
<ESC> (All capital letters: the name of a control character)
or:
<A-grave> (Lower or mixed case: a letter with a diacritical mark)
In other places (which should be clear from the context), the same notation is
used to denote a parameter to a Kermit command, for example:
<filename>
to stand for the name of any file.
TERMINOLOGY
A "character" is the minimum unit of a writing system: a letter, a digit, a
punctuation mark, an ideogram, without regard to the style of rendering except
for capitalization in scripts where that is possible, and without regard to
computer encoding.
A "character set" is a particular, specified group of characters, for example
(and most typically) all the letters, digits, and punctuation marks needed for
a particular writing system.
A "coded character set" is the internal computer representation of a character
set, in which each character is assigned a unique code, often with the
addition of special control codes. In this document, "character set" and
"coded character set" are used synonymously unless otherwise noted.
"Code page" is the term used by IBM and Microsoft to mean "coded character
set".
A "code point" is the association between a character and its encoding in a
particular character set.
An "octet" is a computer storage unit of 8 bits.
A "byte" is an octet, unless otherwise noted.
The word "translation" is used loosely in this document to denote conversion
between character set encodings, not translation between languages or any
other higher-level notion. When characters are intentionally replaced by
different characters, the word "transliteration" is used.
STATEMENT OF THE PROBLEM
The Kermit file transfer protocol has always been able to transfer text files
between unlike computers (e.g. a UNIX system with ASCII stream text files and
an IBM mainframe with EBCDIC record-oriented text files). To do the text
file code conversion, Kermit transfers text in ASCII. However, ASCII
includes only enough letters and symbols for English.
There are now computers capable of representing the characters of other
languages: Roman letters with diacritical marks, Cyrillic letters, Hebrew,
Arabic, and Greek characters; Chinese, Japanese, and Korean ideograms.
However, different computer manufacturers use different codes for these
characters. For example, the IBM PS/2 and the Apple Macintosh both have
character sets that are "8-bit ASCII". When the character value is 32-127,
the character is (normally) a standard ASCII graphic (printable) character.
When the value is 128 or higher, it is a "special" character. Unfortunately,
the PC and the Macintosh assign different special characters to these values.
Here are just a few examples:
Value PS/2 Character Macintosh Character
138 Small e grave Small a diaeresis
143 Capital A ring Small e grave
144 Capital E acute Small e circumflex
136 Small e circumflex Small a grave
When a file contains "8-bit ASCII", basic Kermit transfers it without any
character translation. Therefore, a text file written in French, German,
Italian, or Norwegian transferred between a PS/2 and a Macintosh will contain
the wrong characters when it arrives at its destination: the PS/2's e-grave
becomes a-diaeresis on the Macintosh, etc.
There are many computer vendors in the world and nobody controls what codes
they use to represent characters. Without a standard protocol for
transferring non-ASCII text, each computer would have to know the codes of
all the other computers in order for correct transfer of non-English text
files to occur between all combinations of unlike systems.
To complicate matters, many computers now support more than one character
set. IBM mainframes have not only "standard" US EBCDIC, but also several
EBCDIC-based Country Extended Code Pages (CECPs) for the support of West
European languages, Hebrew, Kanji, etc. The IBM PC and PS/2 have a variety
of ASCII-based 8-bit code pages for the same purpose. These character sets
are a welcome addition because they allow users of these computers to create,
display, and print documents in languages other than English. Unfortunately,
the computer's file system keeps no record of which character set is used in
each file.
IBM is not the only source of private character sets. The Apple Macintosh has
many character sets and fonts. DEC supports its own multinational character
set as well as private encodings for Greek, Hebrew, etc. The NeXT workstation
has its own unique character set. Similarly for Data General, Atari,
Commodore, and many other computer manufacturers. In the USSR, up to five
different Cyrillic character sets are in use. In Japan, there are many
different encodings for Roman, Katakana, and Kanji characters. China and
Taiwan use different encodings for Chinese characters.
NORMAL KERMIT FILE TRANSFER SYNTAX
The Kermit file transfer protocol makes a distinction between text and binary
files. Binary files are transmitted with no translation or conversion. For
text files, the Kermit protocol defines a standard intermediate
representation ("transfer syntax") for text files, namely ASCII characters
with carriage return and linefeed (CRLF) after each line, so text can be
stored in useful fashion on any computer to which it is transferred. Each
Kermit program knows how to translate from the local text-file storage
conventions to ASCII/CRLF syntax, and vice versa. This is the basic,
required, and default mode of operation for any Kermit program.
INTERNATIONAL KERMIT TRANSFER SYNTAX
This proposal adds a new mechanism that permits the use of character sets
other than ASCII in file transfer. These additional character sets are taken
from recognized national or international standards, such as the ISO 8859
Latin Alphabets.
Using a standard character set (other than ASCII), it is possible to transfer
a text file written in a language other than English between unlike
computers, and it is also possible to transfer a text file containing more
than one language. For example Latin Alphabet 1 can represent a file
containing a mixture of Italian, Norwegian, French, German, English, and
Icelandic.
The character set used in a text file stored in a particular computer is
called the "file character set" (FCS). When the characters in a text file
can be represented by a single standard character set, that character set can
be used in place of ASCII in Kermit's transfer syntax. This is called the
"transfer character set" (TCS). Whatever the transfer character set, there
must be a mapping between the local file character set and the transfer
character set. That is, there must be a pair of translation functions in the
program: one from the local file character set to the transfer character set,
and one from the transfer set to the local set:
COMPUTER A COMPUTER B
+------------------+ +------------------+
| +-------------+ | | +-------------+ |
| | Translation | | Transfer | | Translation | |
| | Function: |--------------------------->| Function: | |
| | FCS to TCS | | Character Set | | TCS to FCS | |
| +-------------+ | | +-------------+ |
| ^ | | | |
| | | | v |
| Kermit Program | | Kermit Program |
| SEND | | RECEIVE |
+------------------+ +------------------+
^ |
| v
+------------------+ +------------------+
| Local File | | Local File |
| Character Set A | | Character Set B |
+------------------+ +------------------+
The use of a common, standard transfer character sets means that each Kermit
program only has to know about its own local character sets and a small
number of standard ones.
International transfer syntax is an optional feature for Kermit programs, and
is designed to interoperate (with, of course, no claim to correct translation)
with Kermit programs that do not support it.
SPECIFYING THE FILE CHARACTER SET
The following command allows the Kermit user to specify the local file
character set:
SET FILE CHARACTER-SET <file-character-set-name>
The file character set name is a normally system-dependent item. Some
computers have only one character set, in which case the SET FILE
CHARACTER-SET command is unnecessary.
This command will be required on computers where more than one file character
set is used. These include private (corporate) character sets or the 7-bit
national variants allowed by ISO Standard 646 (See Appendix B).
A consistent, or at least sensible, naming convention should be used for
private character sets.
The following names for are recommended for the 7-bit national character sets:
ASCII, BRITISH, CUBAN, DANISH, DUTCH, FINNISH, FRENCH, FRENCH-CANADIAN,
GERMAN, HUNGARIAN, ITALIAN, JAPANESE-ROMAN, NORWEGIAN, PORTUGUESE, SPANISH,
SWEDISH, and SWISS (note: most of these are ISO-646 sets, but several of them
are private 7-bit sets).
The Apple character sets might include APPLE-STANDARD, APPLE-QUICKDRAW, and
APPLE-SYMBOL. The DEC Multinational Character Set can be called
DEC-MULTINATIONAL, DEC Greek would be DEC-GREEK, DEC Hebrew would be
DEC-HEBREW, etc. The NeXT character set can be NEXT-MULTINATIONAL. The Data
General international character set can be DATA-GENERAL-MULTINATIONAL, and so
on. Later, when these companies add new and no doubt unique character sets,
these can be called NEXT-GREEK, NEXT-HEBREW, DATA-GENERAL-GREEK,
DATA-GENERAL-HEBREW, etc.
For the IBM character sets (code pages), the notation CPnnn is used, where nnn
is the code page number: CP037, CP437, CP500, CP850, etc. EBCDIC should be
used for "standard" USA EBCDIC. An alternative notation, more in keeping with
the ones above, would be something like IBM-PC-STANDARD, IBM-PC-MULTINATIONAL,
IBM-PC-PORTUGUESE, IBM-370-MULTINATIONAL, IBM-370-USA, IBM-370-JAPAN, etc.
But because there are often several code pages that fit one such description,
the CPnnn notation is preferred.
These are simply samples and guidelines for naming conventions for corporate
character sets. File character set names should be both precise and mnemonic
when possible but, as in the IBM case, precision should take precedence.
In countries like the USSR, character sets are not associated with particular
companies, but have grown up as a matter of usage in several different
computing environments, or have grown out of several different generations of
standards. In such cases, it makes the most sense to stick to common usage.
USSR character sets include KOI-7, KOI-8, DKOI, CP866 (Microsoft Cyrillic),
ALT-CYRILLIC ("Alternative Cyrillic"), and CYRILLIC (ISO 8859-5).
In Japan, a mixture of standard (JIS), modified standard, and corporate
character sets are used: JIS-7, JIS-8, SHIFT-JIS, JAPAN-EUC, DEC-KANJI,
FUJITSU-KANJI, HITACHI-KANJI, etc.
Example: Consider a computer where the ASCII character set is used for
programming and the German ISO 646 variant is used for text.
The German phrase:
Gr<u-diaeresis><ess-zet>e aus K<o-diaeresis>ln
would be rendered in ASCII as "Gr}~e aus K|ln", and the ASCII C-language
phrase "{~a[x]}" would appear as:
<a-diaeresis><ess-zet>a<A-diaeresis>x<U-diaeresis><u-diaeresis>
in German ISO 646 (ess-zet is the German double-s character, similar in
appearance to Greek beta). The German-speaking user would want Kermit to
interpret the local file characters as German (SET FILE CHARACTER-SET GERMAN)
in the former case, and as ASCII (SET FILE CHARACTER-SET ASCII) in the latter.
SPECIFYING THE TRANSFER CHARACTER SET
To select the transfer character set for file transfer, the user enters the
command:
SET TRANSFER CHARACTER-SET <name>
where <name> is the name of a standard character set. If the name is
TRANSPARENT, Kermit does no character set conversion at all, but it still does
text record format conversion. For ASCII-based systems, this is equivalent to
Kermit's normal, basic mode of operation.
If a name other than TRANSPARENT is given, and FILE TYPE is set to TEXT,
Kermit translates between the current file character set and the named
transfer character set when constructing or deciphering file data packets.
If the transfer character set is ASCII, Kermit converts between the current
file character set and 7-bit ASCII. This mode of operation is roughly
equivalent to Kermit's basic mode of operation on non-ASCII based systems like
IBM mainframes. But if the local file character set contains accented Roman
characters, the accents are dropped in the transfer character set, for example
a-acute becomes simply a. (But see SET LANGUAGE, described later.)
Other transfer character sets must be chosen from among approved national or
international standards. The sets shown in Table 2 are recommended. The
criteria for including a character set in this table are:
1. 7-bit US ASCII (= ISO-646 US version) is included, for compatibility
with the original Kermit protocol and the hundreds of programs that
implement it.
2. An 8-bit single-byte character set, such those in the ISO 8859 series,
may be included if it is registered, as in (4) below.
3. A multibyte character set may be included, if it is registered as in (4).
4. The set must be listed in the ISO International Register of Character Sets
under the provisions of ISO Standard 2375 (see Appendix A), so it has a
unique registration number and designating escape sequence with which the
sending Kermit program can identify the character set to the receiving
Kermit program. (An exception to this provision is made for Japanese EUC,
which is a combination of two registered standards.) Allowance is made for
the possibility of other registration authorities, should they appear.
5. The set must be a national or international standard graphic character
set, intended for use in computer text processing or programming (as
opposed to Videotex, Teletex, OCR, device control, or other applications).
This category may include standard line-drawing or technical character sets
which fit the other criteria.
Note in particular that the national variants of ISO 646 are not included,
since these are covered adequately by the ISO Latin alphabets.
Standard character sets containing "composed characters", such as CCITT T.61,
in which an accented letter is represented by a two-character sequence (for
example, c-cedilla is encoded as a cedilla character followed by a "c"
character), are not included at this time. The issue of composed versus
precomposed characters will be addressed later.
Standard "Kermit names" (for use with the SET TRANSFER CHARACTER-SET command)
are given to these character sets so they may be referred to uniformly in all
Kermit implementations. These names are chosen to be mnemonic so users don't
have to remember cryptic designations like "ISO-8859-3". The choice of single
words like "CYRILLIC" implies that there will not be more than one transfer
syntax for Cyrillic text. However, if standards change in the future, it will
be possible to add further identifying material to these names, e.g.
"CYRILLIC-2, CYRILLIC-ANCIENT", etc.
The Kermit names are English, as this is the language of the standards
themselves. The Kermit commands are English words, and this document is
written in English. Non-English user interface issues are beyond the scope
of this document.
_____________________________________________________________________________
Table 2: Standard Character Sets
US 7-bit ASCII. English, Latin, Gaelic without accents, Dutch without
y-diaeresis, German without umlauts (vowels marked by diaeresis) or ess-zet.
Kermit name: ASCII.
ISO Registration Number: 6.
Kermit Designator: none (this is the default transfer character set).
ISO 8859-1, Latin Alphabet 1. Danish, Dutch, English, Faeroese, Finnish,
French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish,
and Swedish.
Kermit name: LATIN1.
ISO Registration Number: 100.
Kermit Designator: I6/100.
ISO 8859-2, Latin Alphabet 2. Albanian, Czech, English, German, Hungarian,
Polish, Romanian, Serbocroatian (Croatian), Slovak, and Slovene.
Kermit name: LATIN2.
ISO Registration Number: 101.
Kermit Designator: I6/101.
ISO 8859-3, Latin Alphabet 3. Afrikaans, Catalan, Dutch, English, Esperanto,
French, Galician, German, Italian, Maltese, Spanish, and Turkish.
Kermit name: LATIN3.
ISO Registration Number: 109.
Kermit Designator: I6/109.
ISO 8859-4, Latin Alphabet 4. Danish, English, Estonian, Finnish, German,
Greenlandic, Lappish (Sami), Latvian, Lithuanian, Norwegian, and Swedish.
Kermit name: LATIN4.
ISO Registration Number: 110.
Kermit Designator: I6/110.
ISO 8859-5, the Latin/Cyrillic Alphabet. Bulgarian, Byelorussian, English,
Macedonian, Russian, Serbocroatian (Serbian), and Ukrainian
(Compatible with USSR GOST Standard 19768-1987 and ECMA-113 = "New KOI-8").
Kermit name: CYRILLIC.
ISO Registration Number: 144.
Kermit Designator: I6/144.
ISO 8859-6, the Latin/Arabic Alphabet.
Kermit name: ARABIC.
ISO Registration Number: 127.
Kermit Designator: I6/127.
ISO 8859-7, the Latin/Greek Alphabet.
Kermit name: GREEK.
ISO Registration Number: 126.
Kermit Designator: I6/126.
AKA: ELOT 928 (OMADA ELLINIKON ELOT 928)
ISO 8859-8, the Latin/Hebrew Alphabet.
Kermit name: HEBREW.
ISO Registration Number: 138.
Kermit Designator: I6/138.
ISO DIS 8859-9, Latin Alphabet 5, in which the Icelandic letters Thorn and
Eth plus upper and lowercase Y acute from Latin Alphabet 1 are replaced by
six other letters needed for Turkish. Danish, Dutch, English, Faeroese,
Finnish, French, German, Irish, Italian, Norwegian, Portuguese, Spanish,
Swedish, and Turkish.
Kermit name: LATIN5.
ISO Registration Number: 148.
Kermit Designator: I6/148.
JIS X 0201, a 1-byte code for Japanese Katakana, used in conjunction
with a slightly modified ASCII (backslash is replaced by Yen sign,
tilde by overbar).
Kermit name: KATAKANA.
ISO Registration Numbers: 14 (Roman), 13 (Katakana).
Kermit Designator: I14/13.
Japanese EUC
A variable-length code containing ASCII and Japanese Katakana in their JIS
X 0201 representations, plus 2-byte JIS X 0208. JIS X 0208, in turn,
includes Japanese Kanji, Katakana, Hiragana, Roman, Greek, and Russian
characters, plus special symbols, etc. ASCII codes are single bytes with
their 8th bit set to zero. JIS X 0208 codes are double bytes with the
8th bit of each byte set to one. JIS X 0201 Katakana bytes are preceded
by Single Shift 2 (see Appendix B). This mixture allows single-width Roman
and Katakana characters to coexist with double-width JIS X 0208 characters,
a common practice in many Japanese computing environments.
Kermit name: JAPAN-EUC
ISO Registration Numbers: 14 (Right half of JIS X 0201), 87 (JIS X 0208).
Kermit Designator: I14/87/13.
Chinese Standard GB 2312-80, a 2-byte code for Chinese.
Kermit name: CHINESE.
ISO Registration Number: 58.
Kermit Designator: I6/58.
KS C 5601 (1989), a 2-byte code for Korean.
Kermit name: KOREAN.
ISO Registration Number: 149.
Kermit Designator: I6/149.
TCVN 5712:1993, an ISO 2022-compliant pair of single-byte sets for
Vietnamese, one for uppercase letters, the other for lowercase.
Kermit name: VIETNAMESE.
ISO Registration Number: 180.
Kermit Designator: I6/180.
ISO/IEC 10646-1. International Standard 10646,
Information Processing -- Multiple-Octet Coded Character Set, 1993.
Table 2: Standard Character Sets
_____________________________________________________________________________
BEWARE: The Latin-4 alphabet is confused. The original ECMA 94 standard
was designed for the Scandinavian and Baltic languages and thus included
the character A-ring (necessary for Swedish and Lappish/Sami), but some
editions of the ISO Registry substitute L-acute (not used by any of the
covered languages).
NOTE: CNS 11643 (Taiwan) is not included because (a) one Chinese
transfer character set should be sufficient, and (b) CNS 11643 does not
show up in the ISO Register. The issue of "Han Unification" (combining
Chinese, Japanese, and Korean ideograms into a single code set) is not
addressed by this proposal, except insofar as it occurs in the base
multilingual plane (BMP) of ISO 10646.
Until and unless Kermit programs are updated to take advantage of ISO
10646, additional transfer character sets must be added to Kermit's
repertoire for languages with writing systems not yet covered: Burmese,
Thai, Lao, Khmer, Armenian, Georgian, Amharic, Sinhalese, Tibetan,
Mongolian, Cherokee, many African languages, etc etc.
The ISO Latin alphabets are 8-bit character sets whose left half is identical
with ASCII, and whose right half contains the characters required for
languages other than English. All accented letters are "precomposed", i.e.
single code points. The ISO registration number refers only to the right half
of each of these character sets, but each of these sets must be used in its
entirety, because the unaccented Roman letters, the digits, and the common
punctuation marks appear only in the ASCII left half, which is ALWAYS (unless
otherwise noted) US ASCII, ISO Registration Number 6. The Kermit
character-set name refers to the two halves combined as a single set.
A particular Kermit program need not incorporate all of these character sets.
In many cases, a single 8-bit character set will suffice, such as LATIN1 for
Western Europe, LATIN2 for Eastern European countries with Roman-alphabet
based writing systems, CYRILLIC for most of the USSR, and so on.
When a language is representable in more than one character set from this
table, as are English, German, Finnish, Turkish, etc., the character set
highest on the list which adequately represents the language should be
preferred. More precisely, when a character set other than ASCII is to be
used in Kermit's transfer syntax, the ISO 8859 sets are preferred to other
registered sets which contain the same characters. Within the ISO 8859
family, lower-numbered sets which contain the characters of interest are
preferred to higher-numbered sets which contain the same characters. This
guideline maximizes the chance that any two particular Kermit programs will
interoperate.
For example, LATIN1 would be chosen for French, German, Italian, Spanish,
Danish, Dutch, Swedish, etc; LATIN3 for Turkish; JAPAN-EUC for Japanese text
that includes Kanji characters, KATAKANA for Japanese text that includes only
Roman and Katakana characters, etc.
Unfortunately, but unavoidably, the burden of choosing the best transfer
character set must be placed upon the user. If a file containing a mixture of
English, Finnish, and Latvian must be transferred, the user must find a
character set that can adequately represent all three languages, in this case
Latin Alphabet 4. A table like Table 3 should be provided in the user
documentation to help the user make this selection.
_____________________________________________________________________________
Afrikaans LATIN3 Irish LATIN1,5
Albanian LATIN2 Italian LATIN1,3,5
Arabic ARABIC Japanese Kanji JAPAN-EUC
Bulgarian CYRILLIC Japan.Katakana KATAKANA,JAPAN-EUC
Byelorussian CYRILLIC Korean KOREAN
Catalan LATIN3 Lappish (Sami) LATIN4
Chinese CHINESE Latvian LATIN4
Czech LATIN2 Lithuanian LATIN4
Danish LATIN1,4,5 Macedonian CYRILLIC
Dutch LATIN1,2,3,4,5 Maltese LATIN3
English ASCII,LATIN1,2,3,4,5,etc Norwegian LATIN1,4,5
Esperanto LATIN3 Polish LATIN2
Estonian LATIN4 Portuguese LATIN1,5
Faeroese LATIN1,5 Romanian LATIN2
Finnish LATIN1,4,5 Russian CYRILLIC
French LATIN1,3,5 *Serbocroatian LATIN2, CYRILLIC
Galician LATIN3 Slovak LATIN2
German LATIN1,2,3,4,5 Slovene LATIN2
Greek GREEK Spanish LATIN1,5
Greenlandic LATIN4 Swedish LATIN1,4,5
Hebrew HEBREW Turkish LATIN3,5
Hungarian LATIN2 Ukrainian CYRILLIC
Icelandic LATIN1
Table 3: Preferred Transfer Syntax Character Sets
*If written in Cyrillic, this language is called Serbian. If written
in Roman letters, it is called Croatian.
_____________________________________________________________________________
Note, Table 3 is only a sample. To produce a comprehensive and definitive
table would require a team of language experts. The information in the
current table is based purely upon the claims made within the standards
themselves, in which there is no mention of languages like Farsi, Urdu, Welsh,
Cornish, Manx, Inuit, Old Church Slavonic, Armenian, Georgian, Tagalog,
Swahili, Latin, etc, nor definitions of exactly what is meant by terms like
"Greenlandic", "Irish", etc. Nevertheless, it is the intention of this
proposal to support any language for which a computer character set can be
standardized.
OTHER NON-UNIVERSAL CHARACTER SETS
This section lists character sets that are not listed in Table 2, but that
are likely candidates for eventual inclusion therein (i.e. after they are
registered with the ISO).
ISO 6438, Extended Roman for African Languages.
More information needed.
ISCII-1991 (IS 13194-1991), Indian Script Code for Information Interchange.
Supports the nine official Indian scripts derived from Brahmi: Devanagari,
Gurmukhi, Gujarati, Bengali, Assamese, Oriya, Telugu, Kannada, Malayalam,
and Tamil, with Roman transliteration. A series of single-byte codes with
00/00-07/15 = ASCII, different right halves for each script. All of the
right halves are structurally identical to each other to facilitate
automatic transliteration and display of alternate alphabets using the same
system software. As Perso-Arabic scripts have a different alphabet, a
different standard is envisaged for them. ISCII-1991 is a successor to
earlier codes ISSCII-83 and ISCII-88 announced by the Department of
Electronics, Government of India. Bureau of Indian Standards, Manak Bhavan,
09 Bahadur Shah Zafar Marg, New Delhi 110002. No ISO registration number.
(Add others...)
THE UNIVERSAL CHARACTER SET
Though ISO Standard 10646 was approved in 1993, it will continue undergo
continuous change as national standards bodies evolve and engage in the ISO
process, and it will take many years before it replaces the many existing
private and standard character sets in data processing and communication.
Therefore there is no intention to drop support in the Kermit protocol for the
standard character sets listed above at any time in the foreseeable future.
ISO 10646 can be added (in at least in one form, most likely a compacted
version of the two-byte Base Multilingual Plane) to Kermit's list of transfer
character sets.
IMPLEMENTATION
Character set translation can be added to existing Kermit programs with
a minimum of effort. The following steps are required for each Kermit program:
1. Add the SET FILE TYPE { BINARY, TEXT } command, if the program doesn't
have it already. SET FILE TYPE TEXT enables text-file character set
conversion. SET FILE TYPE BINARY disables conversions of all kinds, but
does not destroy the file and transfer character-set selections (2 and 3
below), so that a subsequent SET FILE TYPE TEXT command will still be able
to use them.
2. Add the SET FILE CHARACTER-SET <name> command. The set of <names> should
include ASCII or EBCDIC (as appropriate, used for program source, etc) plus
the names of any "national" or special character sets that are used on the
particular computer.
3. Add the SET TRANSFER CHARACTER-SET <name> command. The set of <names>
should include TRANSPARENT and ASCII plus the names of one or more other
standard character sets from Table 2 which contain the characters from the
computer's local character set(s).
4. Add translation tables (or functions) between each compatible pair of
character sets in (2) and (3). For each pair, two translation tables are
necessary: one from the local file character set to the transfer character
set, and one from the transfer set to the local one.
5. Add SHOW commands to let the user find out what character sets are
available, and which ones are currently selected, for the transfer syntax
and for local files. The exact syntax of this command will vary. In
some Kermit implementations, every SET command has a corresponding SHOW
command, in which case it will be possible to SHOW FILE CHARACTER-SET and
SHOW TRANSFER CHARACTER-SET. In others, related SET parameters are lumped
together into broader categories for purposes of SHOW, for example SHOW
FILE would show all file-related parameters; SHOW PROTOCOL would show all
protocol-related parameters.
Any particular Kermit program can support several (perhaps many) file
character sets (FCS) and transfer character sets (TCS). No particular
combination of them should be forbidden. If a useful translation between,
say, Hebrew and Katakana can be devised, there is no reason the user should
not be allowed to select it.
However, programs that support large numbers of file and transfer character
sets must bow to the limitations of the computer's architecture and memory
space, as well as the knowledge and patience of the programmer. Hence, purely
as a matter of implementation, certain combinations of FCS and TCS --
preferably the ones that would be least frequently used -- can remain
unsupported. In that case, the SET { FILE, TRANSFER } CHARACTER-SET command
that causes the conflict can issue an error message or switch automatically
to a combination (if any) that makes sense.
Optionally, several additional related commands can be included:
6. The command SET LANGUAGE may be added to allow the program to apply
heuristics in the translation process that would not otherwise be possible.
See discussion below.
7. Commands for modifying, loading, and saving translation tables (to be
specified in a future draft of this document).
8. Once the new commands and translation tables are in place, it is simple to
add a TRANSLATE command, to translate a local file from one character set
to another, using a transfer character set as an intermediate step. With
this command, Kermit may be used as a character-set conversion utility for
local files (see Appendix D).
9. Commands governing automatic pairing of file and transfer character set
and setting the goal for translation, described below.
Translation occurs only in the data field of the D packets. Packet control
fields are not translated, nor are the data fields of any other kind of
packet, including F (filename) packets. (Filename packets cannot be
translated because the attribute packet that announces the file's character
set does not arrive until after the F packet.) As always, IBM Mainframe
Kermit is a special case, since most character strings must be translated
between EBCDIC and ASCII. Nonetheless, the rule applies even there, as long
as we take "translation" to mean the specific translation between the transfer
and file character sets, rather than the standard ASCII/EBCDIC conversion.
Internally, the Kermit program that is sending a file:
1. Reads characters (one or more bytes) or lines of text from the file.
2. Translates the character from the FILE CHARACTER-SET to the TRANSFER
CHARACTER-SET, applying any selected and applicable special rules or goals,
and converting the record format if necessary.
3. Follows the negotiated lower-level encoding options: control prefixing,
shifting, and compression.
4. Assembles and sends the packet.
The Kermit program that is receiving a file:
1. Reads an incoming data packet.
2. Decodes the packet data according to the negotiated prefixing, shifting,
and compression options.
3. Translates the resulting characters from the TRANSFER CHARACTER-SET to the
FILE CHARACTER-SET, applying any selected and applicable special rules or
goals, converting the record format if necessary.
4. Writes the translated characters to the output file.
EXAMPLE
To transfer a Finnish-language text file from a computer that uses the Finnish
ISO 646 national variant to an IBM PS/2, and to store the file using the
PS/2's Multilingual Code Page:
On the sending computer: On the receiving computer:
SET FILE TYPE TEXT SET FILE TYPE TEXT
SET FILE CHARACTER-SET FINNISH SET TRANSFER CHARACTER-SET LATIN1
SET TRANSFER CHARACTER-SET LATIN1 SET FILE CHARACTER-SET CP850
SEND filename RECEIVE
The file sender translates from Finnish ISO 646 to Latin Alphabet 1, the
most appropriate transfer character set (see Table 3), and the file receiver
translates from Latin-1 to Code Page 850.
To transfer a C-language source program between the same two computers:
On the sending computer: On the receiving computer:
SET FILE TYPE TEXT SET FILE TYPE TEXT
SET TRANSFER CHARACTER-SET ASCII SET FILE CHARACTER-SET ASCII
SET FILE CHARACTER-SET ASCII SET TRANSFER CHARACTER-SET ASCII
SEND filename RECEIVE
Here all translations are from ASCII to ASCII, hence no translation at all.
LANGUAGE-SPECIFIC TRANSLATIONS
When national or international text must be translated into ASCII, information
is necessarily lost. ASCII does not include accented or non-Roman letters.
For readability, accented letters can be converted to their unaccented
counterparts, but that can introduce ambiguities or mistakes (to use Andr'e
Pirard's example: "a la francaise" without accents means "has the French
girl"). If we know that the text is written in a specific language, sometimes
certain language-specific rules can be applied to reduce the loss of
information.
For example, consider text containing the y-diaeresis character. It is
acceptable to render y-diaeresis as "ij" if the language is Dutch, but not
otherwise (yielding "Rijksmuseum" -- correct spelling -- rather than
"Ryksmuseum"). Similarly, o-diaeresis can be rendered as "oe" in German or
Swedish but not in English ("co<o-diaeresis>peration").
The command for selecting language-specific translation rules is:
SET LANGUAGE <name>
where <name> is the (English) name of the language, for example ITALIAN,
NORWEGIAN, PORTUGUESE.
Example: The command SET LANGUAGE GERMAN would allow the Kermit program, when
translating from Latin-1 or the German ISO 646 variant into ASCII, to render:
Gr<u-diaeresis><ess-zet>e aus K<o-diaeresis>ln
as "Gruesse aus Koeln" (correct German) rather than "Gruse aus Koln" (Gruse
means something entirely different from Gruesse -- something like "scum"
rather than "greetings").
TRANSLATION MECHANISMS
When translating from one character set to another, there are two goals
possibly conflicting goals:
1. Readability (R): Achieving a translation that makes the most sense to
the reader.
2. Invertibility (I): Achieving a translation that can be translated back
to the original character set without loss or distortion of information.
When readability is desired, nonmatching characters are converted to the
closest matching character, for example Latin-1 e-grave becomes simply e in
ASCII. But now "e" represents two different characters in the translation, so
invertibility is lost. When no sensible counterpart exists, a special "this
can't be translated" character is used (a unique character if possible,
otherwise a question mark "?"). When this special character is used for more
than one purpose, invertibility is lost.
Invertibility is possible only when both character sets are the same size.
When invertibility is desired, the characters of the intersection of the two
sets are paired together: A in one set to A in the other, A-grave in one set
to A-grave in the other, etc. The members of the two sets of differences
between the two character sets are paired together in a way that gives every
character a unique translation in each direction. The exact method for
pairing is problematic, and frequently a particular pair makes no sense at
all, for example "L-with-stroke" with "Vulgar fraction 3/4". Any such pairing
will give an invertible translation, but to achieve the most useful
translation it is necessary to examine all the character sets involved. To
illustrate, Latin Alphabet 1 lacks the OE digraph character but this character
is found in the DEC Multinational Character Set, the Apple Quickdraw character
set, and the NeXT character set, but at different code points in each.
Ideally, each of these character sets should map OE digraph into the same
Latin-1 code point.
Let's look at a few common translation scenarios.
1. From a 7-bit set to a different 7-bit set, e.g. from ISO 646 Spanish
version to ASCII (or vice versa). The two sets do not contain the same
characters. Here we must choose between readability and invertibility. To
achieve readability in the Spanish-to-ASCII direction, we strip diacritical
marks (n-tilde becomes simply n). To achieve invertibility (at least in
this case), we make no translation at all.
2. From a 7-bit set to an 8-bit set. The 7-bit sets are usually ASCII or an
ISO 646 national variant. Normally, all the characters from the 7-bit set
are also present in the 8-bit set, and there is no R vs I conflict.
Otherwise, we must choose between R and I. Normal example: ASCII (and most
ISO 646 national variants) to Latin-1 -- here we satisfy both R and I.
Bizarre example: ISO 646 Swiss national variant to ISO Latin / Arabic --
here we must choose between R and I.
3. From an 8-bit set to another 8-bit set. The common case here is converting
between one of the corporate "extended ASCII" sets (DEC, IBM, Apple, NeXT,
Data General, Commodore, etc) and ISO Latin-1. The two sets share a large
percentage of common characters. How do we handle the characters that
differ? Again, we must choose between R and I. To complicate this case,
the IBM, Apple, and NeXT sets use the forbidden (by ISO standards) C1
control-character area for graphics characters; in this case there must be
a mapping between graphics and C1 controls.
4. From an 8-bit set to a 7-bit set. For example, from Latin-1 to ASCII or to
an ISO 646 national set. Here we are forced to accept a great deal of
information loss. We cannot possibly achieve invertibility, so we should
aim for maximum readability. The SET LANGUAGE command can be used to help.
5. From a single-byte character set to a multibyte character set. Most
multibyte character sets include ASCII and sometimes several other
alphabets (such as Greek and Cyrillic in JIS X 0208). Here we translate
each character into its equivalent, if it has one, and if not we pick some
unique nonsense value to ensure the translation is invertible (for the
single-byte set).
6. From a multibyte set to a single-byte set, for example Japanese EUC
into Latin-1 (or Latin/Cyrillic, Latin/Greek, or even ASCII). Here we
lose the vast majority of characters -- there is no hope for a readable
or even a sensible translation. The only way to translate Kanji into
(say) ASCII is to replace ideograms by words, and that is beyond the
scope of a simple character-set conversion scheme. Hence, we normally
replace ideograms by the "this can't be translated" character.
7. From one national multibyte set to another. These sets are for Chinese,
Japanese, and Korean, and have at least a large number of ideograms (Han
characters) in common, and probably also Roman characters. How to
translate among them is an item for study by language experts: by shape,
by meaning, etc.
How do we choose between readability and invertibility? It depends on what
the user needs at a particular moment. We (Kermit designers and programmers)
can give the user the ability to make this choice. Or we can make the choice
for them, knowing full well that whatever our choice, it will be wrong.
To give the user a choice -- at the expense of increased size and complexity
in the program itself and of the user interface -- the following command can
(optionally) be included:
SET TRANSFER TRANSLATION { INVERTIBLE, READABLE }
The existence of this command requires a dual set of translation tables and/or
functions -- one optimized for invertibility (totally invertible if the two
character sets are the same size), the other for readability. When a Kermit
program handles many character sets, this can result in a significant increase
in program size.
When this command is not provided, the bias of the translation mechanisms --
readability or invertibility -- must be clearly stated in the user
documentation. All else being equal, the bias should be towards
invertibility; if an invertible translation is possible (i.e. the two
character sets are of the same size), it should be provided. This ensures
round-trip consistency PROVIDED the same invertible tables are always used.
It is the programmer's choice whether translation is accomplished by tables or
by functions that implement translation algorithms, or a combination of both.
Functions provide maximum flexibility and tend to reduce program size, at some
cost in execution overhead. Tables provide greatest speed, but generally with
greater cost in program size.
THE POLITICS OF INVERTIBILITY
If two character sets are the same size and contain the same repertoire of
characters, translation is simply a matter of rearranging code points. But
when two character sets intended to serve the same language or group of
languages differ on non-alphabetic code points or in other minor ways,
arbitrary decisions must be made in assigning the nonoverlapping characters
from the two sets. Who makes those decisions?
The classic example is the translation between IBM Code Page 850 (the
"Multilingual Code Page") and ISO Latin-1. Because IBM assigns graphics
characters to its C1 area, it has 32 more graphics characters than Latin-1.
Most of these are line- and box-drawing characters sprinkled throughout
the code page. How should these be paired with Latin-1's C1 set?
Such decisions are beyond the scope of the national and international
standards activities, and they should not be made by Kermit designers or
programmers. These tables (or algorithms) are most appropriately furnished by
the creators of each private character set. This lends the appropriate
"official" air, and encourages the makers of all software packages that need
such a translation to use the official one so all such applications on a
particular computer can interoperate.
IBM has specified an invertible translation between certain of its code pages
and ISO Latin-1 in its Character Data Representation Architecture (CDRA).
Similarly, Apple should specify the translation between Quickdraw and Latin-1.
Microsoft should specify the translation between CP866 and the Latin/Cyrillic
alphabet. And so on. In the absence of such vendor-provided translations,
Kermit programmers are forced to produce their own, but should continue to
press vendors for official versions.
Eventually, the actual contents of each invertible translation table or
algorithm should be specified in a document or set of documents to accompany
this proposal, or references to the relevant corporate standards should be
listed in Appendix G.
Before leaving this topic, let's also remember to encourage designers of
computer operating systems to RECORD THE CHARACTER SET in a text file's
directory entry, so Kermit or any other application program can find out what
it is automatically without requiring the user to identify it manually. (Of
course, this begs the larger question of recording the file type as well...
item for futher study.)
THE POLITICS OF READABILITY
Similarly, we can ask: Who decides what is readable? Transliteration of a
language like Greek or Russian into ASCII can be done in many different ways,
depending upon -- among other things -- the language spoken by the person
reading the result. For example, the surname of a former leader of the USSR,
which, when written in Cyrillic, has only six letters; transliterated into
English, the name is "Khrushchev". Into German, "Khruschtschew".
There are few, if any, widely recognized standards for transliteration, and
yet it is often desirable. Newspapers and magazines, library catalogers,
immigrant bureaus, and many other organizations have procedures for
transliterating "foreign" writing systems. Not just in "ASCII-speaking"
lands, but everywhere: Russian names are written in Arabic newspapers, Hebrew
names in Greek journals, English names on Chinese passports, Korean
publications in Vietnamese library catalogs.
When a translation function is optimized for readability -- and some must be
-- the designer must consider whether to force a particular kind of
readability on the user, or to give the user a choice. The precise mechanism
for this (if indeed any such mechanism can be precise!) is another topic for
further study: How to best transliterate from Language A in Writing System B
to Language X in Writing System Y?
USER-DEFINED TRANSLATIONS
It should be possible for users to override the decisions made by Kermit
programmers regarding the bias of the translation mechanism or its particular
details, as well as to add totally new translations, by introducing their
own translation tables or functions.
Methods for doing this will be described in a future draft of this document.
This is primarily a user-interface design issue.
***
How to do user-defined translations:
SET FILE CHARACTER-SET USER-DEFINED
SET XFER CHARACTER-SET <valid-xfer-charset>
SET USER-TRANSLATION FROM <tcs> xxx yyy ; for incoming files
SET USER-TRANSLATION TO <tcs> yyy xxx ; for outbound files
Applies to <tcs> + USER-DEFINED FCS.
Can have one pair for each TCS.
Now announcements work right, etc etc.
DUMP USER-TRANSLATION <tcs> [ <file> ] ; list tables (to file)
For C-Kermit, we have 4 TCS's, so 8 x 2 x 256 = 2K bytes, not bad:
. Add FC_USER FCS
. Add 6 tables (not supported for Kanji)
. Initialize each table to identity function
. Add 8 functions (even for TRANSPARENT, but NULL for Kanji)
. Figure out a way of telling user whether table has been defined.
. Add SHOW USER-TRANSLATION <tcs> [ <file> ]
(= a bunch of SET USER-TRANSL commands, with comments)
. Add DUMP USER-TRANSLATION <tcs> [ <file> ]
(= just the numbers, comma-separated? space-separated? one per line?)
. Add LOAD USER-TRANSLATION <tcs> <file>
(= read table written by DUMP, watch out for value and table-size overflow)
. Add some kind of built-in test pattern?
***
ATTRIBUTE PACKETS
We want to accommodate as many computers as possible with a minimum of
programming effort, but this approach places a burden on the user in the
form of new commands and the confusion that results if the user forgets to
issue these commands.
This protocol extension does not require support for Kermit File Attribute
Packets, whose use is negotiated in the Kermit Initialization exchange, but
their use is recommended; the user's burden can be alleviated if the sending
Kermit program uses an attribute field to inform the receiving Kermit of the
character set used in the file data packets. The receiving program can accept
or refuse the file based on whether it supports the specified character set.
If the receiving program refuses a file, the user can override this refusal,
for example, if a long file contains only a word or two in an unknown
character set. The most common user-override is the command SET ATTRIBUTES
OFF. However, this also disables other desirable effects of attribute
packets, such as prenotification of file size. Therefore, it is desirable to
let the user specify exactly which attributes are to be "turned off", e.g.,
SET ATTRIBUTES ENCODING OFF.
When the transfer character set is ASCII (or TRANSPARENT when sent from an
ASCII-based system), the Encoding attribute should have the traditional value
of "A" (for ASCII): "*!A".
In order for the sender to inform the receiver of transfer character sets
other than ASCII, a new value for the Encoding attribute ("*") is defined,
namely "C", which is substituted for the normal value "A" (ASCII). "C" means
that the actual character set is specified as an operand which begins with a
single letter that designates the character set registration authority, e.g.
I for ISO, followed by a registration-authority-specific identifier, as in:
Ixxx/yyy
where the letter "I" (for ISO) is followed by a pair of ISO registration
numbers for the character set, xxx for the "left half" and yyy for the right,
expressed in decimal ASCII digits, for example:
+---+---+---+--------+
| * | ' | C | I6/100 |
+---+---+---+--------+
where "*" is code for the Encoding Attribute (or transfer syntax), "'" is the
length of its value. In this case, "CI6/100" is 7 characters long, and
"'" is the printable encoding for 7 (7 + 32 = 39, the ASCII code for "'").
The character "C" means "I'm using the specified Character set", and
"I6/100" specifies the character set: "ISO registration number 6", i.e. US
ASCII, in the left half, and ISO registration number 100, which is the right
half of Latin-1, in the right. The "I" stands for ISO, and is included to
allow for the possibility of other character set registration authorities.
Designators for each character set are given in Table 2, labeled "Kermit
Designator".
Japanese EUC is a special case, because it is a mixture of single-byte JIS X
0201 (two character sets) and double-byte JIS X 0208. Its Kermit designator
is I14/87/13: Japanese Roman in G0, Japanese Kanji in G1, Japanese Katakana
in G2 (Katakana characters are indicated by SS2 in the data -- the SS2 is
considered part of the file).
In the event that a character set standard changes, but keeps the same
registration number, the registration number for the new character set should
be preceded by a non-numeric character which indicates the revision number: @
(atsign) = 1, A=2, B=3, and so on (as suggested in ISO 2022). For example
"I@2/B100" would indicate an 8-bit single-byte character set having Revision
1 of ASCII as its left half and Revision 3 of Latin-1 as its right. Note:
"Revision 1" does not mean the original version, but rather the first
revision AFTER the original version. The Kermit designator for an original
version does not have a revision indicator.
The form of the character-set designator was chosen because the standards
currently provide no single code to designate an 8-bit character set in its
entirety. Each half of the character set has its own registration number.
For example, ISO 8859-1 (Latin-1) is a single 8-bit character set, but
registration number 100 only refers to its right half. Registration number 6
denotes ASCII, which is used as the left half of all ISO 8859 character sets.
To promote maximum interoperability among extended Kermit programs, the
Kermit designator should be treated as a character string, to be looked up in
a small table, rather than as a flexible mechanism to be used for piecing
together character sets from an arbitrary assortment of left and right
halves. However, the Ixxx/yyy notation leaves open this possibility should
it become desirable at a later time.
In the event that a new class of registration numbers appears, for example, to
denote a single-byte 8-bit character set in its entirety rather than just its
left or right half, a different initial letter will be used in the designator,
even if the registration authority is the ISO. In the event that other
character-set registration authorities appear, they too can be assigned their
own unique Kermit designator prefixes (for example, "K" for Kermit Development
and Distribution), to avoid ambiguity from conflict of registration numbers.
For the present, standards organizations like ANSI and CCITT are not treated
as separate registration authorities, because their character sets are also
registered by the ISO. Should these organizations adopt character sets that
have no ISO counterpart, then special Kermit designator prefixes will be
assigned for them.
Based on the attribute information, the receiver may accept or reject the
file, using Kermit's normal attribute response mechanism. To accept, it puts
a "Y" as the first character of the data field of the acknowledgement to the
attribute packet. To refuse, it puts an "N" instead of a "Y", followed by
"*". If the file is refused in this manner, the sending Kermit should respond
by sending a "Z" (end-of-file) packet containing a "D" (for Discard) in its
data field.
The behavior of the receiving Kermit program when an unknown character set
is announced to it is governed by the command SET UNKNOWN-CHARACTER-SET.
SET UNKNOWN-CHARACTER-SET KEEP means that it should not reject the file, but
store it the best way it can (e.g., without translating any characters),
DISCARD means that the file should be rejected.
AUTOMATIC SELECTION OF FILE CHARACTER-SET BY THE FILE RECEIVER
When a file arrives whose transfer character-set is announced in the attribute
packet, it is desirable to include a mechanism to allow the receiving Kermit
program to select the most appropriate file character-set automatically.
Similarly, if the user gives a SET FILE CHARACTER-SET command, it would be
desirable to switch to an appropriate TRANSFER CHARACTER-SET automatically,
and vice-versa. Any such mechanism should also include a "manual override"
to let the user disable it.
Suppose, for example, an MS-DOS Kermit program that is about to receive a file
has CP437 as its FILE CHARACTER-SET, but the arriving file is announced as
CYRILLIC. The receiving Kermit can (a) translate the Cyrillic characters into
ASCII using a transliteration scheme (like "Short KOI" phonetic
transcription), or (b) switch its file character set to one that contains the
greatest number of characters that are also in the transfer character set, in
this case CP866.
We can design Kermit programs to supply translations between every possible
combination of file and transfer character set. Or we can allow only certain
combinations, for example Roman-to-Roman, Cyrillic-to-Cyrillic,
Hebrew-to-Hebrew, etc.
In the former case, it is the user's responsibility to choose the most useful
combination. In the latter, the receiving Kermit must either reject the file
when the file character set is not valid for the incoming transfer character
set (or accept it without translation, depending on the setting of
UNKNOWN-CHARACTER-SET), or else switch to an appropriate file character set
automatically.
An optional automatic switching mechanism, configurable by the user, can be
provided by the following command:
SET SEND AUTOMATIC-TRANSLATION { OFF, ON, <FCS> [ <TCS> ] }
Automatic translation action when sending files.
OFF means don't automatically switch translations.
ON means enable automatic translation.
<FCS> <TCS> means: If the current file character set is <FCS>, then use
<TCS> as the transfer character set. If <TCS> is omitted, automatic
selection of a transfer character set for <FCS> is not done, and the
current transfer character set is used. In either case, any previous
entry for <FCS> is superseded.
SET RECEIVE AUTOMATIC-TRANSLATION { OFF, ON, <TCS> [ <FCS> ] }
Automatic translation action when receiving files.
OFF means don't automatically switch translations.
ON means enable automatic translation.
<TCS> <FCS> means: if the announced transfer character set of the incoming
file is <TCS>, then use <FCS> as the file character set. If <FCS> is
omitted, automatic selection of a file character set for <TCS> is not
done, and the current file character set is used. In either case, any
previous entry for <TCS> is superseded.
Many of these commands can be executed. Their effect is to build a pair of
lookup tables. When AUTOMATIC-TRANSLATION is OFF, or the character set is not
found in these tables, the prevailing settings are used. ON can be used to
enable any tables that had been previously disabled by OFF.
The programmer may preload the Kermit program with a default set of tables.
However, the default AUTOMATIC-TRANSLATION setting in both directions should
be OFF.
INTEROPERABILITY WITH UNEXTENDED KERMIT PROGRAMS
Extended Kermit programs must be fully interoperable with unextended ones.
When the file sender is extended and the receiver is not, the receiver ignores
the encoding attribute and stores the file data as received, but after
applying any required record-format conversions. In case the sender's
encoding attribute causes problems for the receiver, the sending Kermit should
have an option to omit this attribute: SET ATTRIBUTE ENCODING OFF (or as a
last resort, SET ATTRIBUTES OFF altogether). The sender has the option of
translating from a local file character set to any desired transfer character
set, including ASCII, that will be useful on the receiving computer.
When the file receiver is extended and the sender is not, the receiver has the
option of translating the received characters to a local file character set.
This will be useful if the character set used in the packets corresponds with
one of the receiver's transfer character sets, and it requires the user to
manually inform the receiving Kermit of both the transfer and the file
character sets.
In other cases, the extended Kermit's TRANSLATE command can be used to
pre- or postprocess a file to achieve the desired results if the desired
translations are available.
PERFORMANCE
There is nothing in this proposal that affects the performance of the Kermit
file transfer protocol. The efficiency of file transfer is the same with
or without this extension.
However, it is recognized that transfer of 8-bit text will not always be
efficient. Since the special characters have their 8th bits set to one, there
will be a lot of 8th-bit prefixing in the 7-bit environment -- the higher the
proportion of special characters to ASCII characters, the lower the
efficiency. For "left-handed" languages like Italian, Norwegian, and
Portuguese (in which the preponderance of text characters are ASCII), the
impact is negligible. For "right-handed" languages like Russian, Greek,
Hebrew, and Arabic, where characters come from the right half of the character
set, efficiency will be poor in the 7-bit environment. The situation is even
worse for Japanese EUC, in which all Kanji bytes have their 8th bit set to 1.
For this reason, it is recommended that Kermit programs that implement
transfer character sets for non-Roman-based writing systems also include
Kermit's locking shift protocol, which is specified and analyzed in a separate
document.
TERMINAL EMULATION
While not part of the Kermit file transfer protocol, terminal emulation is an
essential feature of many Kermit programs. It is hoped that all of Kermit's
terminal emulators will evolve along the lines of the ISO standards described
in the Appendices. In some cases, this is already a fact, insofar as DEC
VT200 and 300 series terminals already follow these standards and Kermit
programs are available that emulate these terminals.
The following Kermit commands are recommended for terminal emulation:
SET TERMINAL TYPE <name>
Identify the type of terminal to be emulated, for example VT320.
SET TERMINAL BYTESIZE <number>
Tell how many bits of each arriving character are to be displayed on the
screen. This command is used to protect the user from parity bits sent by
the host during terminal emulation, even when PARITY is set to NONE, so the
normal setting is 7. SET TERMINAL BYTESIZE 8 allows reception of 8-bit
bytes.
SET TERMINAL CHARACTER SET <remote-character-set> [ <local-character-set> ]
Tell how to translate characters during terminal emulation. The
<remote-character-set> denotes the codes sent by, and expected by, the
remote host. The <local-character-set>, if given, specifies the character
codes generated by the local keyboard and displayed on the local screen. If
the <local-character-set> is not specified, the current FILE CHARACTER-SET
is assumed. Since it is likely that neither one of the two character sets
is a standard (TRANSFER) character set, the terminal emulator cannot always
use Kermit's built-in file translation tables or functions directly.
However, it is often possible to use them in a two-step process, using one
of Kermit's transfer character sets as an intermediary.
SET TERMINAL TRANSLATION { INVERTIBLE, READABLE }
Specifies the desired style of character translation to use during
terminal emulation.
SET LANGUAGE
Should not apply to terminal emulation -- characters should not be added
or deleted during translation, because that would interfere with the
formatting of the screen.
SET TERMINAL DIRECTION { LEFT-TO-RIGHT, RIGHT-TO-LEFT }
Specifies the direction of screen writing during terminal emulation.
RIGHT-TO-LEFT can be used for Hebrew and Arabic.
SET TERMINAL LOCKING-SHIFT { ON, OFF }
Specifies whether the terminal emulator should use locking shifts
(normally SO/SI) when sending and receiving 8-bit data in the 7-bit
communications environment. This behavior is built in to certain
terminal emulators (such as VT220, VT320); this command is for use
with terminal emulators that do not have this capability built in.
SET TRANSLATION INPUT \aaa \bbb
or
SET TERMINAL TRANSLATION \aaa \bbb
Specify that when the character \aaa is received from the communication
medium, it should be translated to \bbb before display on the screen.
Many such commands can be given, allowing the user to form a custom-made
terminal character set.
SET KEY <code> <value> Specify that when the key whose code is <code> is
pressed, the Kermit program sends the specified <value>. Many such commands
can be given, allowing the user to customize the keyboard for any desired
character set. The <value> can be a single character or a string of
characters.
Terminal character-set translation should be used in screen capture (session
logging), non-transparent screen-print operations, and "raw uploading" of
text files (TRANSMIT command, when FILE TYPE is TEXT). Character-set
translation should NOT be used in scripting commands such as INPUT and OUTPUT.
APPENDIX A: STANDARDS
ANSI X3.4-1986, "Coded Character Sets - 7-bit American Standard Code for
Information Interchange" (US ASCII), is the 7-bit code currently used by
Kermit for transferring text files.
ISO 646 (1983) (= ECMA-6), "Information Processing - ISO 7-bit Coded Character
Sets for Information Interchange", gives us a 7-bit character set equivalent
to ASCII with provision for substituting "national characters" in selected
positions.
ISO 4873 (1986), "Information Processing - ISO 8-bit Code for Information
Interchange - Structure and Rules for Implementation", defines 8-bit
character sets, their graphic and control regions, and how to extend an
8-bit character set by using multiple intermediate graphics sets.
ANSI X3.134.1 (1991), "8-Bit ASCII - Structure and Rules", the USA equivalent
of ISO 4873.
ISO 2022 (1986) (= ECMA-35), "Information Processing - ISO 7-bit and 8-bit
Coded Character Sets - Code Extension Techniques", describes how to use
8-bit character sets in both 7-bit and 8-bit environments, and how to switch
among different character sets.
ISO International Register of Coded Character Sets to be Used with Escape
Sequences. This is the source of the ISO registration numbers.
ISO 2375 (1985) "Data Processing - Procedure for Registration of Escape
Sequences". The procedure by which a character set gets into the above
register and has a registration number and designating escape sequence
assigned to it.
JIS X 0202, "Code Extension Techniques for Use the Code for Information
Interchange", the Japanese counterpart of ISO 2022.
ISO 6429-1983, "C1 Control Character Set".
ANSI X3.41-1974, "Code Extension Techniques for Use with the 7-Bit Coded
Character Set of the American National Standard Code for Information
Interchange", describes 7- and 8-bit codes and extension techniques in
approximately the same manner as ISO 4873 and ISO 2022. (Now obsolete?)
ISO 8859 (1987-present) (see Table 6 for ECMA equivalents), "Information
Processing - 8-Bit Single-Byte Coded Graphic Character Sets", defines the
actual 8-bit character sets to be used for many of the world's languages.
The left half of each of these is the same as ASCII and ISO 646 IRV. Each
character, including those with diacritics, is represented by a single byte.
ANSI X3.134.2 (1991), "7-Bit and 8-Bit ASCII Supplemental Multilingual
Graphic Character Set", the USA equivalent of ISO 8859-1.
JIS X 0201, Japanese Roman / Katakana set (need full reference).
JIS X 0208, Japanese Kanji set (need full reference).
JIS X 0212, Japanese Kanji set (superset of JIS X 0208, reportedly not in
use yet, need full reference).
ISO is the International Standardization Organization. ANSI is the American
National Standards Institute. ECMA is the European Computer Manufacturers
Association. JIS means Japan Industrial Standard.
The ISO/ECMA standards discussed in this proposal may be obtained free of
charge in their ECMA form by writing to:
ECMA Headquarters
Rue du Rhone 114
CH-1204 Geneva
SWITZERLAND
Be sure to specify the title and the ECMA number of each standard requested.
In general, the ISO member body from each country acts as the local sales
agent for ISO Standards in that country, for example ANSI in the USA:
Sales Department
American National Standards Institute
1430 Broadway
New York, NY 10018
Telephone 212-354-3300
Each such organization has its own arrangements for disseminating printed
documents. ANSI sells them for US dollars; organizations in other countries
may either sell them for local currency or give them away, depending on how
they are funded to operate.
ISO standards and CCITT recommendations can also be ordered from the UN
bookstore, but not free of charge:
United Nations Bookstore
United Nations Building
New York, NY 10017
CCITT recommendations are also available by mail order from ANSI.
CCITT recommendations are also available via anonymous FTP on the Internet
from host BRUNO.CS.COLORADO.EDU or DIGITAL.RESOURCE.ORG in the directory
/pub/standards/ccitt/.
APPENDIX B: HOW THE STANDARDS WORK
ASCII and ISO 646 give us a 128-character 7-bit character set. This set is
divided into two parts:
1. 33 "control characters" (characters 0 through 31, and character 127).
2. 95 "graphic characters" (32-126).
Graphic characters make ink appear on the page or phosphor glow on the screen.
Control characters are used as fillers or format effectors and for
transmission or device control. The ASCII / ISO-646 IRV character set is
shown in Figure 1, arranged in a table of 16 rows and 8 columns. The graphic
characters are shown literally (except SP stands for the space character), the
control characters by name (control character names and functions are defined
in ISO 646).
_____________________________________________________________________________
00 01 02 03 04 05 06 07
+---+---+---+---+---+---+---+---+
00 |NUL DLE| SP 0 @ P ` p |
01 |SOH DC1| ! 1 A Q a q |
02 |STX DC2| " 2 B R b r |
03 |ETX DC3| # 3 C S c s |
04 |EOT DC4| $ 4 D T d t |
05 |ENQ NAK| % 5 E U e u |
06 |ACK SYN| & 6 F V f v |
07 |BEL ETB| ' 7 G W g w |
08 |BS CAN| ( 8 H X h x |
09 |HT EM | ) 9 I Y i y |
10 |LF SUB| * : J Z j z |
11 |VT ESC| + ; K [ k { |
12 |FF FS | , < L \ l | |
13 |CR GS | - = M ] m } |
14 |SO RS | . > N ^ n ~ |
15 |SI US | / ? O _ o DEL|
+---+---+---+---+---+---+---+---+
Figure 1: The ASCII / ISO-646 International
Reference Version 7-bit Character Set
_____________________________________________________________________________
Characters are often referred to by their column and row position in this type
of table. For example, character 05/08 in Figure 1 is "X". Columns 00-01,
plus character 07/15, comprise the control set. Columns 02-07, minus
character 07/15, comprise the graphics.
ISO Standard 646 allows for national variant 7-bit character sets in which
certain non-alphanumeric ASCII graphic characters are replaced by "national
characters". The character positions in which replacements are permitted,
along with the replacements used by four of the ten ISO 646 national variants,
are shown in Table B-1.
_____________________________________________________________________________
Column/Row ASCII German Finnish Norwegian French
04/00 at-sign section at-sign at-sign a-grave
05/11 left-bracket A-diaeresis A-diaeresis AE-digraph degree
05/12 backslash O-diaeresis O-diaeresis O-slash c-cedilla
05/13 right-bracket U-diaeresis A-circle A-circle section
05/14 circumflex circumflex U-diaeresis circumflex circumflex
06/00 accent-grave accent-grave e-acute accent-grave accent-grave
07/11 left-brace a-diaeresis a-diaeresis ae-digraph e-acute
07/12 vertical-bar o-diaeresis o-diaeresis o-circle u-grave
07/13 right-brace u-diaeresis a-circle a-circle e-grave
07/14 tilde ess-zet u-diaeresis tilde diaeresis
Table B-1: Selected ISO 646 National Variants, Differences from ASCII
_____________________________________________________________________________
The ISO-registered 7-bit national sets are listed in Table B-2.
_____________________________________________________________________________
ISO
Description Reg.#
International Reference Version 2
British Version, BSI 4730 4
USA Version, ANSI X3.4-1986 6
Swedish Version, SEN 850200/B 10
Japanese Version, Roman Chars 14
Italian Version 15
Spanish Version 17
German Version 21
Norwegian Version, NS 4551 60
French Version, NF Z 62010 69
Portuguese Version 84
Hungarian Version, HS 7795/3 86
Cuba National Standard NC 99-10:81 151
Finnish (DEC Private) --
French Canadian (DEC Private) --
Swiss (DEC Private) --
Table B-2: National 7-Bit Character Sets
_____________________________________________________________________________
8-bit character sets are described in ISO 4873 and related standards (see
Appendix A). An 8-bit character set has two sides. Each side has a control
set and a graphics set. The "left half" consists of the control set C0 and
the graphics set GL (Graphics Left). GL has 94 characters, and corresponds to
ASCII (and ISO 646 IRV) positions 02/01-07/14. SP (space) and DEL are
special: they are pieces of the template (the upper right and lower left
corners, respectively) into which any 94-byte graphic character set must fit.
All the characters in the left half (C0, GL, SP, and DEL) have their
high-order, or 8th, bit set to zero, and are therefore representable in 7
bits. The "right half" consists of the control set C1 and the graphics set GR
(Graphics Right). All characters in the right half have their 8th bits set to
one. Figure 2 shows the layout of an 8-bit character set, with C1 occupied
by the ISO 6429 control character set.
_____________________________________________________________________________
<--C0--> <---------GL----------> <--C1--> <---------GR---------->
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15
+---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+
00 |NUL DLE| SP 0 @ P ` p | | DCS|---+ |
01 |SOH DC1| ! 1 A Q a q | | PU1| |
02 |STX DC2| " 2 B R b r | | PU2| |
03 |ETX DC3| # 3 C S c s | | STS| |
04 |EOT DC4| $ 4 D T d t | |IND CCH| |
05 |ENQ NAK| % 5 E U e u | |NEL MW | |
06 |ACK SYN| & 6 F V f v | |SSA SPA| |
07 |BEL ETB| ' 7 G W g w | |ESA EPA| |
08 |BS CAN| ( 8 H X h x | |HTS | (special |
09 |HT EM | ) 9 I Y i y | |HTJ | graphics) |
10 |LF SUB| * : J Z j z | |VTS | |
11 |VT ESC| + ; K [ k { | |PLD CSI| |
12 |LF FS | , < L \ l | | |PLU ST | |
13 |CR GS | - = M ] m } | |RI OSC| |
14 |SO RS | . > N ^ n ~ | |SS2 PM | |
15 |SI US | / ? O _ o DEL| |SS3 APC| +---|
+---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+
<--C0--> <---------GL----------> <--C1--> <---------GR---------->
Figure 2: An 8-Bit Character Set
_____________________________________________________________________________
GR character sets can have either 94 or 96 characters. A 94-character GR set
begins in position 10/01 and ends in position 15/14, with Space (SP) occupying
position 10/00 and DEL in position 15/15, just like GL (the corners shown in
GR in the diagram). A 96-character set has graphic characters in all 96
positions, 10/00 through 15/15.
An 8-bit alphabet, therefore, has up to 94 + 96 = 190 graphic characters.
This number is sufficient to represent the characters in many of the world's
written languages, but not necessarily sufficient to represent all the graphic
symbols required in a given application, for instance a multi-language
document.
To represent a greater number of graphic characters, ISO 4873 defines four
"intermediate sets" of graphic characters, of either 94 or 96 characters each.
These are called G0, G1, G2, and G3. The G0 set never has more than 94
graphic characters, and G1-G3 can have up to 96 each. Therefore there can be
up to:
94 + (3 x 96) = 382
graphics characters simultaneously within the repertoire of a given device,
assuming all are single-byte sets.
These intermediate graphics sets are kept in tables in the memory of the
terminal or computer. One of the intermediate sets (usually G0) is assigned
to GL, and (in the 8-bit communications environment) another may be assigned
to GR. When the terminal or computer receives a data byte, the numeric value
of its bits denotes the position of the character in GL or GR. For example,
the byte 01000001 binary = 65 decimal = 04/01 = uppercase A in ASCII. In the
8-bit environment, any byte with its 8th bit set to zero is from GL, and a
byte with its 8th bit set to one is from GR.
A language like English can be represented adequately by ASCII in GL, because
all the required characters fit there. When a language has more than 94
characters, two techniques are used to represent all the characters:
1. For alphabetic languages, put ASCII (or the ISO-646 IRV) in GL and
the special characters (like accented letters) in GR. French, German,
and Russian are examples.
2. For languages with many symbols (e.g. where a symbol is assigned
to each word, rather than to each sound), represent each character
with multiple bytes rather than one byte. Japanese Kanji, for example,
uses a 2-byte code. A multibyte code may be assigned to G0, G1, G2, or
G3, just like a single-byte code.
How do we assign actual character sets to G0-G3, and how do we associate the
intermediate character sets with the active character set?
Selection of character sets is accomplished using special control characters
and escape sequences embedded within the data stream as described in ISO
Standard 2022. An ESCAPE SEQUENCE is used to DESIGNATE a particular alphabet
(such as Roman, Cyrillic, Hebrew, Arabic, Kanji, etc) to a particular
intermediate graphics set (G0, G1, G2, or G3). A SHIFT FUNCTION is used to
INVOKE a particular intermediate graphics set into GL or GR. In programmer's
terms, GL and GR are pointers into the array of tables G0..G3, and the shift
functions simply change the values of these pointers.
In our discussion, we use the following notation (numbers are decimal unless
otherwise noted):
<ESC> Escape (ASCII 27, character 01/11)
<SP> Space (ASCII 32, character 02/00)
<SO> Shift Out (Ctrl-N, ASCII 14, character 00/14)
<SI> Shift In (Ctrl-O, ASCII 15, character 00/15)
Table 5 shows the alphabet designation functions for single-byte and
multi-byte character sets in both the 7-bit and 8-bit environments. The
character which is substituted for "F" identifies the actual character set to
be used.
_____________________________________________________________________________
Escape
Sequence Function Invoked By
<ESC>(F assigns 94-character graphics set "F" to G0. SI or LS0
<ESC>)F assigns 94-character graphics set "F" to G1. SO or LS1
<ESC>*F assigns 94-character graphics set "F" to G2. SS2 or LS2
<ESC>+F assigns 94-character graphics set "F" to G3. SS3 or LS3
<ESC>-F assigns 96-character graphics set "F" to G1. SO or LS1
<ESC>.F assigns 96-character graphics set "F" to G2. SS2 or LS2
<ESC>/F assigns 96-character graphics set "F" to G3. SS3 or LS3
<ESC>$(F assigns multibyte character set "F" to G0. SI or LS0
<ESC>$)F assigns multibyte character set "F" to G1. SO or LS1
<ESC>$*F assigns multibyte character set "F" to G2. SS2 or LS2
<ESC>$+F assigns multibyte character set "F" to G3. SS3 or LS3
Table 5: Escape Sequences for Alphabet Designation
_____________________________________________________________________________
Table 6 shows the escape sequences used to designate the appropriate parts of
each of the registered character sets discussed in this proposal to G1 (except
that ASCII is designated to G0, which is the normal situation). It is
important to note that the final letter of the escape sequence is not always
sufficient to designate a character set. For example, Czech Standard and JIS
Katakana are both designated by letter I, but the two can be distinguished by
the intermediate characters of the escape sequence, which specify whether the
set is single- or multibyte, or, when both sets are single-byte, whether there
are 94 or 96 characters.
_____________________________________________________________________________
Escape ISO ECMA ISO/ECMA
Alphabet Name Sequence Reference Reference Registration
ASCII (ANSI X3.4-1986) <ESC>(B ISO 646 IRV ECMA-6 6
Latin Alphabet No. 1 <ESC>-A ISO 8859-1 ECMA-94 100
Latin Alphabet No. 2 <ESC>-B ISO 8859-2 ECMA-94 101
Latin Alphabet No. 3 <ESC>-C ISO 8859-3 ECMA-94 109
Latin Alphabet No. 4 <ESC>-D ISO 8859-4 ECMA-94 110
Latin/Cyrillic <ESC>-L ISO 8859-5 ECMA-113 144
Latin/Arabic <ESC>-G ISO 8859-6 ECMA-114 127
Latin/Greek <ESC>-F ISO 8859-7 ECMA-118 126
Latin/Hebrew <ESC>-H ISO 8859-8 ECMA-121 138
Latin Alphabet No. 5 <ESC>-M ISO 8859-9 ECMA-128 148
* Math/Technical Set <ESC>-K ???? ???? 143
Chinese (CAS GB 2312-80) <ESC>$)A none none 58
Japanese (JIS X 0208) <ESC>$)B none none 87
JIS-Katakana (JIS X 0201) <ESC>)I none none 13
JIS-Roman (JIS X 0201) <ESC>)J none none 14
Korean (KS C 5601-1989) <ESC>$)C none none 149
Table 6: Alphabets, Selectors, Standards, and Registration Numbers
_____________________________________________________________________________
* A math/technical set is clearly needed to handle the IBM PC, DEC VT-series,
and other math/technical/line-drawing characters, but there is apparently
no such standard set at this time (ISO 6862? ISO DIS 10367?)
Tables 7 and 8 show the shift functions that are used to invoke the
intermediate character sets. These shift functions may be either locking or
single. "Locking shift" is like shift-lock on a typewriter. It means that
all subsequent characters until the next shift are to be taken from the
designated intermediate character set. "Single shift" applies only to the
character (either single or multibyte) that follows it immediately, but single
shift functions are only available for the G2 and G3 sets. Locking shift
functions remain in effect across alphabet changes.
In the 7-bit environment, only one character set, GL, can be active at a time.
The active character set can be selected from among the intermediate sets
G0-G3 by the shifts shown in Table 6. Control characters from C0 are
transmitted as-is, and those from the C1 set are sent prefixed by <ESC>
followed by the character value, minus 64. For example, the C1 character
10000001 binary (129 decimal) becomes <ESC>A (129 - 64 = 65 = "A").
_____________________________________________________________________________
Shift Representation Name Function
SI Ctrl-O Shift In invoke G0 into GL
SO Ctrl-N Shift Out invoke G1 into GL
LS2 <ESC>n Locking Shift 2 invoke G2 into GL
LS3 <ESC>o Locking Shift 3 invoke G3 into GL
SS2 <ESC>N Single Shift 2 select single character from G2
SS3 <ESC>O Single Shift 3 select single character from G3
Table 7: Shifts Used in the 7-Bit Environment
_____________________________________________________________________________
ISO 2022 also allows for an alternative C0 set in which the SS2 function is
assigned to the 7-bit control character EM (Control-Y, 01/09). This set must
be designated by ESC 2/1 4/12 ("The C0 set of control characters of ISO 646
with EM replaced by SS2", ISO Registration number 140). This set is not in
common use.
In the 8-bit environment two character sets, GL and GR, can be active at once.
A GL character is selected by a byte whose 8th bit is zero, and a GR character
by a byte whose eighth bit is one. The actual character sets assigned to GL
and GR are selected by the shifts shown in Table 8. Control characters from
both the C0 and C1 sets are sent as is.
_____________________________________________________________________________
Shift Representation Name Function
LS0 Ctrl-O Locking Shift 0 invoke G0 into GL
LS1 Ctrl-N Locking Shift 1 invoke G1 into GL
LS2 <ESC>n Locking Shift 2 invoke G2 into GL
LS3 <ESC>o Locking Shift 3 invoke G3 into GL
LS1R <ESC>~ Locking Shift 1 Right invoke G1 into GR
LS2R <ESC>} Locking Shift 2 Right invoke G2 into GR
LS3R <ESC>| Locking Shift 3 Right invoke G3 into GR
SS2 08/14 Single Shift 2 select single character from G2
SS3 08/15 Single Shift 3 select single character from G3
Table 8: Shifts Used in the 8-Bit Environment
_____________________________________________________________________________
So we have a 3-tiered system. At the bottom tier lie all the world's coded
character sets. We can designate up to four of them, one to each of the
intermediate graphics sets G0, G1, G2, and G3 using the escape sequences shown
in Tables 5 and 6. The terminal or computer keeps each of the selected
intermediate sets in memory. There is also one active set, composed of GL and
GR. The intermediate sets are invoked to GL or GR (one at a time) by the
shifts SO, SI, LS0, LS1, etc, shown in Tables 7 and 8. A simplified diagram
for the 8-bit environment is shown in Figure 3 (see ISO 2022 for detailed
diagrams of both the 7-bit and 8-bit environments). On a more sophisticated
output device, Figure 3 would contain numerous arrows pointing upwards to
demonstrate the operation of the designators and shifts.
_____________________________________________________________________________
+--+--------+ +--+--------+
|C0| GL | |C1| GR |
| | | | | | 8-Bit
| | | | | | Code
| | | | | | In Use
+--+--------+ +--+--------+
LS0 LS1,LS1R LS2,LS2R LS3,LS3R Shifts
SS2 SS3
+--------+ +--------+ +--------+ +--------+ Intermediate
| | | | | | | | Graphics
| G0 | | G1 | | G2 | | G3 | Sets
| | | | | | | |
+--------+ +--------+ +--------+ +--------+
Alphabet
Designation
<ESC>(B <ESC>-A <ESC>-B <ESC>-L <ESC>$)B Sequences
+---------+
+--------+ +--------+ +--------+ +--------+ +--------+ | The world's
| ISO | | ISO | | ISO | | ISO | | JIS X | | registered
| 646IRV | | Latin | | Latin | | Latin | | 0208 | | character
|(ASCII) | | 1 | | 2 | |Cyrillic| | Kanji | + sets
+--------+ +--------+ +--------+ +--------+ +--------+
Figure 3: The ISO 2022 Character Set Selection Mechanisms
_____________________________________________________________________________
For example, the following sequence could be used to transmit the German word
"<u-diaeresis>bern<a-diaeresis>chtig" using Latin Alphabet 1 in the 7-bit
environment:
<ESC>(B<ESC>-A<SO>|<SI>bern<SO>d<SI>chtig
where:
<ESC>(B designates ASCII to G0
<ESC>-A designates the right half of Latin Alphabet 1 to G1
<SO> invokes G1 to GL
| is character 07/12, but since G1 is invoked to GL, it really
denotes character 15/12, which is <u-diaeresis>
<SI> invokes G0 to GL
bern are characters from G0, which is invoked in GL
<SO> invokes G1 to GL
d is character 06/04, but since G1 is invoked to GL, it really
denotes character 14/04, which is <a-diaeresis>
<SI> invokes G0 to GL
chtig are characters from G0
The same word could be transmitted in the 7-bit environment using single
shifts, if Latin Alphabet 1 were designated to G2 (or G3):
<ESC>(B<ESC>*A<ESC>N|bern<ESC>Ndchtig
(where <ESC>*A designates Latin-1 to G2, and <ESC>N is Single Shift 2).
In the 8-bit environment it could be transmitted using no shifts at all:
<ESC>(B<ESC>-A<u-diaeresis>bern<a-diaeresis>chtig
The designation escape sequences are transmitted only at the beginning of a
session and need not be repeated after the initial designations are made,
unless an intermediate set (G0-G3) is to be recycled.
To understand the three-tiered design of ISO 2022, imagine a computer
programmed to display a mixture of character sets on its screen. A large
collection of fonts might be stored on the disk, one font per file. These are
the character sets of the bottom tier. When a font is needed, it will be read
from the disk and stored in memory in an array, for rapid access. If several
fonts are needed, they will be stored in several arrays. These arrays are the
intermediate character sets, G0-G3. When a data byte arrives to be displayed,
the actual graphic representation is taken from GL or GR (depending on the
byte's 8th bit). GL is associated with one of the intermediate graphic sets,
and GR with another. If no more than four character sets are used, then each
one needs to be read from the disk only once, and display is rapid and
efficient thereafter.
Perhaps the most common application of ISO 2022 shifting techniques is with
the Japanese EUC (Extended UNIX Code) character set, which combines JIS X
0201 (which in turn consists of an ASCII-like Roman alphabet in the left half
and Japanese Katakana characters in the right) and JIS X 0208 (a double-byte
Japanese Kanji character set). EUC encoding is used not only in data
communications, but also in files, e-mail, etc. EUC is used as follows:
Left half of JIS X 0201 (Roman, similar to ASCII) is designated to G0.
JIS X 0208 (Kanji) is designated to G1.
Right half of JIS X 0201 (Katakana) is designated to G2.
G0 is initially invoked to GL.
G1 is initially invoked to GR.
In the 8-bit environment, any byte with its 8th bit equal to zero is a Roman
G0 graphic or a C0 control character. A byte with its 8th bit equal to 1 and
low-order 7 bits falling in the graphic range is the first byte of a Kanji
character pair. Others are C1 controls. The C1 control character SS2
selects the subsequent single byte from the Katakana set.
In the 7-bit environment, SO and SI are used to shift G1 in and out of GL,
and Kanji bytes are transmitted without their high-order bits. C1 controls,
including SS3, are transmitted in their 2-byte 7-bit form (SS2 becomes
<ESC>N).
ANNOUNCING ISO 2022 FACILITIES
A large portion of ISO 2022 is devoted to describing how 8-bit characters may
be transmitted on a 7-bit communication path, for example when parity is in
use. In the 7-bit environment, there is only GL -- no GR. Therefore, all
characters are transmitted with their 8th bit removed, and shifts are used to
specify which intermediate set they belong to.
In fact, there are many possible ways to use the ISO 2022 code extension
facilities within both 7-bit and 8-bit environments. For example, the sender
may inform the receiver in advance whether G1, G2, or G3 will be used, etc, so
that the receiver can allocate the appropriate resources. At the beginning of
any particular data transfer, the facilities that actually will be used can be
announced with a sequence of the form <ESC><SP>F, where F is replaced by an
ISO 2022 announcer. Several of the most important ones are described here.
Table 9 lists all the defined announcers in summary form. For details, see
ISO 2022.
<ESC><SP>A means that only the G0 set will be used, invoked into GL. No
shift functions will be used. In the 8-bit environment, GR is not used.
In other words, only a single 7-bit character set is used.
<ESC><SP>B means the G0 and G1 sets will be used with locking shifts. In the
7-bit environment <SI> invokes G0 into GL, <SO> invokes G1 into GL. In the
8-bit environment, LS0 invokes G0 into GL, LS1 invokes G1 into GL. In other
words, two character sets are used, with characters from both sets always
sent as 7-bit values, with locking shifts used to specify the 8th bit.
<ESC><SP>C means that G0 and G1 will be used in the 8-bit environment, with G0
invoked in GL and G1 in GR. No locking shift functions are used. In other
words, a single 8-bit character set is used, with all 8 bits transmitted as
data. GL is selected when the character's 8th bit is zero, GR is selected
when the 8th bit is one.
<ESC><SP>D means that G0 and G1 will be used with locking shifts. In the
7-bit environment, <SI> invokes G0 into GL and <SO> invokes G1 into GL. In
the 8-bit environment, all 8 bits of each character are transmitted with no
shifts.
<ESC><SP>L means that Level 1 of ISO 4873 will be used. That is, a single
8-bit character set with C0, G0, C1, and G1, with no shift functions.
This is like <ESC><SP>C.
<ESC><SP>M means that Level 2 of ISO 4873 will be used. This is equivalent
to Level 1, with the addition of G2 and G3. Characters from G2 and G3 are
invoked only by the single-shift functions SS2 and SS3.
<ESC><SP>N means that Level 3 of ISO 4873 will be used. This is equivalent
to Level 2 with the addition of the locking shift functions LS1R, LS2R, and
LS3R. (Note that ISO 4873 does not concern itself with the 7-bit
environment, and therefore does not discuss the use of LS0, LS1, LS2, or
LS3.)
_____________________________________________________________________________
Esc Sequence 7-Bit Environment 8-Bit Environment
<ESC><SP>A G0->GL G0->GL
<ESC><SP>B G0-(SI)->GL, G1-(SO)->GL G0-(LS0)->GL, G1-(LS1)->GL
<ESC><SP>C (not used) G0->GL, G1->GR
<ESC><SP>D G0-(SI)->GL, G1-(SO)->GL G0->GL, G1->GR
<ESC><SP>E Full preservation of shift functions in 7 & 8 bit environments
<ESC><SP>F C1 represented as <ESC>F C1 represented as <ESC>F
<ESC><SP>G C1 represented as <ESC>F C1 represented as 8-bit quantity
<ESC><SP>H All graphic character sets have 94 characters
<ESC><SP>I All graphic character sets have 94 or 96 characters
<ESC><SP>J In a 7 or 8 bit environment, a 7 bit code is used
<ESC><SP>K In an 8 bit environment, an 8 bit code is used
<ESC><SP>L Level 1 of ISO 4873 is used
<ESC><SP>M Level 2 of ISO 4873 is used
<ESC><SP>N Level 3 of ISO 4873 is used
<ESC><SP>P G0 is used in addition to any other sets:
G0 -(SI)-> GL G0 -(LS0)-> GL
<ESC><SP>R G1 is used in addition to any other sets:
G1 -(SO)-> GL G1 -(LS1)-> GL
<ESC><SP>S G1 is used in addition to any other sets:
G1 -(SO)-> GL G1 -(LS1R)-> GR
<ESC><SP>T G2 is used in addition to any other sets:
G2 -(LS2)-> GL G2 -(LS2)-> GL
<ESC><SP>U G2 is used in addition to any other sets:
G2 -(LS2)-> GL G2 -(LS2R)-> GR
<ESC><SP>V G3 is used in addition to any other sets:
G3 -(LS2)-> GL G3 -(LS3)-> GL
<ESC><SP>W G3 is used in addition to any other sets:
G3 -(LS2)-> GL G3 -(LS3R)-> GR
<ESC><SP>Z G2 is used in addition to any other sets:
SS2 invokes a single character from G2
<ESC><SP>[ G3 is used in addition to any other sets:
SS3 invokes a single character from G3
Table 9: ISO 2022 Announcer Summary
_____________________________________________________________________________
STANDARD VERSUS PRIVATE CHARACTER SETS
Most of the popular private 8-bit character sets, notably the IBM PC code
pages and the Apple Macintosh character sets (but they are not alone), differ
from the standard character sets in three important ways:
1. The repertoire of characters is different.
2. The encoding of characters is different.
3. The C1 area is sometimes used for graphics, which is forbidden by the
standards.
4. In some cases, even the C0 area is used for graphics.
However, most of these character sets conform to the requirement that the
left half be identical with US ASCII.
APPENDIX C: (deleted)
APPENDIX D: SUMMARY OF KERMIT COMMANDS RELATED TO CHARACTER SET TRANSLATION
SET FILE TYPE { BINARY, TEXT }
BINARY means no translation, and overrides all other file-related
commands, including SET TRANSFER.
TEXT is the default. Enables file transfer character set translation,
depending on the setting of SET TRANSFER.
SET FILE CHARACTER-SET <name>
Effective only when file type is TEXT.
Tell Kermit what character set the file is coded in,
or what character set to translate an incoming file to.
SET TRANSFER { CHARACTER-SET <name>, LOCKING-SHIFT { ON, OFF, FORCED } }
CHARACTER-SET <name>
Invoke file transfer character set translation. <name> is
TRANSPARENT, ASCII, LATIN1, LATIN2, ..., CYRILLIC, JAPAN-EUC, etc.
LOCKING-SHIFT { ON, OFF, FORCED }
Enable, disable, or force locking-shift transport protocol for
efficient transfer of 8-bit data in the 7-bit communications
environment. Normally enabled. Used only if both Kermit programs
agree in the feature negotiation phase to use it (essentially, if
PARITY is not NONE, and they both have locking-shift capability).
SET LANGUAGE <name>
This command informs the program which language is being translated,
to allow for special language-based transliteration rules, such as
replacing a-diaeresis by ae.
SET { TRANSFER, TERMINAL } TRANSLATION { INVERTIBLE, READABLE }
Specify the goal of the specified translation: invertibility or
readability.
SET UNKNOWN-CHARACTER-SET { KEEP, CANCEL }
Tell the file receiver whether to keep or cancel an incoming file that
contains an unknown character set. KEEP is the default.
SET { SEND, RECEIVE } AUTOMATIC-TRANSLATION { ON, OFF, <set1> [ <set2> ] }
Enable or disable automatic selection of a file transfer
translation table in the indicated direction, or specify pairs
character sets to be used: given <set1>, automatically translate to
<set2>. Default in both directions is OFF.
SET ATTRIBUTES { ON, OFF }
SET ATTRIBUTE <name-of-attribute> { ON, OFF }
Enables or disables processing of attribute packets, or specific
attribute fields such as DATE, ENCODING, LENGTH, etc.
SET TERMINAL { CHARACTER-SET, DIRECTION, LOCKING-SHIFT, TRANSLATION }
Specifies terminal emulation character-set translation, screen writing
direction, locking shift usage, translation goal.
SHOW { CHARACTER-SETS, LANGUAGE, FILE, TRANSFER, PROTOCOL, TERMINAL }
Display what character sets, translation tables, and languages are
available, and which ones are currently selected.
TRANSLATE <file1> <file2> [ <file1-character-set> [ <file2-character-set> ] ]
Copies local file <file1> to local file <file2>, translating <file1> from
<file1-character-set> to <file2-character-set>. If <file1-character-set>
is not specified, the current FILE CHARACTER-SET is used. If
<file2-character-set> is not specified, the current TRANSFER CHARACTER-SET
is used. Note that this command can be used to convert between two
different FILE CHARACTER-SETS, in which case an appropriate TRANSFER
CHARACTER-SET can be used in an intermediate step.
APPENDIX E: (Deleted)
APPENDIX F: (Deleted)
APPENDIX G: OFFICIAL CHARACTER SET TRANSLATIONS
Apple: ???
Atari: ???
IBM:
IBM lists its character sets in the following manuals:
"Graphic Character Identification System, Graphic Character Global
Identifier (GCGID) Structure", C-H 3-3220-055, 1989 (Internal Use Only).
"Registry of Graphic Character Sets and Code Pages", C-H 3-3220-050
(Internal Use Only).
The translations between its corporate code pages and ISO standard character
sets are given in the following manuals:
"SAA Character Data Representation Architecture (CDRA)"
Executive Overview: GC09-1392-00 (15 pages)
Level-1, Reference: SC09-1390-00 (64 pages)
Level-1, Registry: SC09-1391-00 (tables, 720 pages)
In particular, IBM has adopted ISO 8859-1 Latin Alphabet 1 as IBM Code Page
0819, and publishes its official, invertible translations between this code
page and and various private IBM code pages (such as CP850 and CECP500), as
well as invertible or noninvertible translations between many other pairs of
IBM code pages. From these, it is possible to infer other translations, for
example between Code Page 437 and Latin-1.
Commodore: ???
Data General: ???
Digital Equipment Corporation: ???
Microsoft: ???
(Much work is needed on this section...)
REFERENCES:
The standards listed in Appendix A, the documents in Appendix G, plus:
CCITT Recommendation T.61, "Character Repertoire and Coded Character Sets for
the International Teletex Service", Geneva (1980, amended at
Malaga-Torremolinos 1984).
Chandler, John, "IBM System/370 Kermit User's Guide", version 4.2, (1991)
(Internet: watsun.cc.columbia.edu:kermit/b/ik[cmtx]ker.{doc,ps}). For VM/CMS,
MVS/TSO, CICS, and MUSIC. Detailed description of how to use Kermit's
character set translation facilities in the IBM mainframe environment.
da Cruz, Frank, "Kermit, A File Transfer Protocol", Digital Press (1987).
The specification of the Kermit file transfer protocol before the addition
of this extension.
Do, James, Ngo^ Thanh Nha`n, Hoa`ng Nguye^n, "A proposal for Vietnamese
character encoding standards in a unified text processing framework", Computer
Standards & Interfaces 14 (1992) 3-12, Elsevier North-Holland.
Gianone, Christine M., "It's Time to Prepare for International Computing",
PC Week, October 2, 1989.
Gianone, Christine M., "Using MS-DOS Kermit", Second Edition, Digital Press
(1991). Chapter 13 describes how to use the character set translation
facilities of MS-DOS Kermit 3.0 and later on IBM PCs, PS/2s, and compatibles,
for both terminal emulation and file transfer. Also included are character
set and conversion tables for many Roman and Cyrillic character sets.
Gianone, Christine M., and Frank da Cruz, "C-Kermit User Guide", version 5A
(1991) (Internet: watsun.cc.columbia.edu:kermit/sw/ckuker.{doc,ps}).
Description of the terminal and file transfer character set translation
features of C-Kermit 5A for UNIX and VAX/VMS.
Gianone, Christine M., and Frank da Cruz, "A Locking Shift Mechanism for the
Kermit File Transfer Protocol", unpublished paper, Columbia University,
October 1991 (watsun.cc.columbia.edu:kermit/e/lshift.txt). A Kermit protocol
extension for transferring 8-bit text efficiently in the 7-bit communication
environment.
Hart, Edwin (ed.), "ASCII and EBCDIC Character Set and Code Issues in a
Systems Applications Architecture", SSD #366, SHARE Inc., Chicago, IL, USA
(June 1989). Commonly called the "SHARE White Paper". A cogent description
of the problems of character set translation in the IBM computing environment,
with recommendations adopted by SHARE, an international, voluntary
organization of users of IBM systems.
IBM System/370 Reference Summary, IBM GX20-1850-6. The definitive US-ASCII /
US-EBCDIC translation table.
ISO 639, "Code for Representation of Names of Languages" (1988). Useful for
naming language-related symbols in Kermit programs.
ISO 3166, "Country Codes" (1988 + Registration Newsletter updates). Useful
for naming country-related symbols in Kermit programs.
ISO/IEC 10646-1:1993, Multiple-Octet Coded Character Set. The universal
character set.
Pirard, Andr'e, "Guidelines to Use 8-Bit Character Codes", University of
Liege, Belgium, unpublished paper on character set translation problems,
written from the West European perspective, listing numerous suggested
invertible translation tables.
Files: watsun.cc.columbia.edu:kermit/charsets/iso8859.networking and
iso8859.moretran.
"The Unicode Standard, Worldwide Character Encoding", Version 1.0, Volume 1,
Addison-Wesley (1991).
[End of ISOK6.TXT]