home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Columbia Kermit
/
kermit.zip
/
archives
/
protocol.tar.gz
/
protocol.tar
/
isok5.txt
< prev
next >
Wrap
Text File
|
1991-10-17
|
107KB
|
2,135 lines
********************
NOTE: THIS VERSION IS OBSOLETE, AND HAS BEEN SUPERSEDED BY ISOK6.TXT.
********************
A KERMIT PROTOCOL EXTENSION FOR INTERNATIONAL CHARACTER SETS
Christine Gianone
Manager, Kermit Development and Distribution
Columbia University Center for Computing Activities
612 West 115th Street
New York, NY 10025, USA
DRAFT NUMBER 5
APRIL 25, 1990
ABSTRACT
A two-level extension to the presentation layer of the Kermit file transfer
protocol is proposed to allow transfer of non-English-language text files
between unlike computers. Level 1 allows substitution of single character
sets other than ASCII in Kermit's normal text-file transfer syntax. Level 2
specifies a new transfer syntax in which multiple character sets may be used,
along with mechanisms for switching among them as defined in ISO Standard
2022.
This is still a DRAFT proposal. Readers with knowledge of real-world
multi-alphabet applications and file formats are urged to comment on the
suitability of this proposal. It is assumed the reader is familiar with the
Kermit file transfer protocol. It is also assumed that the reader is
familiar with ISO Standards 4873 and 2022, but these are summarized in
Appendix B.
SUMMARY OF CHANGES SINCE DRAFT #4, August, 1989
- Changes for Level 1 only, to reflect experience in writing the code
to implement it for MS-DOS Kermit 3.0, C-Kermit 5A, and Kermit 370 4.2.
Level 2 is on hold indefinitely pending ISO 10646 & Unicode developments.
- Abandonment of separate attributes for encoding and character set.
- Change all references to ASCII as I2 into I6.
- Change description of SET LANGUAGE to remove side effects.
- Differentiation of SET TRANSFER CHARACTER ASCII and TRANSPARENT.
- The section on terminal emulation has not been changed, even though
this subject needs detailed treatment in this document.
SUMMARY OF CHANGES SINCE DRAFT #3, July 20, 1989
- Expanded & more precise definition of Kermit's character set designators
- Simplification of the syntax of the (former) SET TRANSFER-SYNTAX command
- Addition of SET LANGUAGE command
- Clarification of Kermit's behavior when it receives an unknown character set
- Addition of Appendix F to specify how each Kermit Level is invoked
- Correction of numerous typographical and other errors
ACKNOWLEDGEMENTS
Many thanks to these people for their helpful and constructive comments on the
first three drafts. In most cases, their suggestions or the information they
provided have been incorporated into this or previous drafts.
John Chandler (Harvard/Smithsonian Center for Astrophysics, USA)
Alan Curtis (University of London, UK)
Frank da Cruz (Columbia University, USA)
Joe Doupnik (Utah State University, USA)
Hirofumi Fujii (Japan National Laboratory of High Energy Physics, Tokyo)
John Klensin (Massachusetts Institute of Technology, USA)
Ken-ichiro Murakami (Nippon Telephone and Telegraph Research Labs, Tokyo)
Vladimir Novikov (VNIIPAS, Moscow, USSR)
Jacob Palme (Stockholm University, Sweden)
Andre Pirard (University of Liege, Belgium)
Paul Placeway (Ohio State University, USA)
Gisbert W. Selke (University of Bonn, West Germany)
Fridrik Skulason (University of Iceland, Reykjavik)
Johan van Wingen (Leiden, Netherlands)
Konstantin Vinogradov (ICSTI, Moscow, USSR)
Amanda Walker (InterCon Systems Corp, USA)
Thanks also to the following people for organizing meetings or conferences
in their countries at which the issues of this proposal were discussed:
Kohichi Nishimoto (Nihon DEC, Tokyo, Japan)
Juri Gornostaev and A. Butrimenko (ICSTI, Moscow, USSR)
and thanks also to those who attended these gatherings!
STATEMENT OF THE PROBLEM
Kermit has always been able to transfer text files between unlike systems
(e.g. a UNIX system with ASCII stream text files and an IBM mainframe with
EBCDIC record-oriented text files). To do the text file code conversion,
Kermit transfers text in ASCII. But ASCII only includes enough letters and
symbols for English.
There are now computers capable of representing the characters of other
languages: Roman letters with diacritical marks, Cyrillic letters, Hebrew,
Arabic, and Greek characters, Japanese and Chinese ideograms. But different
computer manufacturers use different codes for these characters.
For example, the IBM PS/2 and the Apple Macintosh have character sets that are
"8-bit ASCII". When the character value is 32-127, the character is
(normally) a standard ASCII graphic (printable) character. When the value is
128 or higher, it is a special character. But the PC and the Macintosh assign
different special characters to these values. Here are just a few examples:
Value PS/2 Character Macintosh Character
138 Small e grave Small a umlaut
143 Capital A ring Small e grave
144 Capital E acute Small e circumflex
136 Small e circumflex Small a grave
When a file contains "8-bit ASCII", Kermit presently transfers it without any
character translation. Therefore, a text file written in French, German,
Italian, or Norwegian transferred between a PS/2 and a Macintosh will contain
the wrong characters when it arrives at its destination: the PS/2's e-grave
becomes a-umlaut on the Macintosh, etc.
The problem is compounded when a file is composed of characters from more than
one character set, for example a Japanese text file that contains Kanji,
Katakana, and Roman characters.
There are many computer vendors in the world and nobody controls what codes
they use to represent characters. Without a standard protocol for
transferring non-ASCII text, each computer would have to know the codes of all
the other computers in order for correct transfer of non-English text files to
occur between unlike systems.
NORMAL KERMIT FILE TRANSFER SYNTAX
The Kermit file transfer protocol makes a distinction between text and binary
files. Binary files are transmitted with no translation or conversion. For
text files, the Kermit protocol defines a standard transfer syntax for text
files, namely ASCII characters with carriage return and linefeed (CRLF) after
each line, so that text may be stored in useful fashion on any computer to
which it is transferred. Each Kermit program knows how to translate from the
local text-file storage conventions to ASCII/CRLF syntax, and vice versa.
This is the basic, required, and default mode of operation for any Kermit
program, and it will be referred to as Kermit's "Normal" or "Level 0" syntax.
EXPANDED KERMIT TRANSFER SYNTAX
This proposal adds two additional levels of transfer syntax, Levels 1 and 2.
Level 1 permits the use of a single character set other than ASCII in the
transfer syntax. These additional character sets are taken from recognized
national or international standards, such as ISO 8859-1 (Latin Alphabet 1),
JIS X 0208 (Japanese), etc.
By using a standard character set (other than ASCII), it is possible to
transfer a text file written in a language other than English, and it is also
possible to transfer a text file containing more than one language. For
example Latin Alphabet 1 can represent a file containing a mixture of Italian,
Norwegian, French, German, English, and Icelandic.
Level 2 allows a mixture of character sets to transfer mixed-language text
that requires characters from more than one standard character set, for
example a document written in Russian, French, and Greek.
The additional levels are optional features for Kermit programs, except that
Level 2 should not be provided without Level 1.
The following discussion applies to text-file transfer only. When the Kermit
user has selected binary file transfer, none of the text-file conversions
discussed here apply.
EXPANDED SYNTAX, LEVEL 1
When all the characters in a text file can be represented by a single
character set, then that character set can be used in place of ASCII in
Kermit's transfer syntax.
Whatever the transfer character set, there must be a mapping between the local
file character set and the character set of the common transfer syntax. That
is, there must be a pair of translation functions in the program, one from the
local file character set to the transfer character set, and one from the
transfer set to the local set.
Until now, many Kermit programs have lacked such a translation function,
because the local file character set was the same as the transfer character
set, namely ASCII. But there have always been exceptions. For example, IBM
System/370 mainframe Kermit must translate between ASCII and its local EBCDIC
character set.
To complicate matters, many computers now support a variety of character sets.
IBM mainframes have not only "standard" US EBCDIC, but also several
EBCDIC-based Country Extended Code Pages (CECPs) for the support of West
European languages, Hebrew, etc. The IBM PC and PS/2 have a variety of
ASCII-based 8-bit code pages for the same purpose. These character sets are a
welcome addition, because they allow users of these computers to create,
display, and print documents in languages other than English. Unfortunately,
it is usually the case that the computer's file system keeps no record of
which character set is used in each file.
For this reason, the following command should be provided to allow the Kermit
user to specify the local file character set:
SET FILE CHARACTER-SET <file-character-set-name>
The file character set name is a system-dependent item. Some computers have
only one character set, in which case the SET FILE CHARACTER-SET command would
be unnecessary.
This command will be necessary on computers that use the "national replacement
characters" allowed by ISO Standard 646. This standard specifies a 7-bit
character set equivalent to ASCII, but with national variants in which certain
non-alphanumeric ASCII graphic characters are replaced by "national
characters", as shown in Table 1.
_____________________________________________________________________________
Column/Row ASCII German Finnish Norwegian French
04/00 at-sign section at-sign at-sign a-grave
05/11 left-bracket A-umlaut A-umlaut AE-digraph degree
05/12 backslash O-umlaut O-umlaut O-slash c-cedilla
05/13 right-bracket U-umlaut A-circle A-circle section
06/00 accent-grave accent-grave e-acute accent-grave accent-grave
07/11 left-brace a-umlaut a-umlaut ae-digraph e-acute
07/12 vertical-bar o-umlaut o-umlaut o-circle u-grave
07/13 right-brace u-umlaut a-circle a-circle e-grave
07/14 tilde ess-zet u-umlaut tilde umlaut
Table 1: ISO 646 Usage in Selected Countries
_____________________________________________________________________________
(see Figure 1 in Appendix B for an explanation of column/row notation.)
For example, the German phrase "Gr<u-umlaut><ess-zet>e aus K<o-umlaut>ln"
would be rendered in ASCII as "Gr}~e aus K|ln", and the ASCII C-language
phrase "{~a[x]}" would become:
<a-umlaut><ess-zet>a<A-umlaut>x<U-umlaut><u-umlaut>
in German ISO 646. The German user would want Kermit to interpret the local
file characters as German in the former case, and as ASCII in the latter.
SPECIFYING THE TRANSFER CHARACTER SET:
To select Level 1, the user enters the command:
SET TRANSFER CHARACTER-SET <name>
Where <name> is the name of a standard character set other than ASCII. If the
name is TRANSPARENT, then Kermit does no character set conversion at all, but
it still may do text record format conversion. For ASCII-based systems, this
is equivalent to Kermit's normal, basic mode of operation.
If a name other than TRANSPARENT is given, and FILE TYPE is set to TEXT,
then Kermit will translate between the current file character set and the
named transfer character set during all packet operations. If the transfer
character set is ASCII, then Kermit converts between the current file
character set and 7-bit ASCII. This mode of operation is roughly equivalent
to Kermit's basic mode of operation on non-ASCII based systems like IBM
mainframes. If the local file character set contains accented characters, the
accents are dropped in the transfer character set, for example a-acute becomes
simply a. (But see SET LANGUAGE, described later.)
Other transfer character sets must be chosen from among approved national or
international standards. As a starting point, the sets shown in Table 2 are
recommended. The criteria for including a character set in this table are:
1. 7-bit ASCII (= ISO-646 International Reference Version, IRV) is included,
for compatibility with the original Kermit protocol and the hundreds of
programs that implement it.
2. An 8-bit single-byte character set, such those in the ISO 8859 series,
if it is registered, as in (4) below, is included.
3. A multibyte character set may be included, if it is registered as in (4).
4. The set must be listed in the ISO International Register of Character Sets
under the provisions of ISO Standard 2375 (see Appendix A), so that it has
a unique registration number and designating escape sequence, in order that
the sending Kermit program can identify the character set to the receiving
Kermit program. Allowance is made for the possibility of other
registration authorities, should they appear.
5. The set must be a national or international standard graphic character
set, intended for use in computer text processing or programming (as
opposed to Videotex, Teletex, OCR, device control, or other applications).
This category may include line-drawing or technical character sets which
fit the other criteria.
Note in particular that the national variants of ISO 646 are not included,
since these are covered adequately by ASCII and the ISO Latin alphabets.
Standard "Kermit names" (for use with the SET TRANSFER CHARACTER-SET command)
are given to these character sets so that they may be referred to uniformly in
all Kermit implementations. These names are chosen to be mnemonic, so that
users don't have to remember cryptic designations like "ISO-8859-3". The
choice of single words like "CYRILLIC" implies that there will not be more
than one Level-1 transfer syntax for Cyrillic text. However, if these
standards change in the future, it will be possible to add further identifying
material to these names, e.g. "CYRILLIC2", "CYRILLIC3", etc.
The mnemonicity of the Kermit names is based upon English, as this is the
language of the standards themselves. The Kermit commands are English words,
and this document is written in English. Once we have solved the problem of
transferring non-English text files between unlike computers, we may begin to
consider the challenge of language-independent user interfaces and
documentation!
_____________________________________________________________________________
Table 2: Standard 8-Bit Character Sets
US 7-bit ASCII, English, Latin, Gaelic, German without umlauts or ess-zet, etc.
Kermit name: ASCII.
ISO Registration Number: 6.
Kermit Designator: none (this is the default transfer alphabet).
ISO 8859-1, Latin Alphabet 1, for Dutch, English, Faeroese, Finnish, French,
German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish,
and Swedish.
Kermit name: LATIN1.
ISO Registration Number: 100.
Kermit Designator: I6/100.
ISO 8859-2, Latin Alphabet 2. Albanian, Czech, English, German, Hungarian,
Polish, Romanian, Serbocroatian (Croatian), Slovak, and Slovene.
Kermit name: LATIN2.
ISO Registration Number: 101.
Kermit Designator: I6/101.
ISO 8859-3, Latin Alphabet 3, for Afrikaans, Catalan, English, Esperanto,
French, Galician, German, Italian, Maltese, and Turkish.
Kermit name: LATIN3.
ISO Registration Number: 109.
Kermit Designator: I6/109.
ISO 8859-4, Latin Alphabet 4, for Danish, English, Estonian, Finnish, German,
Greenlandic, Lappish, Latvian, Lithuanian, Norwegian, and Swedish.
Kermit name: LATIN4.
ISO Registration Number: 110.
Kermit Designator: I6/110.
ISO 8859-5, the Latin/Cyrillic Alphabet, for Bulgarian, Byelorussian, English,
Macedonian, Russian, Serbocroatian (Serbian), and Ukrainian
(Compatible with USSR GOST Standard 19768-1987 and ECMA-113).
Kermit name: CYRILLIC.
ISO Registration Number: 144.
Kermit Designator: I6/144.
ISO 8859-6, the Latin/Arabic Alphabet.
Kermit name: ARABIC.
ISO Registration Number: 127.
Kermit Designator: I6/127.
ISO 8859-7, the Latin/Greek Alphabet.
Kermit name: GREEK.
ISO Registration Number: 126.
Kermit Designator: I6/126.
ISO 8859-8, the Latin/Hebrew Alphabet.
Kermit name: HEBREW.
ISO Registration Number: 138.
Kermit Designator: I6/138.
ISO DIS 8859-9, Latin Alphabet 5, in which six Icelandic letters from
Latin Alphabet 1 are replaced by six other letters needed for Turkish.
Kermit name: LATIN5.
ISO Registration Number: 148.
Kermit Designator: I6/148.
CSN 36 91 03, Czechoslovak Standard alphabet.
Kermit name: CZECH.
ISO Registration Number: 139.
Kermit Designator: I6/139.
JIS X 0201, a 1-byte code for Japanese Katakana, used in conjunction
with a slightly modified ASCII (backslash is replaced by Yen sign,
tilde by overbar).
Kermit name: KATAKANA.
ISO Registration Numbers: 14 (Roman), 13 (Katakana).
Kermit Designator: I14/13.
JIS X 0208, a 2-byte code containing Japanese Kanji, Katakana, Hiragana,
Roman, Greek, and Russian characters, plus special symbols, etc. All
characters in this set are displayed in double width, therefore it is
commonly used in conjunction with JIS X 0201 so that Katakana and Roman
characters and digits may be displayed in single width.
Kermit name: KANJI.
ISO Registration Number: 87.
Kermit Designator: M87.
Chinese Standard GB 2312-80, a 2-byte code for Chinese.
Kermit name: CHINESE.
ISO Registration Number: 58.
Kermit Designator: M58.
KS C 5601 (1987), a 2-byte code for Korean.
Kermit name: KOREAN.
ISO Registration Number: 149.
Kermit Designator: M149.
Table 2: Standard 8-Bit Character Sets
_____________________________________________________________________________
The ISO Latin alphabets and the Czech character set are 8-bit character sets
whose left half is identical with ASCII, and whose right half contains the
special characters. The ISO registration number refers only to the right half
of each of these character sets. But each of these sets must be used in its
entirety, because the unaccented Roman letters, the digits, and the
punctuation marks appear only in the ASCII left half, which is ALWAYS (unless
otherwise noted) US ASCII, ISO Registration Number 6. The Kermit
character-set name refers to the two halves combined as a single set.
A particular Kermit program need not incorporate all of these character sets.
In many cases, a single 8-bit character set will suffice, such as LATIN1 for
Western Europe, LATIN2 for Eastern European countries with Roman-alphabet
based languages, LATIN4 for Scandinavia, CYRILLIC for most of the USSR, etc.
When a language is representable in more than one character set from this
table, as are English, German, Finnish, Czech, Turkish, etc., the character
set highest on the list which adequately represents the language should be
preferred. More precisely, when a character set other than ASCII is to be
used in the Kermit's transfer syntax, the ISO 8859 sets are preferred to other
registered sets which contain the same characters. Within the ISO 8859
family, lower-numbered sets which contain the characters of interest are
preferred to higher-numbered sets which contain the same characters.
This guideline maximizes the chance that any two particular Kermit programs
will recognize the same character sets.
For example, LATIN1 would be chosen for French, LATIN1 for German (because it
represents German better than ASCII), LATIN5 for Turkish (because it
represents Turkish better than LATIN3), KANJI or KATAKANA for Japanese
(because none of the ISO 8859 sets contain Japanese characters), etc.
Unfortunately, but unavoidably, the burden of choosing the best transfer
syntax character set must be placed upon the user. If a file containing a
mixture of Finnish, English, and Danish must be transferred, the user must
find a character set that can adequately represent all three languages, in
this case Latin Alphabet 4. A table like Table 3 should be provided in the
user documentation to help the user make this selection.
_____________________________________________________________________________
Arabic ARABIC Italian LATIN1,3
Bulgarian CYRILLIC Kanji KANJI
Chinese CHINESE Katakana KATAKANA, KANJI
Czech CZECH, LATIN2 Korean KOREAN
Danish LATIN4 Latvian LATIN4
Dutch LATIN1,2,3,4 Lithuanian LATIN4
English ASCII,LATIN1,2,3,4,5,etc Norwegian LATIN1,4
Esperanto LATIN3 Polish LATIN2
Estonian LATIN4 Portuguese LATIN1
Finnish LATIN1,4 Romanian LATIN2
Flemish LATIN1,2,3,4,5 Russian CYRILLIC
French LATIN1,3,5 *Serbocroatian LATIN2, CYRILLIC
German LATIN1,2,3,4,5 Slovak LATIN2
Greek GREEK Spanish LATIN1
Hebrew HEBREW Swedish LATIN1,4
Hungarian LATIN2 Turkish LATIN5,3
Icelandic LATIN1 Ukrainian CYRILLIC
Table 3: Preferred Transfer Syntax Character Sets
*If written in Cyrillic, this language is called Serbian. If written
in Roman letters, it is called Croatian.
_____________________________________________________________________________
Note, table 3 is only a sample. To produce a comprehensive and definitive
table would require a team of language experts. The information in the
current table is based purely upon the claims made within the standards
themselves, in which there is no mention of languages like Albanian, Erse,
Farsi, Urdu, Welsh, Cornish, Manx, Inuit, Old Church Slavonic, Armenian,
Georgian, Tagalog, Swahili, Latin, Vietnamese, etc, nor definitions of
exactly what is meant by terms like "Greenlandic", "Irish", etc. Obviously,
it is the intention of this proposal to support any language for which a
computer character set can be standardized.
IMPLEMENTATION OF LEVEL 1
The Level-1 Kermit extension can be added to existing Kermit programs with
a minimum of effort. The following steps are required for each Kermit program:
1. Add the SET FILE TYPE { BINARY, TEXT } command, if the program doesn't
have it already. SET FILE TYPE TEXT enables text-file character set
conversion at all levels. SET FILE TYPE BINARY disables conversions of
all kinds, but does not destroy the file and transfer character-set
selections (2 and 3 below), so that a subsequent SET FILE TYPE TEXT
command will still be able to use them.
2. Add the SET FILE CHARACTER-SET <name> command. The set of <names> should
include ASCII or EBCDIC (as appropriate, used for program source, etc) plus
the names of any "national" or special character sets that are used on this
particular computer.
3. Add the SET TRANSFER CHARACTER-SET <name> command. The set of <names>
should include TRANSPARENT and ASCII plus the names of one or more other
standard character sets from Table 2 which contain the characters from the
computer's local character set(s).
4. Add translation tables (or functions) between each pair of character sets
in (2) and (3). For each pair, two translation tables are necessary: one
from the local file character set to the transfer character set, and one
from the transfer set to the local one.
5. Add SHOW commands to let the user find out what character sets are
available, and which ones are currently selected, for the transfer syntax
and for local files. The exact syntax of this command will vary. In
some Kermit implementations, every SET command has a corresponding SHOW
command, in which case it will be possible to SHOW FILE CHARACTER-SET and
SHOW TRANSFER CHARACTER-SET. In others, related SET parameters are lumped
together into broader categories for purposes of SHOW, for example SHOW
FILE would show all file-related parameters; SHOW PROTOCOL would show all
protocol-related parameters.
Optionally, several additional related commands may be included:
6. The command SET LANGUAGE may be added to allows the program to apply
heuristics in the translation process that would not otherwise be possible
(see discussion of German below). The choices for SET LANGUAGE should
include ASCII (for transferring program source code, etc) plus the names of
any languages the particular Kermit implementation supports, such as
ITALIAN, NORWEGIAN, PORTUGUESE.
7. To allow for user-defined character-set translations, also add the LOAD
TRANSLATION-TABLE, SHOW TRANSLATION-TABLE, and DUMP TRANSLATION-TABLE
commands (described in the next section).
8. Once the new commands and translation tables are in place, it is simple to
add a TRANSLATE command, to translate a local file from one character set
to another. With this command, Kermit may be used as a character-set
conversion utility for local files.
LEVEL-1 EXAMPLE
To transfer a Finnish-language text file from a computer that uses the Finnish
ISO 646 national replacement set to an IBM PS/2, and to store the file using
the PS/2's Multilingual Code Page:
On the sending computer: On the receiving computer:
SET FILE TYPE TEXT SET FILE TYPE TEXT
SET FILE CHARACTER-SET FINNISH SET TRANSFER CHARACTER-SET LATIN1
SET TRANSFER CHARACTER-SET LATIN1 SET FILE CHARACTER-SET CP850
SEND filename RECEIVE
To transfer a C-language source program between the same two computers:
On the sending computer: On the receiving computer:
SET FILE TYPE TEXT SET FILE TYPE TEXT
SET TRANSFER CHARACTER-SET ASCII SET FILE CHARACTER-SET ASCII
SET FILE CHARACTER-SET ASCII SET TRANSFER CHARACTER-SET ASCII
SEND filename RECEIVE
To emphasize the value of the SET LANGUAGE command, consider German text
containing the Ess-Zet character and vowels with umlauts. It is acceptable
to render Ess-Zet as "ss", and to render a vowel with umlaut as the same
vowel without an umlaut but followed by "e". But this should not necessarily
be done for languages other than German. The command SET LANGUAGE GERMAN
would allow the Kermit program to perform these functions when translating
from Latin-1 or German NRC into ASCII, so that "Gr<u-umlaut><ess-zet>e aus
K<o-umlaut>ln" would become "Gruesse aus Koeln" (correct German) rather than
"Gruse aus Koln" (Gruse means something entirely different from Gruesse --
something like "scum" rather than "greetings").
TRANSLATION TABLES
In many cases, translation tables will be 1-for-1. That is, the two character
sets are the same size, and each character from one set can be found in the
other set. In such cases, the translation table need be only a list of
numbers, in which position "n" in the table contains the translation for
character number "n".
In some cases, the two character sets will be the same size, but certain
characters from one will be lacking in the other, and/or vice versa. For
example, IBM Code Page 850 and the Apple Macintosh sets are both "8-bit
ASCII", but the IBM set lacks the Macintosh's Y-umlaut, and the Macintosh
set lacks IBM's Y-acute.
In other cases the character sets will be different sizes. We have long been
familiar with this problem when translating between 7-bit ASCII and 8-bit
EBCDIC. In Japan, there must be translations between single-byte Roman,
Greek, and Cyrillic characters and the two-byte JIS X 0208 character set.
It is recommended that translation tables built into Kermit programs be as
general and useful as possible, substituting the closest possible character
when an exact match is not available. For instance, when translating from
French Latin-1 or NRC to ASCII, accented letters should normally be
translated into the corresponding unaccented letters: a-acute becomes a, etc.
It is a matter of choice to the programmer whether translation be accomplished
by tables or by functions which implement translation algorithms, or a
combination of both. Functions provide maximum flexibility and tend to reduce
program size, at some cost in execution overhead. Tables provide greatest
speed, with generally greater cost in program size.
It is further recommended that the actual contents of each translation table
eventually be specified in this standard, and that translations be invertible.
USER-DEFINED TRANSLATIONS
It should be possible for users to alter Kermit's translation tables or to add
new ones, without having to change the program's source code. For example, in
certain situations it might be preferable to have a-grave rendered in ASCII as
"a", but in others as "`a", "a`", or even "?". It is also possible that new
character sets will appear which are unknown to the Kermit program.
For these reasons, a standard format is suggested for translation tables,
together with a LOAD TRANSLATION-TABLE command to allow the user to add new
character sets to a Kermit program's repertoire, or to alter current
translations.
Each table within a program is assigned an arbitrary tablename. For example,
LATIN1-CP850 could be the name for the Latin-1 to CP850 table in the PS/2, and
CP850-LATIN1 could be the name for the table in the other direction. To load
a replacement table, the user would issue the command:
LOAD TRANSLATION-TABLE <tablename> <filename>
where <tablename> is the name to assign to the new table. If a table with
that name already existed, that table is replaced. A suggested layout for a
loadable translation table is given in Appendix C.
A Kermit program, upon loading one of these files, would set up the
translation table, add the names of the table and of the character sets
themselves to the appropriate keyword tables, and so on.
So that the translation-table related commands can also be effective for
built-in translation tables, it is recommended that the built-in tables be
designed in the same format as the loadable tables.
Two additional commands should be furnished to allow the user to get
information about the currently loaded tables:
SHOW TRANSLATION-TABLE <tablename>
which would give summary information, and:
DUMP TRANSLATION-TABLE <tablename> <filename>
which would write out a translation table (even a built-in one) in the form
shown in Appendix C, so that it could be edited and loaded again.
ATTRIBUTE PACKETS AT LEVEL 1
The objective of Kermit's Level-1 extension is to accommodate as many
computers as possible with a minimum of programming effort. But this approach
places a burden on the user in the form of new commands and the confusion
which results if the user forgets to issue these commands.
Level 1 does not require support for Kermit File Attribute Packets, whose use
is negotiated in the Kermit Initialization exchange. But the user's burden
can be alleviated if the sending Kermit program uses an attribute field to
inform the receiving Kermit of the character set to be used in the transfer
syntax. The receiving program can accept or refuse the file based on whether
it supports the specified character set. If the receiving program refuses a
file, the user can override this refusal, for example, if a long file contains
only a word or two in an unknown character set. The most common user-override
is the command SET ATTRIBUTES OFF. However, this also disables other
desirable effects of attribute packets, such as prenotification of file size.
Therefore, it is desirable to let the user specify exactly which attributes
are to be "turned off", e.g. SET ATTRIBUTES CHARACTER-SET OFF.
When the transfer character set is ASCII (or TRANSPARENT when sent from an
ASCII-based system), the Encoding attribute alone will suffice, with a value
of "A" (for ASCII): "*!A".
In order for the sender to inform the receiver of transfer alphabets other
than ASCII, a new value for the Encoding attribute ("*") is defined, namely
"C", which is substituted for the normal value "A" (ASCII). "C" means that
the actual character set is specified as an operand of the following form
which begins with a single letter that designates the character set
registration authority, e.g. I for ISO, followed by a
registration-authority-specific identifier, as in:
Ixxx/yyy
where the letter "I" (for ISO) is followed by a pair of ISO registration
numbers for the character set, xxx for the "left half" and yyy for the right,
expressed in decimal ASCII digits, for example:
+---+---+---+--------+
| * | ' | C | I6/100 |
+---+---+---+--------+
where "*" is code for the Encoding Attribute (or transfer syntax), "'" is the
length (ASCII 39 - 32 = 7) of its value, and the single character "C" is the
value itself, which means "I'm using the specified Character set" specified by
the six characters "I6/100" mean "ISO registration number 6", i.e. US ASCII,
in the left half, and ISO registration number 100, which is the right half of
Latin-1, in the right. The "I" stands for ISO, and is included to allow for
the possibility of other character set registration authorities. Designators
for each character set are given in Table 2, labeled "Kermit Designator".
For self-contained ISO standard multibyte character sets, the Kermit
Designator starts with the letter "M", rather than "I", to indicate (a) that
it is a multibyte, rather than single-byte, set and (b) that there is no "left
half", i.e. "M" is always followed by a single ISO registration number.
In the event that a character set standard changes, but keeps the same
registration number, the registration number for the new character set should
be preceded by a non-numeric character which indicates the revision number: @
(atsign) = 1, A=2, B=3, and so on (as suggested in ISO 2022). For example
"I@2/B100" would indicate an 8-bit single-byte character set having Revision
1 of ASCII as its left half and Revision 3 of Latin-1 as its right. Note:
"Revision 1" does not mean the original version, but rather the first
revision AFTER the original version. The Kermit designator for an original
version does not have a revision indicator.
The form of the character-set designator was chosen because the standards
currently provide no single code to designate an 8-bit character set in its
entirety. Each half of the character set has its own registration number.
For example, ISO 8859-1 (Latin-1) is a single 8-bit character set, but
registration number 100 only refers to its right half. Registration number 6
denotes ASCII, which is used as the left half of all ISO 8859 character sets.
To promote maximum interoperability among extended Kermit programs, the
Kermit designator should be treated as a character string, to be looked up in
a small table, rather than as a flexible mechanism to be used for piecing
together character sets from an arbitrary assortment of left and right
halves. However, the Ixxx/yyy notation leaves open this possibility should
it become desirable at a later time.
In the event that a new class of registration numbers appears, for example, to
denote a single-byte 8-bit character set in its entirety rather than just its
left or right half, a different initial letter will be used in the designator,
even if the registration authority is the ISO. In the event that other
character-set registration authorities appear, they too can be assigned their
own unique Kermit designator prefixes (for example, "K" for Kermit Development
and Distribution), to avoid ambiguity from conflict of registration numbers.
For the present, standards organizations like ANSI and CCITT are not treated
as separate registration authorities, because their character sets are also
registered by the ISO. Should these organizations adopt character sets that
have no ISO counterpart, then special Kermit designator prefixes will be
assigned for them.
Based on the attribute information, the receiver may accept or reject the
file, using Kermit's normal attribute response mechanism. To accept, it puts
a "Y" as the first character of the data field of the acknowledgement to the
attribute packet. To refuse, it puts an "N" instead of a "Y", followed by
"*". If the file is refused in this manner, the sending Kermit should respond
by sending a "Z" (end-of-file) packet containing a "D" (for Discard) in its
data field.
The behavior of the receiving Kermit program when an unknown character set
is announced to it is governed by the command SET UNKNOWN-CHARACTER-SET.
SET UNKNOWN-CHARACTER-SET KEEP means that it should not reject the file, but
store it the best way it can (e.g. without translating any characters),
DISCARD means that the file should be rejected.
It is recognized that there are presently Kermit implementations in the USSR,
Japan, and elsewhere that use character sets other than the ones listed in
Table 2 in their transfer syntax, and/or sets that are not listed in the
International Register. It is recommended that these Kermit programs be
converted to use the recommended standard character sets, or if there is a
strong reason why this cannot be done, that these character sets be registered
with Kermit Development and Distribution at Columbia University.
LEVEL 1 PERFORMANCE
Level 1 can be used to transfer files containing special characters when
character-set switching is not required. However, Level-1 transfer will not
always be efficient. Since the special characters have their 8th bits set to
one, there will be a lot of 8th-bit prefixing in the 7-bit environment -- the
higher the proportion of special characters to ASCII characters, the lower the
efficiency. For a language like Russian, in which all letters come from the
right half of the character set, efficiency will be poor in the 7-bit
environment.
Therefore, even though Russian (and Greek, Hebrew, and Arabic) are served by
Level 1 of this proposal, files encoded in these character sets can be
transferred more efficiently using the facilities of Level 2 in the 7-bit
communication environment. See Table 4.
In the future, a separate proposal will address this problem in a general
way, independent of transfer syntax, by specifying a locking shift mechanism
to be used in Kermit's packet encoding.
EXPANDED SYNTAX, LEVEL 2: MULTIPLE CHARACTER SETS
DO NOT PAY MUCH ATTENTION TO THIS SECTION. IT WILL PROBABLY NEVER BE
IMPLEMENTED IN ITS PRESENT FORM BECAUSE COMPUTER FILE SYSTEMS SIMPLY DO NOT
EXIST THAT ALLOW MIXTURES OF CHARACTER SETS WITHIN A SINGLE FLAT FILE. IN
ANY CASE, THE EMERGING ISO-10646 AND UNICODE STANDARDS MAY RENDER THIS
DISCUSSION OBSOLETE ANYWAY.
Suppose there is a computer that can store a file containing characters from
many languages. It may do this by using a multibyte character code, or by
imbedding some kind of control information in the file to mark each change of
character set.
One such computer is the Xerox Star and its successors, described by Joseph D.
Becker in the Scientific American articles "Multilingual Word Processing"
(July 1984) and "Arabic Word Processing" (July 1987). The Star stores textual
data intermixed with special codes. A byte of all 1's means "alphabet shift",
followed by another byte or two to identify the alphabet.
Another, more limited, example is a computer using one of the AT&T Extended
UNIX Codes (EUC), such as JAE for Japan. In this code, a byte with its high
order bit set to zero is ASCII. If it is set to one, then it is either a
1-byte Kanji or 1-byte Katakana character, or (if it has a certain special
value) a shift character indicating that the next two bytes are a Kanji
character. (See N. Takahashi and W. Krone, "The Language Problem", UNIX
Review, February 1987.)
A third example is an IBM PC or PS/2 running a commercial word processor which
uses the PC's graphics adapter to display characters from different alphabets
(Roman, Greek, etc) in different renditions (bold, italic, underlined). A
multilanguage word processor file may contain not only alphabet information,
but also formatting and rendition information. The format of these word
processor files is proprietary, and differs from product to product.
A final example is a simple IBM PC "8-bit ASCII" text file which also
contains the PC's line-drawing characters. These characters have no
equivalents in the ISO Latin alphabets, and so at least two standard
character sets would be required to represent these files during
transmission.
Now suppose we want to transfer a multi-language text file from one computer
to a different kind of computer. Since there will be a growing need to do
this, and a growing number of computers and applications that will support
multi-language text in incompatible ways, it is clearly impractical to require
each computer to know the formats and codes of each other computer.
Once again, a standard common intermediate representation, or transfer syntax,
is required so that each Kermit program need only know the codes and formats
used on its own computer, plus the transfer syntax. But unlike Kermit's
normal transfer syntax, and unlike Kermit's Level-1 extended transfer syntax,
the multi-language syntax must embody an in-band mechanism for identifying
character sets and switching among them.
Fortunately, these mechanisms are already well-defined in the host-terminal
communications environment, and they can be readily adapted to Kermit file
transfer. The mechanisms proposed are defined in the following international
standards:
ISO 4873, "Information Processing - ISO 8-bit code for information
interchange - Structure and rules for implementation"
ISO 2022, "Information Processing - ISO 7-bit and 8-bit coded character
sets - Code extension techniques"
ISO 2375, "Data Processing - Procedure for registration of Escape Sequences"
ISO "International Register of coded Character Sets to be Used with Escape
Sequences"
These standards are summarized in Appendix B, "How the Standards Work". The
following discussion assumes familiarity with these standards, so consult the
appendix if necessary.
KERMIT MULTI-CHARACTER-SET FILE TRANSFER
Level 2 Kermit syntax is intended for transferring multilanguage files that
cannot be adequately represented in a single standard character set. The new
"international" transfer syntax preannounces character sets by their ISO
registration numbers, designates them by their registered escape sequences,
and invokes them by single or locking shifts as defined in ISO Standard 2022.
ENABLING LEVEL-2 TRANSFER SYNTAX
Level-2 transfer syntax is selected when the user issues the command SET
TRANSFER INTERNATIONAL. The user may return to Levels 0 or 1 using the SET
TRANSFER CHARACTER-SET command, or SET TRANSFER NORMAL (See Appendix F).
PROTOCOL PRIOR TO DATA TRANSFER
It is strongly recommended that any Kermit program which is to use
international syntax also support file attribute (A) packets. These are used
for two purposes: (1) to inform the receiver that international syntax will
be used and with which ISO-2022 facilities, and (2) to preannounce the file's
character sets. This will give the receiver the opportunity to refuse files
that it cannot translate, and to allocate the necessary resources for those
files which it can accept.
In Level 2, the value of the encoding attribute "*" should be the uppercase
letter "I" (for International), optionally followed by one or more ISO-2022
announcer letters (the letter after <ESC><SP>), as listed in Appendix B, for
example "IBZ" to declare that G0 and G1 will be used with locking shifts, and
G2 with single shifts.
+---+---+-----+
| * | # | IBZ |
+---+---+-----+
In addition, the sender may (but is not required to) preannounce the transfer
syntax character sets by listing them in the new attribute, "2", "Character
Sets". The value of the character-sets attribute is a comma-separated list of
Kermit character set designators. For example:
+---+---+----------------------+
| 2 | 4 | I2/100,I2/127,I2/144 |
+---+---+----------------------+
where "2" is the character-set attribute, "4" is the length of the following
value (in this case "4" = ASCII 52 - 32 = 20) and the next 20 bytes list ISO
character sets numbers 100, 127, and 144, each number prefixed by "I" to
denote the ISO registration authority, and "2/" to indicate that the left half
of the character set is ASCII.
If the sending Kermit can ascertain a file's character sets easily, it should
send this information in the attribute packet. Otherwise preannouncement of
character sets could require a time-consuming scan through the file prior to
sending, which is undesirable for large files not only because it reduces
Kermit's efficiency, but also because it could cause the entire Kermit session
to time out. Therefore, preannouncement of character sets is not required.
The receiver may accept or refuse a file using Kermit's normal attribute reply
mechanism. When accepting the file, it should include, at minimum, the "*"
attribute its acceptance, so that the sender will know that the receiver
understands international syntax. When refusing a file, it should indicate
what caused the problem: "*" means it can't do international transfer syntax,
but "2" (without "*") means that one or more of the announced character sets
are unknown.
If the file is refused in this manner, the sending Kermit can issue an
informative message to the user, and the user can find some other way to
transfer the file (for example, binary mode, or normal text mode with pre- and
postprocessing, or even by loading a new translation table).
DATA TRANSFER PROTOCOL
Transfer of a multi-character-set text file in international transfer syntax
by Kermit is similar to transfer of a 7-bit ASCII text file, except that it
may contain embedded control characters and escape sequences to identify and
switch between character sets. The file sender translates the file's
characters (if necessary) into one or more registered alphabets, imbedding
character-set designation and shifting codes in the data stream, and
terminates lines of text (records) with CRLF as in ASCII text mode. The file
receiver translates from international transfer syntax into the format
demanded by the local system or application. All of this occurs before Kermit
packet encoding by the sender, and after Kermit packet decoding by the
receiver.
ISO 2022 states that "at the beginning of information interchange, except
where the interchanging parties have agreed otherwise, all designations shall
be defined by use of the appropriate escape sequences, and the shift status
shall be defined by the use of the appropriate locking-shift functions."
Kermit programs should "agree otherwise" that the default G0 character set is
the US-ASCII/ISO-646-IRV (International Reference Version) 7-bit character
set; thus international transfer syntax can be identical to Normal Kermit
transfer syntax when transferring 7-bit text files. There are no defaults for
G1, G2, or G3, in the interest of fairness to all countries and peoples.
When the text contains characters outside the ASCII range, an escape sequence
from Table 5 must be issued, designating the alphabet to which they belong
(using the identification letters shown in Table 5) to the desired
intermediate character set G0, G1, G2, or G3. This sequence must be given
before the first occurrence of a character in that alphabet. If no such
sequence is given, then the receiver treats all characters as ASCII data,
including <ESC>, the shift characters, and bytes with their 8th bits set to
one. In other words, the file transfer behaves in the normal Kermit fashion
for text files.
Since ISO 8859 character sets are subject to revision from time to time, an
alphabet selector may be preceded by <ESC>&F, where F is the revision number
(@ = 1, A = 2, B = 3, etc). For example, <ESC>&@<ESC>-A means Latin Alphabet
Number One, Revision One. (This information is from ISO 2022 6.3.13.)
ISO 2022 escape sequences are inserted into the data, and are
indistinguishable by the Kermit packet encoder/decoder from the data itself.
Therefore these escape sequences may be broken across packets, just as any
other data may be.
UNKNOWN ALPHABETS
It is not required that the sender preannounce all of a file's character sets
prior to transfer. Suppose a file contains a mixture of alphabets, some known
to the receiver, others not. At some point, an alphabet designator arrives
which the receiving Kermit does not recognize. Should the receiving Kermit
cancel the file transfer, or accept the unknown code? A new command is
provided to let the user control what happens in this situation:
SET UNKNOWN-ALPHABET {KEEP, CANCEL}.
If the user elects CANCEL, then the receiver will behave as if the user
had manually cancelled the file, i.e. it will put the character "X" in the
data field of its next acknowledgement, and the sender (assuming it supports
this feature) will stop sending the file.
If the user elects KEEP, the file will be accepted in its entirety. But the
unknown code should be marked in case the user wants to fix it afterwards. To
do this, receiving program accepts the designator for the unknown alphabet and
stores it in the file as data, with subsequent characters stored untranslated.
When the unknown character set is shifted out of (or the end of file arrives),
the receiving Kermit stores the ISO-2022 Coding Method Delimiter, <ESC>d, and
resumes translation. If the unknown alphabet is shifted back into, the
designating escape sequence is stored again, and the process resumes.
A list of the designators of the unknown alphabets should be recorded in the
transaction log (if there is one), for later reference.
The default behavior should be "KEEP". This command should also be effective
at Level 1, where it would simply prevent the receiving Kermit from refusing
a file on the basis of the character set used to transfer it.
LOCAL FILE REPRESENTATION
This proposal assumes nothing about the representation of the file on the
local storage medium. It may be ASCII, EBCDIC, a proprietary word processor
format, IBM code page, or anything else. It is an implementation "detail" for
Kermit programmer to convert between the local file representation for
multi-alphabet text files, and Kermit's file transfer syntax.
In some cases, the file itself (or its directory entry) might contain the
necessary identifying information, in which case the sending Kermit program
can automatically emit the appropriate escape sequences during file transfer.
In others, the user will have to tell the sending program how the file is
encoded. The suggested command is:
SET FILE TYPE <xxx>
where <xxx> specifies how the file is (or when receiving, is to be) encoded on
disk. This will necessarily be highly dependent on the system's conventions,
or the conventions of the applications to be used with the file (e.g. a
multi-language word processing program). Possibilities for <xxx> might
include application names like WORDPERFECT, XYWRITE, NOTA-BENE, MACWRITE,
ALEPH-BET, PC-HANGUL.
BREAKING THE RULES
If the local file is not encoded according to ISO 2022 rules, it may contain
<ESC>, <SO>, and <SI> characters. It is up to the Kermit program to know
what these characters mean in the context of the file's format, and to either
strip them from the file or translate them to something else. The ISO 2022
rules forbid the use of these characters as data to be transferred.
If a file is to be transferred using international syntax, and it contains
any of the control characters significant to this syntax, namely <ESC>, <SI>,
<SO>, <SS2>, or <SS3>, then such characters should be prefixed during
transmission with Datalink Escape, <DLE>, C0 character 01/00 (Control-P).
Furthermore, if <DLE> itself occurs in the data, it should also be prefixed
with <DLE>.
All shifting and escape characters are subject to normal Kermit encoding
rules. Therefore, if a file contains an <SO> character, it must be sent as
<control-prefix><P><control-prefix><N>, normally "#P#N". If it contains an
<SS2>, and it is to be transmitted in the 7-bit environment, the encoding will
be:
<control-prefix><P><8th-bit-prefix><control-prefix><N>
(normally "#PN") -- That is, five characters to transmit one!
Also note that a file containing data that happens to correspond to a
character-set designator (e.g. "<ESC>-X") could confound later efforts at
reconstruction when SET UNKNOWN-ALPHABET KEEP is in effect. In this
circumstance, this character sequence can be distinguished from genuine
alphabet designators that were inserted into the file by the SET
UNKNOWN-ALPHABET KEEP feature in one of two ways: (1) by lacking a matching
<ESC>d (coding method delimiter), or (2) by not being listed in the Kermit
transaction log.
LEVEL-2 PERFORMANCE
Kermit programs may use the full range of ISO 2022 code extension techniques,
including use of G0, G1, G2, and G3 in both the 7-bit and 8-bit environments,
with both single-byte and multibyte character sets. In the general case, G0
will be used for ASCII and English, G1 for the special characters of the
"native language" of the local country or region, G2 for a third language, and
G3 for a fourth. Additional character sets may be swapped in and out of G2
and G3 as required.
Transmission of 8-bit data in the 7-bit environment is accomplished by Kermit
using 8th-bit prefixing, which is an optional feature of the Kermit protocol.
However, most popular implementations of Kermit do include this feature. If a
Kermit program cannot do 8th-bit prefixing, then it must operate in the ISO
2022 7-bit environment, shifting GL among the intermediate graphics sets
G0-G3.
If the Kermit program can do 8th-bit prefixing, the choice of the ISO 2022
7-bit or 8-bit environment is entirely independent of the communication
channel. Selection of the ISO 2022 7-bit or 8-bit environment should be made
on other grounds, such as transmission efficiency or program simplicity. For
example, if the ISO 2022 8-bit environment is used on a 7-bit channel, then
Kermit will have to do 8th-bit prefixing.
On a 7-bit communication channel, the best choice of ISO 7-bit or 8-bit
environment depends on the nature of the data to be transferred. If there is
little or no 8-bit data (as in English text), it doesn't matter. If there is
frequent shifting between 7-bit and 8-bit characters (as in French or
Portuguese), then single shifts would tend to be more efficient than locking
shifts, and Kermit's 8th-bit prefixing is equivalent to a single shift.
Therefore, use the ISO 8-bit environment and let Kermit do the prefixing. If
there are long strings of 8-bit characters, as in "right-sided" languages
like Russian, Greek, Arabic, and Hebrew, then locking shifts are more
efficient -- use the ISO 7-bit environment.
In Japan, many computer systems use at least three character sets, Roman
(close to ASCII), Katakana (a 1-byte code), and Kanji (a 2-byte code). Kanji
is specified in JIS X 0208, which also includes Roman, Hiragana, Katakana, and
some other character sets, but these are double width and not normally used.
Roman characters are usually taken from the left half of JIS X 0201, and
Katakana from the right half. Japanese text frequently shifts between Roman,
Kana, and Kanji, and therefore requires three active character sets, for
example G0 (Roman), G1 (Kana), and G2 or G3 (Kanji). In the 8-bit
environment, data transfer can be quite efficient: locking shifts are used to
shift GL between Roman and Kana, and any bytes with the 8th bit set to one
automatically invoke Kanji in GR as a multi-byte character set. In the 7-bit
environment, locking shifts would also be used to select Kanji. Note that
locking shifts are more efficient in this case than Kermit 8th-bit prefixing
because Kanji characters consist of more than one byte, and tend to occur in
runs. For Japanese, therefore, it is better to use the ISO 7-bit environment
on a 7-bit communication channel.
The situation is summarized in Table 4.
_____________________________________________________________________________
ISO 2022 Environment
7-bit 8-bit
+------------------------------+-----------------------------+
| Recommended for right- | Recommended for 2-sided |
7-bit | sided languages like Greek, | languages like French, |
data | Russian, Arabic, Hebrew. | German, etc. Use Kermit's |
path | Use ISO 2022 locking shifts. | 8th-bit prefix for special |
| Also for Japanese. | characters. |
+------------------------------+-----------------------------+
| No reason to use ISO 7-bit | Clear transmission of 8-bit |
8-bit | environment on a clear 8-bit | characters. Use for both |
data | communication channel. | left- and right-sided |
path | OK for 7-bit ASCII, though. | languages, and Japanese. |
| | |
+------------------------------+-----------------------------+
Table 4: Selecting ISO 7- vs 8-Bit Environment
_____________________________________________________________________________
The user should have control over whether the ISO-2022 7-bit or 8-bit
environment is used. To allow this, the command SET TRANSFER
INTERNATIONAL may be extended as follows:
SET TRANSFER INTERNATIONAL [ {7, 8} ]
which means that an optional final field may be included to specify the
7- or 8-bit ISO-2022 environment. The default should be 8, since it is the
most efficient method in most cases.
If Kermit -- at all levels -- offered locking shifts in addition to single
shifts, then international syntax could always proceed in the 8-bit
environment, and this would simplify implementation considerably. A proposal
on locking shifts for Kermit is forthcoming.
FILE TRANSFER SYNTAX EXAMPLES
A simple 7-bit ASCII text file can be transmitted in the normal Kermit manner
for text files, without any escapes or shifts, even in international syntax.
A text file containing characters from a language or languages covered by a
single alphabet other than ASCII can be transferred exactly like an ASCII text
file, except that the attribute, if used, would denote the character set, e.g.
"*!C2&I2/100" for Latin-1. In the 7-bit environment, international syntax can
be used to cut down on Kermit's 8th-bit prefixing overhead, in which case the
attributes might look like "*#IBJ2&2/144", and any strings of GR characters
would be preceded by LS1 and transmitted with their high-order bits set to
zero.
A multi-character-set text file will require an escape sequence to identify
each alphabet. The attribute packet would show international encoding,
optionally including the ISO 2022 facilities announcers, and the character
sets, as in "*#ICK2-I2/100,I2/144".
In the 7-bit environment, <SO> and <SI> are used to shift between the G0 and
G1 sets. In the absence of any specific designators, the G0 set is presumed
to be ASCII. Example:
A dangerous German word is "gef<ESC>-A<SO>d<SI>hrlich".
In this case, the only extended character is the umlaut-a in "gefaehrlich"
(where ae is how to write umlaut-a without an umlaut). <ESC>-A designates
Latin-1 into G1, <SO> shifts GL out to G1, "d" is the left-half equivalent
of umlaut-a, and <SI> shifts GL back in to G0.
For clarity and consistency with the ISO-2022 recommendations, it is
recommended that the text begin with explicit character set designations, and
then explicitly shift into the G0 set, rather than defaulting to it:
<ESC>(B<ESC>-A<SI>A dangerous German word is "gef<SO>d<SI>hrlich".
A text file containing characters from multiple ISO 8859 alphabets requires a
designation sequence for each alphabet. In the 7-bit environment, SO and SI
can be used to shift between G0 and G1 of the current alphabet, and <ESC>(B
can be used to select G0 of any of the alphabets, since these are all the
same. For example, the following text contains the same word in English,
French, and Russian:
<ESC>-A<SI>Disappointed, d<SO>ig<SI>u, <ESC>-L<SO>`PW^gP`^RP]]kY<SI>.
The first escape sequence assigns Latin Alphabet No. 1 to G1, and the
subsequent <SO> and <SI> shifts apply to its G0 and G1 set, which is used to
form the English and French words. The second escape sequence assigns the
Latin/Cyrillic 96-character set to G1, and the subsequent shifts apply to this
new set.
Another 7-bit example, in which the same word is repeated in English,
Russian, and German, shows how a locking shift remains in effect when the
alphabet is changed. We begin in Latin/Cyrillic, start with an English word
from G0, shift to G1 for the Russian word, and while still in G1 switch to
Latin Alphabet No. 1 for German to get the umlaut-A at the beginning of
Aenderung (where Ae = umlaut-uppercase-A), and shift back to G0 for the rest
of the word:
<ESC>-LAlteration <SO>_U`UTU[ZP <ESC>-AD<SI>nderung.
Some rules and hints to remember:
1. In the 8-bit communication environment, always use 8-bit character
transmission -- it's more efficient.
2. There can be no more than four character sets designated at one time.
Generally designate ASCII to G0, the most frequently-used non-ASCII set
to G2, less frequently used sets to G3 and G1. If a file has more than
four sets, swap the least frequently used sets in and out of G3 and G1.
3. Single shifts can only be used with G2 and G3. This is why G2 and G3
are preferred to G1.
4. Only two character sets can be invoked at once in the 8-bit communication
environment, and only one in the 7-bit environment.
SPECIAL EFFECTS
Today, most multi-alphabet files are produced by proprietary text processing
programs. These programs have many functions besides switching among
alphabets. They may also endow text with special attributes such as boldface,
italic, underline, super- or subscript, color, etc, and render characters in a
variety of type styles and sizes. Each text processing program may have its
own unique formats and conventions.
These special effects are not addressed by this proposal. Nevertheless, it is
likely that a multi-alphabet file produced by a text processing program also
contains special effects. In order for a Kermit program to send a
multi-alphabet file, it must have detailed knowledge of the file's format and
coding conventions. Therefore, the Kermit program should be able to strip out
the special effects, and send only the text. Otherwise the result would be
meaningless when received on an unlike system or for use with a different
application. (When transferring such files between like systems or compatible
applications, Kermit binary mode transfers will suffice.)
At some future time, it might be possible to adapt one of the popular document
description languages to the Kermit protocol, so that Kermit will be able to
transfer formatted documents between unlike systems and applications.
Presently, there are many competing would-be standards including IBM DCA and
DIA, DEC DDIF, US Navy DIF, Postscript. There are also two ISO standards
emerging in this area: Standard Generalized Markup Language (ISO 8879, 9069,
and 9573), and Office Document Architecture (ISO 8613). This is an area for
further study.
TERMINAL EMULATION
While not part of the Kermit file transfer protocol, terminal emulation is a
feature of many Kermit programs. It is hoped that these terminal emulators
will evolve along the lines of the ISO standards mentioned above. In some
cases, this is already a fact, insofar as DEC VT300 series terminals already
follow these standards and Kermit programs are beginning to emulate these
terminals.
Kermit should be as easy to use as possible, but should still give the user
the ability to specify exactly what character codes are in use for both
terminal emulation and file transfer. There should also be a consistent set
of commands for all Kermit programs.
APPENDIX A: STANDARDS
ANSI X3.4 (1986), "Coded Character Sets - 7-bit American Standard Code for
Information Interchange" (US ASCII), is the 7-bit code currently used by
Kermit for transferring text files.
ISO 646 (1983) (= ECMA-6), "Information Processing - ISO 7-bit Coded Character
Sets for Information Interchange", gives us a 7-bit character set equivalent
to ASCII with provision for substituting "national characters" in selected
positions.
ISO 4873 (1986) (= ECMA-43), "Information Processing - ISO 8-bit Code for
Information Interchange - Structure and Rules for Implementation", defines
8-bit character sets, their graphic and control regions, and how to extend
an 8-bit character set by using multiple intermediate graphics sets.
ISO 2022 (1986) (= ECMA-35), "Information Processing - ISO 7-bit and 8-bit
Coded Character Sets - Code Extension Techniques", describes how to use
8-bit character sets in both 7-bit and 8-bit environments, and how to switch
among different character sets and alphabets.
ISO International Register of Coded Character Sets to be Used with Escape
Sequences. This is the source of the ISO registration numbers.
ISO 2375 (1985) "Data Processing - Procedure for Registration of Escape
Sequences". The procedure by which a character set gets into the above
register and has a registration number and designating escape sequence
assigned to it.
JIS X 0202, "Code Extension Techniques for Use the Code for Information
Interchange", the Japanese counterpart of ISO 2022.
ANSI X3.41-1974, "Code Extension Techniques for Use with the 7-Bit Coded
Character Set of the American National Standard Code for Information
Interchange", describes 7- and 8-bit codes and extension techniques in
approximately the same manner as ISO 4873 and ISO 2022.
ISO 8859 (1987-present) (see Table 6 for ECMA equivalents), "Information
Processing - 8-Bit Single-Byte Coded Graphic Character Sets", defines the
actual 8-bit character sets to be used for many of the world's languages.
The left half of each of these is the same as ASCII and ISO 646 IRV. Each
character, including those with diacritics, is represented by a single byte.
ISO is the International Standardization Organization, ANSI is the American
National Standards Institute, ECMA is the European Computer Manufacturers
Association. JIS means Japan Industrial Standard.
The ISO/ECMA standards discussed in this proposal may be obtained free of
charge in their ECMA form by writing to:
ECMA Headquarters
Rue du Rhone 114
CH-1204 Geneva
SWITZERLAND
Be sure to specify the title and the ECMA number of each standard requested.
In general, the ISO member body from each country acts as the local sales
agent for ISO Standards in that country, for example ANSI in the USA:
Sales Department
American National Standards Institute
1430 Broadway
New York, NY 10018
Telephone 212-354-3300
Each such organization has its own arrangements for disseminating printed
documents. ANSI sells them for US dollars; organizations in other countries
may either sell them for local currency or give them away, depending on how
they are funded to operate.
ISO standards and CCITT recommendations can also be ordered from the UN
bookstore, but not free of charge:
United Nations Bookstore
United Nations Building
New York, NY 10017
CCITT recommendations are also available from ANSI.
APPENDIX B: HOW THE STANDARDS WORK
ASCII and ISO 646 give us a 128-character 7-bit character set. This set is
divided into two parts:
1. 33 "control characters" (characters 0 through 31, and character 127).
2. 95 "graphic characters" (32-126).
"Graphics" means printing characters -- characters that make ink appear on the
page or phosphor glow on the screen (as opposed to pixel- or line-oriented
picture graphics), plus the space character. The ASCII / ISO-646 IRV
character set is shown in Figure 1, arranged in a table of 16 rows and 8
colums.
_____________________________________________________________________________
00 01 02 03 04 05 06 07
+---+---+---+---+---+---+---+---+
00 |NUL DLE| SP 0 @ P ` p |
01 |SOH DC1| ! 1 A Q a q |
02 |STX DC2| " 2 B R b r |
03 |ETX DC3| # 3 C S c s |
04 |EOT DC4| $ 4 D T d t |
05 |ENQ NAK| % 5 E U e u |
06 |ACK SYN| & 6 F V f v |
07 |BEL ETB| ' 7 G W g w |
08 |BS CAN| ( 8 H X h x |
09 |HT EM | ) 9 I Y i y |
10 |LF SUB| * : J Z j z |
11 |VT ESC| + ; K [ k { |
12 |LF FS | , < L \ l | |
13 |CR GS | - = M ] m } |
14 |SO RS | . > N ^ n ~ |
15 |SI US | / ? O _ o DEL|
+---+---+---+---+---+---+---+---+
Figure 1: The ASCII / ISO-646 International
Reference Version 7-bit Character Set
_____________________________________________________________________________
Characters are often referred to by their column and row position in this type
of table. For example, character 05/08 in Figure 1 is "X". Columns 00-01,
plus character 07/15, comprise the control set. Columns 02-07, minus
character 07/15, comprise the graphics.
8-bit character sets are described in ISO 4873. An 8-bit character set has
two sides. Each side has a control set and a graphics set. The "left half"
consists of the control set C0 and the graphics set GL (Graphics Left). GL
has 94 characters, and corresponds to ASCII (and ISO 646 IRV) positions
02/01-07/14. SP (space) and DEL are not considered part of GL. All the
characters in the left half have their high-order, or 8th, bit set to zero,
and are therefore representable in 7 bits. The "right half" consists of the
control set C1 and the graphics set GR (Graphics Right). All characters in
the right half have their 8th bits set to one. Figure 2 shows the layout of
an 8-bit character set.
_____________________________________________________________________________
<--C0--> <---------GL----------> <--C1--> <---------GR---------->
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15
+---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+
00 |NUL DLE| SP 0 @ P ` p | | DCS|---+ |
01 |SOH DC1| ! 1 A Q a q | | PU1| |
02 |STX DC2| " 2 B R b r | | PU2| |
03 |ETX DC3| # 3 C S c s | | STS| |
04 |EOT DC4| $ 4 D T d t | |IND CCH| |
05 |ENQ NAK| % 5 E U e u | |NEL MW | |
06 |ACK SYN| & 6 F V f v | |SSA SPA| |
07 |BEL ETB| ' 7 G W g w | |ESA EPA| |
08 |BS CAN| ( 8 H X h x | |HTS | (special |
09 |HT EM | ) 9 I Y i y | |HTJ | graphics) |
10 |LF SUB| * : J Z j z | |VTS | |
11 |VT ESC| + ; K [ k { | |PLD CSI| |
12 |LF FS | , < L \ l | | |PLU ST | |
13 |CR GS | - = M ] m } | |RI OSC| |
14 |SO RS | . > N ^ n ~ | |SS2 PM | |
15 |SI US | / ? O _ o DEL| |SS3 APC| +---|
+---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+
<--C0--> <---------GL----------> <--C1--> <---------GR---------->
Figure 2: An 8-Bit Character Set
_____________________________________________________________________________
GR character sets can have either 94 or 96 characters. A 94-character GR set
begins in position 10/01 and ends in position 15/14, with Space (SP) occupying
position 10/00 and DEL in position 15/15, just like GL (the corners shown in
GR in the diagram). A 96-character set has graphic characters in all 96
positions, 10/00 through 15/15.
An 8-bit alphabet, therefore, has up to 94 + 96 = 190 graphic characters.
This number is sufficient to represent the characters in many of the world's
written languages, but not necessarily sufficient to represent all the graphic
symbols required in a given application, for instance a multi-language
document.
To represent a greater number of graphic characters, ISO 4873 defines four
"intermediate sets" of graphic characters, of either 94 or 96 characters each.
These are called G0, G1, G2, and G3. The G0 set never has more than 94
graphic characters, and G1-G3 can have up to 96 each. Therefore there can be
up to:
94 + (3 x 96) = 382
graphics characters simultaneously within the repertoire of a given device.
These intermediate graphics sets are kept in tables in the memory of the
terminal or computer. One of the intermediate sets (usually G0) is assigned
to GL, and (in the 8-bit communications environment) another may be assigned
to GR. When the terminal or computer receives a data byte, the numeric value
of its bits denotes the position of the character in GL or GR. For example,
the byte 01000001 binary = 65 decimal = 04/01 = uppercase A in ASCII. In the
8-bit environment, any byte with its 8th bit set to zero is from GL, and a
byte with its 8th bit set to one is from GR.
A language like English can be represented adequately by ASCII in GL, because
all the required characters fit there. When a language has more than 94
characters, two techniques are used to represent all the characters:
1. For alphabetic languages, put ASCII (or the ISO-646 IRV) in GL and
the special characters (like accented letters) in GR. French, German,
and Russian are examples.
2. For languages with many symbols (e.g. where a symbol is assigned
to each word, rather than to each sound), represent each character
with multiple bytes rather than one byte. Japanese Kanji, for example,
uses a 2-byte code. A multibyte code may be assigned to G0, G1, G2, or
G3, just like a single-byte code.
How do we assign actual character sets to G0-G3, and how do we associate the
intermediate character sets with the active character set?
Selection of character sets is accomplished using special control characters
and escape sequences embedded within the data stream as described in ISO
Standard 2022. An ESCAPE SEQUENCE is used to DESIGNATE a particular alphabet
(such as Roman, Cyrillic, Hebrew, Arabic, Kanji, etc) to a particular
intermediate graphics set (G0, G1, G2, or G3). A SHIFT FUNCTION is used to
INVOKE a particular intermediate graphics set into GL or GR. In programmer's
terms, GL and GR are pointers into the array of tables G0..G3, and the shift
functions simply change the values of these pointers.
In our discussion, we use the following notation (numbers are decimal unless
otherwise noted):
<ESC> Escape (ASCII 27, character 01/11)
<SP> Space (ASCII 32, character 02/00)
<SO> Shift Out (Ctrl-N, ASCII 14, character 00/14)
<SI> Shift In (Ctrl-O, ASCII 15, character 00/15)
Table 5 shows the alphabet designatation functions for single-byte and
multi-byte character sets in both the 7-bit and 8-bit environments. The
character which is substituted for "F" identifies the actual character set to
be used.
_____________________________________________________________________________
Escape
Sequence Function Invoked By
<ESC>(F assigns 94-character graphics set "F" to G0. SI or LS0
<ESC>)F assigns 94-character graphics set "F" to G1. SO or LS1
<ESC>*F assigns 94-character graphics set "F" to G2. SS2 or LS2
<ESC>+F assigns 94-character graphics set "F" to G3. SS3 or LS3
<ESC>-F assigns 96-character graphics set "F" to G1. SO or LS1
<ESC>.F assigns 96-character graphics set "F" to G2. SS2 or LS2
<ESC>/F assigns 96-character graphics set "F" to G3. SS3 or LS3
<ESC>$(F assigns multibyte character set "F" to G0. SI or LS0
<ESC>$)F assigns multibyte character set "F" to G1. SO or LS1
<ESC>$*F assigns multibyte character set "F" to G2. SS2 or LS2
<ESC>$+F assigns multibyte character set "F" to G3. SS3 or LS3
Table 5: Escape Sequences for Alphabet Designation
_____________________________________________________________________________
Table 6 shows the escape sequences used to designate the appropriate parts of
each of the registered character sets discussed in this proposal to G1 (except
that ASCII is designated to G0, which is the normal situation). It is
important to note that the final letter of the escape sequence is not always
sufficient to designate a character set. For example, Czech Standard and JIS
Katakana are both designated by letter I. But the two can be distinguished by
the intermediate characters of the escape sequence, which specify whether the
set is single- or multibyte, or, when both sets are single-byte, whether there
are 94 or 96 characters.
_____________________________________________________________________________
Escape ISO ECMA ISO/ECMA
Alphabet Name Sequence Reference Reference Registration
ASCII (ANSI X3.4-1986) <ESC>(B ISO 646 IRV ECMA-6 2
Latin Alphabet No. 1 <ESC>-A ISO 8859-1 ECMA-94 100
Latin Alphabet No. 2 <ESC>-B ISO 8859-2 ECMA-94 101
Latin Alphabet No. 3 <ESC>-C ISO 8859-3 ECMA-94 109
Latin Alphabet No. 4 <ESC>-D ISO 8859-4 ECMA-94 110
Latin/Cyrillic <ESC>-L ISO 8859-5 ECMA-113 144
Latin/Arabic <ESC>-G ISO 8859-6 ECMA-114 127
Latin/Greek <ESC>-F ISO 8859-7 ECMA-118 126
Latin/Hebrew <ESC>-H ISO 8859-8 ECMA-121 138
Latin Alphabet No. 5 <ESC>-M ISO 8859-9 ECMA-128 148
Czech Standard CSN 369 03 <ESC>-I none none 139
* Math/Technical Set <ESC>-K ???? ???? 143
Chinese (CAS GB 2312-80) <ESC>$)A none none 58
Japanese (JIS X 0208) <ESC>$)B none none 87
JIS-Katakana (JIS X 0201) <ESC>)I none none 13
JIS-Roman (JIS X 0201) <ESC>)J none none 14
Korean (KS C 5601-1987) <ESC>$)C none none 149
Table 6: Alphabets, Selectors, Standards, and Registration Numbers
_____________________________________________________________________________
* A math/technical set is clearly needed to handle the IBM PC, DEC VT-series,
and other math/technical/line-drawing characters, but there is apparently no
such standard set at this time.
Tables 7 and 8 show the shift functions that are used to invoke the
intermediate character sets. These shift functions may be either locking or
single. "Locking shift" is like shift-lock on a typewriter. It means that
all subsequent characters until the next shift are to be taken from the
designated intermediate character set. "Single shift" applies only to the
character (either single or multibyte) that follows it immediately, but single
shift functions are only available for the G2 and G3 sets. Locking shift
functions remain in effect across alphabet changes.
In the 7-bit environment, only one character set, GL, can be active at a time.
The active character set can be selected from among the intermediate sets
G0-G3 by the shifts shown in Table 6. Control characters from C0 are
transmitted as-is, and those from the C1 set are sent prefixed by <ESC>
followed by the character value, minus 64. For example, the C1 character
10000001 binary (129 decimal) becomes <ESC>A (129 - 64 = 65 = "A").
_____________________________________________________________________________
Shift Representation Name Function
SI Ctrl-O Shift In invoke G0 into GL
SO Ctrl-N Shift Out invoke G1 into GL
LS2 <ESC>n Locking Shift 2 invoke G2 into GL
LS3 <ESC>o Locking Shift 3 invoke G3 into GL
SS2 <ESC>N Single Shift 2 select single character from G2
SS3 <ESC>O Single Shift 3 select single character from G3
Table 7: Shifts Used in the 7-Bit Environment
_____________________________________________________________________________
In the 8-bit environment two character sets, GL and GR, can be active at once.
A GL character is selected by a byte whose 8th bit is zero, and a GR character
by a byte whose eighth bit is one. The actual character sets assigned to GL
and GR are selected by the shifts shown in Table 8. Control characters from
both the C0 and C1 sets are sent as is.
_____________________________________________________________________________
Shift Representation Name Function
LS0 Ctrl-O Locking Shift 0 invoke G0 into GL
LS1 Ctrl-N Locking Shift 1 invoke G1 into GL
LS2 <ESC>n Locking Shift 2 invoke G2 into GL
LS3 <ESC>o Locking Shift 3 invoke G3 into GL
LS1R <ESC>~ Locking Shift 1 Right invoke G1 into GR
LS2R <ESC>} Locking Shift 2 Right invoke G2 into GR
LS3R <ESC>| Locking Shift 3 Right invoke G3 into GR
SS2 08/14 Single Shift 2 select single character from G2
SS3 08/15 Single Shift 3 select single character from G3
Table 8: Shifts Used in the 8-Bit Environment
_____________________________________________________________________________
So we have a 3-tiered system. At the bottom tier lie all the world's coded
character sets. We can designate up to four of them, one to each of the
intermediate graphics sets G0, G1, G2, and G3 using the escape sequences shown
in Tables 5 and 6. The terminal or computer keeps each of the selected
intermediate sets in memory. There is also one active set, composed of GL and
GR. The intermediate sets are invoked to GL or GR (one at a time) by the
shifts SO, SI, LS0, LS1, etc, shown in Tables 7 and 8. A simplified diagram
for the 8-bit environment is shown in Figure 3 (see ISO 2022 for detailed
diagrams of both the 7-bit and 8-bit environments). On a more sophisticated
output device, Figure 3 would contain numerous arrows pointing upwards to
demonstrate the operation of the designators and shifts.
_____________________________________________________________________________
+--+--------+ +--+--------+
|C0| GL | |C1| GR |
| | | | | | 8-Bit
| | | | | | Code
| | | | | | In Use
+--+--------+ +--+--------+
LS0 LS1,LS1R LS2,LS2R LS3,LS3R Shifts
SS2 SS3
+--------+ +--------+ +--------+ +--------+ Intermediate
| | | | | | | | Graphics
| G0 | | G1 | | G2 | | G3 | Sets
| | | | | | | |
+--------+ +--------+ +--------+ +--------+
Alphabet
Designation
<ESC>(B <ESC>-A <ESC>-B <ESC>-L <ESC>$)B Sequences
+---------+
+--------+ +--------+ +--------+ +--------+ +--------+ | The world's
| ISO | | ISO | | ISO | | ISO | | JIS X | | registered
| 646IRV | | Latin | | Latin | | Latin | | 0208 | | character
|(ASCII) | | 1 | | 2 | |Cyrillic| | Kanji | + sets
+--------+ +--------+ +--------+ +--------+ +--------+
Figure 3: The ISO 2022 Character Set Selection Mechanisms
_____________________________________________________________________________
For example, the following sequence would be used to transmit the German word
"<umlaut-u>bern<umlaut-a>chtig" using Latin Alphabet 1 in the 7-bit
environment:
<ESC>(B<ESC>-A<SO>|<SI>bern<SO>d<SI>chtig
where:
<ESC>(B designates ASCII to G0
<ESC>-A designates the right half of Latin Alphabet 1 to G1
<SO> invokes G1 to GL
| is character 07/12, but since G1 is invoked to GL, it really
denotes character 15/12, which is <umlaut-u>
<SI> invokes G0 to GL
bern are characters from G0, which is invoked in GL
<SO> invokes G1 to GL
d is character 06/04, but since G1 is invoked to GL, it really
denotes character 14/04, which is <umlaut-a>
<SI> invokes G0 to GL
chtig are characters from G0
The same word could be transmitted in the 7-bit environment using single
shifts, if Latin Alphabet 1 were designated to G2 (or G3):
<ESC>(B<ESC>*A<ESC>N|bern<ESC>Ndchtig
(where <ESC>*A designates Latin-1 to G2, and <ESC>N is Single Shift 2).
In the 8-bit environment it could be transmitted using no shifts at all:
<ESC>(B<ESC>-A<umlaut-u>bern<umlaut-a>chtig
The designation escape sequences are transmitted only at the beginning of a
session and need not be repeated after the initial designations are made,
unless an intermediate set (G0-G3) is to be recycled.
To understand the three-tiered design of ISO 2022, imagine a computer
programmed to display a mixture of character sets on its screen. A large
collection of fonts might be stored on the disk, one font per file. These are
the character sets of the bottom tier. When a font is needed, it will be read
from the disk and stored in memory in an array, for rapid access. If several
fonts are needed, they will be stored in several arrays. These arrays are the
intermediate character sets, G0-G3. When a data byte arrives to be displayed,
the actual graphic representation is taken from GL or GR (depending on the
byte's 8th bit). GL is associated with one of the intermediate graphic sets,
and GR with another. If no more than four character sets are used, then each
one needs to be read from the disk only once, and display is rapid and
efficient thereafter.
ANNOUNCING ISO 2022 FACILITIES
A large portion of ISO 2022 is devoted to describing how 8-bit characters may
be transmitted on a 7-bit communication path, for example when parity is in
use. In the 7-bit environment, there is only GL -- no GR. Therefore, all
characters are transmitted with their 8th bit removed, and shifts are used to
specify which intermediate set they belong to.
In fact, there are many possible ways to use the ISO 2022 code extension
facilities within both 7-bit and 8-bit environments. For example, the sender
may inform the receiver in advance whether G1, G2, or G3 will be used, etc, so
that the receiver can allocate the appropriate resources. At the beginning of
any particular data transfer, the facilities that actually will be used can be
announced with a sequence of the form <ESC><SP>F, where F is replaced by an
ISO 2022 announcer. Several of the most important ones are described here.
Table 9 lists all the defined announcers in summary form. For details, see
ISO 2022.
<ESC><SP>A means that only the G0 set will be used, invoked into GL. No
shift functions will be used. In the 8-bit environment, GR is not used.
In other words, only a single 7-bit character set is used.
<ESC><SP>B means the G0 and G1 sets will be used with locking shifts. In the
7-bit environment <SI> invokes G0 into GL, <SO> invokes G1 into GL. In the
8-bit environment, LS0 invokes G0 into GL, LS1 invokes G1 into GL. In other
words, two character sets are used, with characters from both sets always
sent as 7-bit values, with locking shifts used to specify the 8th bit.
<ESC><SP>C means that G0 and G1 will be used in the 8-bit environment, with G0
invoked in GL and G1 in GR. No locking shift functions are used. In other
words, a single 8-bit character set is used, with all 8 bits transmitted as
data. GL is selected when the character's 8th bit is zero, GR is selected
when the 8th bit is one.
<ESC><SP>D means that G0 and G1 will be used with locking shifts. In the
7-bit environment, <SI> invokes G0 into GL and <SO> invokes G1 into GL. In
the 8-bit environment, all 8 bits of each character are transmitted with no
shifts.
<ESC><SP>L means that Level 1 of ISO 4873 will be used. That is, a single
8-bit character set with C0, G0, C1, and G1, with no shift functions.
This is like <ESC><SP>C.
<ESC><SP>M means that Level 2 of ISO 4873 will be used. This is equivalent
to Level 1, with the addition of G2 and G3. Characters from G2 and G3 are
invoked only by the single-shift functions SS2 and SS3.
<ESC><SP>N means that Level 3 of ISO 4873 will be used. This is equivalent
to Level 2 with the addition of the locking shift functions LS1R, LS2R, and
LS3R. (Note that ISO 4873 does not concern itself with the 7-bit
environment, and therefore does not discuss the use of LS0, LS1, LS2, or
LS3.)
_____________________________________________________________________________
Esc Sequence 7-Bit Environment 8-Bit Environment
<ESC><SP>A G0->GL G0->GL
<ESC><SP>B G0-(SI)->GL, G1-(SO)->GL G0-(LS0)->GL, G1-(LS1)->GL
<ESC><SP>C (not used) G0->GL, G1->GR
<ESC><SP>D G0-(SI)->GL, G1-(SO)->GL G0->GL, G1->GR
<ESC><SP>E Full preservation of shift functions in 7 & 8 bit environments
<ESC><SP>F C1 represented as <ESC>F C1 represented as <ESC>F
<ESC><SP>G C1 represented as <ESC>F C1 represented as 8-bit quantity
<ESC><SP>H All graphic character sets have 94 characters
<ESC><SP>I All graphic character sets have 94 or 96 characters
<ESC><SP>J In a 7 or 8 bit environment, a 7 bit code is used
<ESC><SP>K In an 8 bit environment, an 8 bit code is used
<ESC><SP>L Level 1 of ISO 4873 is used
<ESC><SP>M Level 2 of ISO 4873 is used
<ESC><SP>N Level 3 of ISO 4873 is used
<ESC><SP>P G0 is used in addition to any other sets:
G0 -(SI)-> GL G0 -(LS0)-> GL
<ESC><SP>R G1 is used in addition to any other sets:
G1 -(SO)-> GL G1 -(LS1)-> GL
<ESC><SP>S G1 is used in addition to any other sets:
G1 -(SO)-> GL G1 -(LS1R)-> GR
<ESC><SP>T G2 is used in addition to any other sets:
G2 -(LS2)-> GL G2 -(LS2)-> GL
<ESC><SP>U G2 is used in addition to any other sets:
G2 -(LS2)-> GL G2 -(LS2R)-> GR
<ESC><SP>V G3 is used in addition to any other sets:
G3 -(LS2)-> GL G3 -(LS3)-> GL
<ESC><SP>W G3 is used in addition to any other sets:
G3 -(LS2)-> GL G3 -(LS3R)-> GR
<ESC><SP>Z G2 is used in addition to any other sets:
SS2 invokes a single character from G2
<ESC><SP>[ G3 is used in addition to any other sets:
SS3 invokes a single character from G3
Table 9: ISO 2022 Announcer Summary
_____________________________________________________________________________
APPENDIX C: PRELIMINARY DESIGN FOR LOADABLE TRANSLATION TABLES
Note the word "PRELIMINARY". This design will be refined as attempts are made
to program it.
The translation table is specified in a file written entirely in printable
ASCII, with line divisions as shown. Numbers are represented as ASCII decimal
digits.
Line Contents
1. Name of this table
2. The word "COMMON" or "LOCAL"
3. Name of SOURCE character set (translating FROM)
4. Number of bytes per character of source set (1, 2, 3, 1-2, etc)
5. Number of characters per plane of source set (94, 96, 128)
6. Name of TARGET character set (translating TO)
7. Number of bytes per character of target set (1, 2, 3, 1-2, etc)
8. Number of characters per plane of target set (94, 96, 128)
9. Designating sequence for COMMON character set.
10. Version number of common character set (blank if none)
11. Registration number of common character set (e.g. I2/100, blank if none)
12. Direction of writing (Left-to-right, Right-to-left, Upwards, etc)
13. Number of entries in the translation table.
14. Count of lines, n, between this line and beginning of translation table.
15 - 15+n. Reserved for future use.
n+16... The translation table itself.
Line 2 is either COMMON or LOCAL, and applies to the SOURCE character set.
LOCAL means that the source character set is local, and the target character
set is common, i.e. the one used during transmission in the transfer syntax.
COMMON means vice-versa.
Line 3 gives the name of the source character set, which is either local or
common, depending on line 2.
Line 4 specifies the number of bytes per character in the source character
set. For example, 1 for ASCII, ISO Latin-1, etc, 2 for JIS X 0208, etc.
The notation "1-2" means that a character can be either one or two bytes, as
in (for instance) CCITT T.61, where "A" is the single character "A", but
"`A" is the single character A-grave.
Line 5 specifies the number of "characters per plane". In a single-byte
character set, there is one plane, in a multibyte set there are many. In the
ISO world, an important distinction is made between 94-byte sets and 96-byte
sets. See Appendix B for a fuller explanation.
Lines 6-8 are like lines 2-5, but for the target character set. If the source
set was local, the target set is common, and vice versa.
Lines 9-11 give further information about the standard, COMMON character set:
Line 9 specifies the designating sequences required to assign the set to G0,
G1, G2, and G3 (see Table 6), in that order, with the bytes written as decimal
numbers, each byte separated by a space, and each sequence separated by a
comma. For example, the entry for a 94-character set whose final designating
letter is "B" would look like this:
27 40 66, 27 41 66, 27 42 66, 27 43 66
If a character set cannot be assigned to G0 (as is the case with a 96-byte
set), then the first entry would be left blank (the final letter here is A):
, 27 45 65, 27 46 65, 27 47 65
Line 10 gives the revision number of the common character set, as described in
the "Data Transfer Protocol" section of the description of Level 2. This
should be blank if the character set has not been revised, @ (atsign) for the
first revision, A for the second revision, B for the third revision, etc.
Line 11 gives the Kermit designator of the common character set, such as
I2/100 for ISO Latin Alphabet 1.
Line 12, direction of writing, has nothing to do with file transfer, but is
included in case the same table is also to be used with terminal emulation.
The actual notation should be the letter L (Left-to-right), R (Right-to-left),
U (Upwards), D (Downwards), or B (Boustrophedon, i.e. alternating L and R).
Line 13 is self explanatory.
Line 14 allows for future expansion of this "information header".
Lines 15 through the end contain the translation table itself. Each of these
lines contains a pair of characters or strings in ASCII decimal
representation, with the members of the pair separated by a comma, followed
optionally by a comment, like "Uppercase A Circumflex"
<character from transfer set>, <character from local set> ; <comment>
Each byte of a character is separated by a space, for example:
231, 135 ; c Cedilla (Latin-1 to CP850)
228, 97 101 ; Latin-1 a-umlaut to ASCII "ae"
97 101, 228 ; "ae" to Latin-1 a-umlaut (dangerous!)
123 456, 234 567 ; Something from a pair of 2-byte character sets
The character pair is listed, rather than a single value (as in most
translation tables) to allow for special translations like a-umlaut to "ae".
There is no rule against having different numbers of bytes on either side of
the comma. There is also no requirement to always have the same number of
bytes on the left or right side of the comma, nor to have every position
filled. If a position is vacant, the program should take some kind of default
action, like substituting a question mark.
APPENDIX D: SUMMARY OF NEW KERMIT COMMANDS
SET FILE TYPE { BINARY, TEXT, <other> }
BINARY means no translation, and overrides all other file-related
commands, including SET TRANSFER.
TEXT is the default. Enables Level 0, 1, or 2 transfer syntax,
depending on the setting of SET TRANSFER.
<other> means any application-specific format known to the Kermit
program, like WORDPERFECT. The meaning of such a command is
system- and/or application-dependent.
SET FILE CHARACTER-SET <name>
Effective only when file type is TEXT.
Tells Kermit what character set the file is coded in,
or what character set to translate an incoming file to.
SET TRANSFER { CHARACTER-SET <name>, INTERNATIONAL [{7,8}], NORMAL }
CHARACTER-SET <name> invokes the Level 1 extension, unless the
<name> is TRANSPARENT or ASCII.
INTERNATIONAL invokes the Level 2 extension. 7 or 8 specifies the
ISO-2022 7- or 8-bit environment.
NORMAL - SET TRANSFER NORMAL is synonym for SET TRANSFER
CHARACTER-SET ASCII.
SET LANGUAGE <name>
This command informs the program which language is being translated,
to allow for special language-based translation tricks, such as
a-umlaut => ae.
SET UNKNOWN-CHARACTER-SET { KEEP, CANCEL }
Tells the file receiver whether to keep or cancel an incoming file that
contains an unknown character set. KEEP is the default.
LOAD TRANSLATION-TABLE <tablename> <filename>
Load a new translation table, or overlay an existing one, from the
specified file.
SHOW TRANSLATION-TABLE <name>
Show information about the named translation table. If <name> omitted,
show information about all translation tables.
DUMP TRANSLATION-TABLE <name> <filename>
Write the contents of the named table to the specified file, in a format
compatible with the LOAD TRANSLATION-TABLE command.
DROP TRANSLATION-TABE <name>
Remove the named translation table from Kermit's memory and command
keyword tables.
SET ATTRIBUTES { ON, OFF }
SET ATTRIBUTE <name-of-attribute> { ON, OFF }
Enables or disables processing of attribute packets, or specific
attribute fields such as DATE, CHARACTER-SET, LENGTH, etc.
SHOW { CHARACTER-SETS, TRANSLATION-TABLES, LANGUAGE }
Display what character sets, translation tables, and languages are
available, and which ones are currently selected.
TRANSLATE <file1> <file2>
Copies local file <file1> to local file <file2>, translating <file1>
from the current file character-set into the current transfer
character-set.
APPENDIX E:
ESCAPE SEQUENCES AND CONTROL CHARACTERS FOR KERMIT LEVEL-2 TRANSFER SYNTAX
1. Designation of character sets.
The final letter "F" denotes the character set, e.g. "A" for ISO Latin-1.
<ESC>(F assigns 94-character graphics set "F" to G0.
<ESC>)F assigns 94-character graphics set "F" to G1.
<ESC>*F assigns 94-character graphics set "F" to G2.
<ESC>+F assigns 94-character graphics set "F" to G3.
<ESC>-F assigns 96-character graphics set "F" to G1.
<ESC>.F assigns 96-character graphics set "F" to G2.
<ESC>/F assigns 96-character graphics set "F" to G3.
<ESC>$(F assigns multibyte character set "F" to G0.
<ESC>$)F assigns multibyte character set "F" to G1.
<ESC>$*F assigns multibyte character set "F" to G2.
<ESC>$+F assigns multibyte character set "F" to G3.
2. Shift functions:
Character(s) Name Function
Ctrl-O SI,LS0 Shift In (invoke G0 to GL)
Ctrl-N SO,LS1 Shift Out (invoke G1 to GL)
<ESC>n LS2 Locking Shift 2 (invoke G2 to GL)
<ESC>o LS3 Locking Shift 3 (invoke G3 to GL)
<ESC>~ LS1R Locking Shift 1 Right (invoke G1 to GR)
<ESC>} LS2R Locking Shift 2 Right (invoke G2 to GR)
<ESC>| LS3R Locking Shift 3 Right (invoke G3 to GR)
<ESC>N SS2 Single Shift 2, 7-bit version, single char from G2
08/14 SS2 Single Shift 2, 8-bit version, single char from G2
<ESC>O SS3 Single Shift 3, 7-bit version, single char from G3
08/15 SS3 Single Shift 3, 8-bit version, single char from G3
3. Coding method delimiter:
When receiving text in an unknown character set, store the character set
designator, then store the untranslated characters, and terminate with the
coding method delimiter.
<ESC>d
4. Special characters in data:
If any of the following characters appear in the data itself, they must be
prefixed during transmission with <DLE>, datalink escape, 01/00, Control-P:
<SO> 00/14
<SI> 00/15
<DLE> 01/00
<ESC> 01/11
<SS2> 08/14
<SS3> 08/15
APPENDIX F: SELECTING LEVEL 0, LEVEL 1, AND LEVEL 2
A Kermit program operates in Level 0 by default. The transfer character set
is ASCII, and if Attribute packets are used, the encoding attribute is "*!A",
and the character-set attribute "2" is not used.
To enter Level 0:
SET TRANSFER CHARACTER-SET ASCII enters Level 0 from any level. SET TRANSFER
NORMAL does exactly the same thing. The two commands are synonyms.
To enter Level 1:
SET TRANSFER CHARACTER-SET <name>, where <name> is not ASCII, enters Level 1
from any level. If Attribute packets are used, the encoding attribute is
"*!C" and the character-set attribute must be specified. If Level 1 is
entered, exited, and entered again, the transfer character set must be
respecified.
To enter Level 2:
SET TRANSFER INTERNATIONAL enters Level 2 from any level. Attribute packets
should be used at this level. The encoding attribute is "*xIyyy" where "x"
is the number of characters to follow, "I" signifies international transfer
syntax, and "yyy" is 0 or more ISO-2022 facility announcers. The
character-set attribute may also be included, if the program knows in
advance which character sets are in the file. In this case the Kermit
character-set designators for each set are listed, separated by commas, for
example "2-I2/100,I2/144".
APPENDIX G: SIMPLIFIED FLOW DIAGRAM OF KERMIT TRANSFER SYNTAX OPTIONS
SET FILE TYPE BINARY (overrides SET TRANSFER command)
| |
N Y--> Transfer file unmodified. END.
|
Text mode. Three possibilities:
SET TRANSFER CHARACTER-SET TRANSPARENT (or ASCII)
| |
N Y--> LEVEL 0: Transfer syntax is ASCII with CRLF as line terminator.
| Sending program translates from local format to transfer syntax,
| Receiving program translates from transfer syntax to local format.
| END.
|
SET TRANSFER CHARACTER-SET LATIN1 (any single character set other than ASCII)
| |
N Y--> LEVEL 1: Transfer syntax is specified character set with CRLFs.
| Sender translates from local format to specified character set.
| Receiver translates from specified character set to local format.
| END.
|
File composed of more than one character set:
SET TRANSFER INTERNATIONAL
| |
N Y--> LEVEL 2: Transfer syntax is ISO-2022. Assumes that sender can
| identify the different character sets in the local file, and
| can translate them to registered character sets if necessary.
| |
| Sender specifies encoding ("*") to be International ("I"),
| and lists ISO-2022 announcers. Sender also optionally lists the
| alphabets to be used in new character-set ("2") attribute.
| |
| Receiver agrees to these facilities and alphabets?
| | |
| Y N --> Receiver rejects the file, indicating "*" and/or "2"
| | as the reason. END.
| |
| Receiver accepts the file.
| |
| Transfer begins. Sender translates from local file format to
| the character sets of the transfer syntax, using ISO-2022
| announcers, designators, and shifts to switch among them.
| |
| 2 --> Receiver heeds announcers, designators, and shifts, and
| translates from the indicated character sets to local
| representation.
| |
| If the receiver encounters an alphabet it does not know, it
| will act according to the SET UNKNOWN-CHARACTER-SET command:
| | |
| KEEP CANCEL --> Reject the file by putting an X (Cancel
| | File) code in the data field of its
| | Acknowledgement. END.
| |
| (default) Continue to receive the file, but store the
| designator for the unknown alphabet along with the
| untranslated characters from that alphabet, until the next
| known alphabet is encountered. Mark the end of the
| untranslated material with <ESC>d. Warn user.
| |
| END.
|
Reserved for future
(END)