home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Columbia Kermit
/
kermit.zip
/
archives
/
charsets.tar.gz
/
charsets.tar
/
lshift.txt
< prev
next >
Wrap
Text File
|
1991-10-02
|
34KB
|
706 lines
A LOCKING SHIFT MECHANISM FOR THE KERMIT FILE TRANSFER PROTOCOL
Christine M. Gianone
Frank da Cruz
Columbia University
New York, NY USA
DRAFT 4.2
October 2, 1991
ABSTRACT
7-bit communication channels remain quite common: they are in use on IBM
mainframes, public data networks, in virtual terminal protocols like
TCP/IP TELNET, and on any connection in which a device uses parity.
The Kermit file transfer protocol achieves transparency over hostile
communication environments by encoding all data as printable characters. In
the 7-bit communications environment, 8-bit data is encoded in 7-bit form
using the single shift; the "&" character acts as a prefix, meaning that the
following character should have its 8th bit set to 1 after decoding. Kermit's
single-shift 8th-bit quoting mechanism can add excessive transmission overhead
to certain kinds of files, particularly text encoded in character sets like
ISO 8859 Cyrillic, Greek, Hebrew, or Arabic, or 8-bit Japanese Kanji codes
like EUC in which most data bytes have their 8th bit set to 1, resulting in
8th-bit quoting overhead up to 100%.
A new locking shift mechanism is proposed to allow 8-bit data to be
transferred more efficiently. This mechanism is an adaptation of the
familiar Shift-In / Shift-Out scheme combined with Kermit's present
single-shift technique, with some quoting rules added.
This proposal was prompted not only by the longstanding need for increased
efficiency in this area, but by a conference between the authors and Dr.
Hirofumi Fujii of the Japan National Laboratory for High Energy Physics
regarding the establishment of an official Kermit transfer syntax for
Japanese, the subject of a separate proposal, and subsequent meetings in
Japan. The algorithm and user interface were designed by Gianone and
the detailed protocol design was contributed by da Cruz in the course of
programming a trial implementation.
The reader is assumed to be familiar with the Kermit file transfer protocol
and with commonly used computer character sets.
TERMINOLOGY
In this proposal, the term "character" refers to an 8-bit byte, or octet,
even if the data is encoded in a multibyte character set, or if it is not
encoded in any character set at all (such as a binary file).
An "8-bit character" is a data byte with its 8th bit set to 1. A "7-bit
character" is one whose 8th bit is set to 0.
A "control character" is a byte in the range 0-31 or 127 decimal (the "C0"
set) or 128-159 or 255 (the "C1" set).
A "printable character" is any character that is not a control character.
NOTATION
Numbers are written in decimal.
"<XXX>" stands for an ASCII control character. "XXX" is replaced by the
character's name, for example "<SOH>" for Start of Heading (Control-A).
"<1>X" stands for an 8-bit character. The "X" can be a literal printable
character (for example, "<1>A" is the ASCII letter A with its 8th bit set to
1) or a control character (for example "<1><SOH>" is a Control-A with its 8th
bit set to 1).
Similarly, "<0>X" stands for a 7-bit character.
BACKGROUND
The Kermit protocol presently specifies three separate prefix characters to
be used within Kermit packets for transparency, compression, and quoting:
The Control Prefix
For transparency on serial communication links that are sensitive to
control characters, the file sender precedes each C0 and C1 control with
the control prefix, normally "#" (ASCII 35), and then encodes the control
character itself by "exclusive-ORing" it with 64 decimal (i.e. inverting
bit 6) to produce a character in the printable ASCII range. For example,
Control-C (ASCII 3) becomes "#C" (3 XOR 64 = 67, which is the ASCII code
for the letter C). Similarly, NUL becomes "#@", Control-A becomes "#A",
Control-Z becomes "#Z", Escape becomes "#[", and DEL becomes "#?". The
receiver decodes by discarding the prefix and XORing the character with
64 again. For example, in "#C", C = ASCII 67, and 67 XOR 64 = 3 =
Control-C. Control prefixing is mandatory. The control prefix is also
used for quoting prefix characters that occur in the data itself; see
"The Prefix Quote" below.
The 8th-bit Prefix
When one or both of the two Kermit programs knows that the connection
between them is not transparent to the 8th bit (e.g. because the Kermit
PARITY variable is not NONE, or because the program always operates that
way), a feature called "8th-bit prefixing" is used if the two Kermit
programs negotiate an agreement to do so. The 8th-bit prefix is Kermit's
single shift, normally the ampersand character "&" (ASCII 38). When the
file sender encounters an 8-bit character, it inserts the "&" prefix in
front of it, and then inserts the data character itself with its 8th bit
set to 0. If the data character is a control character, it is inserted
after the 8th-bit prefix in control-prefixed form. Examples: an "A" with
its 8th bit set to 1 ("<1>A") becomes "&A"; a Control-A with its 8th bit
set to 1 ("<1><SOH>") becomes "A".
The Repeat-Count Prefix
The repeat-count prefix provides a simple form of data compression. It
is used only when both Kermit programs support this feature and agree to
use it. This prefix, normally tilde "~" (ASCII 126), precedes a repeat
count, which can range from 0 to 94. The repeat count is encoded as a
printable ASCII character in the range SP (32) - tilde (126) by adding
32. For example, a series of 36 G's would be encoded as "~DG" (D = ASCII
68 - 32 = 36). The repeat-count prefix applies to the following prefixed
sequence, which may be a single character ("~DG"), an 8th-bit prefixed
character ("~D&G" = 36 Control-G characters with their 8th bits set to
1), a control-prefixed character ("~D#M" = 36 Control-M's), or an
8th-bit-and-control-prefixed character ("~~Z" = 94 Control-Z's with
their 8th bits set to 1).
The Prefix Quote
The control prefix, normally "#", is also used to quote the control prefix
itself if it occurs in the data: "##", means that the "#" character should
be taken literally. If 8th-bit prefixing is in effect, the control prefix
also quotes the 8th-bit prefix: "#&", so "#&D" stands for "&D" rather than
"<1>D". If repeat count prefixing is in effect, the control prefix is also
used to quote the repeat count prefix: "#~", so "#~CG" stands for "~CG"
rather than 35 "G" characters. So the complete meaning of the "#" prefix
is: if the value of the following character is 63-95 or 191-223, the
prefixed character is to be XORed with 64, otherwise it is to be taken
literally. The prefix quote can also be used harmlessly to quote 8th-bit
or repeat-count prefixing characters even when these types of prefixing are
not in effect.
On a 7-bit connection the file sender, after encoding the data, adds the
appropriate parity bit to all characters -- prefixes as well as data -- before
transmission, and the file receiver strips the parity bit from all received
characters before processing them.
On an 8-bit-clean connection, 8th-bit prefixing need not be (and normally is
not) done, and data characters retain their original 8th bit. For example,
"A" with its 8th bit set to 1 is transmitted literally, without any
prefixing ("<1>A"). Control-A with its 8th bit set to 1 is transmitted as
"#" followed by the letter A with its 8th bit set to 1 ("#<1>A") because
control prefixing is always in effect.
SINGLE AND LOCKING SHIFTS
The shift key on a typewriter lets the regular keys do "double duty". A
given key produces different results depending on whether the shift key is
up or down. Kermit's single shift (8th-bit prefix) is like the shift key:
just as you must press two keys on the typewriter for every uppercase
letter, Kermit must send two 7-bit characters for every 8-bit character when
8th-bit prefixing is in effect.
Certain types of files have many 8-bit characters in a row. When this is
the case, the overhead of single shifting could be as high as 100%.
Efficiency could be much improved by the use of "locking shifts": the file
sender tells the file receiver "Here comes a sequence of 8-bit characters"
and then sends these characters in 7-bit form, relying on the receiver to
put their 8th bits back before storing them.
The locking shift behaves like the shift-lock key on a typewriter: to type a
series of uppercase letters, you press the shift lock key once and then type
the letters, one key per letter, rather than two. To go back to lowercase
letters, release the shift lock key and then type more letters.
When the data communications "shift-lock" key is active, 7-bit characters
are said to be "shifted": they are not what they appear to be, but instead
represent 8-bit characters. When the locking shift is not in effect, 7-bit
characters stand for themselves; they are "unshifted".
The locking shift characters are SO (Shift Out, Control-N, ASCII 14), and SI
(Shift In, Control-O, ASCII 15). SO is sent at the beginning of a shifted
sequence, SI is sent to return to normal unshifted operation. For example,
on a 7-bit connection, the following string of characters (written using our
notation):
<0>A<0>B<0>C<1>D<1>E<1>F<1>G<1>H<1>I<0>J<0>K<0>L<0>M (13 characters)
would be transmitted like this with single shifts:
ABC&D&E&F&G&H&IJKLM (19 characters)
and like this with locking shifts:
ABC<SO>DEFGHI<SI>JKLM (15 characters)
On an 8-bit connection, of course, the string of 13 characters can be
transmitted as-is, with no overhead at all.
Now suppose we have the following character sequence:
<1>A<1>B<1>C<0>D<1>E<1>F<1>G<0>H<1>I<1>J<1>K<0>L<1>M (13 characters)
Here several isolated 7-bit characters are found in the middle of a long run
of 8-bit characters. Using locking shifts alone, this would be encoded as:
<SO>ABC<SI>D<SO>EFG<SI>H<SO>IJK<SI>L<SO>M (20 characters)
But using a combination of locking and single shifts, it can be encoded more
compactly, as in this example, in which "&" is the single-shift character:
<SO>ABC&DEFG&HIJK&LM (17 characters)
This proposal adds the locking Shift-In/Shift-Out mechanism to the Kermit
file transfer in a way that it can be used in conjunction with single shifts
for maximum efficiency.
NEGOTIATION
Locking shifts are, like all new additions, an optional feature of the
Kermit protocol. To allow old Kermit programs to interoperate transparently
with the new ones that implement locking shifts, the use of this feature
must be negotiated and agreed upon by both Kermit programs before it can be
used.
Two Kermit programs agree to use the locking shift extension via a new
capability bit, together with the existing 8th-bit prefixing (QBIN) field.
The capabilities mask is the 10th character in the initialization string. It
contains a bit mask encoded as a printable character by adding 32 (ASCII
Space).
Capability number 1 (bit 5, which until now has been reserved for future
use) will be used to indicate the locking shift capability: 1 if enabled, 0
if not. Thus old Kermits automatically disable the use of locking shifts
because they never set this bit. The format of Kermit's capability mask is:
bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0
+----+----+----+----+----+----+----+----+
| X | X | 1 | 2 | 3 | 4 | 5 | Z |
+----+----+----+----+----+----+----+----+
where:
X = Must not be used
1 = Locking Shift Capability
2 = Extra-Long Packet Capability (9025-857374)
3 = Attribute Packet Capability
4 = Sliding Window Capability
5 = Long Packet Capability (95-9024)
Z = Capability Mask Extension Bit (allows addition of new mask bytes)
The locking shift protocol is used if and only if:
1. The file sender sets the Locking Shift Capability bit in the S (Send
Initialization) packet;
2. The file receiver also sets the same bit in its acknowledgement to the S
packet; and
3. The parties have agreed to use single shifts via the QBIN field.
Thus, locking shifts REQUIRE 8th-bit prefixing. This is reasonable because
(a) 8th-bit prefixing is easy to program; (b) all the popular Kermit programs
already implement it; (c) little is gained by using locking shifts without
single shifts; (d) it simplifies the user interface and the negotiation
process; and (e) it allows the file receiver as well as the sender to request
locking shifts.
ENCODING RULES
Kermit's locking shift protocol uses the C0 control character Shift Out (SO,
Control-N, ASCII 14) to precede a sequence of 8-bit characters, and Shift In
(SI, Control-O, ASCII 15) to precede a sequence of 7-bit characters.
Whether or not locking shift protocol is in effect, all of Kermit's normal or
negotiated prefixing rules also remain in effect, so SO appears in the packet
as "#N" and SI appears as "#O".
Each Kermit program maintains a SHIFT-STATE, which may be SHIFTED (shifted
out) or UNSHIFTED (shifted in). SHIFTED means that 8-bit characters are
being transmitted in 7-bit form (preceded by a Shift-Out character) and
UNSHIFTED means that 7-bit characters represent themselves. For each file,
the initial SHIFT-STATE is defined to be UNSHIFTED, so there is no need for
the sender to transmit an initial Shift-In (but it does no harm).
A. When the file sender's SHIFT-STATE is UNSHIFTED and it reads a 7-bit
character, it adds the character to the packet according to Kermit's
other prefixing rules (control and repeat count), and adds the
appropriate parity bits. Thus, any number of 7-bit characters can be
transmitted in a row.
B. When the file sender's SHIFT-STATE is UNSHIFTED and it reads an 8-bit
data character, there are two possibilities:
1. If single-shifting (8th-bit prefixing) is in effect, insert a
single-shift character ("&") with the appropriate parity bit before
the 8-bit data character, and add the data character itself with its
8th bit replaced by the appropriate parity bit.
OR:
2. Insert a Shift Out (SO) character into the packet (encoded as "#N"
with the appropriate parity bits), change the SHIFT-STATE to SHIFTED,
and then add the data character with its 8th bit replaced by the
appropriate parity bit.
C. When the file sender's SHIFT-STATE is SHIFTED and it reads an 8-bit
character, it adds the character to the packet according to Kermit's
other prefixing rules (control and repeat count), replacing the
character's 8th bit by the appropriate parity bit. Thus, any number of
8-bit characters may be transmitted in a row in 7-bit form after the SO.
D. When the file sender's SHIFT-STATE is SHIFTED and a 7-bit character is
encountered, there are two possibilities:
1. If single-shifting is in effect, insert a single-shift character ("&")
before the 7-bit character and add the appropriate parity bits.
OR:
2. Insert a Shift-In (SI) character (encoded as "#O" with the appropriate
parity bits) into the packet and change the SHIFT-STATE to UNSHIFTED,
and then insert the data character itself with the appropriate parity
bit.
E. If a repeated sequence of characters occurs where the shift state changes,
the locking shift is encoded BEFORE the repeat-count sequence: #O~xA,
not ~x#OA.
F. If the file ends in SHIFTED state, there is no need to issue a Shift-In
code at the end of the file, but it does no harm either.
SINGLE AND LOCKING SHIFTS
When locking shifts and single shifts are in effect, the meaning of the
single-shift character is reversed when the SHIFT-STATE is SHIFTED. Single
shifts can be used to efficiently encode isolated characters that don't fit
the current SHIFT-STATE. For example:
Data Encoding
1. ABCABC<1>EBCABC ABCABC&EBCABC
2. <1>A<1>B<1>C<1>A<1>BXY<1>B<1>C<1>A #NABCAB&X&YBCA
In (1) the single shift "&" sets the 8th bit of "E" to 1 (normal Kermit
practice), but in (2) the single shift sets the 8th bit of "X" and "Y" to 0
because the SHIFT-STATE is SHIFTED (#N).
The file sender can decide whether to use single or locking shifts by
looking ahead in the input file data. Single shifts are more efficient when
there are one, two, or three n-bit characters in a row; locking shifts are
more efficient when there are five or more n-bit characters in a row (n is
either 7 or 8):
Single Shift Locking Shift
&A (2) #OA#N (5) (worse)
&A&B (4) #OAB#N (6) (worse)
&A&B&C (6) #OABC#N (7) (worse)
&A&B&C&D (8) #OABCD#N (8) (same)
&A&B&C&D&E (10) #OABCDE#N (9) (better)
Thus five-character lookahead is sufficient to make the best decision.
REPEAT COUNTS AND LOCKING SHIFTS
A repeated sequence of 8-bit characters that occurs while in UNSHIFTED state,
for example abc<1>X<1>X<1>X<1>X, can be encoded by using a single shift:
abc~$&X
A repeated sequence of 8-bit characters that occurs while in SHIFTED
state, for example:
abc<1>A<1>B<1>C<1>X<1>X<1>X<1>X<1>X<1>X<1>X<1>X<1>D<1>E<1>F
is encoded using the same repeat-count notation:
abc#NABC~(XDEF
Just as the # and & prefixes are used as prefixes in both UNSHIFTED and
SHIFTED states, so is the repeat-count prefix, ~. The same sequence could
also be encoded less efficiently as:
abc#NABC#O~$&X#NDEF
PREFIX CHARACTERS THAT OCCUR IN THE DATA
Since Kermit prefix characters can occur within file data, they must be
prefixed to distinguish them from true prefixes. The following encoding
is used:
STATE..............
CHARACTER UNSHIFTED SHIFTED
# ## #
& #& &
~ #~ ~
<1># # ##
<1>& & #&
<1>~ ~ #~
QUOTING THE LOCKING SHIFT CHARACTERS
Since Control-O and Control-N can appear within file data, there has to be
a way to distinguish the use of these characters as locking shifts from
their use as data characters.
When (and only when) locking shift protocol is in effect, SO and SI
characters that appear in the data must be prefixed by Data Link Escape
(DLE, Control-P, ASCII 16), normally encoded as "#P". If DLE itself appears
in the file, it too must be prefixed by DLE.
The DLE character applies to the ENTIRE PREFIXED SEQUENCE that follows it.
This may be a single character, a control-prefixed character, an 8th-bit
prefixed character, or a repeat-count-prefixed sequence of any combination
of these. To illustrate the difference between quoting by "#" and DLE,
"##O" indicates a literal "#" character followed by the letter "O", whereas
"<DLE>#O" indicates a literal Control-O. In practice, the file sender
should use DLE only to prefix SO, SI, and itself, but the receiver should
treat DLE as a general "prefixed sequence" quote: it should discard the DLE,
decode the following prefixed sequence, and treat the result as data rather
than Kermit protocol information.
Should a repeated sequence of SO's, SI's, or DLE's occur within the data,
the entire sequence may be encoded with a repeat count and prefixed by a
single DLE, which applies to all copies of the repeated character. For
example, "#P~A#N" indicates 33 SO characters in a row that are not to be
treated as locking shifts.
When locking shift protocol is in effect, we must handle the C1 counterparts
of SO, SI, and DLE (that is, using our notation, <1>SO, <1>SI, and <1>DLE).
These characters would be inserted into the packet in their 7-bit form when
the SHIFT-STATE is SHIFTED, and the receiver would have no way of
distinguishing a data #O from a Shift-In #O, or a data #N from a Shift-Out #N,
or a data #P from a quoting #P. Therefore these characters too should be
prefixed by DLE when in SHIFTED state.
If a 7-bit SO, SI, or DLE appears in the data during SHIFTED state, the file
sender can "single-shift" it in the normal manner, for example "O". The
file receiver must treat such sequences as literal data characters, as if
they had been prefixed by DLE, not as shifts and quotes.
The rule, therefore, is that if #O, #N, and #P have no prefix of any kind,
then they are used for shifting and quoting. When these characters are
prefixed by either "&" or DLE, no matter what the SHIFT-STATE is, they are
data characters:
File SHIFT-STATE
Character UNSHIFTED SHIFTED
SI #P#O O or #PO
<1>SI O or #PO #P#O
SO #P#N N or #PN
<1>SO N or #PN #P#N
DLE #P#P P or #PP
<1>DLE P or #PP #P#P
The "O" form need not be prefixed by "#P", but no harm is done if it is.
The packet receiver must respond to these prefixed sequences as follows:
Packet SHIFT-STATE
Sequence UNSHIFTED SHIFTED
#O Discard* Shift Out
#P#O Literal SI Literal <1>SI
O or #PO Literal <1>SI Literal SI
#N Shift In Discard*
#P#N Literal SO Literal <1>SO
N or #PN Literal <1>SO Literal SO
#P Quote Quote
#P#P Literal DLE Literal <1>DLE
P or #PP Literal <1>DLE Literal DLE
The "Discard*" entries are for when a redundant shift is received, for
example an unprefixed Shift-Out when the Kermit receiver is already shifted
out. Redundant shifts do not affect the current SHIFT-STATE and are not
interpreted as data; they are simply ignored and discarded by the receiver.
BOUNDARY CONDITIONS
Although sequences of characters prefixed by "#", "&", or "~" may not be
broken across packet boundaries, locking shifts are effective across packet
boundaries. However, locking shifts are not effective across file
boundaries; when a group of files is being transferred, the SHIFT-STATE must
be set to UNSHIFTED at the beginning of each file.
THE FILE RECEIVER
The file receiver has no decisions to make, it is totally driven by the
sequence of characters in each packet it receives. The receiver operates as
it does without the locking shift protocol, but with additional rules: it must
recognize the locking shift indicators "#N" and "#O", set the SHIFT-STATE to
SHIFTED when it sees "#N" and to UNSHIFTED when it sees "O", and set the value
of the 8th bit of each data character according to the current SHIFT-STATE.
It must treat #, &, and ~ as prefix characters even when the SHIFT-STATE is
SHIFTED, remembering that the meaning of the single-shift prefix "&" is
inverted. (The file receiver can also store the shift characters as is -- see
the COMMANDS section below.)
COMMANDS
One new command is required:
SET TRANSFER LOCKING-SHIFT { ON, OFF, FORCED }
The options are as follows:
ON: Enables the use of locking shifts. The Kermit program sets the locking
shift capability bit in any S or I packets it sends, or in any
acknowledgement to an S or I packet. Locking shifts are actually used if
and only if both Kermits set this bit AND single-shifts are successfully
negotiated. If a Kermit program implements the locking shift protocol,
the default TRANSFER LOCKING-SHIFT setting should be ON.
OFF: Disables the use of locking shifts. The Kermit program sets the
locking shift capability bit to zero in all negotiation packets, and
treats SO, SI, and DLE as ordinary data characters in Kermit data
packets.
FORCED: Forces the use of locking shifts, regardless of the PARITY setting and
capability negotiation. The file sender sets the locking shift bit in the
capability mask, sets the QBIN (8th-bit prefix) field to "N", and ignores
the receiver's reply. The file receiver sets the same values, regardless
of the sender's values. A Kermit program that has been given this command
acts as if locking shift protocol had been successfully negotiated and
single shifts have been disabled.
With these facilities and defaults in effect, the Kermit user will get
locking shift protocol automatically whenever PARITY is not NONE and both
Kermits support locking shifts (which implies they also support single shifts
and that single shifts were negotiated successfully).
SET TRANSFER LOCKING-SHIFT FORCED can be used to force the file sender to
use locking shifts even if the receiver doesn't understand this protocol, or
to force the file receiver to treat SO/SI/DLE codes in arriving files as
prescribed by this proposal. This allows an 8-bit data file to be sent
through a 7-bit connection to a Kermit program that does not implement
8th-bit prefixing or locking shifts. The result can displayed on terminals
or printers that respond appropriately to Shift-In/Shift-Out codes, sent
through e-mail, or postprocessed with a simple SO/SI filter to reconstruct
it, provided the original file does not contain SO, SI, or DLE characters.
If a file containing SO/SI codes is sent to a Kermit program with SET
TRANSFER LOCKING-SHIFT FORCED in effect, the data is reconstructed according
to the imbedded shifts.
The SET TRANSFER LOCKING-SHIFT FORCED option is, of course, risky, and can
result in undesired effects if used improperly. For example, if the file
contains SO or SI characters as data, the shift state can become inverted.
Furthermore, DLE does not serve to "quote" SO or SI characters in ordinary
data communication; SO and SI usually act as locking shifts even when
preceded by DLE (or any other character). For example, when the sequence
"<SI>ABC<DLE><SO>DEF" is sent to a VT300 terminal, the DLE is ignored and
the characters DEF are shifted.
Here are the possible SET TRANSFER LOCKING-SHIFT combinations and their
effects. The OFF entries also apply to Kermit programs that don't implement
locking shift protocol at all:
Sender Receiver Effect
ON ON Locking shift protocol done if single shifts negotiated
ON OFF No locking shifts
ON FORCED SO/SI/DLE in data interpreted as shifts by receiver
OFF ON No locking shifts
OFF OFF No locking shifts
OFF FORCED SO/SI/DLE in data interpreted as shifts by receiver
FORCED ON Sender adds shifts, receiver stores them as data (*)
FORCED OFF Sender adds shifts, receiver stores them as data
FORCED FORCED Locking shift protocol is done with no single shifts
(*) Sender announces that it WON'T do single shifts, which disables
the receiver's locking-shift protocol.
CHARACTER SET TRANSLATION
SET TRANSFER LOCKING-SHIFT FORCED (or any other LOCKING-SHIFT settting)
does not affect character set translation. Translation is still done if the
user has elected to do it.
Here are the possibilities when the sender has SET LOCKING-SHIFT FORCED and
has announced an 8-bit transfer character set in the Attribute packet, and the
receiver supports character-set translation, but is not doing LS protocol:
1. Receiver translates the transfer character set into an 8-bit file
character set whose first 128 characters are ASCII, such as an IBM code
page, KOI-8, the Apple or NeXT character set, etc. In this case, the
desired effect is achieved automatically.
2. Receiver translates the transfer character set into a 7-bit file
character set such as an ISO 646 NRC or Short KOI. In this case the
result is garbage. Locking shifts should not be used here. For the
languages covered by ISO 646 NRCs, single shifts are more efficient.
3. The receiver does not understand the transfer character set. The
situation here is no different with locking shifts than without them.
PERFORMANCE
A preliminary implementation of the shifting algorithms described in this
proposal was coded and tested on a large number of text and binary files and
worked correctly: the result of encoding and then decoding each file was
identical to the original. All combinations of single shift, locking shift,
and repeat-count compression were tested successfully in both text and
binary file mode.
The following table shows the number of characters required to encode files of
different representative types (taken from a much larger sample) using
different combinations of single shifts (SS) and locking shifts (LS), but
without repeat-count compression (R). For comparison, the final column
includes repeat-count compression. The number in parentheses is the
"expansion factor" showing how much the data grew in the encoding process.
The .TXT files were encoded in text mode, the others were encoded in binary
mode.
File Encoding..................................................
Name Length SS........... LS........... LS+SS........ LS+SS+R......
ASCII.TXT 190689 202173 (1.06) 202126 (1.06) 202173 (1.06) 194938 (1.02)
GERMAN.TXT 39611 42159 (1.06) 43336 (1.09) 42169 (1.06) 41558 (1.05)
FRENCH.TXT 108021 116426 (1.08) 124446 (1.15) 116446 (1.08) 115531 (1.07)
CYRILL1.TXT 52046 95700 (1.84) 80998 (1.56) 64602 (1.24) 64476 (1.24)
CYRILL2.TXT 13699 25293 (1.85) 23429 (1.71) 18306 (1.34) 18078 (1.32)
CYRILL3.TXT 28434 49834 (1.75) 43029 (1.51) 37104 (1.30) 35519 (1.25)
CYRILL4.TXT 51011 89419 (1.75) 78217 (1.53) 63157 (1.24) 63010 (1.24)
Cyrillic
Totals 145190 260246 (1.79) 225673 (1.55) 183169 (1.26) 181083 (1.25)
KANJI.TXT 29706 59494 (2.00) 32527 (1.09) 32629 (1.10) 32648 (1.10)
KANJIA.TXT 106943 157536 (1.47) 122043 (1.14) 121822 (1.14) 118563 (1.11)
Kanji
Totals 136649 217030 (1.59) 154570 (1.13) 154451 (1.13) 151211 (1.11)
MSVIBM.EXE 146989 247766 (1.69) 302348 (2.06) 248991 (1.69) 210598 (1.43)
WERMIT 419861 737812 (1.76) 923451 (2.20) 760912 (1.81) 713830 (1.70)
FILE.ZIP 96911 173145 (1.79) 226407 (2.34) 172627 (1.78) 172841 (1.78)
ASCII.TXT is a plain US ASCII text file containing English prose and no
8-bit characters. GERMAN.TXT and FRENCH.TXT are German- and French-language
documents coded in ISO 8859-1 Latin Alphabet 1.
CYRILL1.TXT is a chapter from a Russian computer book, containing only a few
English words. CYRILL2.TXT is a poem, The Bronze Horseman by Pushkin; its
lines are short and there are many blank lines so there is a higher CRLF-to-
text ratio. CYRILL3.TXT is "Murphy's Laws" in Russian, in which lines tend to
be short, blank, or indented. CYRILL4 is a RussTeX source file in which the
TeX commands are ASCII and the text is Cyrillic. The Cyrillic text in all
these files is ISO 8859-5 Latin/Cyrillic 8-bit text.
KANJI.TXT is a Japanese-language text file encoded in the Japanese EUC code.
KANJIA.TXT contains a mixture of ASCII English and Japanese Kanji encoded in
EUC.
MSVIBM.EXE is an IBM PC binary executable program image. WERMIT is a SUN-4
(Sparc) binary executable program image. FILE.ZIP is a binary MS-DOS ZIP
archive.
ANALYSIS
For binary files, locking (combined) shifts generally provide no benefit
over single shifts. These files tend to have a high percentage of bytes in
the C0 and C1 ranges, and therefore suffer high overhead from control
prefixing. Furthermore, they rarely have long runs of 8-bit characters.
The reason the combined shift is less efficient than the single shift is the
necessity to quote SO, SI, and DLE characters that occur in the data.
For text files encoded in "left-handed" 8-bit character sets such as ISO
8859 Latin Alphabets 1-4 and 9 (for languages based on Roman characters),
8-bit characters generally occur only in isolation, and so locking
(combined) shifts provide no significant benefit over single shifts.
Locking and combined shifts provide a substantial performance improvement
over single shifts for text files written in "right-handed" 8-bit character
sets like the Latin Arabic, Cyrillic, Greek, and Hebrew alphabets where long
sequences of 8-bit bytes predominate, and for certain multibyte character
sets like as Japanese EUC, in which all Kanji-character bytes have their 8th
bits set to 1.
CONCLUSION
The locking shift algorithm is easy to program and is inexpensive in both
execution time and code space. Implementation of locking shift protocol is
recommended for Kermit programs that must transfer files likely to contain
many sequences of 5 or more consecutive 8-bit GR bytes over 7-bit
communication channels. Such files tend to be text files encoded in the ISO
character sets for non-Roman alphabets and in EUC Kanji codes, but there might
be other candidates too: binary image (raster) data, spreadsheet data, etc.
For such files, the efficiency improvement can approach 100%.
REFERENCES
Gianone, Christine M., "A Kermit Protocol Extension for International
Character Sets", Columbia University (1990).
da Cruz, Frank, "Kermit, A File Transfer Protocol", Digital Press (1987).
ANSI X3.4 (1986), "Coded Character Sets - 7-bit American Standard Code for
Information Interchange".
ISO 2022, "Information processing - ISO 7-bit and 8-bit coded character
sets - Code extension techniques" (1985).
ISO 8859, "Information processing - 8-bit single-byte coded graphic
character sets", parts 1-9 (1987-present)
"JIS X 0212 Study Group Interim Report"
ACKNOWLEDGEMENTS
Thanks to John Chandler, John Klensin, Paul Placeway, and Konstantin
Vinogradov for their detailed comments on this proposal, and to Gisbert W.
Selke for the German file, Andre' Pirard for the French, Konstantin
Vinogradov and Dimitri Vulis for the Russian files, and Hirofumi Fujii for
the Japanese files.
(The End)