home *** CD-ROM | disk | FTP | other *** search
- Xref: sparky comp.std.internat:873 soc.culture.nordic:7922 soc.culture.german:9309
- Newsgroups: comp.std.internat,soc.culture.nordic,soc.culture.german
- Path: sparky!uunet!mcsun!sunic!nobeltech!admin.kth.se!ojarnef
- From: ojarnef@admin.kth.se (Olle Jarnefors), psv@nada.kth.se (Peter Svanberg)
- Subject: Re: ISO Latin 1 to 7-bit ASCII conversion (final draft!)
- Message-ID: <1992Dec16.165027.9152@admin.kth.se>
- Followup-To: comp.std.internat
- Summary: Table modifiability needed. New conversions suggested for several
- characters. Special problems with Scandinavian 7-bit codes. The C3 system.
- Keywords: character sets, ISO 8859-1, terminals, user interface
- Sender: ojarnef@admin.kth.se (Olle Jarnefors)
- Organization: Royal Institute of Technology, Sweden
- References: <1gi1rnEINN1cg@uni-erlangen.de>
- Date: Wed, 16 Dec 1992 16:50:27 GMT
- Lines: 294
-
- Some comments on Markus's proposal:
-
- > ------------------------------------------------------------------------
- > Representation of ISO 8859-1 characters with 7-bit ASCII
- > --------------------------------------------------------
- >
- > Markus Kuhn -- 1992-12-13
- >
- > SUMMARY: This text describes a technique of displaying the 8-bit
- > character set, which is used today in many modern network services, on
- > old 7-bit terminals. Authors of software dealing with text received
- > from international networks are strongly encouraged to implement this
- > or similar methods as options in their software for the convenience of
- > users all over the world. Implementation is often trivial.
-
- By only offering four tables to choose from it's impossible to
- satisfy the different needs of many languages and national
- cultures covered by ISO 8859-1. Therefore it should be stressed
- already in the summary that it is important that the conversion
- table can be easily changed by system managers and end users.
-
- > The disadvantages of this approach are often acceptable:
- >
- > a) No one-to-one mapping between Latin 1 and ASCII strings possible
- > b) Text layout may be destroyed by multi-character substitutions
-
- Add "..., especially in tables" to mention the practically most
- important case. We would also suggest the addition of some text
- about a related disadvantage:
-
- : *) Truncation may be necessary to fit textual data into
- : fields of fixed width
-
- > ... Users should be able to switch between the different
- > tables and the 8-bit transparent normal mode. ...
-
- Many users can get along with only one default conversion that
- is always activated. For most others it's sufficient with one
- alternative conversion. We suggest this addition to the text:
-
- : It should be possible and not difficult to configure a default
- : conversion and possibly a default alternative conversion for
- : the program on a per user basis.
-
- > ... User defineable tables
- > are always a nice possible extension.
-
- In our opinion this is not only a nice feature but a user need
- almost as important as the capability of switching between the
- five conversions defined in this text. Write something like:
-
- : A simple way for users to modify the conversion tables is also
- : desirable.
-
- > Users should know if the text they read has been converted from the
- > original Latin 1 text. ...
-
- Do you have in mind any specific way of visually indicating that
- conversion takes place? Underlining converted characters?
- Something else?
-
- > ... This avoids confusion if e.g. someone asks for
- > sending him a 3<fraction 1/2>" disk [3╜"], which will be displayed
- > after the conversion as 31/2" (= 15.25").
-
- This particular problem is most easily solved, we suggest, by
- converting the character not to "1/2" but to " 1/2", with an
- initial space character.
-
- > ... First of all, a table with the
- > real characters in the range 160 - 255 (0xa0 - 0xff):
-
- Two of the "high" characters of ISO 8859-1
-
- 160 "A0 '240 NO-BREAK SPACE (NBSP)
- 173 "AD '255 SOFT HYPHEN (SHY)
-
- are not ordinary graphic characters but a sort of hybrid
- characters with both a graphic component and a control
- component.
-
- For soft hyphen the graphic component is an ordinary hyphen
- glyph. The functional component is that this glyph should only
- be displayed or printed if the character is at the end of a
- line. If it is somewhere else in the line, _nothing_ should be
- displayed or printed.
-
- In the simple, context-insensitive conversion that we are
- dealing with here, SHY should be converted to the empty string,
- since it will occur less often at the end of a line than
- elsewhere.
-
- > Table 0 is a universal table that is expected to be suitable for many
- > languages. The letters are simply the ASCII versions without the
- > diacritics. The character '?' as a fallback substitution where no ASCII
- > string is suitable is used as little as possible ...
-
- Even if it is used as little as possible we suggest that "_" is
- a better fallback substitution than "?". Genuine question marks
- can be important for the correct interpretation of a text.
- it's unfortunate if they are mixed with replacement characters.
- By the way, there is a case of a genuine conversion to "?" in
- the tables, viz. of the Spanish inverted question mark. This of
- course must not be replaced by "_".
-
- For TABLE 0 we suggest the following changes:
-
- 0a: 164 "A4 '244 CURRENCY SIGN
- Now: SUBST (general substitution character)
- Suggestion: DOLLAR SIGN
-
- This character was actually invented for the international
- reference 7-bit character set of first version of ISO 646.
- It is the only character that can replace the dollar sign in
- ISO 646-compatible character sets and no other character may
- be allocated to code position 36. Therefore this characters
- have in practice become almost equivalent in those countries
- where the currency sign is known at all. (ISO has adopted a
- system for currency codes that doesn't use the currency sign
- at all.)
-
- 0b: 173 "AD '255 SOFT HYPHEN (SHY)
- Now: "-"
- Suggestion: ""
-
- 0c: 175 "AF '257 MACRON
- Now: SUBST
- Suggestion: "-"
-
- Macron is a diacritical mark used in semi-phonetical
- notation to indicate that a vowel is to be pronounced with a
- long sound. In fine typography it has the same width as the
- vowel letter and its vertical position is adjusted to the
- height of the letter. In ISO 8859-1 text it has to be
- written before or, preferably, after the vowel. In the
- conversion to a 7-bit character set its principal graphical
- form can still be preserved.
-
- 0d: 176 "B0 '260 DEGREE SIGN
- Now: SUBST
- Suggestion: "o"
-
- This is most often used in numerical data and can, without
- risk of misunderstanding, be substituted with the lowercase
- "o", as is often done.
-
- 0e: 188 "BC '274 VULGAR FRACTION ONE QUARTER
- Now: "1/4"
- Suggestion: " 1/4"
-
- 0f: 189 "BD '275 VULGAR FRACTION ONE HALF
- Now: "1/2"
- Suggestion: " 1/2"
-
- 0g: 190 "BE '276 VULGAR FRACTION THREE QUARTERS
- Now: "3/4"
- Suggestion: " 3/4"
-
- 0h: 247 "F7 '367 DIVISION SIGN
- Now: ":"
- Suggestion: "-:"
-
- This symbol has the meaning of subtraction in some countries
- and some application fields. In addition, division is
- in some countries normally indicated by "/" rather than ":".
- We therefore suggest that the conversion should be neutral
- by trying to approximate the appearance of the symbol,
- rather than its meaning. "-:" is better than ":-", since
- the "-" can't be misinterpreted as a minus on a following
- number.
-
- These suggestions also apply to TABLE 3.
-
- For TABLE 1 we suggest the following changes:
-
- 1a: 164 "A4 '244 CURRENCY SIGN
- Now: SUBST (general substitution character)
- Suggestion: DOLLAR SIGN
-
- 1b: 173 "AD '255 SOFT HYPHEN (SHY)
- Now: "-"
- Suggestion: ""
-
- 1c: 175 "AF '257 MACRON
- Now: SUBST
- Suggestion: "-"
-
- 1d: 176 "B0 '260 DEGREE SIGN
- Now: SUBST
- Suggestion: "o"
-
- 1e: 171 "AB '253 LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
- Now: '<'
- Suggestion: '"'
-
- The generic double quotation mark of ASCII can be used to
- reduce the risk for confusion with a genuine "<" used as a
- left angle parenthesis.
-
- 1f: 187 "BB '273 RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
- Now: '>'
- Suggestion: '"'
-
- 1g: 188 "BC '274 VULGAR FRACTION ONE QUARTER
- Now: SUBST
- Suggestion: "/"
-
- By using "/" instead of the general fallback character at
- least we indicate that the real character was a vulgar
- fraction.
-
- 1h: 189 "BD '275 VULGAR FRACTION ONE HALF
- Now: SUBST
- Suggestion: "/"
-
- 1i: 190 "BE '276 VULGAR FRACTION THREE QUARTERS
- Now: SUBST
- Suggestion: "/"
-
- > In some north european languages, any US-ASCII replacement for the
- > relevant Latin 1 characters is unacceptable for many people. In these
- > countries, national variants of 7-bit ISO 646 are still very popular.
- > These use the codes of []{}\|~^`$ for national characters. ...
-
- The last sentence is partially incorrect. Write instead: "These
- use the codes of [ ] \ { } | in a consistent manner for national
- letters. One of the codes also uses @ ^ ` ~ for additional
- letters." (The spaces make it easier to identify the special
- characters.)
-
- > ... Table 2 has
- > been designed for Danish, Finnish, Norwegian and Swedish users of ISO
- > 646 terminals:
-
- Unfortunately it uses @ ^ ` ~ as conversions for the letters
- E WITH ACUTE and U WITH DIAERESIS. This will give good results
- only with one of the two Swedish 7-bit character sets and
- disastrous results with the other Swedish code and with the
- Danish/Norwegian code.
-
- 2a: 201 "C9 '311 CAPITAL LETTER E WITH ACUTE
- Now: "@"
- Suggestion: "E"
-
- 2b: 233 "E9 '351 SMALL LETTER E WITH ACUTE
- Now: "`"
- Suggestion: "e"
-
- 2c: 220 "DC '334 CAPITAL LETTER U WITH DIAERESIS
- Now: "^"
- Suggestion: "Ue"
-
- 2d: 252 "FC '374 SMALL LETTER U WITH DIAERESIS
- Now: "~"
- Suggestion: "ue"
-
- Finally we would like to say that we are impressed by Markus's
- work. It identifies an important and neglected problem, is
- based on principles that are technically sound, and offers a
- practical solution. The alternative approaches used by people
- working with EBCDIC-ASCII conversion in the IBM mainframe world
- and by Kermit developers are less promising.
-
- We are ourselves active in this area, and have designed a more
- ambitious system, called "The C3 System for Character Code
- Conversion", as part of a joint project between Swedish
- universities. Actually, it's not quite finished yet. The big
- difference is that Markus's solution only treats two conversion
- paths:
-
- Latin-1 -> ASCII
- Latin-1 -> Scandinavian 7-bit character sets
-
- The C3 system will provide 3 types of conversion - 1-1, legible,
- and reversible - between any two of about 20 coded character
- sets by means of conversion tables on 3 levels - prototype,
- elementary, and working table.
-
- An obvious complement to Markus's work is a system that makes
- _input_ of all Latin-1 characters from a 7-bit terminal
- possible. To make this reasonably user-friendly seems to be a
- rather difficult task though.
-
- --
- Olle Jarnefors Internet: ojarnef@admin.kth.se
- Information Management Services UUCP: ...!uunet!mcsun!sunic!kth!ojarnef
- Royal Institute of Technology (KTH) BITNET: ojarnef@sekth Fax:+46 8 10 25 10
- S-100 44 Stockholm, Sweden Phone: +46 8 790 71 26 (time zone +0100)
-
- ---
- Peter Svanberg, NADA, KTH Email: psv@nada.kth.se
- Dept of Num An & CS,
- Royal Inst of Tech Phone: +46 8 790 71 46
- S-100 44 Stockholm, SWEDEN Fax: +46 8 790 09 30
-