NetNews Usenet Archive 1992 #30

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #30 / NN_1992_30.iso / spool / comp / std / internat / 873 < prev next >

Wrap

Text File | 1992-12-16 | 11.3 KB | 311 lines

Xref: sparky comp.std.internat:873 soc.culture.nordic:7922 soc.culture.german:9309 Newsgroups: comp.std.internat,soc.culture.nordic,soc.culture.german Path: sparky!uunet!mcsun!sunic!nobeltech!admin.kth.se!ojarnef From: ojarnef@admin.kth.se (Olle Jarnefors), psv@nada.kth.se (Peter Svanberg) Subject: Re: ISO Latin 1 to 7-bit ASCII conversion (final draft!) Message-ID: <1992Dec16.165027.9152@admin.kth.se> Followup-To: comp.std.internat Summary: Table modifiability needed. New conversions suggested for several characters. Special problems with Scandinavian 7-bit codes. The C3 system. Keywords: character sets, ISO 8859-1, terminals, user interface Sender: ojarnef@admin.kth.se (Olle Jarnefors) Organization: Royal Institute of Technology, Sweden References: <1gi1rnEINN1cg@uni-erlangen.de> Date: Wed, 16 Dec 1992 16:50:27 GMT Lines: 294 Some comments on Markus's proposal: > ------------------------------------------------------------------------ > Representation of ISO 8859-1 characters with 7-bit ASCII > -------------------------------------------------------- > > Markus Kuhn -- 1992-12-13 > > SUMMARY: This text describes a technique of displaying the 8-bit > character set, which is used today in many modern network services, on > old 7-bit terminals. Authors of software dealing with text received > from international networks are strongly encouraged to implement this > or similar methods as options in their software for the convenience of > users all over the world. Implementation is often trivial. By only offering four tables to choose from it's impossible to satisfy the different needs of many languages and national cultures covered by ISO 8859-1. Therefore it should be stressed already in the summary that it is important that the conversion table can be easily changed by system managers and end users. > The disadvantages of this approach are often acceptable: > > a) No one-to-one mapping between Latin 1 and ASCII strings possible > b) Text layout may be destroyed by multi-character substitutions Add "..., especially in tables" to mention the practically most important case. We would also suggest the addition of some text about a related disadvantage: : *) Truncation may be necessary to fit textual data into : fields of fixed width > ... Users should be able to switch between the different > tables and the 8-bit transparent normal mode. ... Many users can get along with only one default conversion that is always activated. For most others it's sufficient with one alternative conversion. We suggest this addition to the text: : It should be possible and not difficult to configure a default : conversion and possibly a default alternative conversion for : the program on a per user basis. > ... User defineable tables > are always a nice possible extension. In our opinion this is not only a nice feature but a user need almost as important as the capability of switching between the five conversions defined in this text. Write something like: : A simple way for users to modify the conversion tables is also : desirable. > Users should know if the text they read has been converted from the > original Latin 1 text. ... Do you have in mind any specific way of visually indicating that conversion takes place? Underlining converted characters? Something else? > ... This avoids confusion if e.g. someone asks for > sending him a 3<fraction 1/2>" disk [3╜"], which will be displayed > after the conversion as 31/2" (= 15.25"). This particular problem is most easily solved, we suggest, by converting the character not to "1/2" but to " 1/2", with an initial space character. > ... First of all, a table with the > real characters in the range 160 - 255 (0xa0 - 0xff): Two of the "high" characters of ISO 8859-1 160 "A0 '240 NO-BREAK SPACE (NBSP) 173 "AD '255 SOFT HYPHEN (SHY) are not ordinary graphic characters but a sort of hybrid characters with both a graphic component and a control component. For soft hyphen the graphic component is an ordinary hyphen glyph. The functional component is that this glyph should only be displayed or printed if the character is at the end of a line. If it is somewhere else in the line, _nothing_ should be displayed or printed. In the simple, context-insensitive conversion that we are dealing with here, SHY should be converted to the empty string, since it will occur less often at the end of a line than elsewhere. > Table 0 is a universal table that is expected to be suitable for many > languages. The letters are simply the ASCII versions without the > diacritics. The character '?' as a fallback substitution where no ASCII > string is suitable is used as little as possible ... Even if it is used as little as possible we suggest that "_" is a better fallback substitution than "?". Genuine question marks can be important for the correct interpretation of a text. it's unfortunate if they are mixed with replacement characters. By the way, there is a case of a genuine conversion to "?" in the tables, viz. of the Spanish inverted question mark. This of course must not be replaced by "_". For TABLE 0 we suggest the following changes: 0a: 164 "A4 '244 CURRENCY SIGN Now: SUBST (general substitution character) Suggestion: DOLLAR SIGN This character was actually invented for the international reference 7-bit character set of first version of ISO 646. It is the only character that can replace the dollar sign in ISO 646-compatible character sets and no other character may be allocated to code position 36. Therefore this characters have in practice become almost equivalent in those countries where the currency sign is known at all. (ISO has adopted a system for currency codes that doesn't use the currency sign at all.) 0b: 173 "AD '255 SOFT HYPHEN (SHY) Now: "-" Suggestion: "" 0c: 175 "AF '257 MACRON Now: SUBST Suggestion: "-" Macron is a diacritical mark used in semi-phonetical notation to indicate that a vowel is to be pronounced with a long sound. In fine typography it has the same width as the vowel letter and its vertical position is adjusted to the height of the letter. In ISO 8859-1 text it has to be written before or, preferably, after the vowel. In the conversion to a 7-bit character set its principal graphical form can still be preserved. 0d: 176 "B0 '260 DEGREE SIGN Now: SUBST Suggestion: "o" This is most often used in numerical data and can, without risk of misunderstanding, be substituted with the lowercase "o", as is often done. 0e: 188 "BC '274 VULGAR FRACTION ONE QUARTER Now: "1/4" Suggestion: " 1/4" 0f: 189 "BD '275 VULGAR FRACTION ONE HALF Now: "1/2" Suggestion: " 1/2" 0g: 190 "BE '276 VULGAR FRACTION THREE QUARTERS Now: "3/4" Suggestion: " 3/4" 0h: 247 "F7 '367 DIVISION SIGN Now: ":" Suggestion: "-:" This symbol has the meaning of subtraction in some countries and some application fields. In addition, division is in some countries normally indicated by "/" rather than ":". We therefore suggest that the conversion should be neutral by trying to approximate the appearance of the symbol, rather than its meaning. "-:" is better than ":-", since the "-" can't be misinterpreted as a minus on a following number. These suggestions also apply to TABLE 3. For TABLE 1 we suggest the following changes: 1a: 164 "A4 '244 CURRENCY SIGN Now: SUBST (general substitution character) Suggestion: DOLLAR SIGN 1b: 173 "AD '255 SOFT HYPHEN (SHY) Now: "-" Suggestion: "" 1c: 175 "AF '257 MACRON Now: SUBST Suggestion: "-" 1d: 176 "B0 '260 DEGREE SIGN Now: SUBST Suggestion: "o" 1e: 171 "AB '253 LEFT-POINTING DOUBLE ANGLE QUOTATION MARK Now: '<' Suggestion: '"' The generic double quotation mark of ASCII can be used to reduce the risk for confusion with a genuine "<" used as a left angle parenthesis. 1f: 187 "BB '273 RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK Now: '>' Suggestion: '"' 1g: 188 "BC '274 VULGAR FRACTION ONE QUARTER Now: SUBST Suggestion: "/" By using "/" instead of the general fallback character at least we indicate that the real character was a vulgar fraction. 1h: 189 "BD '275 VULGAR FRACTION ONE HALF Now: SUBST Suggestion: "/" 1i: 190 "BE '276 VULGAR FRACTION THREE QUARTERS Now: SUBST Suggestion: "/" > In some north european languages, any US-ASCII replacement for the > relevant Latin 1 characters is unacceptable for many people. In these > countries, national variants of 7-bit ISO 646 are still very popular. > These use the codes of []{}\|~^`$ for national characters. ... The last sentence is partially incorrect. Write instead: "These use the codes of [ ] \ { } | in a consistent manner for national letters. One of the codes also uses @ ^ ` ~ for additional letters." (The spaces make it easier to identify the special characters.) > ... Table 2 has > been designed for Danish, Finnish, Norwegian and Swedish users of ISO > 646 terminals: Unfortunately it uses @ ^ ` ~ as conversions for the letters E WITH ACUTE and U WITH DIAERESIS. This will give good results only with one of the two Swedish 7-bit character sets and disastrous results with the other Swedish code and with the Danish/Norwegian code. 2a: 201 "C9 '311 CAPITAL LETTER E WITH ACUTE Now: "@" Suggestion: "E" 2b: 233 "E9 '351 SMALL LETTER E WITH ACUTE Now: "`" Suggestion: "e" 2c: 220 "DC '334 CAPITAL LETTER U WITH DIAERESIS Now: "^" Suggestion: "Ue" 2d: 252 "FC '374 SMALL LETTER U WITH DIAERESIS Now: "~" Suggestion: "ue" Finally we would like to say that we are impressed by Markus's work. It identifies an important and neglected problem, is based on principles that are technically sound, and offers a practical solution. The alternative approaches used by people working with EBCDIC-ASCII conversion in the IBM mainframe world and by Kermit developers are less promising. We are ourselves active in this area, and have designed a more ambitious system, called "The C3 System for Character Code Conversion", as part of a joint project between Swedish universities. Actually, it's not quite finished yet. The big difference is that Markus's solution only treats two conversion paths: Latin-1 -> ASCII Latin-1 -> Scandinavian 7-bit character sets The C3 system will provide 3 types of conversion - 1-1, legible, and reversible - between any two of about 20 coded character sets by means of conversion tables on 3 levels - prototype, elementary, and working table. An obvious complement to Markus's work is a system that makes _input_ of all Latin-1 characters from a 7-bit terminal possible. To make this reasonably user-friendly seems to be a rather difficult task though. -- Olle Jarnefors Internet: ojarnef@admin.kth.se Information Management Services UUCP: ...!uunet!mcsun!sunic!kth!ojarnef Royal Institute of Technology (KTH) BITNET: ojarnef@sekth Fax:+46 8 10 25 10 S-100 44 Stockholm, Sweden Phone: +46 8 790 71 26 (time zone +0100) --- Peter Svanberg, NADA, KTH Email: psv@nada.kth.se Dept of Num An & CS, Royal Inst of Tech Phone: +46 8 790 71 46 S-100 44 Stockholm, SWEDEN Fax: +46 8 790 09 30