NetNews Usenet Archive 1992 #30

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #30 / NN_1992_30.iso / spool / comp / std / internat / 907 < prev next >

Wrap

Internet Message Format | 1992-12-21 | 6.5 KB

Path: sparky!uunet!zaphod.mps.ohio-state.edu!swrinde!cs.utexas.edu!qt.cs.utexas.edu!yale.edu!ira.uka.de!fauern!uni-erlangen.de!not-for-mail From: unrza3@cd4680fs.rrze.uni-erlangen.de (Markus Kuhn) Newsgroups: comp.std.internat Subject: Re: ISO Latin 1 to 7-bit ASCII conversion (final draft!) Date: 18 Dec 1992 19:14:12 +0100 Organization: Regionales Rechenzentrum Erlangen Message-ID: <1gt4dkEINNi5i@uni-erlangen.de> References: <1gi1rnEINN1cg@uni-erlangen.de> <1992Dec16.165027.9152@admin.kth.se> Reply-To: mskuhn@immd4.informatik.uni-erlangen.de NNTP-Posting-Host: cd4680fs.rrze.uni-erlangen.de Lines: 152 Keywords: character sets, ISO 8859-1, terminals, user interface ojarnef@admin.kth.se, psv@nada.kth.se (Olle Jarnefors) writes: >> Users should know if the text they read has been converted from the >> original Latin 1 text. ... >Do you have in mind any specific way of visually indicating that >conversion takes place? Underlining converted characters? >Something else? I had just a good explanation in the documentation and a reminding message after program start in mind. Extra characters like [(c)] and [x] again make the replacements longer and destroy the layout even worse. In some applications, they might be of use, so I'll describe them as a possible option. If underlining etc. is possible, this would be more attractive. But powerfull terminals that allow underlining, bold face etc. often also provide ISO 8859-1 and then we have the only REAL solution for the whole problem. BTW: Kermit translates <copyright> to @, which looks similar, but has confused me already a lot. But reading USENET articels about transcriptions using a transcription system is always very confusing. >> ... This avoids confusion if e.g. someone asks for >> sending him a 3<fraction 1/2>" disk [3="], which will be displayed >> after the conversion as 31/2" (= 15.25"). >This particular problem is most easily solved, we suggest, by >converting the character not to "1/2" but to " 1/2", with an >initial space character. This was only one example of a long list of possible problems that can't be solved by a non 1-1 mapping solution. 1-1 mapping solutions (e.g. [a:] according to RFC 1345) have the problem, that you need to transform the possible pure ASCII sequences like [, a, : and ] with an escape mechanism. This will modify even 7-bit textes and that was not my intention. I don't want to design an strict encoding, but anything that makes reading e.g. 8-bit USENET articles easier on old terminals. >Two of the "high" characters of ISO 8859-1 >160 "A0 '240 NO-BREAK SPACE (NBSP) >173 "AD '255 SOFT HYPHEN (SHY) >are not ordinary graphic characters but a sort of hybrid >characters with both a graphic component and a control >component. >For soft hyphen the graphic component is an ordinary hyphen >glyph. The functional component is that this glyph should only >be displayed or printed if the character is at the end of a >line. If it is somewhere else in the line, _nothing_ should be >displayed or printed. I agree with you completely here, and that is how I would use these characters if I had to develop a simple text editor with a few word processing functions. WordStar users will be very familiar with the SHY and NBSP characters. But the text of ISO 8859-1:1987(E) does not define the functionality you describe your second and third sentence. >In the simple, context-insensitive conversion that we are >dealing with here, SHY should be converted to the empty string, >since it will occur less often at the end of a line than >elsewhere. NO! I and ISO 8859-1 absolutely disagree here with you. SHY has to be displayed as something similar to a hyphen. If you remove SHYs that are not at the end of the line or are not followed by space, than this might be acceptable, but please NEVER remove SHYs at the end of the line. Even not in the trival context insensitive case that I selected in order to keep things simple in the hope that many PD developpers will use the system. >For TABLE 0 we suggest the following changes: >0b: 173 "AD '255 SOFT HYPHEN (SHY) > Now: "-" > Suggestion: "" No, see above. >0c: 175 "AF '257 MACRON > Now: SUBST > Suggestion: "-" My first suggestion was " ", but Steve Summit insisted on SUB. Perhaps "-" is the best solution, especially if MACRON becomes popular for underlining the previous line. >0d: 176 "B0 '260 DEGREE SIGN > Now: SUBST > Suggestion: "o" > This is most often used in numerical data and can, without > risk of misunderstanding, be substituted with the lowercase > "o", as is often done. A better suggestion was " ", as 25 C and 23 34' 44'' will still be understood. I'll change this to " ". >0e: 188 "BC '274 VULGAR FRACTION ONE QUARTER > Now: "1/4" > Suggestion: " 1/4" One of my goals was to keep the length below 3. There are many other strings that might cause possible confusion. In a context sensitive system, this surely would make sense. > DIVISION SIGN > Suggestion: "-:" > This symbol has the meaning of subtraction in some countries > and some application fields. In addition, division is > in some countries normally indicated by "/" rather than ":". > We therefore suggest that the conversion should be neutral > by trying to approximate the appearance of the symbol, > rather than its meaning. "-:" is better than ":-", since > the "-" can't be misinterpreted as a minus on a following > number. I didn't know this, as both DIVISION SIGN and : are used in Germany for division. "-:" seems to be quite artificial, so if ":" really causes much confusion, SUB may be better here. Which countries use ":" for substraction? >1f: 187 "BB '273 RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK > Now: '>' > Suggestion: '"' Why? >1g: 188 "BC '274 VULGAR FRACTION ONE QUARTER > Now: SUBST > Suggestion: "/" > By using "/" instead of the general fallback character at > least we indicate that the real character was a vulgar > fraction. The important information has been lost, and I would prefer one single fallback character. Thank you for your comments. I'll include at least some of them in my text. BTW: There is a serious bug in the Latin1toASCII function and only one person has detected it so far ... Markus -- Markus Kuhn, Computer Science student -=-=- University of Erlangen, Germany Internet: mskuhn@immd4.informatik.uni-erlangen.de | X.500 entry available ----- Anyone participating in the use of MS-DOS, Heroin or Cocaine is ----- ---- simply not getting the most out of life possible. (Brian Downing) ----