NetNews Usenet Archive 1992 #30

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #30 / NN_1992_30.iso / spool / comp / std / internat / 833 < prev next >

Wrap

Text File | 1992-12-14 | 13.4 KB | 290 lines

Xref: sparky comp.std.internat:833 soc.culture.nordic:7849 soc.culture.german:9228 Path: sparky!uunet!cs.utexas.edu!qt.cs.utexas.edu!yale.edu!ira.uka.de!fauern!uni-erlangen.de!not-for-mail From: unrza3@cd4680fs.rrze.uni-erlangen.de (Markus Kuhn) Newsgroups: comp.std.internat,soc.culture.nordic,soc.culture.german Subject: ISO Latin 1 to 7-bit ASCII conversion (final draft!) Date: 14 Dec 1992 14:23:03 +0100 Organization: Regionales Rechenzentrum Erlangen Message-ID: <1gi1rnEINN1cg@uni-erlangen.de> Reply-To: mskuhn@immd4.informatik.uni-erlangen.de NNTP-Posting-Host: cd4680fs.rrze.uni-erlangen.de Lines: 276 Keywords: character sets, ISO 8859-1, terminals, user interface I received quite a number of suggestions for improvement of the conversion tables which I posted to comp.std.internat a few weeks ago. Before I will send the following text to a number of public domain software authors, I'd like to ask you again for a final comment on these tables. Would you as a user who has still to live with an 7-bit terminal and who receives ISO Latin 1 textes be happy, if your favourite piece of networking software would be able to display the characters as described below? If not, what would you like to change? Markus ------------------------------------------------------------------------ Representation of ISO 8859-1 characters with 7-bit ASCII -------------------------------------------------------- Markus Kuhn -- 1992-12-13 SUMMARY: This text describes a technique of displaying the 8-bit character set, which is used today in many modern network services, on old 7-bit terminals. Authors of software dealing with text received from international networks are strongly encouraged to implement this or similar methods as options in their software for the convenience of users all over the world. Implementation is often trivial. The "Latin alphabet No. 1" defined in part 1 of the international standard ISO 8859:1987 Information processing -- 8-bit single-byte coded graphic character sets is an increasingly popular 8-bit extension of the traditional 7-bit US-ASCII character set. It is already supported by many operating systems and its 191 graphic characters include those used in the following 14 languages: Danish, Dutch, English, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish. ISO 8859-1 contains graphic characters used in at least 44 countries. Many people believe that ISO Latin 1 is already the de-facto replacement of the old 7-bit US-ASCII character set. In addition, the first 256 characters of the new 16-bit character set ISO 10646/Unicode, which contains nearly all characters used on this planet and is expected to be the final solution of most of today's character set troubles, are identical with ISO 8859-1. ISO 8859-1 uses only the codes 32-126 (which are identical with US-ASCII) and 160-255. The positions 0-31 and 127-159 are reserved for control codes and normally used in the same way in which they are used with ASCII. As this character set gains more and more popularity in international data communication (e.g. the Internet gopher service, the Internet MIME, parts of USENET), the need arises to extend existing software with the ability of displaying strings containing ISO 8859-1 characters on old hardware that is only capable of displaying 7-bit US-ASCII characters. Today, many users of old hardware suffer from getting the Latin 1 characters between 160 and 255 only displayed as the corresponding US-ASCII characters with the highest bit cleared. Then they see e.g. a ')' instead of the copyright symbol. Pessimists expect that these old 7-bit terminals will be in use at least for the next ten years. One approach for such a Latin 1 to ASCII conversion is to use the replacements, that people use traditionally, when they have to live with an US-ASCII keyboard. This seems to be the most natural method, which often won't even be noticed by users that use these traditional replacements already today on their old hardware. The disadvantages of this approach are often acceptable: a) No one-to-one mapping between Latin 1 and ASCII strings possible b) Text layout may be destroyed by multi-character substitutions c) Different replacements may be in use for different languages There is no optimal solution possible for the problem of displaying text with ISO Latin 1 characters on old terminals apart from buying new hardware. The conversion tables proposed here are only intermediate solutions that are intended to make life easier for people who get Latin 1 characters currently displayed as the corresponding 7-bit US-ASCII symbols with the highest bit cleared, which is awful and frustrates the users of old hardware. Including the tables below in programs like mail user agents, news readers, gopher clients, file browsers, tty drivers etc. is often a trivial task. Users should be able to switch between the different tables and the 8-bit transparent normal mode. User defineable tables are always a nice possible extension. Users should know if the text they read has been converted from the original Latin 1 text. This avoids confusion if e.g. someone asks for sending him a 3<fraction 1/2>" disk [3╜"], which will be displayed after the conversion as 31/2" (= 15.25"). I collected four tables based on information I received from many USENET readers from various countries in order to cope with the different needs of ISO Latin 1 users. First of all, a table with the real characters in the range 160 - 255 (0xa0 - 0xff): á í ó ú ñ Ñ ª º ¿ ⌐ ¬ ½ ¼ ¡ « » ░ ▒ ▓ │ ┤ ╡ ╢ ╖ ╕ ╣ ║ ╗ ╝ ╜ ╛ ┐ └ ┴ ┬ ├ ─ ┼ ╞ ╟ ╚ ╔ ╩ ╦ ╠ ═ ╬ ╧ ╨ ╤ ╥ ╙ ╘ ╒ ╓ ╫ ╪ ┘ ┌ █ ▄ ▌ ▐ ▀ α ß Γ π Σ σ µ τ Φ Θ Ω δ ∞ φ ε ∩ ≡ ± ≥ ≤ ⌠ ⌡ ÷ ≈ ° ∙ · √ ⁿ ² ■ Table 0 is a universal table that is expected to be suitable for many languages. The letters are simply the ASCII versions without the diacritics. The character '?' as a fallback substitution where no ASCII string is suitable is used as little as possible, as '?' carries no information and if we are pedantic, we have to replace nearly every Latin 1 character over 160 by question marks. ! c ? ? Y | ? " (c) a << - - (R) ? ? +/- 2 3 ' mu P . , 1 o >> 1/4 1/2 3/4 ? A A A A A A AE C E E E E I I I I D N O O O O O x O U U U U Y Th ss a a a a a a ae c e e e e i i i i d n o o o o o : o u u u u y th y Table 1 replaces Latin 1 characters only with single ASCII characters. This won't destroy the layout of textes designed to be printed with monospaced fonts, but the replacements are often not very satisfactory: ! c ? ? Y | ? " c a < - - R ? ? ? 2 3 ' u P . , 1 o > ? ? ? ? A A A A A A A C E E E E I I I I D N O O O O O x O U U U U Y T s a a a a a a a c e e e e i i i i d n o o o o o : o u u u u y t y In some languages, only removing the diacritics as in table 0 gives orthographically incorrect and unappropriate results. The following table 2 might be much more suitable that table 0 in Danish, Dutch, German, Norwegian and Swedish: ! c ? ? Y | ? " (c) a << - - (R) ? ? +/- 2 3 ' mu P . , 1 o >> 1/4 1/2 3/4 ? A A A A Ae Aa AE C E E E E I I I I D N O O O O Oe x O U U U Ue Y Th ss a a a a ae aa ae c e e e e i i i i d n o o o o oe : o u u u ue y th ij In some north european languages, any US-ASCII replacement for the relevant Latin 1 characters is unacceptable for many people. In these countries, national variants of 7-bit ISO 646 are still very popular. These use the codes of []{}\|~^`$ for national characters. Table 2 has been designed for Danish, Finnish, Norwegian and Swedish users of ISO 646 terminals: ! c ? $ Y | ? " (c) a << - - (R) ? ? +/- 2 3 ' mu P . , 1 o >> 1/4 1/2 3/4 ? A A A A [ ] [ C E @ E E I I I I D N O O O O \ x \ U U U ^ Y Th ss a a a a { } { c e ` e e i i i i d n o o o o | : | u u u ~ y th y For the convenience of C programmers, I included the code of these tables in this text. Just copy the following lines in your software: ----------------------------------------------------------------------- /* Conversion tables for displaying the G1 set (0xa0-0xff) of ISO Latin 1 (ISO 8859-1) with 7-bit ASCII characters. Suggestions for improvement are welcome! Table Purpose 0 universal table for many languages 1 single-spacing universal table 2 table for Danish, Dutch, German, Norwegian and Swedish 3 table for Danish, Finnish, Norwegian and Swedish using ISO 646 Markus Kuhn <mskuhn@immd4.informatik.uni-erlangen.de> */ #define SUB "?" /* used if no reasonable ASCII string is possible */ #define ISO_TABLES 4 static unsigned char *iso2asc[ISO_TABLES][96] = {{ " ","!","c",SUB,SUB,"Y","|",SUB,"\"","(c)","a","<<","-","-","(R)",SUB, SUB,"+/-","2","3","'","mu","P",".",",","1","o",">>","1/4","1/2","3/4","?", "A","A","A","A","A","A","AE","C","E","E","E","E","I","I","I","I", "D","N","O","O","O","O","O","x","O","U","U","U","U","Y","Th","ss", "a","a","a","a","a","a","ae","c","e","e","e","e","i","i","i","i", "d","n","o","o","o","o","o",":","o","u","u","u","u","y","th","y" },{ " ","!","c",SUB,SUB,"Y","|",SUB,"\"","c","a","<","-","-","R",SUB, SUB,SUB,"2","3","'","u","P",".",",","1","o",">",SUB,SUB,SUB,"?", "A","A","A","A","A","A","A","C","E","E","E","E","I","I","I","I", "D","N","O","O","O","O","O","x","O","U","U","U","U","Y","T","s", "a","a","a","a","a","a","a","c","e","e","e","e","i","i","i","i", "d","n","o","o","o","o","o",":","o","u","u","u","u","y","t","y" },{ " ","!","c",SUB,SUB,"Y","|",SUB,"\"","(c)","a","<<","-","-","(R)",SUB, SUB,"+/-","2","3","'","mu","P",".",",","1","o",">>","1/4","1/2","3/4","?", "A","A","A","A","Ae","Aa","AE","C","E","E","E","E","I","I","I","I", "D","N","O","O","O","O","Oe","x","O","U","U","U","Ue","Y","Th","ss", "a","a","a","a","ae","aa","ae","c","e","e","e","e","i","i","i","i", "d","n","o","o","o","o","oe",":","o","u","u","u","ue","y","th","ij" },{ " ","!","c",SUB,"$","Y","|",SUB,"\"","(c)","a","<<","-","-","(R)",SUB, SUB,"+/-","2","3","'","mu","P",".",",","1","o",">>","1/4","1/2","3/4","?", "A","A","A","A","[","]","[","C","E","@","E","E","I","I","I","I", "D","N","O","O","O","O","\\","x","\\","U","U","U","^","Y","Th","ss", "a","a","a","a","{","}","{","c","e","`","e","e","i","i","i","i", "d","n","o","o","o","o","|",":","|","u","u","u","~","y","th","y" }}; ----------------------------------------------------------------------- One might perhaps replace the "?" in SUB with another code that will be displayed as a blinking question mark or something similar. Then the user will know that the software wants to tell him/her that it can't display this symbol and that it isn't a question mark. If your software runs on hardware that supports already another 8-bit characters set (e.g. IBM PC with code page 437, etc.) then it might be a much better idea to include only one single table that uses the supported symbols as well as possible and uses the strings suggested here only if no better alternative is available. (BTW: IBM code page 850 which is supported by MS-DOS and OS/2 contains ALL Latin 1 characters, but at other positions, in order to stay compatible with the old IBM PC character set.) The following string conversion routine uses these tables. It may easily be called before a text received from the network is sent to the terminal, if the user has selected one of the tables: ----------------------------------------------------------------------- /* * Transform an 8-bit ISO Latin 1 string s1 in a 7-bit ASCII string s2 * readable on old terminals using conversion table t. * * worst case: strlen(s2) == 3*strlen(s1) */ void Latin1toASCII(s1, s2, t) unsigned char *s1, *s2; int t; { unsigned char *p, **tab; if (s1==NULL || s2==NULL) return; tab = iso2asc[t] - 0xa0; do { if (*s1 > 0x9f) { p = tab[*(s1++)]; while (*p) *(s2++) = *(p++); } else { *(s2++) = *(s1++); } } while (*s1); return; } ----------------------------------------------------------------------- Many users all over the world are looking forward to your next software release that will allow them to participate without pain in the world of 8-bit character communication even before they get modern hardware with ISO 8859-1 (or even better ISO 10646) character sets. Feel free to contact me or experts in USENET group comp.std.internat if you have any questions about modern character sets. Many thanks to everyone who helped me to improve these tables! Markus -- Markus Kuhn, Computer Science student -=-=- University of Erlangen, Germany Internet: mskuhn@immd4.informatik.uni-erlangen.de | X.500 entry available ----- Anyone participating in the use of MS-DOS, Heroin or Cocaine is ----- ---- simply not getting the most out of life possible. (Brian Downing) ----