home *** CD-ROM | disk | FTP | other *** search
- Xref: sparky comp.std.internat:833 soc.culture.nordic:7849 soc.culture.german:9228
- Path: sparky!uunet!cs.utexas.edu!qt.cs.utexas.edu!yale.edu!ira.uka.de!fauern!uni-erlangen.de!not-for-mail
- From: unrza3@cd4680fs.rrze.uni-erlangen.de (Markus Kuhn)
- Newsgroups: comp.std.internat,soc.culture.nordic,soc.culture.german
- Subject: ISO Latin 1 to 7-bit ASCII conversion (final draft!)
- Date: 14 Dec 1992 14:23:03 +0100
- Organization: Regionales Rechenzentrum Erlangen
- Message-ID: <1gi1rnEINN1cg@uni-erlangen.de>
- Reply-To: mskuhn@immd4.informatik.uni-erlangen.de
- NNTP-Posting-Host: cd4680fs.rrze.uni-erlangen.de
- Lines: 276
- Keywords: character sets, ISO 8859-1, terminals, user interface
-
- I received quite a number of suggestions for improvement of the conversion
- tables which I posted to comp.std.internat a few weeks ago. Before I
- will send the following text to a number of public domain software
- authors, I'd like to ask you again for a final comment on these
- tables. Would you as a user who has still to live with an 7-bit
- terminal and who receives ISO Latin 1 textes be happy, if your
- favourite piece of networking software would be able to display
- the characters as described below? If not, what would you like
- to change?
-
- Markus
-
- ------------------------------------------------------------------------
- Representation of ISO 8859-1 characters with 7-bit ASCII
- --------------------------------------------------------
-
- Markus Kuhn -- 1992-12-13
-
- SUMMARY: This text describes a technique of displaying the 8-bit
- character set, which is used today in many modern network services, on
- old 7-bit terminals. Authors of software dealing with text received
- from international networks are strongly encouraged to implement this
- or similar methods as options in their software for the convenience of
- users all over the world. Implementation is often trivial.
-
-
-
- The "Latin alphabet No. 1" defined in part 1 of the international
- standard
-
- ISO 8859:1987 Information processing -- 8-bit single-byte
- coded graphic character sets
-
- is an increasingly popular 8-bit extension of the traditional 7-bit
- US-ASCII character set. It is already supported by many operating
- systems and its 191 graphic characters include those used in the
- following 14 languages:
-
- Danish, Dutch, English, Faeroese, Finnish, French, German,
- Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and
- Swedish.
-
- ISO 8859-1 contains graphic characters used in at least 44 countries.
- Many people believe that ISO Latin 1 is already the de-facto
- replacement of the old 7-bit US-ASCII character set. In addition, the
- first 256 characters of the new 16-bit character set ISO 10646/Unicode,
- which contains nearly all characters used on this planet and is
- expected to be the final solution of most of today's character set
- troubles, are identical with ISO 8859-1.
-
- ISO 8859-1 uses only the codes 32-126 (which are identical with
- US-ASCII) and 160-255. The positions 0-31 and 127-159 are reserved for
- control codes and normally used in the same way in which they are used
- with ASCII.
-
- As this character set gains more and more popularity in international
- data communication (e.g. the Internet gopher service, the Internet
- MIME, parts of USENET), the need arises to extend existing software
- with the ability of displaying strings containing ISO 8859-1 characters
- on old hardware that is only capable of displaying 7-bit US-ASCII
- characters. Today, many users of old hardware suffer from getting the
- Latin 1 characters between 160 and 255 only displayed as the
- corresponding US-ASCII characters with the highest bit cleared. Then
- they see e.g. a ')' instead of the copyright symbol. Pessimists expect
- that these old 7-bit terminals will be in use at least for the next ten
- years.
-
- One approach for such a Latin 1 to ASCII conversion is to use the
- replacements, that people use traditionally, when they have to live
- with an US-ASCII keyboard. This seems to be the most natural method,
- which often won't even be noticed by users that use these traditional
- replacements already today on their old hardware.
-
- The disadvantages of this approach are often acceptable:
-
- a) No one-to-one mapping between Latin 1 and ASCII strings possible
- b) Text layout may be destroyed by multi-character substitutions
- c) Different replacements may be in use for different languages
-
- There is no optimal solution possible for the problem of displaying
- text with ISO Latin 1 characters on old terminals apart from buying new
- hardware. The conversion tables proposed here are only intermediate
- solutions that are intended to make life easier for people who get
- Latin 1 characters currently displayed as the corresponding 7-bit
- US-ASCII symbols with the highest bit cleared, which is awful and
- frustrates the users of old hardware.
-
- Including the tables below in programs like mail user agents, news
- readers, gopher clients, file browsers, tty drivers etc. is often a
- trivial task. Users should be able to switch between the different
- tables and the 8-bit transparent normal mode. User defineable tables
- are always a nice possible extension.
-
- Users should know if the text they read has been converted from the
- original Latin 1 text. This avoids confusion if e.g. someone asks for
- sending him a 3<fraction 1/2>" disk [3╜"], which will be displayed
- after the conversion as 31/2" (= 15.25").
-
- I collected four tables based on information I received from many
- USENET readers from various countries in order to cope with the
- different needs of ISO Latin 1 users. First of all, a table with the
- real characters in the range 160 - 255 (0xa0 - 0xff):
-
-
- á í ó ú ñ Ñ ª º ¿ ⌐ ¬ ½ ¼ ¡ « »
- ░ ▒ ▓ │ ┤ ╡ ╢ ╖ ╕ ╣ ║ ╗ ╝ ╜ ╛ ┐
- └ ┴ ┬ ├ ─ ┼ ╞ ╟ ╚ ╔ ╩ ╦ ╠ ═ ╬ ╧
- ╨ ╤ ╥ ╙ ╘ ╒ ╓ ╫ ╪ ┘ ┌ █ ▄ ▌ ▐ ▀
- α ß Γ π Σ σ µ τ Φ Θ Ω δ ∞ φ ε ∩
- ≡ ± ≥ ≤ ⌠ ⌡ ÷ ≈ ° ∙ · √ ⁿ ² ■
-
- Table 0 is a universal table that is expected to be suitable for many
- languages. The letters are simply the ASCII versions without the
- diacritics. The character '?' as a fallback substitution where no ASCII
- string is suitable is used as little as possible, as '?' carries no
- information and if we are pedantic, we have to replace nearly every
- Latin 1 character over 160 by question marks.
-
- ! c ? ? Y | ? " (c) a << - - (R) ?
- ? +/- 2 3 ' mu P . , 1 o >> 1/4 1/2 3/4 ?
- A A A A A A AE C E E E E I I I I
- D N O O O O O x O U U U U Y Th ss
- a a a a a a ae c e e e e i i i i
- d n o o o o o : o u u u u y th y
-
- Table 1 replaces Latin 1 characters only with single ASCII characters.
- This won't destroy the layout of textes designed to be printed with
- monospaced fonts, but the replacements are often not very satisfactory:
-
- ! c ? ? Y | ? " c a < - - R ?
- ? ? 2 3 ' u P . , 1 o > ? ? ? ?
- A A A A A A A C E E E E I I I I
- D N O O O O O x O U U U U Y T s
- a a a a a a a c e e e e i i i i
- d n o o o o o : o u u u u y t y
-
- In some languages, only removing the diacritics as in table 0 gives
- orthographically incorrect and unappropriate results. The following
- table 2 might be much more suitable that table 0 in Danish, Dutch,
- German, Norwegian and Swedish:
-
- ! c ? ? Y | ? " (c) a << - - (R) ?
- ? +/- 2 3 ' mu P . , 1 o >> 1/4 1/2 3/4 ?
- A A A A Ae Aa AE C E E E E I I I I
- D N O O O O Oe x O U U U Ue Y Th ss
- a a a a ae aa ae c e e e e i i i i
- d n o o o o oe : o u u u ue y th ij
-
- In some north european languages, any US-ASCII replacement for the
- relevant Latin 1 characters is unacceptable for many people. In these
- countries, national variants of 7-bit ISO 646 are still very popular.
- These use the codes of []{}\|~^`$ for national characters. Table 2 has
- been designed for Danish, Finnish, Norwegian and Swedish users of ISO
- 646 terminals:
-
- ! c ? $ Y | ? " (c) a << - - (R) ?
- ? +/- 2 3 ' mu P . , 1 o >> 1/4 1/2 3/4 ?
- A A A A [ ] [ C E @ E E I I I I
- D N O O O O \ x \ U U U ^ Y Th ss
- a a a a { } { c e ` e e i i i i
- d n o o o o | : | u u u ~ y th y
-
-
- For the convenience of C programmers, I included the code of these
- tables in this text. Just copy the following lines in your software:
-
- -----------------------------------------------------------------------
- /* Conversion tables for displaying the G1 set (0xa0-0xff) of
- ISO Latin 1 (ISO 8859-1) with 7-bit ASCII characters.
- Suggestions for improvement are welcome!
-
- Table Purpose
- 0 universal table for many languages
- 1 single-spacing universal table
- 2 table for Danish, Dutch, German, Norwegian and Swedish
- 3 table for Danish, Finnish, Norwegian and Swedish using ISO 646
-
- Markus Kuhn <mskuhn@immd4.informatik.uni-erlangen.de> */
-
- #define SUB "?" /* used if no reasonable ASCII string is possible */
- #define ISO_TABLES 4
-
- static unsigned char *iso2asc[ISO_TABLES][96] = {{
- " ","!","c",SUB,SUB,"Y","|",SUB,"\"","(c)","a","<<","-","-","(R)",SUB,
- SUB,"+/-","2","3","'","mu","P",".",",","1","o",">>","1/4","1/2","3/4","?",
- "A","A","A","A","A","A","AE","C","E","E","E","E","I","I","I","I",
- "D","N","O","O","O","O","O","x","O","U","U","U","U","Y","Th","ss",
- "a","a","a","a","a","a","ae","c","e","e","e","e","i","i","i","i",
- "d","n","o","o","o","o","o",":","o","u","u","u","u","y","th","y"
- },{
- " ","!","c",SUB,SUB,"Y","|",SUB,"\"","c","a","<","-","-","R",SUB,
- SUB,SUB,"2","3","'","u","P",".",",","1","o",">",SUB,SUB,SUB,"?",
- "A","A","A","A","A","A","A","C","E","E","E","E","I","I","I","I",
- "D","N","O","O","O","O","O","x","O","U","U","U","U","Y","T","s",
- "a","a","a","a","a","a","a","c","e","e","e","e","i","i","i","i",
- "d","n","o","o","o","o","o",":","o","u","u","u","u","y","t","y"
- },{
- " ","!","c",SUB,SUB,"Y","|",SUB,"\"","(c)","a","<<","-","-","(R)",SUB,
- SUB,"+/-","2","3","'","mu","P",".",",","1","o",">>","1/4","1/2","3/4","?",
- "A","A","A","A","Ae","Aa","AE","C","E","E","E","E","I","I","I","I",
- "D","N","O","O","O","O","Oe","x","O","U","U","U","Ue","Y","Th","ss",
- "a","a","a","a","ae","aa","ae","c","e","e","e","e","i","i","i","i",
- "d","n","o","o","o","o","oe",":","o","u","u","u","ue","y","th","ij"
- },{
- " ","!","c",SUB,"$","Y","|",SUB,"\"","(c)","a","<<","-","-","(R)",SUB,
- SUB,"+/-","2","3","'","mu","P",".",",","1","o",">>","1/4","1/2","3/4","?",
- "A","A","A","A","[","]","[","C","E","@","E","E","I","I","I","I",
- "D","N","O","O","O","O","\\","x","\\","U","U","U","^","Y","Th","ss",
- "a","a","a","a","{","}","{","c","e","`","e","e","i","i","i","i",
- "d","n","o","o","o","o","|",":","|","u","u","u","~","y","th","y"
- }};
- -----------------------------------------------------------------------
-
- One might perhaps replace the "?" in SUB with another code that will be
- displayed as a blinking question mark or something similar. Then the
- user will know that the software wants to tell him/her that it can't
- display this symbol and that it isn't a question mark. If your software
- runs on hardware that supports already another 8-bit characters set
- (e.g. IBM PC with code page 437, etc.) then it might be a much better
- idea to include only one single table that uses the supported symbols
- as well as possible and uses the strings suggested here only if no
- better alternative is available. (BTW: IBM code page 850 which is
- supported by MS-DOS and OS/2 contains ALL Latin 1 characters, but at
- other positions, in order to stay compatible with the old IBM PC
- character set.)
-
- The following string conversion routine uses these tables. It may
- easily be called before a text received from the network is sent to the
- terminal, if the user has selected one of the tables:
-
- -----------------------------------------------------------------------
- /*
- * Transform an 8-bit ISO Latin 1 string s1 in a 7-bit ASCII string s2
- * readable on old terminals using conversion table t.
- *
- * worst case: strlen(s2) == 3*strlen(s1)
- */
- void
- Latin1toASCII(s1, s2, t)
- unsigned char *s1, *s2;
- int t;
- {
- unsigned char *p, **tab;
-
- if (s1==NULL || s2==NULL) return;
-
- tab = iso2asc[t] - 0xa0;
- do {
- if (*s1 > 0x9f) {
- p = tab[*(s1++)];
- while (*p) *(s2++) = *(p++);
- } else {
- *(s2++) = *(s1++);
- }
- } while (*s1);
-
- return;
- }
- -----------------------------------------------------------------------
-
- Many users all over the world are looking forward to your next software
- release that will allow them to participate without pain in the world
- of 8-bit character communication even before they get modern hardware
- with ISO 8859-1 (or even better ISO 10646) character sets.
-
- Feel free to contact me or experts in USENET group comp.std.internat if
- you have any questions about modern character sets. Many thanks to
- everyone who helped me to improve these tables!
-
- Markus
-
- --
- Markus Kuhn, Computer Science student -=-=- University of Erlangen, Germany
- Internet: mskuhn@immd4.informatik.uni-erlangen.de | X.500 entry available
- ----- Anyone participating in the use of MS-DOS, Heroin or Cocaine is -----
- ---- simply not getting the most out of life possible. (Brian Downing) ----
-