> In other words, without Line Separator U+2028, there would be no canonical
> way to represent line breaks, as in (e.g.) poetry, in Unicode plain text.
> Why? Because the semantics of CR, LF, CRLF, and other control characters
> vary from platform to platform (e.g. Macintosh, UNIX, DOS).
>
> Furthermore, the conventions for separating paragraphs are also platform
> and application-specific. Thus the Paragraph Separator, U+2029.
Does this mean that new applications should refrain from using LF and CR
and use the two new control characters instead? How many Unicode
applications currently understand the Unicode line and paragraph
separators?
As for future Unicode apps what about Unicode supporting e-mail apps?
Will the upcoming Netscape Communicator (most popular commercial Unicode
capable e-mail client I can think of) send e-mail (and understand) with
the new markers (providing they're Unicode encoded, of course).
(targeted towards the Netscape/Unicode group)
--
Adrian Havill <URL:http://www.threeweb.ad.jp/>
Engineering Division, System Planning & Production Section
14-May-97 6:31:04-GMT,2078;000000000001
Received: from malmo.trab.se (malmo.trab.se [131.115.48.10])
by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id CAA22840
for <fdc@watsun.cc.columbia.edu>; Wed, 14 May 1997 02:31:02 -0400 (EDT)
Received: from valinor.malmo.trab.se (valinor.malmo.trab.se [131.115.48.20]) by malmo.trab.se (8.7.5/TRAB-primary-2) with ESMTP id IAA24548; Wed, 14 May 1997 08:31:00 +0200 (MET DST)
Received: by valinor.malmo.trab.se (8.7.5/TRM-1-KLIENT); Wed, 14 May 1997 08:30:59 +0200 (MET DST) (MET)
> 11 11111 1 11111 PDF 00 0000 ... | Yiddish PDF English ...
>
> Now if the newline (NL in above) is indicated by a LS (\u2028), the
> bidi state is reset between the lines. If I now start the second line
> with RLE (so as to say I'm reestablishing an embedding level), I can no
> longer tell whether I have one embedded segment or two (with a 0-level
> space between, where the LS is). Could be an issue if I later reformat
> (reflow) this text (as I might want to do in an editor).
>
> As a matter of fact, if the second line (after LS) starts with a strong
> R2L character and I don't reissue RLE, won't the base level be set to 1?
> This would put the following English at level 2 (not intended as the
> English isn't embedded in the Yiddish here, but the other way around).
LS is defined as a block separator, so you are right. When you
insert an LS to split the lines, your application could insert
arbitrary additional codepoints such as RLE. What it does insert
(or not) is outside of the Unicode BIDI spec, which only describes
static behaviour (what has to happen when the insertions are done),
and not dynamic interactive behaviour (which can be a lot more
complex if you want it to follow user's expectations, and given
that static BIDI is already difficult, I hope you get the point :-).
But when you edit BIDI text, you really should work with
paragraph-oriented plain text, without additional LSs. Then
everything will run more or less smoothly. Reformatting (reflow)
is done automatically and correctly. In those cases where you
indeed insert LSs, they will in most cases not be in the middle
of text, but at some logical interruption point, without the
need for frequent reflow.
> These problems go away if I use any combinations of CR/LF to indicate
> newline.
This might be a solution for some very special cases. But in general,
for BIDI you should use paragraph-oriented plain text, with CR/LF/
CRLF/PS as paragraph separators. I'm pretty sure that when Microsoft
implements BIDI (or the way they already do it), they will treat
CR (what they use internally) as a block separator in the BIDI
algorithm.
Regards, Martin.
16-May-97 22:25:22-GMT,2878;000000000001
Received: from unicode.org (unicode.org [192.195.185.2])
by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id SAA15567
for <fdc@watsun.cc.columbia.edu>; Fri, 16 May 1997 18:25:21 -0400 (EDT)
Received: by unicode.org (NX5.67g/NX3.0M)
id AA12695; Fri, 16 May 97 13:39:12 -0700
Message-Id: <9705162039.AA12695@unicode.org>
Errors-To: uni-bounce@unicode.org
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Uml-Sequence: 2634 (1997-05-16 20:38:15 GMT)
To: Multiple Recipients of <unicode@unicode.org>
Reply-To: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
From: "Unicode Discussion" <unicode@unicode.org>
Date: Fri, 16 May 1997 13:38:13 -0700 (PDT)
Subject: Re: Line Separator character
On Wed, 14 May 1997, Adrian Havill wrote:
> Martin J. Duerst wrote:
> > Email has very strict restrictions on this. You can't send doublebyte
> > UTF-16 or UCS-2 in Email. CRLF always has to be present as a line
> > separator. Unicode in Email is possible with UTF-7 (and CRLF as line
> > separator) or UTF-8 + BASE64/QuotedPrintable (and CRLF...).
> > Please see RFC 2045/6/7 for this.
>
> I'm aware of this. Allow me to clarify: encode the Unicode line and
> paragraph separators in UTF-7 and transmit no CR and LFs. Some
> protocols, such as SMTP, have a line limit (998 octets in the case of
> SMTP).
SMTP email requires that line breaks be encoded as CRLF for all
things that are text (i.e. Content-Type: text/*). The user
(or the user agent) is also asked to limit line length to
something like 80 characters (actually 80 bytes).
> However, as the behavior of CR and LF is system dependent, an e-mail
> client could theoretically ignore CR LF, etc and go by the UTF-7 encoded
> Unicode line and paragraph breaks, when
CR and LF are system dependent, but in mail, it's always CRLF, and
mail user agents do the conversion.
> RFC2046 says '[i]t should not be necessary to add any line breaks to
> display "text/plain" correctly....'
That's because text/plain (and all of text/*) is already defined
to have these as CRLF, at 'short' intervals.
> So why not NOT use them and go with
> the Unicode ones?
Because that may (or actually will) break some mail software.
I know many people don't like that (I don't either), but some
things in Internet mail are braindead, and will stay braindead.
Too many influential people are too used to the way things are,
and too many people are affraid of some software failing to work.
Of course, what you can do is to have your local user agent
change from CRLF to whatever line breaking convention you
use locally, which might very well be the "true" Unicode codes.
> As there are few legacy Unicode-capable e-mail clients, is it not
> possible to push to get this functionality added now?
The problem is not the clients. The problem is all the software
that the mail passes from one client to the other.
Regards, Martin.
17-May-97 21:28:56-GMT,4627;000000000011
Received: from unicode.org (unicode.org [192.195.185.2])
by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id RAA05910
for <fdc@watsun.cc.columbia.edu>; Sat, 17 May 1997 17:28:55 -0400 (EDT)
Received: by unicode.org (NX5.67g/NX3.0M)
id AA15437; Sat, 17 May 97 14:09:06 -0700
Message-Id: <9705172109.AA15437@unicode.org>
Errors-To: uni-bounce@unicode.org
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
X-Uml-Sequence: 2642 (1997-05-17 21:08:44 GMT)
To: Multiple Recipients of <unicode@unicode.org>
Reply-To: Edward Cherlin <cherlin@cauce.org>
From: "Unicode Discussion" <unicode@unicode.org>
Date: Sat, 17 May 1997 14:08:43 -0700 (PDT)
Subject: Re: Line Separator Character
"Martin J. Duerst" <mduerst@ifi.unizh.ch> wrote:
>On Fri, 16 May 1997, Pierre Lewis wrote:
>
>
>> Context: plain text unicode file.
>
>There are basically two models of plain text. The first is line-oriented,
>the second is paragraph-oriented. Email or programm code is the traditional
>example of line-oriented plain text. Descriptive text as it appears in
>word processors, minus formatting, is the typical example of paragraph-
>oriented plain text.
>
>In traditional encoding (using CR/LF/CRLF) and in "official" Unicode
>encoding (using PS), the two models are made compatible by treating
>each line in the line-oriented plain text as a paragraph. On the other
>hand, the paragraph-oriented model can be reduced to the line-oriented
>model by splitting lines in a particular layout of the paragraph.
>This splitting is again done by paragraph separators (CR/LF/CRLF/PS),
>and not by LS.
There are actually several other models for files of 7-bit or 8-bit
character codes, commonly, but misleadingly, known as ASCII text files.
The original model was control of a Teletype machine, where several control
characters called for physical movement of the mechanism. Many of the bad
habits used in text files are survivals of this model. Others, fortunately,
have died out. (I am thinking of some of the uses of control characters in
editors meant for hard copy terminals.)
CRLF was *required* to initiate a new line, but CR by itself was sometimes
used for overstriking (if BS was not available), including underlining and
composition of APL characters, and also for imitating typewriter
overstrikes such as c| for the cent sign and some accented letters such as
u" or e`. HT and FF were very commonly used, and some others, such as SI
and SO, less so, but each of these specified a mechanical action. SI and SO
allowed a fairly standard way to control some dual-script devices including
ASCII/Arabic, ASCII/Cyrillic, APL/ASCII, and other combinations.
Many devices used ASCII control characters for new purposes, so that an
ASCII character string could specify the hardware behavior needed for bold
facing and so on. The actual process of printing might call for translation
from a 'text file' to an ASCII command string file which would produce the
same printed image by other means. For example, a printer driver for a
bidirectional printer could save time by printing alternate lines in
reverse order, with LF and some spacing commands between lines.
We then had the glass Teletype, or dumb terminal, model, which might treat
CR and LF as on mechanical devices, or might treat them both as new line
characters, or might do something else. At the same time, 'text files'
could still be used to control electronic printers, with varying
interpretations of some of the control characters.
Now, on computers with GUIs, we have different systems that expect CR, or
LF, or CRLF, as the new line signal, and have other interpretations of
other control characters. System software vendors are going off in all
directions inventing new misinterpretations of Unicode characters and
constructing yet other file designs.
We want to have a uniform, portable definition of the meaning of a file of
16-bit character codes interpreted as Unicode, or "Unicode text file" for
short. At the same time, we have several uses for such files, where
different interpretations may be desired. If we want to do this right, I
think we have to find the appropriate organization for defining such file
formats and uses, and get down to some serious and at times difficult
standard making. The Unicode character code standard does not seem to be
the right place to do this.
--
Edward Cherlin Help outlaw Spam Everything should be made
Vice President http://www.cauce.org as simple as possible,
NewbieNet, Inc. 1000 members and counting __but no simpler__.
http://www.newbie.net/ 17 May 97 Attributed to Albert Einstein
17-May-97 23:00:51-GMT,6375;000000000001
Received: from unicode.org (unicode.org [192.195.185.2])
by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id TAA21108
for <fdc@watsun.cc.columbia.edu>; Sat, 17 May 1997 19:00:50 -0400 (EDT)
Received: by unicode.org (NX5.67g/NX3.0M)
id AA15658; Sat, 17 May 97 15:40:09 -0700
Message-Id: <9705172240.AA15658@unicode.org>
Errors-To: uni-bounce@unicode.org
X-Uml-Sequence: 2643 (1997-05-17 22:39:47 GMT)
To: Multiple Recipients of <unicode@unicode.org>
Reply-To: Frank da Cruz <fdc@watsun.cc.columbia.edu>
From: "Unicode Discussion" <unicode@unicode.org>
Date: Sat, 17 May 1997 15:39:45 -0700 (PDT)
Subject: Re: Line Separator Character
> There are actually several other models for files of 7-bit or 8-bit
> character codes, commonly, but misleadingly, known as ASCII text files.
>
> The original model was control of a Teletype machine, where several control
> characters called for physical movement of the mechanism. Many of the bad
> habits used in text files are survivals of this model.
>
I wouldn't call them bad habits necessarily. The primary bone of contention
here is the distinction between LF and CR...
> CRLF was *required* to initiate a new line, but CR by itself was sometimes
> used for overstriking (if BS was not available), including underlining and
> composition ...
>
Right. And LF was used by itself to go down one row.
> We then had the glass Teletype, or dumb terminal, model, which might treat
> CR and LF as on mechanical devices, or might treat them both as new line
> characters...
>
Actually I think that practically all CRTs treat CR and LF just as the TTY
did. CR positions the cursor to the left of the current row, LF moves it
down one row.
> Now, on computers with GUIs, we have different systems that expect CR, or
> LF, or CRLF, as the new line signal, and have other interpretations of
> other control characters.
>
Really the problem started when the UNIX designers decided that it was good
idea to have a storage model that was different than the tranmsission model.
This allowed some space to be saved on disk, and it made text processing
software a bit easier to write. However, it complicated the tty driver by
requiring it to substitute CRLF for LF when displaying text files, which in
turn has led to all sorts of confusion about "raw" vs "cooked" mode, etc,
and the related distinction between NVT vs binary mode in Telnet protocol.
(It is a simplification that UNIX was the first disk operating system to store
textual files differently than it transmitted them, but it may have been the
first *stream-oriented* one to do so -- or at least the one we remember.)
Thus CRLF has always been the line terminator in ASCII (in the broad sense of
"not EBCDIC") text transmission. Systems that chose to use different internal
representations have had the obligation to convert back and forth during
transmission.
It's interesting to speculate how different the world (of computing) might be
today if only a few arbitrary and perhaps whimsical decisions had been made
differently decades ago: if UNIX and several other popular platforms had used
CRLF rather than LF (or CR) as the line terminator; if DOS had used "forward
slash" (/) rather than "backward slash" (\) as the directory separator... How
many person-eons of effort have gone into addressing the consequences of these
decisions...
> HT and FF were very commonly used...
>
(And still are...) Now there's an interesting point. Unicode has addressed
the CR/LF/CRLF confusion with LS and PS, but what about formfeed? Isn't it
sometimes just as necessary to specify a hard page break as it is to specify a
hard line or paragraph break? I suppose there must be a boundary somewhere
between "Trust your rendering engine" and "Mother, Please! I'd rather do it
myself!" I don't have a copy handy, and I might be entirely wrong about this,
but isn't the Holy Koran a document that must be paginated in a specific way?
In any case, the strong Use-A-GUI thrust of Unicode will make it increasingly
difficult for certain kinds of people to operate in the ways to which they
have become accustomed over the past decades in which plain text was "good
enough" save that one could not put lots of languages into it. For example,
today I can write a letter that spills over to one or more "second sheets" in
plain text and print it on a plain-text printer without a second thought,
using any software at all on any platform, embedding hard line, paragraph, and
page breaks in it, just as most of us still do with email (except for the page
breaks). No "templates", "wizards", "profiles", "preferences", or
"Buzzword-1.0 Compliance" involved. I can move this letter to practically any
other platform and it will still be perfectly legible and printable -- no
export or import or conversion or version skew to worry about. I think a lot
of people would be perfectly happy to do the same in a plain-text Unicode
world using plain-text Unicode terminals and printers, if there were such
things. But there's a bigger issue...
The idea that one must embed Unicode in a higher level wrapper (e.g. a
Microsoft Word document, or even HTML) to make it useful has a certain
frightening consequence: the loss of any expectancy of longevity for our new
breed of documents. These higher-level systems will be overwhelmingly
proprietary due to the vast amount of coding that must go into them, the
voracious nature of the marketplace, etc, and so formats will become obsolete
with ever-increasing frequency, and it will become ever harder to extract the
plain-text characters -- the substance -- from them. That which is perceived
at a critical moment in time to be worthy of preservation will be converted to
the new format, the rest discarded or left for decipherment by future
generations of information archaeologists. (If you don't believe this is a
problem, think about what is happening to our (physical) libraries all over
the world at this moment -- get ready to say goodbye forever to five millenia
of history that was not worth digitizing.) (And then to do it all over again
when the digital formats and media need conversion in another ten years.)
(And then again five years after that, etc...)
So let's do our part and make some effort to accommodate traditional
plain-text applications in Unicode, rather than discourage them :-)
- Crank (Oops, I mean Frank)
18-May-97 0:13:19-GMT,2045;000000000001
Received: from unicode.org (unicode.org [192.195.185.2])
by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id UAA29722
for <fdc@watsun.cc.columbia.edu>; Sat, 17 May 1997 20:13:18 -0400 (EDT)
Received: by unicode.org (NX5.67g/NX3.0M)
id AA15880; Sat, 17 May 97 16:56:42 -0700
Message-Id: <9705172356.AA15880@unicode.org>
Errors-To: uni-bounce@unicode.org
X-Uml-Sequence: 2644 (1997-05-17 23:56:16 GMT)
To: Multiple Recipients of <unicode@unicode.org>
Reply-To: Terry Allen <tallen@sonic.net>
From: "Unicode Discussion" <unicode@unicode.org>
Date: Sat, 17 May 1997 16:56:15 -0700 (PDT)
Subject: Re: Line Separator Character
Frank da Cruz asked:
>(And still are...) Now there's an interesting point. Unicode has addressed
the CR/LF/CRLF confusion with LS and PS, but what about formfeed? Isn't it
sometimes just as necessary to specify a hard page break as it is to specify a
hard line or paragraph break? I suppose there must be a boundary somewhere
between "Trust your rendering engine" and "Mother, Please! I'd rather do it
myself!" I don't have a copy handy, and I might be entirely wrong about this,
but isn't the Holy Koran a document that must be paginated in a specific way?
It isn't. My Egyptian Qur'an is one continuous text flow; the heading
of a surah may even occur right at the bottom of a page. But there are
such documents; the example of legal documents was brought up recently
wrt SGML style sheets.
>From an SGML point of view, I want to separate lines and paragraphs
in my SGML markup. That's how I'd expect to obtain longevity for the
text, not through LS and PS. CR and LF and SGML's difficulty in
dealing with them (now redressed partially in XML) are bad enough.
In SGML I can't see using LS or PS.
Regards (and thanks for an interesting discussion),
Terry Allen Electronic Publishing Consultant tallen[at]sonic.net
http://www.sonic.net/~tallen/
Davenport and DocBook: http://www.ora.com/davenport/index.html
T.A. at Passage Systems: terry.allen[at]passage.com
18-May-97 8:11:08-GMT,1439;000000000011
Received: from mtshasta.snowcrest.net (mtshasta.snowcrest.net [206.245.192.1])
by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id EAA07970
for <fdc@watsun.cc.columbia.edu>; Sun, 18 May 1997 04:11:06 -0400 (EDT)
Received: from [206.245.192.57] (ttyD0.mtshasta.snowcrest.net [206.245.192.32]) by mtshasta.snowcrest.net (8.8.5/8.6.5) with ESMTP id BAA00515 for <fdc@watsun.cc.columbia.edu>; Sun, 18 May 1997 01:11:02 -0700 (PDT)
Subject: Unicode plain text (Was: Line Separator Character)
Cc: unicode@unicode.org, kenw@sybase.com
X-Sun-Charset: US-ASCII
Crank, er... Frank,
>> HT and FF were very commonly used...
>>
>(And still are...) Now there's an interesting point. Unicode has addressed
>the CR/LF/CRLF confusion with LS and PS, but what about formfeed? Isn't it
>sometimes just as necessary to specify a hard page break as it is to specify a
>hard line or paragraph break?
You can still use U+000C FORM FEED in Unicode plain text, and a renderer that
knows about page breaks can do the "right thing", namely whatever it did with
^L for an ASCII text. FORM FEED, like HORIZONTAL TAB, was not considered to
be ambiguous enough in usage (unlike CR/LF) to require any separate encoding
in Unicode.
> In any case, the strong Use-A-GUI thrust of Unicode will make it increasingly
> difficult for certain kinds of people to operate in the ways to which they
> have become accustomed over the past decades in which plain text was "good
> enough" save that one could not put lots of languages into it.
The goal of Unicode plain text is to recapture that portability in the
encoding, but also allow you to put lots of languages into it. The "Use-A-GUI
thrust" of Unicode acknowledges the fact that rendering of complex scripts
(including the Latin script with generative use of combining marks) requires
logic that is much more amenable to implementation in a GUI framework than in
a terminal model. However, appropriate (and very large and useful) subsets of
Unicode *can* be implemented with simple rendering models. (Cf. Windows NT
until very recently. :-) )
> I can move this letter to practically any
> other platform and it will still be perfectly legible and printable -- no
> export or import or conversion or version skew to worry about. I think a lot
> of people would be perfectly happy to do the same in a plain-text Unicode
> world using plain-text Unicode terminals and printers, if there were such
> things.
That is exactly what Unicode plain text is all about. And, by the way,
Notepad on Windows NT was pretty close to being a "plain-text Unicode terminal".
> The idea that one must embed Unicode in a higher level wrapper (e.g. a
> Microsoft Word document, or even HTML) to make it useful has a certain
> frightening consequence: the loss of any expectancy of longevity for our new
> breed of documents.
There is absolutely nothing new about this. I was warning my linguistic
colleagues about the longevity of their documents when they started using
WordStar back around 82/83. 7-bit ASCII is the only encoding that stayed
stable enough and was widely enough implemented to retain easy transmissibility
across the computer generations without the intervention of information
archaeologists. Well, 16-bit Unicode plain text is aimed at no less a
goal than being the universal wide-ASCII plain text of the 21st century.
Grumpy aside: This goal is not helped by people who treat Unicode as
a standards dumping ground for assigning numbers to everybody's favorite
collection of junk vaguely related to text, or who try to infiltrate
mechanisms (such as language tags) that do not belong in plain text.
> So let's do our part and make some effort to accommodate traditional
> plain-text applications in Unicode, rather than discourage them :-)
I agree completely. An excellent example of the appropriate place for
a Unicode plain-text editor would be a Java IDE. If someone writes
a good Unicode plain-text editor for such an application, it would
have wider applicability. (I know I often use the editors of C++
IDE's to create (ASCII) plain text when I don't want it all gummed up
as a Word or Frame document.)
Ed Cherlin commented:
> We want to have a uniform, portable definition of the meaning of a file of
> 16-bit character codes interpreted as Unicode, or "Unicode text file" for
> short. At the same time, we have several uses for such files, where
> different interpretations may be desired. If we want to do this right, I
> think we have to find the appropriate organization for defining such file
> formats and uses, and get down to some serious and at times difficult
> standard making. The Unicode character code standard does not seem to be
> the right place to do this.
I disagree about the last point. A Unicode plain text file consists of
a stream of Unicode characters (and nothing else), interpreted according
to the Unicode standard. It should be marked with an initial U+FEFF (though
technically that is optional). This much is already clear from the standard,
as is the usage of LINE SEPARATOR and PARAGRAPH SEPARATOR for minimal,
unambiguous, plain text formatting consistent with the bidi algorithm.
The situation is complicated by the two possible byte orders (which is one
reason for the U+FEFF) and by the fact that the most widely implemented
variant, namely that in Windows NT, chose LSB order instead of MSB order.
But other than that, there is not much more to be said about a Unicode
plain text file. The usefulness of the concept lies in its simplicity.
--Ken Whistler
20-May-97 20:29:52-GMT,4480;000000000011
Return-Path: <cherlin@cauce.org>
Received: from mtshasta.snowcrest.net (mtshasta.snowcrest.net [206.245.192.1])
by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA02464
for <fdc@watsun.cc.columbia.edu>; Tue, 20 May 1997 16:29:41 -0400 (EDT)
Received: from [206.245.192.36] (ttyD23.mtshasta.snowcrest.net [206.245.192.67]) by mtshasta.snowcrest.net (8.8.5/8.6.5) with ESMTP id NAA01464; Tue, 20 May 1997 13:29:30 -0700 (PDT)
> > > > My personal preference is for number 2. I kind of like Martin's proposal
> > > > for introducing a plain-text language tag using a control code, and I
> > > > think the existing control codes are fine.
> >
> > Good idea. Indeed the C1 area is not used in the Internet as far as I know.
> >
> There are still such things as terminals that use C1 control codes such as
> CSI, APC, OSC, etc (primarily VT220 and higher, which are the predominant
> types used by emulators such Kermit, Xterm, DECterm, etc). Do we intend that
> Unicode and terminal-to-host communication will become mutually exclusive
> concepts?
Frank - I understand your concerns. But one way of looking at what we
need is some tagging format possibly used in ACAP and IMAP, which
MUST not leak to other places. And what you probably worry about
is the C1 area in terms of octets (which is already gone with UTF-8)
and not the C1 character space in Unicode, which turns up as two bytes
in UTF-8.
Regards, Martin.
8-Jun-97 8:27:08-GMT,2730;000000000001
Return-Path: <unicode@unicode.org>
Received: from mail-out1.apple.com (mail-out1.apple.com [17.254.0.52])
by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id EAA06491
for <fdc@watsun.cc.columbia.edu>; Sun, 8 Jun 1997 04:27:07 -0400 (EDT)
Received: from unicode.org (unicode2.apple.com [17.254.3.212])
by mail-out1.apple.com (8.8.5/8.8.5) with SMTP id BAA08438;
Sun, 8 Jun 1997 01:14:24 -0700
Received: by unicode.org (NX5.67g/NX3.0S)
id AA09568; Sun, 8 Jun 97 01:11:13 -0700
Message-Id: <9706080811.AA09568@unicode.org>
Errors-To: uni-bounce@unicode.org
X-Uml-Sequence: 2866 (1997-06-08 08:10:50 GMT)
To: Multiple Recipients of <unicode@unicode.org>
Reply-To: "Pierre Lewis" <lew@nortel.ca>
From: "Unicode Discussion" <unicode@unicode.org>
Date: Sun, 8 Jun 1997 01:10:49 -0700 (PDT)
Subject: Re: Comments on <draft-ietf-acap-mlsf-00.txt>
Finally got around to reading the MLSF Internet Draft.
Couple of comments:
1) One thing really made me jump: the first sentence in the Abstract.
"While UTF-8 solves most internationalization (I18N) problems, ..."
That makes as much sense to me as saying that QuotedPrintable solves
most I18N problems for Western Europe. It's not QP which does that,
it's ISO 8859-1. QP is just one way to encode 8859-1 text so it can
past most mail relays without corruption. But Base64 is another
way to do the same thing (which can make statistical sense for some
languages).
Similarly, it's not UTF-8 which solves the wider problem of
world-wide I18N, it's Unicode (and/or ISO 10646). The canonical
representation of Unicode is 16-bit quantities (UCS-2). UTF-8 is
nothing more than one of many possible transformations (UTF-7 is
another that's already defined: RFC 2152). If I understood right,
UTF-8 was created mainly to make Unicode coexist reasonably well
with existing OSs that use 8-bit characters, for example Unix.
Not that I agree with the proposal, but the MLSF Internet Draft
should make clear what the implications are of trying to put
language tags into UTF-8 (for example, assumption that UTF-8 becomes
the canonical representation of Unicode, loss of tagging when
converting to other CESs). I guess the pros and cons have been
discussed at length here.
2) It would have been nice to put a few examples of actual UTF-8 strings
with language tags (in hex of course) in the document.
As to the fundamental issue of whether language tagging belongs in
plain-text Unicode, I must say I'm pretty neutral at this point. I
think they could be useful. But, as Frank was saying, if it's going
to take 10 years to converge to an acceptable solution, then it
doesn't belong in plain text, but at a higher level.
Pierre
9-Jun-97 3:10:12-GMT,1193;000000000001
Return-Path: <glenn@spyglass.com>
Received: from cam.spyglass.com (sapir.cam.spyglass.com [208.203.148.66])
by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id XAA24496
for <fdc@watsun.cc.columbia.edu>; Sun, 8 Jun 1997 23:10:11 -0400 (EDT)
Received: from mykhe.cam.spyglass.com (shivacam-1.cam.spyglass.com [208.203.149.181]) by cam.spyglass.com (8.7.5/8.7.3) with SMTP id XAA00525 for <fdc@watsun.cc.columbia.edu>; Sun, 8 Jun 1997 23:10:22 -0400 (EDT)
> I know this 1.0 name field is not subject to the same rule of "no changes,
> ever" that applies to the regular Character Name field, but why should these
> names be changed at all?
Aliases, actually, from the Unicode point of view, not formal names.
And Kent explained why update the aliases.
>
> On this same topic, parenthesized abbreviations have been added to the 1.0
> names for U+000A LIFE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE
> RETURN (CR), and U+0085 NEXT LINE (NEL). Does the addition of these
> abbreviations mean that they are now part of the official 1.0 name,
Nope.
> and if
> so, why? Other characters typically don't have abbreviations as part of
> their names, even if they are as meaningful and as commonly used as these,
> and again it is a change from the 1.0 name we have seen for a decade.
Off and on, I work at a project to backrev from UnicodeData-1.1.5.txt
to produce a Unicode 1.0 version of UnicodeData.txt, as it would
have been defined if such a data file had been defined at the
time. (It wasn't.) If I get around to posting that, then people can
use the Unicode name field itself as the documentation of what the
Unicode 1.0 name was!
In the meantime, if you want the old time religion for the Unicode 1.0
names, you can extract them from UnicodeData-2.0.14.txt (the version
officially released with Unicode 2.0), before the field was repurposed
for the Unicode 3.0 publication.
>
> Perhaps I've been checking the beta files a bit TOO carefully.
I suppose we should add a note to UnicodeData.html, clarifying the
special status of the Unicode 1.0 name field for the control
characters.
--Ken
>
> -Doug Ewell
> Fullerton, California
>
>
20-May-2002 11:44:37-GMT,2846;000000000001
Return-Path: <unicode-bounce@unicode.org>
Received: from unicode.org (unicode.org [209.235.17.55])
by dewberry.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id MAA05325
for <fdc@columbia.edu>; Mon, 20 May 2002 12:44:37 -0400 (EDT)
Received: from sarasvati.unicode.org (localhost.localdomain [127.0.0.1])
by unicode.org (8.9.3/8.9.3) with ESMTP id MAA09865;
Mon, 20 May 2002 12:04:28 -0400
Received: with ECARTIS (v1.0.0; list unicode); Mon, 20 May 2002 12:04:28 -0400 (EDT)
Received: from mg01.austin.ibm.com (mg01.austin.ibm.com [192.35.232.18])
by unicode.org (8.9.3/8.9.3) with ESMTP id MAA09859
for <unicode@unicode.org>; Mon, 20 May 2002 12:04:27 -0400
Received: from austin.ibm.com (netmail.austin.ibm.com [9.3.7.137])
by mg01.austin.ibm.com (AIX4.3/8.9.3/8.9.3) with ESMTP id LAA16450
for <unicode@unicode.org>; Mon, 20 May 2002 11:05:41 -0500
Received: from popmail.austin.ibm.com (popmail.austin.ibm.com [9.53.247.178])
by austin.ibm.com (AIX4.3/8.9.3/8.9.3) with ESMTP id LAA46008
for <unicode@unicode.org>; Mon, 20 May 2002 11:06:14 -0500
Received: from jtcsv.com (markus2000.sanjose.ibm.com [9.43.222.33]) by popmail.austin.ibm.com (AIX4.3/8.9.3/8.7-client1.01) with ESMTP id LAA23698 for <unicode@unicode.org>; Mon, 20 May 2002 11:06:12 -0500
Message-ID: <3CE91F79.4000503@jtcsv.com>
Date: Mon, 20 May 2002 09:08:25 -0700
From: Markus Scherer <markus.scherer@jtcsv.com>
Organization: IBM
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.4) Gecko/20011019 Netscape6/6.2
X-Accept-Language: en,de,eo
MIME-Version: 1.0
To: unicode <unicode@unicode.org>
Subject: Re: Encoding of symbols, and a "lock"/"unlock" pre-proposal
Personally, I find it counter-productive to add a hodge-podge of dingbats and miscellaneous symbols to Unicode, or any coded character set.
They had practical uses when user interfaces and display systems could not handle icons and arbitrary images, but those times are long over.
Witness the demise of the DOS codepages with block graphics when graphical UIs became available.
In my personal opinion, I find that the inclusion of such symbols dimishes the credibility of Unicode as a standard and of the UTC as following reasonable principles and guidelines.
markus
2-Jul-2002 9:15:30-GMT,4825;000000000001
Return-Path: <unicode-bounce@unicode.org>
Received: from unicode.org (unicode.org [209.235.17.55])
by marionberry.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id KAA17490
for <fdc@columbia.edu>; Tue, 2 Jul 2002 10:15:30 -0400 (EDT)
Received: from sarasvati.unicode.org (localhost.localdomain [127.0.0.1])
by unicode.org (8.9.3/8.9.3) with ESMTP id HAA20142;
Tue, 2 Jul 2002 07:13:41 -0400
Received: with ECARTIS (v1.0.0; list unicode); Tue, 02 Jul 2002 07:13:40 -0400 (EDT)
Received: from BOBCAT.borware.com (bobcat.borware.com [213.88.207.165])
by unicode.org (8.9.3/8.9.3) with ESMTP id HAA20132
for <unicode@unicode.org>; Tue, 2 Jul 2002 07:13:31 -0400
Received: by BOBCAT.borware.com with Internet Mail Service (5.5.2655.55)
> Are you suggesting that IRIs should never appear in plain text,
And that's the crux of the problem :). Unicode is "plain text" insofar as it has a sequence of code points that describe behaviors. However if I just spit out some "glyph" for each Unicode code point I encounter, say in a fixed-width DOS-type box, I'll have a mess, even for some Latin sequences.
In order for Unicode to display properly the rendering engine must make some decisions. U+0308 had to be combined with the A before it to make Ä. Even if you force NFC, some scripts still require combining characters for correct display. And that doesn't even begin to touch complex script behavior... or BIDI.
So, even "plain text" has rules for display. The Unicode Bidi Algorithm are some of those rules, without which "plain text" BIDI would be a mess. Unfortunately the Bidi algorithm can't perfectly handle all cases, and IRIs are a case where the bidi algorithm behavior isn't perfect. I'd suggest an addendum to the bidi algorithm to cover the IRI case. Thus tweaking the presentation engines so that when they see a "plain text" IRI, it gets displayed appropriately.
Consistent "Plain Text" display of an IRI is an unattainable holy grail, especially with more complex scripts. Proper display of Unicode requires a rendering engine for proper display.
That goes for source code too. If an editor doesn't render complex scripts in readable (to a speaker at least) ways, then it's pretty pointless to use the Unicode, it'd be better to %encode everything.
So my question is "what is your definition of plain text?" I'd say anything that isn't using extra presentation markup, but allow the rendering engine to make reasonable sense of the display.
-Shawn
29-Jul-2010 4:32:05-GMT,4290;000000000001
Return-Path: <unicode-bounce@unicode.org>
Received: from lmtpproxyd (quorn-eth1.cc.columbia.edu [128.59.33.146])
by mozartwurst.cc.columbia.edu (Cyrus v2.3.16) with LMTPA;
Thu, 29 Jul 2010 00:39:39 -0400
X-Sieve: CMU Sieve 2.3
Received: from quorn.cc.columbia.edu ([unix socket])
by mail.columbia.edu (Cyrus v2.3.16) with LMTPA;
Thu, 29 Jul 2010 00:39:39 -0400
X-Sieve: CMU Sieve 2.3
Received: from calabash.cc.columbia.edu (calabash.cc.columbia.edu [128.59.28.168])
by quorn.cc.columbia.edu (8.13.1/8.13.1) with ESMTP id o6T4ddrf029200;
Thu, 29 Jul 2010 00:39:39 -0400
Received: from unicode.org (unicode.org [69.13.187.182])
by calabash.cc.columbia.edu (8.14.4/8.14.3) with ESMTP id o6T4dXq8015137;
Thu, 29 Jul 2010 00:39:38 -0400 (EDT)
Received: from sarasvati.unicode.org (localhost [127.0.0.1])
by unicode.org (8.12.11/8.12.11) with ESMTP id o6T4WGvb027378;
Wed, 28 Jul 2010 23:32:16 -0500
Received: with ECARTIS (v1.0.0; list unicode); Wed, 28 Jul 2010 23:32:16 -0500 (CDT)
Received: from smtpauth05.prod.mesa1.secureserver.net (smtpauth05.prod.mesa1.secureserver.net [64.202.165.99])
by unicode.org (8.12.11/8.12.11) with SMTP id o6T4WEQT027289
for <unicode@unicode.org>; Wed, 28 Jul 2010 23:32:14 -0500
> But to imply that because text always has a specific appearance, determining the underlying plain text is an artificial process that was imposed on us by computers seems wrong. We (meaning "readers of alphabetic scripts, at least Latin and Cyrillic") learn to recognize letters at an early age, but quickly run into additional glyphs we don't recognize, like certain cursive uppercase letters (especially G and Q) and the two-tier vs. one-tier lowercase a and g. Then we find out they are different forms of the same letter, and learn to read them the same, and that is the essence of "plain text"ùthe underlying letters behind potentially differing glyphs.
>
Just to illustrate Doug's point, suppose someone hands you a hand-written letter and asks you to copy it. To what extent do you attempt to fully recreate the format of the original? Most likely, you'll simply copy the letters and punctuation. If the letter has some specific formatting (such as underlining), you may attempt to recreate that. By and large, however, there would be no effort to recreate the non-paragraphing line breaks and definitely not any effort to recreate the original letter shapes. Copying the letter in this fashion is certainly acceptable under almost all circumstances--indeed, in many cases it would be preferred over, say, a photocopy--and it strongly suggests the existence of some sort of Platonic "plain text" which is the essence of what was written.
=====
Si⌠n ap-Rhisiart
John H. Jenkins
jenkins@apple.com
14-Oct-2011 18:01:55-GMT,8498;000000000001
Return-Path: <unicode-bounce@unicode.org>
Received: from lmtpproxyd (taro-eth1.cc.columbia.edu [128.59.33.142])
by mozartwurst.cc.columbia.edu (Cyrus v2.3.16) with LMTPA;
Fri, 14 Oct 2011 14:14:27 -0400
X-Sieve: CMU Sieve 2.3
Received: from taro.cc.columbia.edu ([unix socket])
by mail.columbia.edu (Cyrus v2.3.16) with LMTPA;
Fri, 14 Oct 2011 14:14:27 -0400
X-Sieve: CMU Sieve 2.3
Received: from kumquat.cc.columbia.edu (kumquat.cc.columbia.edu [128.59.28.169])
by taro.cc.columbia.edu (8.13.8/8.13.8) with ESMTP id p9EIEMst007167;
Fri, 14 Oct 2011 14:14:27 -0400
Received: from unicode.org (unicode.org [216.97.88.9])
by kumquat.cc.columbia.edu (8.14.4/8.14.3) with ESMTP id p9EIEDnY015910;
Fri, 14 Oct 2011 14:14:19 -0400 (EDT)
Received: from sarasvati.unicode.org (sarasvati.unicode.org.local [127.0.0.1])
by unicode.org (8.14.4/8.14.4) with ESMTP id p9EI2IsK017644;
Fri, 14 Oct 2011 13:02:18 -0500
Received: with ECARTIS (v1.0.0; list unicode); Fri, 14 Oct 2011 13:02:18 -0500 (CDT)
Received: from fm200.sybase.com (fm200.sybase.com [192.138.151.122])
by unicode.org (8.14.4/8.14.4) with ESMTP id p9EI2HnE017631
for <unicode@unicode.org>; Fri, 14 Oct 2011 13:02:17 -0500
Received: from smtp1.sybase.com (sybgate.sybase.com [10.22.97.84])
by fm200.sybase.com with ESMTP id p9EI27d08064
for <unicode@unicode.org>; Fri, 14 Oct 2011 11:02:08 -0700 (PDT)
Received: from [10.22.85.211] (localhost [127.0.0.1])
by smtp1.sybase.com with ESMTP id p9EI1tr21326;
Fri, 14 Oct 2011 11:01:55 -0700 (PDT)
Message-ID: <4E987913.5020807@sybase.com>
Date: Fri, 14 Oct 2011 11:01:55 -0700
From: Ken Whistler <kenw@sybase.com>
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1