home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Usenet 1994 January
/
usenetsourcesnewsgroupsinfomagicjanuary1994.iso
/
sources
/
std_unix
/
v21
/
095
< prev
next >
Wrap
Internet Message Format
|
1990-12-05
|
3KB
From std-unix-request@uunet.uu.net Sun Sep 9 15:19:07 1990
Received: from cs.utexas.edu by uunet.uu.net (5.61/1.14) with SMTP
id AA01995; Sun, 9 Sep 90 15:19:07 -0400
Posted-Date: 9 Sep 90 04:39:29 GMT
Received: by cs.utexas.edu (5.64/1.76)
From: henry@zoo.toronto.edu (Henry Spencer)
Newsgroups: comp.std.unix
Subject: Re: ambiguous match with multiple-character collating elements
Message-Id: <501@usenix.ORG>
References: <487@usenix.ORG>
Sender: jsq@usenix.ORG
Organization: U of Toronto Zoology
X-Submissions: std-unix@uunet.uu.net
Date: 9 Sep 90 04:39:29 GMT
Reply-To: std-unix@uunet.uu.net
To: std-unix@uunet.uu.net
From: henry@zoo.toronto.edu (Henry Spencer)
In article <487@usenix.ORG> karl@IMA.ISC.COM (Karl Heuer) writes:
>In an environment where the digraph "ch" collates as a single element, what
>happens if an attempt is made to match the subject string "chi" with the
>pattern "[c[.ch.]]i" or "[c[.ch.]]hi"? Is the implementation required to
>report a successful match in both cases? If so, it would seem necessary to
>use a nondeterministic finite automaton or equivalent, thus making simple
>regexp matching and filename globbing as complex as egrep pattern matching.
Looking at draft 10, I don't think there is much doubt that they both must
match, assuming those are legal regular expressions. (If "c" is not a
collating element or noncollating character, they're not.) If both "c"
and "ch" are valid collating elements, the bracket expression must be
prepared to match either.
The wording could stand improving.
As for the implementation aspects, yes, this is a pain. However, there
is basically no such thing as "simple" regexp matching. :-) The extra
complexity added by multicharacter collating elements, while annoying,
is not that big a deal. I think Karl is confused. *Any* non-trivial
regexp matching ends up using either nondeterministic or deterministic
automata, sometimes behind clever plastic disguises. The very simplest
forms, like globbing, sometimes can get away without having to compile
the regexp string into an internal form, by running a nondeterministic
automaton directly from the regexp string. That will get a bit harder
because of the greater complexity of 1003.2 regexps. However, there
is no way that even "simple" regexp matching (I assume Karl is thinking
of things like ed) is viable without a compilation step.
Given that 1003.2 defines -- finally! -- library functions for regexp
work of various kinds, including globbing, the complexity will, in any
case, be localized to library functions. The programmer won't have to
worry about it unless he's actually writing those library functions.
(*That* won't be fun.)
--
TCP/IP: handling tomorrow's loads today| Henry Spencer at U of Toronto Zoology
OSI: handling yesterday's loads someday| henry@zoo.toronto.edu utzoo!henry
Volume-Number: Volume 21, Number 95