home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!ogicse!uwm.edu!zaphod.mps.ohio-state.edu!usc!news!netlabs!lwall
- From: lwall@netlabs.com (Larry Wall)
- Newsgroups: comp.lang.perl
- Subject: Re: Partial RegExp's
- Message-ID: <1992Nov12.003038.18617@netlabs.com>
- Date: 12 Nov 92 00:30:38 GMT
- Article-I.D.: netlabs.1992Nov12.003038.18617
- References: <BxGFIy.LuA@news.cso.uiuc.edu>
- Sender: news@netlabs.com
- Organization: NetLabs, Inc.
- Lines: 58
- Nntp-Posting-Host: scalpel.netlabs.com
-
- In article <BxGFIy.LuA@news.cso.uiuc.edu> chappell@math.uiuc.edu (Glenn Chappell) writes:
- : Here's a question I've been pondering for a few weeks:
- :
- : Inherent in the design of Perl seems to be the idea that you always have
- : enough space in memory for the entirety of any file you'd ever want to
- : work on. But what if you don't? How does one do wonderful things like
- : "split" and various pattern matches & such on files that are just too big?
- :
- : Well, if there's already a standard, ok-I'll-spell-it-all-out-for-a-
- : neophyte-like-you-but-next-time-read-the-book-okay answer to this
- : question, I'd like to hear it. If, not, an idea:
- :
- : Of course, the way you deal with huge files is to read them in chunks.
- : The problem with that is that you miss a pattern match that starts on
- : one chunk and ends on another.
- :
- : Currently, the result of an attempt at a pattern match gives one of two
- : responses: "Got a match" or "Didn't get a match". What if there were a
- : way to tell the pattern matcher that some patterns may extend off the
- : end of the currently available text, and we gave the matcher the ability
- : to give two other reponses: "Got a partial match, and if you give me
- : more data, I may get a match" and "Got a match, which may turn into a
- : bigger match if you give me more data". The matcher would also return
- : the place at which the partial match began.
- :
- : Now, from what I know of pattern matching, it seems to me that this would
- : be an easy modification to do. The question, then, is whether it would
- : be worthwhile. So, does anyone think so? Or is it just me?
-
- The main problem is the semantics of *, + and {}. Currently these want
- to match the LONGEST possible string. Some patterns would have to
- slurp in the entire file anyway.
-
- This is one of the reasons that Mark and I have been talking about
- variants of *, + and {} that would match the shortest string possible.
- Currently there are only two ways to do first-match rather than last-match:
- 1) have the thing you're looking for be the first thing in a pattern, or
- 2) carefully write the intermediate stuff to exclude the pattern you're
- looking for. Both of these approaches leave something to be desired.
-
- A thing that might help with option 1 is if we attached the state of
- m//g searches to the searched string rather than to the pattern
- itself. This would let you switch from one pattern to another in mid
- search. One could then do tokenizing without s///, for instance. Of
- course, that doesn't help the slurp-in-more-file-if-you-run-out problem.
-
- I don't see any theoretical problems with regular expression operators
- that backtrack by attempting to match more rather than less. There's
- certainly no problem with oscillation, since any given operator always
- progresses in the same direction.
-
- As for the feature in question, it's just one of those things where you
- try to figure out the most general (and practical) way to do it, then
- think about whether you want to do it at all, and then maybe find time
- to do it. None of these is a trivial undertaking. Language design is
- not a haphazard activity, appearances notwithstanding. :-)
-
- Larry
-