NetNews Usenet Archive 1992 #26

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #26 / NN_1992_26.iso / spool / comp / lang / perl / 6947 < prev next >

Wrap

Internet Message Format | 1992-11-11 | 3.5 KB

Path: sparky!uunet!ogicse!uwm.edu!zaphod.mps.ohio-state.edu!usc!news!netlabs!lwall From: lwall@netlabs.com (Larry Wall) Newsgroups: comp.lang.perl Subject: Re: Partial RegExp's Message-ID: <1992Nov12.003038.18617@netlabs.com> Date: 12 Nov 92 00:30:38 GMT Article-I.D.: netlabs.1992Nov12.003038.18617 References: <BxGFIy.LuA@news.cso.uiuc.edu> Sender: news@netlabs.com Organization: NetLabs, Inc. Lines: 58 Nntp-Posting-Host: scalpel.netlabs.com In article <BxGFIy.LuA@news.cso.uiuc.edu> chappell@math.uiuc.edu (Glenn Chappell) writes: : Here's a question I've been pondering for a few weeks: : : Inherent in the design of Perl seems to be the idea that you always have : enough space in memory for the entirety of any file you'd ever want to : work on. But what if you don't? How does one do wonderful things like : "split" and various pattern matches & such on files that are just too big? : : Well, if there's already a standard, ok-I'll-spell-it-all-out-for-a- : neophyte-like-you-but-next-time-read-the-book-okay answer to this : question, I'd like to hear it. If, not, an idea: : : Of course, the way you deal with huge files is to read them in chunks. : The problem with that is that you miss a pattern match that starts on : one chunk and ends on another. : : Currently, the result of an attempt at a pattern match gives one of two : responses: "Got a match" or "Didn't get a match". What if there were a : way to tell the pattern matcher that some patterns may extend off the : end of the currently available text, and we gave the matcher the ability : to give two other reponses: "Got a partial match, and if you give me : more data, I may get a match" and "Got a match, which may turn into a : bigger match if you give me more data". The matcher would also return : the place at which the partial match began. : : Now, from what I know of pattern matching, it seems to me that this would : be an easy modification to do. The question, then, is whether it would : be worthwhile. So, does anyone think so? Or is it just me? The main problem is the semantics of *, + and {}. Currently these want to match the LONGEST possible string. Some patterns would have to slurp in the entire file anyway. This is one of the reasons that Mark and I have been talking about variants of *, + and {} that would match the shortest string possible. Currently there are only two ways to do first-match rather than last-match: 1) have the thing you're looking for be the first thing in a pattern, or 2) carefully write the intermediate stuff to exclude the pattern you're looking for. Both of these approaches leave something to be desired. A thing that might help with option 1 is if we attached the state of m//g searches to the searched string rather than to the pattern itself. This would let you switch from one pattern to another in mid search. One could then do tokenizing without s///, for instance. Of course, that doesn't help the slurp-in-more-file-if-you-run-out problem. I don't see any theoretical problems with regular expression operators that backtrack by attempting to match more rather than less. There's certainly no problem with oscillation, since any given operator always progresses in the same direction. As for the feature in question, it's just one of those things where you try to figure out the most general (and practical) way to do it, then think about whether you want to do it at all, and then maybe find time to do it. None of these is a trivial undertaking. Language design is not a haphazard activity, appearances notwithstanding. :-) Larry