home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.lang.perl
- Path: sparky!uunet!spool.mu.edu!sol.ctr.columbia.edu!hamblin.math.byu.edu!news.byu.edu!news.mtholyoke.edu!jbotz
- From: jbotz@mtholyoke.edu (Jurgen Botz)
- Subject: Perl & manipulating large texts
- Message-ID: <BzF1Cn.M1t@mtholyoke.edu>
- Sender: news@mtholyoke.edu (USENET News System)
- Organization: Mount Holyoke College
- Date: Thu, 17 Dec 1992 18:14:47 GMT
- Lines: 40
-
- I was recently playing with some parsing stuff in the process of
- trying to parse a very large (> 1MB) text file, and I found myself
- unable to determine how to deal with the text efficiently. I'd like
- some input on this... here are the parameters of the problem:
-
- - I can't just read line by line because I might have to backtrack.
-
- - I need to be able to address individual character positions, not
- just lines.
-
- - I need to be able to use regexps to match chunks of text from the
- beginning (possibly crossing line-boundaries) and then use the
- matched chunk for something and then repeat with the text beyond
- what was matched. (i.e. bite off chunks from the beginning.)
-
- A powerful idiom that makes the lexical part very easy is being able
- to use regular expressions and the get the text that matched from
- $& and the rest of the text for the next pass from $'. However,
- it seems to me that sucking a > 1MB file into a scalar and then
- iterating with something like...
-
- $* = 1;
- while (/$pat/) {
- $match = $&;
- $_ = $_;
- &do_something_with($match);
- $pat = &next_pat();
- }
-
- must be prohibitively inefficient. I guess essentially the problem is
- that I can't find a way to use a C-like idiom with pointers into
- the text. Maybe Perl 5 references will take care of this? I noticed
- that "substr" returns an L-value, but couldn't find any way to take
- advantage of that fact.
-
- --
- Jurgen Botz | Internet: JBotz@mtholyoke.edu
- Academic Systems Consultant | Bitnet: JBotz@mhc.bitnet
- Mount Holyoke College | Voice: (US) 413-538-2375 (daytime)
- South Hadley, MA, USA | Snail Mail: J. Botz, 01075-0629
-