NetNews Usenet Archive 1992 #30

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #30 / NN_1992_30.iso / spool / comp / lang / perl / 7490 < prev next >

Wrap

Text File | 1992-12-17 | 2.0 KB | 51 lines

Newsgroups: comp.lang.perl Path: sparky!uunet!spool.mu.edu!sol.ctr.columbia.edu!hamblin.math.byu.edu!news.byu.edu!news.mtholyoke.edu!jbotz From: jbotz@mtholyoke.edu (Jurgen Botz) Subject: Perl & manipulating large texts Message-ID: <BzF1Cn.M1t@mtholyoke.edu> Sender: news@mtholyoke.edu (USENET News System) Organization: Mount Holyoke College Date: Thu, 17 Dec 1992 18:14:47 GMT Lines: 40 I was recently playing with some parsing stuff in the process of trying to parse a very large (> 1MB) text file, and I found myself unable to determine how to deal with the text efficiently. I'd like some input on this... here are the parameters of the problem: - I can't just read line by line because I might have to backtrack. - I need to be able to address individual character positions, not just lines. - I need to be able to use regexps to match chunks of text from the beginning (possibly crossing line-boundaries) and then use the matched chunk for something and then repeat with the text beyond what was matched. (i.e. bite off chunks from the beginning.) A powerful idiom that makes the lexical part very easy is being able to use regular expressions and the get the text that matched from $& and the rest of the text for the next pass from $'. However, it seems to me that sucking a > 1MB file into a scalar and then iterating with something like... $* = 1; while (/$pat/) { $match = $&; $_ = $_; &do_something_with($match); $pat = &next_pat(); } must be prohibitively inefficient. I guess essentially the problem is that I can't find a way to use a C-like idiom with pointers into the text. Maybe Perl 5 references will take care of this? I noticed that "substr" returns an L-value, but couldn't find any way to take advantage of that fact. -- Jurgen Botz | Internet: JBotz@mtholyoke.edu Academic Systems Consultant | Bitnet: JBotz@mhc.bitnet Mount Holyoke College | Voice: (US) 413-538-2375 (daytime) South Hadley, MA, USA | Snail Mail: J. Botz, 01075-0629