home *** CD-ROM | disk | FTP | other *** search
-
-
-
- PERLRE(1) User Contributed Perl Documentation PERLRE(1)
-
-
- NNNNAAAAMMMMEEEE
- perlre - Perl regular expressions
-
- DDDDEEEESSSSCCCCRRRRIIIIPPPPTTTTIIIIOOOONNNN
- This page describes the syntax of regular expressions in
- Perl. For a description of how to actually _u_s_e regular
- expressions in matching operations, plus various examples
- of the same, see mmmm//////// and ssss//////////// in the _p_e_r_l_o_p manpage.
-
- The matching operations can have various modifiers, some
- of which relate to the interpretation of the regular
- expression inside. These are:
-
- iiii DDDDoooo ccccaaaasssseeee----iiiinnnnsssseeeennnnssssiiiittttiiiivvvveeee ppppaaaatttttttteeeerrrrnnnn mmmmaaaattttcccchhhhiiiinnnngggg....
- mmmm TTTTrrrreeeeaaaatttt ssssttttrrrriiiinnnngggg aaaassss mmmmuuuullllttttiiiipppplllleeee lllliiiinnnneeeessss....
- ssss TTTTrrrreeeeaaaatttt ssssttttrrrriiiinnnngggg aaaassss ssssiiiinnnngggglllleeee lllliiiinnnneeee....
- xxxx EEEExxxxtttteeeennnndddd yyyyoooouuuurrrr ppppaaaatttttttteeeerrrrnnnn''''ssss lllleeeeggggiiiibbbbiiiilllliiiittttyyyy wwwwiiiitttthhhh wwwwhhhhiiiitttteeeessssppppaaaacccceeee aaaannnndddd ccccoooommmmmmmmeeeennnnttttssss....
-
- These are usually written as "the ////xxxx modifier", even
- though the delimiter in question might not actually be a
- slash. In fact, any of these modifiers may also be
- embedded within the regular expression itself using the
- new ((((????............)))) construct. See below.
-
- The ////xxxx modifier itself needs a little more explanation.
- It tells the regular expression parser to ignore
- whitespace that is not backslashed or within a character
- class. You can use this to break up your regular
- expression into (slightly) more readable parts. The ####
- character is also treated as a metacharacter introducing a
- comment, just as in ordinary Perl code. Taken together,
- these features go a long way towards making Perl 5 a
- readable language. See the C comment deletion code in the
- _p_e_r_l_o_p manpage.
-
- RRRReeeegggguuuullllaaaarrrr EEEExxxxpppprrrreeeessssssssiiiioooonnnnssss
-
- The patterns used in pattern matching are regular
- expressions such as those supplied in the Version 8 regexp
- routines. (In fact, the routines are derived (distantly)
- from Henry Spencer's freely redistributable
- reimplementation of the V8 routines.) See the section on
- _V_e_r_s_i_o_n _8 _R_e_g_u_l_a_r _E_x_p_r_e_s_s_i_o_n_s for details.
-
- In particular the following metacharacters have their
- standard _e_g_r_e_p-ish meanings:
-
- \\\\ QQQQuuuuooootttteeee tttthhhheeee nnnneeeexxxxtttt mmmmeeeettttaaaacccchhhhaaaarrrraaaacccctttteeeerrrr
- ^^^^ MMMMaaaattttcccchhhh tttthhhheeee bbbbeeeeggggiiiinnnnnnnniiiinnnngggg ooooffff tttthhhheeee lllliiiinnnneeee
- .... MMMMaaaattttcccchhhh aaaannnnyyyy cccchhhhaaaarrrraaaacccctttteeeerrrr ((((eeeexxxxcccceeeepppptttt nnnneeeewwwwlllliiiinnnneeee))))
- $$$$ MMMMaaaattttcccchhhh tttthhhheeee eeeennnndddd ooooffff tttthhhheeee lllliiiinnnneeee ((((oooorrrr bbbbeeeeffffoooorrrreeee nnnneeeewwwwlllliiiinnnneeee aaaatttt tttthhhheeee eeeennnndddd))))
- |||| AAAAlllltttteeeerrrrnnnnaaaattttiiiioooonnnn
- (((()))) GGGGrrrroooouuuuppppiiiinnnngggg
- [[[[]]]] CCCChhhhaaaarrrraaaacccctttteeeerrrr ccccllllaaaassssssss
-
-
-
- 13/Feb/96 perl 5.002 with 1
-
-
-
-
-
- PERLRE(1) User Contributed Perl Documentation PERLRE(1)
-
-
- By default, the "^" character is guaranteed to match only
- at the beginning of the string, the "$" character only at
- the end (or before the newline at the end) and Perl does
- certain optimizations with the assumption that the string
- contains only one line. Embedded newlines will not be
- matched by "^" or "$". You may, however, wish to treat a
- string as a multi-line buffer, such that the "^" will
- match after any newline within the string, and "$" will
- match before any newline. At the cost of a little more
- overhead, you can do this by using the /m modifier on the
- pattern match operator. (Older programs did this by
- setting $$$$****, but this practice is deprecated in Perl 5.)
-
- To facilitate multi-line substitutions, the "." character
- never matches a newline unless you use the ////ssss modifier,
- which tells Perl to pretend the string is a single
- line--even if it isn't. The ////ssss modifier also overrides
- the setting of $$$$****, in case you have some (badly behaved)
- older code that sets it in another module.
-
- The following standard quantifiers are recognized:
-
- **** MMMMaaaattttcccchhhh 0000 oooorrrr mmmmoooorrrreeee ttttiiiimmmmeeeessss
- ++++ MMMMaaaattttcccchhhh 1111 oooorrrr mmmmoooorrrreeee ttttiiiimmmmeeeessss
- ???? MMMMaaaattttcccchhhh 1111 oooorrrr 0000 ttttiiiimmmmeeeessss
- {{{{nnnn}}}} MMMMaaaattttcccchhhh eeeexxxxaaaaccccttttllllyyyy nnnn ttttiiiimmmmeeeessss
- {{{{nnnn,,,,}}}} MMMMaaaattttcccchhhh aaaatttt lllleeeeaaaasssstttt nnnn ttttiiiimmmmeeeessss
- {{{{nnnn,,,,mmmm}}}} MMMMaaaattttcccchhhh aaaatttt lllleeeeaaaasssstttt nnnn bbbbuuuutttt nnnnooootttt mmmmoooorrrreeee tttthhhhaaaannnn mmmm ttttiiiimmmmeeeessss
-
- (If a curly bracket occurs in any other context, it is
- treated as a regular character.) The "*" modifier is
- equivalent to {{{{0000,,,,}}}}, the "+" modifier to {{{{1111,,,,}}}}, and the "?"
- modifier to {{{{0000,,,,1111}}}}. n and m are limited to integral values
- less than 65536.
-
- By default, a quantified subpattern is "greedy", that is,
- it will match as many times as possible without causing
- the rest pattern not to match. The standard quantifiers
- are all "greedy", in that they match as many occurrences
- as possible (given a particular starting location) without
- causing the pattern to fail. If you want it to match the
- minimum number of times possible, follow the quantifier
- with a "?" after any of them. Note that the meanings
- don't change, just the "gravity":
-
- ****???? MMMMaaaattttcccchhhh 0000 oooorrrr mmmmoooorrrreeee ttttiiiimmmmeeeessss
- ++++???? MMMMaaaattttcccchhhh 1111 oooorrrr mmmmoooorrrreeee ttttiiiimmmmeeeessss
- ???????? MMMMaaaattttcccchhhh 0000 oooorrrr 1111 ttttiiiimmmmeeee
- {{{{nnnn}}}}???? MMMMaaaattttcccchhhh eeeexxxxaaaaccccttttllllyyyy nnnn ttttiiiimmmmeeeessss
- {{{{nnnn,,,,}}}}???? MMMMaaaattttcccchhhh aaaatttt lllleeeeaaaasssstttt nnnn ttttiiiimmmmeeeessss
- {{{{nnnn,,,,mmmm}}}}???? MMMMaaaattttcccchhhh aaaatttt lllleeeeaaaasssstttt nnnn bbbbuuuutttt nnnnooootttt mmmmoooorrrreeee tttthhhhaaaannnn mmmm ttttiiiimmmmeeeessss
-
- Since patterns are processed as double quoted strings, the
- following also work:
-
-
-
- 13/Feb/96 perl 5.002 with 2
-
-
-
-
-
- PERLRE(1) User Contributed Perl Documentation PERLRE(1)
-
-
- \\\\tttt ttttaaaabbbb
- \\\\nnnn nnnneeeewwwwlllliiiinnnneeee
- \\\\rrrr rrrreeeettttuuuurrrrnnnn
- \\\\ffff ffffoooorrrrmmmm ffffeeeeeeeedddd
- \\\\aaaa aaaallllaaaarrrrmmmm ((((bbbbeeeellllllll))))
- \\\\eeee eeeessssccccaaaappppeeee ((((tttthhhhiiiinnnnkkkk ttttrrrrooooffffffff))))
- \\\\000033333333 ooooccccttttaaaallll cccchhhhaaaarrrr ((((tttthhhhiiiinnnnkkkk ooooffff aaaa PPPPDDDDPPPP----11111111))))
- \\\\xxxx1111BBBB hhhheeeexxxx cccchhhhaaaarrrr
- \\\\cccc[[[[ ccccoooonnnnttttrrrroooollll cccchhhhaaaarrrr
- \\\\llll lllloooowwwweeeerrrrccccaaaasssseeee nnnneeeexxxxtttt cccchhhhaaaarrrr ((((tttthhhhiiiinnnnkkkk vvvviiii))))
- \\\\uuuu uuuuppppppppeeeerrrrccccaaaasssseeee nnnneeeexxxxtttt cccchhhhaaaarrrr ((((tttthhhhiiiinnnnkkkk vvvviiii))))
- \\\\LLLL lllloooowwwweeeerrrrccccaaaasssseeee ttttiiiillllllll \\\\EEEE ((((tttthhhhiiiinnnnkkkk vvvviiii))))
- \\\\UUUU uuuuppppppppeeeerrrrccccaaaasssseeee ttttiiiillllllll \\\\EEEE ((((tttthhhhiiiinnnnkkkk vvvviiii))))
- \\\\EEEE eeeennnndddd ccccaaaasssseeee mmmmooooddddiiiiffffiiiiccccaaaattttiiiioooonnnn ((((tttthhhhiiiinnnnkkkk vvvviiii))))
- \\\\QQQQ qqqquuuuooootttteeee rrrreeeeggggeeeexxxxpppp mmmmeeeettttaaaacccchhhhaaaarrrraaaacccctttteeeerrrrssss ttttiiiillllllll \\\\EEEE
-
- In addition, Perl defines the following:
-
- \\\\wwww MMMMaaaattttcccchhhh aaaa """"wwwwoooorrrrdddd"""" cccchhhhaaaarrrraaaacccctttteeeerrrr ((((aaaallllpppphhhhaaaannnnuuuummmmeeeerrrriiiicccc pppplllluuuussss """"____""""))))
- \\\\WWWW MMMMaaaattttcccchhhh aaaa nnnnoooonnnn----wwwwoooorrrrdddd cccchhhhaaaarrrraaaacccctttteeeerrrr
- \\\\ssss MMMMaaaattttcccchhhh aaaa wwwwhhhhiiiitttteeeessssppppaaaacccceeee cccchhhhaaaarrrraaaacccctttteeeerrrr
- \\\\SSSS MMMMaaaattttcccchhhh aaaa nnnnoooonnnn----wwwwhhhhiiiitttteeeessssppppaaaacccceeee cccchhhhaaaarrrraaaacccctttteeeerrrr
- \\\\dddd MMMMaaaattttcccchhhh aaaa ddddiiiiggggiiiitttt cccchhhhaaaarrrraaaacccctttteeeerrrr
- \\\\DDDD MMMMaaaattttcccchhhh aaaa nnnnoooonnnn----ddddiiiiggggiiiitttt cccchhhhaaaarrrraaaacccctttteeeerrrr
-
- Note that \\\\wwww matches a single alphanumeric character, not
- a whole word. To match a word you'd need to say \\\\wwww++++. You
- may use \\\\wwww, \\\\WWWW, \\\\ssss, \\\\SSSS, \\\\dddd and \\\\DDDD within character classes
- (though not as either end of a range).
-
- Perl defines the following zero-width assertions:
-
- \\\\bbbb MMMMaaaattttcccchhhh aaaa wwwwoooorrrrdddd bbbboooouuuunnnnddddaaaarrrryyyy
- \\\\BBBB MMMMaaaattttcccchhhh aaaa nnnnoooonnnn----((((wwwwoooorrrrdddd bbbboooouuuunnnnddddaaaarrrryyyy))))
- \\\\AAAA MMMMaaaattttcccchhhh oooonnnnllllyyyy aaaatttt bbbbeeeeggggiiiinnnnnnnniiiinnnngggg ooooffff ssssttttrrrriiiinnnngggg
- \\\\ZZZZ MMMMaaaattttcccchhhh oooonnnnllllyyyy aaaatttt eeeennnndddd ooooffff ssssttttrrrriiiinnnngggg ((((oooorrrr bbbbeeeeffffoooorrrreeee nnnneeeewwwwlllliiiinnnneeee aaaatttt tttthhhheeee eeeennnndddd))))
- \\\\GGGG MMMMaaaattttcccchhhh oooonnnnllllyyyy wwwwhhhheeeerrrreeee pppprrrreeeevvvviiiioooouuuussss mmmm////////gggg lllleeeefffftttt ooooffffffff
-
- A word boundary (\\\\bbbb) is defined as a spot between two
- characters that has a \\\\wwww on one side of it and and a \\\\WWWW on
- the other side of it (in either order), counting the
- imaginary characters off the beginning and end of the
- string as matching a \\\\WWWW. (Within character classes \\\\bbbb
- represents backspace rather than a word boundary.) The \\\\AAAA
- and \\\\ZZZZ are just like "^" and "$" except that they won't
- match multiple times when the ////mmmm modifier is used, while
- "^" and "$" will match at every internal line boundary.
- To match the actual end of the string, not ignoring
- newline, you can use \\\\ZZZZ((((????!!!!\\\\nnnn)))).
-
- When the bracketing construct (((( ............ )))) is used, \<digit>
- matches the digit'th substring. Outside of the pattern,
- always use "$" instead of "\" in front of the digit. (The
- \<digit> notation can on rare occasion work outside the
-
-
-
- 13/Feb/96 perl 5.002 with 3
-
-
-
-
-
- PERLRE(1) User Contributed Perl Documentation PERLRE(1)
-
-
- current pattern, this should not be relied upon. See the
- WARNING below.) The scope of $<digit> (and $$$$````, $$$$&&&&, and $$$$''''))))
- extends to the end of the enclosing BLOCK or eval string,
- or to the next successful pattern match, whichever comes
- first. If you want to use parentheses to delimit
- subpattern (e.g. a set of alternatives) without saving it
- as a subpattern, follow the ( with a ?.
-
- You may have as many parentheses as you wish. If you have
- more than 9 substrings, the variables $$$$11110000, $$$$11111111, ... refer
- to the corresponding substring. Within the pattern, \10,
- \11, etc. refer back to substrings if there have been at
- least that many left parens before the backreference.
- Otherwise (for backward compatibility) \10 is the same as
- \010, a backspace, and \11 the same as \011, a tab. And
- so on. (\1 through \9 are always backreferences.)
-
- $$$$++++ returns whatever the last bracket match matched. $$$$&&&&
- returns the entire matched string. ($0 used to return the
- same thing, but not any more.) $$$$```` returns everything
- before the matched string. $$$$'''' returns everything after
- the matched string. Examples:
-
- ssss////^^^^(((([[[[^^^^ ]]]]****)))) ****(((([[[[^^^^ ]]]]****))))////$$$$2222 $$$$1111////;;;; #### sssswwwwaaaapppp ffffiiiirrrrsssstttt ttttwwwwoooo wwwwoooorrrrddddssss
-
- iiiiffff ((((////TTTTiiiimmmmeeee:::: ((((........))))::::((((........))))::::((((........))))////)))) {{{{
- $$$$hhhhoooouuuurrrrssss ==== $$$$1111;;;;
- $$$$mmmmiiiinnnnuuuutttteeeessss ==== $$$$2222;;;;
- $$$$sssseeeeccccoooonnnnddddssss ==== $$$$3333;;;;
- }}}}
-
- You will note that all backslashed metacharacters in Perl
- are alphanumeric, such as \\\\bbbb, \\\\wwww, \\\\nnnn. Unlike some other
- regular expression languages, there are no backslashed
- symbols that aren't alphanumeric. So anything that looks
- like \\, \(, \), \<, \>, \{, or \} is always interpreted
- as a literal character, not a metacharacter. This makes
- it simple to quote a string that you want to use for a
- pattern but that you are afraid might contain
- metacharacters. Simply quote all the non-alphanumeric
- characters:
-
- $$$$ppppaaaatttttttteeeerrrrnnnn ====~~~~ ssss////((((\\\\WWWW))))////\\\\\\\\$$$$1111////gggg;;;;
-
- You can also use the built-in _q_u_o_t_e_m_e_t_a_(_) function to do
- this. An even easier way to quote metacharacters right in
- the match operator is to say
-
- ////$$$$uuuunnnnqqqquuuuooootttteeeedddd\\\\QQQQ$$$$qqqquuuuooootttteeeedddd\\\\EEEE$$$$uuuunnnnqqqquuuuooootttteeeedddd////
-
- Perl 5 defines a consistent extension syntax for regular
- expressions. The syntax is a pair of parens with a
- question mark as the first thing within the parens (this
- was a syntax error in Perl 4). The character after the
-
-
-
- 13/Feb/96 perl 5.002 with 4
-
-
-
-
-
- PERLRE(1) User Contributed Perl Documentation PERLRE(1)
-
-
- question mark gives the function of the extension.
- Several extensions are already supported:
-
- (?#text) A comment. The text is ignored. If the ////xxxx
- switch is used to enable whitespace formatting,
- a simple #### will suffice.
-
- (?:regexp)
- This groups things like "()" but doesn't make
- backrefences like "()" does. So
-
- sssspppplllliiiitttt((((////\\\\bbbb((((????::::aaaa||||bbbb||||cccc))))\\\\bbbb////))))
-
- is like
-
- sssspppplllliiiitttt((((////\\\\bbbb((((aaaa||||bbbb||||cccc))))\\\\bbbb////))))
-
- but doesn't spit out extra fields.
-
- (?=regexp)
- A zero-width positive lookahead assertion. For
- example, ////\\\\wwww++++((((????====\\\\tttt))))//// matches a word followed by
- a tab, without including the tab in $$$$&&&&.
-
- (?!regexp)
- A zero-width negative lookahead assertion. For
- example ////ffffoooooooo((((????!!!!bbbbaaaarrrr))))//// matches any occurrence of
- "foo" that isn't followed by "bar". Note
- however that lookahead and lookbehind are NOT
- the same thing. You cannot use this for
- lookbehind: ////((((????!!!!ffffoooooooo))))bbbbaaaarrrr//// will not find an
- occurrence of "bar" that is preceded by
- something which is not "foo". That's because
- the ((((????!!!!ffffoooooooo)))) is just saying that the next thing
- cannot be "foo"--and it's not, it's a "bar", so
- "foobar" will match. You would have to do
- something like ////((((????ffffoooooooo))))............bbbbaaaarrrr//// for that. We say
- "like" because there's the case of your "bar"
- not having three characters before it. You
- could cover that this way:
- ////((((????::::((((????!!!!ffffoooooooo))))............||||^^^^........????))))bbbbaaaarrrr////. Sometimes it's still
- easier just to say:
-
- iiiiffff ((((////ffffoooooooo//// &&&&&&&& $$$$```` ====~~~~ ////bbbbaaaarrrr$$$$////))))
-
-
- (?imsx) One or more embedded pattern-match modifiers.
- This is particularly useful for patterns that
- are specified in a table somewhere, some of
- which want to be case sensitive, and some of
- which don't. The case insensitive ones merely
- need to include ((((????iiii)))) at the front of the
- pattern. For example:
-
-
-
-
- 13/Feb/96 perl 5.002 with 5
-
-
-
-
-
- PERLRE(1) User Contributed Perl Documentation PERLRE(1)
-
-
- $$$$ppppaaaatttttttteeeerrrrnnnn ==== """"ffffoooooooobbbbaaaarrrr"""";;;;
- iiiiffff (((( ////$$$$ppppaaaatttttttteeeerrrrnnnn////iiii ))))
-
- #### mmmmoooorrrreeee fffflllleeeexxxxiiiibbbblllleeee::::
-
- $$$$ppppaaaatttttttteeeerrrrnnnn ==== """"((((????iiii))))ffffoooooooobbbbaaaarrrr"""";;;;
- iiiiffff (((( ////$$$$ppppaaaatttttttteeeerrrrnnnn//// ))))
-
-
- The specific choice of question mark for this and the new
- minimal matching construct was because 1) question mark is
- pretty rare in older regular expressions, and 2) whenever
- you see one, you should stop and "question" exactly what
- is going on. That's psychology...
-
- BBBBaaaacccckkkkttttrrrraaaacccckkkkiiiinnnngggg
-
- A fundamental feature of regular expression matching
- involves the notion called _b_a_c_k_t_r_a_c_k_i_n_g. which is used
- (when needed) by all regular expression quantifiers,
- namely ****, ****????, ++++, ++++????, {{{{nnnn,,,,mmmm}}}}, and {{{{nnnn,,,,mmmm}}}}????.
-
- For a regular expression to match, the _e_n_t_i_r_e regular
- expression must match, not just part of it. So if the
- beginning of a pattern containing a quantifier succeeds in
- a way that causes later parts in the pattern to fail, the
- matching engine backs up and recalculates the beginning
- part--that's why it's called backtracking.
-
- Here is an example of backtracking: Let's say you want to
- find the word following "foo" in the string "Food is on
- the foo table.":
-
- $$$$____ ==== """"FFFFoooooooodddd iiiissss oooonnnn tttthhhheeee ffffoooooooo ttttaaaabbbblllleeee...."""";;;;
- iiiiffff (((( ////\\\\bbbb((((ffffoooooooo))))\\\\ssss++++((((\\\\wwww++++))))////iiii )))) {{{{
- pppprrrriiiinnnntttt """"$$$$2222 ffffoooolllllllloooowwwwssss $$$$1111....\\\\nnnn"""";;;;
- }}}}
-
- When the match runs, the first part of the regular
- expression (\\\\bbbb((((ffffoooooooo))))) finds a possible match right at the
- beginning of the string, and loads up $$$$1111 with "Foo".
- However, as soon as the matching engine sees that there's
- no whitespace following the "Foo" that it had saved in $$$$1111,
- it realizes its mistake and starts over again one
- character after where it had had the tentative match.
- This time it goes all the way until the next occurrence of
- "foo". The complete regular expression matches this time,
- and you get the expected output of "table follows foo."
-
- Sometimes minimal matching can help a lot. Imagine you'd
- like to match everything between "foo" and "bar".
- Initially, you write something like this:
-
-
-
-
-
- 13/Feb/96 perl 5.002 with 6
-
-
-
-
-
- PERLRE(1) User Contributed Perl Documentation PERLRE(1)
-
-
- $$$$____ ==== """"TTTThhhheeee ffffoooooooodddd iiiissss uuuunnnnddddeeeerrrr tttthhhheeee bbbbaaaarrrr iiiinnnn tttthhhheeee bbbbaaaarrrrnnnn...."""";;;;
- iiiiffff (((( ////ffffoooooooo((((....****))))bbbbaaaarrrr//// )))) {{{{
- pppprrrriiiinnnntttt """"ggggooootttt <<<<$$$$1111>>>>\\\\nnnn"""";;;;
- }}}}
-
- Which perhaps unexpectedly yields:
-
- ggggooootttt <<<<dddd iiiissss uuuunnnnddddeeeerrrr tttthhhheeee bbbbaaaarrrr iiiinnnn tttthhhheeee >>>>
-
- That's because ....**** was greedy, so you get everything
- between the _f_i_r_s_t "foo" and the _l_a_s_t "bar". In this case,
- it's more effective to use minimal matching to make sure
- you get the text between a "foo" and the first "bar"
- thereafter.
-
- iiiiffff (((( ////ffffoooooooo((((....****????))))bbbbaaaarrrr//// )))) {{{{ pppprrrriiiinnnntttt """"ggggooootttt <<<<$$$$1111>>>>\\\\nnnn"""" }}}}
- ggggooootttt <<<<dddd iiiissss uuuunnnnddddeeeerrrr tttthhhheeee >>>>
-
- Here's another example: let's say you'd like to match a
- number at the end of a string, and you also want to keep
- the preceding part the match. So you write this:
-
- $$$$____ ==== """"IIII hhhhaaaavvvveeee 2222 nnnnuuuummmmbbbbeeeerrrrssss:::: 55553333111144447777"""";;;;
- iiiiffff (((( ////((((....****))))((((\\\\dddd****))))//// )))) {{{{ #### WWWWrrrroooonnnngggg!!!!
- pppprrrriiiinnnntttt """"BBBBeeeeggggiiiinnnnnnnniiiinnnngggg iiiissss <<<<$$$$1111>>>>,,,, nnnnuuuummmmbbbbeeeerrrr iiiissss <<<<$$$$2222>>>>....\\\\nnnn"""";;;;
- }}}}
-
- That won't work at all, because ....**** was greedy and gobbled
- up the whole string. As \\\\dddd**** can match on an empty string
- the complete regular expression matched successfully.
-
- BBBBeeeeggggiiiinnnnnnnniiiinnnngggg iiiissss <<<<IIII hhhhaaaavvvveeee 2222:::: 55553333111144447777>>>>,,,, nnnnuuuummmmbbbbeeeerrrr iiiissss <<<<>>>>....
-
- Here are some variants, most of which don't work:
-
- $$$$____ ==== """"IIII hhhhaaaavvvveeee 2222 nnnnuuuummmmbbbbeeeerrrrssss:::: 55553333111144447777"""";;;;
- @@@@ppppaaaattttssss ==== qqqqwwww{{{{
- ((((....****))))((((\\\\dddd****))))
- ((((....****))))((((\\\\dddd++++))))
- ((((....****????))))((((\\\\dddd****))))
- ((((....****????))))((((\\\\dddd++++))))
- ((((....****))))((((\\\\dddd++++))))$$$$
- ((((....****????))))((((\\\\dddd++++))))$$$$
- ((((....****))))\\\\bbbb((((\\\\dddd++++))))$$$$
- ((((....****\\\\DDDD))))((((\\\\dddd++++))))$$$$
- }}}};;;;
-
-
-
-
-
-
-
-
-
-
-
- 13/Feb/96 perl 5.002 with 7
-
-
-
-
-
- PERLRE(1) User Contributed Perl Documentation PERLRE(1)
-
-
- ffffoooorrrr $$$$ppppaaaatttt ((((@@@@ppppaaaattttssss)))) {{{{
- pppprrrriiiinnnnttttffff """"%%%%----11112222ssss """",,,, $$$$ppppaaaatttt;;;;
- iiiiffff (((( ////$$$$ppppaaaatttt//// )))) {{{{
- pppprrrriiiinnnntttt """"<<<<$$$$1111>>>> <<<<$$$$2222>>>>\\\\nnnn"""";;;;
- }}}} eeeellllsssseeee {{{{
- pppprrrriiiinnnntttt """"FFFFAAAAIIIILLLL\\\\nnnn"""";;;;
- }}}}
- }}}}
-
- That will print out:
-
- ((((....****))))((((\\\\dddd****)))) <<<<IIII hhhhaaaavvvveeee 2222 nnnnuuuummmmbbbbeeeerrrrssss:::: 55553333111144447777>>>> <<<<>>>>
- ((((....****))))((((\\\\dddd++++)))) <<<<IIII hhhhaaaavvvveeee 2222 nnnnuuuummmmbbbbeeeerrrrssss:::: 5555333311114444>>>> <<<<7777>>>>
- ((((....****????))))((((\\\\dddd****)))) <<<<>>>> <<<<>>>>
- ((((....****????))))((((\\\\dddd++++)))) <<<<IIII hhhhaaaavvvveeee >>>> <<<<2222>>>>
- ((((....****))))((((\\\\dddd++++))))$$$$ <<<<IIII hhhhaaaavvvveeee 2222 nnnnuuuummmmbbbbeeeerrrrssss:::: 5555333311114444>>>> <<<<7777>>>>
- ((((....****????))))((((\\\\dddd++++))))$$$$ <<<<IIII hhhhaaaavvvveeee 2222 nnnnuuuummmmbbbbeeeerrrrssss:::: >>>> <<<<55553333111144447777>>>>
- ((((....****))))\\\\bbbb((((\\\\dddd++++))))$$$$ <<<<IIII hhhhaaaavvvveeee 2222 nnnnuuuummmmbbbbeeeerrrrssss:::: >>>> <<<<55553333111144447777>>>>
- ((((....****\\\\DDDD))))((((\\\\dddd++++))))$$$$ <<<<IIII hhhhaaaavvvveeee 2222 nnnnuuuummmmbbbbeeeerrrrssss:::: >>>> <<<<55553333111144447777>>>>
-
- As you see, this can be a bit tricky. It's important to
- realize that a regular expression is merely a set of
- assertions that gives a definition of success. There may
- be 0, 1, or several different ways that the definition
- might succeed against a particular string. And if there
- are multiple ways it might succeed, you need to understand
- backtracking in order to know which variety of success you
- will achieve.
-
- When using lookahead assertions and negations, this can
- all get even tricker. Imagine you'd like to find a
- sequence of nondigits not followed by "123". You might
- try to write that as
-
- $$$$____ ==== """"AAAABBBBCCCC111122223333"""";;;;
- iiiiffff (((( ////^^^^\\\\DDDD****((((????!!!!111122223333))))//// )))) {{{{ #### WWWWrrrroooonnnngggg!!!!
- pppprrrriiiinnnntttt """"YYYYuuuupppp,,,, nnnnoooo 111122223333 iiiinnnn $$$$____\\\\nnnn"""";;;;
- }}}}
-
- But that isn't going to match; at least, not the way
- you're hoping. It claims that there is no 123 in the
- string. Here's a clearer picture of why it that pattern
- matches, contrary to popular expectations:
-
- $$$$xxxx ==== ''''AAAABBBBCCCC111122223333'''' ;;;;
- $$$$yyyy ==== ''''AAAABBBBCCCC444444445555'''' ;;;;
-
- pppprrrriiiinnnntttt """"1111:::: ggggooootttt $$$$1111\\\\nnnn"""" iiiiffff $$$$xxxx ====~~~~ ////^^^^((((AAAABBBBCCCC))))((((????!!!!111122223333))))//// ;;;;
- pppprrrriiiinnnntttt """"2222:::: ggggooootttt $$$$1111\\\\nnnn"""" iiiiffff $$$$yyyy ====~~~~ ////^^^^((((AAAABBBBCCCC))))((((????!!!!111122223333))))//// ;;;;
-
- pppprrrriiiinnnntttt """"3333:::: ggggooootttt $$$$1111\\\\nnnn"""" iiiiffff $$$$xxxx ====~~~~ ////^^^^((((\\\\DDDD****))))((((????!!!!111122223333))))//// ;;;;
- pppprrrriiiinnnntttt """"4444:::: ggggooootttt $$$$1111\\\\nnnn"""" iiiiffff $$$$yyyy ====~~~~ ////^^^^((((\\\\DDDD****))))((((????!!!!111122223333))))//// ;;;;
-
- This prints
-
-
-
- 13/Feb/96 perl 5.002 with 8
-
-
-
-
-
- PERLRE(1) User Contributed Perl Documentation PERLRE(1)
-
-
- 2222:::: ggggooootttt AAAABBBBCCCC
- 3333:::: ggggooootttt AAAABBBB
- 4444:::: ggggooootttt AAAABBBBCCCC
-
- You might have expected test 3 to fail because it just
- seems to a more general purpose version of test 1. The
- important difference between them is that test 3 contains
- a quantifier (\\\\DDDD****) and so can use backtracking, whereas
- test 1 will not. What's happening is that you've asked
- "Is it true that at the start of $$$$xxxx, following 0 or more
- nondigits, you have something that's not 123?" If the
- pattern matcher had let \\\\DDDD**** expand to "ABC", this would
- have caused the whole pattern to fail. The search engine
- will initially match \\\\DDDD**** with "ABC". Then it will try to
- match ((((????!!!!111122223333 with "123" which, of course, fails. But
- because a quantifier (\\\\DDDD****) has been used in the regular
- expression, the search engine can backtrack and retry the
- match differently in the hope of matching the complete
- regular expression.
-
- Well now, the pattern really, _r_e_a_l_l_y wants to succeed, so
- it uses the standard regexp backoff-and-retry and lets \\\\DDDD****
- expand to just "AB" this time. Now there's indeed
- something following "AB" that is not "123". It's in fact
- "C123", which suffices.
-
- We can deal with this by using both an assertion and a
- negation. We'll say that the first part in $$$$1111 must be
- followed by a digit, and in fact, it must also be followed
- by something that's not "123". Remember that the
- lookaheads are zero-width expressions--they only look, but
- don't consume any of the string in their match. So
- rewriting this way produces what you'd expect; that is,
- case 5 will fail, but case 6 succeeds:
-
- pppprrrriiiinnnntttt """"5555:::: ggggooootttt $$$$1111\\\\nnnn"""" iiiiffff $$$$xxxx ====~~~~ ////^^^^((((\\\\DDDD****))))((((????====\\\\dddd))))((((????!!!!111122223333))))//// ;;;;
- pppprrrriiiinnnntttt """"6666:::: ggggooootttt $$$$1111\\\\nnnn"""" iiiiffff $$$$yyyy ====~~~~ ////^^^^((((\\\\DDDD****))))((((????====\\\\dddd))))((((????!!!!111122223333))))//// ;;;;
-
- 6666:::: ggggooootttt AAAABBBBCCCC
-
- In other words, the two zero-width assertions next to each
- other work like they're ANDed together, just as you'd use
- any builtin assertions: ////^^^^$$$$//// matches only if you're at
- the beginning of the line AND the end of the line
- simultaneously. The deeper underlying truth is that
- juxtaposition in regular expressions always means AND,
- except when you write an explicit OR using the vertical
- bar. ////aaaabbbb//// means match "a" AND (then) match "b", although
- the attempted matches are made at different positions
- because "a" is not a zero-width assertion, but a one-width
- assertion.
-
- One warning: particularly complicated regular expressions
- can take exponential time to solve due to the immense
-
-
-
- 13/Feb/96 perl 5.002 with 9
-
-
-
-
-
- PERLRE(1) User Contributed Perl Documentation PERLRE(1)
-
-
- number of possible ways they can use backtracking to try
- match. For example this will take a very long time to run
-
- ////((((((((aaaa{{{{0000,,,,5555}}}})))){{{{0000,,,,5555}}}})))){{{{0000,,,,5555}}}}////
-
- And if you used ****'s instead of limiting it to 0 through 5
- matches, then it would take literally forever--or until
- you ran out of stack space.
-
- VVVVeeeerrrrssssiiiioooonnnn 8888 RRRReeeegggguuuullllaaaarrrr EEEExxxxpppprrrreeeessssssssiiiioooonnnnssss
-
- In case you're not familiar with the "regular" Version 8
- regexp routines, here are the pattern-matching rules not
- described above.
-
- Any single character matches itself, unless it is a
- _m_e_t_a_c_h_a_r_a_c_t_e_r with a special meaning described here or
- above. You can cause characters which normally function
- as metacharacters to be interpreted literally by prefixing
- them with a "\" (e.g. "\." matches a ".", not any
- character; "\\" matches a "\"). A series of characters
- matches that series of characters in the target string, so
- the pattern bbbblllluuuurrrrffffllll would match "blurfl" in the target
- string.
-
- You can specify a character class, by enclosing a list of
- characters in [[[[]]]], which will match any one of the
- characters in the list. If the first character after the
- "[" is "^", the class matches any character not in the
- list. Within a list, the "-" character is used to specify
- a range, so that aaaa----zzzz represents all the characters between
- "a" and "z", inclusive.
-
- Characters may be specified using a metacharacter syntax
- much like that used in C: "\n" matches a newline, "\t" a
- tab, "\r" a carriage return, "\f" a form feed, etc. More
- generally, \_n_n_n, where _n_n_n is a string of octal digits,
- matches the character whose ASCII value is _n_n_n.
- Similarly, \x_n_n, where _n_n are hexidecimal digits, matches
- the character whose ASCII value is _n_n. The expression \c_x
- matches the ASCII character control-_x. Finally, the "."
- metacharacter matches any character except "\n" (unless
- you use ////ssss).
-
- You can specify a series of alternatives for a pattern
- using "|" to separate them, so that ffffeeeeeeee||||ffffiiiieeee||||ffffooooeeee will match
- any of "fee", "fie", or "foe" in the target string (as
- would ffff((((eeee||||iiii||||oooo))))eeee). Note that the first alternative
- includes everything from the last pattern delimiter ("(",
- "[", or the beginning of the pattern) up to the first "|",
- and the last alternative contains everything from the last
- "|" to the next pattern delimiter. For this reason, it's
- common practice to include alternatives in parentheses, to
- minimize confusion about where they start and end. Note
-
-
-
- 13/Feb/96 perl 5.002 with 10
-
-
-
-
-
- PERLRE(1) User Contributed Perl Documentation PERLRE(1)
-
-
- however that "|" is interpreted as a literal with square
- brackets, so if you write [[[[ffffeeeeeeee||||ffffiiiieeee||||ffffooooeeee]]]] you're really only
- matching [[[[ffffeeeeiiiioooo||||]]]].
-
- Within a pattern, you may designate subpatterns for later
- reference by enclosing them in parentheses, and you may
- refer back to the _nth subpattern later in the pattern
- using the metacharacter \_n. Subpatterns are numbered
- based on the left to right order of their opening
- parenthesis. Note that a backreference matches whatever
- actually matched the subpattern in the string being
- examined, not the rules for that subpattern. Therefore,
- ((((0000||||0000xxxx))))\\\\dddd****\\\\ssss\\\\1111\\\\dddd**** will match "0x1234 0x4321",but not
- "0x1234 01234", since subpattern 1 actually matched "0x",
- even though the rule 0000||||0000xxxx could potentially match the
- leading 0 in the second number.
-
- WWWWAAAARRRRNNNNIIIINNNNGGGG oooonnnn \\\\1111 vvvvssss $$$$1111
-
- Some people get too used to writing things like
-
- $$$$ppppaaaatttttttteeeerrrrnnnn ====~~~~ ssss////((((\\\\WWWW))))////\\\\\\\\\\\\1111////gggg;;;;
-
- This is grandfathered for the RHS of a substitute to avoid
- shocking the sssseeeedddd addicts, but it's a dirty habit to get
- into. That's because in PerlThink, the right-hand side of
- a ssss//////////// is a double-quoted string. \\\\1111 in the usual double-
- quoted string means a control-A. The customary Unix
- meaning of \\\\1111 is kludged in for ssss////////////. However, if you get
- into the habit of doing that, you get yourself into
- trouble if you then add an ////eeee modifier.
-
- ssss////((((\\\\dddd++++))))//// \\\\1111 ++++ 1111 ////eeeegggg;;;;
-
- Or if you try to do
-
- ssss////((((\\\\dddd++++))))////\\\\1111000000000000////;;;;
-
- You can't disambiguate that by saying \\\\{{{{1111}}}}000000000000, whereas you
- can fix it with $$$${{{{1111}}}}000000000000. Basically, the operation of
- interpolation should not be confused with the operation of
- matching a backreference. Certainly they mean two
- different things on the _l_e_f_t side of the ssss////////////.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- 13/Feb/96 perl 5.002 with 11
-
-
-