A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).
Regular expressions can be concatenated to form new regular expressions; if A and B are both regular expressions, then AB is also an regular expression. If a string p matches A and another string q matches B, the string pq will match AB. Thus, complex expressions can easily be constructed from simpler primitive expressions like the ones described here. For details of the theory and implementation of regular expressions, consult the Friedl book referenced below, or almost any textbook about compiler construction.
A brief explanation of the format of regular expressions follows.
Regular expressions can contain both special and ordinary characters.
Most ordinary characters, like 'A
', 'a
', or '0
',
are the simplest regular expressions; they simply match themselves.
You can concatenate ordinary characters, so 'last
' matches the
characters 'last'. (In the rest of this section, we'll write RE's in
this special font
, usually without quotes, and strings to be
matched 'in single quotes'.)
Some characters, like |
or (
, are special. Special
characters either stand for classes of ordinary characters, or affect
how the regular expressions around them are interpreted.
The special characters are:
.
DOTALL
flag has been
specified, this matches any character including a newline.
^
MULTILINE
mode also immediately after each newline.
$
MULTILINE
mode also matches before a newline.
foo
matches both 'foo' and 'foobar', while the regular
expression foo$
matches only 'foo'.
*
ab*
will
match 'a', 'ab', or 'a' followed by any number of 'b's.
+
ab+
will match 'a' followed by any non-zero number of 'b's; it
will not match just 'a'.
?
ab?
will
match either 'a' or 'ab'.
*?
, +?
, ??
*
, +
, and
?
qualifiers are all greedy; they match as much text as
possible. Sometimes this behaviour isn't desired; if the RE
<.*>
is matched against <H1>title</H1>
, it will match the
entire string, and not just <H1>
.
Adding ?
after the qualifier makes it perform the match in
non-greedy or minimal fashion; as few characters as
possible will be matched. Using .*?
in the previous
expression will match only <H1>
.
\
If you're not using a raw string to express the pattern, remember that Python also uses the backslash as an escape sequence in string literals; if the escape sequence isn't recognized by Python's parser, the backslash and subsequent character are included in the resulting string. However, if Python would recognize the resulting sequence, the backslash should be repeated twice. This is complicated and hard to understand, so it's highly recommended that you use raw strings for all but the simplest expressions.
[]
[akm$]
will match any of the characters 'a', 'k', 'm', or '$'; [a-z]
will match any lowercase letter and [a-zA-Z0-9]
matches any
letter or digit. Character classes such as \w
or \
S
(defined below) are also acceptable inside a range. If you want to
include a ]
or a -
inside a set, precede it with a
backslash.
Characters not within a range can be matched by including a
^
as the first character of the set; ^
elsewhere will
simply match the '^
' character.
|
A|B
, where A and B can be arbitrary REs,
creates a regular expression that will match either A or B. This can
be used inside groups (see below) as well. To match a literal '|',
use \|
, or enclose it inside a character class, like [|]
.
(...)
\number
special
sequence, described below. To match the literals '(' or ')',
use \(
or \)
, or enclose them inside a character
class: [(] [)]
.
(?...)
(?iLmsx)
compile
function.
(?:...)
(?P<name>...)
For example, if the pattern is
(?P<id>[a-zA-Z_]\w*)
, the group can be referenced by its
name in arguments to methods of match objects, such as m.group('id')
or m.end('id')
, and also by name in pattern text (e.g. (?P=id)
) and
replacement text (e.g. \g<id>
).
(?P=name)
(?#...)
(?=...)
...
matches next, but doesn't consume any of the string. This is called a lookahead assertion. For example,
Isaac (?=Asimov)
will match 'Isaac ' only if it's followed by 'Asimov'.
(?!...)
...
doesn't match next. This is a negative lookahead assertion. For example,
For example,
Isaac (?!Asimov)
will match 'Isaac ' only if it's not followed by 'Asimov'.
The special sequences consist of '\
' and a character from the
list below. If the ordinary character is not on the list, then the
resulting RE will match the second character. For example,
\$
matches the character '$'.
\number
(.+) \1
matches 'the the' or '55 55', but not 'the end' (note
the space after the group). This special sequence can only be used to
match one of the first 99 groups. If the first digit of number
is 0, or number is 3 octal digits long, it will not be interpreted
as a group match, but as the character with octal value number.
\A
\b
\b
represents the backspace character, for compatibility with
Python's string literals.
\B
\d
[0-9]
.
\D
[^0-9]
.
\s
[ \t\n\r\f\v]
.
\S
[^ \t\n\r\f\v]
.
\w
[a-zA-Z0-9_]
. With LOCALE, it will match
the set [0-9_]
plus whatever characters are defined as letters
for the current locale.
\W
[^a-zA-Z0-9_]
. With LOCALE, it will match any character
not in the set [0-9_]
, and not defined as a letter
for the current locale.
\Z
\\