The regex module uses Emacs-style regular expressions, while the re module uses Perl-style expressions. The largest difference between the two is that in Emacs-style syntax, all metacharacters have a backslash in front of them. For example, groups are delimited by "\(" and "\)"; "(" and ")" match the literal characters. This clutters moderately complicated expressions with lots of backslashes. Unfortunately, Python's string literals also use the backslash as an escape character, so it's frequently necessary to add backslashes in front of backslashes; "\\\\" is required to match a single "\", for example.
In Perl-style expressions, things are just the opposite; "\(" and "\)"match the literal characters "(" and ")", while "(" and ")" in a pattern denote grouping. This makes patterns neater, since you'll rarely need to match literal "( )" characters, but will often be using grouping.
regex pattern:
\(\w+\|[0-9]+\)
re pattern:
(\w+|[0-9]+)
The Perl syntax also has more character classes that allow simplifying some expressions. The regex module only supports "\w" to match alphanumeric characters, and "\W" to match non-alphanumeric characters. The re module adds "\d" and "\D" for digits and non-digits, and "\s" and "\S"for whitespace and non-whitespace characters.
regex pattern:
[0-9]+[ \t\n]+
re pattern:
\d+\s+
Regular expressions can get very complicated and difficult to
understand. To make expressions clearer, the re.VERBOSE
flag can be
specified. This flag causes whitespace outside of a character class
to be ignored, and a "#" symbol outside of a character class is treated
as the start of a comment, extending to the end of the line. This
means a pattern can be put inside a triple-quoted string, and then
formatted for clarity.
re code:
pat = re.compile(""" (?P<command> # A command contains ... \w+) # ... a word ... \s+ # ... followed by whitespace ... (?P<var> # ... and an optional variable name (?!\d) # Lookahead assertion: can't start with a digit \w+ # Match a word )""", re.VERBOSE)
If the re.VERBOSE
flag seems a bit easy to overlook, off at the end of
the statement, you can put (?x) inside the pattern; this has the same
effect as specifying re.VERBOSE
, but makes the flag setting part of
the pattern. There are similar extensions to specify re.DOTALL
,
re.IGNORECASE
, re.LOCALE
, and re.MULTILINE
: (?s),
(?i), (?L), and (?m).
A module to automatically translate expressions from the old regex syntax to the new syntax has been written, and is available as reconvert.py in Python 1.5b2. (I haven't really looked at it yet.)