The top-level functions in the re module are quite similar to those in the regex module; there are a few new functions and some new optional arguments, but these can mostly be ignored when converting regex code.
re is now the only module required to access all the available
functionality. regsub has been swallowed by re, and
is available as the sub()
, subn()
, and split()
functions. There's no equivalent of regex_syntax
; re
supports only one syntax, and you can't change it. If you want
alternative regex syntaxes, you'll have to manually parse the syntax
and convert it to the basic Perl-like syntax; sorry!
The functions in the regex module return an integer giving
the length or position of any match, or -1 if no match was found. The
subgroups from the match are then available as attributes of the
compiled pattern object: regs
, last
, and so forth. This
doesn't interact well with threads, because two threads may use the
same pattern object at almost the same time; the results from the
second thread's operation will then stomp on the first thread's results.
To fix this problem, functions in the re module return a
MatchObject
instance, or None
if the match failed. Pattern
objects now have no attributes that change after the object is
created. Code must therefore be converted to store the MatchObject
instance in a variable, and check for None
to determine if a match was
found:
regex code:
pat = regex.compile('[0-9]+') if pat.match(strvar) == -1: print 'No match'
re code:
pat = re.compile('[0-9]+') m = pat.match(strvar) if m == None: print 'No match'
The search()
and match()
functions have the same
parameters in both modules; the re module returns None
or a
MatchObject
instance instead of an integer, and adds an optional flags
argument. Of course, the re versions use the new regular
expression syntax (see "Pattern Differences", below).
regex code:
result = regex.match('\\w+', 'abc abc')
re code:
result = re.match('\\w+', 'abc abc')
Another thing that's disappeared is the translate argument to
the compile() function; in the regex module, a 256-character
string can be passed to indicate how characters should be translated
before matching them. This feature was often used to perform
case-insensitive matching, or to map the digits 0-9 to 0 to simplify
patterns that matched digits. With the re.IGNORECASE
and
re.LOCALE
flags, and special sequences such as "\d", the
need for this functionality is greatly reduced; since the on-the-fly
translation complicated the matching engine and made it slower, the
feature was dropped. If you still need it, you'll have to explicitly
call string.translate()
on your target string before running your
regular expression on it.
regex code:
pat = regex.compile('[abc]', translation) result = pat.match(str_var)
re code:
pat = re.compile('[abc]') result = pat.match( string.translate(str_var, translation) )
Some programs use a translation string to convert foreign
characters such as é or øo to characters in the range
A-Za-z so they can be matched by "\w". These programs can specify the
re.LOCALE
flag, which causes "\w" to match the alphabetic characters
defined in the current locale.
The most common use of the translation string is to do a
case-insensitive match by passing regex.casefold
. The
re equivalent is to pass re.IGNORECASE
(or
re.I
, which is the same thing) as the flags argument to
re.compile()
.
regex code:
pat = regex.compile('[abc]', regex.casefold)
re code:
pat = re.compile('[abc]', re.IGNORECASE)
regex.symcomp()
is no longer required; named groups are always
available using the (?P<name>...) syntax.
regex code:
pat = regex.symcomp( "(<integer>[0-9]+)" )
re code:
pat = re.compile( "(?P<integer>[0-9]+)" )
With the regex module, the group(args)
method of
pattern objects takes zero or more arguments, either integers or
strings containing symbolic group names. group(args)
returns a
tuple containing the corresponding groups from the last pattern match.
With no arguments, .group()
returns a tuple containing all the groups
in the pattern. In all these cases, if the tuple would only contain a
single element, just that single string is returned. This is
inconsistent, but convenient:
x = pat.group(1) x,y = pat.group(1, 2)
The re module moves this method to MatchObject
instances. The action of group()
with no arguments has been
changed for consistency with the start()
, end()
, and
span()
methods; they all assume group 0 as the default. To get
a tuple containing all the subgroups, use the groups()
method.
regex code:
substring = pattern.group(0)
re code:
substring = match.group(0)
match.group()
regex code:
substring = pattern.group()
re code:
substring = match.groups()