Python Re
re is a module for regular expressions.
Contents
Usage
Match
Scan a string to see if it matches a pattern from its beginning. If it matches, return a match object. Otherwise None is returned.
A match object is a collection of match groups. The first (0th) match group is the entire match.
from re import match m = match(r"\$[1-9][0-9]*", listed_price) if m is not None: print("USD", m.group(0)) # or: print(m.expand("USD \0"))
A subpatterns are included as subsequent match groups.
from re import match m = match(r"\$([1-9][0-9]*)", listed_price) if m is not None: print("USD", m.group(1)) # or: print(m.expand("USD \1"))
A tuple of the subsequent match groups is also available from the groups() method.
from re import match m = re.match(r"([A-Z]{3}) (\$[1-9][0-9]*)", currency_plus_listed_price) if m is not None: current_listed_price_pair = m.groups()
Search
Similar to match(), but is not constrained to the beginning of a string.
FullMatch
Similar to match(), but is requires that the entire string match the pattern.
FindAll
Scans a string for a pattern and returns all substrings that match.
from re import findall m = findall(r"\$[1-9][0-9]*", "$1 $2 $3") # ['$1', '$2', '$3']
Split
Splits a string by all matches to a pattern. If the pattern includes a subpattern, the subpattern matches are included.
from re import split s = split('-', 'a-b-c') # ['a', 'b', 'c'] s = split('-', '-a-b-c-') # ['', 'a', 'b', 'c', ''] s = split('(-)', 'a-b-c') # ['a', '-', 'b', '-', 'c']
Sub
Return a new string built by substituting all matches to a pattern with a replacement.
import re s = re.sub("[\t ]+",";","a whitespace delimited string") # "a;whitespace;delimited;string" s = re.sub("[\t ]+",";","a whitespace delimited string", count=1) # "a;whitespace delimited string"
The replacement can include backreferences. \6 is replaced with the substring in match group 6. For this reason, backslashes are handled uniquely in this function. 'Known' escape sequences (like \n) are processed and converted to the represented character (a newline). 'Unknown' escape sequences using an ASCII character raise an error. All others, such as \&, are left as-is.
The replacement can be also be a callback function. It is passed the entire match object as an argument, and is expected to return a string.
import re def redact_external_emails(m): if m.group(0).lower().endswith("example.com"): return m.group(0) else: return '' re.sub(r"[A-Za-z]+@[A-Za-z]+\.[A-Za-z]+", redact_external_emails, "[email protected] [email protected] [email protected]")
SubN
Similar to sub(), but returns a tuple of the new string and a count of substitutions performed.
Compile
All other functions in the re module take a pattern string as the first argument. The regular expression engine internally compiles (and caches) that pattern. If a pattern will be reused frequently, it can be more efficient to compile the pattern once and reuse it directly.
The compile() function returns such a compiled pattern. It has methods mirroring all of the other functions.
from re import compile p = compile(r"\$([1-9][0-9]*)") g = p.match("$1000000").groups() # ('1000000',) g = p.search("$1000000").groups() # ('1000000',) s = p.sub(r"\1 dollars","$1000000") # '1000000 dollars'
Type Annotations
Match can be used to annotate a match object. Pattern can be used to annotate a compiled regular expression.
Both take typing.AnyStr by default, but can be further constrained by annotating with Match[str], Match[bytes], Pattern[str], or Pattern[bytes].
See also
Python re module documentation
Python Module of the Day article for re