Status: Draft / Needs Editing
The goal of this doc is to make it hard to say this with a straight face: "regular expression syntax is maybe a little ugly, but basically OK." It's not really OK.
Regexes are actually composed of two little languages: a way to specify
strings and a way to specify sets of characters. Inside a character
-, have special meanings. Outside a character class, characters
like '.' and * are metacharacters. In Python's verbose mode, a space is
insignificant outside a character class, but still significant inside one.
(?:...)is a hard syntax to read for non-capturing groups.
$are arbitrary in what they stand for, hard to remember.
(?P<foo>...)-- for named capturing groups and named backreferences, the
Pis somewhat random.
Significant space makes regular expressions very hard to read. Even with
re.VERBOSE or Perl's
/x, space within character classes is still
significant. Reading expressions like
[^"'\\\[\]] is hard.
\is overloaded. It can:
\Dstands for any character that's not a digit.
\Astands for the beginning of the string).
It also means something to the programming language (in Python, this is disabled
with raw strings, like
This creates some interesting quirks:
\01both signify an octal character ... but
\1is a backreference!
\b is a word boundary outside a character class, while it's a bell
character inside one! (in Python, +re2+ chooses to avoid this)
? is overloaded.
aa?matches one or two a's.
+ is overloaded.
The extension syntax encompasses unrelated concepts.
(?:...)is a non-capturing group
(?=name)is a lookahead assertion
(?P=name)is a named backreference
(?P<name>pat)is a named group
(?i)turns on a flag
(?#comment)is a comment
^ means 3 different things:
multilinemode it actually stands for the start of a line too.
- creates a range in a character class, except at the end, where it stands
for a hyphen. (NOTE: I guess
\- is a hyphen).
Very subtle rules:
Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser:
\a \b \f \n \r \t \v \x \
(Note that \b is used to represent word boundaries, and means “backspace” only inside character classes.)
Octal escapes are included in a limited form: If the first digit is a 0, or if there are three octal digits, it is considered an octal escape. Otherwise, it is a group reference. As for string literals, octal escapes are always at most three digits in length.
Negation has 3 syntaxes:
[a]-- negate a character class by adding
^at the front.
\D-- negate a Perl-style character class by capitalizing.
(?!...)-- negative lookahead assertion uses
~ is the only syntax for negation (
Zero-width assertions have multiple syntax styles (a historical accident):
$are the traditional style
\Zare closely related but look completely different. They look like character classes.
Another historical accident:
\1is a numbered backreference, but
(?=name)is a named backreference. (CRE is consistent with
On a related note, Constructs have been added that conflate regular expressions with a backtracking implementation.
CRE tries to make this distinction by creating a different syntax for constructs defined using the traditional theory of automata, and constructs defined using the backtracking algorithm.
Again, the backslash is too heavily overloaded.
\g<0> means the whole match, \g<1> means group 1
\0 does NOT mean the whole match, but \1 means group 1.
Mentioned match vs. search confusion.
Also TODO: document exceptions. Have to import another module to catch an exception.
sre.parse_template throws them
Last modified: 2013-01-29 10:42:48 -0800