What's Wrong with Regular Expression Syntax

Status: Draft / Needs Editing

Table of Contents

The goal of this doc is to make it hard to say this with a straight face: "regular expression syntax is maybe a little ugly, but basically OK." It's not really OK.

Regexes are actually composed of two little languages: a way to specify strings and a way to specify sets of characters. Inside a character class, ^, -, have special meanings. Outside a character class, characters like '.' and * are metacharacters. In Python's verbose mode, a space is insignificant outside a character class, but still significant inside one.

General Awkwardness

Significant space makes regular expressions very hard to read. Even with Python's re.VERBOSE or Perl's /x, space within character classes is still significant. Reading expressions like [^"'\\\[\]] is hard.

Same syntax for different concepts

It also means something to the programming language (in Python, this is disabled with raw strings, like r'foo\\').

This creates some interesting quirks:

Very subtle rules:

Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser:

\a \b \f \n \r \t \v \x \

(Note that \b is used to represent word boundaries, and means “backspace” only inside character classes.)

Octal escapes are included in a limited form: If the first digit is a 0, or if there are three octal digits, it is considered an octal escape. Otherwise, it is a group reference. As for string literals, octal escapes are always at most three digits in length.

Different syntax for the same concepts

Negation has 3 syntaxes:

In CRE, ~ is the only syntax for negation (~chars[], ~digit, or ~ASSERT).

Zero-width assertions have multiple syntax styles (a historical accident):

Another historical accident:

On a related note, Constructs have been added that conflate regular expressions with a backtracking implementation.

CRE tries to make this distinction by creating a different syntax for constructs defined using the traditional theory of automata, and constructs defined using the backtracking algorithm.

Python substitution syntax quirks

Again, the backslash is too heavily overloaded.

\g<0> means the whole match, \g<1> means group 1

\0 does NOT mean the whole match, but \1 means group 1.

Extraneous syntax.

Python re API quirks

Mentioned match vs. search confusion.

Also TODO: document exceptions. Have to import another module to catch an exception.

sre.parse_template throws them

Last modified: 2013-01-29 10:42:48 -0800