CRE Syntax

This is the reference for CRE syntax. It describes each construct, and lists the corresponding Perl-like syntax.

Annex, a Python library, supports only a subset of CRE syntax, since it's a wrapper around Python's re.

CREs (and traditional syntax) can be thought of in two parts: a language to specify sets of characters, and a language on top of that to specify sets of strings.

Table of Contents

Specifying Characters

Single Metacharacters

These constructs stand for a set of characters.

CRE Traditional Definition Notes
any . any character, possibly including newline (anyall=true)
chars[x y z] [xyz] character class Insignificant space makes these more readable.
!chars[x y z] [^xyz] negated character class
digit \d Perl character class
!digit \D negated Perl character class
:alpha [:alpha:] ASCII character class
!:alpha [:^alpha:] negated ASCII character class
::Greek \p{Greek} Unicode character class
!::Greek \P{Greek} negated Unicode character class

Character Literals

These constructs stand for exactly one character. They're all valid inside or outside a character class. There are no octal escapes.

CRE Traditional Definition Notes
0x00 varies Hex escape
&201c varies Unicode code point
&name e.g. \n Named character. See the section below

Named Characters

This is a list of named characters. There are no equivalents of \b for backspace or \f for form feed, etc. Instead use hex or unicode escapes.

CRE Traditional Definition Notes
&space space character Traditional regexes just use a literal space, or [ ]. Whitespace is never significant in CREs, so this is necessary to represent a space inside a character class.
&newline &cr &tab \n \r \t whitespace characters The backslash doesn't mean anything special in CRE. Use the default concatenation operator: 'a' &newline 'b' instead of a\nb.
&hyphen &bang &hash &lbracket &rbracket \- ! \# \[ \] The character named. Only needed inside character classes. Otherwise use '-'.
&squote &dquote \- ! \# \[ \] The character named. Purely syntactic sugar. You can always use chars["] inside a char class, or '"' outside of one.

Character Class Elements

These elements may appear inside a character class.

Named classes can be negated with the ! operator: !digit, !:alnum, !::Greek.

CRE Traditional Definition Notes
x x single character
A-Z A-Z character range (inclusive) ranges must be separated by space, e.g. chars[a-z A-Z] not chars[a-zA-Z]. Escapes are also allowed, e.g. chars[0x00 - 0x20]

In addition, all character literals, as well as named classes and their negations, may appear within a character class. For example: digit, !digit, :alnum, !:alnum, &space, 0x00, and &201c.

Perl Named Classes

CRE Traditional Definition Notes
digit \d digits [0-9]
!digit \D not digits [^0-9]
whitespace \s whitespace [\t\n\f\r ]
!whitespace \S not whitespace [^\t\n\f\r ]
wordchar \w word characters [0-9A-Za-z_] word was confusing since it sounds like it stands for multiple characters, so we use the longer wordchar.
!wordchar \W not word characters [^0-9A-Za-z_]

POSIX Named Classes

These are not supported in Annex.

CRE Traditional Definition Notes
:alnum [:alnum:] alphanumeric (== [0-9A-Za-z])
:alpha [:alpha:] alphabetic (== [A-Za-z])
:ascii [:ascii:] ASCII (== [\x00-\x7F])
:blank [:blank:] blank (== [\t ])
:cntrl [:cntrl:] control (== [\x00-\x1F\x7F])
:digit [:digit:] digits (== [0-9])
:graph [:graph:] graphical (== [!-~] == [A-Za-z0-9!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~])
:lower [:lower:] lower case (== [a-z])
:print [:print:] printable (== [ -~] == [ [:graph:]])
:punct [:punct:] punctuation (== [!-/:-@[-`{-~])
:space [:space:] whitespace (== [\t\n\v\f\r ])
:upper [:upper:] upper case (== [A-Z])
:word [:word:] word characters (== [0-9A-Za-z_])
:xdigit [:xdigit:] hex digit (== [0-9A-Fa-f])

Specifying Strings

Literals

CRE Traditional Definition Notes
'x' or "x" x The literal string x. Each type of string can contain the opposite quote, e.g. '"' is a double quote and "'" is a single quote. There are no backslash escapes. You can use the concatenation operator to write strings with special characters, e.g. 'a' 0x00 'b' is like "a\0b" in C.

Composites

CRE Traditional Definition Notes
x y xy x followed by y
either x or y x|y x or y (prefer x) Prefix syntax makes long alternations easier for humans and computers to parse.

Repetitions

General repetitions look like x^(modifier repetition). They're only needed in the relatively rare cases of a ranged numbered repetition or possessive repetition. Otherwise, +, *, ?, ++, **, ??, and ^n suffice.

Possessive repetitions (traditionally the + suffix: *+, ++, ?+, ...) are like the non-greedy repetitions, but with a P instead of an N. For example, x^(P *) or x^(P 2..3). These are not supported in Python.

CRE Traditional Definition Notes
x* x* zero or more x, prefer more
x+ x+ one or more x, prefer more
x? x? zero or one x, prefer one
x^(n..m) x{n,m} n or n+1 or ... or m x, prefer more
x^(n..) x{n,} n or more x, prefer more
x^n x{n} exactly n x
x** or x^(N *) x*? zero or more x, prefer fewer
x++ or x^(N +) x+? one or more x, prefer fewer
x?? or x^(N ?) x?? zero or one x, prefer zero
x^(N n..m) x{n,m}? n or n+1 or ... or m x, prefer fewer
x^(N n..) x{n,}? n or more x, prefer fewer

Grouping

CRE Traditional Definition Notes
{re} (re) numbered capturing group
{re as name} (?P<name>re) or (?<name>re) or (?'name're) named & numbered capturing group no standard traditional syntax
(re) (?:re) non-capturing group
MyName = expression Named subexpression. Creates a pattern that can be referenced elsewhere. It does not create a capturing group. Naming convention is CapWords.

Zero Width Assertions

These constructs match empty space.

CRE Traditional Definition Notes
%begin ^ at beginning of text or line (multiline=true)
%end $ at end of text (like \z not \Z) or line (multiline=true)
%begin-text \A at beginning of text
%end-text \z at end of text
%boundary \b at word boundary (\w on one side and \W, \A, or \z on the other)
!%boundary \B not a word boundary
!%begin-word \< left word boundary
!%end-word \> right word boundary

Backtracking Constructs

These constructs imply a backtracking implementation. They are identified by keywords in CAPS.

Notes:

CRE Traditional Definition Notes
%ASSERT(re) (?=re) lookahead assertion.
!%ASSERT(re) (?!re) negative lookahead assertion.
%ASSERTLEFT(re) (?<=re) lookbehind assertion.
!%ASSERTLEFT(re) (?<!re) negative lookbehind assertion.
REF(1) \1 Backreference to captured group.
REF(name) (?P=name) Backreference to named captured group.
IF foo THEN a ELSE b (?(foo)yes|no) Pattern conditional on backreference.
RECURSE(pattern) ??{name} or (?n) or (?R) Recurse into the pattern. The pattern can be a group identified by name, number, or the entire pattern itself.
ATOMIC(...) (?>...) Atomic grouping. Java and Ruby both support this.
%END-PREV \G The end of the previous match. Not supported by Python (or re2).

Other

Top Level Syntax

A CRE can be either an expression or a list of named expressions, one of which is Start. For example, this is a valid CRE:

digit+

So is this:

Start = digit+

and this:

D     = digit+
Start = D

Flags

Flag syntax is flags(x -y z) (set x and z, clear y).

CRE Traditional Definition Notes
flags(multiline unicode ...) (?flags) set flags. In Python this is only valid at the start of the entire pattern. In CRE, the names for flags are whole words like multiline, not single letters.

List of flags:

CRE Traditional Definition Notes
ignorecase i case-insensitive (default false)
multiline m multi-line mode: ^ and $ match begin/end line in addition to begin/end text (default false)
anyall s let . match \n (default false)
ungreedy U ungreedy: swap meaning of x* and x*?, x+ and x+?, etc (default false) not available in Python.
unicode u Make character classes dependent on Unicode character properties database. Python's re.UNICODE flag.
debug Display debug information about compiled expression. Python's re.DEBUG. (May be Python only?)
locale L Make \w, \W, \b, \B, \s and \S dependent on the current locale. Python's re.LOCALE. (May be Python only?)

Comments

CRE Traditional Definition Notes
# until end of line (?#text) comment

Reference

List of Reserved Words

A word without a punctuation prefix (e.g. %begin, :alnum) is assumed to be the name of a subexpression, except if it's one of these reserved words.

List of Punctuation Used

Unused: / \ ~ @ $ | ; ` < > , .

Grammar

Parser for CRE -- This is in TPE syntax. It's auto-generated from the source code, so it's the most precise documentation. TPE itself is rigorosusly specified, like PEG.

Links

Regular expression references for various languages:


Last modified: 2013-01-27 10:40:07 -0800