Next: Posix Entry Points Prev: Overview
This chapter introduces the Posix regexp notation. This is not a formal or precise definition of Posix regexps -- it is an intuitive and hopefully expository description of them.
An Introduction to Regexps
In the simplest cases, a regexp is just a literal string that must match exactly. For example, the pattern:
regexpmatches the string "regexp" and no others.
Some characters have a special meaning when they occur in a regexp. They aren't matched literally as in the previous example, but instead denote a more general pattern. For example, the character *
is used to indicate that the preceeding element of a regexp may be repeated 0, 1, or more times. In the pattern:
smooo*ththe
*
indicates that the preceeding o
can be repeated 0 or more times. So the pattern matches:
smooth smoooth smooooth smoooooth ...Suppose you want to write a pattern that literally matches a special character like
*
-- in other words, you don't want to *
to indicate a permissible repetition, but to match *
literally. This is accomplished by quoting the special character with a backslash. The pattern:
smoo\*thmatches the string:
smoo*th
and no other strings.
In seven cases, the pattern is reversed -- a backslash makes the character special instead of making a special character normal. The characters +
, ?
, |
, (
, and )
are normal but the sequences \+
, \?
, \|
, \(
, \)
, \{
, and \}
are special (their meaning is described later).
The remaining sections of this chapter introduce and explain the various special characters that can occur in regexps.
Literal Regexps
A literal regexp is a string which contains no special characters. A literal regexp matches an identical string, but no other characters. For example:
literallymatches
literallyand nothing else.
Generally, whitespace characters, numbers, and letters are not special. Some punctuation characters are special and some are not (the syntax summary at the end of this chapter makes a convenient reference for which characters are special and which aren't).
Character Sets
This section introduces the special characters .
and [
.
.
matches any character except the NULL character. For example:
p.ckmatches
pick pack puck pbck pcck p.ck ...
[
begins a character set. A character set is similar to .
in that it matches not a single, literal character, but any of a set of characters. [
is different from .
in that with [
, you define the set of characters explicitly.
There are three basic forms a character set can take.
In the first form, the character set is spelled out:
[<cset-spec>] -- every character in <cset-spec> is in the set.In the second form, the character set indicated is the negation of a character set is explicitly spelled out:
[^<cset-spec>] -- every character *not* in <cset-spec> is in the set.A
<cset-spec>
is more or less an explicit enumeration of a set of characters. It can be written as a string of individual characters:
[aeiou]or as a range of characters:
[0-9]These two forms can be mixed:
[A-za-z0-9_$]Note that special regexp characters (such as
*
) are not special within a character set. -
, as illustrated above, is special, except, as illustrated below, when it is the first character mentioned.
This is a four-character set:
[-+*/]The third form of a character set makes use of a pre-defined "character class":
[[:class-name:]] -- every character described by class-name is in the set.The supported character classes are:
alnum - the set of alpha-numeric characters alpha - the set of alphabetic characters blank - tab and space cntrl - the control characters digit - decimal digits graph - all printable characters except space lower - lower case letters print - the "printable" characters punct - punctuation space - whitespace characters upper - upper case letters xdigit - hexidecimal digitsFinally, character class sets can also be inverted:
[^[:space:]] - all non-whitespace charactersCharacter sets can be used in a regular expression anywhere a literal character can.
Subexpressions
A subexpression is a regular expression enclosed in \(
and \)
. A subexpression can be used anywhere a single character or character set can be used.
Subexpressions are useful for grouping regexp constructs. For example, the repeat operator, *
, usually applies to just the preceeding character. Recall that:
smooo*thmatches
smooth smoooth ...
Using a subexpression, we can apply *
to a longer string:
banan\(an\)*amatches
banana bananana banananana ...Subexpressions also have a special meaning with regard to backreferences and substitutions (see See Backreferences).
*
is the repeat operator. It applies to the preceeding character, character set, subexpression or backreference. It indicates that the preceeding element can be matched 0 or more times:
bana\(na\)*matches
bana banana bananana banananana ...
\+
is similar to *
except that \+
requires the preceeding element to be matched at least once. So while:
bana\(na\)*matches
bana
bana(na\)\+does not. Both match
banana bananana banananana ...Thus,
bana\(na\)+
is short-hand for banana\(na\)*
.
Optional Subexpressions
\?
indicates that the preceeding character, character set, or subexpression is optional. It is permitted to match, or to be skipped:
CSNY\?matches both
CSNand
CSNY
Counted Subexpressions
An interval expression, \{m,n\}
where m
and n
are non-negative integers with n >= m
, applies to the preceeding character, character set, subexpression or backreference. It indicates that the preceeding element must match at least m
times and may match as many as n
times.
For example:
c\([ad]\)\{1,4\}matches
car cdr caar cdar ... caaar cdaar ... cadddr cddddr
Alternative Subexpressions
An alternative is written:
regexp-1\|regexp-2\|regexp-3\|...It matches anything matched by some
regexp-n
. For example:
Crosby, Stills, \(and Nash\|Nash, and Young\)matches
Crosby, Stills, and Nashand
Crosby, Stills, Nash, and Young
Backreferences, Extractions and Substitutions
A backreference is written \n
where n
is some single digit other than 0. To be a valid backreference, there must be at least n
parenthesized subexpressions in the pattern prior to the backreference.
A backreference matches a literal copy of whatever was matched by the corresponding subexpression. For example,
\(.*\)-\1matches:
go-go ha-ha wakka-wakka ...In some applications, subexpressions are used to extract substrings. For example, Emacs has the functions
match-beginnning
and match-end
which report the positions of strings matched by subexpressions. These functions use the same numbering scheme for subexpressions as backreferences, with the additional rule that subexpression 0 is defined to be the whole regexp.
In some applications, subexpressions are used in string substitution. This again uses the backreference numbering scheme. For example, this sed command:
s/From:.*<\(.*\)>/To: \1/first matches the line:
From: Joe Schmoe <schmoe@uspringfield.edu>when it does, subexpression 1 matches "schmoe@uspringfield.edu". The command replaces the matched line with "To: \1" after doing subexpression substitution on it to get:
To: schmoe@uspringfield.edu
A Summary of Regexp Syntax
In summary, regexps can be:
abcd
-- matching a string literally
.
-- matching everything except NULL
[a-z_?]
, ^[a-z_?]
, [[:alpha:]]
and [^[:alpha:]]
-- matching character sets
\(subexp\)
-- grouping an expression into a subexpression.
\n
-- match a copy of whatever was matched by the nth subexpression.
The following special characters and sequences can be applied to a character, character set, subexpression, or backreference:
*
-- repeat the preceeding element 0 or more times.
\+
-- repeat the preceeding element 1 or more times.
\?
-- match the preceeding element 0 or 1 time.
{m,n}
-- match the preceeding element at least m
, and as many as n
times.
regexp-1\|regexp-2\|..
-- match any regexp-n.
A special character, like .
or *
can be made into a literal character by prefixing it with \
.
A special sequence, like \+
or \?
can be made into a literal character by dropping the \
.
Ambiguous Patterns
Sometimes a regular expression appears to be ambiguous. For example, suppose we compare the pattern:
begin\|beginningto the string
beginningeither just the first 5 characters will match, or the whole string will match.
In every case like this, the longer match is preferred. The whole string will match.
Sometimes there is ambiguity not about how many characters to match, but where the subexpressions occur within the match. This can effect extraction functions like Emacs' match-beginning
or rewrite functions like sed's s
command. For example, consider matching the pattern:
b\(\[^q]*\)\(ing\)?against the string
beginningOne possibility is that the first subexpression matches "eginning" and the second is skipped. Another possibility is that the first subexpression matches "eginn" and the second matches "ing".
The rule is that consistant with matching as many characters as possible, the length of lower numbered subexpressions is maximized in preference to maximizing the length of later subexpressions.
In the case of the above example, the two possible matches are equal in overall length. Therefore, it comes down to maximizing the lower-numbered subexpression, \1. The correct answer is that \1 matches "eginning" and \2 is skipped.