Next: Beyond POSIX Prev: Posix Basic Regular Expressions
The GNU C library supports the standard POSIX.2 interface. Programs using this interface should include the header file `rxposix.h'.
regcomp
to prepare to match.
regcomp
.
regexec
to match the compiled
pattern that you get from regcomp
.
There is a special data type for compiled regular expressions:
re_nsub
regex_t
object, you can compile a regular expression into it by calling regcomp
.
regcomp
``compiles'' a regular expression into a data structure that you can use with regexec
to match against a string. The compiled regular expression format is designed for efficient matching. regcomp
stores it into *compiled
.
The parameter pattern points to the regular expression to be compiled. When using regcomp
, pattern must be 0-terminated. When using regncomp
, pattern must be len characters long.
regncomp
is not a standard function; strictly POSIX programs should avoid using it.
It's up to you to allocate an object of type regex_t
and pass its address to regcomp
.
Before freeing the object of type regex_t
You must pass it to regfree
. Not doing so may cause subsequent calls to Rx functions to behave strangely.
The argument cflags lets you specify various options that control the syntax and semantics of regular expressions. See Flags for POSIX Regexps.
If you use the flag REG_NOSUB
, then regcomp
omits from the compiled regular expression the information necessary to record how subexpressions actually match. In this case, you might as well pass 0
for the matchptr and nmatch arguments when you call regexec
.
If you don't use REG_NOSUB
, then the compiled regular expression does have the capacity to record how subexpressions match. Also, regcomp
tells you how many subexpressions pattern has, by storing the number in compiled->re_nsub
. You can use that value to decide how long an array to allocate to hold information about subexpression matches.
regcomp
returns 0
if it succeeds in compiling the regular expression; otherwise, it returns a nonzero error code (see the table below). You can use regerror
to produce an error message string describing the reason for a nonzero value; see Regexp Cleanup.
regcomp
can return:
REG_BADBR
REG_BADPAT
REG_BADRPT
REG_ECOLLATE
REG_ECTYPE
REG_EESCAPE
REG_ESUBREG
REG_EBRACK
REG_EPAREN
REG_EBRACE
REG_ERANGE
REG_ESPACE
regcomp
ran out of memory.
regcomp
.
REG_EXTENDED
REG_ICASE
REG_NOSUB
REG_NEWLINE
Otherwise, newline acts like any other ordinary character.
regexec
. A match anywhere inside the string counts as success, unless the regular expression contains anchor characters (`^' or `$').
*compiled
against string.
regexec
returns 0
if the regular expression matches; otherwise, it returns a nonzero value. See the table below for what nonzero values mean. You can use regerror
to produce an error message string describing the reason for a nonzero value; see Regexp Cleanup.
The parameter string points to the text to search. When using regexec
, string must be 0-terminated. When using regnexec
, string must be len characters long.
regnexec
is not a standard function; strictly POSIX programs should avoid using it.
The argument eflags is a word of bit flags that enable various options.
If you want to get information about what part of string actually matched the regular expression or its subexpressions, use the arguments matchptr and nmatch. Otherwise, pass 0
for nmatch, and NULL
for matchptr. See Regexp Subexpressions.
The function regexec
accepts the following flags in the eflags argument:
REG_NOTBOL
REG_NOTEOL
regexec
can return:
REG_NOMATCH
REG_ESPACE
regexec
ran out of memory.
regexec
matches parenthetical subexpressions of pattern, it records which parts of string they match. It returns that information by storing the offsets into an array whose elements are structures of type regmatch_t
. The first element of the array (index 0
) records the part of the string that matched the entire regular expression. Each other element of the array records the beginning and end of the part that matched a single parenthetical subexpression.
regexec
. It containes two structure fields, as follows:
rm_so
rm_eo
regoff_t
is an alias for another signed integer type. The fields of regmatch_t
have type regoff_t
.
regmatch_t
elements correspond to subexpressions positionally; the first element (index 1
) records where the first subexpression matched, the second element records the second subexpression, and so on. The order of the subexpressions is the order in which they begin.
When you call regexec
, you specify how long the matchptr array is, with the nmatch argument. This tells regexec
how many elements to store. If the actual regular expression has more than nmatch subexpressions, then you won't get offset information about the rest of them. But this doesn't alter whether the pattern matches a particular string or not.
If you don't want regexec
to return any information about where the subexpressions matched, you can either supply 0
for nmatch, or use the flag REG_NOSUB
when you compile the pattern with regcomp
.
Complications in Subexpression Matching
Sometimes a subexpression matches a substring of no characters. This happens when `f\(o*\)' matches the string `fum'. (It really matches just the `f'.) In this case, both of the offsets identify the point in the string where the null substring was found. In this example, the offsets are both 1
.
Sometimes the entire regular expression can match without using some of its subexpressions at all¾for example, when `ba\(na\)*' matches the string `ba', the parenthetical subexpression is not used. When this happens, regexec
stores -1
in both fields of the element for that subexpression.
Sometimes matching the entire regular expression can match a particular subexpression more than once¾for example, when `ba\(na\)*' matches the string `bananana', the parenthetical subexpression matches three times. When this happens, regexec
usually stores the offsets of the last part of the string that matched the subexpression. In the case of `bananana', these offsets are 6
and 8
.
But the last match is not always the one that is chosen. It's more accurate to say that the last opportunity to match is the one that takes precedence. What this means is that when one subexpression appears within another, then the results reported for the inner subexpression reflect whatever happened on the last match of the outer subexpression. For an example, consider `\(ba\(na\)*s \)*' matching the string `bananas bas '. The last time the inner expression actually matches is near the end of the first word. But it is considered again in the second word, and fails to match there. regexec
reports nonuse of the ``na'' subexpression.
Another place where this rule applies is when the regular expression regexec
reports nonuse of the ``na'' subexpression.
POSIX Regexp Matching Cleanup
When you are finished using a compiled regular expression, you must free the storage it uses by calling regfree
.
regfree
frees all the storage that *compiled
points to. This includes various internal fields of the regex_t
structure that aren't documented in this manual.
regfree
does not free the object *compiled
itself.
regex_t
structure with regfree
before using the structure to compile another regular expression.
When regcomp
or regexec
reports an error, you can use the function regerror
to turn it into an error message string.
regcomp
or regexec
was working with when it got the error. Alternatively, you can supply NULL
for compiled; you will still get a meaningful error message, but it might not be as detailed.
If the error message can't fit in length bytes (including a terminating null character), then regerror
truncates it. The string that regerror
stores is always null-terminated even if it has been truncated.
The return value of regerror
is the minimum length needed to store the entire error message. If this is less than length, then the error message was not truncated, and you can use it. Otherwise, you should call regerror
again with a larger buffer.
Here is a function which uses regerror
, but always dynamically allocates a buffer for the error message:
char *get_regerror (int errcode, regex_t *compiled) { size_t length = regerror (errcode, compiled, NULL, 0); char *buffer = xmalloc (length); (void) regerror (errcode, compiled, buffer, length); return buffer; }