Posix Entry Points

Next: Beyond POSIX  Prev: Posix Basic Regular Expressions

This section is excerpted from The GNU C Library reference manual by Sandra Loosemore with Richard M. Stallman, Roland McGrath, and Andrew Oram.

The GNU C library supports the standard POSIX.2 interface. Programs using this interface should include the header file `rxposix.h'.

POSIX Regular Expression Compilation

Before you can actually match a regular expression, you must compile it. This is not true compilation¾it produces a special data structure, not machine instructions. But it is like ordinary compilation in that its purpose is to enable you to ``execute'' the pattern fast. (See Matching POSIX Regexps, for how to use the compiled regular expression for matching.)

There is a special data type for compiled regular expressions:

 - Data Type: regex_t
This type of object holds a compiled regular expression. It is actually a structure. It has just one field that your programs should look at:

re_nsub
This field holds the number of parenthetical subexpressions in the regular expression that was compiled.
There are several other fields, but we don't describe them here, because only the functions in the library should use them.

After you create a regex_t object, you can compile a regular expression into it by calling regcomp.

 - int regcomp (regex_t *compiled, const char *pattern, int cflags)
 - int regncomp (regex_t *compiled, const char *pattern, int len, int cflags)
The function regcomp ``compiles'' a regular expression into a data structure that you can use with regexec to match against a string. The compiled regular expression format is designed for efficient matching. regcomp stores it into *compiled.

The parameter pattern points to the regular expression to be compiled. When using regcomp, pattern must be 0-terminated. When using regncomp, pattern must be len characters long.

regncomp is not a standard function; strictly POSIX programs should avoid using it.

It's up to you to allocate an object of type regex_t and pass its address to regcomp.

Before freeing the object of type regex_t You must pass it to regfree. Not doing so may cause subsequent calls to Rx functions to behave strangely.

The argument cflags lets you specify various options that control the syntax and semantics of regular expressions. See Flags for POSIX Regexps.

If you use the flag REG_NOSUB, then regcomp omits from the compiled regular expression the information necessary to record how subexpressions actually match. In this case, you might as well pass 0 for the matchptr and nmatch arguments when you call regexec.

If you don't use REG_NOSUB, then the compiled regular expression does have the capacity to record how subexpressions match. Also, regcomp tells you how many subexpressions pattern has, by storing the number in compiled->re_nsub. You can use that value to decide how long an array to allocate to hold information about subexpression matches.

regcomp returns 0 if it succeeds in compiling the regular expression; otherwise, it returns a nonzero error code (see the table below). You can use regerror to produce an error message string describing the reason for a nonzero value; see Regexp Cleanup.

Here are the possible nonzero values that regcomp can return:

REG_BADBR
There was an invalid `\{...\}' construct in the regular expression. A valid `\{...\}' construct must contain either a single number, or two numbers in increasing order separated by a comma.

REG_BADPAT
There was a syntax error in the regular expression.

REG_BADRPT
A repetition operator such as `?' or `*' appeared in a bad position (with no preceding subexpression to act on).

REG_ECOLLATE
The regular expression referred to an invalid collating element (one not defined in the current locale for string collation).

REG_ECTYPE
The regular expression referred to an invalid character class name.

REG_EESCAPE
The regular expression ended with `\'.

REG_ESUBREG
There was an invalid number in the `\digit' construct.

REG_EBRACK
There were unbalanced square brackets in the regular expression.

REG_EPAREN
An extended regular expression had unbalanced parentheses, or a basic regular expression had unbalanced `\(' and `\)'.

REG_EBRACE
The regular expression had unbalanced `\{' and `\}'.

REG_ERANGE
One of the endpoints in a range expression was invalid.

REG_ESPACE
regcomp ran out of memory.

Flags for POSIX Regular Expressions

These are the bit flags that you can use in the cflags operand when compiling a regular expression with regcomp.

REG_EXTENDED
Treat the pattern as an extended regular expression, rather than as a basic regular expression.

REG_ICASE
Ignore case when matching letters.

REG_NOSUB
Don't bother storing the contents of the matches-ptr array.

REG_NEWLINE
Treat a newline in string as dividing string into multiple lines, so that `$' can match before the newline and `^' can match after. Also, don't permit `.' to match a newline, and don't permit `[^...]' to match a newline.

Otherwise, newline acts like any other ordinary character.

Matching a Compiled POSIX Regular Expression

Once you have compiled a regular expression, as described in POSIX Regexp Compilation, you can match it against strings using regexec. A match anywhere inside the string counts as success, unless the regular expression contains anchor characters (`^' or `$').

 - int regexec (regex_t *compiled, char *string, size_t nmatch, regmatch_t matchptr [], int eflags)
 - int regnexec (regex_t *compiled, char *string, int len, size_t nmatch, regmatch_t matchptr [], int eflags)
This function tries to match the compiled regular expression *compiled against string.

regexec returns 0 if the regular expression matches; otherwise, it returns a nonzero value. See the table below for what nonzero values mean. You can use regerror to produce an error message string describing the reason for a nonzero value; see Regexp Cleanup.

The parameter string points to the text to search. When using regexec, string must be 0-terminated. When using regnexec, string must be len characters long.

regnexec is not a standard function; strictly POSIX programs should avoid using it.

The argument eflags is a word of bit flags that enable various options.

If you want to get information about what part of string actually matched the regular expression or its subexpressions, use the arguments matchptr and nmatch. Otherwise, pass 0 for nmatch, and NULL for matchptr. See Regexp Subexpressions.

You must match the regular expression with the same set of current locales that were in effect when you compiled the regular expression.

The function regexec accepts the following flags in the eflags argument:

REG_NOTBOL
Do not regard the beginning of the specified string as the beginning of a line; more generally, don't make any assumptions about what text might precede it.

REG_NOTEOL
Do not regard the end of the specified string as the end of a line; more generally, don't make any assumptions about what text might follow it.
Here are the possible nonzero values that regexec can return:

REG_NOMATCH
The pattern didn't match the string. This isn't really an error.

REG_ESPACE
regexec ran out of memory.

Match Results with Subexpressions

When regexec matches parenthetical subexpressions of pattern, it records which parts of string they match. It returns that information by storing the offsets into an array whose elements are structures of type regmatch_t. The first element of the array (index 0) records the part of the string that matched the entire regular expression. Each other element of the array records the beginning and end of the part that matched a single parenthetical subexpression.

 - Data Type: regmatch_t
This is the data type of the matcharray array that you pass to regexec. It containes two structure fields, as follows:

rm_so
The offset in string of the beginning of a substring. Add this value to string to get the address of that part.

rm_eo
The offset in string of the end of the substring.

 - Data Type: regoff_t
regoff_t is an alias for another signed integer type. The fields of regmatch_t have type regoff_t.

The regmatch_t elements correspond to subexpressions positionally; the first element (index 1) records where the first subexpression matched, the second element records the second subexpression, and so on. The order of the subexpressions is the order in which they begin.

When you call regexec, you specify how long the matchptr array is, with the nmatch argument. This tells regexec how many elements to store. If the actual regular expression has more than nmatch subexpressions, then you won't get offset information about the rest of them. But this doesn't alter whether the pattern matches a particular string or not.

If you don't want regexec to return any information about where the subexpressions matched, you can either supply 0 for nmatch, or use the flag REG_NOSUB when you compile the pattern with regcomp.

Complications in Subexpression Matching

Sometimes a subexpression matches a substring of no characters. This happens when `f\(o*\)' matches the string `fum'. (It really matches just the `f'.) In this case, both of the offsets identify the point in the string where the null substring was found. In this example, the offsets are both 1.

Sometimes the entire regular expression can match without using some of its subexpressions at all¾for example, when `ba\(na\)*' matches the string `ba', the parenthetical subexpression is not used. When this happens, regexec stores -1 in both fields of the element for that subexpression.

Sometimes matching the entire regular expression can match a particular subexpression more than once¾for example, when `ba\(na\)*' matches the string `bananana', the parenthetical subexpression matches three times. When this happens, regexec usually stores the offsets of the last part of the string that matched the subexpression. In the case of `bananana', these offsets are 6 and 8.

But the last match is not always the one that is chosen. It's more accurate to say that the last opportunity to match is the one that takes precedence. What this means is that when one subexpression appears within another, then the results reported for the inner subexpression reflect whatever happened on the last match of the outer subexpression. For an example, consider `\(ba\(na\)*s \)*' matching the string `bananas bas '. The last time the inner expression actually matches is near the end of the first word. But it is considered again in the second word, and fails to match there. regexec reports nonuse of the ``na'' subexpression.

Another place where this rule applies is when the regular expression `\(ba\(na\)*s \|nefer\(ti\)* \)*' matches `bananas nefertiti'. The ``na'' subexpression does match in the first word, but it doesn't match in the second word because the other alternative is used there. Once again, the second repetition of the outer subexpression overrides the first, and within that second repetition, the ``na'' subexpression is not used. So regexec reports nonuse of the ``na'' subexpression.

POSIX Regexp Matching Cleanup

When you are finished using a compiled regular expression, you must free the storage it uses by calling regfree.

 - void regfree (regex_t *compiled)
Calling regfree frees all the storage that *compiled points to. This includes various internal fields of the regex_t structure that aren't documented in this manual.

regfree does not free the object *compiled itself.

You should always free the space in a regex_t structure with regfree before using the structure to compile another regular expression.

When regcomp or regexec reports an error, you can use the function regerror to turn it into an error message string.

 - size_t regerror (int errcode, regex_t *compiled, char *buffer, size_t length)
This function produces an error message string for the error code errcode, and stores the string in length bytes of memory starting at buffer. For the compiled argument, supply the same compiled regular expression structure that regcomp or regexec was working with when it got the error. Alternatively, you can supply NULL for compiled; you will still get a meaningful error message, but it might not be as detailed.

If the error message can't fit in length bytes (including a terminating null character), then regerror truncates it. The string that regerror stores is always null-terminated even if it has been truncated.

The return value of regerror is the minimum length needed to store the entire error message. If this is less than length, then the error message was not truncated, and you can use it. Otherwise, you should call regerror again with a larger buffer.

Here is a function which uses regerror, but always dynamically allocates a buffer for the error message:

char *get_regerror (int errcode, regex_t *compiled)
{
  size_t length = regerror (errcode, compiled, NULL, 0);
  char *buffer = xmalloc (length);
  (void) regerror (errcode, compiled, buffer, length);
  return buffer;
}