1 .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2 .\" Copyright (c) 1992, 1993, 1994
3 .\" The Regents of the University of California. All rights reserved.
5 .\" This code is derived from software contributed to Berkeley by
8 .\" Redistribution and use in source and binary forms, with or without
9 .\" modification, are permitted provided that the following conditions
11 .\" 1. Redistributions of source code must retain the above copyright
12 .\" notice, this list of conditions and the following disclaimer.
13 .\" 2. Redistributions in binary form must reproduce the above copyright
14 .\" notice, this list of conditions and the following disclaimer in the
15 .\" documentation and/or other materials provided with the distribution.
16 .\" 4. Neither the name of the University nor the names of its contributors
17 .\" may be used to endorse or promote products derived from this software
18 .\" without specific prior written permission.
20 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
32 .\" @(#)regex.3 8.4 (Berkeley) 3/20/94
43 .Nd regular-expression library
50 .Fa "regex_t * restrict preg" "const char * restrict pattern" "int cflags"
54 .Fa "const regex_t * restrict preg" "const char * restrict string"
55 .Fa "size_t nmatch" "regmatch_t pmatch[restrict]" "int eflags"
59 .Fa "int errcode" "const regex_t * restrict preg"
60 .Fa "char * restrict errbuf" "size_t errbuf_size"
63 .Fn regfree "regex_t *preg"
65 These routines implement
74 compiles an RE written as a string into an internal form,
76 matches that internal form against a string and reports results,
78 transforms error codes from either into human-readable messages,
81 frees any dynamically-allocated storage used by the internal form
86 declares two structure types,
90 the former for compiled internal forms and the latter for match reporting.
91 It also declares the four functions,
94 and a number of constants with names starting with
100 compiles the regular expression contained in the
103 subject to the flags in
105 and places the results in the
107 structure pointed to by
112 is the bitwise OR of zero or more of the following flags:
113 .Bl -tag -width REG_EXTENDED
118 rather than the obsolete
123 This is a synonym for 0,
124 provided as a counterpart to
126 to improve readability.
128 Compile with recognition of all special characters turned off.
129 All characters are thus considered ordinary,
133 This is an extension,
134 compatible with but not specified by
136 and should be used with
137 caution in software intended to be portable to other systems.
145 Compile for matching that ignores upper/lower case distinctions.
149 Compile for matching that need only report success or failure,
150 not what was matched.
152 Compile for newline-sensitive matching.
153 By default, newline is a completely ordinary character with no special
154 meaning in either REs or strings.
157 bracket expressions and
162 anchor matches the null string after any newline in the string
163 in addition to its normal function,
166 anchor matches the null string before any newline in the
167 string in addition to its normal function.
169 The regular expression ends,
170 not at the first NUL,
171 but just before the character pointed to by the
173 member of the structure pointed to by
179 This flag permits inclusion of NULs in the RE;
180 they are considered ordinary characters.
181 This is an extension,
182 compatible with but not specified by
184 and should be used with
185 caution in software intended to be portable to other systems.
190 returns 0 and fills in the structure pointed to by
192 One member of that structure
199 contains the number of parenthesized subexpressions within the RE
200 (except that the value of this member is undefined if the
205 fails, it returns a non-zero error code;
212 matches the compiled RE pointed to by
216 subject to the flags in
218 and reports results using
221 and the returned value.
222 The RE must have been compiled by a previous invocation of
224 The compiled form is not altered during execution of
226 so a single compiled RE can be used simultaneously by multiple threads.
229 the NUL-terminated string pointed to by
231 is considered to be the text of an entire line, minus any terminating
235 argument is the bitwise OR of zero or more of the following flags:
236 .Bl -tag -width REG_STARTEND
238 The first character of the string is treated as the continuation
240 This means that the anchors
245 do not match before it; but see
248 This does not affect the behavior of newlines under
253 does not end a line, so the
255 anchor does not match before it.
256 This does not affect the behavior of newlines under
259 The string is considered to start at
261 .Fa pmatch Ns [0]. Ns Fa rm_so
262 and to end before the byte located at
264 .Fa pmatch Ns [0]. Ns Fa rm_eo ,
265 regardless of the value of
267 See below for the definition of
271 This is an extension,
272 compatible with but not specified by
274 and should be used with
275 caution in software intended to be portable to other systems.
281 is considered the beginning of a line, such that
283 matches before it, and the beginning of a word if there is a word
284 character at this position, such that
292 the character at position
294 is treated as the continuation of a line, and if
296 is greater than 0, the preceding character is taken into consideration.
297 If the preceding character is a newline and the regular expression was compiled
301 matches before the string; if the preceding character is not a word character
302 but the string starts with a word character,
306 match before the string.
311 for a discussion of what is matched in situations where an RE or a
312 portion thereof could match any of several substrings of
317 returns 0 for success and the non-zero code
320 Other non-zero error codes may be returned in exceptional situations;
326 was specified in the compilation of the RE,
333 argument (but see below for the case where
338 points to an array of
342 Such a structure has at least the members
348 (a signed arithmetic type at least as large as an
352 containing respectively the offset of the first character of a substring
353 and the offset of the first character after the end of the substring.
354 Offsets are measured from the beginning of the
358 An empty substring is denoted by equal offsets,
359 both indicating the character following the empty substring.
361 The 0th member of the
363 array is filled in to indicate what substring of
365 was matched by the entire RE.
366 Remaining members report what substring was matched by parenthesized
367 subexpressions within the RE;
370 reports subexpression
372 with subexpressions counted (starting at 1) by the order of their opening
373 parentheses in the RE, left to right.
374 Unused entries in the array (corresponding either to subexpressions that
375 did not participate in the match at all, or to subexpressions that do not
376 exist in the RE (that is,
379 .Fa preg Ns -> Ns Va re_nsub ) )
385 If a subexpression participated in the match several times,
386 the reported substring is the last one it matched.
387 (Note, as an example in particular, that when the RE
391 the parenthesized subexpression matches each of the three
394 an infinite number of empty strings following the last
396 so the reported substring is one of the empties.)
402 must point to at least one
409 to hold the input offsets for
411 Use for output is still entirely controlled by
420 will not be changed by a successful
432 to a human-readable, printable message.
436 .No non\- Ns Dv NULL ,
437 the error code should have arisen from use of
442 and if the error code came from
444 it should have been the result from the most recent
450 may be able to supply a more detailed message using information
456 places the NUL-terminated message into the buffer pointed to by
458 limiting the length (including the NUL) to at most
461 If the whole message will not fit,
462 as much of it as will fit before the terminating NUL is supplied.
464 the returned value is the size of buffer needed to hold the whole
465 message (including terminating NUL).
470 is ignored but the return value is still correct.
480 that results is the printable name of the error code,
483 rather than an explanation thereof.
494 member of the structure it points to
495 must point to the printable name of an error code;
496 in this case, the result in
498 is the decimal digits of
499 the numeric value of the error code
500 (0 if the name is not recognized).
504 are intended primarily as debugging facilities;
506 compatible with but not specified by
508 and should be used with
509 caution in software intended to be portable to other systems.
510 Be warned also that they are considered experimental and changes are possible.
515 frees any dynamically-allocated storage associated with the compiled RE
520 is no longer a valid compiled RE
521 and the effect of supplying it to
527 None of these functions references global variables except for tables
529 all are safe for use from multiple threads if the arguments are safe.
530 .Sh IMPLEMENTATION CHOICES
531 There are a number of decisions that
533 leaves up to the implementor,
534 either by explicitly saying
536 or by virtue of them being
537 forbidden by the RE grammar.
538 This implementation treats them as follows.
542 for a discussion of the definition of case-independent matching.
544 There is no particular limit on the length of REs,
545 except insofar as memory is limited.
546 Memory usage is approximately linear in RE size, and largely insensitive
547 to RE complexity, except for bounded repetitions.
550 for one short RE using them
551 that will run almost any system out of memory.
553 A backslashed character other than one specifically given a magic meaning
556 (such magic meanings occur only in obsolete
559 is taken as an ordinary character.
567 Equivalence classes cannot begin or end bracket-expression ranges.
568 The endpoint of one range cannot begin another.
571 the limit on repetition counts in bounded repetitions, is 255.
573 A repetition operator
578 cannot follow another
580 A repetition operator cannot begin an expression or subexpression
587 cannot appear first or last in a (sub)expression or after another
591 cannot be an empty subexpression.
592 An empty parenthesized subexpression,
594 is legal and matches an
596 An empty string is not a legal RE.
600 followed by a digit is considered the beginning of bounds for a
601 bounded repetition, which must then follow the syntax for bounds.
605 followed by a digit is considered an ordinary character.
610 beginning and ending subexpressions in obsolete
612 REs are anchors, not ordinary characters.
614 Non-zero error codes from
618 include the following:
620 .Bl -tag -width REG_ECOLLATE -compact
627 invalid regular expression
629 invalid collating element
631 invalid character class
634 applied to unescapable character
636 invalid backreference number
650 invalid repetition count(s) in
653 invalid character range in
664 empty (sub)expression
666 cannot happen - you found a bug
668 invalid argument, e.g.\& negative-length string
670 illegal byte sequence (bad multibyte character)
677 sections 2.8 (Regular Expression Notation)
679 B.5 (C Binding for Regular Expression Matching).
681 Originally written by
683 Altered for inclusion in the
687 This is an alpha release with known defects.
688 Please report problems.
690 The back-reference code is subtle and doubts linger about its correctness
697 This will improve with later releases.
701 exceeding 0 is expensive;
703 exceeding 1 is worse.
707 is largely insensitive to RE complexity
710 references are massively expensive.
711 RE length does matter; in particular, there is a strong speed bonus
712 for keeping RE length under about 30 characters,
713 with most special characters counting roughly double.
718 implements bounded repetitions by macro expansion,
719 which is costly in time and space if counts are large
720 or bounded repetitions are nested.
722 .Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
723 will (eventually) run almost any existing machine out of swap space.
725 There are suspected problems with response to obscure error conditions.
727 certain kinds of internal overflow,
728 produced only by truly enormous REs or by multiply nested bounded repetitions,
729 are probably not handled well.
735 are legal REs because
738 a special character only in the presence of a previous unmatched
740 This cannot be fixed until the spec is fixed.
742 The standard's definition of back references is vague.
744 .Ql "a\e(\e(b\e)*\e2\e)*d"
747 Until the standard is clarified,
748 behavior in such cases should not be relied on.
750 The implementation of word-boundary matching is a bit of a kludge,
751 and bugs may lurk in combinations of word-boundary matching and anchoring.
753 Word-boundary matching does not work properly in multibyte locales.