lib/libc/regex/regex.3

   1 .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
   2 .\" Copyright (c) 1992, 1993, 1994
   3 .\"     The Regents of the University of California.  All rights reserved.
   4 .\"
   5 .\" This code is derived from software contributed to Berkeley by
   6 .\" Henry Spencer.
   7 .\"
   8 .\" Redistribution and use in source and binary forms, with or without
   9 .\" modification, are permitted provided that the following conditions
  10 .\" are met:
  11 .\" 1. Redistributions of source code must retain the above copyright
  12 .\"    notice, this list of conditions and the following disclaimer.
  13 .\" 2. Redistributions in binary form must reproduce the above copyright
  14 .\"    notice, this list of conditions and the following disclaimer in the
  15 .\"    documentation and/or other materials provided with the distribution.
  16 .\" 3. All advertising materials mentioning features or use of this software
  17 .\"    must display the following acknowledgement:
  18 .\"     This product includes software developed by the University of
  19 .\"     California, Berkeley and its contributors.
  20 .\" 4. Neither the name of the University nor the names of its contributors
  21 .\"    may be used to endorse or promote products derived from this software
  22 .\"    without specific prior written permission.
  23 .\"
  24 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  25 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  26 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  27 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  28 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  29 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  30 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  31 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  32 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  33 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  34 .\" SUCH DAMAGE.
  35 .\"
  36 .\"     @(#)regex.3     8.4 (Berkeley) 3/20/94
  37 .\" $FreeBSD$
  38 .\"
  39 .TH REGEX 3 "March 20, 1994"
  40 .de ZR
  41 .\" one other place knows this name:  the SEE ALSO section
  42 .IR re_format (7) \\$1
  43 ..
  44 .SH NAME
  45 regcomp, regexec, regerror, regfree \- regular-expression library
  46 .SH SYNOPSIS
  47 .ft B
  48 .\".na
  49 #include <sys/types.h>
  50 .br
  51 #include <regex.h>
  52 .HP 10
  53 int regcomp(regex_t\ *preg, const\ char\ *pattern, int\ cflags);
  54 .HP
  55 int\ regexec(const\ regex_t\ *preg, const\ char\ *string,
  56 size_t\ nmatch, regmatch_t\ pmatch[], int\ eflags);
  57 .HP
  58 size_t\ regerror(int\ errcode, const\ regex_t\ *preg,
  59 char\ *errbuf, size_t\ errbuf_size);
  60 .HP
  61 void\ regfree(regex_t\ *preg);
  62 .\".ad
  63 .ft
  64 .SH DESCRIPTION
  65 These routines implement POSIX 1003.2 regular expressions (``RE''s);
  66 see
  67 .ZR .
  68 .I Regcomp
  69 compiles an RE written as a string into an internal form,
  70 .I regexec
  71 matches that internal form against a string and reports results,
  72 .I regerror
  73 transforms error codes from either into human-readable messages,
  74 and
  75 .I regfree
  76 frees any dynamically-allocated storage used by the internal form
  77 of an RE.
  78 .PP
  79 The header
  80 .I <regex.h>
  81 declares two structure types,
  82 .I regex_t
  83 and
  84 .IR regmatch_t ,
  85 the former for compiled internal forms and the latter for match reporting.
  86 It also declares the four functions,
  87 a type
  88 .IR regoff_t ,
  89 and a number of constants with names starting with ``REG_''.
  90 .PP
  91 .I Regcomp
  92 compiles the regular expression contained in the
  93 .I pattern
  94 string,
  95 subject to the flags in
  96 .IR cflags ,
  97 and places the results in the
  98 .I regex_t
  99 structure pointed to by
 100 .IR preg .
 101 .I Cflags
 102 is the bitwise OR of zero or more of the following flags:
 103 .IP REG_EXTENDED \w'REG_EXTENDED'u+2n
 104 Compile modern (``extended'') REs,
 105 rather than the obsolete (``basic'') REs that
 106 are the default.
 107 .IP REG_BASIC
 108 This is a synonym for 0,
 109 provided as a counterpart to REG_EXTENDED to improve readability.
 110 .IP REG_NOSPEC
 111 Compile with recognition of all special characters turned off.
 112 All characters are thus considered ordinary,
 113 so the ``RE'' is a literal string.
 114 This is an extension,
 115 compatible with but not specified by POSIX 1003.2,
 116 and should be used with
 117 caution in software intended to be portable to other systems.
 118 REG_EXTENDED and REG_NOSPEC may not be used
 119 in the same call to
 120 .IR regcomp .
 121 .IP REG_ICASE
 122 Compile for matching that ignores upper/lower case distinctions.
 123 See
 124 .ZR .
 125 .IP REG_NOSUB
 126 Compile for matching that need only report success or failure,
 127 not what was matched.
 128 .IP REG_NEWLINE
 129 Compile for newline-sensitive matching.
 130 By default, newline is a completely ordinary character with no special
 131 meaning in either REs or strings.
 132 With this flag,
 133 `[^' bracket expressions and `.' never match newline,
 134 a `^' anchor matches the null string after any newline in the string
 135 in addition to its normal function,
 136 and the `$' anchor matches the null string before any newline in the
 137 string in addition to its normal function.
 138 .IP REG_PEND
 139 The regular expression ends,
 140 not at the first NUL,
 141 but just before the character pointed to by the
 142 .I re_endp
 143 member of the structure pointed to by
 144 .IR preg .
 145 The
 146 .I re_endp
 147 member is of type
 148 .IR const\ char\ * .
 149 This flag permits inclusion of NULs in the RE;
 150 they are considered ordinary characters.
 151 This is an extension,
 152 compatible with but not specified by POSIX 1003.2,
 153 and should be used with
 154 caution in software intended to be portable to other systems.
 155 .PP
 156 When successful,
 157 .I regcomp
 158 returns 0 and fills in the structure pointed to by
 159 .IR preg .
 160 One member of that structure
 161 (other than
 162 .IR re_endp )
 163 is publicized:
 164 .IR re_nsub ,
 165 of type
 166 .IR size_t ,
 167 contains the number of parenthesized subexpressions within the RE
 168 (except that the value of this member is undefined if the
 169 REG_NOSUB flag was used).
 170 If
 171 .I regcomp
 172 fails, it returns a non-zero error code;
 173 see DIAGNOSTICS.
 174 .PP
 175 .I Regexec
 176 matches the compiled RE pointed to by
 177 .I preg
 178 against the
 179 .IR string ,
 180 subject to the flags in
 181 .IR eflags ,
 182 and reports results using
 183 .IR nmatch ,
 184 .IR pmatch ,
 185 and the returned value.
 186 The RE must have been compiled by a previous invocation of
 187 .IR regcomp .
 188 The compiled form is not altered during execution of
 189 .IR regexec ,
 190 so a single compiled RE can be used simultaneously by multiple threads.
 191 .PP
 192 By default,
 193 the NUL-terminated string pointed to by
 194 .I string
 195 is considered to be the text of an entire line, minus any terminating
 196 newline.
 197 The
 198 .I eflags
 199 argument is the bitwise OR of zero or more of the following flags:
 200 .IP REG_NOTBOL \w'REG_STARTEND'u+2n
 201 The first character of
 202 the string
 203 is not the beginning of a line, so the `^' anchor should not match before it.
 204 This does not affect the behavior of newlines under REG_NEWLINE.
 205 .IP REG_NOTEOL
 206 The NUL terminating
 207 the string
 208 does not end a line, so the `$' anchor should not match before it.
 209 This does not affect the behavior of newlines under REG_NEWLINE.
 210 .IP REG_STARTEND
 211 The string is considered to start at
 212 \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
 213 and to have a terminating NUL located at
 214 \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
 215 (there need not actually be a NUL at that location),
 216 regardless of the value of
 217 .IR nmatch .
 218 See below for the definition of
 219 .IR pmatch
 220 and
 221 .IR nmatch .
 222 This is an extension,
 223 compatible with but not specified by POSIX 1003.2,
 224 and should be used with
 225 caution in software intended to be portable to other systems.
 226 Note that a non-zero \fIrm_so\fR does not imply REG_NOTBOL;
 227 REG_STARTEND affects only the location of the string,
 228 not how it is matched.
 229 .PP
 230 See
 231 .ZR
 232 for a discussion of what is matched in situations where an RE or a
 233 portion thereof could match any of several substrings of
 234 .IR string .
 235 .PP
 236 Normally,
 237 .I regexec
 238 returns 0 for success and the non-zero code REG_NOMATCH for failure.
 239 Other non-zero error codes may be returned in exceptional situations;
 240 see DIAGNOSTICS.
 241 .PP
 242 If REG_NOSUB was specified in the compilation of the RE,
 243 or if
 244 .I nmatch
 245 is 0,
 246 .I regexec
 247 ignores the
 248 .I pmatch
 249 argument (but see below for the case where REG_STARTEND is specified).
 250 Otherwise,
 251 .I pmatch
 252 points to an array of
 253 .I nmatch
 254 structures of type
 255 .IR regmatch_t .
 256 Such a structure has at least the members
 257 .I rm_so
 258 and
 259 .IR rm_eo ,
 260 both of type
 261 .I regoff_t
 262 (a signed arithmetic type at least as large as an
 263 .I off_t
 264 and a
 265 .IR ssize_t ),
 266 containing respectively the offset of the first character of a substring
 267 and the offset of the first character after the end of the substring.
 268 Offsets are measured from the beginning of the
 269 .I string
 270 argument given to
 271 .IR regexec .
 272 An empty substring is denoted by equal offsets,
 273 both indicating the character following the empty substring.
 274 .PP
 275 The 0th member of the
 276 .I pmatch
 277 array is filled in to indicate what substring of
 278 .I string
 279 was matched by the entire RE.
 280 Remaining members report what substring was matched by parenthesized
 281 subexpressions within the RE;
 282 member
 283 .I i
 284 reports subexpression
 285 .IR i ,
 286 with subexpressions counted (starting at 1) by the order of their opening
 287 parentheses in the RE, left to right.
 288 Unused entries in the array\(emcorresponding either to subexpressions that
 289 did not participate in the match at all, or to subexpressions that do not
 290 exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
 291 .I rm_so
 292 and
 293 .I rm_eo
 294 set to \-1.
 295 If a subexpression participated in the match several times,
 296 the reported substring is the last one it matched.
 297 (Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
 298 the parenthesized subexpression matches each of the three `b's and then
 299 an infinite number of empty strings following the last `b',
 300 so the reported substring is one of the empties.)
 301 .PP
 302 If REG_STARTEND is specified,
 303 .I pmatch
 304 must point to at least one
 305 .I regmatch_t
 306 (even if
 307 .I nmatch
 308 is 0 or REG_NOSUB was specified),
 309 to hold the input offsets for REG_STARTEND.
 310 Use for output is still entirely controlled by
 311 .IR nmatch ;
 312 if
 313 .I nmatch
 314 is 0 or REG_NOSUB was specified,
 315 the value of
 316 .IR pmatch [0]
 317 will not be changed by a successful
 318 .IR regexec .
 319 .PP
 320 .I Regerror
 321 maps a non-zero
 322 .I errcode
 323 from either
 324 .I regcomp
 325 or
 326 .I regexec
 327 to a human-readable, printable message.
 328 If
 329 .I preg
 330 is non-NULL,
 331 the error code should have arisen from use of
 332 the
 333 .I regex_t
 334 pointed to by
 335 .IR preg ,
 336 and if the error code came from
 337 .IR regcomp ,
 338 it should have been the result from the most recent
 339 .I regcomp
 340 using that
 341 .IR regex_t .
 342 .RI ( Regerror
 343 may be able to supply a more detailed message using information
 344 from the
 345 .IR regex_t .)
 346 .I Regerror
 347 places the NUL-terminated message into the buffer pointed to by
 348 .IR errbuf ,
 349 limiting the length (including the NUL) to at most
 350 .I errbuf_size
 351 bytes.
 352 If the whole message won't fit,
 353 as much of it as will fit before the terminating NUL is supplied.
 354 In any case,
 355 the returned value is the size of buffer needed to hold the whole
 356 message (including terminating NUL).
 357 If
 358 .I errbuf_size
 359 is 0,
 360 .I errbuf
 361 is ignored but the return value is still correct.
 362 .PP
 363 If the
 364 .I errcode
 365 given to
 366 .I regerror
 367 is first ORed with REG_ITOA,
 368 the ``message'' that results is the printable name of the error code,
 369 e.g. ``REG_NOMATCH'',
 370 rather than an explanation thereof.
 371 If
 372 .I errcode
 373 is REG_ATOI,
 374 then
 375 .I preg
 376 shall be non-NULL and the
 377 .I re_endp
 378 member of the structure it points to
 379 must point to the printable name of an error code;
 380 in this case, the result in
 381 .I errbuf
 382 is the decimal digits of
 383 the numeric value of the error code
 384 (0 if the name is not recognized).
 385 REG_ITOA and REG_ATOI are intended primarily as debugging facilities;
 386 they are extensions,
 387 compatible with but not specified by POSIX 1003.2,
 388 and should be used with
 389 caution in software intended to be portable to other systems.
 390 Be warned also that they are considered experimental and changes are possible.
 391 .PP
 392 .I Regfree
 393 frees any dynamically-allocated storage associated with the compiled RE
 394 pointed to by
 395 .IR preg .
 396 The remaining
 397 .I regex_t
 398 is no longer a valid compiled RE
 399 and the effect of supplying it to
 400 .I regexec
 401 or
 402 .I regerror
 403 is undefined.
 404 .PP
 405 None of these functions references global variables except for tables
 406 of constants;
 407 all are safe for use from multiple threads if the arguments are safe.
 408 .SH IMPLEMENTATION CHOICES
 409 There are a number of decisions that 1003.2 leaves up to the implementor,
 410 either by explicitly saying ``undefined'' or by virtue of them being
 411 forbidden by the RE grammar.
 412 This implementation treats them as follows.
 413 .PP
 414 See
 415 .ZR
 416 for a discussion of the definition of case-independent matching.
 417 .PP
 418 There is no particular limit on the length of REs,
 419 except insofar as memory is limited.
 420 Memory usage is approximately linear in RE size, and largely insensitive
 421 to RE complexity, except for bounded repetitions.
 422 See BUGS for one short RE using them
 423 that will run almost any system out of memory.
 424 .PP
 425 A backslashed character other than one specifically given a magic meaning
 426 by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs)
 427 is taken as an ordinary character.
 428 .PP
 429 Any unmatched [ is a REG_EBRACK error.
 430 .PP
 431 Equivalence classes cannot begin or end bracket-expression ranges.
 432 The endpoint of one range cannot begin another.
 433 .PP
 434 RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
 435 .PP
 436 A repetition operator (?, *, +, or bounds) cannot follow another
 437 repetition operator.
 438 A repetition operator cannot begin an expression or subexpression
 439 or follow `^' or `|'.
 440 .PP
 441 `|' cannot appear first or last in a (sub)expression or after another `|',
 442 i.e. an operand of `|' cannot be an empty subexpression.
 443 An empty parenthesized subexpression, `()', is legal and matches an
 444 empty (sub)string.
 445 An empty string is not a legal RE.
 446 .PP
 447 A `{' followed by a digit is considered the beginning of bounds for a
 448 bounded repetition, which must then follow the syntax for bounds.
 449 A `{' \fInot\fR followed by a digit is considered an ordinary character.
 450 .PP
 451 `^' and `$' beginning and ending subexpressions in obsolete (``basic'')
 452 REs are anchors, not ordinary characters.
 453 .SH SEE ALSO
 454 grep(1), re_format(7)
 455 .PP
 456 POSIX 1003.2, sections 2.8 (Regular Expression Notation)
 457 and
 458 B.5 (C Binding for Regular Expression Matching).
 459 .SH DIAGNOSTICS
 460 Non-zero error codes from
 461 .I regcomp
 462 and
 463 .I regexec
 464 include the following:
 465 .PP
 466 .nf
 467 .ta \w'REG_ECOLLATE'u+3n
 468 REG_NOMATCH     regexec() failed to match
 469 REG_BADPAT      invalid regular expression
 470 REG_ECOLLATE    invalid collating element
 471 REG_ECTYPE      invalid character class
 472 REG_EESCAPE     \e applied to unescapable character
 473 REG_ESUBREG     invalid backreference number
 474 REG_EBRACK      brackets [ ] not balanced
 475 REG_EPAREN      parentheses ( ) not balanced
 476 REG_EBRACE      braces { } not balanced
 477 REG_BADBR       invalid repetition count(s) in { }
 478 REG_ERANGE      invalid character range in [ ]
 479 REG_ESPACE      ran out of memory
 480 REG_BADRPT      ?, *, or + operand invalid
 481 REG_EMPTY       empty (sub)expression
 482 REG_ASSERT      ``can't happen''\(emyou found a bug
 483 REG_INVARG      invalid argument, e.g. negative-length string
 484 .fi
 485 .SH HISTORY
 486 Originally written by Henry Spencer.
 487 Altered for inclusion in the
 488 .Bx 4.4
 489 distribution.
 490 .SH BUGS
 491 This is an alpha release with known defects.
 492 Please report problems.
 493 .PP
 494 There is one known functionality bug.
 495 The implementation of internationalization is incomplete:
 496 the locale is always assumed to be the default one of 1003.2,
 497 and only the collating elements etc. of that locale are available.
 498 .PP
 499 The back-reference code is subtle and doubts linger about its correctness
 500 in complex cases.
 501 .PP
 502 .I Regexec
 503 performance is poor.
 504 This will improve with later releases.
 505 .I Nmatch
 506 exceeding 0 is expensive;
 507 .I nmatch
 508 exceeding 1 is worse.
 509 .I Regexec
 510 is largely insensitive to RE complexity \fIexcept\fR that back
 511 references are massively expensive.
 512 RE length does matter; in particular, there is a strong speed bonus
 513 for keeping RE length under about 30 characters,
 514 with most special characters counting roughly double.
 515 .PP
 516 .I Regcomp
 517 implements bounded repetitions by macro expansion,
 518 which is costly in time and space if counts are large
 519 or bounded repetitions are nested.
 520 An RE like, say,
 521 `((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
 522 will (eventually) run almost any existing machine out of swap space.
 523 .PP
 524 There are suspected problems with response to obscure error conditions.
 525 Notably,
 526 certain kinds of internal overflow,
 527 produced only by truly enormous REs or by multiply nested bounded repetitions,
 528 are probably not handled well.
 529 .PP
 530 Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is
 531 a special character only in the presence of a previous unmatched `('.
 532 This can't be fixed until the spec is fixed.
 533 .PP
 534 The standard's definition of back references is vague.
 535 For example, does
 536 `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
 537 Until the standard is clarified,
 538 behavior in such cases should not be relied on.
 539 .PP
 540 The implementation of word-boundary matching is a bit of a kludge,
 541 and bugs may lurk in combinations of word-boundary matching and anchoring.