contrib/nvi/regex/regex.3

   1 .\"     $NetBSD: regex.3,v 1.1.1.2 2008/05/18 14:31:38 aymeric Exp $
   2 .\"
   3 .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
   4 .\" Copyright (c) 1992, 1993, 1994
   5 .\"     The Regents of the University of California.  All rights reserved.
   6 .\"
   7 .\" This code is derived from software contributed to Berkeley by
   8 .\" Henry Spencer of the University of Toronto.
   9 .\"
  10 .\" Redistribution and use in source and binary forms, with or without
  11 .\" modification, are permitted provided that the following conditions
  12 .\" are met:
  13 .\" 1. Redistributions of source code must retain the above copyright
  14 .\"    notice, this list of conditions and the following disclaimer.
  15 .\" 2. Redistributions in binary form must reproduce the above copyright
  16 .\"    notice, this list of conditions and the following disclaimer in the
  17 .\"    documentation and/or other materials provided with the distribution.
  18 .\" 3. Neither the name of the University nor the names of its contributors
  19 .\"    may be used to endorse or promote products derived from this software
  20 .\"    without specific prior written permission.
  21 .\"
  22 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  23 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  24 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  25 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  26 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  27 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  28 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  29 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  30 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  31 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  32 .\" SUCH DAMAGE.
  33 .\"
  34 .\"     @(#)regex.3     8.2 (Berkeley) 3/16/94
  35 .\"
  36 .TH REGEX 3 "March 16, 1994"
  37 .de ZR
  38 .\" one other place knows this name:  the SEE ALSO section
  39 .IR re_format (7) \\$1
  40 ..
  41 .SH NAME
  42 regcomp, regexec, regerror, regfree \- regular-expression library
  43 .SH SYNOPSIS
  44 .ft B
  45 .\".na
  46 #include <sys/types.h>
  47 .br
  48 #include <regex.h>
  49 .HP 10
  50 int regcomp(regex_t\ *preg, const\ char\ *pattern, int\ cflags);
  51 .HP
  52 int\ regexec(const\ regex_t\ *preg, const\ char\ *string,
  53 size_t\ nmatch, regmatch_t\ pmatch[], int\ eflags);
  54 .HP
  55 size_t\ regerror(int\ errcode, const\ regex_t\ *preg,
  56 char\ *errbuf, size_t\ errbuf_size);
  57 .HP
  58 void\ regfree(regex_t\ *preg);
  59 .\".ad
  60 .ft
  61 .SH DESCRIPTION
  62 These routines implement POSIX 1003.2 regular expressions (``RE''s);
  63 see
  64 .ZR .
  65 .I Regcomp
  66 compiles an RE written as a string into an internal form,
  67 .I regexec
  68 matches that internal form against a string and reports results,
  69 .I regerror
  70 transforms error codes from either into human-readable messages,
  71 and
  72 .I regfree
  73 frees any dynamically-allocated storage used by the internal form
  74 of an RE.
  75 .PP
  76 The header
  77 .I <regex.h>
  78 declares two structure types,
  79 .I regex_t
  80 and
  81 .IR regmatch_t ,
  82 the former for compiled internal forms and the latter for match reporting.
  83 It also declares the four functions,
  84 a type
  85 .IR regoff_t ,
  86 and a number of constants with names starting with ``REG_''.
  87 .PP
  88 .I Regcomp
  89 compiles the regular expression contained in the
  90 .I pattern
  91 string,
  92 subject to the flags in
  93 .IR cflags ,
  94 and places the results in the
  95 .I regex_t
  96 structure pointed to by
  97 .IR preg .
  98 .I Cflags
  99 is the bitwise OR of zero or more of the following flags:
 100 .IP REG_EXTENDED \w'REG_EXTENDED'u+2n
 101 Compile modern (``extended'') REs,
 102 rather than the obsolete (``basic'') REs that
 103 are the default.
 104 .IP REG_BASIC
 105 This is a synonym for 0,
 106 provided as a counterpart to REG_EXTENDED to improve readability.
 107 .IP REG_NOSPEC
 108 Compile with recognition of all special characters turned off.
 109 All characters are thus considered ordinary,
 110 so the ``RE'' is a literal string.
 111 This is an extension,
 112 compatible with but not specified by POSIX 1003.2,
 113 and should be used with
 114 caution in software intended to be portable to other systems.
 115 REG_EXTENDED and REG_NOSPEC may not be used
 116 in the same call to
 117 .IR regcomp .
 118 .IP REG_ICASE
 119 Compile for matching that ignores upper/lower case distinctions.
 120 See
 121 .ZR .
 122 .IP REG_NOSUB
 123 Compile for matching that need only report success or failure,
 124 not what was matched.
 125 .IP REG_NEWLINE
 126 Compile for newline-sensitive matching.
 127 By default, newline is a completely ordinary character with no special
 128 meaning in either REs or strings.
 129 With this flag,
 130 `[^' bracket expressions and `.' never match newline,
 131 a `^' anchor matches the null string after any newline in the string
 132 in addition to its normal function,
 133 and the `$' anchor matches the null string before any newline in the
 134 string in addition to its normal function.
 135 .IP REG_PEND
 136 The regular expression ends,
 137 not at the first NUL,
 138 but just before the character pointed to by the
 139 .I re_endp
 140 member of the structure pointed to by
 141 .IR preg .
 142 The
 143 .I re_endp
 144 member is of type
 145 .IR const\ char\ * .
 146 This flag permits inclusion of NULs in the RE;
 147 they are considered ordinary characters.
 148 This is an extension,
 149 compatible with but not specified by POSIX 1003.2,
 150 and should be used with
 151 caution in software intended to be portable to other systems.
 152 .PP
 153 When successful,
 154 .I regcomp
 155 returns 0 and fills in the structure pointed to by
 156 .IR preg .
 157 One member of that structure
 158 (other than
 159 .IR re_endp )
 160 is publicized:
 161 .IR re_nsub ,
 162 of type
 163 .IR size_t ,
 164 contains the number of parenthesized subexpressions within the RE
 165 (except that the value of this member is undefined if the
 166 REG_NOSUB flag was used).
 167 If
 168 .I regcomp
 169 fails, it returns a non-zero error code;
 170 see DIAGNOSTICS.
 171 .PP
 172 .I Regexec
 173 matches the compiled RE pointed to by
 174 .I preg
 175 against the
 176 .IR string ,
 177 subject to the flags in
 178 .IR eflags ,
 179 and reports results using
 180 .IR nmatch ,
 181 .IR pmatch ,
 182 and the returned value.
 183 The RE must have been compiled by a previous invocation of
 184 .IR regcomp .
 185 The compiled form is not altered during execution of
 186 .IR regexec ,
 187 so a single compiled RE can be used simultaneously by multiple threads.
 188 .PP
 189 By default,
 190 the NUL-terminated string pointed to by
 191 .I string
 192 is considered to be the text of an entire line, minus any terminating
 193 newline.
 194 The
 195 .I eflags
 196 argument is the bitwise OR of zero or more of the following flags:
 197 .IP REG_NOTBOL \w'REG_STARTEND'u+2n
 198 The first character of
 199 the string
 200 is not the beginning of a line, so the `^' anchor should not match before it.
 201 This does not affect the behavior of newlines under REG_NEWLINE.
 202 .IP REG_NOTEOL
 203 The NUL terminating
 204 the string
 205 does not end a line, so the `$' anchor should not match before it.
 206 This does not affect the behavior of newlines under REG_NEWLINE.
 207 .IP REG_STARTEND
 208 The string is considered to start at
 209 \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
 210 and to have a terminating NUL located at
 211 \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
 212 (there need not actually be a NUL at that location),
 213 regardless of the value of
 214 .IR nmatch .
 215 See below for the definition of
 216 .IR pmatch
 217 and
 218 .IR nmatch .
 219 This is an extension,
 220 compatible with but not specified by POSIX 1003.2,
 221 and should be used with
 222 caution in software intended to be portable to other systems.
 223 Note that a non-zero \fIrm_so\fR does not imply REG_NOTBOL;
 224 REG_STARTEND affects only the location of the string,
 225 not how it is matched.
 226 .PP
 227 See
 228 .ZR
 229 for a discussion of what is matched in situations where an RE or a
 230 portion thereof could match any of several substrings of
 231 .IR string .
 232 .PP
 233 Normally,
 234 .I regexec
 235 returns 0 for success and the non-zero code REG_NOMATCH for failure.
 236 Other non-zero error codes may be returned in exceptional situations;
 237 see DIAGNOSTICS.
 238 .PP
 239 If REG_NOSUB was specified in the compilation of the RE,
 240 or if
 241 .I nmatch
 242 is 0,
 243 .I regexec
 244 ignores the
 245 .I pmatch
 246 argument (but see below for the case where REG_STARTEND is specified).
 247 Otherwise,
 248 .I pmatch
 249 points to an array of
 250 .I nmatch
 251 structures of type
 252 .IR regmatch_t .
 253 Such a structure has at least the members
 254 .I rm_so
 255 and
 256 .IR rm_eo ,
 257 both of type
 258 .I regoff_t
 259 (a signed arithmetic type at least as large as an
 260 .I off_t
 261 and a
 262 .IR ssize_t ),
 263 containing respectively the offset of the first character of a substring
 264 and the offset of the first character after the end of the substring.
 265 Offsets are measured from the beginning of the
 266 .I string
 267 argument given to
 268 .IR regexec .
 269 An empty substring is denoted by equal offsets,
 270 both indicating the character following the empty substring.
 271 .PP
 272 The 0th member of the
 273 .I pmatch
 274 array is filled in to indicate what substring of
 275 .I string
 276 was matched by the entire RE.
 277 Remaining members report what substring was matched by parenthesized
 278 subexpressions within the RE;
 279 member
 280 .I i
 281 reports subexpression
 282 .IR i ,
 283 with subexpressions counted (starting at 1) by the order of their opening
 284 parentheses in the RE, left to right.
 285 Unused entries in the array\(emcorresponding either to subexpressions that
 286 did not participate in the match at all, or to subexpressions that do not
 287 exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
 288 .I rm_so
 289 and
 290 .I rm_eo
 291 set to \-1.
 292 If a subexpression participated in the match several times,
 293 the reported substring is the last one it matched.
 294 (Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
 295 the parenthesized subexpression matches each of the three `b's and then
 296 an infinite number of empty strings following the last `b',
 297 so the reported substring is one of the empties.)
 298 .PP
 299 If REG_STARTEND is specified,
 300 .I pmatch
 301 must point to at least one
 302 .I regmatch_t
 303 (even if
 304 .I nmatch
 305 is 0 or REG_NOSUB was specified),
 306 to hold the input offsets for REG_STARTEND.
 307 Use for output is still entirely controlled by
 308 .IR nmatch ;
 309 if
 310 .I nmatch
 311 is 0 or REG_NOSUB was specified,
 312 the value of
 313 .IR pmatch [0]
 314 will not be changed by a successful
 315 .IR regexec .
 316 .PP
 317 .I Regerror
 318 maps a non-zero
 319 .I errcode
 320 from either
 321 .I regcomp
 322 or
 323 .I regexec
 324 to a human-readable, printable message.
 325 If
 326 .I preg
 327 is non-NULL,
 328 the error code should have arisen from use of
 329 the
 330 .I regex_t
 331 pointed to by
 332 .IR preg ,
 333 and if the error code came from
 334 .IR regcomp ,
 335 it should have been the result from the most recent
 336 .I regcomp
 337 using that
 338 .IR regex_t .
 339 .RI ( Regerror
 340 may be able to supply a more detailed message using information
 341 from the
 342 .IR regex_t .)
 343 .I Regerror
 344 places the NUL-terminated message into the buffer pointed to by
 345 .IR errbuf ,
 346 limiting the length (including the NUL) to at most
 347 .I errbuf_size
 348 bytes.
 349 If the whole message won't fit,
 350 as much of it as will fit before the terminating NUL is supplied.
 351 In any case,
 352 the returned value is the size of buffer needed to hold the whole
 353 message (including terminating NUL).
 354 If
 355 .I errbuf_size
 356 is 0,
 357 .I errbuf
 358 is ignored but the return value is still correct.
 359 .PP
 360 If the
 361 .I errcode
 362 given to
 363 .I regerror
 364 is first ORed with REG_ITOA,
 365 the ``message'' that results is the printable name of the error code,
 366 e.g. ``REG_NOMATCH'',
 367 rather than an explanation thereof.
 368 If
 369 .I errcode
 370 is REG_ATOI,
 371 then
 372 .I preg
 373 shall be non-NULL and the
 374 .I re_endp
 375 member of the structure it points to
 376 must point to the printable name of an error code;
 377 in this case, the result in
 378 .I errbuf
 379 is the decimal digits of
 380 the numeric value of the error code
 381 (0 if the name is not recognized).
 382 REG_ITOA and REG_ATOI are intended primarily as debugging facilities;
 383 they are extensions,
 384 compatible with but not specified by POSIX 1003.2,
 385 and should be used with
 386 caution in software intended to be portable to other systems.
 387 Be warned also that they are considered experimental and changes are possible.
 388 .PP
 389 .I Regfree
 390 frees any dynamically-allocated storage associated with the compiled RE
 391 pointed to by
 392 .IR preg .
 393 The remaining
 394 .I regex_t
 395 is no longer a valid compiled RE
 396 and the effect of supplying it to
 397 .I regexec
 398 or
 399 .I regerror
 400 is undefined.
 401 .PP
 402 None of these functions references global variables except for tables
 403 of constants;
 404 all are safe for use from multiple threads if the arguments are safe.
 405 .SH IMPLEMENTATION CHOICES
 406 There are a number of decisions that 1003.2 leaves up to the implementor,
 407 either by explicitly saying ``undefined'' or by virtue of them being
 408 forbidden by the RE grammar.
 409 This implementation treats them as follows.
 410 .PP
 411 See
 412 .ZR
 413 for a discussion of the definition of case-independent matching.
 414 .PP
 415 There is no particular limit on the length of REs,
 416 except insofar as memory is limited.
 417 Memory usage is approximately linear in RE size, and largely insensitive
 418 to RE complexity, except for bounded repetitions.
 419 See BUGS for one short RE using them
 420 that will run almost any system out of memory.
 421 .PP
 422 A backslashed character other than one specifically given a magic meaning
 423 by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs)
 424 is taken as an ordinary character.
 425 .PP
 426 Any unmatched [ is a REG_EBRACK error.
 427 .PP
 428 Equivalence classes cannot begin or end bracket-expression ranges.
 429 The endpoint of one range cannot begin another.
 430 .PP
 431 RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
 432 .PP
 433 A repetition operator (?, *, +, or bounds) cannot follow another
 434 repetition operator.
 435 A repetition operator cannot begin an expression or subexpression
 436 or follow `^' or `|'.
 437 .PP
 438 `|' cannot appear first or last in a (sub)expression or after another `|',
 439 i.e. an operand of `|' cannot be an empty subexpression.
 440 An empty parenthesized subexpression, `()', is legal and matches an
 441 empty (sub)string.
 442 An empty string is not a legal RE.
 443 .PP
 444 A `{' followed by a digit is considered the beginning of bounds for a
 445 bounded repetition, which must then follow the syntax for bounds.
 446 A `{' \fInot\fR followed by a digit is considered an ordinary character.
 447 .PP
 448 `^' and `$' beginning and ending subexpressions in obsolete (``basic'')
 449 REs are anchors, not ordinary characters.
 450 .SH SEE ALSO
 451 grep(1), re_format(7)
 452 .PP
 453 POSIX 1003.2, sections 2.8 (Regular Expression Notation)
 454 and
 455 B.5 (C Binding for Regular Expression Matching).
 456 .SH DIAGNOSTICS
 457 Non-zero error codes from
 458 .I regcomp
 459 and
 460 .I regexec
 461 include the following:
 462 .PP
 463 .nf
 464 .ta \w'REG_ECOLLATE'u+3n
 465 REG_NOMATCH     regexec() failed to match
 466 REG_BADPAT      invalid regular expression
 467 REG_ECOLLATE    invalid collating element
 468 REG_ECTYPE      invalid character class
 469 REG_EESCAPE     \e applied to unescapable character
 470 REG_ESUBREG     invalid backreference number
 471 REG_EBRACK      brackets [ ] not balanced
 472 REG_EPAREN      parentheses ( ) not balanced
 473 REG_EBRACE      braces { } not balanced
 474 REG_BADBR       invalid repetition count(s) in { }
 475 REG_ERANGE      invalid character range in [ ]
 476 REG_ESPACE      ran out of memory
 477 REG_BADRPT      ?, *, or + operand invalid
 478 REG_EMPTY       empty (sub)expression
 479 REG_ASSERT      ``can't happen''\(emyou found a bug
 480 REG_INVARG      invalid argument, e.g. negative-length string
 481 .fi
 482 .SH HISTORY
 483 Originally written by Henry Spencer at University of Toronto.
 484 Altered for inclusion in the 4.4BSD distribution.
 485 .SH BUGS
 486 This is an alpha release with known defects.
 487 Please report problems.
 488 .PP
 489 There is one known functionality bug.
 490 The implementation of internationalization is incomplete:
 491 the locale is always assumed to be the default one of 1003.2,
 492 and only the collating elements etc. of that locale are available.
 493 .PP
 494 The back-reference code is subtle and doubts linger about its correctness
 495 in complex cases.
 496 .PP
 497 .I Regexec
 498 performance is poor.
 499 This will improve with later releases.
 500 .I Nmatch
 501 exceeding 0 is expensive;
 502 .I nmatch
 503 exceeding 1 is worse.
 504 .I Regexec
 505 is largely insensitive to RE complexity \fIexcept\fR that back
 506 references are massively expensive.
 507 RE length does matter; in particular, there is a strong speed bonus
 508 for keeping RE length under about 30 characters,
 509 with most special characters counting roughly double.
 510 .PP
 511 .I Regcomp
 512 implements bounded repetitions by macro expansion,
 513 which is costly in time and space if counts are large
 514 or bounded repetitions are nested.
 515 An RE like, say,
 516 `((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
 517 will (eventually) run almost any existing machine out of swap space.
 518 .PP
 519 There are suspected problems with response to obscure error conditions.
 520 Notably,
 521 certain kinds of internal overflow,
 522 produced only by truly enormous REs or by multiply nested bounded repetitions,
 523 are probably not handled well.
 524 .PP
 525 Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is
 526 a special character only in the presence of a previous unmatched `('.
 527 This can't be fixed until the spec is fixed.
 528 .PP
 529 The standard's definition of back references is vague.
 530 For example, does
 531 `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
 532 Until the standard is clarified,
 533 behavior in such cases should not be relied on.
 534 .PP
 535 The implementation of word-boundary matching is a bit of a kludge,
 536 and bugs may lurk in combinations of word-boundary matching and anchoring.