lib/libcompat/regexp/regexp.3

   1 .\" Copyright (c) 1991, 1993
   2 .\"     The Regents of the University of California.  All rights reserved.
   3 .\"
   4 .\" Redistribution and use in source and binary forms, with or without
   5 .\" modification, are permitted provided that the following conditions
   6 .\" are met:
   7 .\" 1. Redistributions of source code must retain the above copyright
   8 .\"    notice, this list of conditions and the following disclaimer.
   9 .\" 2. Redistributions in binary form must reproduce the above copyright
  10 .\"    notice, this list of conditions and the following disclaimer in the
  11 .\"    documentation and/or other materials provided with the distribution.
  12 .\" 4. Neither the name of the University nor the names of its contributors
  13 .\"    may be used to endorse or promote products derived from this software
  14 .\"    without specific prior written permission.
  15 .\"
  16 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  17 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  18 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  19 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  20 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  21 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  22 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  23 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  24 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  25 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  26 .\" SUCH DAMAGE.
  27 .\"
  28 .\"     @(#)regexp.3    8.1 (Berkeley) 6/4/93
  29 .\" $FreeBSD$
  30 .\"
  31 .Dd June 4, 1993
  32 .Dt REGEXP 3
  33 .Os
  34 .Sh NAME
  35 .Nm regcomp ,
  36 .Nm regexec ,
  37 .Nm regsub ,
  38 .Nm regerror
  39 .Nd regular expression handlers
  40 .Sh LIBRARY
  41 .Lb libcompat
  42 .Sh SYNOPSIS
  43 .In regexp.h
  44 .Ft regexp *
  45 .Fn regcomp "const char *exp"
  46 .Ft int
  47 .Fn regexec "const regexp *prog" "const char *string"
  48 .Ft void
  49 .Fn regsub "const regexp *prog" "const char *source" "char *dest"
  50 .Sh DESCRIPTION
  51 .Bf Sy
  52 This interface is made obsolete by
  53 .Xr regex 3 .
  54 .Ef
  55 .Pp
  56 The
  57 .Fn regcomp ,
  58 .Fn regexec ,
  59 .Fn regsub ,
  60 and
  61 .Fn regerror
  62 functions
  63 implement
  64 .Xr egrep 1 Ns -style
  65 regular expressions and supporting facilities.
  66 .Pp
  67 The
  68 .Fn regcomp
  69 function
  70 compiles a regular expression into a structure of type
  71 .Vt regexp ,
  72 and returns a pointer to it.
  73 The space has been allocated using
  74 .Xr malloc 3
  75 and may be released by
  76 .Xr free 3 .
  77 .Pp
  78 The
  79 .Fn regexec
  80 function
  81 matches a
  82 .Dv NUL Ns -terminated
  83 .Fa string
  84 against the compiled regular expression
  85 in
  86 .Fa prog .
  87 It returns 1 for success and 0 for failure, and adjusts the contents of
  88 .Fa prog Ns 's
  89 .Em startp
  90 and
  91 .Em endp
  92 (see below) accordingly.
  93 .Pp
  94 The members of a
  95 .Vt regexp
  96 structure include at least the following (not necessarily in order):
  97 .Bd -literal -offset indent
  98 char *startp[NSUBEXP];
  99 char *endp[NSUBEXP];
 100 .Ed
 101 .Pp
 102 where
 103 .Dv NSUBEXP
 104 is defined (as 10) in the header file.
 105 Once a successful
 106 .Fn regexec
 107 has been done using the
 108 .Fn regexp ,
 109 each
 110 .Em startp Ns - Em endp
 111 pair describes one substring
 112 within the
 113 .Fa string ,
 114 with the
 115 .Em startp
 116 pointing to the first character of the substring and
 117 the
 118 .Em endp
 119 pointing to the first character following the substring.
 120 The 0th substring is the substring of
 121 .Fa string
 122 that matched the whole
 123 regular expression.
 124 The others are those substrings that matched parenthesized expressions
 125 within the regular expression, with parenthesized expressions numbered
 126 in left-to-right order of their opening parentheses.
 127 .Pp
 128 The
 129 .Fn regsub
 130 function
 131 copies
 132 .Fa source
 133 to
 134 .Fa dest ,
 135 making substitutions according to the
 136 most recent
 137 .Fn regexec
 138 performed using
 139 .Fa prog .
 140 Each instance of `&' in
 141 .Fa source
 142 is replaced by the substring
 143 indicated by
 144 .Em startp Ns Bq
 145 and
 146 .Em endp Ns Bq .
 147 Each instance of
 148 .Sq \e Ns Em n ,
 149 where
 150 .Em n
 151 is a digit, is replaced by
 152 the substring indicated by
 153 .Em startp Ns Bq Em n
 154 and
 155 .Em endp Ns Bq Em n .
 156 To get a literal `&' or
 157 .Sq \e Ns Em n
 158 into
 159 .Fa dest ,
 160 prefix it with `\e';
 161 to get a literal `\e' preceding `&' or
 162 .Sq \e Ns Em n ,
 163 prefix it with
 164 another `\e'.
 165 .Pp
 166 The
 167 .Fn regerror
 168 function
 169 is called whenever an error is detected in
 170 .Fn regcomp ,
 171 .Fn regexec ,
 172 or
 173 .Fn regsub .
 174 The default
 175 .Fn regerror
 176 writes the string
 177 .Fa msg ,
 178 with a suitable indicator of origin,
 179 on the standard
 180 error output
 181 and invokes
 182 .Xr exit 3 .
 183 The
 184 .Fn regerror
 185 function
 186 can be replaced by the user if other actions are desirable.
 187 .Sh REGULAR EXPRESSION SYNTAX
 188 A regular expression is zero or more
 189 .Em branches ,
 190 separated by `|'.
 191 It matches anything that matches one of the branches.
 192 .Pp
 193 A branch is zero or more
 194 .Em pieces ,
 195 concatenated.
 196 It matches a match for the first, followed by a match for the second, etc.
 197 .Pp
 198 A piece is an
 199 .Em atom
 200 possibly followed by `*', `+', or `?'.
 201 An atom followed by `*' matches a sequence of 0 or more matches of the atom.
 202 An atom followed by `+' matches a sequence of 1 or more matches of the atom.
 203 An atom followed by `?' matches a match of the atom, or the null string.
 204 .Pp
 205 An atom is a regular expression in parentheses (matching a match for the
 206 regular expression), a
 207 .Em range
 208 (see below), `.'
 209 (matching any single character), `^' (matching the null string at the
 210 beginning of the input string), `$' (matching the null string at the
 211 end of the input string), a `\e' followed by a single character (matching
 212 that character), or a single character with no other significance
 213 (matching that character).
 214 .Pp
 215 A
 216 .Em range
 217 is a sequence of characters enclosed in `[]'.
 218 It normally matches any single character from the sequence.
 219 If the sequence begins with `^',
 220 it matches any single character
 221 .Em not
 222 from the rest of the sequence.
 223 If two characters in the sequence are separated by `\-', this is shorthand
 224 for the full list of
 225 .Tn ASCII
 226 characters between them
 227 (e.g.\& `[0-9]' matches any decimal digit).
 228 To include a literal `]' in the sequence, make it the first character
 229 (following a possible `^').
 230 To include a literal `\-', make it the first or last character.
 231 .Sh AMBIGUITY
 232 If a regular expression could match two different parts of the input string,
 233 it will match the one which begins earliest.
 234 If both begin in the same place but match different lengths, or match
 235 the same length in different ways, life gets messier, as follows.
 236 .Pp
 237 In general, the possibilities in a list of branches are considered in
 238 left-to-right order, the possibilities for `*', `+', and `?' are
 239 considered longest-first, nested constructs are considered from the
 240 outermost in, and concatenated constructs are considered leftmost-first.
 241 The match that will be chosen is the one that uses the earliest
 242 possibility in the first choice that has to be made.
 243 If there is more than one choice, the next will be made in the same manner
 244 (earliest possibility) subject to the decision on the first choice.
 245 And so forth.
 246 .Pp
 247 For example,
 248 .Sq Li (ab|a)b*c
 249 could match
 250 `abc' in one of two ways.
 251 The first choice is between `ab' and `a'; since `ab' is earlier, and does
 252 lead to a successful overall match, it is chosen.
 253 Since the `b' is already spoken for,
 254 the `b*' must match its last possibility\(emthe empty string\(emsince
 255 it must respect the earlier choice.
 256 .Pp
 257 In the particular case where no `|'s are present and there is only one
 258 `*', `+', or `?', the net effect is that the longest possible
 259 match will be chosen.
 260 So
 261 .Sq Li ab* ,
 262 presented with `xabbbby', will match `abbbb'.
 263 Note that if
 264 .Sq Li ab* ,
 265 is tried against `xabyabbbz', it
 266 will match `ab' just after `x', due to the begins-earliest rule.
 267 (In effect, the decision on where to start the match is the first choice
 268 to be made, hence subsequent choices must respect it even if this leads them
 269 to less-preferred alternatives.)
 270 .Sh RETURN VALUES
 271 The
 272 .Fn regcomp
 273 function
 274 returns
 275 .Dv NULL
 276 for a failure
 277 .Pf ( Fn regerror
 278 permitting),
 279 where failures are syntax errors, exceeding implementation limits,
 280 or applying `+' or `*' to a possibly-null operand.
 281 .Sh SEE ALSO
 282 .Xr ed 1 ,
 283 .Xr egrep 1 ,
 284 .Xr ex 1 ,
 285 .Xr expr 1 ,
 286 .Xr fgrep 1 ,
 287 .Xr grep 1 ,
 288 .Xr regex 3
 289 .Sh HISTORY
 290 Both code and manual page for
 291 .Fn regcomp ,
 292 .Fn regexec ,
 293 .Fn regsub ,
 294 and
 295 .Fn regerror
 296 were written at the University of Toronto
 297 and appeared in
 298 .Bx 4.3 tahoe .
 299 They are intended to be compatible with the Bell V8
 300 .Xr regexp 3 ,
 301 but are not derived from Bell code.
 302 .Sh BUGS
 303 Empty branches and empty regular expressions are not portable to V8.
 304 .Pp
 305 The restriction against
 306 applying `*' or `+' to a possibly-null operand is an artifact of the
 307 simplistic implementation.
 308 .Pp
 309 Does not support
 310 .Xr egrep 1 Ns 's
 311 newline-separated branches;
 312 neither does the V8
 313 .Xr regexp 3 ,
 314 though.
 315 .Pp
 316 Due to emphasis on
 317 compactness and simplicity,
 318 it is not strikingly fast.
 319 It does give special attention to handling simple cases quickly.