usr.bin/lex/lex.1

   1 .\" $FreeBSD$
   2 .\"
   3 .TH FLEX 1 "May 21, 2013" "Version 2.5.37"
   4 .SH NAME
   5 flex, lex \- fast lexical analyzer generator
   6 .SH SYNOPSIS
   7 .B flex
   8 .B [\-bcdfhilnpstvwBFILTV78+? \-C[aefFmr] \-ooutput \-Pprefix \-Sskeleton]
   9 .B [\-\-help \-\-version]
  10 .I [filename ...]
  11 .SH OVERVIEW
  12 This manual describes
  13 .I flex,
  14 a tool for generating programs that perform pattern-matching on text.
  15 The manual includes both tutorial and reference sections:
  16 .nf
  17
  18     Description
  19         a brief overview of the tool
  20
  21     Some Simple Examples
  22
  23     Format Of The Input File
  24
  25     Patterns
  26         the extended regular expressions used by flex
  27
  28     How The Input Is Matched
  29         the rules for determining what has been matched
  30
  31     Actions
  32         how to specify what to do when a pattern is matched
  33
  34     The Generated Scanner
  35         details regarding the scanner that flex produces;
  36         how to control the input source
  37
  38     Start Conditions
  39         introducing context into your scanners, and
  40         managing "mini-scanners"
  41
  42     Multiple Input Buffers
  43         how to manipulate multiple input sources; how to
  44         scan from strings instead of files
  45
  46     End-of-file Rules
  47         special rules for matching the end of the input
  48
  49     Miscellaneous Macros
  50         a summary of macros available to the actions
  51
  52     Values Available To The User
  53         a summary of values available to the actions
  54
  55     Interfacing With Yacc
  56         connecting flex scanners together with yacc parsers
  57
  58     Options
  59         flex command-line options, and the "%option"
  60         directive
  61
  62     Performance Considerations
  63         how to make your scanner go as fast as possible
  64
  65     Generating C++ Scanners
  66         the (experimental) facility for generating C++
  67         scanner classes
  68
  69     Incompatibilities With Lex And POSIX
  70         how flex differs from AT&T lex and the POSIX lex
  71         standard
  72
  73     Diagnostics
  74         those error messages produced by flex (or scanners
  75         it generates) whose meanings might not be apparent
  76
  77     Files
  78         files used by flex
  79
  80     Deficiencies / Bugs
  81         known problems with flex
  82
  83     See Also
  84         other documentation, related tools
  85
  86     Author
  87         includes contact information
  88
  89 .fi
  90 .SH DESCRIPTION
  91 .I flex
  92 is a tool for generating
  93 .I scanners:
  94 programs which recognize lexical patterns in text.
  95 .I flex
  96 reads
  97 the given input files, or its standard input if no file names are given,
  98 for a description of a scanner to generate.
  99 The description is in the form of pairs
 100 of regular expressions and C code, called
 101 .I rules.
 102 .I flex
 103 generates as output a C source file,
 104 .B lex.yy.c,
 105 which defines a routine
 106 .B yylex().
 107 This file is compiled and linked with the
 108 .B \-ll
 109 library to produce an executable.
 110 When the executable is run,
 111 it analyzes its input for occurrences
 112 of the regular expressions.
 113 Whenever it finds one, it executes
 114 the corresponding C code.
 115 .SH SOME SIMPLE EXAMPLES
 116 First some simple examples to get the flavor of how one uses
 117 .I flex.
 118 The following
 119 .I flex
 120 input specifies a scanner which whenever it encounters the string
 121 "username" will replace it with the user's login name:
 122 .nf
 123
 124     %%
 125     username    printf( "%s", getlogin() );
 126
 127 .fi
 128 By default, any text not matched by a
 129 .I flex
 130 scanner
 131 is copied to the output, so the net effect of this scanner is
 132 to copy its input file to its output with each occurrence
 133 of "username" expanded.
 134 In this input, there is just one rule.
 135 "username" is the
 136 .I pattern
 137 and the "printf" is the
 138 .I action.
 139 The "%%" marks the beginning of the rules.
 140 .PP
 141 Here's another simple example:
 142 .nf
 143
 144     %{
 145             int num_lines = 0, num_chars = 0;
 146     %}
 147
 148     %%
 149     \\n      ++num_lines; ++num_chars;
 150     .       ++num_chars;
 151
 152     %%
 153     main()
 154             {
 155             yylex();
 156             printf( "# of lines = %d, # of chars = %d\\n",
 157                     num_lines, num_chars );
 158             }
 159
 160 .fi
 161 This scanner counts the number of characters and the number
 162 of lines in its input (it produces no output other than the
 163 final report on the counts).
 164 The first line
 165 declares two globals, "num_lines" and "num_chars", which are accessible
 166 both inside
 167 .B yylex()
 168 and in the
 169 .B main()
 170 routine declared after the second "%%".
 171 There are two rules, one
 172 which matches a newline ("\\n") and increments both the line count and
 173 the character count, and one which matches any character other than
 174 a newline (indicated by the "." regular expression).
 175 .PP
 176 A somewhat more complicated example:
 177 .nf
 178
 179     /* scanner for a toy Pascal-like language */
 180
 181     %{
 182     /* need this for the call to atof() below */
 183     #include <math.h>
 184     %}
 185
 186     DIGIT    [0-9]
 187     ID       [a-z][a-z0-9]*
 188
 189     %%
 190
 191     {DIGIT}+    {
 192                 printf( "An integer: %s (%d)\\n", yytext,
 193                         atoi( yytext ) );
 194                 }
 195
 196     {DIGIT}+"."{DIGIT}*        {
 197                 printf( "A float: %s (%g)\\n", yytext,
 198                         atof( yytext ) );
 199                 }
 200
 201     if|then|begin|end|procedure|function        {
 202                 printf( "A keyword: %s\\n", yytext );
 203                 }
 204
 205     {ID}        printf( "An identifier: %s\\n", yytext );
 206
 207     "+"|"-"|"*"|"/"   printf( "An operator: %s\\n", yytext );
 208
 209     "{"[^}\\n]*"}"     /* eat up one-line comments */
 210
 211     [ \\t\\n]+          /* eat up whitespace */
 212
 213     .           printf( "Unrecognized character: %s\\n", yytext );
 214
 215     %%
 216
 217     main( argc, argv )
 218     int argc;
 219     char **argv;
 220         {
 221         ++argv, --argc;  /* skip over program name */
 222         if ( argc > 0 )
 223                 yyin = fopen( argv[0], "r" );
 224         else
 225                 yyin = stdin;
 226
 227         yylex();
 228         }
 229
 230 .fi
 231 This is the beginnings of a simple scanner for a language like
 232 Pascal.
 233 It identifies different types of
 234 .I tokens
 235 and reports on what it has seen.
 236 .PP
 237 The details of this example will be explained in the following
 238 sections.
 239 .SH FORMAT OF THE INPUT FILE
 240 The
 241 .I flex
 242 input file consists of three sections, separated by a line with just
 243 .B %%
 244 in it:
 245 .nf
 246
 247     definitions
 248     %%
 249     rules
 250     %%
 251     user code
 252
 253 .fi
 254 The
 255 .I definitions
 256 section contains declarations of simple
 257 .I name
 258 definitions to simplify the scanner specification, and declarations of
 259 .I start conditions,
 260 which are explained in a later section.
 261 .PP
 262 Name definitions have the form:
 263 .nf
 264
 265     name definition
 266
 267 .fi
 268 The "name" is a word beginning with a letter or an underscore ('_')
 269 followed by zero or more letters, digits, '_', or '-' (dash).
 270 The definition is taken to begin at the first non-white-space character
 271 following the name and continuing to the end of the line.
 272 The definition can subsequently be referred to using "{name}", which
 273 will expand to "(definition)".
 274 For example,
 275 .nf
 276
 277     DIGIT    [0-9]
 278     ID       [a-z][a-z0-9]*
 279
 280 .fi
 281 defines "DIGIT" to be a regular expression which matches a
 282 single digit, and
 283 "ID" to be a regular expression which matches a letter
 284 followed by zero-or-more letters-or-digits.
 285 A subsequent reference to
 286 .nf
 287
 288     {DIGIT}+"."{DIGIT}*
 289
 290 .fi
 291 is identical to
 292 .nf
 293
 294     ([0-9])+"."([0-9])*
 295
 296 .fi
 297 and matches one-or-more digits followed by a '.' followed
 298 by zero-or-more digits.
 299 .PP
 300 The
 301 .I rules
 302 section of the
 303 .I flex
 304 input contains a series of rules of the form:
 305 .nf
 306
 307     pattern   action
 308
 309 .fi
 310 where the pattern must be unindented and the action must begin
 311 on the same line.
 312 .PP
 313 See below for a further description of patterns and actions.
 314 .PP
 315 Finally, the user code section is simply copied to
 316 .B lex.yy.c
 317 verbatim.
 318 It is used for companion routines which call or are called
 319 by the scanner.
 320 The presence of this section is optional;
 321 if it is missing, the second
 322 .B %%
 323 in the input file may be skipped, too.
 324 .PP
 325 In the definitions and rules sections, any
 326 .I indented
 327 text or text enclosed in
 328 .B %{
 329 and
 330 .B %}
 331 is copied verbatim to the output (with the %{}'s removed).
 332 The %{}'s must appear unindented on lines by themselves.
 333 .PP
 334 In the rules section,
 335 any indented or %{} text appearing before the
 336 first rule may be used to declare variables
 337 which are local to the scanning routine and (after the declarations)
 338 code which is to be executed whenever the scanning routine is entered.
 339 Other indented or %{} text in the rule section is still copied to the output,
 340 but its meaning is not well-defined and it may well cause compile-time
 341 errors (this feature is present for
 342 .I POSIX
 343 compliance; see below for other such features).
 344 .PP
 345 In the definitions section (but not in the rules section),
 346 an unindented comment (i.e., a line
 347 beginning with "/*") is also copied verbatim to the output up
 348 to the next "*/".
 349 .SH PATTERNS
 350 The patterns in the input are written using an extended set of regular
 351 expressions.
 352 These are:
 353 .nf
 354
 355     x          match the character 'x'
 356     .          any character (byte) except newline
 357     [xyz]      a "character class"; in this case, the pattern
 358                  matches either an 'x', a 'y', or a 'z'
 359     [abj-oZ]   a "character class" with a range in it; matches
 360                  an 'a', a 'b', any letter from 'j' through 'o',
 361                  or a 'Z'
 362     [^A-Z]     a "negated character class", i.e., any character
 363                  but those in the class.  In this case, any
 364                  character EXCEPT an uppercase letter.
 365     [^A-Z\\n]   any character EXCEPT an uppercase letter or
 366                  a newline
 367     r*         zero or more r's, where r is any regular expression
 368     r+         one or more r's
 369     r?         zero or one r's (that is, "an optional r")
 370     r{2,5}     anywhere from two to five r's
 371     r{2,}      two or more r's
 372     r{4}       exactly 4 r's
 373     {name}     the expansion of the "name" definition
 374                (see above)
 375     "[xyz]\\"foo"
 376                the literal string: [xyz]"foo
 377     \\X         if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v',
 378                  then the ANSI-C interpretation of \\x.
 379                  Otherwise, a literal 'X' (used to escape
 380                  operators such as '*')
 381     \\0         a NUL character (ASCII code 0)
 382     \\123       the character with octal value 123
 383     \\x2a       the character with hexadecimal value 2a
 384     (r)        match an r; parentheses are used to override
 385                  precedence (see below)
 386
 387
 388     rs         the regular expression r followed by the
 389                  regular expression s; called "concatenation"
 390
 391
 392     r|s        either an r or an s
 393
 394
 395     r/s        an r but only if it is followed by an s.  The
 396                  text matched by s is included when determining
 397                  whether this rule is the "longest match",
 398                  but is then returned to the input before
 399                  the action is executed.  So the action only
 400                  sees the text matched by r.  This type
 401                  of pattern is called trailing context".
 402                  (There are some combinations of r/s that flex
 403                  cannot match correctly; see notes in the
 404                  Deficiencies / Bugs section below regarding
 405                  "dangerous trailing context".)
 406     ^r         an r, but only at the beginning of a line (i.e.,
 407                  when just starting to scan, or right after a
 408                  newline has been scanned).
 409     r$         an r, but only at the end of a line (i.e., just
 410                  before a newline).  Equivalent to "r/\\n".
 411
 412                Note that flex's notion of "newline" is exactly
 413                whatever the C compiler used to compile flex
 414                interprets '\\n' as; in particular, on some DOS
 415                systems you must either filter out \\r's in the
 416                input yourself, or explicitly use r/\\r\\n for "r$".
 417
 418
 419     <s>r       an r, but only in start condition s (see
 420                  below for discussion of start conditions)
 421     <s1,s2,s3>r
 422                same, but in any of start conditions s1,
 423                  s2, or s3
 424     <*>r       an r in any start condition, even an exclusive one.
 425
 426
 427     <<EOF>>    an end-of-file
 428     <s1,s2><<EOF>>
 429                an end-of-file when in start condition s1 or s2
 430
 431 .fi
 432 Note that inside of a character class, all regular expression operators
 433 lose their special meaning except escape ('\\') and the character class
 434 operators, '-', ']', and, at the beginning of the class, '^'.
 435 .PP
 436 The regular expressions listed above are grouped according to
 437 precedence, from highest precedence at the top to lowest at the bottom.
 438 Those grouped together have equal precedence.
 439 For example,
 440 .nf
 441
 442     foo|bar*
 443
 444 .fi
 445 is the same as
 446 .nf
 447
 448     (foo)|(ba(r*))
 449
 450 .fi
 451 since the '*' operator has higher precedence than concatenation,
 452 and concatenation higher than alternation ('|').
 453 This pattern
 454 therefore matches
 455 .I either
 456 the string "foo"
 457 .I or
 458 the string "ba" followed by zero-or-more r's.
 459 To match "foo" or zero-or-more "bar"'s, use:
 460 .nf
 461
 462     foo|(bar)*
 463
 464 .fi
 465 and to match zero-or-more "foo"'s-or-"bar"'s:
 466 .nf
 467
 468     (foo|bar)*
 469
 470 .fi
 471 .PP
 472 In addition to characters and ranges of characters, character classes
 473 can also contain character class
 474 .I expressions.
 475 These are expressions enclosed inside
 476 .B [:
 477 and
 478 .B :]
 479 delimiters (which themselves must appear between the '[' and ']' of the
 480 character class; other elements may occur inside the character class, too).
 481 The valid expressions are:
 482 .nf
 483
 484     [:alnum:] [:alpha:] [:blank:]
 485     [:cntrl:] [:digit:] [:graph:]
 486     [:lower:] [:print:] [:punct:]
 487     [:space:] [:upper:] [:xdigit:]
 488
 489 .fi
 490 These expressions all designate a set of characters equivalent to
 491 the corresponding standard C
 492 .B isXXX
 493 function.
 494 For example,
 495 .B [:alnum:]
 496 designates those characters for which
 497 .B isalnum()
 498 returns true - i.e., any alphabetic or numeric.
 499 Some systems don't provide
 500 .B isblank(),
 501 so flex defines
 502 .B [:blank:]
 503 as a blank or a tab.
 504 .PP
 505 For example, the following character classes are all equivalent:
 506 .nf
 507
 508     [[:alnum:]]
 509     [[:alpha:][:digit:]]
 510     [[:alpha:]0-9]
 511     [a-zA-Z0-9]
 512
 513 .fi
 514 If your scanner is case-insensitive (the
 515 .B \-i
 516 flag), then
 517 .B [:upper:]
 518 and
 519 .B [:lower:]
 520 are equivalent to
 521 .B [:alpha:].
 522 .PP
 523 Some notes on patterns:
 524 .IP -
 525 A negated character class such as the example "[^A-Z]"
 526 above
 527 .I will match a newline
 528 unless "\\n" (or an equivalent escape sequence) is one of the
 529 characters explicitly present in the negated character class
 530 (e.g., "[^A-Z\\n]").
 531 This is unlike how many other regular
 532 expression tools treat negated character classes, but unfortunately
 533 the inconsistency is historically entrenched.
 534 Matching newlines means that a pattern like [^"]* can match the entire
 535 input unless there's another quote in the input.
 536 .IP -
 537 A rule can have at most one instance of trailing context (the '/' operator
 538 or the '$' operator).
 539 The start condition, '^', and "<<EOF>>" patterns
 540 can only occur at the beginning of a pattern, and, as well as with '/' and '$',
 541 cannot be grouped inside parentheses.
 542 A '^' which does not occur at
 543 the beginning of a rule or a '$' which does not occur at the end of
 544 a rule loses its special properties and is treated as a normal character.
 545 .IP
 546 The following are illegal:
 547 .nf
 548
 549     foo/bar$
 550     <sc1>foo<sc2>bar
 551
 552 .fi
 553 Note that the first of these, can be written "foo/bar\\n".
 554 .IP
 555 The following will result in '$' or '^' being treated as a normal character:
 556 .nf
 557
 558     foo|(bar$)
 559     foo|^bar
 560
 561 .fi
 562 If what's wanted is a "foo" or a bar-followed-by-a-newline, the following
 563 could be used (the special '|' action is explained below):
 564 .nf
 565
 566     foo      |
 567     bar$     /* action goes here */
 568
 569 .fi
 570 A similar trick will work for matching a foo or a
 571 bar-at-the-beginning-of-a-line.
 572 .SH HOW THE INPUT IS MATCHED
 573 When the generated scanner is run, it analyzes its input looking
 574 for strings which match any of its patterns.
 575 If it finds more than
 576 one match, it takes the one matching the most text (for trailing
 577 context rules, this includes the length of the trailing part, even
 578 though it will then be returned to the input).
 579 If it finds two
 580 or more matches of the same length, the
 581 rule listed first in the
 582 .I flex
 583 input file is chosen.
 584 .PP
 585 Once the match is determined, the text corresponding to the match
 586 (called the
 587 .I token)
 588 is made available in the global character pointer
 589 .B yytext,
 590 and its length in the global integer
 591 .B yyleng.
 592 The
 593 .I action
 594 corresponding to the matched pattern is then executed (a more
 595 detailed description of actions follows), and then the remaining
 596 input is scanned for another match.
 597 .PP
 598 If no match is found, then the
 599 .I default rule
 600 is executed: the next character in the input is considered matched and
 601 copied to the standard output.
 602 Thus, the simplest legal
 603 .I flex
 604 input is:
 605 .nf
 606
 607     %%
 608
 609 .fi
 610 which generates a scanner that simply copies its input (one character
 611 at a time) to its output.
 612 .PP
 613 Note that
 614 .B yytext
 615 can be defined in two different ways: either as a character
 616 .I pointer
 617 or as a character
 618 .I array.
 619 You can control which definition
 620 .I flex
 621 uses by including one of the special directives
 622 .B %pointer
 623 or
 624 .B %array
 625 in the first (definitions) section of your flex input.
 626 The default is
 627 .B %pointer,
 628 unless you use the
 629 .B -l
 630 lex compatibility option, in which case
 631 .B yytext
 632 will be an array.
 633 The advantage of using
 634 .B %pointer
 635 is substantially faster scanning and no buffer overflow when matching
 636 very large tokens (unless you run out of dynamic memory).
 637 The disadvantage
 638 is that you are restricted in how your actions can modify
 639 .B yytext
 640 (see the next section), and calls to the
 641 .B unput()
 642 function destroys the present contents of
 643 .B yytext,
 644 which can be a considerable porting headache when moving between different
 645 .I lex
 646 versions.
 647 .PP
 648 The advantage of
 649 .B %array
 650 is that you can then modify
 651 .B yytext
 652 to your heart's content, and calls to
 653 .B unput()
 654 do not destroy
 655 .B yytext
 656 (see below).
 657 Furthermore, existing
 658 .I lex
 659 programs sometimes access
 660 .B yytext
 661 externally using declarations of the form:
 662 .nf
 663     extern char yytext[];
 664 .fi
 665 This definition is erroneous when used with
 666 .B %pointer,
 667 but correct for
 668 .B %array.
 669 .PP
 670 .B %array
 671 defines
 672 .B yytext
 673 to be an array of
 674 .B YYLMAX
 675 characters, which defaults to a fairly large value.
 676 You can change
 677 the size by simply #define'ing
 678 .B YYLMAX
 679 to a different value in the first section of your
 680 .I flex
 681 input.
 682 As mentioned above, with
 683 .B %pointer
 684 yytext grows dynamically to accommodate large tokens.
 685 While this means your
 686 .B %pointer
 687 scanner can accommodate very large tokens (such as matching entire blocks
 688 of comments), bear in mind that each time the scanner must resize
 689 .B yytext
 690 it also must rescan the entire token from the beginning, so matching such
 691 tokens can prove slow.
 692 .B yytext
 693 presently does
 694 .I not
 695 dynamically grow if a call to
 696 .B unput()
 697 results in too much text being pushed back; instead, a run-time error results.
 698 .PP
 699 Also note that you cannot use
 700 .B %array
 701 with C++ scanner classes
 702 (the
 703 .B c++
 704 option; see below).
 705 .SH ACTIONS
 706 Each pattern in a rule has a corresponding action, which can be any
 707 arbitrary C statement.
 708 The pattern ends at the first non-escaped
 709 whitespace character; the remainder of the line is its action.
 710 If the
 711 action is empty, then when the pattern is matched the input token
 712 is simply discarded.
 713 For example, here is the specification for a program
 714 which deletes all occurrences of "zap me" from its input:
 715 .nf
 716
 717     %%
 718     "zap me"
 719
 720 .fi
 721 (It will copy all other characters in the input to the output since
 722 they will be matched by the default rule.)
 723 .PP
 724 Here is a program which compresses multiple blanks and tabs down to
 725 a single blank, and throws away whitespace found at the end of a line:
 726 .nf
 727
 728     %%
 729     [ \\t]+        putchar( ' ' );
 730     [ \\t]+$       /* ignore this token */
 731
 732 .fi
 733 .PP
 734 If the action contains a '{', then the action spans till the balancing '}'
 735 is found, and the action may cross multiple lines.
 736 .I flex
 737 knows about C strings and comments and won't be fooled by braces found
 738 within them, but also allows actions to begin with
 739 .B %{
 740 and will consider the action to be all the text up to the next
 741 .B %}
 742 (regardless of ordinary braces inside the action).
 743 .PP
 744 An action consisting solely of a vertical bar ('|') means "same as
 745 the action for the next rule."  See below for an illustration.
 746 .PP
 747 Actions can include arbitrary C code, including
 748 .B return
 749 statements to return a value to whatever routine called
 750 .B yylex().
 751 Each time
 752 .B yylex()
 753 is called it continues processing tokens from where it last left
 754 off until it either reaches
 755 the end of the file or executes a return.
 756 .PP
 757 Actions are free to modify
 758 .B yytext
 759 except for lengthening it (adding
 760 characters to its end--these will overwrite later characters in the
 761 input stream).
 762 This however does not apply when using
 763 .B %array
 764 (see above); in that case,
 765 .B yytext
 766 may be freely modified in any way.
 767 .PP
 768 Actions are free to modify
 769 .B yyleng
 770 except they should not do so if the action also includes use of
 771 .B yymore()
 772 (see below).
 773 .PP
 774 There are a number of special directives which can be included within
 775 an action:
 776 .IP -
 777 .B ECHO
 778 copies yytext to the scanner's output.
 779 .IP -
 780 .B BEGIN
 781 followed by the name of a start condition places the scanner in the
 782 corresponding start condition (see below).
 783 .IP -
 784 .B REJECT
 785 directs the scanner to proceed on to the "second best" rule which matched the
 786 input (or a prefix of the input).
 787 The rule is chosen as described
 788 above in "How the Input is Matched", and
 789 .B yytext
 790 and
 791 .B yyleng
 792 set up appropriately.
 793 It may either be one which matched as much text
 794 as the originally chosen rule but came later in the
 795 .I flex
 796 input file, or one which matched less text.
 797 For example, the following will both count the
 798 words in the input and call the routine special() whenever "frob" is seen:
 799 .nf
 800
 801             int word_count = 0;
 802     %%
 803
 804     frob        special(); REJECT;
 805     [^ \\t\\n]+   ++word_count;
 806
 807 .fi
 808 Without the
 809 .B REJECT,
 810 any "frob"'s in the input would not be counted as words, since the
 811 scanner normally executes only one action per token.
 812 Multiple
 813 .B REJECT's
 814 are allowed, each one finding the next best choice to the currently
 815 active rule.
 816 For example, when the following scanner scans the token
 817 "abcd", it will write "abcdabcaba" to the output:
 818 .nf
 819
 820     %%
 821     a        |
 822     ab       |
 823     abc      |
 824     abcd     ECHO; REJECT;
 825     .|\\n     /* eat up any unmatched character */
 826
 827 .fi
 828 (The first three rules share the fourth's action since they use
 829 the special '|' action.)
 830 .B REJECT
 831 is a particularly expensive feature in terms of scanner performance;
 832 if it is used in
 833 .I any
 834 of the scanner's actions it will slow down
 835 .I all
 836 of the scanner's matching.
 837 Furthermore,
 838 .B REJECT
 839 cannot be used with the
 840 .I -Cf
 841 or
 842 .I -CF
 843 options (see below).
 844 .IP
 845 Note also that unlike the other special actions,
 846 .B REJECT
 847 is a
 848 .I branch;
 849 code immediately following it in the action will
 850 .I not
 851 be executed.
 852 .IP -
 853 .B yymore()
 854 tells the scanner that the next time it matches a rule, the corresponding
 855 token should be
 856 .I appended
 857 onto the current value of
 858 .B yytext
 859 rather than replacing it.
 860 For example, given the input "mega-kludge"
 861 the following will write "mega-mega-kludge" to the output:
 862 .nf
 863
 864     %%
 865     mega-    ECHO; yymore();
 866     kludge   ECHO;
 867
 868 .fi
 869 First "mega-" is matched and echoed to the output.
 870 Then "kludge"
 871 is matched, but the previous "mega-" is still hanging around at the
 872 beginning of
 873 .B yytext
 874 so the
 875 .B ECHO
 876 for the "kludge" rule will actually write "mega-kludge".
 877 .PP
 878 Two notes regarding use of
 879 .B yymore().
 880 First,
 881 .B yymore()
 882 depends on the value of
 883 .I yyleng
 884 correctly reflecting the size of the current token, so you must not
 885 modify
 886 .I yyleng
 887 if you are using
 888 .B yymore().
 889 Second, the presence of
 890 .B yymore()
 891 in the scanner's action entails a minor performance penalty in the
 892 scanner's matching speed.
 893 .IP -
 894 .B yyless(n)
 895 returns all but the first
 896 .I n
 897 characters of the current token back to the input stream, where they
 898 will be rescanned when the scanner looks for the next match.
 899 .B yytext
 900 and
 901 .B yyleng
 902 are adjusted appropriately (e.g.,
 903 .B yyleng
 904 will now be equal to
 905 .I n
 906 ).
 907 For example, on the input "foobar" the following will write out
 908 "foobarbar":
 909 .nf
 910
 911     %%
 912     foobar    ECHO; yyless(3);
 913     [a-z]+    ECHO;
 914
 915 .fi
 916 An argument of 0 to
 917 .B yyless
 918 will cause the entire current input string to be scanned again.
 919 Unless you've
 920 changed how the scanner will subsequently process its input (using
 921 .B BEGIN,
 922 for example), this will result in an endless loop.
 923 .PP
 924 Note that
 925 .B yyless
 926 is a macro and can only be used in the flex input file, not from
 927 other source files.
 928 .IP -
 929 .B unput(c)
 930 puts the character
 931 .I c
 932 back onto the input stream.
 933 It will be the next character scanned.
 934 The following action will take the current token and cause it
 935 to be rescanned enclosed in parentheses.
 936 .nf
 937
 938     {
 939     int i;
 940     /* Copy yytext because unput() trashes yytext */
 941     char *yycopy = strdup( yytext );
 942     unput( ')' );
 943     for ( i = yyleng - 1; i >= 0; --i )
 944         unput( yycopy[i] );
 945     unput( '(' );
 946     free( yycopy );
 947     }
 948
 949 .fi
 950 Note that since each
 951 .B unput()
 952 puts the given character back at the
 953 .I beginning
 954 of the input stream, pushing back strings must be done back-to-front.
 955 .PP
 956 An important potential problem when using
 957 .B unput()
 958 is that if you are using
 959 .B %pointer
 960 (the default), a call to
 961 .B unput()
 962 .I destroys
 963 the contents of
 964 .I yytext,
 965 starting with its rightmost character and devouring one character to
 966 the left with each call.
 967 If you need the value of yytext preserved
 968 after a call to
 969 .B unput()
 970 (as in the above example),
 971 you must either first copy it elsewhere, or build your scanner using
 972 .B %array
 973 instead (see How The Input Is Matched).
 974 .PP
 975 Finally, note that you cannot put back
 976 .B EOF
 977 to attempt to mark the input stream with an end-of-file.
 978 .IP -
 979 .B input()
 980 reads the next character from the input stream.
 981 For example,
 982 the following is one way to eat up C comments:
 983 .nf
 984
 985     %%
 986     "/*"        {
 987                 int c;
 988
 989                 for ( ; ; )
 990                     {
 991                     while ( (c = input()) != '*' &&
 992                             c != EOF )
 993                         ;    /* eat up text of comment */
 994
 995                     if ( c == '*' )
 996                         {
 997                         while ( (c = input()) == '*' )
 998                             ;
 999                         if ( c == '/' )
1000                             break;    /* found the end */
1001                         }
1002
1003                     if ( c == EOF )
1004                         {
1005                         error( "EOF in comment" );
1006                         break;
1007                         }
1008                     }
1009                 }
1010
1011 .fi
1012 (Note that if the scanner is compiled using
1013 .B C++,
1014 then
1015 .B input()
1016 is instead referred to as
1017 .B yyinput(),
1018 in order to avoid a name clash with the
1019 .B C++
1020 stream by the name of
1021 .I input.)
1022 .IP -
1023 .B YY_FLUSH_BUFFER
1024 flushes the scanner's internal buffer
1025 so that the next time the scanner attempts to match a token, it will
1026 first refill the buffer using
1027 .B YY_INPUT
1028 (see The Generated Scanner, below).
1029 This action is a special case
1030 of the more general
1031 .B yy_flush_buffer()
1032 function, described below in the section Multiple Input Buffers.
1033 .IP -
1034 .B yyterminate()
1035 can be used in lieu of a return statement in an action.
1036 It terminates
1037 the scanner and returns a 0 to the scanner's caller, indicating "all done".
1038 By default,
1039 .B yyterminate()
1040 is also called when an end-of-file is encountered.
1041 It is a macro and may be redefined.
1042 .SH THE GENERATED SCANNER
1043 The output of
1044 .I flex
1045 is the file
1046 .B lex.yy.c,
1047 which contains the scanning routine
1048 .B yylex(),
1049 a number of tables used by it for matching tokens, and a number
1050 of auxiliary routines and macros.
1051 By default,
1052 .B yylex()
1053 is declared as follows:
1054 .nf
1055
1056     int yylex()
1057         {
1058         ... various definitions and the actions in here ...
1059         }
1060
1061 .fi
1062 (If your environment supports function prototypes, then it will
1063 be "int yylex( void )".)  This definition may be changed by defining
1064 the "YY_DECL" macro.
1065 For example, you could use:
1066 .nf
1067
1068     #define YY_DECL float lexscan( a, b ) float a, b;
1069
1070 .fi
1071 to give the scanning routine the name
1072 .I lexscan,
1073 returning a float, and taking two floats as arguments.
1074 Note that
1075 if you give arguments to the scanning routine using a
1076 K&R-style/non-prototyped function declaration, you must terminate
1077 the definition with a semi-colon (;).
1078 .PP
1079 Whenever
1080 .B yylex()
1081 is called, it scans tokens from the global input file
1082 .I yyin
1083 (which defaults to stdin).
1084 It continues until it either reaches
1085 an end-of-file (at which point it returns the value 0) or
1086 one of its actions executes a
1087 .I return
1088 statement.
1089 .PP
1090 If the scanner reaches an end-of-file, subsequent calls are undefined
1091 unless either
1092 .I yyin
1093 is pointed at a new input file (in which case scanning continues from
1094 that file), or
1095 .B yyrestart()
1096 is called.
1097 .B yyrestart()
1098 takes one argument, a
1099 .B FILE *
1100 pointer (which can be nil, if you've set up
1101 .B YY_INPUT
1102 to scan from a source other than
1103 .I yyin),
1104 and initializes
1105 .I yyin
1106 for scanning from that file.
1107 Essentially there is no difference between
1108 just assigning
1109 .I yyin
1110 to a new input file or using
1111 .B yyrestart()
1112 to do so; the latter is available for compatibility with previous versions
1113 of
1114 .I flex,
1115 and because it can be used to switch input files in the middle of scanning.
1116 It can also be used to throw away the current input buffer, by calling
1117 it with an argument of
1118 .I yyin;
1119 but better is to use
1120 .B YY_FLUSH_BUFFER
1121 (see above).
1122 Note that
1123 .B yyrestart()
1124 does
1125 .I not
1126 reset the start condition to
1127 .B INITIAL
1128 (see Start Conditions, below).
1129 .PP
1130 If
1131 .B yylex()
1132 stops scanning due to executing a
1133 .I return
1134 statement in one of the actions, the scanner may then be called again and it
1135 will resume scanning where it left off.
1136 .PP
1137 By default (and for purposes of efficiency), the scanner uses
1138 block-reads rather than simple
1139 .I getc()
1140 calls to read characters from
1141 .I yyin.
1142 The nature of how it gets its input can be controlled by defining the
1143 .B YY_INPUT
1144 macro.
1145 YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)".
1146 Its action is to place up to
1147 .I max_size
1148 characters in the character array
1149 .I buf
1150 and return in the integer variable
1151 .I result
1152 either the
1153 number of characters read or the constant YY_NULL (0 on Unix systems)
1154 to indicate EOF.
1155 The default YY_INPUT reads from the
1156 global file-pointer "yyin".
1157 .PP
1158 A sample definition of YY_INPUT (in the definitions
1159 section of the input file):
1160 .nf
1161
1162     %{
1163     #define YY_INPUT(buf,result,max_size) \\
1164         { \\
1165         int c = getchar(); \\
1166         result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \\
1167         }
1168     %}
1169
1170 .fi
1171 This definition will change the input processing to occur
1172 one character at a time.
1173 .PP
1174 When the scanner receives an end-of-file indication from YY_INPUT,
1175 it then checks the
1176 .B yywrap()
1177 function.
1178 If
1179 .B yywrap()
1180 returns false (zero), then it is assumed that the
1181 function has gone ahead and set up
1182 .I yyin
1183 to point to another input file, and scanning continues.
1184 If it returns
1185 true (non-zero), then the scanner terminates, returning 0 to its
1186 caller.
1187 Note that in either case, the start condition remains unchanged;
1188 it does
1189 .I not
1190 revert to
1191 .B INITIAL.
1192 .PP
1193 If you do not supply your own version of
1194 .B yywrap(),
1195 then you must either use
1196 .B %option noyywrap
1197 (in which case the scanner behaves as though
1198 .B yywrap()
1199 returned 1), or you must link with
1200 .B \-ll
1201 to obtain the default version of the routine, which always returns 1.
1202 .PP
1203 Three routines are available for scanning from in-memory buffers rather
1204 than files:
1205 .B yy_scan_string(), yy_scan_bytes(),
1206 and
1207 .B yy_scan_buffer().
1208 See the discussion of them below in the section Multiple Input Buffers.
1209 .PP
1210 The scanner writes its
1211 .B ECHO
1212 output to the
1213 .I yyout
1214 global (default, stdout), which may be redefined by the user simply
1215 by assigning it to some other
1216 .B FILE
1217 pointer.
1218 .SH START CONDITIONS
1219 .I flex
1220 provides a mechanism for conditionally activating rules.
1221 Any rule
1222 whose pattern is prefixed with "<sc>" will only be active when
1223 the scanner is in the start condition named "sc".
1224 For example,
1225 .nf
1226
1227     <STRING>[^"]*        { /* eat up the string body ... */
1228                 ...
1229                 }
1230
1231 .fi
1232 will be active only when the scanner is in the "STRING" start
1233 condition, and
1234 .nf
1235
1236     <INITIAL,STRING,QUOTE>\\.        { /* handle an escape ... */
1237                 ...
1238                 }
1239
1240 .fi
1241 will be active only when the current start condition is
1242 either "INITIAL", "STRING", or "QUOTE".
1243 .PP
1244 Start conditions
1245 are declared in the definitions (first) section of the input
1246 using unindented lines beginning with either
1247 .B %s
1248 or
1249 .B %x
1250 followed by a list of names.
1251 The former declares
1252 .I inclusive
1253 start conditions, the latter
1254 .I exclusive
1255 start conditions.
1256 A start condition is activated using the
1257 .B BEGIN
1258 action.
1259 Until the next
1260 .B BEGIN
1261 action is executed, rules with the given start
1262 condition will be active and
1263 rules with other start conditions will be inactive.
1264 If the start condition is
1265 .I inclusive,
1266 then rules with no start conditions at all will also be active.
1267 If it is
1268 .I exclusive,
1269 then
1270 .I only
1271 rules qualified with the start condition will be active.
1272 A set of rules contingent on the same exclusive start condition
1273 describe a scanner which is independent of any of the other rules in the
1274 .I flex
1275 input.
1276 Because of this,
1277 exclusive start conditions make it easy to specify "mini-scanners"
1278 which scan portions of the input that are syntactically different
1279 from the rest (e.g., comments).
1280 .PP
1281 If the distinction between inclusive and exclusive start conditions
1282 is still a little vague, here's a simple example illustrating the
1283 connection between the two.
1284 The set of rules:
1285 .nf
1286
1287     %s example
1288     %%
1289
1290     <example>foo   do_something();
1291
1292     bar            something_else();
1293
1294 .fi
1295 is equivalent to
1296 .nf
1297
1298     %x example
1299     %%
1300
1301     <example>foo   do_something();
1302
1303     <INITIAL,example>bar    something_else();
1304
1305 .fi
1306 Without the
1307 .B <INITIAL,example>
1308 qualifier, the
1309 .I bar
1310 pattern in the second example wouldn't be active (i.e., couldn't match)
1311 when in start condition
1312 .B example.
1313 If we just used
1314 .B <example>
1315 to qualify
1316 .I bar,
1317 though, then it would only be active in
1318 .B example
1319 and not in
1320 .B INITIAL,
1321 while in the first example it's active in both, because in the first
1322 example the
1323 .B example
1324 start condition is an
1325 .I inclusive
1326 .B (%s)
1327 start condition.
1328 .PP
1329 Also note that the special start-condition specifier
1330 .B <*>
1331 matches every start condition.
1332 Thus, the above example could also have been written;
1333 .nf
1334
1335     %x example
1336     %%
1337
1338     <example>foo   do_something();
1339
1340     <*>bar    something_else();
1341
1342 .fi
1343 .PP
1344 The default rule (to
1345 .B ECHO
1346 any unmatched character) remains active in start conditions.
1347 It
1348 is equivalent to:
1349 .nf
1350
1351     <*>.|\\n     ECHO;
1352
1353 .fi
1354 .PP
1355 .B BEGIN(0)
1356 returns to the original state where only the rules with
1357 no start conditions are active.
1358 This state can also be
1359 referred to as the start-condition "INITIAL", so
1360 .B BEGIN(INITIAL)
1361 is equivalent to
1362 .B BEGIN(0).
1363 (The parentheses around the start condition name are not required but
1364 are considered good style.)
1365 .PP
1366 .B BEGIN
1367 actions can also be given as indented code at the beginning
1368 of the rules section.
1369 For example, the following will cause
1370 the scanner to enter the "SPECIAL" start condition whenever
1371 .B yylex()
1372 is called and the global variable
1373 .I enter_special
1374 is true:
1375 .nf
1376
1377             int enter_special;
1378
1379     %x SPECIAL
1380     %%
1381             if ( enter_special )
1382                 BEGIN(SPECIAL);
1383
1384     <SPECIAL>blahblahblah
1385     ...more rules follow...
1386
1387 .fi
1388 .PP
1389 To illustrate the uses of start conditions,
1390 here is a scanner which provides two different interpretations
1391 of a string like "123.456".
1392 By default it will treat it as
1393 three tokens, the integer "123", a dot ('.'), and the integer "456".
1394 But if the string is preceded earlier in the line by the string
1395 "expect-floats"
1396 it will treat it as a single token, the floating-point number
1397 123.456:
1398 .nf
1399
1400     %{
1401     #include <math.h>
1402     %}
1403     %s expect
1404
1405     %%
1406     expect-floats        BEGIN(expect);
1407
1408     <expect>[0-9]+"."[0-9]+      {
1409                 printf( "found a float, = %f\\n",
1410                         atof( yytext ) );
1411                 }
1412     <expect>\\n           {
1413                 /* that's the end of the line, so
1414                  * we need another "expect-number"
1415                  * before we'll recognize any more
1416                  * numbers
1417                  */
1418                 BEGIN(INITIAL);
1419                 }
1420
1421     [0-9]+      {
1422                 printf( "found an integer, = %d\\n",
1423                         atoi( yytext ) );
1424                 }
1425
1426     "."         printf( "found a dot\\n" );
1427
1428 .fi
1429 Here is a scanner which recognizes (and discards) C comments while
1430 maintaining a count of the current input line.
1431 .nf
1432
1433     %x comment
1434     %%
1435             int line_num = 1;
1436
1437     "/*"         BEGIN(comment);
1438
1439     <comment>[^*\\n]*        /* eat anything that's not a '*' */
1440     <comment>"*"+[^*/\\n]*   /* eat up '*'s not followed by '/'s */
1441     <comment>\\n             ++line_num;
1442     <comment>"*"+"/"        BEGIN(INITIAL);
1443
1444 .fi
1445 This scanner goes to a bit of trouble to match as much
1446 text as possible with each rule.
1447 In general, when attempting to write
1448 a high-speed scanner try to match as much possible in each rule, as
1449 it's a big win.
1450 .PP
1451 Note that start-conditions names are really integer values and
1452 can be stored as such.
1453 Thus, the above could be extended in the
1454 following fashion:
1455 .nf
1456
1457     %x comment foo
1458     %%
1459             int line_num = 1;
1460             int comment_caller;
1461
1462     "/*"         {
1463                  comment_caller = INITIAL;
1464                  BEGIN(comment);
1465                  }
1466
1467     ...
1468
1469     <foo>"/*"    {
1470                  comment_caller = foo;
1471                  BEGIN(comment);
1472                  }
1473
1474     <comment>[^*\\n]*        /* eat anything that's not a '*' */
1475     <comment>"*"+[^*/\\n]*   /* eat up '*'s not followed by '/'s */
1476     <comment>\\n             ++line_num;
1477     <comment>"*"+"/"        BEGIN(comment_caller);
1478
1479 .fi
1480 Furthermore, you can access the current start condition using
1481 the integer-valued
1482 .B YY_START
1483 macro.
1484 For example, the above assignments to
1485 .I comment_caller
1486 could instead be written
1487 .nf
1488
1489     comment_caller = YY_START;
1490
1491 .fi
1492 Flex provides
1493 .B YYSTATE
1494 as an alias for
1495 .B YY_START
1496 (since that is what's used by AT&T
1497 .I lex).
1498 .PP
1499 Note that start conditions do not have their own name-space; %s's and %x's
1500 declare names in the same fashion as #define's.
1501 .PP
1502 Finally, here's an example of how to match C-style quoted strings using
1503 exclusive start conditions, including expanded escape sequences (but
1504 not including checking for a string that's too long):
1505 .nf
1506
1507     %x str
1508
1509     %%
1510             char string_buf[MAX_STR_CONST];
1511             char *string_buf_ptr;
1512
1513
1514     \\"      string_buf_ptr = string_buf; BEGIN(str);
1515
1516     <str>\\"        { /* saw closing quote - all done */
1517             BEGIN(INITIAL);
1518             *string_buf_ptr = '\\0';
1519             /* return string constant token type and
1520              * value to parser
1521              */
1522             }
1523
1524     <str>\\n        {
1525             /* error - unterminated string constant */
1526             /* generate error message */
1527             }
1528
1529     <str>\\\\[0-7]{1,3} {
1530             /* octal escape sequence */
1531             int result;
1532
1533             (void) sscanf( yytext + 1, "%o", &result );
1534
1535             if ( result > 0xff )
1536                     /* error, constant is out-of-bounds */
1537
1538             *string_buf_ptr++ = result;
1539             }
1540
1541     <str>\\\\[0-9]+ {
1542             /* generate error - bad escape sequence; something
1543              * like '\\48' or '\\0777777'
1544              */
1545             }
1546
1547     <str>\\\\n  *string_buf_ptr++ = '\\n';
1548     <str>\\\\t  *string_buf_ptr++ = '\\t';
1549     <str>\\\\r  *string_buf_ptr++ = '\\r';
1550     <str>\\\\b  *string_buf_ptr++ = '\\b';
1551     <str>\\\\f  *string_buf_ptr++ = '\\f';
1552
1553     <str>\\\\(.|\\n)  *string_buf_ptr++ = yytext[1];
1554
1555     <str>[^\\\\\\n\\"]+        {
1556             char *yptr = yytext;
1557
1558             while ( *yptr )
1559                     *string_buf_ptr++ = *yptr++;
1560             }
1561
1562 .fi
1563 .PP
1564 Often, such as in some of the examples above, you wind up writing a
1565 whole bunch of rules all preceded by the same start condition(s).
1566 Flex makes this a little easier and cleaner by introducing a notion of
1567 start condition
1568 .I scope.
1569 A start condition scope is begun with:
1570 .nf
1571
1572     <SCs>{
1573
1574 .fi
1575 where
1576 .I SCs
1577 is a list of one or more start conditions.
1578 Inside the start condition
1579 scope, every rule automatically has the prefix
1580 .I <SCs>
1581 applied to it, until a
1582 .I '}'
1583 which matches the initial
1584 .I '{'.
1585 So, for example,
1586 .nf
1587
1588     <ESC>{
1589         "\\\\n"   return '\\n';
1590         "\\\\r"   return '\\r';
1591         "\\\\f"   return '\\f';
1592         "\\\\0"   return '\\0';
1593     }
1594
1595 .fi
1596 is equivalent to:
1597 .nf
1598
1599     <ESC>"\\\\n"  return '\\n';
1600     <ESC>"\\\\r"  return '\\r';
1601     <ESC>"\\\\f"  return '\\f';
1602     <ESC>"\\\\0"  return '\\0';
1603
1604 .fi
1605 Start condition scopes may be nested.
1606 .PP
1607 Three routines are available for manipulating stacks of start conditions:
1608 .TP
1609 .B void yy_push_state(int new_state)
1610 pushes the current start condition onto the top of the start condition
1611 stack and switches to
1612 .I new_state
1613 as though you had used
1614 .B BEGIN new_state
1615 (recall that start condition names are also integers).
1616 .TP
1617 .B void yy_pop_state()
1618 pops the top of the stack and switches to it via
1619 .B BEGIN.
1620 .TP
1621 .B int yy_top_state()
1622 returns the top of the stack without altering the stack's contents.
1623 .PP
1624 The start condition stack grows dynamically and so has no built-in
1625 size limitation.
1626 If memory is exhausted, program execution aborts.
1627 .PP
1628 To use start condition stacks, your scanner must include a
1629 .B %option stack
1630 directive (see Options below).
1631 .SH MULTIPLE INPUT BUFFERS
1632 Some scanners (such as those which support "include" files)
1633 require reading from several input streams.
1634 As
1635 .I flex
1636 scanners do a large amount of buffering, one cannot control
1637 where the next input will be read from by simply writing a
1638 .B YY_INPUT
1639 which is sensitive to the scanning context.
1640 .B YY_INPUT
1641 is only called when the scanner reaches the end of its buffer, which
1642 may be a long time after scanning a statement such as an "include"
1643 which requires switching the input source.
1644 .PP
1645 To negotiate these sorts of problems,
1646 .I flex
1647 provides a mechanism for creating and switching between multiple
1648 input buffers.
1649 An input buffer is created by using:
1650 .nf
1651
1652     YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
1653
1654 .fi
1655 which takes a
1656 .I FILE
1657 pointer and a size and creates a buffer associated with the given
1658 file and large enough to hold
1659 .I size
1660 characters (when in doubt, use
1661 .B YY_BUF_SIZE
1662 for the size).
1663 It returns a
1664 .B YY_BUFFER_STATE
1665 handle, which may then be passed to other routines (see below).
1666 The
1667 .B YY_BUFFER_STATE
1668 type is a pointer to an opaque
1669 .B struct yy_buffer_state
1670 structure, so you may safely initialize YY_BUFFER_STATE variables to
1671 .B ((YY_BUFFER_STATE) 0)
1672 if you wish, and also refer to the opaque structure in order to
1673 correctly declare input buffers in source files other than that
1674 of your scanner.
1675 Note that the
1676 .I FILE
1677 pointer in the call to
1678 .B yy_create_buffer
1679 is only used as the value of
1680 .I yyin
1681 seen by
1682 .B YY_INPUT;
1683 if you redefine
1684 .B YY_INPUT
1685 so it no longer uses
1686 .I yyin,
1687 then you can safely pass a nil
1688 .I FILE
1689 pointer to
1690 .B yy_create_buffer.
1691 You select a particular buffer to scan from using:
1692 .nf
1693
1694     void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
1695
1696 .fi
1697 switches the scanner's input buffer so subsequent tokens will
1698 come from
1699 .I new_buffer.
1700 Note that
1701 .B yy_switch_to_buffer()
1702 may be used by yywrap() to set things up for continued scanning, instead
1703 of opening a new file and pointing
1704 .I yyin
1705 at it.
1706 Note also that switching input sources via either
1707 .B yy_switch_to_buffer()
1708 or
1709 .B yywrap()
1710 does
1711 .I not
1712 change the start condition.
1713 .nf
1714
1715     void yy_delete_buffer( YY_BUFFER_STATE buffer )
1716
1717 .fi
1718 is used to reclaim the storage associated with a buffer.
1719 (
1720 .B buffer
1721 can be nil, in which case the routine does nothing.)
1722 You can also clear the current contents of a buffer using:
1723 .nf
1724
1725     void yy_flush_buffer( YY_BUFFER_STATE buffer )
1726
1727 .fi
1728 This function discards the buffer's contents,
1729 so the next time the scanner attempts to match a token from the
1730 buffer, it will first fill the buffer anew using
1731 .B YY_INPUT.
1732 .PP
1733 .B yy_new_buffer()
1734 is an alias for
1735 .B yy_create_buffer(),
1736 provided for compatibility with the C++ use of
1737 .I new
1738 and
1739 .I delete
1740 for creating and destroying dynamic objects.
1741 .PP
1742 Finally, the
1743 .B YY_CURRENT_BUFFER
1744 macro returns a
1745 .B YY_BUFFER_STATE
1746 handle to the current buffer.
1747 .PP
1748 Here is an example of using these features for writing a scanner
1749 which expands include files (the
1750 .B <<EOF>>
1751 feature is discussed below):
1752 .nf
1753
1754     /* the "incl" state is used for picking up the name
1755      * of an include file
1756      */
1757     %x incl
1758
1759     %{
1760     #define MAX_INCLUDE_DEPTH 10
1761     YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
1762     int include_stack_ptr = 0;
1763     %}
1764
1765     %%
1766     include             BEGIN(incl);
1767
1768     [a-z]+              ECHO;
1769     [^a-z\\n]*\\n?        ECHO;
1770
1771     <incl>[ \\t]*      /* eat the whitespace */
1772     <incl>[^ \\t\\n]+   { /* got the include file name */
1773             if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
1774                 {
1775                 fprintf( stderr, "Includes nested too deeply" );
1776                 exit( 1 );
1777                 }
1778
1779             include_stack[include_stack_ptr++] =
1780                 YY_CURRENT_BUFFER;
1781
1782             yyin = fopen( yytext, "r" );
1783
1784             if ( ! yyin )
1785                 error( ... );
1786
1787             yy_switch_to_buffer(
1788                 yy_create_buffer( yyin, YY_BUF_SIZE ) );
1789
1790             BEGIN(INITIAL);
1791             }
1792
1793     <<EOF>> {
1794             if ( --include_stack_ptr < 0 )
1795                 {
1796                 yyterminate();
1797                 }
1798
1799             else
1800                 {
1801                 yy_delete_buffer( YY_CURRENT_BUFFER );
1802                 yy_switch_to_buffer(
1803                      include_stack[include_stack_ptr] );
1804                 }
1805             }
1806
1807 .fi
1808 Three routines are available for setting up input buffers for
1809 scanning in-memory strings instead of files.
1810 All of them create
1811 a new input buffer for scanning the string, and return a corresponding
1812 .B YY_BUFFER_STATE
1813 handle (which you should delete with
1814 .B yy_delete_buffer()
1815 when done with it).
1816 They also switch to the new buffer using
1817 .B yy_switch_to_buffer(),
1818 so the next call to
1819 .B yylex()
1820 will start scanning the string.
1821 .TP
1822 .B yy_scan_string(const char *str)
1823 scans a NUL-terminated string.
1824 .TP
1825 .B yy_scan_bytes(const char *bytes, int len)
1826 scans
1827 .I len
1828 bytes (including possibly NUL's)
1829 starting at location
1830 .I bytes.
1831 .PP
1832 Note that both of these functions create and scan a
1833 .I copy
1834 of the string or bytes.
1835 (This may be desirable, since
1836 .B yylex()
1837 modifies the contents of the buffer it is scanning.)  You can avoid the
1838 copy by using:
1839 .TP
1840 .B yy_scan_buffer(char *base, yy_size_t size)
1841 which scans in place the buffer starting at
1842 .I base,
1843 consisting of
1844 .I size
1845 bytes, the last two bytes of which
1846 .I must
1847 be
1848 .B YY_END_OF_BUFFER_CHAR
1849 (ASCII NUL).
1850 These last two bytes are not scanned; thus, scanning
1851 consists of
1852 .B base[0]
1853 through
1854 .B base[size-2],
1855 inclusive.
1856 .IP
1857 If you fail to set up
1858 .I base
1859 in this manner (i.e., forget the final two
1860 .B YY_END_OF_BUFFER_CHAR
1861 bytes), then
1862 .B yy_scan_buffer()
1863 returns a nil pointer instead of creating a new input buffer.
1864 .IP
1865 The type
1866 .B yy_size_t
1867 is an integral type to which you can cast an integer expression
1868 reflecting the size of the buffer.
1869 .SH END-OF-FILE RULES
1870 The special rule "<<EOF>>" indicates
1871 actions which are to be taken when an end-of-file is
1872 encountered and yywrap() returns non-zero (i.e., indicates
1873 no further files to process).
1874 The action must finish
1875 by doing one of four things:
1876 .IP -
1877 assigning
1878 .I yyin
1879 to a new input file (in previous versions of flex, after doing the
1880 assignment you had to call the special action
1881 .B YY_NEW_FILE;
1882 this is no longer necessary);
1883 .IP -
1884 executing a
1885 .I return
1886 statement;
1887 .IP -
1888 executing the special
1889 .B yyterminate()
1890 action;
1891 .IP -
1892 or, switching to a new buffer using
1893 .B yy_switch_to_buffer()
1894 as shown in the example above.
1895 .PP
1896 <<EOF>> rules may not be used with other
1897 patterns; they may only be qualified with a list of start
1898 conditions.
1899 If an unqualified <<EOF>> rule is given, it
1900 applies to
1901 .I all
1902 start conditions which do not already have <<EOF>> actions.
1903 To
1904 specify an <<EOF>> rule for only the initial start condition, use
1905 .nf
1906
1907     <INITIAL><<EOF>>
1908
1909 .fi
1910 .PP
1911 These rules are useful for catching things like unclosed comments.
1912 An example:
1913 .nf
1914
1915     %x quote
1916     %%
1917
1918     ...other rules for dealing with quotes...
1919
1920     <quote><<EOF>>   {
1921              error( "unterminated quote" );
1922              yyterminate();
1923              }
1924     <<EOF>>  {
1925              if ( *++filelist )
1926                  yyin = fopen( *filelist, "r" );
1927              else
1928                 yyterminate();
1929              }
1930
1931 .fi
1932 .SH MISCELLANEOUS MACROS
1933 The macro
1934 .B YY_USER_ACTION
1935 can be defined to provide an action
1936 which is always executed prior to the matched rule's action.
1937 For example,
1938 it could be #define'd to call a routine to convert yytext to lower-case.
1939 When
1940 .B YY_USER_ACTION
1941 is invoked, the variable
1942 .I yy_act
1943 gives the number of the matched rule (rules are numbered starting with 1).
1944 Suppose you want to profile how often each of your rules is matched.
1945 The following would do the trick:
1946 .nf
1947
1948     #define YY_USER_ACTION ++ctr[yy_act]
1949
1950 .fi
1951 where
1952 .I ctr
1953 is an array to hold the counts for the different rules.
1954 Note that the macro
1955 .B YY_NUM_RULES
1956 gives the total number of rules (including the default rule, even if
1957 you use
1958 .B \-s),
1959 so a correct declaration for
1960 .I ctr
1961 is:
1962 .nf
1963
1964     int ctr[YY_NUM_RULES];
1965
1966 .fi
1967 .PP
1968 The macro
1969 .B YY_USER_INIT
1970 may be defined to provide an action which is always executed before
1971 the first scan (and before the scanner's internal initializations are done).
1972 For example, it could be used to call a routine to read
1973 in a data table or open a logging file.
1974 .PP
1975 The macro
1976 .B yy_set_interactive(is_interactive)
1977 can be used to control whether the current buffer is considered
1978 .I interactive.
1979 An interactive buffer is processed more slowly,
1980 but must be used when the scanner's input source is indeed
1981 interactive to avoid problems due to waiting to fill buffers
1982 (see the discussion of the
1983 .B \-I
1984 flag below).
1985 A non-zero value
1986 in the macro invocation marks the buffer as interactive, a zero
1987 value as non-interactive.
1988 Note that use of this macro overrides
1989 .B %option interactive ,
1990 .B %option always-interactive
1991 or
1992 .B %option never-interactive
1993 (see Options below).
1994 .B yy_set_interactive()
1995 must be invoked prior to beginning to scan the buffer that is
1996 (or is not) to be considered interactive.
1997 .PP
1998 The macro
1999 .B yy_set_bol(at_bol)
2000 can be used to control whether the current buffer's scanning
2001 context for the next token match is done as though at the
2002 beginning of a line.
2003 A non-zero macro argument makes rules anchored with
2004  '^' active, while a zero argument makes '^' rules inactive.
2005 .PP
2006 The macro
2007 .B YY_AT_BOL()
2008 returns true if the next token scanned from the current buffer
2009 will have '^' rules active, false otherwise.
2010 .PP
2011 In the generated scanner, the actions are all gathered in one large
2012 switch statement and separated using
2013 .B YY_BREAK,
2014 which may be redefined.
2015 By default, it is simply a "break", to separate
2016 each rule's action from the following rule's.
2017 Redefining
2018 .B YY_BREAK
2019 allows, for example, C++ users to
2020 #define YY_BREAK to do nothing (while being very careful that every
2021 rule ends with a "break" or a "return"!) to avoid suffering from
2022 unreachable statement warnings where because a rule's action ends with
2023 "return", the
2024 .B YY_BREAK
2025 is inaccessible.
2026 .SH VALUES AVAILABLE TO THE USER
2027 This section summarizes the various values available to the user
2028 in the rule actions.
2029 .IP -
2030 .B char *yytext
2031 holds the text of the current token.
2032 It may be modified but not lengthened
2033 (you cannot append characters to the end).
2034 .IP
2035 If the special directive
2036 .B %array
2037 appears in the first section of the scanner description, then
2038 .B yytext
2039 is instead declared
2040 .B char yytext[YYLMAX],
2041 where
2042 .B YYLMAX
2043 is a macro definition that you can redefine in the first section
2044 if you don't like the default value (generally 8KB).
2045 Using
2046 .B %array
2047 results in somewhat slower scanners, but the value of
2048 .B yytext
2049 becomes immune to calls to
2050 .I input()
2051 and
2052 .I unput(),
2053 which potentially destroy its value when
2054 .B yytext
2055 is a character pointer.
2056 The opposite of
2057 .B %array
2058 is
2059 .B %pointer,
2060 which is the default.
2061 .IP
2062 You cannot use
2063 .B %array
2064 when generating C++ scanner classes
2065 (the
2066 .B \-+
2067 flag).
2068 .IP -
2069 .B int yyleng
2070 holds the length of the current token.
2071 .IP -
2072 .B FILE *yyin
2073 is the file which by default
2074 .I flex
2075 reads from.
2076 It may be redefined but doing so only makes sense before
2077 scanning begins or after an EOF has been encountered.
2078 Changing it in the midst of scanning will have unexpected results since
2079 .I flex
2080 buffers its input; use
2081 .B yyrestart()
2082 instead.
2083 Once scanning terminates because an end-of-file
2084 has been seen, you can assign
2085 .I yyin
2086 at the new input file and then call the scanner again to continue scanning.
2087 .IP -
2088 .B void yyrestart( FILE *new_file )
2089 may be called to point
2090 .I yyin
2091 at the new input file.
2092 The switch-over to the new file is immediate
2093 (any previously buffered-up input is lost).
2094 Note that calling
2095 .B yyrestart()
2096 with
2097 .I yyin
2098 as an argument thus throws away the current input buffer and continues
2099 scanning the same input file.
2100 .IP -
2101 .B FILE *yyout
2102 is the file to which
2103 .B ECHO
2104 actions are done.
2105 It can be reassigned by the user.
2106 .IP -
2107 .B YY_CURRENT_BUFFER
2108 returns a
2109 .B YY_BUFFER_STATE
2110 handle to the current buffer.
2111 .IP -
2112 .B YY_START
2113 returns an integer value corresponding to the current start
2114 condition.
2115 You can subsequently use this value with
2116 .B BEGIN
2117 to return to that start condition.
2118 .SH INTERFACING WITH YACC
2119 One of the main uses of
2120 .I flex
2121 is as a companion to the
2122 .I yacc
2123 parser-generator.
2124 .I yacc
2125 parsers expect to call a routine named
2126 .B yylex()
2127 to find the next input token.
2128 The routine is supposed to
2129 return the type of the next token as well as putting any associated
2130 value in the global
2131 .B yylval.
2132 To use
2133 .I flex
2134 with
2135 .I yacc,
2136 one specifies the
2137 .B \-d
2138 option to
2139 .I yacc
2140 to instruct it to generate the file
2141 .B y.tab.h
2142 containing definitions of all the
2143 .B %tokens
2144 appearing in the
2145 .I yacc
2146 input.
2147 This file is then included in the
2148 .I flex
2149 scanner.
2150 For example, if one of the tokens is "TOK_NUMBER",
2151 part of the scanner might look like:
2152 .nf
2153
2154     %{
2155     #include "y.tab.h"
2156     %}
2157
2158     %%
2159
2160     [0-9]+        yylval = atoi( yytext ); return TOK_NUMBER;
2161
2162 .fi
2163 .SH OPTIONS
2164 .I flex
2165 has the following options:
2166 .TP
2167 .B \-b, --backup
2168 Generate backing-up information to
2169 .I lex.backup.
2170 This is a list of scanner states which require backing up
2171 and the input characters on which they do so.
2172 By adding rules one
2173 can remove backing-up states.
2174 If
2175 .I all
2176 backing-up states are eliminated and
2177 .B \-Cf
2178 or
2179 .B \-CF
2180 is used, the generated scanner will run faster (see the
2181 .B \-p
2182 flag).
2183 Only users who wish to squeeze every last cycle out of their
2184 scanners need worry about this option.
2185 (See the section on Performance Considerations below.)
2186 .TP
2187 .B \-c
2188 is a do-nothing, deprecated option included for POSIX compliance.
2189 .TP
2190 .B \-d, \-\-debug
2191 makes the generated scanner run in
2192 .I debug
2193 mode.
2194 Whenever a pattern is recognized and the global
2195 .B yy_flex_debug
2196 is non-zero (which is the default),
2197 the scanner will write to
2198 .I stderr
2199 a line of the form:
2200 .nf
2201
2202     --accepting rule at line 53 ("the matched text")
2203
2204 .fi
2205 The line number refers to the location of the rule in the file
2206 defining the scanner (i.e., the file that was fed to flex).
2207 Messages are also generated when the scanner backs up, accepts the
2208 default rule, reaches the end of its input buffer (or encounters
2209 a NUL; at this point, the two look the same as far as the scanner's concerned),
2210 or reaches an end-of-file.
2211 .TP
2212 .B \-f, \-\-full
2213 specifies
2214 .I fast scanner.
2215 No table compression is done and stdio is bypassed.
2216 The result is large but fast.
2217 This option is equivalent to
2218 .B \-Cfr
2219 (see below).
2220 .TP
2221 .B \-h, \-\-help
2222 generates a "help" summary of
2223 .I flex's
2224 options to
2225 .I stdout
2226 and then exits.
2227 .B \-?
2228 and
2229 .B \-\-help
2230 are synonyms for
2231 .B \-h.
2232 .TP
2233 .B \-i, \-\-case-insensitive
2234 instructs
2235 .I flex
2236 to generate a
2237 .I case-insensitive
2238 scanner.
2239 The case of letters given in the
2240 .I flex
2241 input patterns will
2242 be ignored, and tokens in the input will be matched regardless of case.
2243 The matched text given in
2244 .I yytext
2245 will have the preserved case (i.e., it will not be folded).
2246 .TP
2247 .B \-l, \-\-lex\-compat
2248 turns on maximum compatibility with the original AT&T
2249 .I lex
2250 implementation.
2251 Note that this does not mean
2252 .I full
2253 compatibility.
2254 Use of this option costs a considerable amount of
2255 performance, and it cannot be used with the
2256 .B \-+, -f, -F, -Cf,
2257 or
2258 .B -CF
2259 options.
2260 For details on the compatibilities it provides, see the section
2261 "Incompatibilities With Lex And POSIX" below.
2262 This option also results
2263 in the name
2264 .B YY_FLEX_LEX_COMPAT
2265 being #define'd in the generated scanner.
2266 .TP
2267 .B \-n
2268 is another do-nothing, deprecated option included only for
2269 POSIX compliance.
2270 .TP
2271 .B \-p, \-\-perf\-report
2272 generates a performance report to stderr.
2273 The report consists of comments regarding features of the
2274 .I flex
2275 input file which will cause a serious loss of performance in the resulting
2276 scanner.
2277 If you give the flag twice, you will also get comments regarding
2278 features that lead to minor performance losses.
2279 .IP
2280 Note that the use of
2281 .B REJECT,
2282 .B %option yylineno,
2283 and variable trailing context (see the Deficiencies / Bugs section below)
2284 entails a substantial performance penalty; use of
2285 .I yymore(),
2286 the
2287 .B ^
2288 operator,
2289 and the
2290 .B \-I
2291 flag entail minor performance penalties.
2292 .TP
2293 .B \-s, \-\-no\-default
2294 causes the
2295 .I default rule
2296 (that unmatched scanner input is echoed to
2297 .I stdout)
2298 to be suppressed.
2299 If the scanner encounters input that does not
2300 match any of its rules, it aborts with an error.
2301 This option is
2302 useful for finding holes in a scanner's rule set.
2303 .TP
2304 .B \-t, \-\-stdout
2305 instructs
2306 .I flex
2307 to write the scanner it generates to standard output instead
2308 of
2309 .B lex.yy.c.
2310 .TP
2311 .B \-v, \-\-verbose
2312 specifies that
2313 .I flex
2314 should write to
2315 .I stderr
2316 a summary of statistics regarding the scanner it generates.
2317 Most of the statistics are meaningless to the casual
2318 .I flex
2319 user, but the first line identifies the version of
2320 .I flex
2321 (same as reported by
2322 .B \-V),
2323 and the next line the flags used when generating the scanner, including
2324 those that are on by default.
2325 .TP
2326 .B \-w, \-\-nowarn
2327 suppresses warning messages.
2328 .TP
2329 .B \-B, \-\-batch
2330 instructs
2331 .I flex
2332 to generate a
2333 .I batch
2334 scanner, the opposite of
2335 .I interactive
2336 scanners generated by
2337 .B \-I
2338 (see below).
2339 In general, you use
2340 .B \-B
2341 when you are
2342 .I certain
2343 that your scanner will never be used interactively, and you want to
2344 squeeze a
2345 .I little
2346 more performance out of it.
2347 If your goal is instead to squeeze out a
2348 .I lot
2349 more performance, you should be using the
2350 .B \-Cf
2351 or
2352 .B \-CF
2353 options (discussed below), which turn on
2354 .B \-B
2355 automatically anyway.
2356 .TP
2357 .B \-F, \-\-fast
2358 specifies that the
2359 .I fast
2360 scanner table representation should be used (and stdio
2361 bypassed).
2362 This representation is about as fast as the full table representation
2363 .B (-f),
2364 and for some sets of patterns will be considerably smaller (and for
2365 others, larger).
2366 In general, if the pattern set contains both "keywords"
2367 and a catch-all, "identifier" rule, such as in the set:
2368 .nf
2369
2370     "case"    return TOK_CASE;
2371     "switch"  return TOK_SWITCH;
2372     ...
2373     "default" return TOK_DEFAULT;
2374     [a-z]+    return TOK_ID;
2375
2376 .fi
2377 then you're better off using the full table representation.
2378 If only
2379 the "identifier" rule is present and you then use a hash table or some such
2380 to detect the keywords, you're better off using
2381 .B -F.
2382 .IP
2383 This option is equivalent to
2384 .B \-CFr
2385 (see below).
2386 It cannot be used with
2387 .B \-+.
2388 .TP
2389 .B \-I, \-\-interactive
2390 instructs
2391 .I flex
2392 to generate an
2393 .I interactive
2394 scanner.
2395 An interactive scanner is one that only looks ahead to decide
2396 what token has been matched if it absolutely must.
2397 It turns out that
2398 always looking one extra character ahead, even if the scanner has already
2399 seen enough text to disambiguate the current token, is a bit faster than
2400 only looking ahead when necessary.
2401 But scanners that always look ahead
2402 give dreadful interactive performance; for example, when a user types
2403 a newline, it is not recognized as a newline token until they enter
2404 .I another
2405 token, which often means typing in another whole line.
2406 .IP
2407 .I Flex
2408 scanners default to
2409 .I interactive
2410 unless you use the
2411 .B \-Cf
2412 or
2413 .B \-CF
2414 table-compression options (see below).
2415 That's because if you're looking
2416 for high-performance you should be using one of these options, so if you
2417 didn't,
2418 .I flex
2419 assumes you'd rather trade off a bit of run-time performance for intuitive
2420 interactive behavior.
2421 Note also that you
2422 .I cannot
2423 use
2424 .B \-I
2425 in conjunction with
2426 .B \-Cf
2427 or
2428 .B \-CF.
2429 Thus, this option is not really needed; it is on by default for all those
2430 cases in which it is allowed.
2431 .IP
2432 Note that if
2433 .B isatty()
2434 returns false for the scanner input, flex will revert to batch mode, even if
2435 .B \-I
2436 was specified.
2437 To force interactive mode no matter what, use
2438 .B %option always-interactive
2439 (see Options below).
2440 .IP
2441 You can force a scanner to
2442 .I not
2443 be interactive by using
2444 .B \-B
2445 (see above).
2446 .TP
2447 .B \-L, \-\-noline
2448 instructs
2449 .I flex
2450 not to generate
2451 .B #line
2452 directives.
2453 Without this option,
2454 .I flex
2455 peppers the generated scanner
2456 with #line directives so error messages in the actions will be correctly
2457 located with respect to either the original
2458 .I flex
2459 input file (if the errors are due to code in the input file), or
2460 .B lex.yy.c
2461 (if the errors are
2462 .I flex's
2463 fault -- you should report these sorts of errors to the email address
2464 given below).
2465 .TP
2466 .B \-T, \-\-trace
2467 makes
2468 .I flex
2469 run in
2470 .I trace
2471 mode.
2472 It will generate a lot of messages to
2473 .I stderr
2474 concerning
2475 the form of the input and the resultant non-deterministic and deterministic
2476 finite automata.
2477 This option is mostly for use in maintaining
2478 .I flex.
2479 .TP
2480 .B \-V, \-\-version
2481 prints the version number to
2482 .I stdout
2483 and exits.
2484 .B \-\-version
2485 is a synonym for
2486 .B \-V.
2487 .TP
2488 .B \-7, \-\-7bit
2489 instructs
2490 .I flex
2491 to generate a 7-bit scanner, i.e., one which can only recognize 7-bit
2492 characters in its input.
2493 The advantage of using
2494 .B \-7
2495 is that the scanner's tables can be up to half the size of those generated
2496 using the
2497 .B \-8
2498 option (see below).
2499 The disadvantage is that such scanners often hang
2500 or crash if their input contains an 8-bit character.
2501 .IP
2502 Note, however, that unless you generate your scanner using the
2503 .B \-Cf
2504 or
2505 .B \-CF
2506 table compression options, use of
2507 .B \-7
2508 will save only a small amount of table space, and make your scanner
2509 considerably less portable.
2510 .I Flex's
2511 default behavior is to generate an 8-bit scanner unless you use the
2512 .B \-Cf
2513 or
2514 .B \-CF,
2515 in which case
2516 .I flex
2517 defaults to generating 7-bit scanners unless your site was always
2518 configured to generate 8-bit scanners (as will often be the case
2519 with non-USA sites).
2520 You can tell whether flex generated a 7-bit
2521 or an 8-bit scanner by inspecting the flag summary in the
2522 .B \-v
2523 output as described above.
2524 .IP
2525 Note that if you use
2526 .B \-Cfe
2527 or
2528 .B \-CFe
2529 (those table compression options, but also using equivalence classes as
2530 discussed see below), flex still defaults to generating an 8-bit
2531 scanner, since usually with these compression options full 8-bit tables
2532 are not much more expensive than 7-bit tables.
2533 .TP
2534 .B \-8, \-\-8bit
2535 instructs
2536 .I flex
2537 to generate an 8-bit scanner, i.e., one which can recognize 8-bit
2538 characters.
2539 This flag is only needed for scanners generated using
2540 .B \-Cf
2541 or
2542 .B \-CF,
2543 as otherwise flex defaults to generating an 8-bit scanner anyway.
2544 .IP
2545 See the discussion of
2546 .B \-7
2547 above for flex's default behavior and the tradeoffs between 7-bit
2548 and 8-bit scanners.
2549 .TP
2550 .B \-+, \-\-c++
2551 specifies that you want flex to generate a C++
2552 scanner class.
2553 See the section on Generating C++ Scanners below for
2554 details.
2555 .TP
2556 .B \-C[aefFmr]
2557 controls the degree of table compression and, more generally, trade-offs
2558 between small scanners and fast scanners.
2559 .IP
2560 .B \-Ca, \-\-align
2561 ("align") instructs flex to trade off larger tables in the
2562 generated scanner for faster performance because the elements of
2563 the tables are better aligned for memory access and computation.
2564 On some
2565 RISC architectures, fetching and manipulating longwords is more efficient
2566 than with smaller-sized units such as shortwords.
2567 This option can
2568 double the size of the tables used by your scanner.
2569 .IP
2570 .B \-Ce, \-\-ecs
2571 directs
2572 .I flex
2573 to construct
2574 .I equivalence classes,
2575 i.e., sets of characters
2576 which have identical lexical properties (for example, if the only
2577 appearance of digits in the
2578 .I flex
2579 input is in the character class
2580 "[0-9]" then the digits '0', '1', ..., '9' will all be put
2581 in the same equivalence class).
2582 Equivalence classes usually give
2583 dramatic reductions in the final table/object file sizes (typically
2584 a factor of 2-5) and are pretty cheap performance-wise (one array
2585 look-up per character scanned).
2586 .IP
2587 .B \-Cf
2588 specifies that the
2589 .I full
2590 scanner tables should be generated -
2591 .I flex
2592 should not compress the
2593 tables by taking advantages of similar transition functions for
2594 different states.
2595 .IP
2596 .B \-CF
2597 specifies that the alternative fast scanner representation (described
2598 above under the
2599 .B \-F
2600 flag)
2601 should be used.
2602 This option cannot be used with
2603 .B \-+.
2604 .IP
2605 .B \-Cm, \-\-meta-ecs
2606 directs
2607 .I flex
2608 to construct
2609 .I meta-equivalence classes,
2610 which are sets of equivalence classes (or characters, if equivalence
2611 classes are not being used) that are commonly used together.
2612 Meta-equivalence
2613 classes are often a big win when using compressed tables, but they
2614 have a moderate performance impact (one or two "if" tests and one
2615 array look-up per character scanned).
2616 .IP
2617 .B \-Cr, \-\-read
2618 causes the generated scanner to
2619 .I bypass
2620 use of the standard I/O library (stdio) for input.
2621 Instead of calling
2622 .B fread()
2623 or
2624 .B getc(),
2625 the scanner will use the
2626 .B read()
2627 system call, resulting in a performance gain which varies from system
2628 to system, but in general is probably negligible unless you are also using
2629 .B \-Cf
2630 or
2631 .B \-CF.
2632 Using
2633 .B \-Cr
2634 can cause strange behavior if, for example, you read from
2635 .I yyin
2636 using stdio prior to calling the scanner (because the scanner will miss
2637 whatever text your previous reads left in the stdio input buffer).
2638 .IP
2639 .B \-Cr
2640 has no effect if you define
2641 .B YY_INPUT
2642 (see The Generated Scanner above).
2643 .IP
2644 A lone
2645 .B \-C
2646 specifies that the scanner tables should be compressed but neither
2647 equivalence classes nor meta-equivalence classes should be used.
2648 .IP
2649 The options
2650 .B \-Cf
2651 or
2652 .B \-CF
2653 and
2654 .B \-Cm
2655 do not make sense together - there is no opportunity for meta-equivalence
2656 classes if the table is not being compressed.
2657 Otherwise the options
2658 may be freely mixed, and are cumulative.
2659 .IP
2660 The default setting is
2661 .B \-Cem,
2662 which specifies that
2663 .I flex
2664 should generate equivalence classes
2665 and meta-equivalence classes.
2666 This setting provides the highest degree of table compression.
2667 You can trade off
2668 faster-executing scanners at the cost of larger tables with
2669 the following generally being true:
2670 .nf
2671
2672     slowest & smallest
2673           -Cem
2674           -Cm
2675           -Ce
2676           -C
2677           -C{f,F}e
2678           -C{f,F}
2679           -C{f,F}a
2680     fastest & largest
2681
2682 .fi
2683 Note that scanners with the smallest tables are usually generated and
2684 compiled the quickest, so
2685 during development you will usually want to use the default, maximal
2686 compression.
2687 .IP
2688 .B \-Cfe
2689 is often a good compromise between speed and size for production
2690 scanners.
2691 .TP
2692 .B \-ooutput, \-\-outputfile=FILE
2693 directs flex to write the scanner to the file
2694 .B output
2695 instead of
2696 .B lex.yy.c.
2697 If you combine
2698 .B \-o
2699 with the
2700 .B \-t
2701 option, then the scanner is written to
2702 .I stdout
2703 but its
2704 .B #line
2705 directives (see the
2706 .B \\-L
2707 option above) refer to the file
2708 .B output.
2709 .TP
2710 .B \-Pprefix, \-\-prefix=STRING
2711 changes the default
2712 .I "yy"
2713 prefix used by
2714 .I flex
2715 for all globally-visible variable and function names to instead be
2716 .I prefix.
2717 For example,
2718 .B \-Pfoo
2719 changes the name of
2720 .B yytext
2721 to
2722 .B footext.
2723 It also changes the name of the default output file from
2724 .B lex.yy.c
2725 to
2726 .B lex.foo.c.
2727 Here are all of the names affected:
2728 .nf
2729
2730     yy_create_buffer
2731     yy_delete_buffer
2732     yy_flex_debug
2733     yy_init_buffer
2734     yy_flush_buffer
2735     yy_load_buffer_state
2736     yy_switch_to_buffer
2737     yyin
2738     yyleng
2739     yylex
2740     yylineno
2741     yyout
2742     yyrestart
2743     yytext
2744     yywrap
2745
2746 .fi
2747 (If you are using a C++ scanner, then only
2748 .B yywrap
2749 and
2750 .B yyFlexLexer
2751 are affected.)
2752 Within your scanner itself, you can still refer to the global variables
2753 and functions using either version of their name; but externally, they
2754 have the modified name.
2755 .IP
2756 This option lets you easily link together multiple
2757 .I flex
2758 programs into the same executable.
2759 Note, though, that using this option also renames
2760 .B yywrap(),
2761 so you now
2762 .I must
2763 either
2764 provide your own (appropriately-named) version of the routine for your
2765 scanner, or use
2766 .B %option noyywrap,
2767 as linking with
2768 .B \-ll
2769 no longer provides one for you by default.
2770 .TP
2771 .B \-Sskeleton_file, \-\-skel=FILE
2772 overrides the default skeleton file from which
2773 .I flex
2774 constructs its scanners.
2775 You'll never need this option unless you are doing
2776 .I flex
2777 maintenance or development.
2778 .TP
2779 .B \-X, \-\-posix\-compat
2780 maximal compatibility with POSIX lex.
2781 .TP
2782 .B \-\-yylineno
2783 track line count in yylineno.
2784 .TP
2785 .B \-\-yyclass=NAME
2786 name of C++ class.
2787 .TP
2788 .B \-\-header\-file=FILE
2789 create a C header file in addition to the scanner.
2790 .TP
2791 .B \-\-tables\-file[=FILE]
2792 write tables to FILE.
2793 .TP
2794 .B \\-Dmacro[=defn]
2795 #define macro defn (default defn is '1').
2796 .TP
2797 .B \-R,  \-\-reentrant
2798 generate a reentrant C scanner
2799 .TP
2800 .B \-\-bison\-bridge
2801 scanner for bison pure parser.
2802 .TP
2803 .B \-\-bison\-locations
2804 include yylloc support.
2805 .TP
2806 .B \-\-stdinit
2807 initialize yyin/yyout to stdin/stdout.
2808 .TP
2809 .B \-\-noansi\-definitions old\-style function definitions.
2810 .TP
2811 .B \-\-noansi\-prototypes
2812 empty parameter list in prototypes.
2813 .TP
2814 .B \-\-nounistd
2815 do not include <unistd.h>.
2816 .TP
2817 .B \-\-noFUNCTION
2818 do not generate a particular FUNCTION.
2819 .PP
2820 .I flex
2821 also provides a mechanism for controlling options within the
2822 scanner specification itself, rather than from the flex command-line.
2823 This is done by including
2824 .B %option
2825 directives in the first section of the scanner specification.
2826 You can specify multiple options with a single
2827 .B %option
2828 directive, and multiple directives in the first section of your flex input
2829 file.
2830 .PP
2831 Most options are given simply as names, optionally preceded by the
2832 word "no" (with no intervening whitespace) to negate their meaning.
2833 A number are equivalent to flex flags or their negation:
2834 .nf
2835
2836     7bit            -7 option
2837     8bit            -8 option
2838     align           -Ca option
2839     backup          -b option
2840     batch           -B option
2841     c++             -+ option
2842
2843     caseful or
2844     case-sensitive  opposite of -i (default)
2845
2846     case-insensitive or
2847     caseless        -i option
2848
2849     debug           -d option
2850     default         opposite of -s option
2851     ecs             -Ce option
2852     fast            -F option
2853     full            -f option
2854     interactive     -I option
2855     lex-compat      -l option
2856     meta-ecs        -Cm option
2857     perf-report     -p option
2858     read            -Cr option
2859     stdout          -t option
2860     verbose         -v option
2861     warn            opposite of -w option
2862                     (use "%option nowarn" for -w)
2863
2864     array           equivalent to "%array"
2865     pointer         equivalent to "%pointer" (default)
2866
2867 .fi
2868 Some
2869 .B %option's
2870 provide features otherwise not available:
2871 .TP
2872 .B always-interactive
2873 instructs flex to generate a scanner which always considers its input
2874 "interactive".
2875 Normally, on each new input file the scanner calls
2876 .B isatty()
2877 in an attempt to determine whether
2878 the scanner's input source is interactive and thus should be read a
2879 character at a time.
2880 When this option is used, however, then no
2881 such call is made.
2882 .TP
2883 .B main
2884 directs flex to provide a default
2885 .B main()
2886 program for the scanner, which simply calls
2887 .B yylex().
2888 This option implies
2889 .B noyywrap
2890 (see below).
2891 .TP
2892 .B never-interactive
2893 instructs flex to generate a scanner which never considers its input
2894 "interactive" (again, no call made to
2895 .B isatty()).
2896 This is the opposite of
2897 .B always-interactive.
2898 .TP
2899 .B stack
2900 enables the use of start condition stacks (see Start Conditions above).
2901 .TP
2902 .B stdinit
2903 if set (i.e.,
2904 .B %option stdinit)
2905 initializes
2906 .I yyin
2907 and
2908 .I yyout
2909 to
2910 .I stdin
2911 and
2912 .I stdout,
2913 instead of the default of
2914 .I nil.
2915 Some existing
2916 .I lex
2917 programs depend on this behavior, even though it is not compliant with
2918 ANSI C, which does not require
2919 .I stdin
2920 and
2921 .I stdout
2922 to be compile-time constant.
2923 .TP
2924 .B yylineno
2925 directs
2926 .I flex
2927 to generate a scanner that maintains the number of the current line
2928 read from its input in the global variable
2929 .B yylineno.
2930 This option is implied by
2931 .B %option lex-compat.
2932 .TP
2933 .B yywrap
2934 if unset (i.e.,
2935 .B %option noyywrap),
2936 makes the scanner not call
2937 .B yywrap()
2938 upon an end-of-file, but simply assume that there are no more
2939 files to scan (until the user points
2940 .I yyin
2941 at a new file and calls
2942 .B yylex()
2943 again).
2944 .PP
2945 .I flex
2946 scans your rule actions to determine whether you use the
2947 .B REJECT
2948 or
2949 .B yymore()
2950 features.
2951 The
2952 .B reject
2953 and
2954 .B yymore
2955 options are available to override its decision as to whether you use the
2956 options, either by setting them (e.g.,
2957 .B %option reject)
2958 to indicate the feature is indeed used, or
2959 unsetting them to indicate it actually is not used
2960 (e.g.,
2961 .B %option noyymore).
2962 .PP
2963 Three options take string-delimited values, offset with '=':
2964 .nf
2965
2966     %option outfile="ABC"
2967
2968 .fi
2969 is equivalent to
2970 .B -oABC,
2971 and
2972 .nf
2973
2974     %option prefix="XYZ"
2975
2976 .fi
2977 is equivalent to
2978 .B -PXYZ.
2979 Finally,
2980 .nf
2981
2982     %option yyclass="foo"
2983
2984 .fi
2985 only applies when generating a C++ scanner (
2986 .B \-+
2987 option).
2988 It informs
2989 .I flex
2990 that you have derived
2991 .B foo
2992 as a subclass of
2993 .B yyFlexLexer,
2994 so
2995 .I flex
2996 will place your actions in the member function
2997 .B foo::yylex()
2998 instead of
2999 .B yyFlexLexer::yylex().
3000 It also generates a
3001 .B yyFlexLexer::yylex()
3002 member function that emits a run-time error (by invoking
3003 .B yyFlexLexer::LexerError())
3004 if called.
3005 See Generating C++ Scanners, below, for additional information.
3006 .PP
3007 A number of options are available for lint purists who want to suppress
3008 the appearance of unneeded routines in the generated scanner.
3009 Each of the following, if unset
3010 (e.g.,
3011 .B %option nounput
3012 ), results in the corresponding routine not appearing in
3013 the generated scanner:
3014 .nf
3015
3016     input, unput
3017     yy_push_state, yy_pop_state, yy_top_state
3018     yy_scan_buffer, yy_scan_bytes, yy_scan_string
3019
3020 .fi
3021 (though
3022 .B yy_push_state()
3023 and friends won't appear anyway unless you use
3024 .B %option stack).
3025 .SH PERFORMANCE CONSIDERATIONS
3026 The main design goal of
3027 .I flex
3028 is that it generate high-performance scanners.
3029 It has been optimized
3030 for dealing well with large sets of rules.
3031 Aside from the effects on scanner speed of the table compression
3032 .B \-C
3033 options outlined above,
3034 there are a number of options/actions which degrade performance.
3035 These are, from most expensive to least:
3036 .nf
3037
3038     REJECT
3039     %option yylineno
3040     arbitrary trailing context
3041
3042     pattern sets that require backing up
3043     %array
3044     %option interactive
3045     %option always-interactive
3046
3047     '^' beginning-of-line operator
3048     yymore()
3049
3050 .fi
3051 with the first three all being quite expensive and the last two
3052 being quite cheap.
3053 Note also that
3054 .B unput()
3055 is implemented as a routine call that potentially does quite a bit of
3056 work, while
3057 .B yyless()
3058 is a quite-cheap macro; so if just putting back some excess text you
3059 scanned, use
3060 .B yyless().
3061 .PP
3062 .B REJECT
3063 should be avoided at all costs when performance is important.
3064 It is a particularly expensive option.
3065 .PP
3066 Getting rid of backing up is messy and often may be an enormous
3067 amount of work for a complicated scanner.
3068 In principal, one begins by using the
3069 .B \-b
3070 flag to generate a
3071 .I lex.backup
3072 file.
3073 For example, on the input
3074 .nf
3075
3076     %%
3077     foo        return TOK_KEYWORD;
3078     foobar     return TOK_KEYWORD;
3079
3080 .fi
3081 the file looks like:
3082 .nf
3083
3084     State #6 is non-accepting -
3085      associated rule line numbers:
3086            2       3
3087      out-transitions: [ o ]
3088      jam-transitions: EOF [ \\001-n  p-\\177 ]
3089
3090     State #8 is non-accepting -
3091      associated rule line numbers:
3092            3
3093      out-transitions: [ a ]
3094      jam-transitions: EOF [ \\001-`  b-\\177 ]
3095
3096     State #9 is non-accepting -
3097      associated rule line numbers:
3098            3
3099      out-transitions: [ r ]
3100      jam-transitions: EOF [ \\001-q  s-\\177 ]
3101
3102     Compressed tables always back up.
3103
3104 .fi
3105 The first few lines tell us that there's a scanner state in
3106 which it can make a transition on an 'o' but not on any other
3107 character, and that in that state the currently scanned text does not match
3108 any rule.
3109 The state occurs when trying to match the rules found
3110 at lines 2 and 3 in the input file.
3111 If the scanner is in that state and then reads
3112 something other than an 'o', it will have to back up to find
3113 a rule which is matched.
3114 With a bit of headscratching one can see that this must be the
3115 state it's in when it has seen "fo".
3116 When this has happened,
3117 if anything other than another 'o' is seen, the scanner will
3118 have to back up to simply match the 'f' (by the default rule).
3119 .PP
3120 The comment regarding State #8 indicates there's a problem
3121 when "foob" has been scanned.
3122 Indeed, on any character other
3123 than an 'a', the scanner will have to back up to accept "foo".
3124 Similarly, the comment for State #9 concerns when "fooba" has
3125 been scanned and an 'r' does not follow.
3126 .PP
3127 The final comment reminds us that there's no point going to
3128 all the trouble of removing backing up from the rules unless
3129 we're using
3130 .B \-Cf
3131 or
3132 .B \-CF,
3133 since there's no performance gain doing so with compressed scanners.
3134 .PP
3135 The way to remove the backing up is to add "error" rules:
3136 .nf
3137
3138     %%
3139     foo         return TOK_KEYWORD;
3140     foobar      return TOK_KEYWORD;
3141
3142     fooba       |
3143     foob        |
3144     fo          {
3145                 /* false alarm, not really a keyword */
3146                 return TOK_ID;
3147                 }
3148
3149 .fi
3150 .PP
3151 Eliminating backing up among a list of keywords can also be
3152 done using a "catch-all" rule:
3153 .nf
3154
3155     %%
3156     foo         return TOK_KEYWORD;
3157     foobar      return TOK_KEYWORD;
3158
3159     [a-z]+      return TOK_ID;
3160
3161 .fi
3162 This is usually the best solution when appropriate.
3163 .PP
3164 Backing up messages tend to cascade.
3165 With a complicated set of rules it's not uncommon to get hundreds
3166 of messages.
3167 If one can decipher them, though, it often
3168 only takes a dozen or so rules to eliminate the backing up (though
3169 it's easy to make a mistake and have an error rule accidentally match
3170 a valid token.
3171 A possible future
3172 .I flex
3173 feature will be to automatically add rules to eliminate backing up).
3174 .PP
3175 It's important to keep in mind that you gain the benefits of eliminating
3176 backing up only if you eliminate
3177 .I every
3178 instance of backing up.
3179 Leaving just one means you gain nothing.
3180 .PP
3181 .I Variable
3182 trailing context (where both the leading and trailing parts do not have
3183 a fixed length) entails almost the same performance loss as
3184 .B REJECT
3185 (i.e., substantial).
3186 So when possible a rule like:
3187 .nf
3188
3189     %%
3190     mouse|rat/(cat|dog)   run();
3191
3192 .fi
3193 is better written:
3194 .nf
3195
3196     %%
3197     mouse/cat|dog         run();
3198     rat/cat|dog           run();
3199
3200 .fi
3201 or as
3202 .nf
3203
3204     %%
3205     mouse|rat/cat         run();
3206     mouse|rat/dog         run();
3207
3208 .fi
3209 Note that here the special '|' action does
3210 .I not
3211 provide any savings, and can even make things worse (see
3212 Deficiencies / Bugs below).
3213 .LP
3214 Another area where the user can increase a scanner's performance
3215 (and one that's easier to implement) arises from the fact that
3216 the longer the tokens matched, the faster the scanner will run.
3217 This is because with long tokens the processing of most input
3218 characters takes place in the (short) inner scanning loop, and
3219 does not often have to go through the additional work of setting up
3220 the scanning environment (e.g.,
3221 .B yytext)
3222 for the action.
3223 Recall the scanner for C comments:
3224 .nf
3225
3226     %x comment
3227     %%
3228             int line_num = 1;
3229
3230     "/*"         BEGIN(comment);
3231
3232     <comment>[^*\\n]*
3233     <comment>"*"+[^*/\\n]*
3234     <comment>\\n             ++line_num;
3235     <comment>"*"+"/"        BEGIN(INITIAL);
3236
3237 .fi
3238 This could be sped up by writing it as:
3239 .nf
3240
3241     %x comment
3242     %%
3243             int line_num = 1;
3244
3245     "/*"         BEGIN(comment);
3246
3247     <comment>[^*\\n]*
3248     <comment>[^*\\n]*\\n      ++line_num;
3249     <comment>"*"+[^*/\\n]*
3250     <comment>"*"+[^*/\\n]*\\n ++line_num;
3251     <comment>"*"+"/"        BEGIN(INITIAL);
3252
3253 .fi
3254 Now instead of each newline requiring the processing of another
3255 action, recognizing the newlines is "distributed" over the other rules
3256 to keep the matched text as long as possible.
3257 Note that
3258 .I adding
3259 rules does
3260 .I not
3261 slow down the scanner!  The speed of the scanner is independent
3262 of the number of rules or (modulo the considerations given at the
3263 beginning of this section) how complicated the rules are with
3264 regard to operators such as '*' and '|'.
3265 .PP
3266 A final example in speeding up a scanner: suppose you want to scan
3267 through a file containing identifiers and keywords, one per line
3268 and with no other extraneous characters, and recognize all the
3269 keywords.
3270 A natural first approach is:
3271 .nf
3272
3273     %%
3274     asm      |
3275     auto     |
3276     break    |
3277     ... etc ...
3278     volatile |
3279     while    /* it's a keyword */
3280
3281     .|\\n     /* it's not a keyword */
3282
3283 .fi
3284 To eliminate the back-tracking, introduce a catch-all rule:
3285 .nf
3286
3287     %%
3288     asm      |
3289     auto     |
3290     break    |
3291     ... etc ...
3292     volatile |
3293     while    /* it's a keyword */
3294
3295     [a-z]+   |
3296     .|\\n     /* it's not a keyword */
3297
3298 .fi
3299 Now, if it's guaranteed that there's exactly one word per line,
3300 then we can reduce the total number of matches by a half by
3301 merging in the recognition of newlines with that of the other
3302 tokens:
3303 .nf
3304
3305     %%
3306     asm\\n    |
3307     auto\\n   |
3308     break\\n  |
3309     ... etc ...
3310     volatile\\n |
3311     while\\n  /* it's a keyword */
3312
3313     [a-z]+\\n |
3314     .|\\n     /* it's not a keyword */
3315
3316 .fi
3317 One has to be careful here, as we have now reintroduced backing up
3318 into the scanner.
3319 In particular, while
3320 .I we
3321 know that there will never be any characters in the input stream
3322 other than letters or newlines,
3323 .I flex
3324 can't figure this out, and it will plan for possibly needing to back up
3325 when it has scanned a token like "auto" and then the next character
3326 is something other than a newline or a letter.
3327 Previously it would
3328 then just match the "auto" rule and be done, but now it has no "auto"
3329 rule, only an "auto\\n" rule.
3330 To eliminate the possibility of backing up,
3331 we could either duplicate all rules but without final newlines, or,
3332 since we never expect to encounter such an input and therefore don't
3333 how it's classified, we can introduce one more catch-all rule, this
3334 one which doesn't include a newline:
3335 .nf
3336
3337     %%
3338     asm\\n    |
3339     auto\\n   |
3340     break\\n  |
3341     ... etc ...
3342     volatile\\n |
3343     while\\n  /* it's a keyword */
3344
3345     [a-z]+\\n |
3346     [a-z]+   |
3347     .|\\n     /* it's not a keyword */
3348
3349 .fi
3350 Compiled with
3351 .B \-Cf,
3352 this is about as fast as one can get a
3353 .I flex
3354 scanner to go for this particular problem.
3355 .PP
3356 A final note:
3357 .I flex
3358 is slow when matching NUL's, particularly when a token contains
3359 multiple NUL's.
3360 It's best to write rules which match
3361 .I short
3362 amounts of text if it's anticipated that the text will often include NUL's.
3363 .PP
3364 Another final note regarding performance: as mentioned above in the section
3365 How the Input is Matched, dynamically resizing
3366 .B yytext
3367 to accommodate huge tokens is a slow process because it presently requires that
3368 the (huge) token be rescanned from the beginning.
3369 Thus if performance is
3370 vital, you should attempt to match "large" quantities of text but not
3371 "huge" quantities, where the cutoff between the two is at about 8K
3372 characters/token.
3373 .SH GENERATING C++ SCANNERS
3374 .I flex
3375 provides two different ways to generate scanners for use with C++.
3376 The first way is to simply compile a scanner generated by
3377 .I flex
3378 using a C++ compiler instead of a C compiler.
3379 You should not encounter
3380 any compilations errors (please report any you find to the email address
3381 given in the Author section below).
3382 You can then use C++ code in your rule actions instead of C code.
3383 Note that the default input source for your scanner remains
3384 .I yyin,
3385 and default echoing is still done to
3386 .I yyout.
3387 Both of these remain
3388 .I FILE *
3389 variables and not C++
3390 .I streams.
3391 .PP
3392 You can also use
3393 .I flex
3394 to generate a C++ scanner class, using the
3395 .B \-+
3396 option (or, equivalently,
3397 .B %option c++),
3398 which is automatically specified if the name of the flex
3399 executable ends in a '+', such as
3400 .I flex++.
3401 When using this option, flex defaults to generating the scanner to the file
3402 .B lex.yy.cc
3403 instead of
3404 .B lex.yy.c.
3405 The generated scanner includes the header file
3406 .I FlexLexer.h,
3407 which defines the interface to two C++ classes.
3408 .PP
3409 The first class,
3410 .B FlexLexer,
3411 provides an abstract base class defining the general scanner class
3412 interface.
3413 It provides the following member functions:
3414 .TP
3415 .B const char* YYText()
3416 returns the text of the most recently matched token, the equivalent of
3417 .B yytext.
3418 .TP
3419 .B int YYLeng()
3420 returns the length of the most recently matched token, the equivalent of
3421 .B yyleng.
3422 .TP
3423 .B int lineno() const
3424 returns the current input line number
3425 (see
3426 .B %option yylineno),
3427 or
3428 .B 1
3429 if
3430 .B %option yylineno
3431 was not used.
3432 .TP
3433 .B void set_debug( int flag )
3434 sets the debugging flag for the scanner, equivalent to assigning to
3435 .B yy_flex_debug
3436 (see the Options section above).
3437 Note that you must build the scanner using
3438 .B %option debug
3439 to include debugging information in it.
3440 .TP
3441 .B int debug() const
3442 returns the current setting of the debugging flag.
3443 .PP
3444 Also provided are member functions equivalent to
3445 .B yy_switch_to_buffer(),
3446 .B yy_create_buffer()
3447 (though the first argument is an
3448 .B std::istream*
3449 object pointer and not a
3450 .B FILE*),
3451 .B yy_flush_buffer(),
3452 .B yy_delete_buffer(),
3453 and
3454 .B yyrestart()
3455 (again, the first argument is a
3456 .B std::istream*
3457 object pointer).
3458 .PP
3459 The second class defined in
3460 .I FlexLexer.h
3461 is
3462 .B yyFlexLexer,
3463 which is derived from
3464 .B FlexLexer.
3465 It defines the following additional member functions:
3466 .TP
3467 .B
3468 yyFlexLexer( std::istream* arg_yyin = 0, std::ostream* arg_yyout = 0 )
3469 constructs a
3470 .B yyFlexLexer
3471 object using the given streams for input and output.
3472 If not specified, the streams default to
3473 .B cin
3474 and
3475 .B cout,
3476 respectively.
3477 .TP
3478 .B virtual int yylex()
3479 performs the same role is
3480 .B yylex()
3481 does for ordinary flex scanners: it scans the input stream, consuming
3482 tokens, until a rule's action returns a value.
3483 If you derive a subclass
3484 .B S
3485 from
3486 .B yyFlexLexer
3487 and want to access the member functions and variables of
3488 .B S
3489 inside
3490 .B yylex(),
3491 then you need to use
3492 .B %option yyclass="S"
3493 to inform
3494 .I flex
3495 that you will be using that subclass instead of
3496 .B yyFlexLexer.
3497 In this case, rather than generating
3498 .B yyFlexLexer::yylex(),
3499 .I flex
3500 generates
3501 .B S::yylex()
3502 (and also generates a dummy
3503 .B yyFlexLexer::yylex()
3504 that calls
3505 .B yyFlexLexer::LexerError()
3506 if called).
3507 .TP
3508 .B
3509 virtual void switch_streams(std::istream* new_in = 0,
3510 .B
3511 std::ostream* new_out = 0)
3512 reassigns
3513 .B yyin
3514 to
3515 .B new_in
3516 (if non-nil)
3517 and
3518 .B yyout
3519 to
3520 .B new_out
3521 (ditto), deleting the previous input buffer if
3522 .B yyin
3523 is reassigned.
3524 .TP
3525 .B
3526 int yylex( std::istream* new_in, std::ostream* new_out = 0 )
3527 first switches the input streams via
3528 .B switch_streams( new_in, new_out )
3529 and then returns the value of
3530 .B yylex().
3531 .PP
3532 In addition,
3533 .B yyFlexLexer
3534 defines the following protected virtual functions which you can redefine
3535 in derived classes to tailor the scanner:
3536 .TP
3537 .B
3538 virtual int LexerInput( char* buf, int max_size )
3539 reads up to
3540 .B max_size
3541 characters into
3542 .B buf
3543 and returns the number of characters read.
3544 To indicate end-of-input, return 0 characters.
3545 Note that "interactive" scanners (see the
3546 .B \-B
3547 and
3548 .B \-I
3549 flags) define the macro
3550 .B YY_INTERACTIVE.
3551 If you redefine
3552 .B LexerInput()
3553 and need to take different actions depending on whether or not
3554 the scanner might be scanning an interactive input source, you can
3555 test for the presence of this name via
3556 .B #ifdef.
3557 .TP
3558 .B
3559 virtual void LexerOutput( const char* buf, int size )
3560 writes out
3561 .B size
3562 characters from the buffer
3563 .B buf,
3564 which, while NUL-terminated, may also contain "internal" NUL's if
3565 the scanner's rules can match text with NUL's in them.
3566 .TP
3567 .B
3568 virtual void LexerError( const char* msg )
3569 reports a fatal error message.
3570 The default version of this function writes the message to the stream
3571 .B cerr
3572 and exits.
3573 .PP
3574 Note that a
3575 .B yyFlexLexer
3576 object contains its
3577 .I entire
3578 scanning state.
3579 Thus you can use such objects to create reentrant scanners.
3580 You can instantiate multiple instances of the same
3581 .B yyFlexLexer
3582 class, and you can also combine multiple C++ scanner classes together
3583 in the same program using the
3584 .B \-P
3585 option discussed above.
3586 .PP
3587 Finally, note that the
3588 .B %array
3589 feature is not available to C++ scanner classes; you must use
3590 .B %pointer
3591 (the default).
3592 .PP
3593 Here is an example of a simple C++ scanner:
3594 .nf
3595
3596         // An example of using the flex C++ scanner class.
3597
3598     %{
3599     int mylineno = 0;
3600     %}
3601
3602     string  \\"[^\\n"]+\\"
3603
3604     ws      [ \\t]+
3605
3606     alpha   [A-Za-z]
3607     dig     [0-9]
3608     name    ({alpha}|{dig}|\\$)({alpha}|{dig}|[_.\\-/$])*
3609     num1    [-+]?{dig}+\\.?([eE][-+]?{dig}+)?
3610     num2    [-+]?{dig}*\\.{dig}+([eE][-+]?{dig}+)?
3611     number  {num1}|{num2}
3612
3613     %%
3614
3615     {ws}    /* skip blanks and tabs */
3616
3617     "/*"    {
3618             int c;
3619
3620             while((c = yyinput()) != 0)
3621                 {
3622                 if(c == '\\n')
3623                     ++mylineno;
3624
3625                 else if(c == '*')
3626                     {
3627                     if((c = yyinput()) == '/')
3628                         break;
3629                     else
3630                         unput(c);
3631                     }
3632                 }
3633             }
3634
3635     {number}  cout << "number " << YYText() << '\\n';
3636
3637     \\n        mylineno++;
3638
3639     {name}    cout << "name " << YYText() << '\\n';
3640
3641     {string}  cout << "string " << YYText() << '\\n';
3642
3643     %%
3644
3645     int main( int /* argc */, char** /* argv */ )
3646         {
3647         FlexLexer* lexer = new yyFlexLexer;
3648         while(lexer->yylex() != 0)
3649             ;
3650         return 0;
3651         }
3652 .fi
3653 If you want to create multiple (different) lexer classes, you use the
3654 .B \-P
3655 flag (or the
3656 .B prefix=
3657 option) to rename each
3658 .B yyFlexLexer
3659 to some other
3660 .B xxFlexLexer.
3661 You then can include
3662 .B <FlexLexer.h>
3663 in your other sources once per lexer class, first renaming
3664 .B yyFlexLexer
3665 as follows:
3666 .nf
3667
3668     #undef yyFlexLexer
3669     #define yyFlexLexer xxFlexLexer
3670     #include <FlexLexer.h>
3671
3672     #undef yyFlexLexer
3673     #define yyFlexLexer zzFlexLexer
3674     #include <FlexLexer.h>
3675
3676 .fi
3677 if, for example, you used
3678 .B %option prefix="xx"
3679 for one of your scanners and
3680 .B %option prefix="zz"
3681 for the other.
3682 .PP
3683 IMPORTANT: the present form of the scanning class is
3684 .I experimental
3685 and may change considerably between major releases.
3686 .SH INCOMPATIBILITIES WITH LEX AND POSIX
3687 .I flex
3688 is a rewrite of the AT&T Unix
3689 .I lex
3690 tool (the two implementations do not share any code, though),
3691 with some extensions and incompatibilities, both of which
3692 are of concern to those who wish to write scanners acceptable
3693 to either implementation.
3694 Flex is fully compliant with the POSIX
3695 .I lex
3696 specification, except that when using
3697 .B %pointer
3698 (the default), a call to
3699 .B unput()
3700 destroys the contents of
3701 .B yytext,
3702 which is counter to the POSIX specification.
3703 .PP
3704 In this section we discuss all of the known areas of incompatibility
3705 between flex, AT&T lex, and the POSIX specification.
3706 .PP
3707 .I flex's
3708 .B \-l
3709 option turns on maximum compatibility with the original AT&T
3710 .I lex
3711 implementation, at the cost of a major loss in the generated scanner's
3712 performance.
3713 We note below which incompatibilities can be overcome
3714 using the
3715 .B \-l
3716 option.
3717 .PP
3718 .I flex
3719 is fully compatible with
3720 .I lex
3721 with the following exceptions:
3722 .IP -
3723 The undocumented
3724 .I lex
3725 scanner internal variable
3726 .B yylineno
3727 is not supported unless
3728 .B \-l
3729 or
3730 .B %option yylineno
3731 is used.
3732 .IP
3733 .B yylineno
3734 should be maintained on a per-buffer basis, rather than a per-scanner
3735 (single global variable) basis.
3736 .IP
3737 .B yylineno
3738 is not part of the POSIX specification.
3739 .IP -
3740 The
3741 .B input()
3742 routine is not redefinable, though it may be called to read characters
3743 following whatever has been matched by a rule.
3744 If
3745 .B input()
3746 encounters an end-of-file the normal
3747 .B yywrap()
3748 processing is done.
3749 A ``real'' end-of-file is returned by
3750 .B input()
3751 as
3752 .I EOF.
3753 .IP
3754 Input is instead controlled by defining the
3755 .B YY_INPUT
3756 macro.
3757 .IP
3758 The
3759 .I flex
3760 restriction that
3761 .B input()
3762 cannot be redefined is in accordance with the POSIX specification,
3763 which simply does not specify any way of controlling the
3764 scanner's input other than by making an initial assignment to
3765 .I yyin.
3766 .IP -
3767 The
3768 .B unput()
3769 routine is not redefinable.
3770 This restriction is in accordance with POSIX.
3771 .IP -
3772 .I flex
3773 scanners are not as reentrant as
3774 .I lex
3775 scanners.
3776 In particular, if you have an interactive scanner and
3777 an interrupt handler which long-jumps out of the scanner, and
3778 the scanner is subsequently called again, you may get the following
3779 message:
3780 .nf
3781
3782     fatal flex scanner internal error--end of buffer missed
3783
3784 .fi
3785 To reenter the scanner, first use
3786 .nf
3787
3788     yyrestart( yyin );
3789
3790 .fi
3791 Note that this call will throw away any buffered input; usually this
3792 isn't a problem with an interactive scanner.
3793 .IP
3794 Also note that flex C++ scanner classes
3795 .I are
3796 reentrant, so if using C++ is an option for you, you should use
3797 them instead.
3798 See "Generating C++ Scanners" above for details.
3799 .IP -
3800 .B output()
3801 is not supported.
3802 Output from the
3803 .B ECHO
3804 macro is done to the file-pointer
3805 .I yyout
3806 (default
3807 .I stdout).
3808 .IP
3809 .B output()
3810 is not part of the POSIX specification.
3811 .IP -
3812 .I lex
3813 does not support exclusive start conditions (%x), though they
3814 are in the POSIX specification.
3815 .IP -
3816 When definitions are expanded,
3817 .I flex
3818 encloses them in parentheses.
3819 With lex, the following:
3820 .nf
3821
3822     NAME    [A-Z][A-Z0-9]*
3823     %%
3824     foo{NAME}?      printf( "Found it\\n" );
3825     %%
3826
3827 .fi
3828 will not match the string "foo" because when the macro
3829 is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?"
3830 and the precedence is such that the '?' is associated with
3831 "[A-Z0-9]*".
3832 With
3833 .I flex,
3834 the rule will be expanded to
3835 "foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match.
3836 .IP
3837 Note that if the definition begins with
3838 .B ^
3839 or ends with
3840 .B $
3841 then it is
3842 .I not
3843 expanded with parentheses, to allow these operators to appear in
3844 definitions without losing their special meanings.
3845 But the
3846 .B <s>, /,
3847 and
3848 .B <<EOF>>
3849 operators cannot be used in a
3850 .I flex
3851 definition.
3852 .IP
3853 Using
3854 .B \-l
3855 results in the
3856 .I lex
3857 behavior of no parentheses around the definition.
3858 .IP
3859 The POSIX specification is that the definition be enclosed in parentheses.
3860 .IP -
3861 Some implementations of
3862 .I lex
3863 allow a rule's action to begin on a separate line, if the rule's pattern
3864 has trailing whitespace:
3865 .nf
3866
3867     %%
3868     foo|bar<space here>
3869       { foobar_action(); }
3870
3871 .fi
3872 .I flex
3873 does not support this feature.
3874 .IP -
3875 The
3876 .I lex
3877 .B %r
3878 (generate a Ratfor scanner) option is not supported.
3879 It is not part
3880 of the POSIX specification.
3881 .IP -
3882 After a call to
3883 .B unput(),
3884 .I yytext
3885 is undefined until the next token is matched, unless the scanner
3886 was built using
3887 .B %array.
3888 This is not the case with
3889 .I lex
3890 or the POSIX specification.
3891 The
3892 .B \-l
3893 option does away with this incompatibility.
3894 .IP -
3895 The precedence of the
3896 .B {}
3897 (numeric range) operator is different.
3898 .I lex
3899 interprets "abc{1,3}" as "match one, two, or
3900 three occurrences of 'abc'", whereas
3901 .I flex
3902 interprets it as "match 'ab'
3903 followed by one, two, or three occurrences of 'c'".
3904 The latter is in agreement with the POSIX specification.
3905 .IP -
3906 The precedence of the
3907 .B ^
3908 operator is different.
3909 .I lex
3910 interprets "^foo|bar" as "match either 'foo' at the beginning of a line,
3911 or 'bar' anywhere", whereas
3912 .I flex
3913 interprets it as "match either 'foo' or 'bar' if they come at the beginning
3914 of a line".
3915 The latter is in agreement with the POSIX specification.
3916 .IP -
3917 The special table-size declarations such as
3918 .B %a
3919 supported by
3920 .I lex
3921 are not required by
3922 .I flex
3923 scanners;
3924 .I flex
3925 ignores them.
3926 .IP -
3927 The name
3928 .B FLEX_SCANNER
3929 is #define'd so scanners may be written for use with either
3930 .I flex
3931 or
3932 .I lex.
3933 Scanners also include
3934 .B YY_FLEX_MAJOR_VERSION
3935 and
3936 .B YY_FLEX_MINOR_VERSION
3937 indicating which version of
3938 .I flex
3939 generated the scanner
3940 (for example, for the 2.5 release, these defines would be 2 and 5
3941 respectively).
3942 .PP
3943 The following
3944 .I flex
3945 features are not included in
3946 .I lex
3947 or the POSIX specification:
3948 .nf
3949
3950     C++ scanners
3951     %option
3952     start condition scopes
3953     start condition stacks
3954     interactive/non-interactive scanners
3955     yy_scan_string() and friends
3956     yyterminate()
3957     yy_set_interactive()
3958     yy_set_bol()
3959     YY_AT_BOL()
3960     <<EOF>>
3961     <*>
3962     YY_DECL
3963     YY_START
3964     YY_USER_ACTION
3965     YY_USER_INIT
3966     #line directives
3967     %{}'s around actions
3968     multiple actions on a line
3969
3970 .fi
3971 plus almost all of the flex flags.
3972 The last feature in the list refers to the fact that with
3973 .I flex
3974 you can put multiple actions on the same line, separated with
3975 semi-colons, while with
3976 .I lex,
3977 the following
3978 .nf
3979
3980     foo    handle_foo(); ++num_foos_seen;
3981
3982 .fi
3983 is (rather surprisingly) truncated to
3984 .nf
3985
3986     foo    handle_foo();
3987
3988 .fi
3989 .I flex
3990 does not truncate the action.
3991 Actions that are not enclosed in
3992 braces are simply terminated at the end of the line.
3993 .SH DIAGNOSTICS
3994 .I warning, rule cannot be matched
3995 indicates that the given rule
3996 cannot be matched because it follows other rules that will
3997 always match the same text as it.
3998 For example, in the following "foo" cannot be matched because it comes after
3999 an identifier "catch-all" rule:
4000 .nf
4001
4002     [a-z]+    got_identifier();
4003     foo       got_foo();
4004
4005 .fi
4006 Using
4007 .B REJECT
4008 in a scanner suppresses this warning.
4009 .PP
4010 .I warning,
4011 .B \-s
4012 .I
4013 option given but default rule can be matched
4014 means that it is possible (perhaps only in a particular start condition)
4015 that the default rule (match any single character) is the only one
4016 that will match a particular input.
4017 Since
4018 .B \-s
4019 was given, presumably this is not intended.
4020 .PP
4021 .I reject_used_but_not_detected undefined
4022 or
4023 .I yymore_used_but_not_detected undefined -
4024 These errors can occur at compile time.
4025 They indicate that the scanner uses
4026 .B REJECT
4027 or
4028 .B yymore()
4029 but that
4030 .I flex
4031 failed to notice the fact, meaning that
4032 .I flex
4033 scanned the first two sections looking for occurrences of these actions
4034 and failed to find any, but somehow you snuck some in (via a #include
4035 file, for example).
4036 Use
4037 .B %option reject
4038 or
4039 .B %option yymore
4040 to indicate to flex that you really do use these features.
4041 .PP
4042 .I flex scanner jammed -
4043 a scanner compiled with
4044 .B \-s
4045 has encountered an input string which wasn't matched by
4046 any of its rules.
4047 This error can also occur due to internal problems.
4048 .PP
4049 .I token too large, exceeds YYLMAX -
4050 your scanner uses
4051 .B %array
4052 and one of its rules matched a string longer than the
4053 .B YYLMAX
4054 constant (8K bytes by default).
4055 You can increase the value by
4056 #define'ing
4057 .B YYLMAX
4058 in the definitions section of your
4059 .I flex
4060 input.
4061 .PP
4062 .I scanner requires \-8 flag to
4063 .I use the character 'x' -
4064 Your scanner specification includes recognizing the 8-bit character
4065 .I 'x'
4066 and you did not specify the \-8 flag, and your scanner defaulted to 7-bit
4067 because you used the
4068 .B \-Cf
4069 or
4070 .B \-CF
4071 table compression options.
4072 See the discussion of the
4073 .B \-7
4074 flag for details.
4075 .PP
4076 .I flex scanner push-back overflow -
4077 you used
4078 .B unput()
4079 to push back so much text that the scanner's buffer could not hold
4080 both the pushed-back text and the current token in
4081 .B yytext.
4082 Ideally the scanner should dynamically resize the buffer in this case, but at
4083 present it does not.
4084 .PP
4085 .I
4086 input buffer overflow, can't enlarge buffer because scanner uses REJECT -
4087 the scanner was working on matching an extremely large token and needed
4088 to expand the input buffer.
4089 This doesn't work with scanners that use
4090 .B
4091 REJECT.
4092 .PP
4093 .I
4094 fatal flex scanner internal error--end of buffer missed -
4095 This can occur in a scanner which is reentered after a long-jump
4096 has jumped out (or over) the scanner's activation frame.
4097 Before reentering the scanner, use:
4098 .nf
4099
4100     yyrestart( yyin );
4101
4102 .fi
4103 or, as noted above, switch to using the C++ scanner class.
4104 .PP
4105 .I too many start conditions in <> construct! -
4106 you listed more start conditions in a <> construct than exist (so
4107 you must have listed at least one of them twice).
4108 .SH FILES
4109 .TP
4110 .B \-ll
4111 library with which scanners must be linked.
4112 .TP
4113 .I lex.yy.c
4114 generated scanner (called
4115 .I lexyy.c
4116 on some systems).
4117 .TP
4118 .I lex.yy.cc
4119 generated C++ scanner class, when using
4120 .B -+.
4121 .TP
4122 .I <FlexLexer.h>
4123 header file defining the C++ scanner base class,
4124 .B FlexLexer,
4125 and its derived class,
4126 .B yyFlexLexer.
4127 .TP
4128 .I flex.skl
4129 skeleton scanner.
4130 This file is only used when building flex, not when flex executes.
4131 .TP
4132 .I lex.backup
4133 backing-up information for
4134 .B \-b
4135 flag (called
4136 .I lex.bck
4137 on some systems).
4138 .SH DEFICIENCIES / BUGS
4139 Some trailing context
4140 patterns cannot be properly matched and generate
4141 warning messages ("dangerous trailing context").
4142 These are patterns where the ending of the
4143 first part of the rule matches the beginning of the second
4144 part, such as "zx*/xy*", where the 'x*' matches the 'x' at
4145 the beginning of the trailing context.
4146 (Note that the POSIX draft
4147 states that the text matched by such patterns is undefined.)
4148 .PP
4149 For some trailing context rules, parts which are actually fixed-length are
4150 not recognized as such, leading to the above mentioned performance loss.
4151 In particular, parts using '|' or {n} (such as "foo{3}") are always
4152 considered variable-length.
4153 .PP
4154 Combining trailing context with the special '|' action can result in
4155 .I fixed
4156 trailing context being turned into the more expensive
4157 .I variable
4158 trailing context.
4159 For example, in the following:
4160 .nf
4161
4162     %%
4163     abc      |
4164     xyz/def
4165
4166 .fi
4167 .PP
4168 Use of
4169 .B unput()
4170 invalidates yytext and yyleng, unless the
4171 .B %array
4172 directive
4173 or the
4174 .B \-l
4175 option has been used.
4176 .PP
4177 Pattern-matching of NUL's is substantially slower than matching other
4178 characters.
4179 .PP
4180 Dynamic resizing of the input buffer is slow, as it entails rescanning
4181 all the text matched so far by the current (generally huge) token.
4182 .PP
4183 Due to both buffering of input and read-ahead, you cannot intermix
4184 calls to <stdio.h> routines, such as, for example,
4185 .B getchar(),
4186 with
4187 .I flex
4188 rules and expect it to work.
4189 Call
4190 .B input()
4191 instead.
4192 .PP
4193 The total table entries listed by the
4194 .B \-v
4195 flag excludes the number of table entries needed to determine
4196 what rule has been matched.
4197 The number of entries is equal
4198 to the number of DFA states if the scanner does not use
4199 .B REJECT,
4200 and somewhat greater than the number of states if it does.
4201 .PP
4202 .B REJECT
4203 cannot be used with the
4204 .B \-f
4205 or
4206 .B \-F
4207 options.
4208 .PP
4209 The
4210 .I flex
4211 internal algorithms need documentation.
4212 .SH SEE ALSO
4213 lex(1), yacc(1), sed(1), awk(1).
4214 .PP
4215 John Levine, Tony Mason, and Doug Brown,
4216 .I Lex & Yacc,
4217 O'Reilly and Associates.
4218 Be sure to get the 2nd edition.
4219 .PP
4220 M. E. Lesk and E. Schmidt,
4221 .I LEX \- Lexical Analyzer Generator
4222 .PP
4223 Alfred Aho, Ravi Sethi and Jeffrey Ullman,
4224 .I Compilers: Principles, Techniques and Tools,
4225 Addison-Wesley (1986).
4226 Describes the pattern-matching techniques used by
4227 .I flex
4228 (deterministic finite automata).
4229 .SH AUTHOR
4230 Vern Paxson, with the help of many ideas and much inspiration from
4231 Van Jacobson.
4232 Original version by Jef Poskanzer.
4233 The fast table
4234 representation is a partial implementation of a design done by Van
4235 Jacobson.
4236 The implementation was done by Kevin Gong and Vern Paxson.
4237 .PP
4238 Thanks to the many
4239 .I flex
4240 beta-testers, feedbackers, and contributors, especially Francois Pinard,
4241 Casey Leedom,
4242 Robert Abramovitz,
4243 Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai,
4244 Neal Becker, Nelson H.F. Beebe, benson@odi.com,
4245 Karl Berry, Peter A. Bigot, Simon Blanchard,
4246 Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher,
4247 Brian Clapper, J.T. Conklin,
4248 Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David
4249 Daniels, Chris G. Demetriou, Theo de Raadt,
4250 Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin,
4251 Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl,
4252 Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz,
4253 Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel,
4254 Jan Hajic, Charles Hemphill, NORO Hideo,
4255 Jarkko Hietaniemi, Scott Hofmann,
4256 Jeff Honig, Dana Hudes, Eric Hughes, John Interrante,
4257 Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones,
4258 Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane,
4259 Amir Katz, ken@ken.hilco.com, Kevin B. Kenny,
4260 Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht,
4261 Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle,
4262 David Loffredo, Mike Long,
4263 Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall,
4264 Bengt Martensson, Chris Metcalf,
4265 Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum,
4266 G.T. Nicol, Landon Noll, James Nordby, Marc Nozell,
4267 Richard Ohnemus, Karsten Pahnke,
4268 Sven Panne, Roland Pesch, Walter Pelissero, Gaumond
4269 Pierre, Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha,
4270 Frederic Raimbault, Pat Rankin, Rick Richardson,
4271 Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini,
4272 Andreas Scherer, Darrell Schiebel, Raf Schietekat,
4273 Doug Schmidt, Philippe Schnoebelen, Andreas Schwab,
4274 Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist,
4275 Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor,
4276 Chris Thewalt, Richard M. Timoney, Jodi Tsai,
4277 Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams, Ken
4278 Yap, Ron Zellar, Nathan Zelle, David Zuhn,
4279 and those whose names have slipped my marginal
4280 mail-archiving skills but whose contributions are appreciated all the
4281 same.
4282 .PP
4283 Thanks to Keith Bostic, Jon Forrest, Noah Friedman,
4284 John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T.
4285 Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various
4286 distribution headaches.
4287 .PP
4288 Thanks to Esmond Pitt and Earle Horton for 8-bit character support; to
4289 Benson Margulies and Fred Burke for C++ support; to Kent Williams and Tom
4290 Epperly for C++ class support; to Ove Ewerlid for support of NUL's; and to
4291 Eric Hughes for support of multiple buffers.
4292 .PP
4293 This work was primarily done when I was with the Real Time Systems Group
4294 at the Lawrence Berkeley Laboratory in Berkeley, CA.
4295 Many thanks to all there for the support I received.
4296 .PP
4297 Send comments to vern@ee.lbl.gov.