share/doc/psd/16.lex/lex.ms

   1 .\" Copyright (C) Caldera International Inc. 2001-2002.  All rights reserved.
   2 .\"
   3 .\" Redistribution and use in source and binary forms, with or without
   4 .\" modification, are permitted provided that the following conditions are
   5 .\" met:
   6 .\"
   7 .\" Redistributions of source code and documentation must retain the above
   8 .\" copyright notice, this list of conditions and the following
   9 .\" disclaimer.
  10 .\"
  11 .\" Redistributions in binary form must reproduce the above copyright
  12 .\" notice, this list of conditions and the following disclaimer in the
  13 .\" documentation and/or other materials provided with the distribution.
  14 .\"
  15 .\" All advertising materials mentioning features or use of this software
  16 .\" must display the following acknowledgement:
  17 .\"
  18 .\" This product includes software developed or owned by Caldera
  19 .\" International, Inc.  Neither the name of Caldera International, Inc.
  20 .\" nor the names of other contributors may be used to endorse or promote
  21 .\" products derived from this software without specific prior written
  22 .\" permission.
  23 .\"
  24 .\" USE OF THE SOFTWARE PROVIDED FOR UNDER THIS LICENSE BY CALDERA
  25 .\" INTERNATIONAL, INC.  AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR
  26 .\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
  27 .\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
  28 .\" DISCLAIMED.  IN NO EVENT SHALL CALDERA INTERNATIONAL, INC. BE LIABLE
  29 .\" FOR ANY DIRECT, INDIRECT INCIDENTAL, SPECIAL, EXEMPLARY, OR
  30 .\" CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
  31 .\" SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
  32 .\" BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
  33 .\" WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
  34 .\" OR OTHERWISE) RISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN
  35 .\" IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  36 .\"
  37 .\"     @(#)lex.ms      8.1 (Berkeley) 6/8/93
  38 .\"
  39 .\" $FreeBSD$
  40 .nr tm 0
  41 .de MH
  42 Bell Laboratories, Murray Hill, NJ 07974.
  43 ..
  44 .EH 'PSD:16-%''Lex \- A Lexical Analyzer Generator'
  45 .OH 'Lex \- A Lexical Analyzer Generator''PSD:16-%'
  46 .hc ~
  47 .bd I 2
  48 .de TS
  49 .br
  50 .nf
  51 .sp 1v
  52 .ul 0
  53 ..
  54 .de TE
  55 .sp 1v
  56 .fi
  57 ..
  58 .\".de PT
  59 .\".if \\n%>1 'tl ''\s7LEX\s0\s9\(mi%\s0''
  60 .\".if \\n%>1 'sp
  61 .\"..
  62 .ND July 21, 1975
  63 .\".RP
  64 .\".TM 75-1274-15 39199 39199-11
  65 .TL
  66 Lex \- A Lexical Analyzer ~Generator~
  67 .AU ``MH 2C-569'' 6377
  68 M. E. Lesk and E. Schmidt
  69 .AI
  70 .MH
  71 .AB
  72 .ps +1
  73 NOTE: This document describes the historical Unix version of \fIlex\fP.
  74 FreeBSD is supplied with \fIflex\fP\| which is a compatible replacement.
  75 See the extensive documentation in \fIflex(1)\fP\| for details.
  76 .ps
  77 .sp
  78 .bd I 2
  79 .\".nr PS 8
  80 .\".nr VS 9
  81 .\".ps 8
  82 .\".vs 9p
  83 Lex helps write programs whose control flow
  84 is directed by instances of regular
  85 expressions in the input stream.
  86 It is well suited for editor-script type transformations and
  87 for segmenting input in preparation for
  88 a parsing routine.
  89 .PP
  90 Lex source is a table of regular expressions and corresponding program fragments.
  91 The table is translated to a program
  92 which reads an input stream, copying it to an output stream
  93 and partitioning the input
  94 into strings which match the given expressions.
  95 As each such string is recognized the corresponding
  96 program fragment is executed.
  97 The recognition of the expressions
  98 is performed by a deterministic finite automaton
  99 generated by Lex.
 100 The program fragments written by the user are executed in the order in which the
 101 corresponding regular expressions occur in the input stream.
 102 .PP
 103 The lexical analysis
 104 programs written with Lex accept ambiguous specifications
 105 and choose the longest
 106 match possible at each input point.
 107 If necessary, substantial look~ahead
 108 is performed on the input, but the
 109 input stream will be backed up to the
 110 end of the current partition, so that the user
 111 has general freedom to manipulate it.
 112 .PP
 113 Lex can generate analyzers in either C or Ratfor, a language
 114 which can be translated automatically to portable Fortran.
 115 It is available on the PDP-11 UNIX, Honeywell GCOS,
 116 and IBM OS systems.
 117 This manual, however, will only discuss generating analyzers
 118 in C on the UNIX system, which is the only supported
 119 form of Lex under UNIX Version 7.
 120 Lex is designed to simplify
 121 interfacing with Yacc, for those
 122 with access to this compiler-compiler system.
 123 .\".nr PS 9
 124 .\".nr VS 11
 125 .AE
 126 .2C
 127 .NH
 128 Introduction.
 129 .PP
 130 Lex is a program generator designed for
 131 lexical processing of character input streams.
 132 It accepts a high-level, problem oriented specification
 133 for character string matching,
 134 and
 135 produces a program in a general purpose language which recognizes
 136 regular expressions.
 137 The regular expressions are specified by the user in the
 138 source specifications given to Lex.
 139 The Lex written code recognizes these expressions
 140 in an input stream and partitions the input stream into
 141 strings matching the expressions.  At the bound~aries
 142 between strings
 143 program sections
 144 provided by the user are executed.
 145 The Lex source file associates the regular expressions and the
 146 program fragments.
 147 As each expression appears in the input to the program written by Lex,
 148 the corresponding fragment is executed.
 149 .PP
 150 The user supplies the additional code
 151 beyond expression matching
 152 needed to complete his tasks, possibly
 153 including code written by other generators.
 154 The program that recognizes the expressions is generated in the
 155 general purpose programming language employed for the
 156 user's program fragments.
 157 Thus, a high level expression
 158 language is provided to write the string expressions to be
 159 matched while the user's freedom to write actions
 160 is unimpaired.
 161 This avoids forcing the user who wishes to use a string manipulation
 162 language for input analysis to write processing programs in the same
 163 and often inappropriate string handling language.
 164 .PP
 165 Lex is not a complete language, but rather a generator representing
 166 a new language feature which can be added to
 167 different programming languages, called ``host languages.''
 168 Just as general purpose languages
 169 can produce code to run on different computer hardware,
 170 Lex can write code in different host languages.
 171 The host language is used for the output code generated by Lex
 172 and also for the program fragments added by the user.
 173 Compatible run-time libraries for the different host languages
 174 are also provided.
 175 This makes Lex adaptable to different environments and
 176 different users.
 177 Each application
 178 may be directed to the combination of hardware and host language appropriate
 179 to the task, the user's background, and the properties of local
 180 implementations.
 181 At present, the only supported host language is C,
 182 although Fortran (in the form of Ratfor [2] has been available
 183 in the past.
 184 Lex itself exists on UNIX, GCOS, and OS/370; but the
 185 code generated by Lex may be taken anywhere the appropriate
 186 compilers exist.
 187 .PP
 188 Lex turns the user's expressions and actions
 189 (called
 190 .ul
 191 source
 192 in this memo) into the host general-purpose language;
 193 the generated program is named
 194 .ul
 195 yylex.
 196 The
 197 .ul
 198 yylex
 199 program
 200 will recognize expressions
 201 in a stream
 202 (called
 203 .ul
 204 input
 205 in this memo)
 206 and perform the specified actions for each expression as it is detected.
 207 See Figure 1.
 208 .\" .GS
 209 .TS
 210 center;
 211 l _ r
 212 l|c|r
 213 l _ r
 214 l _ r
 215 l|c|r
 216 l _ r
 217 c s s
 218 c s s.
 219
 220 Source \(->     Lex     \(-> yylex
 221
 222 .sp 2
 223
 224 Input \(->      yylex   \(-> Output
 225
 226 .sp
 227 An overview of Lex
 228 Figure 1
 229 .TE
 230 .\" .GE
 231 .PP
 232 For a trivial example, consider a program to delete
 233 from the input
 234 all blanks or tabs at the ends of lines.
 235 .TS
 236 center;
 237 l l.
 238 %%
 239 [ \et]+$        ;
 240 .TE
 241 is all that is required.
 242 The program
 243 contains a %% delimiter to mark the beginning of the rules, and
 244 one rule.
 245 This rule contains a regular expression
 246 which matches one or more
 247 instances of the characters blank or tab
 248 (written \et for visibility, in accordance with the C language convention)
 249 just prior to the end of a line.
 250 The brackets indicate the character
 251 class made of blank and tab; the + indicates ``one or more ...'';
 252 and the $ indicates ``end of line,'' as in QED.
 253 No action is specified,
 254 so the program generated by Lex (yylex) will ignore these characters.
 255 Everything else will be copied.
 256 To change any remaining
 257 string of blanks or tabs to a single blank,
 258 add another rule:
 259 .TS
 260 center;
 261 l l.
 262 %%
 263 [ \et]+$        ;
 264 [ \et]+ printf(" ");
 265 .TE
 266 The finite automaton generated for this
 267 source will scan for both rules at once,
 268 observing at
 269 the termination of the string of blanks or tabs
 270 whether or not there is a newline character, and executing
 271 the desired rule action.
 272 The first rule matches all strings of blanks or tabs
 273 at the end of lines, and the second
 274 rule all remaining strings of blanks or tabs.
 275 .PP
 276 Lex can be used alone for simple transformations, or
 277 for analysis and statistics gathering on a lexical level.
 278 Lex can also be used with a parser generator
 279 to perform the lexical analysis phase; it is particularly
 280 easy to interface Lex and Yacc [3].
 281 Lex programs recognize only regular expressions;
 282 Yacc writes parsers that accept a large class of context free grammars,
 283 but require a lower level analyzer to recognize input tokens.
 284 Thus, a combination of Lex and Yacc is often appropriate.
 285 When used as a preprocessor for a later parser generator,
 286 Lex is used to partition the input stream,
 287 and the parser generator assigns structure to
 288 the resulting pieces.
 289 The flow of control
 290 in such a case (which might be the first half of a compiler,
 291 for example) is shown in Figure 2.
 292 Additional programs,
 293 written by other generators
 294 or by hand, can
 295 be added easily to programs written by Lex.
 296 .\" .BS 2
 297 .ps 9
 298 .vs 11
 299 .TS
 300 center;
 301 l c c c l
 302 l c c c l
 303 l c c c l
 304 l _ c _ l
 305 l|c|c|c|l
 306 l _ c _ l
 307 l c c c l
 308 l _ c _ l
 309 l|c|c|c|l
 310 l _ c _ l
 311 l c s s l
 312 l c s s l.
 313         lexical         grammar
 314         rules           rules
 315         \(da            \(da
 316
 317         Lex             Yacc
 318
 319         \(da            \(da
 320
 321 Input \(->      yylex   \(->    yyparse \(-> Parsed input
 322
 323 .sp
 324         Lex with Yacc
 325         Figure 2
 326 .TE
 327 .ps 10
 328 .vs 12
 329 .\" .BE
 330 Yacc users
 331 will realize that the name
 332 .ul
 333 yylex
 334 is what Yacc expects its lexical analyzer to be named,
 335 so that the use of this name by Lex simplifies
 336 interfacing.
 337 .PP
 338 Lex generates a deterministic finite automaton from the regular expressions
 339 in the source [4].
 340 The automaton is interpreted, rather than compiled, in order
 341 to save space.
 342 The result is still a fast analyzer.
 343 In particular, the time taken by a Lex program
 344 to recognize and partition an input stream is
 345 proportional to the length of the input.
 346 The number of Lex rules or
 347 the complexity of the rules is
 348 not important in determining speed,
 349 unless rules which include
 350 forward context require a significant amount of re~scanning.
 351 What does increase with the number and complexity of rules
 352 is the size of the finite
 353 automaton, and therefore the size of the program
 354 generated by Lex.
 355 .PP
 356 In the program written by Lex, the user's fragments
 357 (representing the
 358 .ul
 359 actions
 360 to be performed as each regular expression
 361 is found)
 362 are gathered
 363 as cases of a switch.
 364 The automaton interpreter directs the control flow.
 365 Opportunity is provided for the user to insert either
 366 declarations or additional statements in the routine containing
 367 the actions, or to
 368 add subroutines outside this action routine.
 369 .PP
 370 Lex is not limited to source which can
 371 be interpreted on the basis of one character
 372 look~ahead.
 373 For example,
 374 if there are two rules, one looking for
 375 .I ab
 376 and another for
 377 .I abcdefg ,
 378 and the input stream is
 379 .I abcdefh ,
 380 Lex will recognize
 381 .I ab
 382 and leave
 383 the input pointer just before
 384 .I "cd. . ."
 385 Such backup is more costly
 386 than the processing of simpler languages.
 387 .2C
 388 .NH
 389 Lex Source.
 390 .PP
 391 The general format of Lex source is:
 392 .TS
 393 center;
 394 l.
 395 {definitions}
 396 %%
 397 {rules}
 398 %%
 399 {user subroutines}
 400 .TE
 401 where the definitions and the user subroutines
 402 are often omitted.
 403 The second
 404 .I %%
 405 is optional, but the first is required
 406 to mark the beginning of the rules.
 407 The absolute minimum Lex program is thus
 408 .TS
 409 center;
 410 l.
 411 %%
 412 .TE
 413 (no definitions, no rules) which translates into a program
 414 which copies the input to the output unchanged.
 415 .PP
 416 In the outline of Lex programs shown above, the
 417 .I
 418 rules
 419 .R
 420 represent the user's control
 421 decisions; they are a table, in which the left column
 422 contains
 423 .I
 424 regular expressions
 425 .R
 426 (see section 3)
 427 and the right column contains
 428 .I
 429 actions,
 430 .R
 431 program fragments to be executed when the expressions
 432 are recognized.
 433 Thus an individual rule might appear
 434 .TS
 435 center;
 436 l l.
 437 integer printf("found keyword INT");
 438 .TE
 439 to look for the string
 440 .I integer
 441 in the input stream and
 442 print the message ``found keyword INT'' whenever it appears.
 443 In this example the host procedural language is C and
 444 the C library function
 445 .I
 446 printf
 447 .R
 448 is used to print the string.
 449 The end
 450 of the expression is indicated by the first blank or tab character.
 451 If the action is merely a single C expression,
 452 it can just be given on the right side of the line; if it is
 453 compound, or takes more than a line, it should be enclosed in
 454 braces.
 455 As a slightly more useful example, suppose it is desired to
 456 change a number of words from British to American spelling.
 457 Lex rules such as
 458 .TS
 459 center;
 460 l l.
 461 colour  printf("color");
 462 mechanise       printf("mechanize");
 463 petrol  printf("gas");
 464 .TE
 465 would be a start.  These rules are not quite enough,
 466 since
 467 the word
 468 .I petroleum
 469 would become
 470 .I gaseum ;
 471 a way of dealing
 472 with this will be described later.
 473 .2C
 474 .NH
 475 Lex Regular Expressions.
 476 .PP
 477 The definitions of regular expressions are very similar to those
 478 in QED [5].
 479 A regular
 480 expression specifies a set of strings to be matched.
 481 It contains text characters (which match the corresponding
 482 characters in the strings being compared)
 483 and operator characters (which specify
 484 repetitions, choices, and other features).
 485 The letters of the alphabet and the digits are
 486 always text characters; thus the regular expression
 487 .TS
 488 center;
 489 l l.
 490 integer
 491 .TE
 492 matches the string
 493 .ul
 494 integer
 495 wherever it appears
 496 and the expression
 497 .TS
 498 center;
 499 l.
 500 a57D
 501 .TE
 502 looks for the string
 503 .ul
 504 a57D.
 505 .PP
 506 .I
 507 Operators.
 508 .R
 509 The operator characters are
 510 .TS
 511 center;
 512 l.
 513 " \e [ ] ^ \- ? . \(** + | ( ) $ / { } % < >
 514 .TE
 515 and if they are to be used as text characters, an escape
 516 should be used.
 517 The quotation mark operator (")
 518 indicates that whatever is contained between a pair of quotes
 519 is to be taken as text characters.
 520 Thus
 521 .TS
 522 center;
 523 l.
 524 xyz"++"
 525 .TE
 526 matches the string
 527 .I xyz++
 528 when it appears.  Note that a part of a string may be quoted.
 529 It is harmless but unnecessary to quote an ordinary
 530 text character; the expression
 531 .TS
 532 center;
 533 l.
 534 "xyz++"
 535 .TE
 536 is the same as the one above.
 537 Thus by quoting every non-alphanumeric character
 538 being used as a text character, the user can avoid remembering
 539 the list above of current
 540 operator characters, and is safe should further extensions to Lex
 541 lengthen the list.
 542 .PP
 543 An operator character may also be turned into a text character
 544 by preceding it with \e as in
 545 .TS
 546 center;
 547 l.
 548 xyz\e+\e+
 549 .TE
 550 which
 551 is another, less readable, equivalent of the above expressions.
 552 Another use of the quoting mechanism is to get a blank into
 553 an expression; normally, as explained above, blanks or tabs end
 554 a rule.
 555 Any blank character not contained within [\|] (see below) must
 556 be quoted.
 557 Several normal C escapes with \e
 558 are recognized: \en is newline, \et is tab, and \eb is backspace.
 559 To enter \e itself, use \e\e.
 560 Since newline is illegal in an expression, \en must be used;
 561 it is not
 562 required to escape tab and backspace.
 563 Every character but blank, tab, newline and the list above is always
 564 a text character.
 565 .PP
 566 .I
 567 Character classes.
 568 .R
 569 Classes of characters can be specified using the operator pair [\|].
 570 The construction
 571 .I [abc]
 572 matches a
 573 single character, which may be
 574 .I a ,
 575 .I b ,
 576 or
 577 .I c .
 578 Within square brackets,
 579 most operator meanings are ignored.
 580 Only three characters are special:
 581 these are \e \(mi and ^.  The \(mi character
 582 indicates ranges.  For example,
 583 .TS
 584 center;
 585 l.
 586 [a\(miz0\(mi9<>_]
 587 .TE
 588 indicates the character class containing all the lower case letters,
 589 the digits,
 590 the angle brackets, and underline.
 591 Ranges may be given in either order.
 592 Using \(mi between any pair of characters which are
 593 not both upper case letters, both lower case letters, or both digits
 594 is implementation dependent and will get a warning message.
 595 (E.g., [0\-z] in ASCII is many more characters
 596 than it is in EBCDIC).
 597 If it is desired to include the
 598 character \(mi in a character class, it should be first or
 599 last; thus
 600 .TS
 601 center;
 602 l.
 603 [\(mi+0\(mi9]
 604 .TE
 605 matches all the digits and the two signs.
 606 .PP
 607 In character classes,
 608 the ^ operator must appear as the first character
 609 after the left bracket; it indicates that the resulting string
 610 is to be complemented with respect to the computer character set.
 611 Thus
 612 .TS
 613 center;
 614 l.
 615 [^abc]
 616 .TE
 617 matches all characters except a, b, or c, including
 618 all special or control characters; or
 619 .TS
 620 center;
 621 l.
 622 [^a\-zA\-Z]
 623 .TE
 624 is any character which is not a letter.
 625 The \e character provides the usual escapes within
 626 character class brackets.
 627 .PP
 628 .I
 629 Arbitrary character.
 630 .R
 631 To match almost any character, the operator character
 632 .TS
 633 center;
 634 l.
 635 \&.
 636 .TE
 637 is the class of all characters except newline.
 638 Escaping into octal is possible although non-portable:
 639 .TS
 640 center;
 641 l.
 642 [\e40\-\e176]
 643 .TE
 644 matches all printable characters in the ASCII character set, from octal
 645 40 (blank) to octal 176 (tilde).
 646 .PP
 647 .I
 648 Optional expressions.
 649 .R
 650 The operator
 651 .I ?
 652 indicates
 653 an optional element of an expression.
 654 Thus
 655 .TS
 656 center;
 657 l.
 658 ab?c
 659 .TE
 660 matches either
 661 .I ac
 662 or
 663 .I abc .
 664 .PP
 665 .I
 666 Repeated expressions.
 667 .R
 668 Repetitions of classes are indicated by the operators
 669 .I \(**
 670 and
 671 .I + .
 672 .TS
 673 center;
 674 l.
 675 \f2a\(**\f1
 676 .TE
 677 is any number of consecutive
 678 .I a
 679 characters, including zero; while
 680 .TS
 681 center;
 682 l.
 683 a+
 684 .TE
 685 is one or more instances of
 686 .I a.
 687 For example,
 688 .TS
 689 center;
 690 l.
 691 [a\-z]+
 692 .TE
 693 is all strings of lower case letters.
 694 And
 695 .TS
 696 center;
 697 l.
 698 [A\(miZa\(miz][A\(miZa\(miz0\(mi9]\(**
 699 .TE
 700 indicates all alphanumeric strings with a leading
 701 alphabetic character.
 702 This is a typical expression for recognizing identifiers in
 703 computer languages.
 704 .PP
 705 .I
 706 Alternation and Grouping.
 707 .R
 708 The operator |
 709 indicates alternation:
 710 .TS
 711 center;
 712 l.
 713 (ab\||\|cd)
 714 .TE
 715 matches either
 716 .ul
 717 ab
 718 or
 719 .ul
 720 cd.
 721 Note that parentheses are used for grouping, although
 722 they are
 723 not necessary on the outside level;
 724 .TS
 725 center;
 726 l.
 727 ab\||\|cd
 728 .TE
 729 would have sufficed.
 730 Parentheses
 731 can be used for more complex expressions:
 732 .TS
 733 center;
 734 l.
 735 (ab\||\|cd+)?(ef)\(**
 736 .TE
 737 matches such strings as
 738 .I abefef ,
 739 .I efefef ,
 740 .I cdef ,
 741 or
 742 .I cddd\| ;
 743 but not
 744 .I abc ,
 745 .I abcd ,
 746 or
 747 .I abcdef .
 748 .PP
 749 .I
 750 Context sensitivity.
 751 .R
 752 Lex will recognize a small amount of surrounding
 753 context.  The two simplest operators for this are
 754 .I ^
 755 and
 756 .I $ .
 757 If the first character of an expression is
 758 .I ^ ,
 759 the expression will only be matched at the beginning
 760 of a line (after a newline character, or at the beginning of
 761 the input stream).
 762 This can never conflict with the other meaning of
 763 .I ^ ,
 764 complementation
 765 of character classes, since that only applies within
 766 the [\|] operators.
 767 If the very last character is
 768 .I $ ,
 769 the expression will only be matched at the end of a line (when
 770 immediately followed by newline).
 771 The latter operator is a special case of the
 772 .I /
 773 operator character,
 774 which indicates trailing context.
 775 The expression
 776 .TS
 777 center;
 778 l.
 779 ab/cd
 780 .TE
 781 matches the string
 782 .I ab ,
 783 but only if followed by
 784 .ul
 785 cd.
 786 Thus
 787 .TS
 788 center;
 789 l.
 790 ab$
 791 .TE
 792 is the same as
 793 .TS
 794 center;
 795 l.
 796 ab/\en
 797 .TE
 798 Left context is handled in Lex by
 799 .I
 800 start conditions
 801 .R
 802 as explained in section 10.  If a rule is only to be executed
 803 when the Lex automaton interpreter is in start condition
 804 .I
 805 x,
 806 .R
 807 the rule should be prefixed by
 808 .TS
 809 center;
 810 l.
 811 <x>
 812 .TE
 813 using the angle bracket operator characters.
 814 If we considered ``being at the beginning of a line'' to be
 815 start condition
 816 .I ONE ,
 817 then the ^ operator
 818 would be equivalent to
 819 .TS
 820 center;
 821 l.
 822 <ONE>
 823 .TE
 824 Start conditions are explained more fully later.
 825 .PP
 826 .I
 827 Repetitions and Definitions.
 828 .R
 829 The operators {} specify
 830 either repetitions (if they enclose numbers)
 831 or
 832 definition expansion (if they enclose a name).  For example
 833 .TS
 834 center;
 835 l.
 836 {digit}
 837 .TE
 838 looks for a predefined string named
 839 .I digit
 840 and inserts it
 841 at that point in the expression.
 842 The definitions are given in the first part of the Lex
 843 input, before the rules.
 844 In contrast,
 845 .TS
 846 center;
 847 l.
 848 a{1,5}
 849 .TE
 850 looks for 1 to 5 occurrences of
 851 .I a .
 852 .PP
 853 Finally, initial
 854 .I %
 855 is special, being the separator
 856 for Lex source segments.
 857 .2C
 858 .NH
 859 Lex Actions.
 860 .PP
 861 When an expression written as above is matched, Lex
 862 executes the corresponding action.  This section describes
 863 some features of Lex which aid in writing actions.  Note
 864 that there is a default action, which
 865 consists of copying the input to the output.  This
 866 is performed on all strings not otherwise matched.  Thus
 867 the Lex user who wishes to absorb the entire input, without
 868 producing any output, must provide rules to match everything.
 869 When Lex is being used with Yacc, this is the normal
 870 situation.
 871 One may consider that actions are what is done instead of
 872 copying the input to the output; thus, in general,
 873 a rule which merely copies can be omitted.
 874 Also, a character combination
 875 which is omitted from the rules
 876 and which appears as input
 877 is likely to be printed on the output, thus calling
 878 attention to the gap in the rules.
 879 .PP
 880 One of the simplest things that can be done is to ignore
 881 the input.   Specifying a C null statement, \fI;\fR as an action
 882 causes this result.  A frequent rule is
 883 .TS
 884 center;
 885 l l.
 886 [ \et\en]       ;
 887 .TE
 888 which causes the three spacing characters (blank, tab, and newline)
 889 to be ignored.
 890 .PP
 891 Another easy way to avoid writing actions is the action character
 892 |, which indicates that the action for this rule is the action
 893 for the next rule.
 894 The previous example could also have been written
 895 .TS
 896 center, tab(#);
 897 l l.
 898 " "#|
 899 "\et"#|
 900 "\en"#;
 901 .TE
 902 with the same result, although in different style.
 903 The quotes around \en and \et are not required.
 904 .PP
 905 In more complex actions, the user
 906 will
 907 often want to know the actual text that matched some expression
 908 like
 909 .I [a\(miz]+ .
 910 Lex leaves this text in an external character
 911 array named
 912 .I
 913 yytext.
 914 .R
 915 Thus, to print the name found,
 916 a rule like
 917 .TS
 918 center;
 919 l l.
 920 [a\-z]+ printf("%s", yytext);
 921 .TE
 922 will print
 923 the string in
 924 .I
 925 yytext.
 926 .R
 927 The C function
 928 .I
 929 printf
 930 .R
 931 accepts a format argument and data to be printed;
 932 in this case, the format is ``print string'' (% indicating
 933 data conversion, and
 934 .I s
 935 indicating string type),
 936 and the data are the characters
 937 in
 938 .I
 939 yytext.
 940 .R
 941 So this just places
 942 the matched string
 943 on the output.
 944 This action
 945 is so common that
 946 it may be written as ECHO:
 947 .TS
 948 center;
 949 l l.
 950 [a\-z]+ ECHO;
 951 .TE
 952 is the same as the above.
 953 Since the default action is just to
 954 print the characters found, one might ask why
 955 give a rule, like this one, which merely specifies
 956 the default action?
 957 Such rules are often required
 958 to avoid matching some other rule
 959 which is not desired.  For example, if there is a rule
 960 which matches
 961 .I read
 962 it will normally match the instances of
 963 .I read
 964 contained in
 965 .I bread
 966 or
 967 .I readjust ;
 968 to avoid
 969 this,
 970 a rule
 971 of the form
 972 .I [a\(miz]+
 973 is needed.
 974 This is explained further below.
 975 .PP
 976 Sometimes it is more convenient to know the end of what
 977 has been found; hence Lex also provides a count
 978 .I
 979 yyleng
 980 .R
 981 of the number of characters matched.
 982 To count both the number
 983 of words and the number of characters in words in the input, the user might write
 984 .TS
 985 center;
 986 l l.
 987 [a\-zA\-Z]+     {words++; chars += yyleng;}
 988 .TE
 989 which accumulates in
 990 .ul
 991 chars
 992 the number
 993 of characters in the words recognized.
 994 The last character in the string matched can
 995 be accessed by
 996 .TS
 997 center;
 998 l.
 999 yytext[yyleng\-1]
1000 .TE
1001 .PP
1002 Occasionally, a Lex
1003 action may decide that a rule has not recognized the correct
1004 span of characters.
1005 Two routines are provided to aid with this situation.
1006 First,
1007 .I
1008 yymore()
1009 .R
1010 can be called to indicate that the next input expression recognized is to be
1011 tacked on to the end of this input.  Normally,
1012 the next input string would overwrite the current
1013 entry in
1014 .I
1015 yytext.
1016 .R
1017 Second,
1018 .I
1019 yyless (n)
1020 .R
1021 may be called to indicate that not all the characters matched
1022 by the currently successful expression are wanted right now.
1023 The argument
1024 .I
1025 n
1026 .R
1027 indicates the number of characters
1028 in
1029 .I
1030 yytext
1031 .R
1032 to be retained.
1033 Further characters previously matched
1034 are
1035 returned to the input.  This provides the same sort of
1036 look~ahead offered by the / operator,
1037 but in a different form.
1038 .PP
1039 .I
1040 Example:
1041 .R
1042 Consider a language which defines
1043 a string as a set of characters between quotation (") marks, and provides that
1044 to include a " in a string it must be preceded by a \e.  The
1045 regular expression which matches that is somewhat confusing,
1046 so that it might be preferable to write
1047 .TS
1048 center;
1049 l l.
1050 \e"[^"]\(**     {
1051         if (yytext[yyleng\-1] == \(fm\e\e\(fm)
1052              yymore();
1053         else
1054              ... normal user processing
1055         }
1056 .TE
1057 which will, when faced with a string such as
1058 .I
1059 "abc\e"def\|"
1060 .R
1061 first match
1062 the five characters
1063 \fI"abc\e\|\fR;
1064 then
1065 the call to
1066 .I yymore()
1067 will
1068 cause the next part of the string,
1069 \fI"def\|\fR,
1070 to be tacked on the end.
1071 Note that the final quote terminating the string should be picked
1072 up in the code labeled ``normal processing''.
1073 .PP
1074 The function
1075 .I
1076 yyless()
1077 .R
1078 might be used to reprocess
1079 text in various circumstances.  Consider the C problem of distinguishing
1080 the ambiguity of ``=\(mia''.
1081 Suppose it is desired to treat this as ``=\(mi a''
1082 but print a message.  A rule might be
1083 .ps 9
1084 .vs 11
1085 .TS
1086 center;
1087 l l.
1088 =\(mi[a\-zA\-Z] {
1089         printf("Op (=\(mi) ambiguous\en");
1090         yyless(yyleng\-1);
1091         ... action for =\(mi ...
1092         }
1093 .TE
1094 .ps 10
1095 .vs 12
1096 which prints a message, returns the letter after the
1097 operator to the input stream, and treats the operator as ``=\(mi''.
1098 Alternatively it might be desired to treat this as ``=  \(mia''.
1099 To do this, just return the minus
1100 sign as well as the letter to the input:
1101 .ps 9
1102 .vs 11
1103 .TS
1104 center;
1105 l l.
1106 =\(mi[a\-zA\-Z] {
1107         printf("Op (=\(mi) ambiguous\en");
1108         yyless(yyleng\-2);
1109         ... action for = ...
1110         }
1111 .TE
1112 .ps 10
1113 .vs 12
1114 will perform the other interpretation.
1115 Note that the expressions for the two cases might more easily
1116 be written
1117 .TS
1118 center;
1119 l l.
1120 =\(mi/[A\-Za\-z]
1121 .TE
1122 in the first case and
1123 .TS
1124 center;
1125 l.
1126 =/\-[A\-Za\-z]
1127 .TE
1128 in the second;
1129 no backup would be required in the rule action.
1130 It is not necessary to recognize the whole identifier
1131 to observe the ambiguity.
1132 The
1133 possibility of ``=\(mi3'', however, makes
1134 .TS
1135 center;
1136 l.
1137 =\(mi/[^ \et\en]
1138 .TE
1139 a still better rule.
1140 .PP
1141 In addition to these routines, Lex also permits
1142 access to the I/O routines
1143 it uses.
1144 They are:
1145 .IP 1)
1146 .I
1147 input()
1148 .R
1149 which returns the next input character;
1150 .IP 2)
1151 .I
1152 output(c)
1153 .R
1154 which writes the character
1155 .I
1156 c
1157 .R
1158 on the output; and
1159 .IP 3)
1160 .I
1161 unput(c)
1162 .R
1163 pushes the character
1164 .I
1165 c
1166 .R
1167 back onto the input stream to be read later by
1168 .I
1169 input().
1170 .R
1171 .LP
1172 By default these routines are provided as macro definitions,
1173 but the user can override them and supply private versions.
1174 These routines
1175 define the relationship between external files and
1176 internal characters, and must all be retained
1177 or modified consistently.
1178 They may be redefined, to
1179 cause input or output to be transmitted to or from strange
1180 places, including other programs or internal memory;
1181 but the character set used must be consistent in all routines;
1182 a value of zero returned by
1183 .I
1184 input
1185 .R
1186 must mean end of file; and
1187 the relationship between
1188 .I
1189 unput
1190 .R
1191 and
1192 .I
1193 input
1194 .R
1195 must be retained
1196 or the Lex look~ahead will not work.
1197 Lex does not look ahead at all if it does not have to,
1198 but every rule ending in
1199 .ft I
1200 + \(** ?
1201 .ft R
1202 or
1203 .ft I
1204 $
1205 .ft R
1206 or containing
1207 .ft I
1208 /
1209 .ft R
1210 implies look~ahead.
1211 Look~ahead is also necessary to match an expression that is a prefix
1212 of another expression.
1213 See below for a discussion of the character set used by Lex.
1214 The standard Lex library imposes
1215 a 100 character limit on backup.
1216 .PP
1217 Another Lex library routine that the user will sometimes want
1218 to redefine is
1219 .I
1220 yywrap()
1221 .R
1222 which is called whenever Lex reaches an end-of-file.
1223 If
1224 .I
1225 yywrap
1226 .R
1227 returns a 1, Lex continues with the normal wrapup on end of input.
1228 Sometimes, however, it is convenient to arrange for more
1229 input to arrive
1230 from a new source.
1231 In this case, the user should provide
1232 a
1233 .I
1234 yywrap
1235 .R
1236 which
1237 arranges for new input and
1238 returns 0.  This instructs Lex to continue processing.
1239 The default
1240 .I
1241 yywrap
1242 .R
1243 always returns 1.
1244 .PP
1245 This routine is also a convenient place
1246 to print tables, summaries, etc. at the end
1247 of a program.  Note that it is not
1248 possible to write a normal rule which recognizes
1249 end-of-file; the only access to this condition is
1250 through
1251 .I
1252 yywrap.
1253 .R
1254 In fact, unless a private version of
1255 .I
1256 input()
1257 .R
1258 is supplied
1259 a file containing nulls
1260 cannot be handled,
1261 since a value of 0 returned by
1262 .I
1263 input
1264 .R
1265 is taken to be end-of-file.
1266 .PP
1267 .2C
1268 .NH
1269 Ambiguous Source Rules.
1270 .PP
1271 Lex can handle ambiguous specifications.
1272 When more than one expression can match the
1273 current input, Lex chooses as follows:
1274 .IP 1)
1275 The longest match is preferred.
1276 .IP 2)
1277 Among rules which matched the same number of characters,
1278 the rule given first is preferred.
1279 .LP
1280 Thus, suppose the rules
1281 .TS
1282 center;
1283 l l.
1284 integer keyword action ...;
1285 [a\-z]+ identifier action ...;
1286 .TE
1287 to be given in that order.  If the input is
1288 .I integers ,
1289 it is taken as an identifier, because
1290 .I [a\-z]+
1291 matches 8 characters while
1292 .I integer
1293 matches only 7.
1294 If the input is
1295 .I integer ,
1296 both rules match 7 characters, and
1297 the keyword rule is selected because it was given first.
1298 Anything shorter (e.g. \fIint\fR\|) will
1299 not match the expression
1300 .I integer
1301 and so the identifier interpretation is used.
1302 .PP
1303 The principle of preferring the longest
1304 match makes rules containing
1305 expressions like
1306 .I \&.\(**
1307 dangerous.
1308 For example,
1309 .TS
1310 center;
1311 l.
1312 \&\(fm.\(**\(fm
1313 .TE
1314 might seem a good way of recognizing
1315 a string in single quotes.
1316 But it is an invitation for the program to read far
1317 ahead, looking for a distant
1318 single quote.
1319 Presented with the input
1320 .TS
1321 center;
1322 l l.
1323 \&\(fmfirst\(fm quoted string here, \(fmsecond\(fm here
1324 .TE
1325 the above expression will match
1326 .TS
1327 center;
1328 l l.
1329 \&\(fmfirst\(fm quoted string here, \(fmsecond\(fm
1330 .TE
1331 which is probably not what was wanted.
1332 A better rule is of the form
1333 .TS
1334 center;
1335 l.
1336 \&\(fm[^\(fm\en]\(**\(fm
1337 .TE
1338 which, on the above input, will stop
1339 after
1340 .I \(fmfirst\(fm .
1341 The consequences
1342 of errors like this are mitigated by the fact
1343 that the
1344 .I \&.
1345 operator will not match newline.
1346 Thus expressions like
1347 .I \&.\(**
1348 stop on the
1349 current line.
1350 Don't try to defeat this with expressions like
1351 .I (.|\en)+
1352 or
1353 equivalents;
1354 the Lex generated program will try to read
1355 the entire input file, causing
1356 internal buffer overflows.
1357 .PP
1358 Note that Lex is normally partitioning
1359 the input stream, not searching for all possible matches
1360 of each expression.
1361 This means that each character is accounted for
1362 once and only once.
1363 For example, suppose it is desired to
1364 count occurrences of both \fIshe\fR and \fIhe\fR in an input text.
1365 Some Lex rules to do this might be
1366 .TS
1367 center;
1368 l l.
1369 she     s++;
1370 he      h++;
1371 \en     |
1372 \&.     ;
1373 .TE
1374 where the last two rules ignore everything besides \fIhe\fR and \fIshe\fR.
1375 Remember that . does not include newline.
1376 Since \fIshe\fR includes \fIhe\fR, Lex will normally
1377 .I
1378 not
1379 .R
1380 recognize
1381 the instances of \fIhe\fR included in \fIshe\fR,
1382 since once it has passed a \fIshe\fR those characters are gone.
1383 .PP
1384 Sometimes the user would like to override this choice.  The action
1385 REJECT
1386 means ``go do the next alternative.''
1387 It causes whatever rule was second choice after the current
1388 rule to be executed.
1389 The position of the input pointer is adjusted accordingly.
1390 Suppose the user really wants to count the included instances of \fIhe\fR:
1391 .TS
1392 center;
1393 l l.
1394 she     {s++; REJECT;}
1395 he      {h++; REJECT;}
1396 \en     |
1397 \&.     ;
1398 .TE
1399 these rules are one way of changing the previous example
1400 to do just that.
1401 After counting each expression, it is rejected; whenever appropriate,
1402 the other expression will then be counted.  In this example, of course,
1403 the user could note that \fIshe\fR includes \fIhe\fR but not
1404 vice versa, and omit the REJECT action on \fIhe\fR;
1405 in other cases, however, it
1406 would not be possible a priori to tell
1407 which input characters
1408 were in both classes.
1409 .PP
1410 Consider the two rules
1411 .TS
1412 center;
1413 l l.
1414 a[bc]+  { ... ; REJECT;}
1415 a[cd]+  { ... ; REJECT;}
1416 .TE
1417 If the input is
1418 .I ab ,
1419 only the first rule matches,
1420 and on
1421 .I ad
1422 only the second matches.
1423 The input string
1424 .I accb
1425 matches the first rule for four characters
1426 and then the second rule for three characters.
1427 In contrast, the input
1428 .I accd
1429 agrees with
1430 the second rule for four characters and then the first
1431 rule for three.
1432 .PP
1433 In general, REJECT is useful whenever
1434 the purpose of Lex is not to partition the input
1435 stream but to detect all examples of some items
1436 in the input, and the instances of these items
1437 may overlap or include each other.
1438 Suppose a digram table of the input is desired;
1439 normally the digrams overlap, that is the word
1440 .I the
1441 is considered to contain
1442 both
1443 .I th
1444 and
1445 .I he .
1446 Assuming a two-dimensional array named
1447 .ul
1448 digram
1449 to be incremented, the appropriate
1450 source is
1451 .TS
1452 center;
1453 l l.
1454 %%
1455 [a\-z][a\-z]    {
1456         digram[yytext[0]][yytext[1]]++;
1457         REJECT;
1458         }
1459 \&.     ;
1460 \en     ;
1461 .TE
1462 where the REJECT is necessary to pick up
1463 a letter pair beginning at every character, rather than at every
1464 other character.
1465 .2C
1466 .NH
1467 Lex Source Definitions.
1468 .PP
1469 Remember the format of the Lex
1470 source:
1471 .TS
1472 center;
1473 l.
1474 {definitions}
1475 %%
1476 {rules}
1477 %%
1478 {user routines}
1479 .TE
1480 So far only the rules have been described.  The user needs
1481 additional options,
1482 though, to define variables for use in his program and for use
1483 by Lex.
1484 These can go either in the definitions section
1485 or in the rules section.
1486 .PP
1487 Remember that Lex is turning the rules into a program.
1488 Any source not intercepted by Lex is copied
1489 into the generated program.  There are three classes
1490 of such things.
1491 .IP 1)
1492 Any line which is not part of a Lex rule or action
1493 which begins with a blank or tab is copied into
1494 the Lex generated program.
1495 Such source input prior to the first %% delimiter will be external
1496 to any function in the code; if it appears immediately after the first
1497 %%,
1498 it appears in an appropriate place for declarations
1499 in the function written by Lex which contains the actions.
1500 This material must look like program fragments,
1501 and should precede the first Lex rule.
1502 .IP
1503 As a side effect of the above, lines which begin with a blank
1504 or tab, and which contain a comment,
1505 are passed through to the generated program.
1506 This can be used to include comments in either the Lex source or
1507 the generated code.  The comments should follow the host
1508 language convention.
1509 .IP 2)
1510 Anything included between lines containing
1511 only
1512 .I %{
1513 and
1514 .I %}
1515 is
1516 copied out as above.  The delimiters are discarded.
1517 This format permits entering text like preprocessor statements that
1518 must begin in column 1,
1519 or copying lines that do not look like programs.
1520 .IP 3)
1521 Anything after the third %% delimiter, regardless of formats, etc.,
1522 is copied out after the Lex output.
1523 .PP
1524 Definitions intended for Lex are given
1525 before the first %% delimiter.  Any line in this section
1526 not contained between %{ and %}, and begining
1527 in column 1, is assumed to define Lex substitution strings.
1528 The format of such lines is
1529 .TS
1530 center;
1531 l l.
1532 name translation
1533 .TE
1534 and it
1535 causes the string given as a translation to
1536 be associated with the name.
1537 The name and translation
1538 must be separated by at least one blank or tab, and the name must begin with a letter.
1539 The translation can then be called out
1540 by the {name} syntax in a rule.
1541 Using {D} for the digits and {E} for an exponent field,
1542 for example, might abbreviate rules to recognize numbers:
1543 .TS
1544 center, tab(#);
1545 l l.
1546 D#[0\-9]
1547 E#[DEde][\-+]?{D}+
1548 %%
1549 {D}+#printf("integer");
1550 {D}+"."{D}\(**({E})?#|
1551 {D}\(**"."{D}+({E})?#|
1552 {D}+{E}#printf("real");
1553 .TE
1554 Note the first two rules for real numbers;
1555 both require a decimal point and contain
1556 an optional exponent field,
1557 but the first requires at least one digit before the
1558 decimal point and the second requires at least one
1559 digit after the decimal point.
1560 To correctly handle the problem
1561 posed by a Fortran expression such as
1562 .I 35.EQ.I ,
1563 which does not contain a real number, a context-sensitive
1564 rule such as
1565 .TS
1566 center;
1567 l l.
1568 [0\-9]+/"."EQ   printf("integer");
1569 .TE
1570 could be used in addition to the normal rule for integers.
1571 .PP
1572 The definitions
1573 section may also contain other commands, including the
1574 selection of a host language, a character set table,
1575 a list of start conditions, or adjustments to the default
1576 size of arrays within Lex itself for larger source programs.
1577 These possibilities
1578 are discussed below under ``Summary of Source Format,''
1579 section 12.
1580 .2C
1581 .NH
1582 Usage.
1583 .PP
1584 There are two steps in
1585 compiling a Lex source program.
1586 First, the Lex source must be turned into a generated program
1587 in the host general purpose language.
1588 Then this program must be compiled and loaded, usually with
1589 a library of Lex subroutines.
1590 The generated program
1591 is on a file named
1592 .I lex.yy.c .
1593 The I/O library is defined in terms of the C standard
1594 library [6].
1595 .PP
1596 The C programs generated by Lex are slightly different
1597 on OS/370, because the
1598 OS compiler is less powerful than the UNIX or GCOS compilers,
1599 and does less at compile time.
1600 C programs generated on GCOS and UNIX are the same.
1601 .PP
1602 .I
1603 UNIX.
1604 .R
1605 The library is accessed by the loader flag
1606 .I \-ll .
1607 So an appropriate
1608 set of commands is
1609 .KS
1610 .in 5
1611 lex source
1612 cc lex.yy.c \-ll
1613 .in 0
1614 .KE
1615 The resulting program is placed on the usual file
1616 .I
1617 a.out
1618 .R
1619 for later execution.
1620 To use Lex with Yacc see below.
1621 Although the default Lex I/O routines use the C standard library,
1622 the Lex automata themselves do not do so;
1623 if private versions of
1624 .I
1625 input,
1626 output
1627 .R
1628 and
1629 .I unput
1630 are given, the library can be avoided.
1631 .PP
1632 .2C
1633 .NH
1634 Lex and Yacc.
1635 .PP
1636 If you want to use Lex with Yacc, note that what Lex writes is a program
1637 named
1638 .I
1639 yylex(),
1640 .R
1641 the name required by Yacc for its analyzer.
1642 Normally, the default main program on the Lex library
1643 calls this routine, but if Yacc is loaded, and its main
1644 program is used, Yacc will call
1645 .I
1646 yylex().
1647 .R
1648 In this case each Lex rule should end with
1649 .TS
1650 center;
1651 l.
1652 return(token);
1653 .TE
1654 where the appropriate token value is returned.
1655 An easy way to get access
1656 to Yacc's names for tokens is to
1657 compile the Lex output file as part of
1658 the Yacc output file by placing the line
1659 .TS
1660 center;
1661 l.
1662 # include "lex.yy.c"
1663 .TE
1664 in the last section of Yacc input.
1665 Supposing the grammar to be
1666 named ``good'' and the lexical rules to be named ``better''
1667 the UNIX command sequence can just be:
1668 .TS
1669 center;
1670 l.
1671 yacc good
1672 lex better
1673 cc y.tab.c \-ly \-ll
1674 .TE
1675 The Yacc library (\-ly) should be loaded before the Lex library,
1676 to obtain a main program which invokes the Yacc parser.
1677 The generations of Lex and Yacc programs can be done in
1678 either order.
1679 .2C
1680 .NH
1681 Examples.
1682 .PP
1683 As a trivial problem, consider copying an input file while
1684 adding 3 to every positive number divisible by 7.
1685 Here is a suitable Lex source program
1686 .TS
1687 center;
1688 l l.
1689 %%
1690         int k;
1691 [0\-9]+ {
1692         k = atoi(yytext);
1693         if (k%7 == 0)
1694              printf("%d", k+3);
1695         else
1696              printf("%d",k);
1697         }
1698 .TE
1699 to do just that.
1700 The rule [0\-9]+ recognizes strings of digits;
1701 .I
1702 atoi
1703 .R
1704 converts the digits to binary
1705 and stores the result in
1706 .ul
1707 k.
1708 The operator % (remainder) is used to check whether
1709 .ul
1710 k
1711 is divisible by 7; if it is,
1712 it is incremented by 3 as it is written out.
1713 It may be objected that this program will alter such
1714 input items as
1715 .I 49.63
1716 or
1717 .I X7 .
1718 Furthermore, it increments the absolute value
1719 of all negative numbers divisible by 7.
1720 To avoid this, just add a few more rules after the active one,
1721 as here:
1722 .TS
1723 center;
1724 l l.
1725 %%
1726         int k;
1727 \-?[0\-9]+      {
1728         k = atoi(yytext);
1729         printf("%d",
1730           k%7 == 0 ? k+3 : k);
1731         }
1732 \-?[0\-9.]+     ECHO;
1733 [A-Za-z][A-Za-z0-9]+    ECHO;
1734 .TE
1735 Numerical strings containing
1736 a ``.'' or preceded by a letter will be picked up by
1737 one of the last two rules, and not changed.
1738 The
1739 .I if\-else
1740 has been replaced by
1741 a C conditional expression to save space;
1742 the form
1743 .ul
1744 a?b:c
1745 means ``if
1746 .I a
1747 then
1748 .I b
1749 else
1750 .I c ''.
1751 .PP
1752 For an example of statistics gathering, here
1753 is a program which histograms the lengths
1754 of words, where a word is defined as a string of letters.
1755 .TS
1756 center;
1757 l l.
1758         int lengs[100];
1759 %%
1760 [a\-z]+ lengs[yyleng]++;
1761 \&.     |
1762 \en     ;
1763 %%
1764 .T&
1765 l s.
1766 yywrap()
1767 {
1768 int i;
1769 printf("Length  No. words\en");
1770 for(i=0; i<100; i++)
1771      if (lengs[i] > 0)
1772           printf("%5d%10d\en",i,lengs[i]);
1773 return(1);
1774 }
1775 .TE
1776 This program
1777 accumulates the histogram, while producing no output.  At the end
1778 of the input it prints the table.
1779 The final statement
1780 .I
1781 return(1);
1782 .R
1783 indicates that Lex is to perform wrapup.  If
1784 .I
1785 yywrap
1786 .R
1787 returns zero (false)
1788 it implies that further input is available
1789 and the program is
1790 to continue reading and processing.
1791 To provide a
1792 .I
1793 yywrap
1794 .R
1795 that never
1796 returns true causes an infinite loop.
1797 .PP
1798 As a larger example,
1799 here are some parts of a program written by N. L. Schryer
1800 to convert double precision Fortran to single precision Fortran.
1801 Because Fortran does not distinguish upper and lower case letters,
1802 this routine begins by defining a set of classes including
1803 both cases of each letter:
1804 .TS
1805 center;
1806 l l.
1807 a       [aA]
1808 b       [bB]
1809 c       [cC]
1810 \&...
1811 z       [zZ]
1812 .TE
1813 An additional class recognizes white space:
1814 .TS
1815 center;
1816 l l.
1817 W       [ \et]\(**
1818 .TE
1819 The first rule changes
1820 ``double precision'' to ``real'', or ``DOUBLE PRECISION'' to ``REAL''.
1821 .TS
1822 center;
1823 l.
1824 {d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {
1825      printf(yytext[0]==\(fmd\(fm? "real" : "REAL");
1826      }
1827 .TE
1828 Care is taken throughout this program to preserve the case
1829 (upper or lower)
1830 of the original program.
1831 The conditional operator is used to
1832 select the proper form of the keyword.
1833 The next rule copies continuation card indications to
1834 avoid confusing them with constants:
1835 .TS
1836 center;
1837 l l.
1838 ^"     "[^ 0]   ECHO;
1839 .TE
1840 In the regular expression, the quotes surround the
1841 blanks.
1842 It is interpreted as
1843 ``beginning of line, then five blanks, then
1844 anything but blank or zero.''
1845 Note the two different meanings of
1846 .I ^ .
1847 There follow some rules to change double precision
1848 constants to ordinary floating constants.
1849 .TS
1850 center;
1851 l.
1852 [0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+     |
1853 [0\-9]+{W}"."{W}{d}{W}[+\-]?{W}[0\-9]+     |
1854 "."{W}[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+     {
1855      /\(** convert constants \(**/
1856      for(p=yytext; \(**p != 0; p++)
1857           {
1858           if (\(**p == \(fmd\(fm || \(**p == \(fmD\(fm)
1859                \(**p=+ \(fme\(fm\- \(fmd\(fm;
1860           ECHO;
1861           }
1862 .TE
1863 After the floating point constant is recognized, it is
1864 scanned by the
1865 .ul
1866 for
1867 loop
1868 to find the letter
1869 .I d
1870 or
1871 .I D .
1872 The program then adds
1873 .I \(fme\(fm\-\(fmd\(fm ,
1874 which converts
1875 it to the next letter of the alphabet.
1876 The modified constant, now single-precision,
1877 is written out again.
1878 There follow a series of names which must be respelled to remove
1879 their initial \fId\fR.
1880 By using the
1881 array
1882 .I
1883 yytext
1884 .R
1885 the same action suffices for all the names (only a sample of
1886 a rather long list is given here).
1887 .TS
1888 center;
1889 l l.
1890 {d}{s}{i}{n}    |
1891 {d}{c}{o}{s}    |
1892 {d}{s}{q}{r}{t} |
1893 {d}{a}{t}{a}{n} |
1894 \&...
1895 {d}{f}{l}{o}{a}{t}      printf("%s",yytext+1);
1896 .TE
1897 Another list of names must have initial \fId\fR changed to initial \fIa\fR:
1898 .TS
1899 center;
1900 l l.
1901 {d}{l}{o}{g}    |
1902 {d}{l}{o}{g}10  |
1903 {d}{m}{i}{n}1   |
1904 {d}{m}{a}{x}1   {
1905         yytext[0] =+ \(fma\(fm \- \(fmd\(fm;
1906         ECHO;
1907         }
1908 .TE
1909 And one routine
1910 must have initial \fId\fR changed to initial \fIr\fR:
1911 .TS
1912 center, tab(#);
1913 l l.
1914 {d}1{m}{a}{c}{h}#{yytext[0] =+ \(fmr\(fm  \- \(fmd\(fm;
1915 #ECHO;
1916 #}
1917 .TE
1918 To avoid such names as \fIdsinx\fR being detected as instances
1919 of \fIdsin\fR, some final rules pick up longer words as identifiers
1920 and copy some surviving characters:
1921 .TS
1922 center;
1923 l l.
1924 [A\-Za\-z][A\-Za\-z0\-9]\(**    |
1925 [0\-9]+ |
1926 \en     |
1927 \&.     ECHO;
1928 .TE
1929 Note that this program is not complete; it
1930 does not deal with the spacing problems in Fortran or
1931 with the use of keywords as identifiers.
1932 .br
1933 .2C
1934 .NH
1935 Left Context Sensitivity.
1936 .PP
1937 Sometimes
1938 it is desirable to have several sets of lexical rules
1939 to be applied at different times in the input.
1940 For example, a compiler preprocessor might distinguish
1941 preprocessor statements and analyze them differently
1942 from ordinary statements.
1943 This requires
1944 sensitivity
1945 to prior context, and there are several ways of handling
1946 such problems.
1947 The \fI^\fR operator, for example, is a prior context operator,
1948 recognizing immediately preceding left context just as \fI$\fR recognizes
1949 immediately following right context.
1950 Adjacent left context could be extended, to produce a facility similar to
1951 that for adjacent right context, but it is unlikely
1952 to be as useful, since often the relevant left context
1953 appeared some time earlier, such as at the beginning of a line.
1954 .PP
1955 This section describes three means of dealing
1956 with different environments: a simple use of flags,
1957 when only a few rules change from one environment to another,
1958 the use of
1959 .I
1960 start conditions
1961 .R
1962 on rules,
1963 and the possibility of making multiple lexical analyzers all run
1964 together.
1965 In each case, there are rules which recognize the need to change the
1966 environment in which the
1967 following input text is analyzed, and set some parameter
1968 to reflect the change.  This may be a flag explicitly tested by
1969 the user's action code; such a flag is the simplest way of dealing
1970 with the problem, since Lex is not involved at all.
1971 It may be more convenient,
1972 however,
1973 to have Lex remember the flags as initial conditions on the rules.
1974 Any rule may be associated with a start condition.  It will only
1975 be recognized when Lex is in
1976 that start condition.
1977 The current start condition may be changed at any time.
1978 Finally, if the sets of rules for the different environments
1979 are very dissimilar,
1980 clarity may be best achieved by writing several distinct lexical
1981 analyzers, and switching from one to another as desired.
1982 .PP
1983 Consider the following problem: copy the input to the output,
1984 changing the word \fImagic\fR to \fIfirst\fR on every line which began
1985 with the letter \fIa\fR, changing \fImagic\fR to \fIsecond\fR on every line
1986 which began with the letter \fIb\fR, and changing
1987 \fImagic\fR to \fIthird\fR on every line which began
1988 with the letter \fIc\fR.  All other words and all other lines
1989 are left unchanged.
1990 .PP
1991 These rules are so simple that the easiest way
1992 to do this job is with a flag:
1993 .TS
1994 center;
1995 l l.
1996         int flag;
1997 %%
1998 ^a      {flag = \(fma\(fm; ECHO;}
1999 ^b      {flag = \(fmb\(fm; ECHO;}
2000 ^c      {flag = \(fmc\(fm; ECHO;}
2001 \en     {flag =  0 ; ECHO;}
2002 magic   {
2003         switch (flag)
2004         {
2005         case \(fma\(fm: printf("first"); break;
2006         case \(fmb\(fm: printf("second"); break;
2007         case \(fmc\(fm: printf("third"); break;
2008         default: ECHO; break;
2009         }
2010         }
2011 .TE
2012 should be adequate.
2013 .PP
2014 To handle the same problem with start conditions, each
2015 start condition must be introduced to Lex in the definitions section
2016 with a line reading
2017 .TS
2018 center;
2019 l l.
2020 %Start  name1 name2 ...
2021 .TE
2022 where the conditions may be named in any order.
2023 The word \fIStart\fR may be abbreviated to \fIs\fR or \fIS\fR.
2024 The conditions may be referenced at the
2025 head of a rule with the <> brackets:
2026 .TS
2027 center;
2028 l.
2029 <name1>expression
2030 .TE
2031 is a rule which is only recognized when Lex is in the
2032 start condition \fIname1\fR.
2033 To enter a start condition,
2034 execute the action statement
2035 .TS
2036 center;
2037 l.
2038 BEGIN name1;
2039 .TE
2040 which changes the start condition to \fIname1\fR.
2041 To resume the normal state,
2042 .TS
2043 center;
2044 l.
2045 BEGIN 0;
2046 .TE
2047 resets the initial condition
2048 of the Lex automaton interpreter.
2049 A rule may be active in several
2050 start conditions:
2051 .TS
2052 center;
2053 l.
2054 <name1,name2,name3>
2055 .TE
2056 is a legal prefix.  Any rule not beginning with the
2057 <> prefix operator is always active.
2058 .PP
2059 The same example as before can be written:
2060 .TS
2061 center;
2062 l l.
2063 %START AA BB CC
2064 %%
2065 ^a      {ECHO; BEGIN AA;}
2066 ^b      {ECHO; BEGIN BB;}
2067 ^c      {ECHO; BEGIN CC;}
2068 \en     {ECHO; BEGIN 0;}
2069 <AA>magic       printf("first");
2070 <BB>magic       printf("second");
2071 <CC>magic       printf("third");
2072 .TE
2073 where the logic is exactly the same as in the previous
2074 method of handling the problem, but Lex does the work
2075 rather than the user's code.
2076 .2C
2077 .NH
2078 Character Set.
2079 .PP
2080 The programs generated by Lex handle
2081 character I/O only through the routines
2082 .I
2083 input,
2084 output,
2085 .R
2086 and
2087 .I
2088 unput.
2089 .R
2090 Thus the character representation
2091 provided in these routines
2092 is accepted by Lex and employed to return
2093 values in
2094 .I
2095 yytext.
2096 .R
2097 For internal use
2098 a character is represented as a small integer
2099 which, if the standard library is used,
2100 has a value equal to the integer value of the bit
2101 pattern representing the character on the host computer.
2102 Normally, the letter
2103 .I a
2104 is represented as the same form as the character constant
2105 .I \(fma\(fm .
2106 If this interpretation is changed, by providing I/O
2107 routines which translate the characters,
2108 Lex must be told about
2109 it, by giving a translation table.
2110 This table must be in the definitions section,
2111 and must be bracketed by lines containing  only
2112 ``%T''.
2113 The table contains lines of the form
2114 .TS
2115 center;
2116 l.
2117 {integer} {character string}
2118 .TE
2119 which indicate the value associated with each character.
2120 Thus the next example
2121 .\" .GS 2
2122 .TS
2123 center;
2124 l l.
2125 %T
2126  1      Aa
2127  2      Bb
2128 \&...
2129 26      Zz
2130 27      \en
2131 28      +
2132 29      \-
2133 30      0
2134 31      1
2135 \&...
2136 39      9
2137 %T
2138 .TE
2139 .sp
2140 .ce 1
2141 Sample character table.
2142 .\" .GE
2143 maps the lower and upper case letters together into the integers 1 through 26,
2144 newline into 27, + and \- into 28 and 29, and the
2145 digits into 30 through 39.
2146 Note the escape for newline.
2147 If a table is supplied, every character that is to appear either
2148 in the rules or in any valid input must be included
2149 in the table.
2150 No character
2151 may be assigned the number 0, and no character may be
2152 assigned a bigger number than the size of the hardware character set.
2153 .2C
2154 .NH
2155 Summary of Source Format.
2156 .PP
2157 The general form of a Lex source file is:
2158 .TS
2159 center;
2160 l.
2161 {definitions}
2162 %%
2163 {rules}
2164 %%
2165 {user subroutines}
2166 .TE
2167 The definitions section contains
2168 a combination of
2169 .IP 1)
2170 Definitions, in the form ``name space translation''.
2171 .IP 2)
2172 Included code, in the form ``space code''.
2173 .IP 3)
2174 Included code, in the form
2175 .TS
2176 center;
2177 l.
2178 %{
2179 code
2180 %}
2181 .TE
2182 .ns
2183 .IP 4)
2184 Start conditions, given in the form
2185 .TS
2186 center;
2187 l.
2188 %S name1 name2 ...
2189 .TE
2190 .ns
2191 .IP 5)
2192 Character set tables, in the form
2193 .TS
2194 center;
2195 l.
2196 %T
2197 number space character-string
2198 \&...
2199 %T
2200 .TE
2201 .ns
2202 .IP 6)
2203 Changes to internal array sizes, in the form
2204 .TS
2205 center;
2206 l.
2207 %\fIx\fR\0\0\fInnn\fR
2208 .TE
2209 where \fInnn\fR is a decimal integer representing an array size
2210 and \fIx\fR selects the parameter as follows:
2211 .TS
2212 center;
2213 c c
2214 c l.
2215 Letter  Parameter
2216 p       positions
2217 n       states
2218 e       tree nodes
2219 a       transitions
2220 k       packed character classes
2221 o       output array size
2222 .TE
2223 .LP
2224 Lines in the rules section have the form ``expression  action''
2225 where the action may be continued on succeeding
2226 lines by using braces to delimit it.
2227 .PP
2228 Regular expressions in Lex use the following
2229 operators:
2230 .br
2231 .TS
2232 center;
2233 l l.
2234 x       the character "x"
2235 "x"     an "x", even if x is an operator.
2236 \ex     an "x", even if x is an operator.
2237 [xy]    the character x or y.
2238 [x\-z]  the characters x, y or z.
2239 [^x]    any character but x.
2240 \&.     any character but newline.
2241 ^x      an x at the beginning of a line.
2242 <y>x    an x when Lex is in start condition y.
2243 x$      an x at the end of a line.
2244 x?      an optional x.
2245 x\(**   0,1,2, ... instances of x.
2246 x+      1,2,3, ... instances of x.
2247 x|y     an x or a y.
2248 (x)     an x.
2249 x/y     an x but only if followed by y.
2250 {xx}    the translation of xx from the
2251         definitions section.
2252 x{m,n}  \fIm\fR through \fIn\fR occurrences of x
2253 .TE
2254 .NH
2255 Caveats and Bugs.
2256 .PP
2257 There are pathological expressions which
2258 produce exponential growth of the tables when
2259 converted to deterministic machines;
2260 fortunately, they are rare.
2261 .PP
2262 REJECT does not rescan the input; instead it remembers the results of the previous
2263 scan.  This means that if a rule with trailing context is found, and
2264 REJECT executed, the user
2265 must not have used
2266 .ul
2267 unput
2268 to change the characters forthcoming
2269 from the input stream.
2270 This is the only restriction on the user's ability to manipulate
2271 the not-yet-processed input.
2272 .PP
2273 .2C
2274 .NH
2275 Acknowledgments.
2276 .PP
2277 As should
2278 be obvious from the above, the outside of Lex
2279 is patterned
2280 on Yacc and the inside on Aho's string matching routines.
2281 Therefore, both S. C. Johnson and A. V. Aho
2282 are really originators
2283 of much of Lex,
2284 as well as debuggers of it.
2285 Many thanks are due to both.
2286 .PP
2287 The code of the current version of Lex was designed, written,
2288 and debugged by Eric Schmidt.
2289 .if 0 .SG MH-1274-MEL-unix
2290 .sp 1
2291 .2C
2292 .NH
2293 References.
2294 .sp 1v
2295 .IP 1.
2296 B. W. Kernighan and D. M. Ritchie,
2297 .I
2298 The C Programming Language,
2299 .R
2300 Prentice-Hall, N. J. (1978).
2301 .IP 2.
2302 B. W. Kernighan,
2303 .I
2304 Ratfor: A Preprocessor for a Rational Fortran,
2305 .R
2306 Software \- Practice and Experience,
2307 \fB5\fR, pp. 395-496 (1975).
2308 .IP 3.
2309 S. C. Johnson,
2310 .I
2311 Yacc: Yet Another Compiler Compiler,
2312 .R
2313 Computing Science Technical Report No. 32,
2314 1975,
2315 .MH
2316 .if \n(tm (also TM 75-1273-6)
2317 .IP 4.
2318 A. V. Aho and M. J. Corasick,
2319 .I
2320 Efficient String Matching: An Aid to Bibliographic Search,
2321 .R
2322 Comm. ACM
2323 .B
2324 18,
2325 .R
2326 333-340 (1975).
2327 .IP 5.
2328 B. W. Kernighan, D. M. Ritchie and K. L. Thompson,
2329 .I
2330 QED Text Editor,
2331 .R
2332 Computing Science Technical Report No. 5,
2333 1972,
2334 .MH
2335 .IP 6.
2336 D. M. Ritchie,
2337 private communication.
2338 See also
2339 M. E. Lesk,
2340 .I
2341 The Portable C Library,
2342 .R
2343 Computing Science Technical Report No. 31,
2344 .MH
2345 .if \n(tm (also TM 75-1274-11)