1 .\" Copyright (C) Caldera International Inc. 2001-2002. All rights reserved.
3 .\" Redistribution and use in source and binary forms, with or without
4 .\" modification, are permitted provided that the following conditions are
7 .\" Redistributions of source code and documentation must retain the above
8 .\" copyright notice, this list of conditions and the following
11 .\" Redistributions in binary form must reproduce the above copyright
12 .\" notice, this list of conditions and the following disclaimer in the
13 .\" documentation and/or other materials provided with the distribution.
15 .\" All advertising materials mentioning features or use of this software
16 .\" must display the following acknowledgement:
18 .\" This product includes software developed or owned by Caldera
19 .\" International, Inc. Neither the name of Caldera International, Inc.
20 .\" nor the names of other contributors may be used to endorse or promote
21 .\" products derived from this software without specific prior written
24 .\" USE OF THE SOFTWARE PROVIDED FOR UNDER THIS LICENSE BY CALDERA
25 .\" INTERNATIONAL, INC. AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR
26 .\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
27 .\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
28 .\" DISCLAIMED. IN NO EVENT SHALL CALDERA INTERNATIONAL, INC. BE LIABLE
29 .\" FOR ANY DIRECT, INDIRECT INCIDENTAL, SPECIAL, EXEMPLARY, OR
30 .\" CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
31 .\" SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
32 .\" BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
33 .\" WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
34 .\" OR OTHERWISE) RISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN
35 .\" IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
37 .\" @(#)lex.ms 8.1 (Berkeley) 6/8/93
42 Bell Laboratories, Murray Hill, NJ 07974.
44 .EH 'PSD:16-%''Lex \- A Lexical Analyzer Generator'
45 .OH 'Lex \- A Lexical Analyzer Generator''PSD:16-%'
59 .\".if \\n%>1 'tl ''\s7LEX\s0\s9\(mi%\s0''
64 .\".TM 75-1274-15 39199 39199-11
66 Lex \- A Lexical Analyzer ~Generator~
67 .AU ``MH 2C-569'' 6377
68 M. E. Lesk and E. Schmidt
73 NOTE: This document describes the historical Unix version of \fIlex\fP.
74 FreeBSD is supplied with \fIflex\fP\| which is a compatible replacement.
75 See the extensive documentation in \fIflex(1)\fP\| for details.
83 Lex helps write programs whose control flow
84 is directed by instances of regular
85 expressions in the input stream.
86 It is well suited for editor-script type transformations and
87 for segmenting input in preparation for
90 Lex source is a table of regular expressions and corresponding program fragments.
91 The table is translated to a program
92 which reads an input stream, copying it to an output stream
93 and partitioning the input
94 into strings which match the given expressions.
95 As each such string is recognized the corresponding
96 program fragment is executed.
97 The recognition of the expressions
98 is performed by a deterministic finite automaton
100 The program fragments written by the user are executed in the order in which the
101 corresponding regular expressions occur in the input stream.
104 programs written with Lex accept ambiguous specifications
105 and choose the longest
106 match possible at each input point.
107 If necessary, substantial look~ahead
108 is performed on the input, but the
109 input stream will be backed up to the
110 end of the current partition, so that the user
111 has general freedom to manipulate it.
113 Lex can generate analyzers in either C or Ratfor, a language
114 which can be translated automatically to portable Fortran.
115 It is available on the PDP-11 UNIX, Honeywell GCOS,
117 This manual, however, will only discuss generating analyzers
118 in C on the UNIX system, which is the only supported
119 form of Lex under UNIX Version 7.
120 Lex is designed to simplify
121 interfacing with Yacc, for those
122 with access to this compiler-compiler system.
130 Lex is a program generator designed for
131 lexical processing of character input streams.
132 It accepts a high-level, problem oriented specification
133 for character string matching,
135 produces a program in a general purpose language which recognizes
137 The regular expressions are specified by the user in the
138 source specifications given to Lex.
139 The Lex written code recognizes these expressions
140 in an input stream and partitions the input stream into
141 strings matching the expressions. At the bound~aries
144 provided by the user are executed.
145 The Lex source file associates the regular expressions and the
147 As each expression appears in the input to the program written by Lex,
148 the corresponding fragment is executed.
150 The user supplies the additional code
151 beyond expression matching
152 needed to complete his tasks, possibly
153 including code written by other generators.
154 The program that recognizes the expressions is generated in the
155 general purpose programming language employed for the
156 user's program fragments.
157 Thus, a high level expression
158 language is provided to write the string expressions to be
159 matched while the user's freedom to write actions
161 This avoids forcing the user who wishes to use a string manipulation
162 language for input analysis to write processing programs in the same
163 and often inappropriate string handling language.
165 Lex is not a complete language, but rather a generator representing
166 a new language feature which can be added to
167 different programming languages, called ``host languages.''
168 Just as general purpose languages
169 can produce code to run on different computer hardware,
170 Lex can write code in different host languages.
171 The host language is used for the output code generated by Lex
172 and also for the program fragments added by the user.
173 Compatible run-time libraries for the different host languages
175 This makes Lex adaptable to different environments and
178 may be directed to the combination of hardware and host language appropriate
179 to the task, the user's background, and the properties of local
181 At present, the only supported host language is C,
182 although Fortran (in the form of Ratfor [2] has been available
184 Lex itself exists on UNIX, GCOS, and OS/370; but the
185 code generated by Lex may be taken anywhere the appropriate
188 Lex turns the user's expressions and actions
192 in this memo) into the host general-purpose language;
193 the generated program is named
200 will recognize expressions
206 and perform the specified actions for each expression as it is detected.
220 Source \(-> Lex \(-> yylex
224 Input \(-> yylex \(-> Output
232 For a trivial example, consider a program to delete
234 all blanks or tabs at the ends of lines.
241 is all that is required.
243 contains a %% delimiter to mark the beginning of the rules, and
245 This rule contains a regular expression
246 which matches one or more
247 instances of the characters blank or tab
248 (written \et for visibility, in accordance with the C language convention)
249 just prior to the end of a line.
250 The brackets indicate the character
251 class made of blank and tab; the + indicates ``one or more ...'';
252 and the $ indicates ``end of line,'' as in QED.
253 No action is specified,
254 so the program generated by Lex (yylex) will ignore these characters.
255 Everything else will be copied.
256 To change any remaining
257 string of blanks or tabs to a single blank,
266 The finite automaton generated for this
267 source will scan for both rules at once,
269 the termination of the string of blanks or tabs
270 whether or not there is a newline character, and executing
271 the desired rule action.
272 The first rule matches all strings of blanks or tabs
273 at the end of lines, and the second
274 rule all remaining strings of blanks or tabs.
276 Lex can be used alone for simple transformations, or
277 for analysis and statistics gathering on a lexical level.
278 Lex can also be used with a parser generator
279 to perform the lexical analysis phase; it is particularly
280 easy to interface Lex and Yacc [3].
281 Lex programs recognize only regular expressions;
282 Yacc writes parsers that accept a large class of context free grammars,
283 but require a lower level analyzer to recognize input tokens.
284 Thus, a combination of Lex and Yacc is often appropriate.
285 When used as a preprocessor for a later parser generator,
286 Lex is used to partition the input stream,
287 and the parser generator assigns structure to
288 the resulting pieces.
290 in such a case (which might be the first half of a compiler,
291 for example) is shown in Figure 2.
293 written by other generators
295 be added easily to programs written by Lex.
321 Input \(-> yylex \(-> yyparse \(-> Parsed input
331 will realize that the name
334 is what Yacc expects its lexical analyzer to be named,
335 so that the use of this name by Lex simplifies
338 Lex generates a deterministic finite automaton from the regular expressions
340 The automaton is interpreted, rather than compiled, in order
342 The result is still a fast analyzer.
343 In particular, the time taken by a Lex program
344 to recognize and partition an input stream is
345 proportional to the length of the input.
346 The number of Lex rules or
347 the complexity of the rules is
348 not important in determining speed,
349 unless rules which include
350 forward context require a significant amount of re~scanning.
351 What does increase with the number and complexity of rules
352 is the size of the finite
353 automaton, and therefore the size of the program
356 In the program written by Lex, the user's fragments
360 to be performed as each regular expression
363 as cases of a switch.
364 The automaton interpreter directs the control flow.
365 Opportunity is provided for the user to insert either
366 declarations or additional statements in the routine containing
368 add subroutines outside this action routine.
370 Lex is not limited to source which can
371 be interpreted on the basis of one character
374 if there are two rules, one looking for
378 and the input stream is
383 the input pointer just before
385 Such backup is more costly
386 than the processing of simpler languages.
391 The general format of Lex source is:
401 where the definitions and the user subroutines
405 is optional, but the first is required
406 to mark the beginning of the rules.
407 The absolute minimum Lex program is thus
413 (no definitions, no rules) which translates into a program
414 which copies the input to the output unchanged.
416 In the outline of Lex programs shown above, the
420 represent the user's control
421 decisions; they are a table, in which the left column
427 and the right column contains
431 program fragments to be executed when the expressions
433 Thus an individual rule might appear
437 integer printf("found keyword INT");
439 to look for the string
441 in the input stream and
442 print the message ``found keyword INT'' whenever it appears.
443 In this example the host procedural language is C and
444 the C library function
448 is used to print the string.
450 of the expression is indicated by the first blank or tab character.
451 If the action is merely a single C expression,
452 it can just be given on the right side of the line; if it is
453 compound, or takes more than a line, it should be enclosed in
455 As a slightly more useful example, suppose it is desired to
456 change a number of words from British to American spelling.
461 colour printf("color");
462 mechanise printf("mechanize");
463 petrol printf("gas");
465 would be a start. These rules are not quite enough,
472 with this will be described later.
475 Lex Regular Expressions.
477 The definitions of regular expressions are very similar to those
480 expression specifies a set of strings to be matched.
481 It contains text characters (which match the corresponding
482 characters in the strings being compared)
483 and operator characters (which specify
484 repetitions, choices, and other features).
485 The letters of the alphabet and the digits are
486 always text characters; thus the regular expression
509 The operator characters are
513 " \e [ ] ^ \- ? . \(** + | ( ) $ / { } % < >
515 and if they are to be used as text characters, an escape
517 The quotation mark operator (")
518 indicates that whatever is contained between a pair of quotes
519 is to be taken as text characters.
528 when it appears. Note that a part of a string may be quoted.
529 It is harmless but unnecessary to quote an ordinary
530 text character; the expression
536 is the same as the one above.
537 Thus by quoting every non-alphanumeric character
538 being used as a text character, the user can avoid remembering
539 the list above of current
540 operator characters, and is safe should further extensions to Lex
543 An operator character may also be turned into a text character
544 by preceding it with \e as in
551 is another, less readable, equivalent of the above expressions.
552 Another use of the quoting mechanism is to get a blank into
553 an expression; normally, as explained above, blanks or tabs end
555 Any blank character not contained within [\|] (see below) must
557 Several normal C escapes with \e
558 are recognized: \en is newline, \et is tab, and \eb is backspace.
559 To enter \e itself, use \e\e.
560 Since newline is illegal in an expression, \en must be used;
562 required to escape tab and backspace.
563 Every character but blank, tab, newline and the list above is always
569 Classes of characters can be specified using the operator pair [\|].
573 single character, which may be
578 Within square brackets,
579 most operator meanings are ignored.
580 Only three characters are special:
581 these are \e \(mi and ^. The \(mi character
582 indicates ranges. For example,
588 indicates the character class containing all the lower case letters,
590 the angle brackets, and underline.
591 Ranges may be given in either order.
592 Using \(mi between any pair of characters which are
593 not both upper case letters, both lower case letters, or both digits
594 is implementation dependent and will get a warning message.
595 (E.g., [0\-z] in ASCII is many more characters
596 than it is in EBCDIC).
597 If it is desired to include the
598 character \(mi in a character class, it should be first or
605 matches all the digits and the two signs.
607 In character classes,
608 the ^ operator must appear as the first character
609 after the left bracket; it indicates that the resulting string
610 is to be complemented with respect to the computer character set.
617 matches all characters except a, b, or c, including
618 all special or control characters; or
624 is any character which is not a letter.
625 The \e character provides the usual escapes within
626 character class brackets.
631 To match almost any character, the operator character
637 is the class of all characters except newline.
638 Escaping into octal is possible although non-portable:
644 matches all printable characters in the ASCII character set, from octal
645 40 (blank) to octal 176 (tilde).
648 Optional expressions.
653 an optional element of an expression.
666 Repeated expressions.
668 Repetitions of classes are indicated by the operators
677 is any number of consecutive
679 characters, including zero; while
685 is one or more instances of
693 is all strings of lower case letters.
698 [A\(miZa\(miz][A\(miZa\(miz0\(mi9]\(**
700 indicates all alphanumeric strings with a leading
701 alphabetic character.
702 This is a typical expression for recognizing identifiers in
706 Alternation and Grouping.
709 indicates alternation:
721 Note that parentheses are used for grouping, although
723 not necessary on the outside level;
731 can be used for more complex expressions:
735 (ab\||\|cd+)?(ef)\(**
737 matches such strings as
752 Lex will recognize a small amount of surrounding
753 context. The two simplest operators for this are
757 If the first character of an expression is
759 the expression will only be matched at the beginning
760 of a line (after a newline character, or at the beginning of
762 This can never conflict with the other meaning of
765 of character classes, since that only applies within
767 If the very last character is
769 the expression will only be matched at the end of a line (when
770 immediately followed by newline).
771 The latter operator is a special case of the
774 which indicates trailing context.
783 but only if followed by
798 Left context is handled in Lex by
802 as explained in section 10. If a rule is only to be executed
803 when the Lex automaton interpreter is in start condition
807 the rule should be prefixed by
813 using the angle bracket operator characters.
814 If we considered ``being at the beginning of a line'' to be
818 would be equivalent to
824 Start conditions are explained more fully later.
827 Repetitions and Definitions.
829 The operators {} specify
830 either repetitions (if they enclose numbers)
832 definition expansion (if they enclose a name). For example
838 looks for a predefined string named
841 at that point in the expression.
842 The definitions are given in the first part of the Lex
843 input, before the rules.
850 looks for 1 to 5 occurrences of
855 is special, being the separator
856 for Lex source segments.
861 When an expression written as above is matched, Lex
862 executes the corresponding action. This section describes
863 some features of Lex which aid in writing actions. Note
864 that there is a default action, which
865 consists of copying the input to the output. This
866 is performed on all strings not otherwise matched. Thus
867 the Lex user who wishes to absorb the entire input, without
868 producing any output, must provide rules to match everything.
869 When Lex is being used with Yacc, this is the normal
871 One may consider that actions are what is done instead of
872 copying the input to the output; thus, in general,
873 a rule which merely copies can be omitted.
874 Also, a character combination
875 which is omitted from the rules
876 and which appears as input
877 is likely to be printed on the output, thus calling
878 attention to the gap in the rules.
880 One of the simplest things that can be done is to ignore
881 the input. Specifying a C null statement, \fI;\fR as an action
882 causes this result. A frequent rule is
888 which causes the three spacing characters (blank, tab, and newline)
891 Another easy way to avoid writing actions is the action character
892 |, which indicates that the action for this rule is the action
894 The previous example could also have been written
902 with the same result, although in different style.
903 The quotes around \en and \et are not required.
905 In more complex actions, the user
907 often want to know the actual text that matched some expression
910 Lex leaves this text in an external character
915 Thus, to print the name found,
920 [a\-z]+ printf("%s", yytext);
931 accepts a format argument and data to be printed;
932 in this case, the format is ``print string'' (% indicating
935 indicating string type),
936 and the data are the characters
946 it may be written as ECHO:
952 is the same as the above.
953 Since the default action is just to
954 print the characters found, one might ask why
955 give a rule, like this one, which merely specifies
957 Such rules are often required
958 to avoid matching some other rule
959 which is not desired. For example, if there is a rule
962 it will normally match the instances of
974 This is explained further below.
976 Sometimes it is more convenient to know the end of what
977 has been found; hence Lex also provides a count
981 of the number of characters matched.
982 To count both the number
983 of words and the number of characters in words in the input, the user might write
987 [a\-zA\-Z]+ {words++; chars += yyleng;}
993 of characters in the words recognized.
994 The last character in the string matched can
1003 action may decide that a rule has not recognized the correct
1005 Two routines are provided to aid with this situation.
1010 can be called to indicate that the next input expression recognized is to be
1011 tacked on to the end of this input. Normally,
1012 the next input string would overwrite the current
1021 may be called to indicate that not all the characters matched
1022 by the currently successful expression are wanted right now.
1027 indicates the number of characters
1033 Further characters previously matched
1035 returned to the input. This provides the same sort of
1036 look~ahead offered by the / operator,
1037 but in a different form.
1042 Consider a language which defines
1043 a string as a set of characters between quotation (") marks, and provides that
1044 to include a " in a string it must be preceded by a \e. The
1045 regular expression which matches that is somewhat confusing,
1046 so that it might be preferable to write
1051 if (yytext[yyleng\-1] == \(fm\e\e\(fm)
1054 ... normal user processing
1057 which will, when faced with a string such as
1068 cause the next part of the string,
1070 to be tacked on the end.
1071 Note that the final quote terminating the string should be picked
1072 up in the code labeled ``normal processing''.
1078 might be used to reprocess
1079 text in various circumstances. Consider the C problem of distinguishing
1080 the ambiguity of ``=\(mia''.
1081 Suppose it is desired to treat this as ``=\(mi a''
1082 but print a message. A rule might be
1089 printf("Op (=\(mi) ambiguous\en");
1091 ... action for =\(mi ...
1096 which prints a message, returns the letter after the
1097 operator to the input stream, and treats the operator as ``=\(mi''.
1098 Alternatively it might be desired to treat this as ``= \(mia''.
1099 To do this, just return the minus
1100 sign as well as the letter to the input:
1107 printf("Op (=\(mi) ambiguous\en");
1109 ... action for = ...
1114 will perform the other interpretation.
1115 Note that the expressions for the two cases might more easily
1122 in the first case and
1129 no backup would be required in the rule action.
1130 It is not necessary to recognize the whole identifier
1131 to observe the ambiguity.
1133 possibility of ``=\(mi3'', however, makes
1139 a still better rule.
1141 In addition to these routines, Lex also permits
1142 access to the I/O routines
1149 which returns the next input character;
1154 which writes the character
1163 pushes the character
1167 back onto the input stream to be read later by
1172 By default these routines are provided as macro definitions,
1173 but the user can override them and supply private versions.
1175 define the relationship between external files and
1176 internal characters, and must all be retained
1177 or modified consistently.
1178 They may be redefined, to
1179 cause input or output to be transmitted to or from strange
1180 places, including other programs or internal memory;
1181 but the character set used must be consistent in all routines;
1182 a value of zero returned by
1186 must mean end of file; and
1187 the relationship between
1196 or the Lex look~ahead will not work.
1197 Lex does not look ahead at all if it does not have to,
1198 but every rule ending in
1211 Look~ahead is also necessary to match an expression that is a prefix
1212 of another expression.
1213 See below for a discussion of the character set used by Lex.
1214 The standard Lex library imposes
1215 a 100 character limit on backup.
1217 Another Lex library routine that the user will sometimes want
1222 which is called whenever Lex reaches an end-of-file.
1227 returns a 1, Lex continues with the normal wrapup on end of input.
1228 Sometimes, however, it is convenient to arrange for more
1231 In this case, the user should provide
1237 arranges for new input and
1238 returns 0. This instructs Lex to continue processing.
1245 This routine is also a convenient place
1246 to print tables, summaries, etc. at the end
1247 of a program. Note that it is not
1248 possible to write a normal rule which recognizes
1249 end-of-file; the only access to this condition is
1254 In fact, unless a private version of
1259 a file containing nulls
1261 since a value of 0 returned by
1265 is taken to be end-of-file.
1269 Ambiguous Source Rules.
1271 Lex can handle ambiguous specifications.
1272 When more than one expression can match the
1273 current input, Lex chooses as follows:
1275 The longest match is preferred.
1277 Among rules which matched the same number of characters,
1278 the rule given first is preferred.
1280 Thus, suppose the rules
1284 integer keyword action ...;
1285 [a\-z]+ identifier action ...;
1287 to be given in that order. If the input is
1289 it is taken as an identifier, because
1291 matches 8 characters while
1296 both rules match 7 characters, and
1297 the keyword rule is selected because it was given first.
1298 Anything shorter (e.g. \fIint\fR\|) will
1299 not match the expression
1301 and so the identifier interpretation is used.
1303 The principle of preferring the longest
1304 match makes rules containing
1314 might seem a good way of recognizing
1315 a string in single quotes.
1316 But it is an invitation for the program to read far
1317 ahead, looking for a distant
1319 Presented with the input
1323 \&\(fmfirst\(fm quoted string here, \(fmsecond\(fm here
1325 the above expression will match
1329 \&\(fmfirst\(fm quoted string here, \(fmsecond\(fm
1331 which is probably not what was wanted.
1332 A better rule is of the form
1336 \&\(fm[^\(fm\en]\(**\(fm
1338 which, on the above input, will stop
1342 of errors like this are mitigated by the fact
1345 operator will not match newline.
1346 Thus expressions like
1350 Don't try to defeat this with expressions like
1354 the Lex generated program will try to read
1355 the entire input file, causing
1356 internal buffer overflows.
1358 Note that Lex is normally partitioning
1359 the input stream, not searching for all possible matches
1361 This means that each character is accounted for
1363 For example, suppose it is desired to
1364 count occurrences of both \fIshe\fR and \fIhe\fR in an input text.
1365 Some Lex rules to do this might be
1374 where the last two rules ignore everything besides \fIhe\fR and \fIshe\fR.
1375 Remember that . does not include newline.
1376 Since \fIshe\fR includes \fIhe\fR, Lex will normally
1381 the instances of \fIhe\fR included in \fIshe\fR,
1382 since once it has passed a \fIshe\fR those characters are gone.
1384 Sometimes the user would like to override this choice. The action
1386 means ``go do the next alternative.''
1387 It causes whatever rule was second choice after the current
1388 rule to be executed.
1389 The position of the input pointer is adjusted accordingly.
1390 Suppose the user really wants to count the included instances of \fIhe\fR:
1399 these rules are one way of changing the previous example
1401 After counting each expression, it is rejected; whenever appropriate,
1402 the other expression will then be counted. In this example, of course,
1403 the user could note that \fIshe\fR includes \fIhe\fR but not
1404 vice versa, and omit the REJECT action on \fIhe\fR;
1405 in other cases, however, it
1406 would not be possible a priori to tell
1407 which input characters
1408 were in both classes.
1410 Consider the two rules
1414 a[bc]+ { ... ; REJECT;}
1415 a[cd]+ { ... ; REJECT;}
1419 only the first rule matches,
1422 only the second matches.
1425 matches the first rule for four characters
1426 and then the second rule for three characters.
1427 In contrast, the input
1430 the second rule for four characters and then the first
1433 In general, REJECT is useful whenever
1434 the purpose of Lex is not to partition the input
1435 stream but to detect all examples of some items
1436 in the input, and the instances of these items
1437 may overlap or include each other.
1438 Suppose a digram table of the input is desired;
1439 normally the digrams overlap, that is the word
1441 is considered to contain
1446 Assuming a two-dimensional array named
1449 to be incremented, the appropriate
1456 digram[yytext[0]][yytext[1]]++;
1462 where the REJECT is necessary to pick up
1463 a letter pair beginning at every character, rather than at every
1467 Lex Source Definitions.
1469 Remember the format of the Lex
1480 So far only the rules have been described. The user needs
1482 though, to define variables for use in his program and for use
1484 These can go either in the definitions section
1485 or in the rules section.
1487 Remember that Lex is turning the rules into a program.
1488 Any source not intercepted by Lex is copied
1489 into the generated program. There are three classes
1492 Any line which is not part of a Lex rule or action
1493 which begins with a blank or tab is copied into
1494 the Lex generated program.
1495 Such source input prior to the first %% delimiter will be external
1496 to any function in the code; if it appears immediately after the first
1498 it appears in an appropriate place for declarations
1499 in the function written by Lex which contains the actions.
1500 This material must look like program fragments,
1501 and should precede the first Lex rule.
1503 As a side effect of the above, lines which begin with a blank
1504 or tab, and which contain a comment,
1505 are passed through to the generated program.
1506 This can be used to include comments in either the Lex source or
1507 the generated code. The comments should follow the host
1508 language convention.
1510 Anything included between lines containing
1516 copied out as above. The delimiters are discarded.
1517 This format permits entering text like preprocessor statements that
1518 must begin in column 1,
1519 or copying lines that do not look like programs.
1521 Anything after the third %% delimiter, regardless of formats, etc.,
1522 is copied out after the Lex output.
1524 Definitions intended for Lex are given
1525 before the first %% delimiter. Any line in this section
1526 not contained between %{ and %}, and begining
1527 in column 1, is assumed to define Lex substitution strings.
1528 The format of such lines is
1535 causes the string given as a translation to
1536 be associated with the name.
1537 The name and translation
1538 must be separated by at least one blank or tab, and the name must begin with a letter.
1539 The translation can then be called out
1540 by the {name} syntax in a rule.
1541 Using {D} for the digits and {E} for an exponent field,
1542 for example, might abbreviate rules to recognize numbers:
1549 {D}+#printf("integer");
1550 {D}+"."{D}\(**({E})?#|
1551 {D}\(**"."{D}+({E})?#|
1552 {D}+{E}#printf("real");
1554 Note the first two rules for real numbers;
1555 both require a decimal point and contain
1556 an optional exponent field,
1557 but the first requires at least one digit before the
1558 decimal point and the second requires at least one
1559 digit after the decimal point.
1560 To correctly handle the problem
1561 posed by a Fortran expression such as
1563 which does not contain a real number, a context-sensitive
1568 [0\-9]+/"."EQ printf("integer");
1570 could be used in addition to the normal rule for integers.
1573 section may also contain other commands, including the
1574 selection of a host language, a character set table,
1575 a list of start conditions, or adjustments to the default
1576 size of arrays within Lex itself for larger source programs.
1578 are discussed below under ``Summary of Source Format,''
1584 There are two steps in
1585 compiling a Lex source program.
1586 First, the Lex source must be turned into a generated program
1587 in the host general purpose language.
1588 Then this program must be compiled and loaded, usually with
1589 a library of Lex subroutines.
1590 The generated program
1593 The I/O library is defined in terms of the C standard
1596 The C programs generated by Lex are slightly different
1597 on OS/370, because the
1598 OS compiler is less powerful than the UNIX or GCOS compilers,
1599 and does less at compile time.
1600 C programs generated on GCOS and UNIX are the same.
1605 The library is accessed by the loader flag
1615 The resulting program is placed on the usual file
1619 for later execution.
1620 To use Lex with Yacc see below.
1621 Although the default Lex I/O routines use the C standard library,
1622 the Lex automata themselves do not do so;
1623 if private versions of
1630 are given, the library can be avoided.
1636 If you want to use Lex with Yacc, note that what Lex writes is a program
1641 the name required by Yacc for its analyzer.
1642 Normally, the default main program on the Lex library
1643 calls this routine, but if Yacc is loaded, and its main
1644 program is used, Yacc will call
1648 In this case each Lex rule should end with
1654 where the appropriate token value is returned.
1655 An easy way to get access
1656 to Yacc's names for tokens is to
1657 compile the Lex output file as part of
1658 the Yacc output file by placing the line
1662 # include "lex.yy.c"
1664 in the last section of Yacc input.
1665 Supposing the grammar to be
1666 named ``good'' and the lexical rules to be named ``better''
1667 the UNIX command sequence can just be:
1673 cc y.tab.c \-ly \-ll
1675 The Yacc library (\-ly) should be loaded before the Lex library,
1676 to obtain a main program which invokes the Yacc parser.
1677 The generations of Lex and Yacc programs can be done in
1683 As a trivial problem, consider copying an input file while
1684 adding 3 to every positive number divisible by 7.
1685 Here is a suitable Lex source program
1700 The rule [0\-9]+ recognizes strings of digits;
1704 converts the digits to binary
1705 and stores the result in
1708 The operator % (remainder) is used to check whether
1711 is divisible by 7; if it is,
1712 it is incremented by 3 as it is written out.
1713 It may be objected that this program will alter such
1718 Furthermore, it increments the absolute value
1719 of all negative numbers divisible by 7.
1720 To avoid this, just add a few more rules after the active one,
1730 k%7 == 0 ? k+3 : k);
1733 [A-Za-z][A-Za-z0-9]+ ECHO;
1735 Numerical strings containing
1736 a ``.'' or preceded by a letter will be picked up by
1737 one of the last two rules, and not changed.
1740 has been replaced by
1741 a C conditional expression to save space;
1752 For an example of statistics gathering, here
1753 is a program which histograms the lengths
1754 of words, where a word is defined as a string of letters.
1760 [a\-z]+ lengs[yyleng]++;
1769 printf("Length No. words\en");
1770 for(i=0; i<100; i++)
1772 printf("%5d%10d\en",i,lengs[i]);
1777 accumulates the histogram, while producing no output. At the end
1778 of the input it prints the table.
1783 indicates that Lex is to perform wrapup. If
1787 returns zero (false)
1788 it implies that further input is available
1790 to continue reading and processing.
1796 returns true causes an infinite loop.
1798 As a larger example,
1799 here are some parts of a program written by N. L. Schryer
1800 to convert double precision Fortran to single precision Fortran.
1801 Because Fortran does not distinguish upper and lower case letters,
1802 this routine begins by defining a set of classes including
1803 both cases of each letter:
1813 An additional class recognizes white space:
1819 The first rule changes
1820 ``double precision'' to ``real'', or ``DOUBLE PRECISION'' to ``REAL''.
1824 {d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {
1825 printf(yytext[0]==\(fmd\(fm? "real" : "REAL");
1828 Care is taken throughout this program to preserve the case
1830 of the original program.
1831 The conditional operator is used to
1832 select the proper form of the keyword.
1833 The next rule copies continuation card indications to
1834 avoid confusing them with constants:
1840 In the regular expression, the quotes surround the
1842 It is interpreted as
1843 ``beginning of line, then five blanks, then
1844 anything but blank or zero.''
1845 Note the two different meanings of
1847 There follow some rules to change double precision
1848 constants to ordinary floating constants.
1852 [0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ |
1853 [0\-9]+{W}"."{W}{d}{W}[+\-]?{W}[0\-9]+ |
1854 "."{W}[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ {
1855 /\(** convert constants \(**/
1856 for(p=yytext; \(**p != 0; p++)
1858 if (\(**p == \(fmd\(fm || \(**p == \(fmD\(fm)
1859 \(**p=+ \(fme\(fm\- \(fmd\(fm;
1863 After the floating point constant is recognized, it is
1872 The program then adds
1873 .I \(fme\(fm\-\(fmd\(fm ,
1875 it to the next letter of the alphabet.
1876 The modified constant, now single-precision,
1877 is written out again.
1878 There follow a series of names which must be respelled to remove
1879 their initial \fId\fR.
1885 the same action suffices for all the names (only a sample of
1886 a rather long list is given here).
1895 {d}{f}{l}{o}{a}{t} printf("%s",yytext+1);
1897 Another list of names must have initial \fId\fR changed to initial \fIa\fR:
1905 yytext[0] =+ \(fma\(fm \- \(fmd\(fm;
1910 must have initial \fId\fR changed to initial \fIr\fR:
1914 {d}1{m}{a}{c}{h}#{yytext[0] =+ \(fmr\(fm \- \(fmd\(fm;
1918 To avoid such names as \fIdsinx\fR being detected as instances
1919 of \fIdsin\fR, some final rules pick up longer words as identifiers
1920 and copy some surviving characters:
1924 [A\-Za\-z][A\-Za\-z0\-9]\(** |
1929 Note that this program is not complete; it
1930 does not deal with the spacing problems in Fortran or
1931 with the use of keywords as identifiers.
1935 Left Context Sensitivity.
1938 it is desirable to have several sets of lexical rules
1939 to be applied at different times in the input.
1940 For example, a compiler preprocessor might distinguish
1941 preprocessor statements and analyze them differently
1942 from ordinary statements.
1945 to prior context, and there are several ways of handling
1947 The \fI^\fR operator, for example, is a prior context operator,
1948 recognizing immediately preceding left context just as \fI$\fR recognizes
1949 immediately following right context.
1950 Adjacent left context could be extended, to produce a facility similar to
1951 that for adjacent right context, but it is unlikely
1952 to be as useful, since often the relevant left context
1953 appeared some time earlier, such as at the beginning of a line.
1955 This section describes three means of dealing
1956 with different environments: a simple use of flags,
1957 when only a few rules change from one environment to another,
1963 and the possibility of making multiple lexical analyzers all run
1965 In each case, there are rules which recognize the need to change the
1966 environment in which the
1967 following input text is analyzed, and set some parameter
1968 to reflect the change. This may be a flag explicitly tested by
1969 the user's action code; such a flag is the simplest way of dealing
1970 with the problem, since Lex is not involved at all.
1971 It may be more convenient,
1973 to have Lex remember the flags as initial conditions on the rules.
1974 Any rule may be associated with a start condition. It will only
1975 be recognized when Lex is in
1976 that start condition.
1977 The current start condition may be changed at any time.
1978 Finally, if the sets of rules for the different environments
1979 are very dissimilar,
1980 clarity may be best achieved by writing several distinct lexical
1981 analyzers, and switching from one to another as desired.
1983 Consider the following problem: copy the input to the output,
1984 changing the word \fImagic\fR to \fIfirst\fR on every line which began
1985 with the letter \fIa\fR, changing \fImagic\fR to \fIsecond\fR on every line
1986 which began with the letter \fIb\fR, and changing
1987 \fImagic\fR to \fIthird\fR on every line which began
1988 with the letter \fIc\fR. All other words and all other lines
1991 These rules are so simple that the easiest way
1992 to do this job is with a flag:
1998 ^a {flag = \(fma\(fm; ECHO;}
1999 ^b {flag = \(fmb\(fm; ECHO;}
2000 ^c {flag = \(fmc\(fm; ECHO;}
2001 \en {flag = 0 ; ECHO;}
2005 case \(fma\(fm: printf("first"); break;
2006 case \(fmb\(fm: printf("second"); break;
2007 case \(fmc\(fm: printf("third"); break;
2008 default: ECHO; break;
2014 To handle the same problem with start conditions, each
2015 start condition must be introduced to Lex in the definitions section
2020 %Start name1 name2 ...
2022 where the conditions may be named in any order.
2023 The word \fIStart\fR may be abbreviated to \fIs\fR or \fIS\fR.
2024 The conditions may be referenced at the
2025 head of a rule with the <> brackets:
2031 is a rule which is only recognized when Lex is in the
2032 start condition \fIname1\fR.
2033 To enter a start condition,
2034 execute the action statement
2040 which changes the start condition to \fIname1\fR.
2041 To resume the normal state,
2047 resets the initial condition
2048 of the Lex automaton interpreter.
2049 A rule may be active in several
2056 is a legal prefix. Any rule not beginning with the
2057 <> prefix operator is always active.
2059 The same example as before can be written:
2065 ^a {ECHO; BEGIN AA;}
2066 ^b {ECHO; BEGIN BB;}
2067 ^c {ECHO; BEGIN CC;}
2068 \en {ECHO; BEGIN 0;}
2069 <AA>magic printf("first");
2070 <BB>magic printf("second");
2071 <CC>magic printf("third");
2073 where the logic is exactly the same as in the previous
2074 method of handling the problem, but Lex does the work
2075 rather than the user's code.
2080 The programs generated by Lex handle
2081 character I/O only through the routines
2090 Thus the character representation
2091 provided in these routines
2092 is accepted by Lex and employed to return
2098 a character is represented as a small integer
2099 which, if the standard library is used,
2100 has a value equal to the integer value of the bit
2101 pattern representing the character on the host computer.
2102 Normally, the letter
2104 is represented as the same form as the character constant
2106 If this interpretation is changed, by providing I/O
2107 routines which translate the characters,
2108 Lex must be told about
2109 it, by giving a translation table.
2110 This table must be in the definitions section,
2111 and must be bracketed by lines containing only
2113 The table contains lines of the form
2117 {integer} {character string}
2119 which indicate the value associated with each character.
2120 Thus the next example
2141 Sample character table.
2143 maps the lower and upper case letters together into the integers 1 through 26,
2144 newline into 27, + and \- into 28 and 29, and the
2145 digits into 30 through 39.
2146 Note the escape for newline.
2147 If a table is supplied, every character that is to appear either
2148 in the rules or in any valid input must be included
2151 may be assigned the number 0, and no character may be
2152 assigned a bigger number than the size of the hardware character set.
2155 Summary of Source Format.
2157 The general form of a Lex source file is:
2167 The definitions section contains
2170 Definitions, in the form ``name space translation''.
2172 Included code, in the form ``space code''.
2174 Included code, in the form
2184 Start conditions, given in the form
2192 Character set tables, in the form
2197 number space character-string
2203 Changes to internal array sizes, in the form
2207 %\fIx\fR\0\0\fInnn\fR
2209 where \fInnn\fR is a decimal integer representing an array size
2210 and \fIx\fR selects the parameter as follows:
2220 k packed character classes
2224 Lines in the rules section have the form ``expression action''
2225 where the action may be continued on succeeding
2226 lines by using braces to delimit it.
2228 Regular expressions in Lex use the following
2235 "x" an "x", even if x is an operator.
2236 \ex an "x", even if x is an operator.
2237 [xy] the character x or y.
2238 [x\-z] the characters x, y or z.
2239 [^x] any character but x.
2240 \&. any character but newline.
2241 ^x an x at the beginning of a line.
2242 <y>x an x when Lex is in start condition y.
2243 x$ an x at the end of a line.
2245 x\(** 0,1,2, ... instances of x.
2246 x+ 1,2,3, ... instances of x.
2249 x/y an x but only if followed by y.
2250 {xx} the translation of xx from the
2251 definitions section.
2252 x{m,n} \fIm\fR through \fIn\fR occurrences of x
2257 There are pathological expressions which
2258 produce exponential growth of the tables when
2259 converted to deterministic machines;
2260 fortunately, they are rare.
2262 REJECT does not rescan the input; instead it remembers the results of the previous
2263 scan. This means that if a rule with trailing context is found, and
2264 REJECT executed, the user
2268 to change the characters forthcoming
2269 from the input stream.
2270 This is the only restriction on the user's ability to manipulate
2271 the not-yet-processed input.
2278 be obvious from the above, the outside of Lex
2280 on Yacc and the inside on Aho's string matching routines.
2281 Therefore, both S. C. Johnson and A. V. Aho
2282 are really originators
2284 as well as debuggers of it.
2285 Many thanks are due to both.
2287 The code of the current version of Lex was designed, written,
2288 and debugged by Eric Schmidt.
2289 .if 0 .SG MH-1274-MEL-unix
2296 B. W. Kernighan and D. M. Ritchie,
2298 The C Programming Language,
2300 Prentice-Hall, N. J. (1978).
2304 Ratfor: A Preprocessor for a Rational Fortran,
2306 Software \- Practice and Experience,
2307 \fB5\fR, pp. 395-496 (1975).
2311 Yacc: Yet Another Compiler Compiler,
2313 Computing Science Technical Report No. 32,
2316 .if \n(tm (also TM 75-1273-6)
2318 A. V. Aho and M. J. Corasick,
2320 Efficient String Matching: An Aid to Bibliographic Search,
2328 B. W. Kernighan, D. M. Ritchie and K. L. Thompson,
2332 Computing Science Technical Report No. 5,
2337 private communication.
2341 The Portable C Library,
2343 Computing Science Technical Report No. 31,
2345 .if \n(tm (also TM 75-1274-11)