A Lexical Analyzer Generator by M. E. Lesk and E. Schmidt (2024)

Lex - A Lexical Analyzer Generator

M. E. Lesk and E. Schmidt

ABSTRACT

Lex helps write programs whose control flow is directed by instancesof regular expressions in the input stream. It is well suitedfor editor-script type transformations and for segmenting inputin preparation for a parsing routine.

Lex source is a table of regular expressions and corresponding program fragments. The table is translated to a programwhich reads an input stream, copying it to an output stream and partitioning the input into strings which match the givenexpressions. As each such string is recognized the corresponding program fragment is executed. The recognition of the expressionsis performed by a deterministic finite automaton generated by Lex. The program fragments written by the user are executedin the order in which the corresponding regular expressions occurin the input stream.

The lexical analysis programs written with Lex accept ambiguous specifications and choose the longest match possible at eachinput point. If necessary, substantial lookahead is performed on the input, but the input stream will be backed up to theend of the current partition, so that the user has generalfreedom to manipulate it.

Lex can generate analyzers in either C or Ratfor, a language whichcan be translated automatically to portable Fortran. It is availableon the PDP-11 UNIX, Honeywell GCOS, and IBM OS systems. Thismanual, however, will only discuss generating analyzers in C onthe UNIX system, which is the only supported form of Lex underUNIX Version 7. Lex is designed to simplify interfacing withYacc, for those with access to this compiler-compiler system.

1. Introduction.

Lex is a program generator designed for lexical processing ofharacterinput streams. It accepts a high-level, problem oriented specificationfor character string matching, and produces a program in a generalpurpose language which recognizes regular expressions. The regularexpressions are specified by the user in the source specificationsgiven to Lex. The Lex written code recognizes these expressionsin an input stream and partitions the input stream into stringsmatching the expressions. At the boundaries between strings programsections provided by the user are executed. The Lex source fileassociates the regular expressions and the program fragments. As each expression appears in the input to the program writtenby Lex, the corresponding fragment is executed.

The user supplies the additional code beyond expression matchingneeded to complete his tasks, possibly including code writtenby other generators. The program that recognizes the expressionsis generated in the general purpose programming language employedfor the user's program fragments. Thus, a high level expressionlanguage is provided to write the string expressions to be matchedwhile the user's freedom to write actions is unimpaired. Thisavoids forcing the user who wishes to use a string manipulationlanguage for input analysis to write processing programs in thesame and often inappropriate string handling language.

Lex is not a complete language, but rather a generator representinga new language feature which can be added to different programminglanguages, called ``host languages.'' Just as general purposelanguages can produce code to run on different computer hardware,Lex can write code in different host languages. The host languageis used for the output code generated by Lex and also for theprogram fragments added by the user. Compatible run-time librariesfor the different host languages are also provided. This makesLex adaptable to different environments and different users. Each application may be directed to the combination of hardwareand host language appropriate to the task, the user's background,and the properties of local implementations. At present, theonly supported host language is C, although Fortran (in the formof Ratfor [2] has been available in the past. Lex itself existson UNIX, GCOS, and OS/370; but the code generated by Lex may betaken any- where the appropriate compilers exist.

Lex turns the user's expressions and actions (called source inthis memo) into the host general-purpose language; the generatedprogram is named yylex. The yylex program will recognize expressionsin a stream (called input in this memo) and perform the specifiedactions for each expression as it is detected. See Figure 1.

+-------+
Source -> | Lex | -> yylex
+-------+

+-------+
Input -> | yylex | -> Output
+-------+

An overview of Lex

Figure 1

For a trivial example, consider a program to delete fromthe input all blanks or tabs at the ends of lines.

%%
[ \t]+$ ;

is all that is required. The program contains a %% delimiterto mark the beginning of the rules, and one rule. This rule containsa regular expression which matches one or more instances of thecharacters blank or tab (written \t for visibility, in accordancewith the C language convention) just prior to the end of a line. The brackets indicate the character class made of blank and tab;the + indicates ``one or more ...''; and the $ indicates ``endof line,'' as in QED. No action is specified, so the programgenerated by Lex (yylex) will ignore these characters. Everythingelse will be copied. To change any remaining string of blanksor tabs to a single blank, add another rule:

%%
[ \t]+$ ;
[ \t]+ printf(" ");

The finite automaton generated for this source will scan for bothrules at once, observing at the termination of the string of blanksor tabs whether or not there is a newline character, and executingthe desired rule action. The first rule matches all strings ofblanks or tabs at the end of lines, and the second rule all remainingstrings of blanks or tabs.

Lex can be used alone for simple transformations, or for analysisand statistics gathering on a lexical level. Lex can also beused with a parser generator to perform the lexical analysis phase;it is particularly easy to interface Lex and Yacc [3]. Lex programsrecognize only regular expressions; Yacc writes parsers that accepta large class of context free grammars, but require a lower levelanalyzer to recognize input tokens. Thus, a combination of Lexand Yacc is often appropriate. When used as a preprocessor fora later parser generator, Lex is used to partition the input stream,and the parser generator assigns structure to the resulting pieces. The flow of control in such a case (which might be the firsthalf of a compiler, for example) is shown in Figure 2. Additionalprograms, written by other generators or by hand, can be addedeasily to programs written by Lex.

lexical grammar
rules rules

+---------+ +---------+
| Lex | | Yacc |
+---------+ +---------+

+--------+ +-----------+
Input -> | yylex | -> | yyparse | -> Parsed input
+--------+ +-----------+

Lex with Yacc
Figure 2

Yacc users will realize that the name yylex is what Yacc expectsits lexical analyzer to be named, so that the use of this nameby Lex simplifies interfacing.

Lex generates a deterministic finite automaton from the regularexpressions in the source [4]. The automaton is interpreted,rather than compiled, in order to save space. The result is stilla fast analyzer. In particular, the time taken by a Lex programto recognize and partition an input stream is proportional tothe length of the input. The number of Lex rules or the complexityof the rules is not important in determining speed, unless ruleswhich include forward context require a significant amount ofrescanning. What does increase with the number and complexityof rules is the size of the finite automaton, and therefore thesize of the program generated by Lex.

In the program written by Lex, the user's fragments (representingthe actions to be performed as each regular expression is found)are gathered as cases of a switch. The automaton interpreterdirects the control flow. Opportunity is provided for the userto insert either declarations or additional statements in theroutine containing the actions, or to add subroutines outsidethis action routine.

Lex is not limited to source which can be interpreted on the basisof one character lookahead. For example, if there are two rules,one looking for ab and another for abcdefg, and the input streamis abcdefh, Lex will recognize ab and leave the input pointerjust before cd. . . Such backup is more costly than the processingof simpler languages.

2. Lex Source.

The general format of Lex source is:

{definitions}
%%
{rules}
%%
{user subroutines}

where the definitions and the user subroutines are often omitted. The second %% is optional, but the first is required to markthe beginning of the rules. The absolute minimum Lex programis thus

(no definitions, no rules) which translates into a program whichcopies the input to the output unchanged.

In the outline of Lex programs shown above, the rules representthe user's control decisions; they are a table, in which the leftcolumn contains regular expressions (see section 3) and the rightcolumn contains actions, program fragments to be executed whenthe expressions are recognized. Thus an individual rule mightappear

integer printf("found keyword INT");

to look for the string integer in the input stream and print themessage ``found keyword INT'' whenever it appears. In this examplethe host procedural language is C and the C library function printfis used to print the string. The end of the expression is indicatedby the first blank or tab character. If the action is merelya single C expression, it can just be given on the right sideof the line; if it is compound, or takes more than a line, itshould be enclosed in braces. As a slightly more useful example,suppose it is desired to change a number of words from Britishto American spelling. Lex rules such as

colour printf("color");
mechanise printf("mechanize");
petrol printf("gas");

would be a start. These rules are not quite enough, since thewordpetroleum would become gaseum; a way of dealing with thiswillbe described later.

3. Lex Regular Expressions.

The definitions of regular expressions are very similar to thosein QED [5]. A regular expression specifies a set of strings tobe matched. It contains text characters (which match the correspondingcharacters in the strings being compared) and operator characters(which specify repetitions, choices, and other features). Theletters of the alphabet and the digits are always text characters;thus the regular expression

integer

matches the string integer wherever it appears and the expression

a57D

looks for the string a57D.

Operators. The operator characters are

" \ [ ] ^ - ? . * + | ( ) $ / { } % <>

and if they are to be used as text characters, an escape shouldbe used. The quotation mark operator (") indicates thatwhatever is contained between a pair of quotes is to be takenas text characters. Thus

xyz"++"

matches the string xyz++ when it appears. Note that a part ofa string may be quoted. It is harmless but unnecessary to quotean ordinary text character; the expression

"xyz++"

is the same as the one above. Thus by quoting every non-alphanumericcharacter being used as a text character, the user can avoid rememberingthe list above of current operator characters, and is safe shouldfurther extensions to Lex lengthen the list.

An operator character may also be turned into a text characterby preceding it with \ as in

xyz\+\+

which is another, less readable, equivalent of the above expressions. Another use of the quoting mechanism is to get a blank into anexpression; normally, as explained above, blanks or tabs end arule. Any blank character not contained within [] (see below)must be quoted. Several normal C escapes with \ are recognized:\n is newline, \t is tab, and \b is backspace. To enter \ itself,use \\. Since newline is illegal in an expression, \n must beused; it is not required to escape tab and backspace. Every characterbut blank, tab, newline and the list above is always a text character.

Character classes. Classes of characters can be specifiedusing the operator pair []. The construction [abc] matches asingle character, which may be a, b, or c. Within square brackets,most operator meanings are ignored. Only three characters arespecial: these are \ - and ^. The - character indicates ranges. For example,

[a-z0-9<>_]

indicates the character class containing all the lower case letters,the digits, the angle brackets, and underline. Ranges may begiven in either order. Using - between any pair of characterswhich are not both upper case letters, both lower case letters,or both digits is implementation dependent and will get a warningmessage. (E.g., [0-z] in ASCII is many more characters than itis in EBCDIC). If it is desired to include the character - ina character class, it should be first or last; thus

[-+0-9]

matches all the digits and the two signs.

In character classes, the ^ operator must appear as the firstcharacter after the left bracket; it indicates that the resultingstring is to be complemented with respect to the computer characterset. Thus

[^abc]

matches all characters except a, b, or c, including all specialor

control characters; or

[^a-zA-Z]

is any character which is not a letter. The \ character providesthe usual escapes within character class brackets.

Arbitrary character. To match almost any character, theoperator character

is the class of all characters except newline. Escaping intooctal is possible although non-portable:

[\40-\176]

matches all printable characters in the ASCII character set, fromoctal 40 (blank) to octal 176 (tilde).

Optional expressions. The operator ? indicates an optionalelement of an expression. Thus

ab?c

4. Lex Actions.

When an expression written as above is matched, Lex executes thecorresponding action. This section describes some features ofLex which aid in writing actions. Note that there is a defaultaction, which consists of copying the input to the output. Thisis performed on all strings not otherwise matched. Thus the Lexuser who wishes to absorb the entire input, without producingany output, must provide rules to match everything. When Lexis being used with Yacc, this is the normal situation. One mayconsider that actions are what is done instead of copying theinput to the output; thus, in general, a rule which merely copiescan be omitted. Also, a character combination which is omittedfrom the rules and which appears as input is likely to be printedon the output, thus calling attention to the gap in the rules.

One of the simplest things that can be done is to ignore the input. Specifying a C null statement, ; as an action causes this result. A frequent rule is

[ \t\n] ;

which causes the three spacing characters (blank, tab, and newline)to be ignored.

Another easy way to avoid writing actions is the action character|, which indicates that the action for this rule is the actionfor the next rule. The previous example could also have beenwritten

" "
"\t"
"\n"

with the same result, although in different style. The quotesaround \n and \t are not required.

In more complex actions, the user will often want to know theactual text that matched some expression like [a-z]+. Lex leavesthis text in an external character array named yytext. Thus,to print the name found, a rule like

[a-z]+ printf("%s", yytext);

will print the string in yytext. The C function printf acceptsa format argument and data to be printed; in this case, the formatis ``print string'' (% indicating data conversion, and s indicatingstring type), and the data are the characters in yytext. So thisjust places the matched string on the output. This action is socommon that it may be written as ECHO:

[a-z]+ ECHO;

is the same as the above. Since the default action is just toprint the characters found, one might ask why give a rule, likethis one, which merely specifies the default action? Such rulesare often required to avoid matching some other rule which isnot desired. For example, if there is a rule which matches readit will normally match the instances of read contained in breador readjust; to avoid this, a rule of the form [a-z]+ is needed.This is explained further below.

Sometimes it is more convenient to know the end of what has beenfound; hence Lex also provides a count yyleng of the number ofcharacters matched. To count both the number of words and thenumber of characters in words in the input, the user might write

[a-zA-Z]+ {words++; chars += yyleng;}

which accumulates in chars the number of characters in the wordsrecognized. The last character in the string matched can be accessedby

yytext[yyleng-1]

Occasionally, a Lex action may decide that a rule has not recognizedthe correct span of characters. Two routines are provided toaid with this situation. First, yymore() can be called to indicatethat the next input expression recognized is to be tacked on tothe end of this input. Normally, the next input string wouldoverwrite the current entry in yytext. Second, yyless (n) maybe called to indicate that not all the characters matched by thecurrently successful expression are wanted right now. The argumentn indicates the number of characters in yytext to be retained. Further characters previously matched are returned to the input. This provides the same sort of lookahead offered by the / operator,but in a different form.

Example: Consider a language which defines a string as a set ofcharacters between quotation (") marks, and provides thatto include a " in a string it must be preceded by a \. Theregular expression which matches that is somewhat confusing, sothat it might be preferable to write

\"[^"]* {
if (yytext[yyleng-1] == '\\')
yymore();
else
... normal user processing
}

which will, when faced with a string such as "abc\"def"first match the five characters "abc\; then the call to yymore()will cause the next part of the string, "def, to be tackedon the end. Note that the final quote terminating the string shouldbe picked up in the code labeled ``normal processing''.

The function yyless() might be used to reprocess text in variouscirc*mstances. Consider the C problem of distinguishing the ambiguityof ``=-a''. Suppose it is desired to treat this as ``=- a'' butprint a mes- sage. A rule might be

=-[a-zA-Z] {
printf("Op (=-) ambiguous\n");
yyless(yyleng-1);
... action for =- ...
}

which prints a message, returns the letter after the operatorto the input stream, and treats the operator as ``=-''. Alternativelyit might be desired to treat this as ``= -a''. To do this, justreturn the minus sign as well as the letter to the input:

=-[a-zA-Z] {
printf("Op (=-) ambiguous\n");
yyless(yyleng-2);
... action for = ...
}

will perform the other interpretation. Note that the expressionsfor the two cases might more easily be written

=-/[A-Za-z]

in the first case and

=/-[A-Za-z]

in the second; no backup would be required in the rule action. It is not necessary to recognize the whole identifier to observethe ambiguity. The possibility of ``=-3'', however, makes

=-/[^ \t\n]

a still better rule.

In addition to these routines, Lex also permits access to theI/O routines it uses. They are:

1) input() which returns the next input character;

2) output(c) which writes the character c on the output; and

3) unput(c) pushes the character c back onto the input streamto be read later by input().

By default these routines are provided as macro definitions, butthe user can override them and supply private versions. Theseroutines define the relationship between external files and internalcharacters, and must all be retained or modified consistently. They may be redefined, to cause input or output to be transmittedto or from strange places, including other programs or internalmemory; but the character set used must be consistent in all routines;a value of zero returned by input must mean end of file; and therelationship between unput and intput must be retained or theLex lookahead will not work. Lex does not look ahead at all ifit does not have to, but every rule ending in + * ? or $ or containing/ implies lookahead. Lookahead is also necessary to match anexpression that is a prefix of another expression. See belowfor a discussion of the character set used by Lex. The standardLex library imposes a 100 character limit on backup.

Another Lex library routine that the user will sometimes wantto rede- fine is yywrap() which is called whenever Lex reachesan end-of-file. If yywrap returns a 1, Lex continues with thenormal wrapup on end of input. Sometimes, however, it is convenientto arrange for more input to arrive from a new source. In thiscase, the user should provide a yywrap which arranges for newinput and returns 0. This instructs Lex to continue processing. The default yywrap always returns 1.

This routine is also a convenient place to print tables, summaries,etc. at the end of a program. Note that it is not possible towrite a normal rule which recognizes end-of-file; the only accessto this condition is through yywrap. In fact, unless a privateversion of input() is supplied a file containing nulls cannotbe handled, since a value of 0 returned by input is taken to beend-of-file.

5. Ambiguous Source Rules.

Lex can handle ambiguous specifications. When more thanone expression can match the current input, Lex chooses as follows:

1) The longest match is preferred.

2) Among rules which matched the same number of characters,the rule given first is preferred.

Thus, suppose the rules

integer keyword action ...;

[a-z]+ identifier action ...;

to be given in that order. If the input is integers, it is takenas an identifier, because [a-z]+ matches 8 characters while integermatches only 7. If the input is integer, both rules match 7 characters,and the keyword rule is selected because it was given first. Anything shorter (e.g. int) will not match the expression integerand so the identifier interpretation is used.

The principle of preferring the longest match makes rules containingexpressions like .* dangerous. For example,

'.*'

might seem a good way of recognizing a string in single quotes.But it is an invitation for the program to read far ahead, lookingfor a distant single quote. Presented with the input

'first' quoted string here, 'second' here

the above expression will match

'first' quoted string here, 'second'

which is probably not what was wanted. A better rule is of theform

'[^'\n]*'

which, on the above input, will stop after 'first'. The consequencesof errors like this are mitigated by the fact that the . operatorwill not match newline. Thus expressions like .* stop on thecurrent line. Don't try to defeat this with expressions like(.|\n)+ or equivalents; the Lex generated program will try toread the entire input file, causing internal buffer overflows.

Note that Lex is normally partitioning the input stream, not searchingfor all possible matches of each expression. This means thateach character is accounted for once and only once. For example,suppose it is desired to count occurrences of both she and hein an input text. Some Lex rules to do this might be

she s++;
he h++;
\n |
. ;

where the last two rules ignore everything besides he and she.Remember that . does not include newline. Since she includeshe, Lex will normally not recognize the instances of he includedin she, since once it has passed a she those characters are gone.

Sometimes the user would like to override this choice. The actionREJECT means ``go do the next alternative.'' It causes whateverrule was second choice after the current rule to be executed. The position of the input pointer is adjusted accordingly. Supposethe user really wants to count the included instances of he:

she {s++; REJECT;}
he {h++; REJECT;}
\n |
. ;

these rules are one way of changing the previous example to dojust that. After counting each expression, it is rejected; wheneverappropriate, the other expression will then be counted. In thisexample, of course, the user could note that she includes he butnot vice versa, and omit the REJECT action on he; in other cases,however, it would not be possible a priori to tell which inputcharacters were in both classes.

Consider the two rules

a[bc]+ { ... ; REJECT;}

a[cd]+ { ... ; REJECT;}

If the input is ab, only the first rule matches, and on ad onlythe second matches. The input string accb matches the first rulefor four characters and then the second rule for three characters.In contrast, the input accd agrees with the second rule for fourcharacters and then the first rule for three.

In general, REJECT is useful whenever the purpose of Lex is notto par- tition the input stream but to detect all examples ofsome items in the input, and the instances of these items mayoverlap or include each other. Suppose a digram table of theinput is desired; normally the digrams overlap, that is the wordthe is considered to contain both th and he. Assuming a two-dimensionalarray named digram to be incremented, the appropriate source is

%%
[a-z][a-z] {
digram[yytext[0]][yytext[1]]++;
REJECT;
}
. ;
\n ;

where the REJECT is necessary to pick up a letter pair beginningat every character, rather than at every other character.

6. Lex Source Definitions.

Remember the format of the Lex source:

{definitions}
%%
{rules}
%%
{user routines}

So far only the rules have been described. The user needs additionaloptions, though, to define variables for use in his program andfor use by Lex. These can go either in the definitions sectionor in the rules section.

Remember that Lex is turning the rules into a program. Any sourcenot intercepted by Lex is copied into the generated program. There are three classes of such things.

1) Any line which is not part of a Lex rule or action whichbegins with a blank or tab is copied into the Lex generated program. Such source input prior to the first %% delimiter will be externalto any function in the code; if it appears immediately after thefirst %%, it appears in an appropriate place for declarationsin the function written by Lex which contains the actions. Thismaterial must look like program fragments, and should precedethe first Lex rule. As a side effect of the above, lines whichbegin with a blank or tab, and which contain a comment, are passedthrough to the generated program. This can be used to includecomments in either the Lex source or the generated code. Thecomments should follow the host language convention.

2) Anything included between lines containing only %{ and %} iscopied out as above. The delimiters are discarded. This formatpermits entering text like preprocessor statements that must beginin column 1, or copying lines that do not look like programs.

3) Anything after the third %% delimiter, regardless of formats,etc., is copied out after the Lex output.

Definitions intended for Lex are given before the first %% delimiter. Any line in this section not contained between %{ and %}, andbeginning in column 1, is assumed to define Lex substitution strings. The format of such lines is

name translation

and it causes the string given as a translation to be associatedwith the name. The name and translation must be separated byat least one blank or tab, and the name must begin with a letter.The translation can then be called out by the {name} syntax ina rule. Using {D} for the digits and {E} for an exponent field,for example, might abbreviate rules to recognize numbers:

D [0-9]
E [DEde][-+]?{D}+
%%
{D}+ printf("integer");
{D}+"."{D}*({E})? |
{D}*"."{D}+({E})? |
{D}+{E}

Note the first two rules for real numbers; both require a decimalpoint and contain an optional exponent field, but the first requiresat least one digit before the decimal point and the second requiresat least one digit after the decimal point. To correctly handlethe problem posed by a Fortran expression such as 35.EQ.I, whichdoes not contain a real number, a context-sensitive rule suchas

[0-9]+/"."EQ printf("integer");

could be used in addition to the normal rule for integers.

The definitions section may also contain other commands, includingthe selection of a host language, a character set table, a listof start conditions, or adjustments to the default size of arrayswithin Lex itself for larger source programs. These possibilitiesare discussed below under ``Summary of Source Format,'' section12.

7. Usage.

There are two steps in compiling a Lex source program. First,the Lex source must be turned into a generated program in thehost general purpose language. Then this program must be compiledand loaded, usually with a library of Lex subroutines. The generatedprogram is on a file named lex.yy.c. The I/O library is definedin terms of the C standard library [6].

The C programs generated by Lex are slightly different on OS/370,because the OS compiler is less powerful than the UNIX or GCOScompilers, and does less at compile time. C programs generatedon GCOS and UNIX are the same.

UNIX. The library is accessed by the loader flag -ll. So anappropriate set of commands is lex source cc lex.yy.c -llThe resulting program is placed on the usual file a.out for laterexecution. To use Lex with Yacc see below. Although the defaultLex I/O routines use the C standard library, the Lex automatathemselves do not do so; if private versions of input, outputand unput are given, the library can be avoided.

8. Lex and Yacc.

If you want to use Lex with Yacc, note that what Lex writes isa program named yylex(), the name required by Yacc for its analyzer. Normally, the default main program on the Lex library calls thisroutine, but if Yacc is loaded, and its main program is used,Yacc will call yylex(). In this case each Lex rule should endwith

return(token);

where the appropriate token value is returned. An easy way toget access to Yacc's names for tokens is to compile the Lex outputfile as part of the Yacc output file by placing the line

# include "lex.yy.c"

in the last section of Yacc input. Supposing the grammar to benamed ``good'' and the lexical rules to be named ``better'' theUNIX command sequence can just be:

yacc good
lex better
cc y.tab.c -ly -ll

The Yacc library (-ly) should be loaded before the Lex library,to obtain a main program which invokes the Yacc parser. The generationsof Lex and Yacc programs can be done in either order.

9. Examples.

As a trivial problem, consider copying an input file while adding3 to every positive number divisible by 7. Here is a suitableLex source program

%%
int k;
[0-9]+ {
k = atoi(yytext);
if (k%7 == 0)
printf("%d", k+3);
else
printf("%d",k);
}

to do just that. The rule [0-9]+ recognizes strings of digits;atoi converts the digits to binary and stores the result in k.The operator % (remainder) is used to check whether k is divisibleby 7; if it is, it is incremented by 3 as it is written out. It may be objected that this program will alter such input itemsas 49.63 or X7. Furthermore, it increments the absolute valueof all negative numbers divisible by 7. To avoid this, just adda few more rules after the active one, as here:

%%
int k;
-?[0-9]+ {
k = atoi(yytext);
printf("%d",
k%7 == 0 ? k+3 : k);
}
-?[0-9.]+ ECHO;
[A-Za-z][A-Za-z0-9]+ ECHO;

Numerical strings containing a ``.'' or preceded by a letter willbe picked up by one of the last two rules, and not changed. Theif-else has been replaced by a C conditional expression to savespace; the form a?b:c means ``if a then b else c''.

For an example of statistics gathering, here is a program whichhistograms the lengths of words, where a word is defined as astring of letters.

int lengs[100];
%%
[a-z]+ lengs[yyleng]++;
. |
\n ;
%%
yywrap()
{
int i;
printf("Length No. words\n");
for(i=0; i<100; i++)
if (lengs[i] > 0)
printf("%5d%10d\n",i,lengs[i]);
return(1);
}

This program accumulates the histogram, while producing no output.At the end of the input it prints the table. The final statementreturn(1); indi- cates that Lex is to perform wrapup. If yywrapreturns zero (false) it implies that further input is availableand the program is to continue reading and processing. To providea yywrap that never returns true causes an infinite loop.

As a larger example, here are some parts of a program writtenby N. L. Schryer to convert double precision Fortran to singleprecision Fortran. Because Fortran does not distinguish upperand lower case letters, this rou- tine begins by defining a setof classes including both cases of each letter:

a [aA]
b [bB]
c [cC]
...
z [zZ]

An additional class recognizes white space:

W [ \t]*

The first rule changes ``double precision'' to ``real'', or ``DOUBLEPRECISION'' to ``REAL''.

{d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n}{
printf(yytext[0]=='d'? "real" : "REAL");
}

Care is taken throughout this program to preserve the case (upperor lower) of the original program. The conditional operator isused to select the proper form of the keyword. The next rulecopies continuation card indications to avoid confusing them withconstants:

^" "[^ 0] ECHO;

In the regular expression, the quotes surround the blanks. Itis interpreted as ``beginning of line, then five blanks, thenanything but blank or zero.'' Note the two different meaningsof ^. There follow some rules to change double precision constantsto ordinary floating constants.

[0-9]+{W}{d}{W}[+-]?{W}[0-9]+ |
[0-9]+{W}"."{W}{d}{W}[+-]?{W}[0-9]+ |
"."{W}[0-9]+{W}{d}{W}[+-]?{W}[0-9]+ {
/* convert constants */
for(p=yytext; *p != 0; p++)
{
if (*p == 'd' || *p == 'D')
*p=+ 'e'- 'd';
ECHO;
}

After the floating point constant is recognized, it is scannedby the for loop to find the letter d or D. The program than adds'e'-'d', which converts it to the next letter of the alphabet.The modified constant, now single-precision, is written out again.There follow a series of names which must be respelled to removetheir initial d. By using the array yytext the same action sufficesfor all the names (only a sample of a rather long list is givenhere).

{d}{s}{i}{n} |
{d}{c}{o}{s} |
{d}{s}{q}{r}{t} |
{d}{a}{t}{a}{n} |
...
{d}{f}{l}{o}{a}{t} printf("%s",yytext+1);

Another list of names must have initial d changed to initial a:

{d}{l}{o}{g} |
{d}{l}{o}{g}10 |
{d}{m}{i}{n}1 |
{d}{m}{a}{x}1 {
yytext[0] =+ 'a' - 'd';
ECHO;
}

And one routine must have initial d changed to initial r:

{d}1{m}{a}{c}{h} {yytext[0] =+ 'r' - 'd';

To avoid such names as dsinx being detected as instances of dsin,some final rules pick up longer words as identifiers and copysome surviving characters:

[A-Za-z][A-Za-z0-9]* |
[0-9]+ |
\n |
. ECHO;

Note that this program is not complete; it does not deal withthe spacing problems in Fortran or with the use of keywords asidentifiers.

10. Left Context Sensitivity.

Sometimes it is desirable to have several sets of lexical rulesto be applied at different times in the input. For example, acompiler preprocessor might distinguish preprocessor statementsand analyze them differently from ordinary statements. This requiressensitivity to prior context, and there are several ways of handlingsuch problems. The ^ operator, for example, is a prior contextoperator, recognizing immediately preceding left con- text justas $ recognizes immediately following right context. Adjacentleft context could be extended, to produce a facility similarto that for adjacent right context, but it is unlikely to be asuseful, since often the relevant left context appeared some timeearlier, such as at the beginning of a line.

This section describes three means of dealing with different environments:a simple use of flags, when only a few rules change from one environmentto another, the use of start conditions on rules, and the possibilityof making multiple lexical analyzers all run together. In eachcase, there are rules which recognize the need to change the environmentin which the following input text is analyzed, and set some parameterto reflect the change. This may be a flag explicitly tested bythe user's action code; such a flag is the simplest way of dealingwith the problem, since Lex is not involved at all. It may bemore convenient, however, to have Lex remember the flags as initialconditions on the rules. Any rule may be associated with a startcondition. It will only be recognized when Lex is in that startcondition. The current start condition may be changed at anytime. Finally, if the sets of rules for the different environmentsare very dissimilar, clarity may be best achieved by writing severaldistinct lexical analyzers, and switching from one to anotheras desired.

Consider the following problem: copy the input to the output,changing the word magic to first on every line which began withthe letter a, changing magic to second on every line which beganwith the letter b, and changing magic to third on every line whichbegan with the letter c. All other words and all other linesare left unchanged.

These rules are so simple that the easiest way to do this jobis with a flag:

int flag;
%%
^a {flag = 'a'; ECHO;}
^b {flag = 'b'; ECHO;}
^c {flag = 'c'; ECHO;}
\n {flag = 0 ; ECHO;}
magic {
switch (flag)
{
case 'a': printf("first");break;
case 'b': printf("second");break;
case 'c': printf("third");break;
default: ECHO; break;
}
}

should be adequate.

To handle the same problem with start conditions, each start conditionmust be introduced to Lex in the definitions section with a linereading

%Start name1 name2 ...

where the conditions may be named in any order. The word Startmay be abbreviated to s or S. The conditions may be referencedat the head of a rule with the <> brackets:

<name1>expression

is a rule which is only recognized when Lex is in the start conditionname1. To enter a start condition, execute the action statement

BEGIN name1;

which changes the start condition to name1. To resume the normalstate,

BEGIN 0;

resets the initial condition of the Lex automaton interpreter. A rule may be active in several start conditions:

<name1,name2,name3>

is a legal prefix. Any rule not beginning with the <> prefixoperator is always active.

The same example as before can be written:

%START AA BB CC
%%
^a {ECHO; BEGIN AA;}
^b {ECHO; BEGIN BB;}
^c {ECHO; BEGIN CC;}
\n {ECHO; BEGIN 0;}
<AA>magic printf("first");
<BB>magic printf("second");
<CC>magic printf("third");

where the logic is exactly the same as in the previous methodof handling the problem, but Lex does the work rather than theuser's code.

11. Character Set.

The programs generated by Lex handle character I/O only throughthe routines input, output, and unput. Thus the character representationprovided in these routines is accepted by Lex and employed toreturn values in yytext. For internal use a character is representedas a small integer which, if the standard library is used, hasa value equal to the integer value of the bit pattern representingthe character on the host computer. Normally, the letter a isrepresented as the same form as the character constant 'a'. Ifthis interpretation is changed, by providing I/O routines whichtranslate the characters, Lex must be told about it, by givinga translation table. This table must be in the definitions section,and must be bracketed by lines containing only ``%T''. The tablecontains lines of the form

{integer} {character string}

which indicate the value associated with each character. Thusthe next example maps the lower and upper case letters togetherinto the integers 1 through 26, newline into 27, + and - into28 and 29, and the digits into 30 through 39. Note the escapefor newline. If a table is supplied, every character that isto appear either in the rules or in any valid input must be includedin the table. No character may be assigned the number 0, andno character may be assigned a bigger number than the size ofthe hardware character set.

%T
1 Aa
2 Bb
...
26 Zz
27 \n
28 +
29 -
30 0
31 1
...
39 9
%T

Sample character table.

12. Summary of Source Format.

The general form of a Lex source file is:

{definitions}
%%
{rules}
%%
{user subroutines}

The definitions section contains a combination of

1) Definitions, in the form ``name space translation''.

2) Included code, in the form ``space code''.

3) Included code, in the form

%{
code
%}

4) Start conditions, given in the form

%S name1 name2 ...

5) Character set tables, in the form

%T
number space character-string
...
%T

6) Changes to internal array sizes, in the form

%x nnn

where nnn is a decimal integer representing an array sizeand x selects the parameter as follows:

Letter Parameter
p positions
n states
e tree nodes
a transitions
k packed character classes
o output array size

Lines in the rules section have the form ``expression action''where the action may be continued on succeeding lines by usingbraces to delimit it.

Regular expressions in Lex use the following operators:

x the character "x"
"x" an "x", even if xis an operator.
\x an "x", even if x is an operator.
[xy] the character x or y.
[x-z] the characters x, y or z.
[^x] any character but x.
. any character but newline.
^x an x at the beginning of a line.
<y>x an x when Lex is in start conditiony.
x$ an x at the end of a line.
x? an optional x.
x* 0,1,2, ... instances of x.
x+ 1,2,3, ... instances of x.
x|y an x or a y.
(x) an x.
x/y an x but only if followed by y.
{xx} the translation of xx from the
definitions section.
x{m,n} m through n occurrences of x

13. Caveats and Bugs.

There are pathological expressions which produce exponential growthof the tables when converted to deterministic machines; fortunately,they are rare.

REJECT does not rescan the input; instead it remembers the resultsof the previous scan. This means that if a rule with trailingcontext is found, and REJECT executed, the user must not haveused unput to change the characters forthcoming from the inputstream. This is the only restriction on the user's ability tomanipulate the not-yet-processed input.

14. Acknowledgments.

As should be obvious from the above, the outside of Lex is patternedon Yacc and the inside on Aho's string matching routines. Therefore,both S. C. Johnson and A. V. Aho are really originators of muchof Lex, as well as debuggers of it. Many thanks are due to both.

The code of the current version of Lex was designed, written,and debugged by Eric Schmidt.

15. References.

1. B. W. Kernighan and D. M. Ritchie, The C ProgrammingLanguage, Prentice-Hall, N. J. (1978).

2. B. W. Kernighan, Ratfor: A Preprocessor for a Rational Fortran,Software - Practice and Experience, 5, pp. 395-496 (1975).

3. S. C. Johnson, Yacc: Yet Another Compiler Compiler, ComputingScience Technical Report No. 32, 1975, Bell Laboratories, MurrayHill, NJ 07974.

4. A. V. Aho and M. J. Corasick, Efficient String Matching: An Aid to Bibliographic Search, Comm. ACM 18, 333-340 (1975).

5. B. W. Kernighan, D. M. Ritchie and K. L. Thompson, QED TextEditor, Computing Science Technical Report No. 5, 1972, BellLaboratories, Murray Hill, NJ 07974.

6. D. M. Ritchie, private communication. See also M. E. Lesk,The Portable C Library, Computing Science Technical Report No.31, Bell Laboratories, Murray Hill, NJ 07974.