Chapter 1 Introduction

Download Report

Transcript Chapter 1 Introduction

Chapter 2
Lexical Analysis
Dr. Frank Lee
2.1 File Processing and Tokens
• The lexical analysis phase of compilation does exactly what
you do unconsciously – it breaks the text file (program) into
smaller chunks
• These chunks are called tokens
• The lexical analysis phase of the compiler is often called
tokenization
• One efficient but complex brute-force approach is to read
character by character to check if each one follows the right
sequence of a token.
• Fig 2.2 (p7) shows a partial code for this brute-force lexical
analyzer (see next slide)
• What we’d like is a set of tools that will allow us to easily
create and modify a lexical analyzer that has the same runtime efficiency as the brute-force method
• The first tool that we use to attack this problem is
Deterministic Finite Automata (DFA).
Partial code for a brute-force lexical analyzer
if (c = nextchar() == ‘c’) {
if (c = nextchar() == ‘l’) {
// code to handle the rest of either “class” or any identifier that
// starts with “cl”
} else if (c == ‘a’) {
// code to handle the rest of either “case” or any identifier that
// starts with “ca”
} else {
// code to handle any identifier that starts with c
}
} else if ( c = …..) {
// many more nested ifs for the rest of the lexical analyzer
2.2 Deterministic Finite Automata
2.2.1 Informal Definition of A DFA
• DFA is a kind of specialized flowchart for text
processing
• “Deterministic” means the next step is
completely determined by the current state and
the next symbol
• “Finite” means the numbers of states must be
finite
• “Automata” means machines; automaton is the
singular form
2.2.1 Informal Definition of DFA
• The DFA is to recognize strings
• When a string is fed into a DFA, if the DFA recognizes the
string, it accepts the string otherwise it rejects the string.
• A DFA is a collection of states and transitions
• Given the input string, the transitions tell us how to move
among the states
• One of the states is denoted as the initial state, and a
subset of the states are final states
• We start from the initial state, move from state to state via
the transitions, and check to see if we are in the final state
when we have checked each character in the string
• If we are, then the string is accepted, otherwise, the string
is rejected
• See Figures 2.3 to 2.6 (p8-9)
2.2.2 Formal Definition of a DFA
• A DFA is a quintuple M = (Q, ∑, δ, q0, F), where
–
–
–
–
quintuple means the machine M has five parameters.
Q is a finite set of states
∑ is a finite set called the alphabet
δ is a total function from (Q x ∑) to Q known as
transition function (a function that takes a state and a
symbol as inputs and returns a state)
– q0 ε Q is the start state, and
– F is subset of Q called final states (see p9)
• See page 10 for detail descriptions about DFA
2.2.3 Representing DFA
• Figure 2.7 (p11) defines a DFA that accepts all
strings over {a, b} that start with “a” and end with
“b”.
• Fig 2.8 A DFA equivalent to the formal definition
of Fig. 2.7
• DFA Example:
– Design a DFA for a newspaper vending machine if
each copy of newspaper is 35 cents and the machine
accepts only nickels, dimes and quarters.
– Note, there is no change for the machine, for example,
if the customer drop into 40 cents, one copy of
newspaper will come out and without changes.
2.2.4 Combining DFA
• Figure 2.9 (p11) show a DFA that accepts
strings “new” and “null”
0
n
1
u
e
2
l
4
w
l
5
3
6
2.2.5 More Complicated DFA
• Fig. 2.10 (p12): a DFA that accepts all legal simpleJava
identifiers
• Fig. 2.11 (p13): results of running various strings through
the DFA of Fig. 2.10.
• Fig. 2.12 (p13): A DFA that accepts “new”, “null” and all
legal identifiers in simpleJava
• How to design a DFA for a Coke vending machine if
each can of Coke is 75 cents and if the machine accepts
only dimes and quarters? Note, no change function in
this machine.
• See Appendix A (p318) for simpleJava Reference
Manual
• See Appendix B (p332) for combination DFA in detail
2.3 Regular Expressions
• Regular expressions (RE) allow us to describe
sets of string using 2 basic operators │ (means
OR, or union) and * (means zero or more
repetitions).
• Є stands for the empty string
• Examples:
–
–
–
–
–
–
(aab│bba)
a(ab│bc)
(a│b)(c│d)
(ab)*  {“”, “ab”, “abab”, “ababab”, ….}.
a(bc)*d
(ab│cd)*
2.3.1 Formal Languages and Strings
• See Definitions 1 to 8 (pp14-15)
–
–
–
–
–
–
–
–
Def 1: Alphabet
Def 2: String
Def 3: String Length
Def 4: Empty String
Def 5: Formal Language
Def 6: Concatenation (i.e. AND): x○y (= xy)
Def 7: Language Concatenation L ○ M (=LM)
Def 8: Kleene Closure: The Kleene closure of a formal language
L is denoted as L* = {Є}ULUL2 UL3 UL4 U…
– Examples: (see Definition 8 on page15)
• If L={a} then L* =
• If L={bc} then L* =
• If L={a, b} then L*=
• If L={a, b, c} then L*=
2.3.2 Regular Expression Definition
• A regular expression is a way to describe
a formal language over an alphabet ∑
• For any regular expression α, we will use
L[α] to denote the formal language
described by α
• Regular expressions are defined
recursively on page 15
• See table 2.2 (p16) for some examples
2.3.3 Regular Expression Shorthand
•
•
•
•
(a○b)  (ab)
((a│b)│c)  a│b│c
Is ab│c same as (ab) │c or a(b│c)?
Operator precedence is:
1. Kleene closure *
2. Concatenation (○) (means AND)
3. Union (│) (means OR)
•
So ab│c = (ab)│c , and
ab*c│de = a(b*c)│(de)
2.3.3 Regular Expression Shorthand (cont.)
• Use square brackets to choose a single
character from a set
• For examples:
–
–
–
–
–
[abcd] = (a │b │c │d)
[a-f] = (a │b │c │d │e │f)
[a-c0-3] = (a │b │c │0 │1 │2 │3)
JavaCC uses [“a”-”c”, “0”-”3”] to represent [a-c0-3]
Uses ^ in the first character to represent not in the set
• For example: [^a-zA-Z] stands for any character that is not a
letter
– JavaCC uses ~ to represent “not”
• For example: ~[“a”-”z”,”A”-”Z”] means [^a-zA-Z]
2.3.3 Regular Expression Shorthand (cont.)
• For any regular expression α
– α? is equivalent to (α│ε)
• For example: ab?c = {abc, ac}
– α+ is equivalent to αα*
• For example: (ab)+c = {abc, ababc, abababc, …}
• See some shorthand regular expressions
examples on Table 2.3 (p17).
2.3.4 Intuitive Description of Regular Expressions
•
•
•
•
Concatenation means “is followed by”
│is the same as “or”
* is the same as “zero” or more occurrences
Examples:
–
–
–
–
–
–
–
abc(d│e)
(a │b)(c │d)
(a │b)(a │b)
a*b
(a │b)c*
(a │b)* (all strings that are made up of “a”s and “b”s)
The regular expression for an identifier in simpleJava is
[a-zA-Z_][a-zA-Z_0-9]*  {a, A, _, aa, aA, a_, a0, Av7, ….}
– See Table 2.4 (p18) for some examples
2.4 JavaCC – A Lexical Analyzer
and Parser Generator
• JavaCC takes a set of regular expressions as
input that describe tokens
• Creates a DFA that recognizes the input set of
tokens, and
• Then creates a Java class to implement that DFA
• Table 2.4 (page 18) shows some tokens and their
regular expressions
• JavaCC also creates a parser which will be
explored in Chapter 4
2.4.1 Structure of a JavaCC file
• Input files to JavaCC have the extension “.jj”
• A sample foo.jj file is shown on p18
• Several tokens can be defined in the same TOKEN block,
with the rules separated by a “│”. (see Fig 2.3 on p19)
• By convention, token names are in all UPPERCASE.
• When JavaCC runs on the file foo.jj, JavaCC creates a class
fooTokenManager, which contains a method getNextToken().
• The getNextToken() method implements the DFA that
recognizes the tokens described by the regular expressions
• Every time getNextToken is called, it finds the rule whose
regular expression matches to the next sequence of
characters in the input file, and returns the token for that rule
• See the simple JavaCC input file simple.jj in Fig. 2.13 (p19)
2.4.1 Structure of a JavaCC file (cont.)
•
When there is more than one rule that matches
the input, JavaCC uses the following strategy:
1. Always match to the longest possible string
2. If two different rules match the same string, use the
rule that appears first in the .jj file
•
•
For example, the “else” string matches both
the ELSE and IDENTIFIER rules,
getNextToken() will return the ELSE token
But “else21” matches to the IDENTIFIER token
only
2.4.2 Tokens in JavaCC
•
When we run JavaCC on the input file
simple.jj, JavaCC creates several files to
implement a lexical analyzer
1. simpleTokenManager.Java: to implement
the token manager class
2. simpleConstants.Java: an interface that
defines a set of constants (Fig 2.14, p21)
3. Token.Java: describes the tokens returned
by getNextToken()
2.4.3 Dealing with Whitespaces –
SKIP rules
• If the input file contains “else ;”, the first
call to getNextToken() will return an “else”
token
• Then we use the SKIP rules to skip the
white spaces between “else” and “;”
• See Fig 2.15 (p22): insert SKIP rules into
the simple.jj in Fig 2.13
2.4.4 Skipping Comments –
States in JavaCC
• For single-line comment, use the SKIP rule below:
SKIP :
{
< “//” (~[“\n”])* “\n” >
}
start with //, follow any character except \n, and end with \n.
• For the multiple-line comment, if the input file contains “/*”, then use
the following SKIP rule:
SKIP :
{
< “/*” > : IN_COMMENT
}
to recognize the beginning /* of a multiple-line comment and enter to
IN_COMMENT state
2.4.4 Skipping Comments –
States in JavaCC (cont.)
• Under the IN_COMMENT state, use the following SKIP
rule to skip all comment lines:
<IN_COMMENT>
SKIP :
{
< “*/” > : DEFAULT
│ < ~[] >
}
• If meets the end-of-comment symbol “*/”, then change
IN_COMMENT state back to DEFAULT state, otherwise
continue to skip any character ~[] (Note: ~[] means any
character except the empty string [])
• See Fig 2.16 (p24)
2.4.4 Skipping Comments –
States in JavaCC (cont.)
• For example:
else /* Comment */
12/* Comment */else
• At beginning, in DEAFULT state, return the ELSE token
• Then read the /* and enter to IN_COMMENT state, skip
each character after /*, until meet the */, then skip */ and
switch back to DEFAULT state
• Next, read “12” and return an INTEGER_LITERAL token
• Then read /* and enter IN_COMMENT state again, skip
everything after /*
• Until read */, skip */ and switch back to DEFAULT state
• At last, read “else” and return another ELSE token
2.4.5 Adding Actions to SKIP and
TOKEN Definitions
• We need some way of keeping track of the
number of comments that have been skipped
• Fig 2.17 (p26) shows the way to count number
of comments have been skipped
• It inserts a numcomments++; statement in both
comment SKIP rules (single-line and multi-line
comments).
• To tell how many comments are skipped.
2.4.6 Using the Generated
TokenManager in Java Code
• When JavaCC is run on the file foo.jj, it
generates the following files (p25):
– foo.Java
– ParseException.Java
– Token.Java
– SimpleCharStream.Java
– fooTokenManager.Java
– fooConstants.Java
– TokenMgrError.Java
2.4.6 Using the Generated TokenManager
in Java Code (cont.)
• To create a Java program that uses the lexical
analyzer created by JavaCC, you need to
instantiate a variable of type fooTokenManager,
where foo is the name of your .jj file
• The constructor for fooTokenManager requires
an object of type SimpleCharStream
• The constructor for SimpleCharStream requires
a standard Java.io.InputStream
• Thus, you can use the general lexical analyzer
as in next slide
2.4.6 Using the Generated TokenManager
in Java Code (cont., p26)
Token t;
fooTokenManager tm;
Java.io.InputStream infile;
tm = new fooTokenManager(new SimpleCharStream(infile));
infile = new Java.io.FileInputstream(“filename”);
t = tm.getNextToken();
while (t.kind != fooConstants.EOF)
{
/* Process t */
t = tm.getNextToken();
}
2.4.6 Using the Generated TokenManager
in Java Code (cont.)
• Any public variables that are declared in the
TOKEN_MGR_DECLS block (p18) can be accessed through
the token manager
• For instance, the main program in RemoveComments.Java
(Fig. 2.18, p28) uses (calls) the lexical analyzer (which is
generated by remComments.jj (Fig. 2.17, p26)) to strip all of the
comments out of an input file
• Assume that the files remComments.jj and
RemoveComments.Java are in the same directory, the
comment stripper can be created with the following commands:
javacc remComments.jj
javac *.java
• The code can be tested with the command:
java RemoveComments <filename>
2.5 Creating a Lexical Analyzer
2.5.1 Project Definition
• To write a JavaCC file “sjava.jj” that creates a lexical
analyzer for simpleJava tokens
• A skeleton of sjava.jj can be found in Fig. 2.20 (p32) to
get you started
• Once built your lexical analyzer, it can be tested with the
Java class TokenTest in Fig. 2.21 (p33)
• Your own lexical analyzer should be able to recognize the
following keywords and symbols (p30):
and class do else false for if true while
+ - * / [ ] \{ \} ( ) . , ; == != < > <= >= = && ││ !
• Also recognize integer literals and identifiers
• Skip single-line and multi-line comments
• Refer to page 319 for definitions of tokens and comments
2.5.2 Project Difficulty
• To learn the syntax of JavaCC
• To minimize your frustration, add to the sjava.jj file a little at a
time, and constantly checking it with JavaCC to find bugs.
• Provided files:
– sjava.jj (Fig. 2.20, p32)
– TokenTest.Java (Fig. 2.21, p33)
– test1.sjava to test4.sjava: sample test programs to test
your lexical analyzer
• Once you have modified sjava.jj to implement a complete
lexical analyzer, you can build the lexical analyzer with the
command:
javacc sjava.jj
javac *.java
• You can test your lexical analyzer with the command:
java TokenTest <filename>
Chapter 2 Homework
• Due: 2/13/2013
• Part A:
• Design a DFA for a chewing gum vending machine if each pack
of chewing gum is 25 cents and the machine accepts only nickels,
dimes and quarters.
• Note, there is no change for the machine. For example, if the
customer drops into 30 cents, one pack of chewing gum will come
out without changes.
• Part B:
• Do the following Exercises in Chapter 2:
1. (a), (c), (e), (g)
2. (a), (b), (c)
3. (a), (b), (c)
4. Give a DFA for each of the languages in question 1
(a), (c), (e), (g)
Project 1
• Due: 2/20/2013
• Project Title: A Lexical Analyzer for simpleJava tokens
• You need to write a JavaCC file “sjava.jj”that create a
lexical analyzer for simpleJava tokens
• Study chapter 2 and JavaCC completely to figure out how
to design, implement and test your lexical analyzer.
• Hand in & Email me:
– Your sjava.jj program
– Your compilation results (outputs) with javacc and javac
commands
– Your testing results (outputs) with java TokenTest <filename>
command
– Your input files for testing the TokenTest program.
– A README file to describe the steps to run and test your programs.