3. Phase 2 : Syntax Analysis Part I

Download Report

Transcript 3. Phase 2 : Syntax Analysis Part I

3. Phase 2 : Syntax Analysis Part I
• Parsing.
• The symbol table.
• The abstract syntax tree.
• Representing statements &
expressions.
• Recursive descent parsing.
• A bit of theory.
3.1
Parsing
• Given a grammar and a particular sequence of
symbols, either :
– Find how that sequence might be derived from
the sentence symbol by using the productions
of the grammar.
– Or, show that the sequence could not be
derived from the sentence symbol and
therefore is not a sentence.
• In programming language terms, for ‘sentence’
read ‘source program’ and for ‘sequence’ read
‘contents of the file being parsed’.
• The parser reads the file and checks that the
contents could be derived from the production
rules. If not it produces compile time error
messages.
3.2
Parsing II
• If the source program is syntactically correct then parsing
produces :
– Symbol table (ST) from the declarations.
– Abstract syntax tree (AST) from the statements.
• If the source program is syntactically incorrect then parsing
produces :
– Error messages.
• ST and AST data structures and error messages declared in
syner.h in phase 2 unit directory :
/usr/users/staff/aosc/cm049icp/phase2
3.3
The Symbol Table
• A data structure recording all the user
defined symbols (i.e. identifiers) declared in
the program.
• Either a variable sized array or a linked list.
– No variable sized arrays in C++ so
we’ve got to use a linked list.
 C++ does have dynamic arrays but
they are not variable sized.
• Each (user defined) identifier is stored in a
struct with elements for its name, type and
other things as desired.
• Used to determine what identifiers refer to
during syntax analysis and code
generation.
3.4
The Symbol Table II
• For a real language the ST is a sequence
of frames, one per subprogram.
• Frame : Conceptually, a sequence of
entries, one per variable / constant.
• C-- has no subprograms so the ST is
simply a sequence of entries.
• Example C-- source :
const string s = “Hello” ;
int n = 0 ;
bool b ;
{ ...
}
3.5
Symbol Table Entries
• Schematic diagram :
ident
type
constFlag
initialise
address
next
:
:
:
:
:
:
“b”
BOOLDATA
false
NULL
0
ident
type
constFlag
initialise
address
next
:
:
:
:
:
:
ident
type
constFlag
initialise
address
next
“s”
STRINGDATA
true
0
NULL
:
:
:
:
:
:
“n”
INTDATA
false
0
0
Stored in
reverse of
declaration
order.
“Hello”
3.6
C++ Data Structures For The Symbol Table
• In syner.h :
enum DataType { VOIDDATA,
BOOLDATA,
STRINGDATA, INTDATA } ;
struct SymTab
{ string
ident
DataType type
bool
constflag
Factor
*initialise
int
address
SymTab
*next
} ;
;
;
;
;
;
;
struct Factor
{ ...
// Later.
} ;
3.7
The Abstract Syntax Tree
• An abstract representation of the program.
i
f
• An n-ary tree.
• Example C-- source :
if (b > 6)
x = 0 ;
else y = x ;
>
b
=
6
x
=
0
y
x
• Of course it’s nowhere near as simple as this
diagram suggests.
3.8
C++ Data Structures For The Abstract Syntax Tree
• In syner.h :
enum StatType { NULLST, ASSIGNST,
IFST,
WHILEST,
CINST, COUTST } ;
struct AST
{ StatType tag
;
AssignSt *assignst ;
IfSt
*ifst
;
WhileSt *whilest ;
CinSt
*cinst
;
CoutSt
*coust
;
AST
*next
;
} ;
3.9
Schematic Diagram Of An Abstract Syntax Tree
• Example C-- program :
void main()
{ x = 23 + 4 ;
cout << x ;
}
tag
: ASSIGNST
next
:
assignst :
AST for
x = 23 + 4
Built in reverse order. Must
then be reversed.
Unused fields
set to NULL.
tag
next
coutst
: COUTST
: NULL
:
AST for
cout << x
3.10
Assignment Statements
• Example C-- source :
target
expr
x = y + z * 34 ;
• C++ data structure :
struct AssignSt
{ SymTab
*target ;
Expression *expr
;
} ;
ST entry
for x
AST for
y + z * 34
• The Expression data structure is quite complicated.
– Must handle arbitrary expressions.
– Later.
3.11
if Statements
• Example C-- source :
if (a > b)
{ cout << a ;
}
else
{ cout << b ;
} ;
The { and } are
obligatory in C--.
So is the ;.
• C++ data structure :
struct IfSt
{ Expression
AST
AST
int
int
} ;
*condition
*thenstats
*elsestats
elselabel
endlabel
;
;
;
;
;
elselabel and
endlabel are used in
code generation.
3.12
if Statements II
• Schematic diagram :
AST for
a > b
condition
thenstats
elsestats
elselabel
thenlabel
:
:
:
: 0
: 0
AST for
cout << a
AST for
cout << b
3.13
while Statements
• Example C-- source :
The { and } are
obligatory in C--.
So is the ;.
while (a > 0)
{ a = a - 1 ;
cout << a ;
} ;
• C++ data structure :
struct WhileSt
{ Expression *condition
AST
*stats
int
startlabel
int
endlabel
} ;
;
;
;
;
startlabel and
endlabel are used in
code generation.
3.14
while statements II
• Schematic diagram :
AST for
a > 0
condition
stats
startlabel
endlabel
:
:
: 0
: 0
AST for
a = a - 1 ;
cout << a ;
3.15
cin statements
• Example C-- source :
cin >> a ;
• C++ data structure :
a must be an int
variable in C--.
struct CinSt
{ SymTab *invar ;
} ;
• Schematic diagram :
invar
:
Symbol table
entry for a.
3.16
cout statements
• Example C-- source :
cout >> a ;
• C++ data structure :
struct CoutSt
{ SymTab *outvar ;
} ;
a must be an int
variable or
constant or a
string constant in
C--.
• Schematic diagram :
outvar
:
Symbol table
entry for a.
3.17
Expressions
• C-- syntax :
expression
::= basic_exp [ rel_op basic_exp ]
basic_exp
::= term { add_op basic_exp }
term
::= factor {mul_op term }
factor
::= literal
|
identifier
|
‘(‘ expression ‘)’ |
‘!’ factor
• Example expressions :
22 + 4 < b
z != y
a
!true
3.18
Expressions II
• In syner.h :
struct Expression
{ BasicExp *be1 ;
string
relOp ;
BasicExp *be2 ;
} ;
struct BasicExp
{ Term
*term ;
string
addOp ;
BasicExp *bexp ;
} ;
Unused fields
set to ““ or
NULL as
appropriate.
struct Term
{ Factor *fact ;
string mulOp ;
Term
*term ;
}
3.19
Factors
• In syner.h :
struct Factor
{ bool
literal
DataType
type
string
litBool
string
litString
int
litInt
SymTab
*ident
Expression *bexp
Factor
*nFactor
} ;
;
;
;
;
;
;
;
;
//
//
//
//
//
//
//
//
Tag field
Type field
Boolean literal
String literal
Integer literal
Identifier
Bracket expression
Negated factor
• Literal is short for literal constant.
• Note that C-- does not allow negative integer literals.
– Makes things a lot simpler.
• Unused fields are ignored.
– Set to ““ or 0 or NULL as appropriate.
3.20
Example Expressions
• Example C-- expression :
42
• Schematic diagram :
be1
:
relOp : ““
be2
: NULL
term :
addOp : ““
bexp : NULL
literal
type
litBool
litString
litInt
ident
bexp
nFactor
:
:
:
:
:
:
:
:
fact :
mulOp : ““
term : NULL
true
intData
““
““
42
NULL
NULL
NULL
3.21
Example Expressions II
* as from be1
on last slide.
• Example C-- expression :
42 < a
• Schematic diagram :
be1
: *
relOp : “<“
be2
:
term :
addOp : ““
bexp : NULL
literal
type
litBool
litString
litInt
ident
bexp
nFactor
:
:
:
:
:
:
:
:
fact :
mulOp : ““
term : NULL
false
intData
““
““
0
NULL
NULL
Symbol table
entry for a
3.22
Recursive Descent Parsing
• Notice that the AST is a recursive data structure.
– Definition : for recursion see recursion.
• This is necessary because C-- has a recursive
syntax :
– while and if statements may include other
statements.
– Expressions include basic expressions which
include terms which include factors which
include expressions.
• The simplest way to syntax analyse a recursive
syntax is via a method known as recursive
descent syntax analysis.
– Usually called recursive descent parsing
although technically parsing is syntax analysis
plus lexical analysis.
3.23
Recursive Descent Parsing II
• In a recursive descent parser we write a subprogram to
recognise each of the different syntactic constructs.
– One subprogram per production rule (more or less).
• Each analysis subprogram calls the lexAnal to get
tokens as required.
• Each analysis subprogram may call other functions to
recognise component syntactic constructs.
– synStatement will call synAssign, synIf,
synWhile etc.
– synIf and synWhile will call synStatement
recursively.
• Lots more on how to do recursive descent parsing in
the next lecture.
3.24
Chomsky Grammars
• Noam Chomsky is an American linguist (among other things).
• Chomsky’s theories about the syntactic structure of languages
are very popular with Computists.
– Much less popular with Linguists.
• Chomsky identified 4 types of grammar.
– Increasing levels of restrictions on the production rules.
• In a Chomsky type 0 grammar all production rules are of the form
A ::= alpha
where A is a non-empty string of non-terminal symbols and
alpha is a string of terminal and/or non-terminal symbols.
• Type 0 is the most general type of grammar.
– Often called free grammars.
– Natural language grammars.
3.25
Chomsky Grammars II
• A Chomsky type 1 grammar contains only
production rules of the form
beta A gamma ::= beta alpha gamma
where A is a single non-terminal symbol and
alpha, beta and gamma are strings of
terminal and/or non-terminal symbols.
– i.e. each production replaces a single nonterminal symbol in a particular context.
• Often called context-sensitive grammars.
3.26
Chomsky Grammars III
• A Chomsky type 2 grammar contains only
production rules of the form
A ::= alpha
where A is a single non-terminal symbol and alpha
is a string of terminal and/or non-terminal symbols.
• Often called context-free grammars.
• A Chomsky type 3 grammar contains only
production rules of the forms
A ::= a
and
A ::= bB
in which A and B are single non-terminal symbols
and a and b are single terminal symbols.
• Often called regular-expression grammars.
3.27
Programming Languages And Chomsky Grammars
• In programming languages the syntax of statements,
expressions etc. is defined by a type 2 grammar.
– Syntax analysis works on type 2 grammars.
• Type 3 grammars are used to define the micro-syntax
of lexical items such as identifiers, literals and strings.
– Lexical analysis works on type 3 grammars.
• Note that a type 3 grammar is also a type 2 grammar.
– A type 2 grammar is also a type 1 grammar.
– A type 1 grammar is also a type 0 grammar.
• In general type 0 grammars cannot be analysed by a
computer.
– The other 3 types can be analysed by a computer.
3.28
Backtracking
• Consider the following grammar :
S ::= ‘c’ A ‘d’
A ::= ‘a’ ‘b’ | ‘a’
• Suppose the input sequence is “cad”.
– Parser consumes ‘c’.
– Chooses first alternative for A and consumes ‘a’.
– Parser now looking for a ‘b’ which is not present.
– Parser must backtrack and try the second
alternative for A.
– Parser consumes ‘a’.
– Parser consumes ‘d’.
• Backtracking is inefficient and makes the parser more
complex to write.
3.29
Predictive Parsing
• It is reasonably straightforward to construct a grammar which
can be parsed without backtracking.
• Simplest type is the LL(1) grammar :
– An LL(1) grammar can be parsed by reading the input left
to right.
– At each step the leftmost derivation is produced.
 Only the leftmost non-terminal in the sequence is
replaced.
– Which production rule should be used at each step can be
uniquely determined from the tokens already consumed,
the current input token and a peek at the next input token.
 1 token lookahead.
• C-- has an LL(1) grammar.
• Ideal for recursive descent predictive parsing.
3.30
Summary
• Syntax analysis : determining whether or not a file
contains a syntactically correct program.
• Recursive descent parsing : One subprogram to
syntax analyse each (major) syntactic construct.
• Symbol table holds information about user defined
identifiers for use in syntax analysis and code
generation.
• Abstract syntax tree holds an abstract representation
of the statements for use in syntax analysis and code
generation.
• Chomsky grammars are used to define the syntax of
programming languages.
• C-- has an LL(1) Chomsky type 2 grammar.
3.31