02b_Parsers_and_Grammar

Download Report

Transcript 02b_Parsers_and_Grammar

Parsers and Grammar
1
Categories of Grammar Rules

Declarations or definitions.
AttributeDeclaration ::=
[ final ] [ static ] [ access ] datatype [ = expression ]
{ , datatype [ = expression ] } ;
access ::= ' public ' | ' protected ' | ' private '

Statements.
 assignment, if, for, while, do_while

Expressions,
such as the examples in these slides.

Structures such as statement blocks, methods, and
entire classes.
StatementBlock ::= '{' { Statement; } '}'
2
Parsing Algorithms (1)
Broadly divided into LL and LR.
 LL algorithms match input directly to left-side
symbols, then choose a right-side production that
matches the tokens. This is top-down parsing
 LR algorithms try to match tokens to the right-side
productions, then replace groups of tokens with the
left-side nonterminal. They continue until the
entire input has been "reduced" to the start symbol
 LALR (look-ahead LR) are a special case of LR;
they require a few restrictions to the LR case
 Reference: Sebesta, section 4.3 - 4.5.

3
Parsing Algorithms (2)

Look ahead:
 algorithms must look at next token(s) to decide
between alternate productions for current tokens
 LALR(1) means LALR with 1 token look-ahead
 LL(1) means LL with 1 token look-ahead

LL algorithms are simpler and easier to visualize.

LR algorithms are more powerful: can parse some
grammars that LL cannot, such as left recursion.

yacc, bison, and CUP generate LALR(1) parsers

Recursive-descent is a useful LL algorithm that "every
computer professional should know" [Louden].
4
Top-down (LL) Parsing Example
For the input: z = (2*x + 5)*y - 7;
tokens: ID = ( NUMBER * ID + NUMBER ) * ID - NUMBER ;
Grammar rules (as before):
assignment => ID = expression ;
expression => expression + term
| expression - term
| term
term
=>
term * factor
| term / factor
| factor
factor =>
( expression )
| ID
| NUMBER
5
Top-down Parsing Example (2)
The top-down parser tries to match input to left sides.
ID = ( NUMBER * ID + NUMBER )*
assignment
ID = expression
ID = expression
ID = term
ID = term
*
ID = factor
*
ID = ( expression
*
ID = ( expression
+ term ) *
ID = ( term
+ term ) *
ID = ( term * factor + term )*
ID = ( factor * ID + factor )*
ID = ( NUMBER * ID + NUMBER )*
ID = ( NUMBER * ID + NUMBER )*
ID = ( NUMBER * ID + NUMBER )*
ID
- NUMBER ;
factor
factor
factor
factor
factor
factor
factor
factor
ID
ID
-
term
term
term
term
term
term
term
term
term
term
factor
ID
;
;
;
;
;
;
;
;
;
;
;
;
6
Top-down Parsing Example (3)

Problem in example: we had to look ahead many
tokens in order to know which production to use.

This isn't necessary provided that we know the
grammar is parsable using LL (top-down) methods.

There are conditions on the grammar that we can test
to verify this. (see: The Parsing Problem)

Later we will study the recursive-descent algorithm
which does top-down parsing with minimal lookahead.
7
Bottoms-up (LR) Parsing Example (1)
tokens: ID = ( NUMBER * ID + NUMBER ) * ID - NUMBER ;
parser: ID
...
read (shift) first token
factor
...
reduce
factor =
...
shift
FAIL: Can't match any rules (reduce)
Action
Backtrack and try again
ID = ( NUMBER ...
shift
ID = ( factor ...
reduce
ID = ( term * ...
sh/reduce
ID = ( term * ID ...
shift
ID = ( term * factor ...
reduce
ID = ( term ...
reduce
ID = ( term + ...
shift
ID = ( expression + NUMBER ...
reduce/sh
ID = ( expression + factor ...
reduce
ID = ( expression + term ...
reduce
8
Bottoms-up Parsing Example (2)
tokens: ID = ( NUMBER * ID + NUMBER ) * ID -NUMBER;
input:
ID = ( expression
...
ID = ( expression ) ...
ID = factor
...
ID = factor *
...
ID = term *
ID
...
ID = term * factor ...
ID = term
...
ID = term ...
ID = expression ...
ID = expression - NUMBER ...
ID = expression - factor ...
ID = expression - term
...
ID = expression ;
assignment
Start Symbol
SUCCESS!!
reduce
shift
reduce
shift
reduce/sh
reduce
reduce
shift
reduce
shift
reduce
reduce
shift
reduce
9
Bottoms-up Parsing Example (3)
LR parsing processes the input stream from the Left
and tries to match the input to the Right side of a
production.
 When something matches, it reduces the expression to
a left side non-terminal symbol.
 Repeat the process until the entire input stream is
matched.
 This could potentially be an O(n3) task, but Knuth and
others devised a table-based algorithm that is O(n).

10
The Parsing Problem
11
The Parsing Problem

Top-down parsers must decide which production to
use based on the current symbol, and perhaps
"peeking" at the next symbol (or two...).

Predictive parser: a parser that bases its actions on
the next available token (called single symbol lookahead).

Two conditions are necessary: [see Louden, p. 108110]
12
The Parsing Problem (cont.)
Condition 1: the ability to choose between multiple
alternatives, such as: A  1 | 2 | ... | n
 define First() = set of all tokens that can be the
first token for any production cascade that
produces symbol 
 then a predictive parser can be used for rule A if:
First(1)  First(2) ...  First(n) is empty.
Condition 2: the ability of the parser to detect presence
of an optional element, such as A   [ b ].

Can the parser detect for certain when b is
present?
13
The Parsing Problem (cont.)
Example: list  expr [list].
How do we know that list isn't part of expr?

define Follow(  ) = set of all tokens that can follow
the non-terminal  some production. Use a
special symbol ($) to represent the end of input if 
can be the end of input.

Example: Follow( factor ) = { +, -, *, /, ), $ }
while Follow( term ) = { *, /, ), $ }

then a predictive parser can detect the presence of
optional symbol b if First( b )  Follow( b ) is
empty.
14
Review and Thought Questions
15
Lexics vs. Syntax vs. Semantics
Division between lexical and syntactic structure is not
fixed:
 number can be a token or defined by a grammar rule.
 Implementation can often decide:
 scanners are faster
 parsers are more flexible
 error checking of number format as regex is simpler
 Division between syntax and semantics is not fixed:



we could define separate rules for IntegerNumber and
FloatingPtNumber , IntegerTerm, FloatingPtTerm, ... in order to
specify which mixed-mode operations are allowed.
or specify as part of semantics
16
Numbers: Scan or Parse?
We can construct numbers from digits using the scanner
or parser. Which is easier / better ?

Scanner: Define numbers as tokens:
number : [-]\d+

Parser: grammar rules define numbers (digits are
tokens):
number => '-' unsignednumber | unsignednumber
unsignednumber => unsignednumber digit | digit
digit =>
0|1|2|3|4|5|6|7|8|9
17
Is Java 'Class' grammar context-free?

A class may have static and instance attributes.

An inner class or local class have same syntax as toplevel class, but:


may not contain static members (except static constants)

inner class may access outer class using OuterClass.this

local class cannot be "public"
Does this means the syntax for a class depends on
context?
18
Alternative operator notation

Some languages use prefix notation: operator comes
first
expr => + expr expr | * expr expr | NUMBER


Examples:
* + 2 3 4
means (2 + 3) * 4
+ 2 * 3 4
means
2 + (3 * 4)
Using prefix notation, we don't have to worry about
precedence of different operators in BNF rules !
19