Chapter 4 - Syntax - Kasetsart University

Download Report

Transcript Chapter 4 - Syntax - Kasetsart University

Parsing Tools
Introduction to Bison and Flex
1
Scanning/parsing tools
lex - original UNIX lexics generator (Lesk, 1975)

create a C function that will parse input according
to a set of regular expressions
yacc - "yet another compiler compiler" UNIX parser
(Johnson, 1975)

generate a C program for a parser from BNF rules
bison and flex ("fast lex") - more powerful, free
versions of yacc and lex, from GNU Software Fnd'n.
Jflex - generates Java code for a scanner
CUP - generates Java code for a parser
2
Bison Overview
Purpose: automatically write a parser program for a
grammar written in BNF.
Usage: you write a bison source file containing rules that
look like BNF.
Bison creates a C program that parses according to the
rules
term
factor
:
|
|
;
:
|
;
term '*' factor { $$ = $1 * $3; }
term '/' factor { $$ = $1 / $3; }
factor
{ $$ = $1; }
ID
NUMBER
{ $$ = valueof($1); }
{ $$ = $1; }
3
Bison Overview (2)
myparser.y
BNF rules and actions for
your grammar.
The programmer puts BNF rules and
token rules for the parser he wants in
a bison source file myparser.y
run bison to create a C program
(*.tab.c) containing a parser function.
> bison myparser.y
The programmer must also supply a
tokenizer named yylex( )
myparser.tab.c
yylex.c
parser source code
tokenizer function in C
> gcc -o myprog
myparser.tab.c yylex.c
myprog
executable program
4
Bison Overview (3)
In operation:
input file to be
parsed.
your main program calls
yyparse( ).
yyparse( ) calls yylex
when it wants a token.
yylex returns the type of
the token.
yylex puts the value of the
token in a global variable
named yylval
yylex( )
tokenizer returns the type
of the next token
yylval
yyparse( )
parser created by bison
parse tree or other
result
5
Bison source file
The file has 3 sections, separated by "%%" lines.
/* declarations go here */
%%
/* grammar rules go here */
%%
/* additional C code goes here */
Note: format for "yacc" is the same as for bison.
6
Bison source file with C declarations
You usually include C code in the declarations section.
%{
/* C declarations and #define statements go here */
#include <stdio.h>
#define YYSTYPE double
Declare that yylval will be
%}
a "double" type.
/* bison declarations go here */
%%
/* grammar rules go here */
%%
/* additional C code goes here */
7
Bison Example
Create a parser for this grammar:
expression => expression + term
| expression - term
| term
term
=> term * factor
| term / factor
| factor
factor =>
( expression )
| NUMBER
8
Bison/Yacc file for example (1)
Structure of Bison or Yacc input:
%{
/* C declarations and #DEFINE statements go here */
#include <stdio.h>
#define YYSTYPE double
%}
/* Bison/Yacc declarations go here */
%token NUMBER
/* define token type NUMBER */
%left '+' '-'
/* + and - are left associative */
%left '*' '/'
/* * and / are left associative */
%%
/* grammar rules go here */
%%
/* additional C code goes here */
9
Bison/Yacc example (2)
%%
input
line
expr
term
factor
/* Bison grammar rules */
: /* empty production to allow an
| input line
;
: expr '\n'
{ printf("Result is
: expr '+' term
{ $$ = $1 + $3;
| expr '-' term
{ $$ = $1 - $3;
| term
{ $$ = $1; }
;
: term '*' factor { $$ = $1 * $3;
| term '/' factor { $$ = $1 / $3;
| factor
{ $$ = $1; }
;
: '(' expr ')'
{ $$ = $2; }
| NUMBER
{ $$ = $1; }
;
empty input */
%f\n", $1); }
}
}
}
}
10
Bison/Yacc example (3)

$1, $2, ... represent the actual values of tokens or nonterminals (rules) that match the production.

$$ is the result.
rule
pattern to match
expr
: expr '+' term
| expr '-' term
| term
;
action
{ $$ = $1 + $3; }
{ $$ = $1 - $3; }
{ $$ = $1; }
Example:
if the input matches expr + term then set the result ($$)
equal to the sum of expr plus term ($1 + $3).
11
Bison/Yacc example (4)
Q: why can we write "$$ = $1 + $3" ?
A: because we declared "#define YYSTYPE
double", so all tokens and results are double.
rule
pattern to match
expr
: expr '+' term
| expr '-' term
| term
;
action
{ $$ = $1 + $3; }
{ $$ = $1 - $3; }
{ $$ = $1; }
12
Scanner function: yylex( )
You must supply a scanner named yylex.
 yylex returns the token TYPE (not token value).

int yylex( void ) {
int c = getchar();
/* read from stdin */
if (c < 0) return 0;
/* end of input
*/
if ( c == '+' || c == '-' ) return c;
/* for character tokens, TYPE = character itself */
if ( isdigit(c) ) {
yylval = c - '0';
/* yylval is a global var */
while( isdigit( c=getchar() ) )
yylval = 10*yylval + (c - '0');
if (c >= 0) ungetc(c,stdin);
return NUMBER;
/* token type is NUMBER */
}
...
}
13
Where is the token value?

The value of the token is stored in a global variable
named yylval.
int yylex( void ) {
int c = getchar();
/* read from stdin */
if (c < 0) return 0;
/* end of input
*/
if ( c == '+' || c == '-' ) return c;
/* read number and store as yylval */
if ( isdigit(c) ) {
yylval = c - '0';
/* get each digit */
while( isdigit( c=getchar() ) )
yylval = 10*yylval + (c - '0');
/* push next character back into input stream */
if (c >= 0) ungetc(c,stdin);
return NUMBER;
/* token type is NUMBER */
}
...
}
14
Useful C functions for yylex
int c = getchar( );
ungetc( c, stdin)
read next char from stdin
returns -1 at end of input
put character back in input
#include <ctype.h>
isdigit(c)
isalpha(c)
isalnum(c)
islower(c)
isupper(c)
isspace(c)
true if c contains a digit
true if c contains a letter
isdigit(c) || isalnum(c)
true if c is lowercase
guess?
space tab newline formfeed
scanf("%d", &num)
scanf("%lf", &dnum)
read an integer from input
read a double from input
15
Scanner function for double values

Suppose we want the data type of all tokens to be
double.

In Bison input:
#define YYSTYPE double.

This changes yylval to be a double!
%{
/* C declarations and #DEFINE statements go here */
#include <stdio.h>
#define YYSTYPE double
%}
%%
/* bison definitions go here */
16
Scanner function for double

Now yylex must know that yylval is "extern double".

Here is example of using scanf to parse numbers.
int yylex( void ) {
int c = getchar();
/* read from stdin */
if (c < 0) return 0;
/* end of the input*/
while ( c == ' ' || c == '\t' ) c = getchar( );
if ( isdigit(c) || c == '.' ) {
ungetc(c, stdin);
/* put c back into input */
scanf ("%lf", &yylval); /* get value using scanf */
return NUMBER;
/* return the token type */
}
return c; /* anything else... return char itself */
}
17
Other C functions: yyerror

Bison requires an error routine named yyerror.

yyerror is called by the parser when there is an error.

You can include yyerror( ) in your Bison source file.
/* display error message */
int yyerror( char *errmsg ) {
printf("%s\n", errmsg);
}
18
Other C functions: main
you need a write a main() function that starts the
parser.
 For a simple parser, main() calls yyparse().

/* main method to run the program */
int main( ) {
printf("Type some input. Enter ? for help.\n");
yyparse( );
}
19
Running Bison

Compile the example file simple.y
CMD> bison simple.y

Output is "simple.tab.c" it is C code for the parser.

To compile the parser by itself using gcc:
CMD> gcc -o simple.exe simple.tab.c

To run the program at the command prompt:
CMD> simple.exe
20
Simple example: definitions
File: simple.y
/* The Bison declarations section */
%{
/* C declarations and #DEFINE statements go here */
#include <math.h>
#define YYSTYPE double
%}
%token NUMBER
/* define token type for numbers */
%token '+' '-'
/* + and - are left associative */
No left/right associativity specified
21
Simple example: grammar rules
%%
input
line
expr
term
/* Simple grammar rules */
: /* allow empty input */
| input line
;
: expr '\n'
{ printf("answer: %d\n", $1); }
: expr '+' term
{ $$ = $1 + $3; }
| expr '-' term
{ $$ = $1 - $3; }
| term
{ $$ = $1; }
;
: NUMBER
{ $$ = $1; }
;
22
yyerror and main
%% /* extra C code */
/* display error message */
int yyerror( char *errmsg ) { printf("%s\n", errmsg); }
/* main */
int main() {
printf("type an expression:\n");
yyparse( );
}
23
Simple example: does it work?
cmd> bison simple.y
cmd> gcc -o calc.exe simple.tab.c
cmd> calc
Test the grammar rules:
3 + 5
3 - 4 - 5 + 6
How about this: -3
24
Simple example: exploring BNF
%%
input
line
expr
term
/* Simple grammar rules */
: /* empty input */
| input line
;
: expr '\n'
{ printf("answer: %d\n", $1); }
: expr '+' expr
{ $$ = $1 + $3; }
| expr '-' expr
{ $$ = $1 - $3; }
| term
{ $$ = $1; }
;
: NUMBER
{ $$ = $1; }
| '-' NUMBER
{ $$ = -$2; }
;
%token NUMBER
%right '+' '-'
25
Exercise
Expand the grammar to include these operations:
4*5
2/3
10 + 3 * 4 – 1 / 2
2*(3+4)
multiplication
division
correct order of operations
grouping
26
Complete example: definitions
File: simple.y
/* The Bison declarations section */
%{
/* C declarations and #DEFINE statements go here */
#include <math.h>
#define YYSTYPE double
%}
%token NUMBER
/* define token type for numbers */
%left '+' '-'
/* + and - are left associative */
%left '*' '/'
/* * and / are left associative */
27
Complete example: grammar rules
%%
input
line
expr
term
factor
/* Bison grammar rules */
: /* allow empty input */
| input line
;
: expr '\n'
{ printf("Result is
: expr '+' term
{ $$ = $1 + $3;
| expr '-' term
{ $$ = $1 - $3;
| term
{ $$ = $1; }
;
: term '*' factor { $$ = $1 * $3;
| term '/' factor { $$ = $1 / $3;
| factor
{ $$ = $1; }
;
: '(' expr ')'
{ $$ = $2; }
| NUMBER
{ $$ = $1; }
| '-' NUMBER
{ $$ = -$2; }
;
%f\n", $1); }
}
}
}
}
28
Common Errors in Bison
1. Forgetting to quote literals: term '+' term
2. Not skipping space where space is allowed
3. Ambiguity in grammar rules
4. Forgetting ';' at the end of rule
%token NUMBER
%token + %% /* grammar rules */
expr
term
:
|
|
:
|
term + term
term - term
term
NUMBER
- NUMBER
{
{
{
{
{
$$
$$
$$
$$
$$
=
=
=
=
=
$1 + $3; }
$1 - $3; }
$1; }
$1; }
-$2; }
29
Shift / Reduce and operator order

Bison uses token look-ahead and a token stack. It
shifts tokens onto the stack until it can choose a rule to
reduce (replace) tokens with a non-terminal. Example:
expr ::=
|
|
|
|
term ::=

expr +
expr expr *
expr /
term
NUMBER
expr
expr
expr
expr
| ( expr )
Suppose the input read is 10 -
shift these onto stack because no matching rule yet.
 Suppose the next token is 2. What should Bison do?

30
Shift / Reduce and operator order
STACK:

10 -
CURRENT TOKEN:
2
This grammar is ambiguous. Bison could match "expr
- expr" or it could shift 2 onto the stack and look at the
next token... it maybe * such as: 10 - 2 * 3
This is called a "shift / reduce conflict".
 Unless you specify a disambiguating rule (next slide),
Bison chooses "shift" over "reduce".
 Meaning: if it's not clear how to resolve conflicts, wait.

INPUT
10
2
*
ACTION
shift
shift
shift
shift
STACK
10
10 10 - 2
10 - 2 *
INPUT
3
ACTION
reduce
reduce
done
STACK
10 - 6
4
31
The declarations section
The declarations section can contain Bison directives
and C directives.
 We already saw the use of %left, %right, etc.

%{
/* semantic value of tokens, as C datatype */
#define YYSTYPE double
#include <math.h>
/* why do we have to declare these? */
int yylex (void);
void yyerror (char const *);
%}
%token NUMBER
%left '+' '-'
%left '*' '/'
32
Specifying operator order

In the declarations section you can write:
%noassoc NEG
/* unary minus sign: - 3 */
%left '+' '-'
%left '*' '/' '%'
%right '^'




'+' and '-' are left associative and have the same precedence.
'*', '/', and '%' left associative and same precedence; they
have higher precedence (later declarations are higher).
'^' is right associative and higher precedence than + - * / %
'NEG' has no associativity: "- - 3" is an error
33
Scanner function main points

Tokens have a TYPE and a VALUE.

The scanner (yylex) returns the TYPE of the next token

For one-character tokens like '+', '=', "(' the character
itself can be used as the type.

Define symbolic names for token types in Bison using
%token NAME

Return the VALUE of a token using the global variable
yylval.

By default, yylval is type "int". Change it using:
#define YYSTYPE double

Rather than parse numbers yourself, use scanf.
34
Scanner function main points (1)
%{
#include <ctype.h>
#include <math.h>
#define YYSTYPE double
%}
%token
%token
%left
%left
%%
line
expr
NUMBER
SQRT
'+' '-'
'*' '/'
Token values are double
Token for sqrt function
: expr '\n' { printf("%g\n", $$); }
;
: expr * expr { $$ = $1 + $3; }
| SQRT '(' expr ')' { $$ = sqrt( $2 ); }
...
/* more rules */
35
Scanner function main points (2)
%%
int yylex(void) {
int c = getchar( );
while ( c == ' ' || c == '\t' ) c = getchar( );
if ( isdigit(c) ) { ungetc(c,stdin);
scanf("%f", &yylval);
Token is a number
return NUMBER;
}
if ( isalpha(c) ) {
char *word = getword(c); /* get next word */
if ( strcmp(word,"sqrt") ) return SQRT;
...
Token is sqrt function
}
For a more general way of handling identifiers, see the Bison User's
Guide, "mfcalc" example.
36
Handling Multiple Data Types


For a more general grammar, the scanner should be able to return
different data types as the value (yylval).
In definitions section, define "%union" as union of all data types
that yylval (and hence $1, $2, ...) can have.
Define all the data types that
token values can have
For each token type, define the
data type of its value.
For tokens that represent their
own value, don't need to define
data type.
%union {
double number;
char* string;
}
%token <number> NUMBER
%token <string> IDENT
%type <number> expr
%type <number> term
%left '+' '-'
%left '*' '/'
37
Handling Multiple Data Types (2)
%%
int yylex(void) {
int c = getchar( );
while ( c == ' ' || c == '\t' ) c = getchar( );
if ( isdigit(c) ) { ungetc(c,stdin);
double x;
scanf("%f", &x);
token value is a double
yylval.number = x;
return NUMBER;
}
if ( isalpha(c) ) {
char *word = getword(c); /* get next word */
yylval.string = word;
token value is a string
return IDENT;
...
}
38
Using a Separate File for yylex
You can put the scanner (yylex) in a separate file.
 BUT, yylex needs values that are defined by Bison,
such as NUMBER, IDENT, yylval.

Solution: add the "%defines" option to your rules
 Bison will create a header file named
"simple.tab.h".

%defines
%{
#include <math.h>
#define YYSTYPE double
%}
token NUMBER
...
39
Using a Makefile

make is an automatic build facility. Just type "make"

Reads a Makefile of "rules" for how to make things
# Makefile for simple calculator
calc.exe: simple.tab.c yylex.o
gcc -o calc.exe simple.tab.c yylex.o -lm
# how to make simple.tab.c from simple.y
simple.tab.c: simple.y
bison simple.y
# yylex.o (obj. file) depends on simple.tab.h
yylex.o: yylex.c simple.tab.h
gcc –c yylex.c
40
Running Make
C:/calculator>
make
bison -d simple.y
gcc -c simple.tab.c
gcc -o simple.exe simple.tab.o yylex.c -lm
41
Advantages of Make

Makefile record all the dependencies in a project

make only builds the parts that are missing or out of
date

can manage complex projects with multiple Makefiles
in subdirectories
42
Introduction to flex
Flex is a program that automatically creates a scanner
in C, using rules for tokens as regular expressions.
 Format of the input file is like Bison.

%{
/* C definitions for scanner */
%}
flex definitions
%%
rules
%%
user code (extra C code)
43
Flex Example

Read console input and describes each token read.
%{
inline int yywrap(void) { return 1; };
%}
/* flex definitions */
DIGIT
[0-9]
LETTER
[A-Za-z]
%%
{LETTER}+
printf("Word:
%s\n", yytext); return 1;
-?{DIGIT}+ printf("Number: %s\n", yytext); return 2;
[[:punct:]] printf("Punct: %s\n", yytext); return 3;
\n
printf("End of line\n"); return 0;
%% /* main method calls yylex to tokenize input */
int main( ) { printf("Type something: ");
while( yylex() > 0 ) { };
}
44
Flex Example (2)

Run flex to create yylex.c (on Linux: lex.yy.c).
cmd>

Compile lex.yy.c and create myscanner.exe
cmd>

flex mylex.fl
gcc -o myscanner.exe yylex.c
Run the program:
cmd>
myscanner
Type something: hello, it's 9:00 o'clock.
word: hello
punct: ,
word: it
punct: '
45
Flex Explained: definitions
%{
/* C definitions and includes */
#include "myparser.tab.h"
/* this fixes an unknown symbol "yywrap" */
#ifdef yywrap
#undef yywrap
#endif
inline int yywrap(void) { return 0; };
%}
/* flex definitions */
DIGIT
[0-9]
LETTER
[A-Za-z]
%%
46
Flex Explained: Parsing Rules (1)

pattern is a regular expression or a literal value.

action is zero or more C statements to execute.

The current matched input is (char *) yytext

The length of matched input is yyleng
{LETTER}+
-?[0-9]+
pattern
printf("Found a word: %s", yytext);
printf("Found an int: %s", yytext);
action
yytext is a string containing the current token.
47
Flex Explained: Parsing Rules (2)

Keep matching input until a return statement.

Flex chooses the rule that produces longest match.

If there is a tie, choose the first rule that matches.

Use yylval (global variable) for value of token.
%{ /* Bison header file defines INT, IDENT...*/
#include "myparser.tab.h"
%}
%%
[0-9]+
{LETTER}+
"+"|"-"
"="
yylval.itype = atoi(yytext); return INT;
yylval.ctype = yytext; return IDENT;
yylval.ctype = yytext; return OP;
return ASSIGN;
48
Flex yywrap
Flex can read input from multiple input files.
 when flex reaches the end of a file, it calls yywrap()
 if yywrap() returns 0, it means that there is another
input file to read.
 if yywrap() returns 1, it means that there are no more
input files.
%{
inline int yywrap(void) { return 1; };
%}
/* same thing */
%option noyywrap
49
Using flex with bison
1.
2.
3.
4.
5.
Run "bison -d file.y" to create file.tab.h
#include "file.tab.h" in your flex source.
Set yylval to the token value.
Run flex.
Compile and link lex.yy.c and file.tab.c
%{
#include <math.h>
#include "file.tab.h"
%}
LETTER
[A-Za-z] /* same as [[:alpha:]] */
%%
-?[0-9]+
yylval.itype = atoi(yytext); return INT;
{LETTER}+
yylval.ctype = yytext; return IDENT;
"+"|"-"
yylval.ctype = yytext; return OP;
"="
return ASSIGN;
50
Using Flex and Bison
mygrammer.y
mylex.fl
BNF rules for your
grammar.
Regular expressions that
match and return tokens
> bison –d mygrammar.y
> flex mylex.fl
mygrammar.tab.c
lex.yy,c
parser source code
tokenizer function in C
> gcc -o myprog
mygrammar.tab.c lex.yy.c
myprog
executable program
51
Learning Bison, Flex, CUP

On Linux or most UNIX systems type:
info bison
Using "info": SPACEBAR = next page
ALT-V = previous page,

CTRL-C = quit
Documentation on the Web:
GNU Free Software Foundation
http://www.gnu.org/software/bison/bison.html
... or many tutorials online (search Google)
Similar documentation for flex.
 CUP manual: http://www2.cs.tum.edu/projects/cup


includes many easy examples!
52
Getting the Software

bison and flex are included with Linux distributions

MS Windows version of GNU programs
http://gnuwin32.sourceforge.net/
http://gnuwin32.sourceforge.net/packages.html


The packages include excellent user's guides with examples.
CUP:
http://www2.cs.tum.edu/projects/cup/


software (jar file) and excellent manual; links to resources
Jflex (lex that creates java code):
http://jflex.sourceforge.net/
53
GNU Tools and C/C++ compiler

Cygwin: www.cygwin.com

MinGW: www.mingw.org


free GNU toolkit.
Dev-C++

C++ integrated development environment

two downloads: with gcc or without gcc
54