Transcript 02_Syntax

Syntax
The Grammar of a Language
1
Topics to Know
Difference between syntax and semantics.
 Four categories of languages.
 Regular expressions
 used for pattern matching and extracting info
 regular expressions are an essential part of many
programming languages ... memorize them!
 BNF and EBNF
 used to describe grammar of computer languages
 can be used to automatically generate a parser for a
language

2
Syntax and Semantics

The syntax of a language defines the valid symbols
and grammar.

Syntax defines the structure of a program, i.e., the
form that each program unit and each statement
must use.

The semantics defines the meaning of the grammar
elements.

Lexical structure is the form of lowest level syntactic
units (words or tokens) of a grammar.
3
Syntax and Semantics Compared

Syntax: in Java, an assignment statement is:
identifier = expression { operator expression } ;

Semantics: an assignment statement must use
compatible types, e.g.
int n1, n2;
n1 = 20*1024;
n2 = 3.50;

// OK, int_var = int_expression
// illegal, incompatible types
Lexical elements (tokens):
"n2" "=" "3.50" ";"
4
Syntax and Semantics Compared

Syntax: the form of a while statement is:
while ( boolean_expression ) statement ;

Semantics:
when a thread of execution encounters a while
statement the boolean expression is tested. If the
expression evaluates to true then the statement is
executed and the process is repeated. If [not
when] the expression evaluates to false, execution
continues to the next statement. ...
5
How are they used?
Program
Source Code
Parts of a Compiler / Interpreter:
Tokenizer (Lexical Analysis)
Token stream
Parser (Syntax Analysis)
Parse tree
Semantic Analysis
Intermediate
code
Optimization and Code Generation
Object code
6
Scanning and Parsing
source file
input stream
Tokenizer
tokens
sum = x1 + x2;
sum
=
x1
+
x2
;
Parser
parse tree
assignment:
sum
=
+
x1
x2
7
Scanners

Recognize regular expressions

Implemented as finite automata (finite state machines)

Typically contain a loop that cycles through
characters, building tokens and associated values by
repeated operations

scanner may be integrated as a function in the parser.

Parser calls the Scanner to get the next token.
8
Parsers

Recognize patterns defined by grammar rules

Implemented as pushdown automata

Convert a stream of tokens (supplied by the scanner)
into a parse tree containing symbols defined in the
grammar.

Symbols are things like "assignment", "expression"

Parsing is more difficult than scanning.
9
Formal Languages

Famed linguist Noam Chomsky introduced a formal
classification of (human) languages. In terms of
computing theory, his categories are:
Hierarchy
Grammars
Languages
Type 0
unrestricted
Recursive Enumerable
Minimal Automaton
Turing machine
Recursive
Type 1
context-sensitive Context-sensitive
Type 2
context-free
Context-free
Type 3
regular
Regular
Decider
Linear-bounded
Pushdown
Deterministic Finite
Each class in this hierarchy is a subset of the class above it.
A context-free grammar is a grammar where the syntax of each
constituent is independent of the symbols that come before and after
it.
10
Applying Formal Languages (1)
Tokenizer or Scanner (Lexical Analysis):
 The lexemes (tokens) in a computer language are a
"regular grammar" (Type 3).
 Therefore, we can use the simplest grammar processor
to make a tokenizer.
 rules for tokens are defined using regular expressions.
Examples:
integer ::= "[+-]?[0-9]+" (actually, this is too simple)
integer ::= "[+-]?(0[0-7]*|[1-9][0-9]*)"
(better)
identifier ::= "[A-Za-z_][A-Za-z_0-9]*"
11
Applying Formal Languages (2)
Syntax Analysis
 The syntax of a computer language is a "Context-free
Grammar" ...almost.
 We can use a "type 2" grammar processor as a parser.
 Rules defined in Backus-Naur Form and Extended BNF
Example:
expression ::=
|
|
term
::=
|
|
factor ::=
expression + term
expression - term
term
term * factor
term / factor
factor
( expression ) |
NUMBER
12
Lexical Structure

Lexemes are the smallest lexical unit of a language,
grouped according to syntactic usage. Some types of
lexemes in computer languages are:
identifiers: x, println, _INIT, ArrayList
numeric constants: 0, 10000, 2.98E+6
operators: =, +, -, ++, +=, *, /
separators: [ ] ; : . , ( )
string literals: "hello there"
A token is a string representing the value of a lexeme.
 Lexemes are recognized by the first phase of a
translator -- the scanner -- that deals directly with the
input. The scanner separates the input into tokens.
 Scanners are also called lexers.

13
Tokens
Tokens are the strings of syntactic units.
 Example: what are the tokens in this statement?
result = (sum - average)/count;
 Tokens:

result
=
(
sum
average
)
/
count
;
identifier
assignment
expression
identifier
arithmetic
identifier
expression
arithmetic
identifier
semi-colon
operator
delimiter
operator
delimiter
operator
(statement delimiter)
14
C tokens

Lexical structure of C, defined in The C Programming
Language...
“There are six classes of tokens: identifiers, keywords, constants,
string literals, operators, and other separators. Blanks, horizontal
and vertical tabs, newlines, formfeeds, and comments as
described below (collectively, "white space") are ignored except
as they separate tokens. Some white space is required to
separate otherwise adjacent identifiers, keywords, and constants.
If the input stream has been separated into tokens up to a given
character, the next token is the longest string of characters that
could constitute a token.” [Kernighan and Ritchie, The C
Programming Language, 2nd Ed., pp. 191-192.]

C uses the principle of longest match (substring).
15
Principle of Longest Match
A token should be the longest string that satisfies a rule
for lexemes.
 Example:
x2+=1.0
tokens:
"x2" "+=" "1.0"
NOT:
"x" "2" "+" "=" "1" "." "0"


What are the tokens in these inputs?
x = y+1;
x = y+=1; // tokenizer cannot rely on optional spaces
x = y == 1;
if ( x++ = 1 ) getFirstValue( ); // probably an error

Tokener should not use context or look-ahead more than
1 character. (FORTRAN is an exception)
16
Describing Lexical Structure

In C, an identifier (i) begins with a letter or _ (ii) followed
by any number of letters, '_', or digits.

Compare these two examples: which way is simpler?
Rules for C identifiers using EBNF Rules:
identifier ::= ( letter | _ ) { letter | _ | digit }
letter ::= 'A' | 'B' | 'C' | 'D' | . . . | 'Z'
| 'a' | 'b' | 'c' | 'd' | . . . | 'z'
digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
Rule for C identifiers using Regular Expressions:
identifier ::= [A-Za-z_ ][A-Za-z_0-9]*
identifier ::= [\w_][\w_\d]*
17
Rules for Regular Expressions

Regular expressions:
x
match an occurrence of x
[abcd]
match any one of these characters
[A-Z]
any one character from this range
[A-Za-z_] any letter or _
x* [a-z]* 0 or more occurrences of these
x+ [a-z]+ 1 or more occurrences of these
x? [a-z]? exactly 0 or 1 occurrence
x{5}
match exactly 5 times, same as xxxxx
x{4,8}
match between 4 and 8 times
.
(period) match any one character
.*
match anything!
18
Pattern Matching in Java
java.util.regex.Matcher
matches Strings using regular expressions (patterns)
can be used to extract the thing that is matched
java.util.regex.Pattern
defines objects for regular expressions
String.matches( regex )
tests whether the string values matches a regular exp.
String.split( regex [, maxelements ] )
split a string everyplace the regex is found.
19
Using String matches( )
Java's String match( ) method and the Matcher class.
"hi".matches("[a-z]+")
true
"you2".matches("[a-z]+") false
"9".matches("\\d")
true, \d := [0-9]
"4@com".matches(".+@.+") anything @ anything
"away".matches("a.*y")
true
"ay".matches("a.*y")
true
"ay".matches("a.+y")
false, at least 1
"123".matches("-?\\d+") true
"-123".matches("-?\\d+") true
"--12".matches("-?\\d+") false
20
Using String split( )
Split a string into pieces anyplace a white space is found
String s = "split me\tat
white space
";
String [ ] word = s.split("\\s+");
word[0]="split"
word[1]="me"
word[2]="at"
word[3]="white"
word[4]="space"
21
Character classes and special chars
\n\t\r\f\e
\x4E
\u1234
[^abcd]
newline, tab, return, formfeed, escape
character with hexadecimal value 4E
character with Unicode value 1234
any character NOT (^) in this set.
\d
\D
\s
\S
\w
\W
^ must be FIRST CHAR after "["
any digit, same meaning as [0-9]
any non-digit,same as [^0-9]
any white space, [ \t\n\r\f\x08]
any non-whitespace
any word character, [a-zA-Z0-9_]
any non-word character, [^a-zA-Z0-9_]
( ... )
pattern group
22
Examples using Character Classes

Match a valid student ID for this class:
4[6-9]\d{6}

A C identifier: begins with a letter or underscore,
followed by any number of letters, digits, or underscore
[A-Za-z_]\w*

Hello in Thai:
\u0E2A\u0E27\u0E31\u0E2A\u0E14\u0E35
23
Positional Matching
^
$
\b
match beginning of a line
match the end of a line
match a word boundary
Match a student ID at the beginning of the string, as a
word by itself:
Matcher m = Pattern.compile("^\d{8}\b.*");
m.match("47541234 Joe Hacker"); // match
m.match("123456789 too long"); // no match
m.match("My ID is 48541234");
// no match
24
Regular Expression for the Time
Write a pattern to match a time string of the form
"hh:mm:ss am" or "hh:mm:ss PM"; use a 12 hour clock
(am,pm). "am", "pm" "AM", "PM"
1.
Example: 3:38:09 am,
9:59:59 AM,
12:47:38 PM.
25
Regular Expression for the Time
Write a pattern to match a time string of the form
"hh:mm:ss am" or "hh:mm:ss PM"; use a 12 hour clock
(am,pm). "am", "pm" can be uppercase or lowercase.
Example: 3:38:09 am, 12:47:38 PM.

Don't allow nonsense like 33:82:61 am
1?[0-9]:[0-5][0-9]:[0-5][0-9] +[AaPp][Mm]


Another way, using a group and repetition:
1?[0-9](:[0-5][0-9]){2} +[AaPp][Mm]
If the seconds are optional (8:33 am) then use:
1?[0-9](:[0-5][0-9]){1:2} +[AaPp][Mm]
26
More Matching in Java
Examples using \w, \d
"hello".matches("\\w+") true
"you2!".matches("\\w+") false
"10900".matches("\\d{5}") true: match zipcode


String "split": String [ ] split( regular_expression )
String s = "I like
java";
String [] w = s.split("\\s+")
returns:
w[0] = "I", w[1] = "like" w[2] = "java"
27
Pattern Extraction
Using a Matcher object, you can also find the position of
a match, and extract the last matched string.
import java.util.regex.*;
...
Pattern pattern = Pattern.compile( "4754\d\d\d\d" );
System.out.println("enter input line to scan: ");
String text = console.readLine( );
Matcher matcher = pattern.matcher( text );
if ( matcher.matches() ) {
String id = matcher.group(); // get the string
System.out.println("found " + id);
while ( matcher.find() ) {
// find next match
id = matcher.group( ); // get the string
System.out.println("found " + id);
}
}
28
Groups and Pattern Extraction
( expression )
( ) defines a group that you want to re-use or extract.
\n
refers to n-th group matched using ( ).
What strings does this pattern match?
s.match( "^(\w+).*\1$" )
29
NOT Regular Expressions
Don't confuse regular expressions (not part of EBNF)
with BNF / EBNF notation.Unfortunately, many
sources use a hybrid of EBNF and regular
expressions.
 In regular expressions, ( ... ) is used for grouping, not
a list of choices.

Wrong: bogus notation (looks more like EBNF):
identifier ::= ([A-Z][a-z]_ )([A-Z][a-z][0-9]_)*
Here, ( ) means "any one of these characters"
(abc) means "a" or "b" or "c"
30
Why Use Regular Expressions?

Can be directly translated into source code for a
tokenizer.
Shorter than [E]BNF
 Many applications and many languages use them.

How to Learn Regular Expressions
search the web -- many tutorials
 Core Java, p. 698-702. Java regular expressions not
exactly same as syntax in C, Perl, or flex.
 Java API for "Pattern" class.
 Perl or Flex book - define regex for these languages

31
Practice

Write a lexical description (using a regular expression)
for:
 base 10 constants (1234, -1234)
 octal constants (0377)
 hexadecimal constants (0x2FA84D, 0Xeeef)
 floating point constants, with optional exponent

Write a Java method to find and extract all the words
in a string; a word is a sequence of letters delimited by
a non-letter or the begin/end of a line in the string.

Write a Java method to remove /* ... */ comments from
a string.
32
Types of lexemes

Common Lexemes (classes of tokens)
identifiers: x, println, _INIT, ArrayList
numeric constants: 0, 10000, 2.98E+6
assignment operators: =, +=, -=, *=, /=, %=
arithmetic operators: *, /, +, -, %
boolean operators: &&, ||, ^, !
separators: [ ] ; : . , ( )
string literals: "hello there"
Defining many lexemes makes the syntactic grammar
more precise
 Reserved words: may be defined as a class, or simply
treat as identifiers at lexical level

33
White space and comments
“Internal” tokens of the scanner that are matched and
discarded
 Typical white space: newlines, tabs, spaces
 Comments:
 /* … */, // … \n (C, C++, C#, Java)
 # … \n (Perl, Unix Shells)
 (* … *) (Pascal, ML)
 ; … \n
(Scheme)
 Comments generally not nested.
 Comments & white space ignored except that they
serve as separators of tokens.

34
FORTRAN is an exception

No reserved words:
REAL IF, THEN
IF (THEN .GT. 0) IF = THEN

Compiler ignores spaces (spaces removed before
tokenizing):
SUM=0.0
S U M = 0 .
DO 99 I = 1,10
DO 99 I = 1.10
0 (same as SUM=0.0)
(loop: for i := 1 to 10 )
(assignment: DO99I = 1.10 )
This means that parser must "look ahead" to identify
syntax.
 Lesson: don't remove white space before tokenizing.

35
Reserved words versus key words
Pascal: uses key words such as "integer", "real".
var
n: integer;
integer: real;
begin
integer = 0.5;
"integer" has special
meaning in this context
no special meaning here,
you can redefine it
C: uses reserved words, such as "int", "float", "return".
Reserved words may not be redefined in a program.
int n;
float int;
Illegal! "int" is reserved.
Reserved words are easier than key words for scanner to
recognize, and easier for people to read.
36
Predefined identifiers

Predefined identifiers have special meanings, but can be
redefined (although they probably shouldn’t).

Examples of predefined identifiers in Java:
String, Object, System, null
 in Java, you can define your own String or Object
class

Predefined Identifiers are not Reserved Words

Reserved words cannot be used as the name of
anything (i.e., as an identifier) except itself.
37
Java "keywords" (reserved words)
abstract
assert
boolean
break
byte
case
catch
char
class
const
continue
default
do
double
else
enum
extends
final
finally
float
for
if
goto
implements
import
instanceof
int
interface
long
native
new
package
private
protected
public
return
short
static
strictfp
super
switch
synchronized
this
throw
throws
transient
try
void
volatile
while
The Java Language Specification calls these "key words".
38
Java reserved words (cont.)

The words const and goto are reserved, even though
they are not used in the Java language.

Why (do you think) Java reserves "goto" and "const" ?
39
Java reserved words (cont.)
foreach : many languages have a "foreach" statement. In C#:
double [ ] data = new double[100]; ...
foreach( double x in data ) { sum += x; }
Java 5.0 defines a new syntax of "for" to do this:
for(double x : data ) sum += x;
Q: Why did Java use "for" instead of defining a "foreach" ?
What is the disadvantage of defining "foreach(var in collection)"?
40
Java reserved words (cont.)

true and false aren't listed as "keywords" in the
language spec. The spec calls them boolean literals (sect
3.10.3). Similarly, null is the null literal (sect 3.10.7).

In actuality they are reserved words!
These examples prove it (compiler gives an error msg):
/* Encapsulate a constant :-) */
public static true( ) { return true; }
public static false( ) { return false; }
/* If true and false are mere
be allowed to redefine the
public void illogical( ) {
boolean false = (1==1);
boolean true = !false;
constants, we should
names locally. */
// false = true
// true = false
41
Categories of Grammar Rules

Declarations or definitions.
AttributeDeclaration ::=
[ final ] [ static ] [ access ] datatype [ = expression ]
{ , datatype [ = expression ] } ;
access ::= ' public ' | ' protected ' | ' private '

Statements.
 assignment, if, for, while, do_while

Expressions,
such as the examples in these slides.

Structures such as statement blocks, methods, and
entire classes.
StatementBlock ::= '{' { Statement; } '}'
42
Parsing Algorithms (1)
Broadly divided into LL and LR.
 LL algorithms match input directly to left-side
symbols, then choose a right-side production that
matches the tokens. This is top-down parsing
 LR algorithms try to match tokens to the right-side
productions, then replace groups of tokens with the
left-side nonterminal. They continue until the
entire input has been "reduced" to the start symbol
 LALR (look-ahead LR) are a special case of LR;
they require a few restrictions to the LR case
 Reference: Sebesta, section 4.3 - 4.5.

43
Parsing Algorithms (2)

Look ahead:
 algorithms must look at next token(s) to decide
between alternate productions for current tokens
 LALR(1) means LALR with 1 token look-ahead
 LL(1) means LL with 1 token look-ahead

LL algorithms are simpler and easier to visualize.

LR algorithms are more powerful: can parse some
grammars that LL cannot, such as left recursion.

yacc, bison, and CUP generate LALR(1) parsers

Recursive-descent is a useful LL algorithm that "every
computer professional should know" [Louden].
44
Top-down Parsing Example
For the input: z = (2*x + 5)*y - 7;
tokens: ID = ( NUMBER * ID + NUMBER ) * ID - NUMBER ;
Grammar rules (as before):
assignment => ID = expression ;
expression => expression + term
| expression - term
| term
term
=>
term * factor
| term / factor
| factor
factor =>
( expression )
| ID
| NUMBER
45
Top-down Parsing Example (2)
The top-down parser tries to match input to left sides. In the example,
GREEN is part matched to the input so far.
ID = ( NUMBER * ID + NUMBER )* ID
- NUMBER ;
assignment
ID = expression
ID = expression
- term
;
ID = term
- term
;
ID = term
* factor - term
;
ID = factor
* factor - term
;
ID = ( expression
* factor - term
;
ID = ( expression
+ term ) * factor - term
;
ID = ( term
+ term ) * factor - term
;
ID = ( term * factor + term )* factor - term
;
ID = ( factor * ID + factor )* factor - term
;
ID = ( NUMBER * ID + NUMBER )* factor - term
;
ID = ( NUMBER * ID + NUMBER )* ID
- factor ;
ID = ( NUMBER * ID + NUMBER )* ID
- ID
; 46
Top-down Parsing Example (3)

Problem in example: we had to look ahead many
tokens in order to know which production to use.

This isn't necessary provided that we know the
grammar is parsable using LL (top-down) methods.

There are conditions on the grammar that we can test
to verify this. (see: The Parsing Problem)

Later we will study the recursive-descent algorithm
which does top-down parsing with minimal lookahead.
47
The Parsing Problem
48
The Parsing Problem

Top-down parsers must decide which production to use based on
the current symbol, and perhaps "peeking" at the next symbol (or
two...).

Predictive parser: a parser that bases its actions on the next
available token (called single symbol look-ahead).

Two conditions are necessary: [see Louden, p. 108-110]

The first condition is the ability to choose between multiple
alternatives, such as: A  1 | 2 | ... | n


define First() = set of all tokens that can be the first token
for any production cascade that produces symbol 
then a predictive parser can be used for rule A if:
First(1)  First(2) ...  First(n) is empty.
49
The Parsing Problem (cont.)


The second condition is the ability of the parser to detect presence
of an optional element, such as A   [ b ]. Can the parser detect
for certain whether b is present?
Example: list  expr [list].
How do we know that list isn't part of expr?

define Follow(  ) = set of all tokens that can follow the nonterminal  some production. Use a special symbol ($) to
represent the end of input if  can be the end of input.

Example: Follow( factor ) = { +, -, *, /, ), $ }
while Follow( term ) = { *, /, ), $ }

then a predictive parser can detect the presence of optional
symbol b if First( b )  Follow( b ) is empty.
50
Review and Thought Questions
51
Lexics vs. Syntax vs. Semantics
Division between lexical and syntactic structure is not
fixed:
 a number can be a token or defined by a grammar
rule.
 Implementation can often decide:
 scanners are faster
 parsers are more flexible
 error checking of number format as regex is simpler
 Division between syntax and semantics is not fixed:



we could define separate rules for IntegerNumber and
FloatingPtNumber , IntegerTerm, FloatingPtTerm, ... in order to
specify which mixed-mode operations are allowed.
or specify as part of semantics
52
Numbers: Scan or Parse?
We can construct numbers from digits using the scanner
or parser. Which is easier / better ?

Scanner: Define numbers as tokens:
number : [-]\d+

Parser: grammar rules define numbers (digits are
tokens):
number => '-' unsignednumber | unsignednumber
unsignednumber => unsignednumber digit | digit
digit =>
0|1|2|3|4|5|6|7|8|9
53
Is Java 'Class' grammar context-free?

A class may have static and instance attributes.

An inner class or local class have same syntax as toplevel class, but:


may not contain static members (except static constants)

inner class may access outer class using OuterClass.this

local class cannot be "public"
Does this means the syntax for a class depends on
context?
54
Alternative operator notation

Some languages use prefix notation: operator comes
first
expr => + expr expr | * expr expr | NUMBER


Examples:
* + 2 3 4
means (2 + 3) * 4
+ 2 * 3 4
means
2 + (3 * 4)
Using prefix notation, we don't have to worry about
precedence of different operators in BNF rules !
55