Transcript lec3

Lecture 3
Introduction to JLex: a
lexical analyzer generator for Java
1
The role of JLex
Lscanner.lex
input.L
JLex
char
Lscanner.class
stream
…
Lscanner.java
javac
token stream
Ltokens…
2
JLex Specifications
user code
%%
JLex directives
%%
lexical rules
// must at the beginning of a line
// must at the beginning of a line
 Each spec file consists of 3 sections, seperated by %%
» user code copied to output file
» directives include macro and state definitions, among
others.
» 3rd section contains the rules of lexical analysis, each
of which consists of three parts: an optional state list, a
regular expression, and an action.
3
The layout of the generated file
%userCode // from 1st section: package, import decls+utility classes
%public class %class [implements %implements] {
%nternalCode // from %{ … %} directive
// 2 constructors
%public %class( InputStream is) [throws %initthrow]{
%initCode] // from %init{ … %init} directive
…} // and %public %class( Reader is) throws …{…}
// main methods for requesting next token
% public %type %function() [throws %yylexthrow] { …
// if eof => return ( %eofValue ) …
}
// method to be called after eof encountered
private void yy_do_eof ()
[throws %eofthrow]
{
... %eofCode ...
4
JLex Directives
1 Internal Code to Lexical
Analyzer Class
2 Initialization Code for Lexical
Analyzer Class
3 End-of-File Code for Lexical
Analyzer Class
4 Macro Definitions
5 State Declarations
6 Character Counting
7 Line Counting
8 Java CUP Compatibility
9 Lexical Analyzer Component
Titles
10 Default Token Type: int
11 Default Token Type II: Wrapped
Integer
12 YYEOF on End-of-File
13 Newlines and Operating
System Compatibility
14 Character Sets
15 Character Format To and From
File
16 Exceptions Generated by
Lexical Actions
17 Specifying the Return Value on
End-of-File
18 Specifying an interface to
implement
19 Making the Generated Class
Public
5
Directives for determining the names of
various components of the lexer.
 The name of the generated class (as well as the
file name)
%class className // default is Yylex
 The interface the lexer class would implement
%implements interfaceName
 The name and return type of the method to get
the next token
%function methodName // default is yylex
%type typeName // default is Yytoken
 make the lexer class public
%public
6
Directives for position information
 Enabling the counting of character position
%char // private int yychar declared
 Enabling the counting of line information
%line // private int yyline declared
Notes:
1. yychar and yyline are zero-based.
2. yychar is used to record the position of the beginning of
the current token in the input stream.
3. yylength (always enabled) is used to record the length of
the text the current token consumes.
7
Java codes to be put on various
parts of the generated file
 user code to be put outside the lexer class
[all text from 1st section] // before first %%
 user code to be put inside the lexer class
 user code to be put inside the constructors of the
lexer class
 user code to be put inside the body of the
yy_do_eof() method.
 value to be return when eof is encountered.
8
User code to be put inside
the lexer class
 format:
%{
// at the beginning of line
<internal code>
%}
// at the beginning of line
 Permit the declaration of variables and methods
inside the generated lexer class
 Correspond to the %internalCode region.
9
User Code to be put inside all
constructors of the lexer class
 format:
%init{
// at the beginning of line
<initCode>
%init}
// at the beginning of line
 Correspond to the %initCode region.
 Exceptions thrown should be declared by the
directive:
%initthrow{
Exception0 , …, ExceptionN
%initthrow} // corresponds to %initthrow region
10
Directives for Specifying the
input alphabet
%full
%unicode
 default alphabet is ASCII ( 0~127)
 %full => 0~255; %unicode => 0 ~65535.
%ignorecase
 upper case and lower case letters regarded as
the same.
11
Directives related to eof processing
 Specifying the Return Value on End-of-File
%eofval{
eofValue
%eofval}
 YYEOF on End-of-File
%yyeof
 notes:
» Enable the decl: public final int YYEOF=-1; in
lexer
» implied by the dir: %integer
12
User Code to be executed when
end_of_file is encountered
 format:
%eof{
// at the beginning of line
<eofCode>
%eof}
// at the beginning of line
 Correspond to the %eofCode region.
 Exceptions thrown should be declared by the
directive:
%eofthrow{
Exception0 , …, ExceptionN
%eofthrow} // corresponds to %eofthrow region
13
Specifying the type of the returned
token
%type typeName
%integer // equ to %type int
%intwrap // equ to %type java.lang.Integer
Notes:
1. Default type is Yytoken (need to be declared
elsewhere, say, in user code)
2. null will be returned for eof token if the returned
type is not primitive.
3. YYEOF (-1) will be returned for %integer.
14
Java CUP Compatibility
%cup
 this directive makes the generated scanner
conform to the java_cup.runtime.Scanner
interface.
 has the same effect as the following three
directives:
%implements java_cup.runtime.Scanner
%function next_token
%type java_cup.runtime.Symbol
15
Newlines and Operating System
Compatibility
 new line represented differently in UNIX and DOS
based OSs.
unix => \n
dos => \r\n
 The directive %notunix cause the lexer to
recognize either \r or \n as a new line.
16
Exceptions Generated by Lexical
Actions
 Format:
%yylexthrow{
Exception0,…,ExceptionN
%yylexthrow}
 Notes:
1. mapped to the %yylexthrow region.
2. are Exceptions that may be thrown from within
the action codes of lexical rules.
17
State Declarations
 Format:
%state state0,…, stateN
 Notes:
1. state0,..stateN must be at the same line.
2. can have more than one %state declarations
3. State names should be valid identifiers
4. Each stateK will be declared as an int constants
in the lexer class.
5. A special state YYINITIAL is implicitly declared
and the lexer begins its analysis in this state.
18
Macro Definitions
 used to name and define sets of strings for later
use of lexical rules.
 format:
MacroName = MacroDefinition
 Notes:
1. Each macro definition is contained on a single line
2. MacroName should be a valid id
(letter|_)(letter|digit|_)*
3. MacroDefinition should be a valid regular expression
to be defined later.
4. MacroDefintion may contain other macro expansion
in the form {otherMacroName}, but recursion is not
permitted.
19
Lexical Rules
 Format:
[<state1,…statesN>] expression { actionCode }
 Notes:
1. All stateKs must have been declared by %state.
2. the rule will be activated only when the lexer is
in one of the state listed in the state list.
» if state list omitted, it is always activated.
3. the intuitive meaning of the rule is as follows:
» if the lexer is in one of the state in the list and
the substring from the current position
matches the expression, then execute the
actionCode.
20
Conflict resolution
 What happens If more than one rule matches
strings from its input?
1. Choose the rule that matches the longest string.
2. If more than one rule matches strings of the
same length, then choose the rule that is given
first in the JLex specification.
 Therefore, rules appearing earlier in the
specification are given a higher priority by the
generated lexer.
21
Regular Expressions
 The alphabet for JLex is the Ascii character set,
meaning character codes between 0 and 127
inclusive
 non_newline white spaces in expressions is not
allowed unless withnin double quotes “ … “ or
immediately after \.
 metacharacters: are chars with special meanings
in JLex regular expressions.
? * + | ( ) ^ $ . [ ] { } “ \
 Other chars represent themselves.
22
Escape sequences for characters








\ddd The character with number (ddd)8
\xdd The character with number (dd)16
\udddd The Unicode character with number (dddd)16.
\b Backspace
\n
newline
\t Tab
\f Formfeed
\r
Carriage return
\^C Control character(0~31: \^@, \^A,…Z,[,\,],^,_)
\c
A backslash followed by any other character c
matches itself: Ex: \\, \a, \B, \”, \’, etc.
$
denotes the end of a line.
.
matches any character except the newline, equ to
[^\n].
23
More on regular expression
 “…aString…" denotes aString.
» Metacharacters in aString loose their meaning and
represent themselves.
» The sequence \" which represents " is the only
exception.
» Ex: “ab d\\\”” stands for ab d\\”
 {name} denote a macro expansion
 E1E2 : concatenation
 E1|E2: choice
 E+ or (E)+ : one or more repetitions of E,
 E* or (E)* : zero or more repetitions of E.
 E? or (E)? : zero or one repetitions of E.
 (E) : (..) is used for grouping.
24
More on regular expressions
 [...]
» Square backets denote a class of characters
and match any one character enclosed in the
backets.
 substring inside with special meaning:
» {name} : macro expansion
» a-b : range of characters from a to b.
» “String” means String with metachars loosing
special meaning.
» \a means a where a is any character.
» [^Rest] means S – [Rest]
25
More on regular expressinos
 Ex:
» [a-z] match a,b,…,z.
» [^0-9] matches any char but 0,1,…,9.
» [\”\\] matches “ or \.
» [“a-z”] matches a,- and z.
» [-0-9] matches -,0,..,9.
» how about [\b\f”\r\t”] ?
26
Lexical Actions
 format:
{ action }
 notes: All curly braces contained in action not part of
strings or comments should be balanced.
 Actions and Recursion:
 If no return value is returned in an action, the lexical
analyzer will search for the next match from the input
stream and returning the value associated with that
match.
 The lexical analyzer can be made to recur explicitly with a
call to yylex(), as in the following code fragment.
{ ...
return yylex();
... }
27
More on lexical actions
 State transitions are made by the function call.
yybegin(state);
 Avilable Lexical methods / vars:
String yytext()
Matched portion of the character input stream
int yylength()
length of yytext()
int yychar;
int yyline;
28
Performance
 Size of
Source File
177 lines
897 lines
JLex generated Lexer
Execution Time
0.42 seconds
0.98 seconds
Hand-Written Lexer
Execution Times
0.53 seconds
1.28 seconds
 The JLex lexical analyzer soundly outperformed
the hand-written lexer!!
29
Example
import java.lang.System;
class Sample {
public static void main(String argv[]) throws
java.io.IOException {
Yylex yy = new Yylex(System.in);
Yytoken t;
while ((t = yy.yylex()) != null)
System.out.println(t); }
}
30
class Utility {
public static void assert ( boolean expr ) {
if (false == expr) { throw (new Error("Error: Assertion failed.")); }
}
private static final String errorMsg[] = {
"Error: Unmatched end-of-comment punctuation.",
"Error: Unmatched start-of-comment punctuation.",
"Error: Unclosed string.",
"Error: Illegal character." };
public static final int E_ENDCOMMENT = 0;
public static final int E_STARTCOMMENT = 1;
public static final int E_UNCLOSEDSTR = 2;
public static final int E_UNMATCHED = 3;
public static void error ( int code ) { System.out.println(errorMsg[code]);
}
}
31
class Yytoken {
Yytoken ( int index, String text, int line, int charBegin, int
charEnd ) {
m_index = index;
m_text = new String(text);
m_line = line;
m_charBegin = charBegin;
m_charEnd = charEnd;
}
public int m_index;
public String m_text;
public int m_line;
public int m_charBegin;
public int m_charEnd;
public String toString() { return "Token #"+m_index+":
"+m_text+" (line "+m_line+")"; } }
32
%%
%{
private int comment_count = 0;
%}
%line
%char
%state COMMENT
ALPHA=[A-Za-z]
DIGIT=[0-9]
NONNEWLINE_WHITE_SPACE_CHAR=[\ \t\b\012]
WHITE_SPACE_CHAR=[\n\ \t\b\012]
STRING_TEXT= (\\\"|[^\n\"]|\\{WHITE_SPACE_CHAR}+\\)*
COMMENT_TEXT=([^/*\n]|[^*\n]"/"[^*\n]|[^/\n]"*"[^/\n]|"*"[^/\n]|
"/"[^*\n])*
%%
33
<YYINITIAL> ","
<YYINITIAL> ":"
<YYINITIAL> ";"
<YYINITIAL> "("
{ return (new
Yytoken(0,yytext(),yyline,yychar,yychar+1)); }
{ return (new
Yytoken(1,yytext(),yyline,yychar,yychar+1)); }
{ return (new
Yytoken(2,yytext(),yyline,yychar,yychar+1)); }
{ return (new
Yytoken(3,yytext(),yyline,yychar,yychar+1)); }
…
<YYINITIAL> "<>" { return (new
Yytoken(15,yytext(),yyline,yychar,yychar+2)); }
…
<YYINITIAL> "<" { return (new
Yytoken(16,yytext(),yyline,yychar,yychar+1)); }
<YYINITIAL> "<=" { return (new
Yytoken(17,yytext(),yyline,yychar,yychar+2)); }…
<YYINITIAL> "|" { return (new
Yytoken(21,yytext(),yyline,yychar,yychar+1)); }
<YYINITIAL> ":=" { return (new
Yytoken(22,yytext(),yyline,yychar,yychar+2)); }
34
<YYINITIAL> {NONNEWLINE_WHITE_SPACE_CHAR}+
<YYINITIAL,COMMENT> \n
{}
<YYINITIAL> "/*" { yybegin(COMMENT);
comment_count = comment_count + 1; }
{}
<COMMENT> "/*" { comment_count = comment_count + 1; }
<COMMENT> "*/" {
comment_count = comment_count - 1;
Utility.assert(comment_count >= 0);
if (comment_count == 0) {yybegin(YYINITIAL);}}
<COMMENT> {COMMENT_TEXT} { }
<YYINITIAL> \"{STRING_TEXT}\" {
String str = yytext().substring(1,yytext().length() - 1);
Utility.assert(str.length() == yytext().length() - 2);
return (new Yytoken(40,str,yyline,yychar,yychar + str.length())); }
35
<YYINITIAL> \"{STRING_TEXT} {
String str = yytext().substring(1,yytext().length());
Utility.error(Utility.E_UNCLOSEDSTR);
Utility.assert(str.length() == yytext().length() - 1);
return (new Yytoken(41,str,yyline,yychar,yychar + str.length()));}
<YYINITIAL> {DIGIT}+ {
return (new Yytoken(42,yytext(),yyline,yychar,yychar +
yytext().length()));}
<YYINITIAL> {ALPHA}({ALPHA}|{DIGIT}|_)* {
return (new Yytoken(43,yytext(),yyline,yychar,yychar +
yytext().length())); }
<YYINITIAL,COMMENT> . {
System.out.println("Illegal character: <" + yytext() + ">");
Utility.error(Utility.E_UNMATCHED);}
36