Python supplement on regular expressions and parser generators

Download Report

Transcript Python supplement on regular expressions and parser generators

Notes on Python Regular Expressions
and parser generators (by D. Parson)
• These are the Python supplements to the author’s
slides for Chapter 1 and Section 2.1.
• http://faculty.kutztown.edu/parson/spring2010/CS
C310Spring2010.html has a link to the author’s
slides, which are password protected by your K.U.
Windows login / password used to access your
student account.
Regular Expressions in Python
•
•
•
•
re module in the optional Python text.
http://docs.python.org/library/re.html
A RE is a pattern in the form of a string.
compile(pattern [, flags]) compiles an RE
expression into a finite automaton object.
• Return value can be used by other functions.
• Flags are for case, multiline, and meta-character options.
• search(pattern, string [, flags) searches string for
the first match of pattern.
• match(pattern, string [, flags) checks at string’s
beginning.
• Both return a MatchObject or None.
Regular Expressions in Python
• split(pattern, string [, maxsplit = 0]) splits string
into occurrences of pattern.
• Returns a list of strings
• sub(pattern, repl, string [, count = 0]) performs
substitutions of repl for pattern occurrences.
• String and sequence operations are related.
• http://docs.python.org/library/string.html
• >>> s = "abcde"
• >>> dir(s)
•
['__add__', '__class__', '__contains__', '__delattr__', '__doc__', '__eq__', '__format__', '__ge__',
'__getattribute__', '__getitem__', '__getnewargs__', '__getslice__', '__gt__', '__hash__', '__init__', '__le__',
'__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__',
'__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__',
'_formatter_field_name_split', '_formatter_parser', 'capitalize', 'center', 'count', 'decode', 'encode',
'endswith', 'expandtabs', 'find', 'format', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle',
'isupper', 'join', 'ljust', 'lower', 'lstrip', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip',
'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']
Python Regular Expression Examples
• >>> m1 = search('a+z*(b.d)', 'abcdefghi')
• >>> m1
• <_sre.SRE_Match object at 0x11c520>
• >>> m1.groups()
• ('bcd',)
• >>> m1.start()
• 0
• >>> m1.end()
•
•
•
•
•
4
>>> m1.start(0)
0
>>> m1.start(1)
1 # Group 0 is the entire match, 1 is the first parenthesized
subexpression, etc.
Learn the major Meta-characters!
•
•
•
•
•
•
•
•
•
•
•
Text – verbatim text
. – any character except newline
^ – matches start of the string (anchor)
$ – matches end of the string
* – Kleene start, 0 or more subpattern repetitions
+ – Kleene plus, 1 or more subpattern repetitions
? – optional, 0 or 1 subpattern occurrence
| – alternation, either left or right subpattern
() – group a subexpression inside parentheses
\ – escape a meta-character (make it normal)
[set of chars], [^set of chars not matched]
More Python RE Examples
• >>> m2 = search('a+z*(b.d)', 'Abcde')
• >>> m2
• >>> print m2
• None
•
•
•
•
•
•
>>> split(':', "abc:cd:e:f")
['abc', 'cd', 'e', 'f']
>>> split('[:]', "abc:cd:e:f")
['abc', 'cd', 'e', 'f']
>>> split('[^:]', "abc:cd:e:f")
['', '', '', ':', '', ':', ':', '']
More Python RE Examples (sub)
• >>> sub('a([^b]+)b', 'A\\1B', 'a123b45ab67a9b
aab')
• 'A123B45ab67A9B AaB'
• The parenthesized subexpression matches one or
more occurrences of anything except for b.
• The matched substring of the first parenthesized
subexpression is group 1.
• The replacement pattern \1 says “insert group 1 at
this point.”
• Effect is to re-insert characters between a and b.
Finite State Automata
• A regular expression compiler translates a regular
expression into a finite state automaton.
• This could be a linked data structure or code. It looks like a
graph of mapping steps needed for the regular expression.
• There are nondeterministic and deterministic flavors.
• (a|b)c+d is a simple example expression.
a
s1
start b
c
s2
c
c
ε
s3
s4
d
accept
Lookahead 1 types of parsers.
• LL(1) and LR(1) grammars require a parser to get
at most 1 look-ahead terminal from the scanner.
• LL(1) cannot handle left-recursive grammar
productions. It can handle other recursion.
• LR(1) and its variants can handle left, right and
nested recursion; left is the most efficient.
• A generated parser is essentially a deterministic
finite state automaton that uses a stack to keep
track of nested syntactic structures.
• This topic is covered exhaustively in compiler
design.
Parser generators in Python.
•
•
•
•
YAPPS2 is an LL(1) parser generator.
http://theory.stanford.edu/~amitp/yapps/
http://pypi.python.org/pypi/Yapps2
PLY is a Python LALR(1) (subset of LR(1))
equivalent to UNIX YACC and GNU Bison that
are used to generate compilers for C code.
• http://www.dabeaz.com/ply/
• Both generate Python executable parsers from
stylized Python code.