Transcript Document
Python & Pattern Matching
with Regular Expressions (REs)
OPIM 101
File:PythonREs.ppt
1
Foresight
• Pattern matching
– Literal
– With metacharacters
• Regular expressions (REs)
• Using REs in Python
2
Consider: dir by Itself
D:\athomepc\day\idt>dir
Volume in drive D has no label
Volume Serial Number is 3E4B-1609
Directory of D:\athomepc\day\idt
.
<DIR>
01-01-02 8:16a .
..
<DIR>
01-01-02 8:16a ..
SPRING~1 PDF
180,072 01-01-02 8:17a spring02idtfront.pdf
SPRING~2 PDF
241,542 01-01-02 8:19a spring02idtpartI.pdf
SPRING~3 PDF 1,246,514 01-01-02 8:20a spring02idtpartII.pdf
SPRING~4 PDF 2,517,343 01-01-02 8:22a spring02idtpartIII.pdf
SPRING~5 PDF 3,469,138 01-01-02 8:24a spring02idtpartIV.pdf
CASE1-~1 DOC
35,328 01-01-02 8:42a case1-python.doc
LECTUR~1 PPT
78,336 01-01-02 9:45a lecture01fall01.ppt
PYTHON~1 PPT
34,816 01-01-02 9:46a Python_Intro.ppt
PYTHON~2 PPT
37,376 01-01-02 9:46a Python_Structures.ppt
LECTUR~2 PPT
154,112 01-01-02 11:51a lecture01spring02.ppt
PYTHON~3 PPT
34,816 01-01-02 11:52a PythonREs.ppt
11 file(s) 8,029,393 bytes
2 dir(s)
1,209.06 MB free
D:\athomepc\day\idt>
3
Now: dir with a Literal Search
D:\athomepc\day\idt>dir case1-python.doc
Volume in drive D has no label
Volume Serial Number is 3E4B-1609
Directory of D:\athomepc\day\idt
CASE1-~1 DOC
35,328 01-01-02
8:42a case1-python.doc
1 file(s)
35,328 bytes
0 dir(s)
1,209.06 MB free
D:\athomepc\day\idt>
4
Now: dir with “*”
D:\athomepc\day\idt>dir *.doc
Volume in drive D has no label
Volume Serial Number is 3E4B-1609
Directory of D:\athomepc\day\idt
CASE1-~1 DOC
case1-python.doc
1 file(s)
0 dir(s)
35,328
01-01-02
8:42a
35,328 bytes
1,209.06 MB free
D:\athomepc\day\idt>
5
Literal vs. Pattern Searches
• dir myfile.doc
– Searches literally, for an exact match with
“myfile.doc”
• dir my*.doc
– Does a pattern search. Matches to any file
beginning with “my”, followed by 0 or more
characters of any kind, followed by “.doc”
6
MetaCharacters
• dir treats “*” as a metacharacter, a
character not taken literally, but as
instruction to match a certain kind of pattern
(here: anything)
• The dir metacharacter scheme is very useful
7
On Beyond *
• ...and also very primitive and limited
• A step up: grep in Unix & Linux; support
for RE searches in some text editors, e.g.,
TextPad (www.textpad.com)
• Regular expressions (REs) use a richer
language and larger set of metacharacters,
giving us a very powerful capability to
extract information (patterns) from text
8
Python’s RE Metacharacters
• Here’s the complete list:
. ^ $ * + ? { } [ ] \ | ( )
• No use memorizing. We’ll learn by
examples.
• A natural question: But what if I want to
search for a pattern that contains what
Python’s RE counts as metacharacters?
– Be just a little patient
9
Load Python’s re Module
>>> import re
>>> teststring = "Television is public anomie number 1.”
>>> teststring
'Television is public anomie number 1.’
>>> len(teststring)
37
>>> match = re.search('anomie',teststring)
>>> match == None
0
>>> match.span()
(21, 27)
>>> teststring[21:27]
'anomie’
>>>
10
Now a Nonliteral Match
>>> match = re.search('Television',teststring)
>>> match == None
0
>>> match = re.search('television',teststring)
>>> match == None
1
>>> match = re.search('[tT]elevision',teststring)
>>> match.span()
(0, 10)
>>> teststring
'Television is public anomie number 1.’
>>>
11
Square Bracket Notation: [...]
• “[tT]” means “any one of the characters
‘t’ or ‘T’.”
• [...] is called a character class
• Examples:
– [abc], [a-z], [A-Z]
– [^t^T] not t and not T
12
Not Example ^
>>> teststring
'Television is public anomie number 1.’
>>> match = re.search('[^t^T][a-z]+',teststring)
>>> match.span()
(1, 10)
>>> teststring[1:10]
'elevision’
>>>
Note: + means “one or more of the previous”
* means “zero or more” ? means “zero or one”
13
'\s\w+\.' and '\s(\w+)\.'
>>> teststring
'Television is public anomie number 1.’
>>> match = re.search('\s\w+\.',teststring)
>>> match.span()
(34, 37)
>>> teststring[34:37]
' 1.’
>>> match = re.search('\s(\w+)\.',teststring)
>>> match.span(0)
(34, 37)
>>> match.span(1)
(35, 36)
>>> teststring[35:36]
'1’
14
>>>
[.] == \.
• Inside [...] most metacharacters are taken
literally
– So, [.] == \.
• Note (again): [...] is called a character class
>>> match = re.search('\s(\w+)[.]',teststring)
>>> match.span()
(34, 37)
>>>
15
Avoiding Greed ?
>>> newstring = '<div align="center">’
>>> newstring = newstring+'<i class="smaller">’
>>> newstring = newstring+'(As of 10:55 AM on 12/20/01)’
>>> newstring = newstring+'</i></div><br>’
>>> newstring
'<div align="center"><i class="smaller">(As of 10:55 AM on 12/20/01
>>> match = re.search('<.+>',newstring)
>>> match.span()
(0, 81)
>>> match = re.search('<.+?>',newstring)
>>> match.group()
<div align="center">’
16
>>>
More on Not Being Greedy
>>> match = re.search(r'<(\w).+?>(.+)</(\1)',newstring)
>>> match.groups()
('d', '<i class="smaller">(As of 10:55 AM on 12/20/01)</i>', 'd')
>>> match = re.search(r'<(\w).+?>([^<]+)</(\1)',newstring)
>>> match.groups()
('i', '(As of 10:55 AM on 12/20/01)', 'i')
>>>
\1 is called a backreference. It refers to group 1
17
Concluding
• REs are a very powerful tool, very often
very useful
• The language notation is compact and a bit
hard to read
• Practice, study the examples, don’t worry
about memorization.
18
Advice on Scripting
• Scripting, and programming in general, is a process
• Successful scripts don’t spring into existence whole
– Scripts built in small increments
• Attend to:
– Decomposition
– Stories
– Testing
19
Advice on Scripting
• Decomposition
– Solve big problems by decomposing them into small
problems and solving them
• Stories
– Scripting/programming as a form of literature
– Use comments with code to tell a clear story about what
the code is or should be doing
• Testing
– Everything, whole and part, often, varying inputs
20
Readings
• IDT book, chapter 8, “Text and Pattern
Processing”
• Further information (but beyond the scope
of 101)
– The Python online documentation on the re
module
– “Regular Expression HOWTO” by A.M.
Kuchling at http://py-howto.sourceforge.net/
and also at http://pyhowto.sourceforge.net/regex/regex.html
21