8-Lecture-IOandRegexx

Transcript 8-Lecture-IOandRegexx

File I/O and Regular Expressions
Sandy Brownlee
[email protected]
Outline
• Basic reading / writing of text files in Python
– Use a library for more complex formats!
– E.g. openpyxl, python-docx, pypdf2
• Regular Expressions (Regex)
– Appears in Python, but also many other contexts
– Introduction to basic operators and the Python
implementation
Text files
• Open file, get handle
• Step through the file
– Line by line (pointer moves as we read)
– (bytewise for binary files)
• Close file
– Releases locks and resources
• Be careful about:
– Windows / Unix format newlines
– Character encoding (ASCII is *so* 1980s)
Reading files in Python
• f = open("data.txt", "r")
– Open file for reading (“w”=writing, “a”=append)
• s = f.readline()
– Read next line from the file, store in string “s”
• f.write(s + "\n")
– Write “s” to file, followed by newline character
– print >> f, s achieves the same
• f.close()
Reading files (1)
f = open("data.txt", "r")
print(f)
line1 = f.readline()
line2 = f.readline()
line3 = f.readline()
print(line1)
print(line2 + " - " + line3)
f.close()
print("done.")
data.txt
Name,Room,Phone
Bob,C11,4445
Alice,C12,4443
Jeff,B14,4456
Jonathan,B16,4452
Susan,B19,4476
Betty,AA1,4599
Sean,AX2,4598
Wilma,AX3,4578
Jim,AX5,4590
Mary,C44,4140
Output:
<_io.TextIOWrapper name=‘data.txt' mode='r' encoding='cp1252'>
Name,Room,Phone
Bob,C11,4445
- Alice,C12,4443
done.
Reading files (2)
• Pretty ugly, right?
• Use with… instead of file.open() & file.close():
– with open("data.txt") as f:
– This automatically closes the file after the block
• Use a loop to iterate over the file:
– for line in f:
• Strip those nasty newlines:
– line.rstrip()
Reading files (3)
data.txt
with open('data.txt') as f:
print(f)
for line in f:
print(line.rstrip())
print("done.")
Name,Room,Phone
Bob,C11,4445
Alice,C12,4443
Jeff,B14,4456
Jonathan,B16,4452
Susan,B19,4476
Betty,AA1,4599
Sean,AX2,4598
Wilma,AX3,4578
Jim,AX5,4590
Mary,C44,4140
Output:
<_io.TextIOWrapper name='data.txt' mode='r' encoding='cp1252'>
Name,Room,Phone
Bob,C11,4445
Alice,C12,4443
Jim,AX5,4590
Mary,C44,4140
done.
Writing files
with open('output.txt', "w") as f:
for i in range(1, 10):
print >> f, ("Line " + str(i)) # Python 2
# print("Line " + str(i), file=f) # Python 3
• What do you expect to be in the file?
output.txt
Line
Line
Line
Line
Line
Line
Line
Line
Line
1
2
3
4
5
6
7
8
9
CSVs
• Comma separated values – text file with rows and
columns, data separated by commas
Name,Room,Phone
“Lock,Alice”,C12,4443
“Hanson,Jeff”,B14,4456
“Holmes,Jonathan”,B16,4452
• Could read each line and use split(“,”) to break
into lists, but this is quite easy to break!
– e.g. commas within quotes (like the names below)
• Better to use the Python csv library:
–
–
–
–
csv.reader(file)
csv.DictReader(file)
csv.writer(file, dialect='excel')
csv.DictWriter(file, fieldnames, dialect='excel')
Regular Expressions
• A regular expression (regex) provides a syntax
for matching patterns of characters in a string
• You have probably seen a simple version
("wildcards") for file names: *.txt
or searching in SQL: LIKE a%b
• Regexes are FAR more powerful, as we shall see
Why Do we Need Them?
• Searching:
– Find all the email addresses in a file
– Find all the words that have a suffix "ing"
• Verification
– Check an email address matches the required
format
• Manipulation
– Remove certain characters
– Change a=1,b=2,c=3 to {"a":1,"b":2,"c":3}
Regex Example
Naive_MOEAD_unseeded_Dup5_att_1.txt
Naive_NSGAII_unseeded_Dup5_att_1.txt
Naive_MOEAD_Dup5_att_1.txt
Naive_NSGAII_Dup5_att_1.txt
Bilevel_MOEAD_unseeded_Dup5_att_15.txt
Bilevel_MOEAD_Dup5_att_15.txt
Bilevel_NSGAII_unseeded_Dup5_att_15.txt
…
Find ^([^.]+).txt, replace with: \1 = read.table("\1.txt")
Naive_MOEAD_unseeded_Dup5_att_1 = read.table("Naive_MOEAD_unseeded_Dup5_att_1.txt")
Naive_NSGAII_unseeded_Dup5_att_1 = read.table("Naive_NSGAII_unseeded_Dup5_att_1.txt")
Naive_MOEAD_Dup5_att_1 = read.table("Naive_MOEAD_Dup5_att_1.txt")
Naive_NSGAII_Dup5_att_1 = read.table("Naive_NSGAII_Dup5_att_1.txt")
Bilevel_MOEAD_unseeded_Dup5_att_15 = read.table("Bilevel_MOEAD_unseeded_Dup5_att_15.txt")
Bilevel_MOEAD_Dup5_att_15 = read.table("Bilevel_MOEAD_Dup5_att_15.txt")
Bilevel_NSGAII_unseeded_Dup5_att_15 = read.table("Bilevel_NSGAII_unseeded_Dup5_att_15.txt")
Where are They Used?
• Unix has a search function called Grep, which
allows you to search files from the command
line
• Most programming languages have regex
commands or libraries, notably:
– Javascript (good for validating form entry)
– Python (for data wrangling)
– Java, C#, Perl, Ruby, PHP …
• Many databases support Regex search,
including MongoDB, MySQL …
• Common in text editors / IDEs (e.g. Eclipse)
I'm Sold - How Do I Use Them?
• We will use a simple text editing program
called EditPad (http://www.editpadlite.com)
• It has a regex search facility, so is good to
practice on
• A regular expression is a string of characters
that defines what patterns should be matched
Regex Characters
• Want to search for the word "cat"? The
regular expression is cat
• But if you want to do more, you need to use a
combination of the regex characters:
\ ^ $ . | > * + ( ) [] { }
Examples
• Here are a few lines of text in EditPad
Examples
Cat
c.t
Dog\d
\D
Anchors
• ^ matches the start of a line
• $ matches the end of a line
Counts
• {} brackets specify a count
Character Sets
• Use [] to signify a set of single characters
• [abc] finds all occurrences of a OR b OR c
• [0-5] finds all occurrences
in a range
• [a-fA-F] finds all
occurrences in multiple ranges
•
•
•
•
•
Built-in sets include:
\d finds digits [0-9]
\D finds non-digits [^0-9]
\s finds whitespace
\S finds non-whitespace
Alternation (OR)
• If you want to search members of a list of strings,
use |
• cat|dog searches for cat or dog
• Use word boundary \b to search
for full words:
\b(Cat|Dog)\b
• (word boundary is whitespace or
an end of line)
• Brackets group the "or" part
to mean:
wordstart(cat or dog)wordend
() Parentheses for Groups
• Use ( ... ) parentheses to group part of a
regular expression
• Same logic as with mathematical expressions:
(a+b)/c ≠ a+(b/c)
• c(\d{3}) ≠ (c\d){3}
• c123
c1c2c3
Repetition
• We have already seen {} for counting
• More general counters are:
* means zero or more
+ means one or more
? means zero or one
So how does it work?
• The parser starts on the left of the regex, and
on the left of the text, and works along
towards the right, eating characters as it goes
The quick brown fox jumps over the lazy dog
q.*o
• What do you expect to match in the text?
Greedy / non-greedy
• Repetitions like * and + are greedy
• Regex engine try to match them as many
times as possible
• If later portions of the pattern don’t match,
the engine will back up and try again
• Non-greedy operators match as little as
possible: *? +?
Negation
• Regular expressions are not naturally good at
"not equal" type matches
• [^abc] means “not a b or c”, but matching
words doesn’t work
• Negative look-ahead is one way to achieve
negation
Look-ahead
• Exist in many regex implementations
• Look-ahead allows you to specify that you
want a positive match for a string that is
(or is not) followed by something use
(?=...)
• Negative look-ahead using (?!...)
Finds things NOT followed by something
else: (?!dog) means “not dog”
• "^(.(?!Dog))*$" matches strings that
do not contain “Dog”: it means that, from
the start to the end, all characters must
not be followed by “Dog”
• Look-aheads don’t “eat” characters in the
way the other patterns do
Identifying Groups
• Replacements (slide 12 – “Regex Example”)
• Fancy searches – e.g. find matching pairs of
tags in HTML
RegEx in Python
• import re
– re.search() – find a match anywhere in the string
– re.match() – only try to match at start of string
– re.findall() – find all matches and return as a list
– re.split() – like string.split() but a regex pattern
• Backslashes also mean something in Python
strings, so used raw string for tidiness: r”\d+”
• Use flags to modify pattern matching:
re.DOTALL
re.IGNORECASE
re.MULTILINE
re.VERBOSE
Make . match any character, including newlines
Do case-insensitive matches
Multi-line matching, affecting ^ and $
Enable verbose REs, can be formatted more neatly
Python RegEx Examples (1)
import re
quickfox = "The quick brown fox jumps over the 2 lazy dogs"
result = re.search("b.", quickFox)
print("Result: " + str(result))
print("Group:
" + result.group())
expr = re.compile("b.")
result = expr.search(quickFox)
print("Result2: " + str(result))
if (expr.search(quickfox)):
print("Matched!")
Output:
Result: <_sre.SRE_Match object; span=(10, 12), match='br'>
Group:
br
Result2: <_sre.SRE_Match object; span=(10, 12), match='br'>
Matched!
Python RegEx Examples (2)
import re
quickfox = "The quick brown fox jumps over\nthe 2 lazy dogs"
print("A: " + str(re.search("\\d....", quickfox)))
print("B: " + str(re.search(r"\d....", quickfox)))
print("C: " + str(re.search("T", quickfox)))
print("D: " + str(re.search("T", quickfox, re.IGNORECASE)))
print("E: " + str(re.search(".T", quickfox, re.IGNORECASE)))
print("F: " + \
str(re.search(".T", quickfox, re.IGNORECASE | re.DOTALL)))
Output:
A: <_sre.SRE_Match
B: <_sre.SRE_Match
C: <_sre.SRE_Match
D: <_sre.SRE_Match
E: None
F: <_sre.SRE_Match
object;
object;
object;
object;
span=(35, 40), match='2 laz'>
span=(35, 40), match='2 laz'>
span=(0, 1), match='T'>
span=(0, 1), match='T'>
object; span=(30, 32), match='\nt'>
This week’s lab
• Open a CSV file
• Print content of CSV to screen
• Do some quick checks on the data using regex

8-Lecture-IOandRegexx

Transcript 8-Lecture-IOandRegexx

Directory