Transcript Files

COSC 1306—COMPUTER
SCIENCE AND PROGRAMMING
PYTHON FUNCTIONS
Jehan-François Pâris
[email protected]
Module Overview
• We will learn how to read, create and modify files
– Pay special attention to pickled files
• They are very easy to use!
The file system
• Provides long term storage of information.
• Will store data in stable storage (disk)
• Cannot be RAM because:
– Dynamic RAM loses its contents when
powered off
– Static RAM is too expensive
– System crashes can corrupt contents of the
main memory
Overall organization
• Data managed by the file system are grouped in
user-defined data sets called files
• The file system must provide a mechanism for
naming these data
– Each file system has its own set of
conventions
– All modern operating systems use a
hierarchical directory structure
Windows solution
• Each device and each disk partition is identified
by a letter
– A: and B: were used by the floppy drives
– C: is the first disk partition of the hard drive
– If hard drive has no other disk partition,
D: denotes the DVD drive
• Each device and each disk partition has its own
hierarchy of folders
Windows solution
Second disk
D:
C:
Users
Windows
Program Files
Flash drive
F:
UNIX/LINUX organization
• Each device and disk partition has its own
directory tree
– Disk partitions are glued together through the
operation to form a single tree
• Typical user does not know where her files
are stored
UNIX/LINUX organization
Root partition
/
Other partition
usr
bin
The magic
mount
Second partition
can be accessed as /usr
Mac OS organization
• Similar to Windows
– Disk partitions are not merged
– Represented by separate icons on the
desktop
Accessing a file (I)
• Your Python programs are stored in a folder AKA
directory
– On my home PC it is
C:\Users\Jehan-Francois Paris\Documents\
Courses\1306\Python
• All files in that directory can be directly accessed
through their names
– "myfile.txt"
Accessing a file (II)
• Files in subdirectories can be accessed by
specifying first the subdirectory
– Windows style:
• "test\\sample.txt"
– Note the double backslash
– Linux/Unix/Mac OS X style:
• "test/sample.txt"
– Generally works for Windows
Why the double backslash?
• The backslash is an escape character in
Python
– Combines with its successor to represent
non-printable characters
• ‘\n’ represents a newline
• ‘\t’ represents a tab
– Must use ‘\\’ to represent a plain backslash
Accessing a file (III)
• For other files, must use full pathname
– Windows Style:
• "C:\\Users\\Jehan-Francois Paris\\
Documents\\Courses\\1306\\Python\\
myfile.txt"
Accessing file contents
• Two step process:
– First we open the file
– Then we access its contents
• Read
• Write
• When we are done, we close the file.
What happens at open() time?
• The system verifies
– That you are an authorized user
– That you have the right permission
• Read permission
• Write permission
• Execute permission exists but doesn’t
apply
and returns a file handle /file descriptor
The file handle
• Gives the user
– Direct access to the file
• No directory lookups
– Authority to execute the file operations whose
permissions have been requested
Python open()
• open(name, mode = ‘r’, buffering = -1)
where
– name is name of file
– mode is permission requested
• Default is ‘r’ for read only
– buffering specifies the buffer size
• Use system default value (code -1)
The modes
• Can request
– ‘r’ for read-only
– ‘w’ for write-only
• Always overwrites the file
– ‘a’ for append
• Writes at the end
– ‘r+’ or ‘a+’ for updating (read + write/append)
Examples
• f1 = open("myfile.txt")
same as
f1 = open("myfile.txt", "r")
• f2 = open("test\\sample.txt", "r")
• f3 = open("test/sample.txt", "r")
• f4 = open("C:\\Users\\Jehan-Francois Paris\\
Documents\\Courses\\1306\\Python\\myfile.txt")
Reading a file
• Three ways:
– Global reads
– Line by line
– Pickled files
Global reads
• fh.read()
– Returns whole contents of file specified by
file handle fh
– File contents are stored in a single string that
might be very large
Example
• f2 = open("test\\sample.txt", "r")
bigstring = f2.read()
print(bigstring)
f2.close() # not required
Output of example
• To be or not to be that is the question
Now is the winter of our discontent
– Exact contents of file ‘test\sample.txt’
Line-by-line reads
• for line in fh : # do not forget the column
#anything you want
fh.close() # not required
Example
• f3 = open("test/sample.txt", "r")
for line in f3 : # do not forget the column
print(line)
f3.close() # not required
Output
• To be or not to be that is the question
Now is the winter of our discontent
– With one or more extra blank lines
Why?
• Each line ends with an end-of-line marker
• print(…) adds an extra end-of-line
Trying to remove blank lines
• print('----------------------------------------------------')
f5 = open("test/sample.txt", "r")
for line in f5 : # do not forget the column
print(line[:-1]) # remove last char
f5.close() # not required
print('-----------------------------------------------------')
The output
• ---------------------------------------------------To be or not to be that is the question
Now is the winter of our disconten
-----------------------------------------------------
• The last line did not end with an EOL!
A smarter solution (I)
• Only remove the last character if it is an EOL
– if line[-1] == ‘\n’ :
print(line[:-1]
else
print line
A smarter solution (II)
• print('----------------------------------------------------')
fh = open("test/sample.txt", "r")
for line in fh : # do not forget the column
if line[-1] == '\n' :
print(line[:-1]) # remove last char
else :
print(line)
print('-----------------------------------------------------')
fh.close() # not required
It works!
• ---------------------------------------------------To be or not to be that is the question
Now is the winter of our discontent
-----------------------------------------------------
Making sense of file contents
• Most files contain more than one data item per
line
– COSC 713-743-3350
UHPD 713-743-3333
• Must split lines
– mystring.split(sepchar)
where sepchar is a separation character
• returns a list of items
Splitting strings
• >>> text = "Four score and seven years ago"
>>> text.split()
['Four', 'score', 'and', 'seven', 'years', 'ago']
• >>>record ="1,'Baker, Andy', 83, 89, 85"
>>> record.split(',')
[' 1', "'Baker", " Andy'", ' 83', ' 89', ' 85']
Not what we wanted!
Example
# how2split.py
print('----------------------------------------------------')
f5 = open("test/sample.txt", "r")
for line in f5 :
words = line.split()
for xxx in words :
print(xxx)
f5.close() # not required
print('-----------------------------------------------------')
Output
• ---------------------------------------------------To
be
…
of
our
discontent
-----------------------------------------------------
Other separators (I)
• Commas
– CSV Excel format
• Values are separated by commas
• Strings are stored without quotes
– Unless they contain a comma
• “Doe, Jane”, freshman, 90, 90
– Quotes within strings are doubled
Other separators (II)
• Tabs( ‘\t’)
– Advantages:
• Your fields will appear nicely aligned
• Spaces, commas, … are not an issue
– Disadvantage:
• You do not see them
– They look like spaces
Why it is important
• When you must pick your file format, you should
decide how the data inside the file will be used:
– People will read them
– Other programs will use them
– Will be used by people and machines
An exercise
• Converting our output to CSV format
– Replacing tabs by commas
• Easy
– Will use string replace function
First attempt
• fh_in = open('grades.txt', 'r') # the 'r' is optional
buffer = fh_in.read()
newbuffer = buffer.replace('\t', ',')
fh_out = open('grades0.csv', 'w')
fh_out.write(newbuffer)
fh_in.close()
fh_out.close()
print('Done!')
The output
• Alice
90 90 90
Bob
85 85 85
Carol
75 75 75
becomes
• Alice,90,90,90,90,90
Bob,85,85,85,85,85
Carol,75,75,75,75,75
90
85
75
90
85
75
Dealing with commas (I)
• Work line by line
• For each line
– split input into fields using TAB as separator
– store fields into a list
• Alice
90 90 90 90 90
becomes
[‘Alice’, ’90’, ’90’, ’90’, ’90’, ’90’]
Dealing with commas (II)
– Put within double quotes any entry containing
one or more commas
– Output list entries separated by commas
• ['"Baker, Alice"', 90, 90, 90, 90, 90]
becomes
"Baker, Alice",90,90,90,90,90
Dealing with commas (III)
• Our troubles are not over:
– Must store somewhere all lines until we are
done
– Store them in a list
Dealing with double quotes
• Before wrapping items with commas with double
quotes replace
– All double quotes by pairs of double quotes
– 'Aguirre, "Lalo" Eduardo'
becomes
'Aguirre, ""Lalo"" Eduardo'
then
'"Aguirre, ""Lalo"" Eduardo"'
General organization (I)
• linelist = [ ]
• for line in file
– itemlist = line.split(…)
– linestring = '' # empty string
– for each item in itemlist
• remove any trailing newline
• double all double quotes
• if item contains comma, wrap
• add to linestring
General organization (II)
• for line in file
–…
– for each item in itemlist
• double all double quotes
• if item contains comma, wrap
• add to linestring
– append linestring to stringlist
General organization (III)
• for line in file
–…
– remove last comma of linestring
– add newline at end of linestring
– append linestring to stringlist
• for linestring in in stringline
– write linestring into output file
The program (I)
• # betterconvert2csv.py
""" Convert tab-separated file to csv
"""
fh = open('grades.txt','r') #input file
linelist = [ ] # global data structure
for line in fh : # outer loop
itemlist = line.split('\t')
# print(str(itemlist)) # just for debugging
linestring = '' # start afresh
The program (II)
•
for item in itemlist : #inner loop
item = item.replace('"','""') # for quotes
if item[-1] == '\n' : # remove it
item = item[:-1]
if ',' in item :
# wrap item
linestring += '"' + item +'"' + ','
else :
# just append
linestring += item +','
# end of inside for loop
The program (III)
•
# must replace last comma by newline
linestring = linestring[:-1] + '\n'
linelist.append(linestring)
# end of outside for loop
fh.close()
fhh = open('great.csv', 'w')
for line in linelist :
fhh.write(line)
fhh.close()
Notes
• Most print statements used for debugging were
removed
– Space considerations
• Observe that the inner loop adds a comma after
each item
– Wanted to remove the last one
• Must also add a newline at end of each line
The input file
• Alice
90 90 90
Bob
85 85 85
Carol
75 75 75
Doe, Jane
90 90
Fulano, Eduardo "Lalo"
90
85
75
90
90
90
85
75
80
90
70
90
90
The output file
• Alice,90,90,90,90,90
Bob,85,85,85,85,85
Carol ,75,75,75,75,75
"Doe, Jane",90,90 ,90 ,80 ,75
"Fulano, Eduardo ""Lalo""",90,90,90,90
Mistakes being made (I)
• Mixing lists and strings:
– Earlier draft of program declared
• linestring = [ ]
and did
• linestring.append(item)
– Outcome was
• ['Alice,', '90,'. … ]
instead of
• 'Alice,90, …'
Mistakes being made (II)
• Forgetting to add a newline
– Output was a single line
• Doing the append inside the inner loop:
– Output was
• Alice,90
Alice,90,90
Alice,90,90,90
…
Mistakes being made
• Forgetting that strings are immutable:
– Trying to do
• linestring[-1] = '\n'
instead of
• linestring = linestring[:-1] + '\n'
– Bigger issue:
• Do we have to remove the last comma?
Could we have done better? (I)
• Make the program more readable by
decomposing it into functions
– A function to process each line of input
• do_line(line)
– Input is a string ending with newline
– Output is a string in CSV format
– Should call a function processing
individual items
Could we have done better? (II)
– A function to process individual items
• do_item(item)
– Input is a string
– Returns a string
• With double quotes "doubled"
• Without a newline
• Within quotes if it contains a comma
The new program (I)
• def do_item(item) :
item = item.replace('"','""')
if item[-1] == '\n' :
item = item[:-1]
if ',' in item :
item ='"' + item +'"'
return item
The new program (II)
• def do_line(line) :
itemlist = line.split('\t')
linestring = '' # start afresh
for item in itemlist :
linestring += do_item(item) +','
linestring += '\n'
return linestring
The new program (III)
• fh = open('grades.txt','r')
linelist = [ ]
for line in fh :
linelist.append(do_line(line))
fh.close()
The new program (IV)
• fhh = open('great.csv', 'w')
for line in linelist :
fhh.write(line)
fhh.close()
Why it is better
• Program is decomposed into small modules that
are much easier to understand
– Each fits on a PowerPoint slide
The break statement
• Makes the program exit the loop it is in
• In next example, we are looking for
first instance of a string in a file
– Can exit as soon it is found
Example (I)
• searchstring= input('Enter search string:')
found = False
fh = open('grades.txt')
for line in fh :
if searchstring in line :
print(line)
found = True
break
Example (II)
• if found == True :
print("String %s was found" % searchstring)
else :
print("String %s NOT found " % searchstring)
Flags
• A variable like found
– That can either be True or False
– That is used in a condition for an if or a while
is often referred to as a flag
A dumb mistake
•
•
Unlike C and its family of languages,
Python does not let you write
– if found = True
for
– if found == True
There are still cases where we can do
mistakes!
Example
• >>> b = 5
>>> c = 8
>>> a = b = c
>>> a
8
• >>> a = b == c
>>> a
True
HANDLING EXCEPTIONS
When a wrong value is entered
• When user is prompted for
– number = int(input("Enter a number: ")
and enters
– a non-numerical string
a ValueError exception is raised and the
program terminates
• Python a programs catch errors
The try… except pair (I)
• try:
<statements being tried>
except Exception as ex:
<statements catching the exception>
• Observe
– the colons
– the indentation
The try… except pair (II)
• try:
<statements being tried>
except Exception as ex:
<statements catching the exception>
• If an exception occurs while the program executes
the statements between the try and the except,
control is immediately transferred to the
statements after the except
A better example
• done = False
while not done :
filename= input("Enter a file name: ")
try :
fh = open(filename)
done = True
except Exception as ex:
print ('File %s does not exist' % filename)
print(fh.read())
An Example (I)
• done = False
while not done :
try :
number = int(input('Enter a number:'))
done = True
except Exception as ex:
print ('You did not enter a number')
print ("You entered %.2f." % number)
input("Hit enter when done with program.")
A simpler solution
• done = False
while not done
myinput = (input('Enter a number:'))
if myinput.isdigit() :
number = int(myinput)
done = True
else :
print ('You did not enter a number')
print ("You entered %.2f." % number)
input("Hit enter when done with program.")
PICKLED FILES
Pickled files
• import pickle
– Provides a way to save complex data
structures in a file
– Sometimes said to provide a
serialized representation of Python objects
Basic primitives (I)
• dump(object,fh)
– appends a sequential representation of object
into file with file handle fh
– object is virtually any Python object
– fh is the handle of a file that must have been
opened in 'wb' mode
b is a special option allowing to
write or read binary data
Basic primitives (II)
• target = load( filehandle)
– assigns to target next pickled object stored in
file filehandle
– target is virtually any Python object
– filehandle id filehandle of a file that was
opened in rb mode
Example (I)
• >>> mylist = [ 2, 'Apples', 5, 'Oranges']
• >>> mylist
[2, 'Apples', 5, 'Oranges']
• >>> fh = open('testfile', 'wb') # b is for BINARY
• >>> import pickle
• >>> pickle.dump(mylist, fh)
• >>> fh.close()
Example (II)
• >>> fhh = open('testfile', 'rb') # b is for BINARY
• >>> theirlist = pickle.load(fhh)
• >>> theirlist
[2, 'Apples', 5, 'Oranges']
• >>> theirlist == mylist
True
What was stored in testfile?
• Some binary data containing the strings 'Apples'
and 'Oranges'
Using ASCII format
• Can require a pickled representation of objects
that only contains printable characters
– Must specify protocol = 0
• Advantage:
– Easier to debug
• Disadvantage:
– Takes more space
Example
• import pickle
mydict = {'Alice': 22, 'Bob' : 27}
fh = open('asciifile.txt', 'wb') # MUST be 'wb'
pickle.dump(mydict, fh, protocol = 0)
fh.close()
fhh = open('asciifile.txt', 'rb')
theirdict = pickle.load(fhh)
print(mydict)
print(theirdict)
The output
• {'Bob': 27, 'Alice': 22}
{'Bob': 27, 'Alice': 22}
What is inside asciifile.txt?
• (dp0VBobp1L27LsVAlicep2L22Ls.
Dumping multiple objects (I)
• import pickle
fh = open('asciifile.txt', 'wb')
for k in range(3, 6) :
mylist = [i for i in range(1,k)]
print(mylist)
pickle.dump(mylist, fh, protocol = 0)
fh.close()
Dumping multiple objects (II)
• fhh = open('asciifile.txt', 'rb')
lists = [ ] # initializing list of lists
while 1 : # means forever
try:
lists.append(pickle.load(fhh))
except EOFError :
break
fhh.close()
print(lists)
Dumping multiple objects (III)
• Note the way we test for end-of-file (EOF)
– while 1 : # means forever
try:
lists.append(pickle.load(fhh))
except EOFError :
break
The output
• [1, 2]
[1, 2, 3]
[1, 2, 3, 4]
[[1, 2], [1, 2, 3], [1, 2, 3, 4]]
What is inside asciifile.txt?
• (lp0L1LaL2La.(lp0L1LaL2LaL3La.(lp0L1LaL2L
aL3LaL4La.
Practical considerations
• You rarely pick the format of your input files
– May have to do format conversion
• You often have to use specific formats for you
output files
– Often dictated by program that will use them
• Otherwise stick with pickled files!