print - Purdue CS Wiki/Application server
Download
Report
Transcript print - Purdue CS Wiki/Application server
Announcements
All groups have been assigned
Homework:
By this evening email everyone in your group and set up
a meeting time to discuss project 4
Project 4 will be released tomorrow
You will have roughly 3 weeks to work on it
How do I work in a team?
Communication
Teams that do not communicate well do poorly on the
project
Understanding the assignment
Teams that sit down and go over the assignment together
do well
Battle plan
Outline the project in your own English text
Code together
Difficult parts of the project are best done together
Parsing Text
The vast majority of the information present on the
internet is in text form
Data, webpages, etc
We want to transform the data into a more usable form
Examples we have seen thus far:
Encoding of a matrix
Encoding of a tree
Project 3, changing text (encrypting and decrypting)
Example: Finding a nucleotide
sequence
We can find DNA sequences of parasites on the
internet (typically in databases)
Problem: we want to know if a sequence of nucleotides
is in a particular parasite
We not only want to know “yes” or “no” but which parasite
What the data looks like
>Schisto unique AA825099
gcttagatgtcagattgagcacgatgatcgattgaccgtgagatcgacga
gatgcgcagatcgagatctgcatacagatgatgaccatagtgtacg
>Schisto unique mancons0736
ttctcgctcacactagaagcaagacaatttacactattattattattatt
accattattattattattattactattattattattattactattattta
ctacgtcgctttttcactccctttattctcaaattgtgtatccttccttt
How are we going to do it?
First, we get the sequences in a big string.
Next, we find where the small subsequence is in the
big string.
From there, we need to work backwards until we find
“>” which is the beginning of the line with the
sequence name.
From there, we need to work forwards to the end of the
line. From “>” to the end of the line is the name of the
sequence
Yes, this is hard to get right.
Lets Review Some Python
string.find(sub) – returns the lowest index where the
substring sub is found or -1
string.find(sub, start) – same as above, except using
the slice string[start:]
string.find(sub, start, end) – same as above, except
using the slice string[start:end]
Lets Review Some Python
string.rfind(sub) – returns the highest index where the
substring sub is found or -1
string.rfind(sub, start) – same as above, except using
the slice string[start:]
string.rfind(sub, start, end) – same as above, except
using the slice string[start:end]
Clicker Question: are these
programs equivalent?
String = “two plus two is four”
1
String.find(“two”)
2
String.rfind(“two”)
A: yes
B: no
Lets solve the problem!
def findSequence(seq):
sequencesFile = "parasites.txt”
file = open(sequencesFile,”r")
sequences = file.read()
file.close()
seqloc = sequences.find(seq)
if seqloc != -1:
# Now, find the ">" with the name of the sequence
nameloc = sequences.rfind(">",0,seqloc) # using rfind() here!!
endline = sequences.find("\n",nameloc)
print ("Found in ",sequences[nameloc:endline])
else:
print ("Not found”)
Why -1?
If .find or .rfind don’t find something, they return -1
If they return 0 or more, then it’s the index of where the
search string is found.
Note: last week we saw the urlib module
It contains a method that lets you download a file from the
internet
How might you modify your program to first download the
file from the internet prior to opening it?
Running the program
>>> findSequence("tagatgtcagattgagcacgatgatcgattgacc")
Found in >Schisto unique AA825099
>>> findSequence("agtcactgtctggttgaaagtgaatgcttccaccgatt")
Found in >Schisto unique mancons0736
One More Note on Parsing
We saw how to read a file as a string or list of strings
We saw how to leverage how data was structured to
find specific information we were interested in
What if there are many pieces we want to extract?
Revisiting Split
String.split(delimiter) break the string String into parts,
separated by the delimiter
print (“a b c d”.split(“ “))
Would print: [‘a’, ‘b’, ‘c’, ‘d’]
• Some quirky cases for string.split()
• Explained in pre lab 10
Why is this useful?
When reading in a file, we may have many interesting
data items on a given line (or in the file)
Example: Lab 10
How to glue everything
together
Step 1) get some interesting data
Step 2) open the file
Step 3) read the data from the file, either as one large
string or a list of strings
Step 4) break this string (or list of strings) into the data
we want (rfind, find, split)
Abstract Example
Getting values from a text file
str = file.read()
Lines = str.split(‘\n’) list of strings
for element in Lines:
items = element.split(‘ ‘) list of strings
Concrete Example
foo = "bab cad eag”
elem = foo.split(" ”)
for i in elem:
print(i.split("a"))
['b', 'b']
['c', 'd']
['e', 'g']
CQ:How can I parse all the
words in a file?
Assume we have read the file in as one big string (we
used file.read()) and the file contains no punctuation
A) first split on “\n” and for each element in the result,
we split on “ “
B) only split on “ “
Concrete Clicker Example
file = open(“text.txt”, “r”)
content = file.read()
line = content.split(“\n”)
for i in line:
print(i.split(“ "))
text.txt
[‘This', ‘is']
[’a’, ‘file’]
This is
a file
Example: Get the temperature
The weather is always available on the Internet.
Can we write a function that takes the current
temperature out of a source like
http://www.ajc.com/weather or
http://www.weather.com?
The Internet is mostly text
Web pages are actually text in the format called HTML
(HyperText Markup Language)
HTML isn’t a programming language,
it’s an encoding language.
It defines a set of meanings for certain characters, but
one can’t program in it.
We can ignore the HTML meanings for now, and just
look at patterns in the text.
Where’s the temperature?
The word “temperature”
doesn’t really show up.
But the temperature
always follows the word
“Currently”, and always
comes before the
“<b>°</b>”
<td ><img
src="/sharedlocal/weather/images/ps.gif"
width="48" height="48"
border="0"><font size=2><br></font><font
size="-1" face="Arial, Helvetica, sansserif"><b>Currently</b><br>
Partly sunny<br>
<font
size="+2">54<b>°</b></font><
font face="Arial, Helvetica, sansserif"
size="+1">F</font></font></td>
</tr>
We can use the same algorithm we’ve
seen previously
Grab the content out of a file in a big string.
We’ve saved the HTML page previously.
We‘ve seen how to grab it directly.
Find the starting indicator (“Currently”)
Find the ending indicator (“<b>°”)
Read the previous characters
def findTemperature():
weatherFile = "ajc-weather.html”
file = open(weatherFile,”r")
weather = file.read()
file.close()
# Find the Temperature
curloc = weather.find("Currently")
if curloc <> -1:
# Now, find the "<b>°" following the temp
temploc = weather.find("<b>°",curloc)
tempstart = weather.rfind(">",0,temploc)
print ("Current temperature:”,weather[tempstart+1:temploc])
if curloc == -1:
print (”Can't find the temp”)
Homework
Email your group members
Read through the project 4 description when it
becomes available
Announcements
Dictionaries in Python
Useful Analogy: an actual Dictionary!
English dictionaries provide an association between a
Word and a Definition
We us the Word to look up the Definition
Given a definition it would be very hard to look up the
word
Dictionaries Python
Much like a dictionary for the English language, python
dictionaries create an association between a key and a
value
Key corresponds to a Word in our analogy
Value corresponds to a Definition
Dictionary Syntax
A dictionary is a collection of elements
Each element is a key/value
key : value
Just like a list is defined by [ ] a dictionary is defined by
{}
{‘key1’:value1, ‘key2’:value2, ‘key3’:value3}
Keys
A key can be any immutable type (we will consider two
types)
Strings and Integers
Much like the [index] is used to select out an element
from a list, for a dictionary we use [key]
A = {‘key1’:value1, ‘key2’:value2, ‘key3’:value3}
print(A[‘key2’])
Example: Simple Phone Book
phoneBook = {‘Luke’ : ’123 4567’,
‘Dr. Martino’ : ‘456 7890’}
names are keys, phone numbers are values
def lookup(key):
return phoneBook[key]
lookup(‘Dr. Martino’)
Clicker Question: are these
programs equivalent?
1
A = [‘mike’, ‘mary’,
‘marty’]
print A[1]
2
A = {0:’mike’, 1:’mary’,
2:’marty’}
print A[1]
A: yes
B: no
Clicker Question: are these
programs equivalent?
1
A = [‘mike’, ‘mary’,
‘marty’]
print A[1]
2
A = {1:’mary’, 2:’marty’,
0:’mike’}
print A[1]
A: yes
B: no
Key Differences from Lists
Lists are ordered
Index is implicit based on the list ordering
Dictionaries are unordered
Keys are specified and do not depend on order
Lists are useful for storing ordered data, dictionaries
are useful for storing relational data
Motivating example from book: databases!
Updating a Dictionary
Much like a list we can assign to a dictionary
Abstract:
dictionary[key] = newValue
Concrete Example:
A = {0:’mike’, 1:’mary’, 2:’marty’}
print A[1]
A[1] = ‘alex’
print A[1]
Adding to a Dictionary
Much like a list we can append to a dictionary
Abstract:
dictionary[newKey] = newValue
Concrete Example:
A = {0:’mike’, 1:’mary’, 2:’marty’}
print A[1]
A[3] = ‘alex’
print A
{0:’mike’, 1:’mary’, 2:’marty’, 3:’alex’}
Clicker Question: What is the
output of this code?
A = {0:’mike’, 1:’mary’, 2:’marty’,
‘marty’:2, ‘mike’:0, ‘mary’:1}
A[3] = ‘mary’
A[‘mary’] = 5
A[2] = A[0] + A[1]
A: {'mike': 0, 'marty': 2, 3: 'mary', 'mary': 5, 2: 'mikemary',
1: 'mary', 0: 'mike'}
B: {'mike': 0, 'marty': 2, 'mary’:3, 'mary': 5, 2: 'mikemary',
1: 'mary', 0: 'mike'}
C: {'mike': 0, 'marty': 2, 'mary’:3, 'mary': 5, 2:1,
1: 'mary', 0: 'mike'}
Printing a Dictionary
A = {0:'mike', 1:'mary', 2:'marty’}
for k in A:
print k
Prints: 2
1
0
A = {0:'mike', 1:'mary', 2:'marty’}
for k,v in A.iteritems():
print k, ":", v
Prints: 2 : marty
1 : mary
0 : mike
Project 4:
Frequency Analysis Intuition
We can leverage a dictionary to calculate the number
of times a particular letter occurs in a message
We can use characters as the keys
The number of times that character occurs is the value
Increment the value each time we see a character
Initially the value starts at 0
Some Additional Notation:
Pairs in Python
We can create pairs in python
Example: tuple = (‘name’, 3)
Such pairs are called tuples (see page 291)
Tuples support the [] for selecting their elements
Tuples are immutable (like strings)
Further reading (section 5.3):
http://docs.python.org/tutorial/datastructures.html#tuplesand-sequences
Tuples
We can think of tuples as an immutable list
They do not support assignment
Example:
A = (‘me’, 5, 32, ‘joe’)
print A[0]
print A[3]
A[2] = 4
<--- this throws an error
Creating a dictionary from a
list
Python provides the dict function to create a dictionary
out of a list of pairs
Example: dict([(0, ‘mike’),(1, ‘mary’),(2, ‘marty’)])
Why do I care?
We can leverage list creation short cuts to populate
dictionaries!
Example: dict([(x, x**2) for x in range(10)])