print - Purdue CS Wiki/Application server

Download Report

Transcript print - Purdue CS Wiki/Application server

Announcements
 All groups have been assigned
 Homework:
 By this evening email everyone in your group and set up
a meeting time to discuss project 4
 Project 4 will be released tomorrow
 You will have roughly 3 weeks to work on it
How do I work in a team?
 Communication
 Teams that do not communicate well do poorly on the
project
 Understanding the assignment
 Teams that sit down and go over the assignment together
do well
 Battle plan
 Outline the project in your own English text
 Code together
 Difficult parts of the project are best done together
Parsing Text
 The vast majority of the information present on the
internet is in text form
 Data, webpages, etc
 We want to transform the data into a more usable form
 Examples we have seen thus far:
 Encoding of a matrix
 Encoding of a tree
 Project 3, changing text (encrypting and decrypting)
Example: Finding a nucleotide
sequence
 We can find DNA sequences of parasites on the
internet (typically in databases)
 Problem: we want to know if a sequence of nucleotides
is in a particular parasite
 We not only want to know “yes” or “no” but which parasite
What the data looks like
>Schisto unique AA825099
gcttagatgtcagattgagcacgatgatcgattgaccgtgagatcgacga
gatgcgcagatcgagatctgcatacagatgatgaccatagtgtacg
>Schisto unique mancons0736
ttctcgctcacactagaagcaagacaatttacactattattattattatt
accattattattattattattactattattattattattactattattta
ctacgtcgctttttcactccctttattctcaaattgtgtatccttccttt
How are we going to do it?
 First, we get the sequences in a big string.
 Next, we find where the small subsequence is in the
big string.
 From there, we need to work backwards until we find
“>” which is the beginning of the line with the
sequence name.
 From there, we need to work forwards to the end of the
line. From “>” to the end of the line is the name of the
sequence
 Yes, this is hard to get right.
Lets Review Some Python
 string.find(sub) – returns the lowest index where the
substring sub is found or -1
 string.find(sub, start) – same as above, except using
the slice string[start:]
 string.find(sub, start, end) – same as above, except
using the slice string[start:end]
Lets Review Some Python
 string.rfind(sub) – returns the highest index where the
substring sub is found or -1
 string.rfind(sub, start) – same as above, except using
the slice string[start:]
 string.rfind(sub, start, end) – same as above, except
using the slice string[start:end]
Clicker Question: are these
programs equivalent?
String = “two plus two is four”
1
String.find(“two”)
2
String.rfind(“two”)
A: yes
B: no
Lets solve the problem!
def findSequence(seq):
sequencesFile = "parasites.txt”
file = open(sequencesFile,”r")
sequences = file.read()
file.close()
seqloc = sequences.find(seq)
if seqloc != -1:
# Now, find the ">" with the name of the sequence
nameloc = sequences.rfind(">",0,seqloc) # using rfind() here!!
endline = sequences.find("\n",nameloc)
print ("Found in ",sequences[nameloc:endline])
else:
print ("Not found”)
Why -1?
 If .find or .rfind don’t find something, they return -1
 If they return 0 or more, then it’s the index of where the
search string is found.
 Note: last week we saw the urlib module
 It contains a method that lets you download a file from the
internet
 How might you modify your program to first download the
file from the internet prior to opening it?
Running the program
>>> findSequence("tagatgtcagattgagcacgatgatcgattgacc")
Found in >Schisto unique AA825099
>>> findSequence("agtcactgtctggttgaaagtgaatgcttccaccgatt")
Found in >Schisto unique mancons0736
One More Note on Parsing
 We saw how to read a file as a string or list of strings
 We saw how to leverage how data was structured to
find specific information we were interested in
 What if there are many pieces we want to extract?
Revisiting Split
 String.split(delimiter) break the string String into parts,
separated by the delimiter
 print (“a b c d”.split(“ “))
Would print: [‘a’, ‘b’, ‘c’, ‘d’]
• Some quirky cases for string.split()
• Explained in pre lab 10
Why is this useful?
 When reading in a file, we may have many interesting
data items on a given line (or in the file)
 Example: Lab 10
How to glue everything
together
 Step 1) get some interesting data
 Step 2) open the file
 Step 3) read the data from the file, either as one large
string or a list of strings
 Step 4) break this string (or list of strings) into the data
we want (rfind, find, split)
Abstract Example
 Getting values from a text file
 str = file.read()
 Lines = str.split(‘\n’)  list of strings
 for element in Lines:
items = element.split(‘ ‘)  list of strings
Concrete Example
foo = "bab cad eag”
elem = foo.split(" ”)
for i in elem:
print(i.split("a"))
['b', 'b']
['c', 'd']
['e', 'g']
CQ:How can I parse all the
words in a file?
 Assume we have read the file in as one big string (we
used file.read()) and the file contains no punctuation
 A) first split on “\n” and for each element in the result,
we split on “ “
 B) only split on “ “
Concrete Clicker Example
file = open(“text.txt”, “r”)
content = file.read()
line = content.split(“\n”)
for i in line:
print(i.split(“ "))
text.txt
[‘This', ‘is']
[’a’, ‘file’]
This is
a file
Example: Get the temperature
 The weather is always available on the Internet.
 Can we write a function that takes the current
temperature out of a source like
 http://www.ajc.com/weather or
 http://www.weather.com?
The Internet is mostly text
 Web pages are actually text in the format called HTML
(HyperText Markup Language)
 HTML isn’t a programming language,
it’s an encoding language.
 It defines a set of meanings for certain characters, but
one can’t program in it.
 We can ignore the HTML meanings for now, and just
look at patterns in the text.
Where’s the temperature?
 The word “temperature”
doesn’t really show up.
 But the temperature
always follows the word
“Currently”, and always
comes before the
“<b>&deg;</b>”
<td ><img
src="/sharedlocal/weather/images/ps.gif"
width="48" height="48"
border="0"><font size=2><br></font><font
size="-1" face="Arial, Helvetica, sansserif"><b>Currently</b><br>
Partly sunny<br>
<font
size="+2">54<b>&deg;</b></font><
font face="Arial, Helvetica, sansserif"
size="+1">F</font></font></td>
</tr>
We can use the same algorithm we’ve
seen previously
 Grab the content out of a file in a big string.
 We’ve saved the HTML page previously.
 We‘ve seen how to grab it directly.
 Find the starting indicator (“Currently”)
 Find the ending indicator (“<b>&deg;”)
 Read the previous characters
def findTemperature():
weatherFile = "ajc-weather.html”
file = open(weatherFile,”r")
weather = file.read()
file.close()
# Find the Temperature
curloc = weather.find("Currently")
if curloc <> -1:
# Now, find the "<b>&deg;" following the temp
temploc = weather.find("<b>&deg;",curloc)
tempstart = weather.rfind(">",0,temploc)
print ("Current temperature:”,weather[tempstart+1:temploc])
if curloc == -1:
print (”Can't find the temp”)
Homework
 Email your group members
 Read through the project 4 description when it
becomes available
Announcements
Dictionaries in Python
 Useful Analogy: an actual Dictionary!
 English dictionaries provide an association between a
Word and a Definition
 We us the Word to look up the Definition
 Given a definition it would be very hard to look up the
word
Dictionaries Python
 Much like a dictionary for the English language, python
dictionaries create an association between a key and a
value
 Key corresponds to a Word in our analogy
 Value corresponds to a Definition
Dictionary Syntax
 A dictionary is a collection of elements
 Each element is a key/value
key : value
 Just like a list is defined by [ ] a dictionary is defined by
{}
{‘key1’:value1, ‘key2’:value2, ‘key3’:value3}
Keys
 A key can be any immutable type (we will consider two
types)
 Strings and Integers
 Much like the [index] is used to select out an element
from a list, for a dictionary we use [key]
A = {‘key1’:value1, ‘key2’:value2, ‘key3’:value3}
print(A[‘key2’])
Example: Simple Phone Book
 phoneBook = {‘Luke’ : ’123 4567’,
‘Dr. Martino’ : ‘456 7890’}
names are keys, phone numbers are values
def lookup(key):
return phoneBook[key]
lookup(‘Dr. Martino’)
Clicker Question: are these
programs equivalent?
1
A = [‘mike’, ‘mary’,
‘marty’]
print A[1]
2
A = {0:’mike’, 1:’mary’,
2:’marty’}
print A[1]
A: yes
B: no
Clicker Question: are these
programs equivalent?
1
A = [‘mike’, ‘mary’,
‘marty’]
print A[1]
2
A = {1:’mary’, 2:’marty’,
0:’mike’}
print A[1]
A: yes
B: no
Key Differences from Lists
 Lists are ordered
 Index is implicit based on the list ordering
 Dictionaries are unordered
 Keys are specified and do not depend on order
 Lists are useful for storing ordered data, dictionaries
are useful for storing relational data
 Motivating example from book: databases!
Updating a Dictionary
 Much like a list we can assign to a dictionary
Abstract:
dictionary[key] = newValue
Concrete Example:
A = {0:’mike’, 1:’mary’, 2:’marty’}
print A[1]
A[1] = ‘alex’
print A[1]
Adding to a Dictionary
 Much like a list we can append to a dictionary
Abstract:
dictionary[newKey] = newValue
Concrete Example:
A = {0:’mike’, 1:’mary’, 2:’marty’}
print A[1]
A[3] = ‘alex’
print A
{0:’mike’, 1:’mary’, 2:’marty’, 3:’alex’}
Clicker Question: What is the
output of this code?
A = {0:’mike’, 1:’mary’, 2:’marty’,
‘marty’:2, ‘mike’:0, ‘mary’:1}
A[3] = ‘mary’
A[‘mary’] = 5
A[2] = A[0] + A[1]
A: {'mike': 0, 'marty': 2, 3: 'mary', 'mary': 5, 2: 'mikemary',
1: 'mary', 0: 'mike'}
B: {'mike': 0, 'marty': 2, 'mary’:3, 'mary': 5, 2: 'mikemary',
1: 'mary', 0: 'mike'}
C: {'mike': 0, 'marty': 2, 'mary’:3, 'mary': 5, 2:1,
1: 'mary', 0: 'mike'}
Printing a Dictionary
A = {0:'mike', 1:'mary', 2:'marty’}
for k in A:
print k
Prints: 2
1
0
A = {0:'mike', 1:'mary', 2:'marty’}
for k,v in A.iteritems():
print k, ":", v
Prints: 2 : marty
1 : mary
0 : mike
Project 4:
Frequency Analysis Intuition
 We can leverage a dictionary to calculate the number
of times a particular letter occurs in a message
 We can use characters as the keys
 The number of times that character occurs is the value
 Increment the value each time we see a character
 Initially the value starts at 0
Some Additional Notation:
Pairs in Python
 We can create pairs in python
 Example: tuple = (‘name’, 3)
 Such pairs are called tuples (see page 291)
 Tuples support the [] for selecting their elements
 Tuples are immutable (like strings)
 Further reading (section 5.3):
 http://docs.python.org/tutorial/datastructures.html#tuplesand-sequences
Tuples
 We can think of tuples as an immutable list
 They do not support assignment
 Example:
A = (‘me’, 5, 32, ‘joe’)
print A[0]
print A[3]
A[2] = 4
<--- this throws an error
Creating a dictionary from a
list
 Python provides the dict function to create a dictionary
out of a list of pairs
Example: dict([(0, ‘mike’),(1, ‘mary’),(2, ‘marty’)])
 Why do I care?
 We can leverage list creation short cuts to populate
dictionaries!
Example: dict([(x, x**2) for x in range(10)])