Python - BIOTEC - Biotechnology Center TU Dresden

Download Report

Transcript Python - BIOTEC - Biotechnology Center TU Dresden

Programming in Python
Michael Schroeder
Andreas Henschel
{ms, ah}@biotec.tu-dresden.de
1
Motivation
All these sequences are winged helix DNA binding domains. How can we group them into families?
mkdntvplkliallangefhsgeqlgetlgmsraainkhiqtlrdwgvdvftvpgkgyslpep
mktvrqerlksivrilerskepvsgaqlaeelsvsrqvivqdiaylrslgynivatprgyvlagg
kaltarqqevfdlirdhisqtgmpptraeiaqrlgfrspnaaeehlkalarkgvieivsgasrgirllqee
mrssakqeelvkafkallkeekfssqgeivaalqeqgfdninqskvsrmltkfgavrtrnakmemvyclpaelgvptt
gqrhikireiimsndietqdelvdrlreagfnvtqatvsrdikemqlvkvpmangrykyslpsdqrfnplqklkr
kgqrhikireiitsneietqdelvdmlkqdgykvtqatvsrdikelhlvkvptnngsykyslpadqrfnplsklkr
dvtgriaqtllnlakqpdamthpdgmqikitrqeigqivgcsretvgrilkmledqnlisahgktivvygt
dikqriagffidhanttgrqtqggvivsvdftveeianligssrqttstalnslikegyisrqgrghytipnlvrlkaaa
iderdkiileilekdartpfteiakklgisetavrkrvkaleekgiiegytikinpkklg
elqaiapevaqslaeffavladpnrlrllsllarselcvgdlaqaigvsesavshqlrslrnlrlvsyrkqgrhvyyqlqdhhivalyqnaldhlqec
mntlkkafeildfivknpgdvsvseiaekfnmsvsnaykymvvleekgfvlrkkdkryvpgyklieygsfvlrrf
lfneiiplgrlihmvnqkkdrllneylsplditaaqfkvlcsircaacitpvelkkvlsvdlgaltrmldrlvckgwverlpnpndkrgvlvklttggaaiceqchqlvgqdlhqeltknltadevatleyllkkvlp
nypvnpdlmpalmavfqhvrtriqseldcqrldltppdvhvlklideqrglnlqdlgrqmcrdkalitrkirelegrnlvrrernpsdqrsfqlfltdeglaihqhaeaimsrvhdelfapltpveqatlvhlldqclaaq
tdilreigmiaraldsisniefkelsltrgqylylvrvcenpgiiqekiaelikvdrttaaraikrleeqgfiyrqedasnkkikriyatekgknvypiivrenqhsnqvalqglseveisqladylvrmrknvsedwefvkkg
mskindindlvnatfqvkkffrdtkkkfnlnyeeiyilnhilrsesneisskeiakcsefkpyyltkalqklkdlkllskkrslqdertvivyvtdtqkaniqkliseleeyikn
aitkindcfellsmvtyadklkslikkefsisfeefavltyisenkekeyylkdiinhlnykqpqvvkavkilsqedyfdkkrnehdertvlilvnaqqrkkiesllsrvnkrit
miimeeakkliielfselakihglnksvgavyailylsdkpltisdimeelkiskgnvsmslkkleelgfvrkvwikgerknyyeavdgfssikdiakrkhdliaktyedlkkleekcneeekefikqkikgiermkkisekilealndl
aqspagfaeeyiiesiwnnrfppgtilpaerelseligvtrttlrevlqrlardgwltiqhgkptkvnnfwets
eekrsstgflvkqraflklymitmteqerlyglkllevlrsefkeigfkpnhtevyrslhellddgilkqikvkkegaklqevvlyqfkdyeaaklykkqlkveldrckkliekalsdnf
hmqaeilltlklqqklfadprrisllkhialsgsisqgakdagisyksawdainemnqlsehilveratggkggggavltrygqrliqlydllaqiqqkafdvlsdddalplnsllaaisrfslqts
skvtyiikasndvlnektatilitiakkdfitaaevrevhpdlgnavvnsnigvlikkglveksgdgliitgeaqdiisnaatlyaqenapellk
sprivqsndlteaayslsrdqkrmlylfvdqirksdgtlqehdgiceihvakyaeifgltsaeaskdirqalksfagkevvfyrpeedagdekgyesfpwfikpahspsrglysvhinpylipffiglq
nrftqfrlsetkeitnpyamrlyeslcqyrkpdgsgivslkidwiieryqlpqsyqrmpdfrrrflqvcvneinsrtpmrlsyiekkkgrqtthivfsfrdit
lglekrdreilevlilrfgggpvglatlatalsedpgtleevhepylirqgllkrtprgrvatelarrhl
lglekrdreilevlilrfgggpvglatlatalsedpgtleevhepylirqgllkrtprgrvatelayrhlgypppv
egldefdrkilktiieiyrggpvglnalaaslgveadtlsevyepyllqagflartprgrivtekaykhlkyevp
iseevliglplheklfllaivrslkishtpyitfgdaeesykivceeygerprvhsqlwsylndlrekgivetrqnkrgegvrgrttlisigtepldtleavitklikeelr
kyeltlqrslpfiegmltnlgamklhkihsflkitvpkdwgynritlqqlegylntladegrlkyiangsyeiv
pmkteqkqeqetthknieedrklliqaaivrimkmrkvlkhqqllgevltqlssrfkprvpvikkcidiliekeylervdgekdtysyla
gspekilaqiiqehregldwqeaatraslsleetrkllqsmaaagqvtllrvendlyaist
eryqawwqavtraleefhsryplrpglareelrsryfsrlparvyqalleewsregrlqlaantvalagftps
fsetqkkllkdledkyrvsrwqppsfkevagsfnldpseleellhylvregvlvkindefywhr
qalgeareviknlastgpfglaeardalgssrkyvlplleyldqvkftrrvgdkrvvvgn
vpkrvywemlatnltdkeyvrtrralileilikagslkieqiqdnlkklgfdevietiendikglintgifieikgrfyqlkdhilqfvipnrgvtkqlv
irtfgwvqnpgkfenlkrvvqvfdrnskvhnevknikiptlvkeskiqkelvaimnqhdliytykelvgtgtsirseapcdaiiqatiadqgnkkgyidnwssdgflrwahalgfieyinksdsfvitdvglaysksad
gsaiekeilieaissyppairiltlledgqhltkfdlgknlgfsgesgftslpegilldtlanampkdkgeirnnwegssdkyarmiggwldklglvkqgkkefiiptlgkpdnkefishafkitgeglkvlrrakgstkftr
2
Motivation:
Let's rebuild SCOP families
• Given a SCOP superfamily and its sequences, how can
we divide it into families?
• First, we need dynamic programming to determine the
sequence similarity
• Then we do the following:
– For all pairs of sequences, call the sequence
similarity algorithm and record the similarity into a
distance matrix
– Next, run hierarchical clustering to cluster the
sequences.
3
Python for Bioinformatics
Lecture 1: Datatypes and Loops
Slides derived from
Ian Holmes
Department of Statistics
University of Oxford
4
Goals of this course
• Concepts of computer programming
• Rudimentary Python (widely-used
language)
• Introduction to Bioinformatics file formats
• Practical data-handling algorithms
• Exposure to Bioinformatics software
5
Literature/Material
• Textbook: Python in a Nutshell, Alex Martelli
• Textbook: Python Cookbook, Alex Martelli, David
Ascher (both published by O'Reilly)
• Python Course in Bioinformatics, K. Schuerer/C.
Letondal, Pasteur University (pdf)
• a lot of online material (see course homepage
http://www.biotec.tu-dresden.de/schroeder/group/
teaching/bioinfo2/python.html)
6
Style of this lecture
• The color scheme for programs, output and text
files:
The main program
The program output
Files are
shown in
yellow
The filename
goes here
• Interaction with the Python shell: very handy for
quick tests. Helps beginners to overcome
physiological barrier: Go ahead, try things out!
Prompt, (python
expects input here)
>>> (Python Expression)
(immediate Python result)
Press Enter
7
General principles of programming
• Make incremental changes
• Test everything you do
– use the Python shell for testing
expressions/functions interactively
– the edit-run-revise cycle
• Write so that others can read it
– (when possible, write with others)
• Think before you write
• Use a good text editor (emacs)
8
Python/Emacs IDE
9
Python: Motivation
• Well suited for scripting (better syntax than Perl)
• However, capable of Object Orientation
• Hence complex data types and large projects
feasible, reuse of code (BioPython)
• Universal language, Applications in and beyond
bioinformatics: Amber, ProHit, PyRat, PyMOL,
Gene2EST/Google, CGI, Zope
• Compatible with most software technologies:
GUI, MPI, OpenGL, Corba, RDB
• Test complicated expressions in python shell
10
Python basics
• Basic syntax of a Python program:
Lines
# Elementary Python program
beginning
print "Hello World"
with "#" are
comments,
and are ignored
by Python
Single or double quotes
enclose a "string literal"
print statement tells Python to print the following stuff to the screen
Hello World
11
Variables
• We can tell Python to "remember" a particular
value, using the assignment operator "=":
x = 3
print x
x = "ACGCGT"
print x
Binding site for yeast
transcription factor MCB
3
ACGCGT
• The x is referred to as a "scalar variable".
Variable names can contain alphabetic characters, numbers
(but not at the start of the name), and underscore symbols "_"
12
Variables and Objects
• Everything in Python is an object
• An object models a real-world entity
• objects possess methods (also called functions)
that are typically applied to the object, possibly
parameterized
• objects can also possess variables, that
describe their state
Object
.
Method or variable
• e.g. x.upper()is a parameter-less method,
that works on the string object x
13
Arithmetic operations…
• Basic operators are + - / * %
x = 14
y = 3
print "Sum: ", x + y
print "Product: ", x * y
print "Remainder: ", x % y
Could write
x *= 2
Could write
x += 1
x = 5
print
x = x
print
x = x
print
"x started as", x
* 2
"Then x was", x
+ 1
"Finally x was" ,x
Sum: 17
Product: 42
Remainder: 2
x started as 5
Then x was 10
Finally x was 11
14
… Or interactively
>>> x = 14
>>> y = 3
>>> x + y
17
>>> x * y
42
>>> x % y
2
>>> x = 5
>>> print "x started as", x
x started as 5
>>> x *= 2
>>> print "Then x was", x
Then x was 10
>>> x += 1
>>> print "Finally x was", x
Finally x was 11
>>>
• This way, you can
use Python as a
calculator
• Can also use
+= -= /= *=
15
String operations
• Concatenation
+ +=
a = "pan"
b = "cake"
a = a + b
print a
a = "soap"
b = "dish"
a += b
print a
pancake
soapdish
• Can find the length of a string using the
function len(x)
mcb = "ACGCGT"
print "Length of %s is "%mcb,
len(mcb)
Length of ACGCGT is 6
16
String formatting
• Strings can be formatted with place holders for inserted
strings (%s) and numbers (%d for digits and %f for floats)
• Use Operator % on strings: Formatted String % Insertion Tuple
>>> "aaaa%saaaa%saaa"%("gcgcgc","tttt")
'aaaagcgcgcaaaattttaaa'
>>> "A range written like this: (%d - %d)" % (2,5)
'A range written like this: (2 - 5)'
>>> "Or with preceeding 0's: (%03d - %04d)" % (2,5)
"Or with preceeding 0's: (002 - 0005)"
>>> "Rounding floats %.3f" % math.pi
'Rounding floats 3.142'
>>> "Space holders: _%-7s_ and _%7s_" %("left", "right")
'Space holders: _left
_ and _ right_'
17
More string operations
Convert to upper case
Convert to lower case
Convert the string to a list
Reverse the list
Join all list members
x = "A simple sentence"
print x
print x.upper()
print x.lower()
xl=list(x)
xl.reverse()
print "".join(xl)
x = x.replace("i", "a")
print x
print len(x)
Translate "i"'s into "a"'s
Calculate the length of the string
A simple
A SIMPLE
a simple
ecnetnes
A sample
17
sentence
SENTENCE
sentence
elpmis A
sentence
18
Concatenating DNA fragments
dna1 = "accacgt"
dna2 = "taggtct"
print dna1 + dna2
accacgttaggtct
"Transcribing" DNA to RNA
dna = "accACgttAGGTct"
rna = dna.lower().replace("t", "u")
print rna
Make it all
lower case
DNA string is a mixture
of upper & lower case
Replace "t" with "u"
accacguuaggucu
19
Conditional blocks
• The ability to execute an action contingent on
some condition is what distinguishes a computer
from a calculator. In Python, this looks like this:
Consistent, level-wise
indenting important
These indentations
tell Python which
piece of code
is contingent on
the condition.
if condition:
action
else:
alternative
x = 149
y = 100
if x > y:
print x,"is greater than",y
else:
print x,"is less than", y
149 is greater than 100
20
Conditional operators
"does not equal"
• Numeric: > >= < <= != ==
x = 5 * 4
y = 17 + 3
if x == y: print x, "equals", y
Note that the test
for "x equals y" is
x==y, not x=y
20 equals 20
• The same operators work on strings as
alphabetic comparisons
Shorthand syntax for
assigning more than
one variable at a time
(x, y) = ("Apple", "Banana")
if y > x: print y, "after", x
Banana after Apple
21
Logical operators
• Logical operators: and and or
x = 222
if x % 2 == 0 and x % 3 == 0:
print x, "is an even multiple of 3"
222 is an even multiple of 3
• The keyword not is used to negate what follows. Thus
not x < y means the same as x >= y
• The keyword False (or the value zero) is used to
represent falsehood, while True (or any non-zero value,
e.g. 1) represents truth. Thus:
if True:
if False:
if -99:
print "True is true"
print "False is true"
print "-99 is true"
True is true
-99 is true
22
Loops
• Here's how to print out the numbers 0 to 9:
The indented code x = 0
while x < 10:
is repeatedly
print x,
executed as long
x+=1
as the condition
x<10 remains
true
0 1 2 3 4 5 6 7 8 9
Equivalent to
x = x + 1
• This is a while loop.
The code is executed while the condition is true.
23
A common kind of loop
• Let's dissect the code of the while loop again:
Initialisation
Test for completion
Continuation
x = 0
while x < 10:
print x,
x+=1
• Alternatively, the for loop construct iterates
through a list
Iteration variable Generates a list
[0,1, …,9]
for x in range(10):
print x,
24
For loop features
• Loops can be used with all iteratable types, ie.:
lists, strings, tuples, iterators, sets, file handlers
>>> for nucleotide in "actgc":
...
print nucleotide,
a c t g c
• Stepsizes can be specified with the 3. argument
of the slice constructor (negative values for
iterating backwards)
>>>
...
0 7
>>>
...
c g
for number in range(50)[::7]:
print number,
14 21 28 35 42 49
for nucleotide in "actgc"[::-1]:
print nucleotide,
t c a
25
Reading Data from Files
• To read from a file, we can conveniently iterate through it
linewise with a for-loop and the open function. Internally
a filehandle is maintained during the loop.
for line in open("sequence.txt"):
print line,
The comma prevents
print's automatic newline
This code snippet opens a file called
"sequence.txt" in the in the current directory,
and iterates through it line by line
sequence.txt
>CG11604
TAGTTATAGCGTGAGTTAGT
TGTAAAGGAACGTGAAAGAT
AAATACATTTTCAATACC
>CG11604
TAGTTATAGCGTGAGTTAGT
TGTAAAGGAACGTGAAAGAT
AAATACATTTTCAATACC
26
Python for Bioinformatics
Lecture 2: Sequences and Lists
27
Summary: scalars and loops
•
•
•
•
•
•
•
Assignment operator x = 5
Arithmetic operations y = x * 3
String operations s = "Concatenating " + "strings"
Conditional tests if y > 10: print s
Logical operators if y > 10 and not s == "": print s
for x in range(10): print x
Loops
Reading a file for line in open("sequence.txt"):
print line,
28
Pattern-matching
• A very sophisticated kind of logical test is
to ask whether a string contains a pattern
• e.g. does a yeast promoter sequence
contain the MCB binding site, ACGCGT?
name = "YBR007C"
dna="TAATAAAAAACGCGTTGTCG"
if "ACGCGT" in dna:
print name, "has MCB!"
The pattern for the MCB binding site
20 bases upstream of
the yeast gene YBR007C
The membership operator in
YBR007C has MCB!
29
FASTA format
• A format for storing multiple named sequences
in a single file
>CG11604
Name of sequence is
preceded by > symbol
NB sequences can
span multiple lines
• This file contains 3' UTRs
for Drosophila genes CG11604,
CG11455 and CG11488
TAGTTATAGCGTGAGTTAGT
TGTAAAGGAACGTGAAAGAT
AAATACATTTTCAATACC
>CG11455
TAGACGGAGACCCGTTTTTC
TTGGTTAGTTTCACATTGTA
AAACTGCAAATTGTGTAAAA
ATAAAATGAGAAACAATTCT
GGT
>CG11488
TAGAAGTCAAAAAAGTCAAG
TTTGTTATATAACAAGAAAT
CAAAAATTATATAATTGTTT
TTCACTCT
30
Call this file fly3utr.txt
Printing all sequence names in a
FASTA database
for line in open("fly3utr.txt"):
if line.startswith(">"):
print line,
>CG11604
>CG11455
>CG11488
31
Finding all sequence lengths
The rstrip statement length=0
name=""
trims the white space
for line in open("/home/bioinf/ah/tmp/sequence.txt"):
characters
line=line.rstrip()
off the right end.
if line.startswith(">"):
Try it without this and
if name and length:
see what happens –
print name, length
and if you can work
name=line[1:]
out why
length=0
else:
length+=len(line)
>CG11604
print name, length
TAGTTATAGCGTGAGTTAGT
TGTAAAGGAACGTGAAAGAT
AAATACATTTTCAATACC
>CG11455
TAGACGGAGACCCGTTTTTC
TTGGTTAGTTTCACATTGTA
AAACTGCAAATTGTGTAAAA
ATAAAATGAGAAACAATTCT
GGT
>CG11488
TAGAAGTCAAAAAAGTCAAG
TTTGTTATATAACAAGAAAT
CAAAAATTATATAATTGTTT
TTCACTCT
CG11604 58
CG11455 83
CG11488 69
32
Reverse complementing DNA
• A common operation due to double-helix
symmetry of DNA
Start by making string lower case
again. This is generally good practice
Replace 'a' with 't', 'c' with 'g',
'g' with 'c' and 't' with 'a'
Reverse the list
def revcomp(dna):
replaced=list(dna.lower().
replace("a","x").replace("t","a").
replace("x", "t").replace("g","x").
replace("c","g").replace("x", "c"))
replaced.reverse()
return "".join(replaced)
print revcomp("accACgttAGgtct ")
agacctaacgtggt
33
Lists
• A list is a list of variables
nucleotides = ['a', 'c', 'g', 't']
print "Nucleotides: ", nucleotides
Nucleotides: ['a', 'c', 'g', 't']
• We can think of this as a list with 4 entries
a c g t
element 0
element 1 element 2
element 3
the list is the
set of all four elements
Note that the element
indices start at zero.
34
List literals
• There are several, equally valid ways to
assign an entire array at once.
This is the most common: a commaseparated list, delimited by squared brackets
a = [1,2,3,4,5]
print "a = ",a
b = ['a','c','g','t']
print "b = ",b
c = range(1,6)
print "c = ",c
d = "a c g t".split()
print "d = ", d
a
b
c
d
=
=
=
=
[1,2,3,4,5]
['a','c','g','t']
[1,2,3,4,5]
['a','c','g','t']
35
Accessing lists
• To access list elements, use square brackets
e.g. x[0] means "element zero of list x"
x = ['a', 'c', 'g', 't']
i=2
print x[0], x[i], x[-1]
a g t
• Remember, element indices start at zero!
• Negative indices refer to elements counting from
the end e.g. x[-1] means "last element of list
x"
36
List operations
• You can sort and reverse lists...
x = ['a', 't', 'g', 'c']
print "x =",x
x.sort()
print "x =",x
x.reverse()
print "x =",x
x = a t g c
x = a c g t
x = t g c a
• You can read the entire contents of a file
into an array (each line of the file becomes
an element of the array)
seqfile = open,
"C:/sequence.txt"
x = <FILE>
37
Applying Methods to Objects
• Instances of lists, strings, etc. are objects with
built-in methods
• Explore available methods using dir:
>>> dir("hello")
['__add__', … ,'__str__', 'capitalize', 'center', 'count',
'decode', 'encode', 'endswith', 'expandtabs', 'find', 'index',
'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace',
'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip',
'replace', 'rfind', 'rindex', 'rjust', 'rsplit', 'rstrip',
'split', 'splitlines', 'startswith', 'strip', 'swapcase',
'title', 'translate', 'upper', 'zfill']
>>> help("hello".count)
(…)Return the number of occurrences of substring sub in
string S[start:end]. Optional arguments start and end are
interpreted as in slice notation.
>>> "hello".count("l")
2
String object
List of
applicable
methods
Method
. (dot) applies method to object
38
List operations
Multiplying lists with *
pop removes the last
element of a list
append adds an element
to the end of a list
concatenating lists with +
or +=
Removing the first
occurrence of an element
Position of an
element
>>>
>>>
[1,
>>>
x=[1,0]*5
x
0, 1, 0, 1, 0, 1, 0, 1, 0]
while 0 in x: print x.pop(),
0 1
>>>
[1]
>>>
>>>
[1,
>>>
>>>
[1,
>>>
>>>
[1,
0 1 0 1 0 1 0
x
x.append(2)
x
2]
x+=x
x
2, 1, 2]
x.remove(2)
x
1, 2]
>>> x.index(2)
2
39
for loop revisited
• Finding the total of a list of numbers:
for statement
loops through each
entry in a list
val = [4, 19, 1, 100, 125, 10]
total = 0
for x in val:
total += x
print total
259
• Equivalent to:
val = [4, 19, 1, 100, 125, 10]
total = 0
for i in range(len(val)):
total += val[i]
print total
259
40
Modules
• Additional functionality, that is not part of
the core language, is in modules like
– sys (system)
– re (regular expressions)
– math (mathematics)
• Load modules with import
>>> import math
>>> help(math)
Help on built-in module math:
…
• You can write your own modules and
import them
41
The sys.argv list
• A special list is sys.argv
• This contains the command-line arguments if the
program is invoked at the command line
• It's a way for the user to pass information into
the program, if you don't have a graphical
interface with which to do this
import sys
print sys.argv
File args.py
ah@studipool1> python args.py abc 123
['args.py', 'abc', '123']
Output at command line
42
Converting a sequence into a list
Data types can be
converted. Here the
list function converts
a string into a list.
Triple quotes allow for
strings that stretch over
several lines
>>> dna="acgtcgaga"
>>> list(dna)
['a', 'c', 'g', 't', 'c', 'g', 'a', 'g', 'a']
>>> """You can also make
use of long
strings
and the
split
function""".split()
['You', 'can', 'also', 'make', 'use', 'of',
'long', 'strings', 'and', 'the', 'split',
'function']
• The underlying programming language C
treats all strings as lists
43
Taking a slice of a list
• The syntax x[i:j] returns a list containing
elements i,i+1,…,j-1 of list x
nucleotides = ['a', 'g', 'c', 't']
purines = nucleotides[0:2]
# nucleotides[:2] also works
pyrimidines = nucleotides[2:4]# nucleotides[2:] also works
print "Nucleotides:", nucleotides
print "Purines:", purines
print "Pyrimidines:", pyrimidines
Nucleotides: ['a', 'g', 'c', 't']
Purines: ['a', 'g']
Pyrimidines: ['c', 't']
44
Applying a function to a list
• The map command applies a function to every
element in an array
• Similar syntax to list: map(EXPR,LIST) applies
EXPR to every element in LIST
• EXPR can be arbitrary function, defined
elsewhere or lambda calculus expression
• Lambda calculus: provides "anonymous"
function, constructed with keyword lambda, a
set of parameters, and an expression with these
• Example: multiply every number by 3
>>> map(lambda x: x*3, [1,2,3])
[3, 6, 9]
45
Python for Bioinformatics
Lecture 3: Patterns and Functions
46
Review: pattern-matching
• The following code:
if "ACGCGT" in dna:
print "Found MCB binding site!"
prints the string "Found MCB binding site!" if the
pattern "ACGCGT" is present in the string
variable "sequence"
• We can replace the first occurrence of ACGCGT
with the string _MCB_ using the following
syntax:
dna.replace("ACGCGT","_MCB_", 1)
pattern
replacement
count
• We can replace all occurrences by omitting the
optional count argument dna.replace("ACGCGT","_MCB_")
47
Regular expressions
• Python provides a pattern-matching
engine
• Patterns are called regular expressions
• They are extremely powerful
• Often called "regexps" for short
• import module re
48
Motivation: N-glycosylation motif
• Common post-translational modification
• Attachment of a sugar group
• Occurs at asparagine residues with the
consensus sequence NX1X2, where
– X1 can be anything (but proline inhibits)
– X2 is serine or threonine
• Can we detect potential N-glycosylation
sites in a protein sequence?
49
Building regexps
• In general square brackets denote a set of
alternative possibilities
• Use - to match a range of characters: [AZ]
• . matches anything
• \s matches spaces or tabs
• \S is anything that's not a space or tab
• [^X] matches anything but X
50
Using Regular Expressions
• Compile a regular expression object (pattern)
using re.compile
• pattern has a number of methods (try
dir(pattern)), eg.
– match (in case of success returns a Match object,
otherwise None)
– search (scans through a string looking for a match)
– findall (returns a list of all matches)
>>> import re
>>> pattern = re.compile('[ACGT]')
>>> if pattern.match("A"): print "Matched"
Matched
>>> if pattern.match("a"): print "Matched"
>>>
successful match
unsuccessful, returns None
51
by default case sensitive
Matching alternative strings
• /(this|that)/ matches "this" or "that"
• ...and is equivalent to /th(is|at)/
case unsensitive search pattern
>>> pattern=re.compile("(this|that|other)", re.IGNORECASE)
>>> pattern.search("Will match THIS") ## success
<_sre.SRE_Match object at 0x00B52860>
>>> pattern.search("Will also match THat") ## success
<_sre.SRE_Match object at 0x00B528A0>
>>> pattern.search("Will not match ot-her") ## will return None
>>>
Python returns a description of the match object
52
Matching multiple characters
• x* matches zero or more x's
• x+ matches one or more x's
• x{n} matches n x's
• x{m,n} matches from m to n x's
Word and string boundaries
• ^ matches the start of a string
• $ matches the end of a string
• \b matches word boundaries
53
"Escaping" special characters
• \ is used to "escape" characters that
otherwise have meaning in a regexp
• so \[ matches the character "["
– if not escaped, "[" signifies the start of a list of
alternative characters, as in [ACGT]
• All special characters:
. ^ $ * + ? { [ ] \ | ( )
54
Substitutions/Match Retrieval
• regexp methods can be used without
compiling (less efficient but easier to use)
• Example re.sub (substitution):
>>> re.sub("(red|blue|green)", "color", "blue socks and red shoes")
'color socks and color shoes'
• Example re.findall:
>>> e,raw,frm,to = re.findall("\d+", \
"E-value: 4, \
Raw Bit Score: 165, \
Match position: 362-419")
>>> print e, raw, frm, to
4 165 362 419
matches one or more digits
The result, a list of 4 strings,
is assigned to 4 variables
\ allows multiple line commands
alternatively, construct multi-line
strings using triple quotes """ …"""
55
N-glycosylation site detector
>>> protein="""
MGMFFNLRSNIKKKAMDNGLSLPISRNGSSNNIKDKRSEHNSNSLKGKYRYQPRSTPSKFQLTVSITSLI
IIAVLSLYLFISFLSGMGIGVSTQNGRSLLGSSKSSENYKTIDLEDEEYYDYDFEDIDPEVISKFDDGVQ
HYLISQFGSEVLTPKDDEKYQRELNMLFDSTVEEYDLSNFEGAPNGLETRDHILLCIPLRNAADVLPLMF
KHLMNLTYPHELIDLAFLVSDCSEGDTTLDALIAYSRHLQNGTLSQIFQEIDAVIDSQTKGTDKLYLKYM
DEGYINRVHQAFSPPFHENYDKPFRSVQIFQKDFGQVIGQGFSDRHAVKVQGIRRKLMGRARNWLTANAL
KPYHSWVYWRDADVELCPGSVIQDLMSKNYDVI""".upper().replace("\n","")
>>> for match in re.finditer("N[^P][ST]", protein):
print match.group(), match.span()
NGS (26, 29)
NLT (214, 217)
NGT (250, 253)
re.finditer
provides an iterator
over match-objects
N[^P][ST]- the
main regular
expression
multi-line string,
upper case,
line breaks removed
match.group and match.span print the actual matched string and the position-tuple.
Altenatively, you can print gene[match.start():match.end()]
56
PROSITE and Pfam
PROSITE – a database of regular expressions
for protein families, domains and motifs
Pfam – a database of Hidden Markov
Models (HMMs) – equivalent to
probabilistic regular expressions
57
Another Example:
• Ferredoxins are a group of iron-sulfur
proteins which mediate electron transfer
• The share the motif C, then two residues,
C, then two residues, C, then three
residues, C, then either P,E, or G
• The 4 C's are 4Fe-4S ligands
• What is the corresponding Python
58
Another Example:
• Ferredoxins are a group of iron-sulfur
proteins which mediate electron transfer
• The share the motif C, then two residues,
C, then two residues, C, then three
residues, C, then either P,E, or G
• The 4 C's are 4Fe-4S ligands
• What is the corresponding Python code?
• C.{2}C.{2}C.{3}C[PEG]
59
Another Example:
60
Courtesy of Chris Bystroff
61
Courtesy of Chris Bystroff
Another Example:
62
Courtesy of Chris Bystroff
Another Example
• Regular expressions are useful to parse
text
• Example: extract information from Blast
output, such as
– species name
– E value
– Score
– ID
63
Another Example
BLASTP 2.2.6 [Apr-09-2003]
RID: 1062117117-16602-2157828.BLASTQ3
Query= gi|6174889|sp|P26367|PAX6_HUMAN Paired box protein Pax-6
(Oculorhombin) (Aniridia, type II protein).
(422 letters)
Database: All non-redundant GenBank CDS
translations+PDB+SwissProt+PIR+PRF
1,509,571 sequences 486,132,453 total letters
Results of PSI-Blast iteration 1
Sequences with E-value BETTER than threshold
Sequences producing significant alignments:
Score
E
(bits) Value
gi|4505615|ref|NP_000271.1| paired box gene 6 isoform a Paired box h...
781
0.0
gi|189353|gb|AAA59962.1| oculorhombin >gi|189354|gb|AAA59963.1| oculo...
780
0.0
gi|6981334|ref|NP_037133.1| paired box homeotic gene 6 [Rattus norveg...
778
0.0
gi|26389393|dbj|BAC25729.1| unnamed protein product [Mus musculus]
776
0.0
gi|7305369|ref|NP_038655.1| paired box gene 6 small eye Dickie's sm...
776
0.0
gi|383296|prf||1902328A PAX6 gene
775
0.0
gi|4580424|ref|NP_001595.2| paired box gene 6 isoform b Paired box h...
775
0.0
gi|18138028|emb|CAC80516.1| paired box protein [Mus musculus]
773
0.0
gi|2576237|dbj|BAA23004.1| PAX6 protein [Gallus gallus]
770
0.0
gi|27469846|gb|AAH41712.1| Similar to paired box gene 6 [Xenopus laevis]
768
0.0
…
64
Functions
• Often, we can identify self-contained tasks
that occur in so many different places we
may want to separate their description
from the rest of our program.
• Code for such a task is called a function
NB: Python
• Examples of such tasks:
provides the
– finding the length of a sequence
– reverse complementing a sequence
– finding the mean of a list of numbers
function len(x)
to do this already
65
Maximum element of a list
• Function to find the largest entry in a list
def find_max(data):
max = data.pop()
for x in data:
if x > max:
max = x
return max
data = [1, 5, 1, 12, 3, 4, 6]
print "Data:", data
print "Maximum:", find_max(data)
Function declaration
Function body
Function result
Function call
Data: [1, 5, 1, 12, 3, 4, 6]
Maximum: 12
66
Reverse complement
The arguments follow the
from string import maketrans
function name in
parantheses
def revcomp(seq):
(in this case seq, the
translation = maketrans("agct", "tcga")
sequence to be revcomp'd)
comp = seq.translate(translation)
rcomp = comp[::-1] # reversing comp
return rcomp
By default, translation
and comp are local variables,
ie., they "live" only inside
dna = "cggcgt"
the surrounding function
rev = revcomp(dna)
print "Revcomp of %s is %s"%(dna, rev)
return announces
that the return value
of this function is
whatever's in
rcomp
string formatted with place holders
Revcomp of cggcgt is acgccg
67
revcomp goes OO
self refers to the current object, gives access to all its variables
from string import *
Class Constructor
saves input sequence
as object variable in
lower case
method calls
class DNA:
def __init__(self, sequence):
self.seq=sequence.lower()
def revcomp(self):
translation = maketrans("agct", "tcga")
comp = self.seq.translate(translation)
self.revcomp = comp[::-1]
def report(self):
print "Revcomp of %s is %s"%\
(self.seq,self.revcomp)
dna = DNA("accggcatg")# Creating a DNA object
dna.revcomp()
dna.report()
Useful to structure code :
add additional DNA sequence functionality
to this class, eg. a function that calculates GC-contents, translation to protein etc.
68
Mean
&
standard
deviation
Importing square
root function from
module math
Function
takes a list
of n numeric
arguments
Function
returns a
two-element
tuple: (mean,sd)
from math import sqrt
def mean_sd(data):
n = len(data)
sum = 0
sqSum = 0
for x in data:
sum += x
sqSum += x * x
mean = sum / n
variance = sqSum / n - mean * mean
sd = sqrt (variance)
return (mean, sd)
data = [1, 5, 1, 12, 3, 4, 6]
(mean, sd) = mean_sd (data)
print "Data:", data
print "Mean:", mean
print "Standard deviation:", sd
69
Including variables in patterns
• Function to find number of instances of a
given binding site in a sequence
finds rightmost position, where
pattern matches in text
no match
call recursively with
text to the left of
rightmost match,
count up one
def count_matches(pattern, text):
pos=text.rfind(pattern)
if pos==-1: return 0
else:
return count_matches(pattern, text[:pos])+1
print count_matches("ACGCGT", "ACGCGTAAGTCGGCACGCGTACGCGT")
3
NB: Built-in string method count also does the job
text="ACGCGTAAGTCGGCACGCGTACGCGT"
print text.count("ACGCGT")
70
Python for Bioinformatics
Lecture 4: Dictionaries
71
Data structures
• Suppose we have a file containing a table
of Drosophila gene names and cellular
compartments, one pair on each line:
Cyp12a5
MRG15
Cop
bor
Bx42
Mitochondrion
Nucleus
Golgi
Cytoplasm
Nucleus
Suppose this file is in "c:/genecomp.txt"
72
Reading a table of data
• We can split each
line into a 2-element list using the
split command.
• This breaks the
line at each space:
genes, comps= [], []
for line in open("C:/genecomp.txt"):
gene, comp = line.split()
genes.append(gene)
comps.append(comp)
print "Genes:", " - ".join(genes)
print "Compartments:", " ".join(comps)
Genes: Cyp12a5 - MRG15 – Cop - bor - Bx42
Compartments: Mitochondrion Nucleus Golgi Cytoplasm Nucleus
• The opposite of split is join, which makes a string
from a list of strings
73
Finding an entry in a table
• The following code assumes that we've
already read in the table from the file:
import sys
geneToFind = sys.argv[1]
for i in range(len(genes)):
if genes[i]==geneToFind:
print "Gene:", genes[i]
print "Compartment:", comps[i]
sys.exit()
print "Couldn't find gene"
• Example:
sys.argv[1] = "Cop"
Searching for gene Cop
Gene: Cop
Compartment: Golgi
74
Binary search
• The previous algorithm is inefficient. If there are N
entries in the list, then on average we have to search
through ½N entries to find the one we want.
• For the full Drosophila genome, N=12,000. This is
painfully slow.
• An alternative is the Binary Search algorithm:
Start with a sorted list.
Compare the middle element
with the one we want. Pick the
half of the list that contains our
element.
Iterate this procedure to
"home in" on the right element.
This takes log2(N) steps. 75
Dictionaries (hashes)
• Implementing algorithms like binary search
is a common task in languages like C.
• Conveniently, Python provides a type of
array called a dictionary (also called a
hash) that does something similar for you.
• A dictionary is a set of key:value pairs (like
our gene:compartment table)
comp["Cop"] = "Golgi"
Squared brackets [] are used
to index a dictionary
76
keys and values
• keys returns the list of keys in the hash
– e.g. names, in the name2seq hash
• values returns the list of values
– e.g. sequences, in the name2seq hash
name2seq = read_FASTA ("C:/fly3utr.txt")
print "Sequence names: ",
" ".join(name2seq.keys())
print "Total length: ",
len("".join(name2seq.values()))
Sequence names: CG11488 CG11604 CG11455
Total length: 210
77
Getting familiar with hashes
>>> tlf={"Michael" : 40062, \
"Bingding" : 40064, "Andreas": 40063 }
>>> tlf.keys()
['Bingding', 'Andreas', 'Michael']
>>> tlf.values()
[40064, 40063, 40062]
>>> tlf["Michael"]
40062
>>> tlf.has_key("Lars")
False
>>> tlf["Lars"] = 40070
>>> tlf.has_key("Lars") # now its there
True
>>> for name in tlf.keys():
...
print name, tlf[name]
...
Lars 40070
Bingding 40064
Andreas 40063
Michael 40062
Creating an initial phone book
Asking for all keys
Asking for all values
Asking for a value, given a key
Checking whether a key is in the list
Inserting a single key:value pair
Looping through the dictionary
78
Reading a table using hashes
import sys
comps={}
for line in open("C:/genecomp.txt"):
gene, comp = line.split()
comps[gene] = comp
geneToFind=sys.argv[1]
print "Gene:", geneToFind
print "Compartment:", comp[geneToFind]
...with sys.argv[1] = "Cop" as before:
Gene: Cop
Compartment: Golgi
79
Reading a FASTA file into a hash
def read_fasta(filename):
name = None
name2seq = {}
for line in open(filename):
if line.startswith(">"):
if name:
name2seq[name]=seq
name=line[1:].rstrip()
seq=""
else:
seq+=line.rstrip()
name2seq[name]=seq
return name2seq
if name only evaluates to false, if it
is still None (when going over first line)
new name is derived from line from
second letter on, with new-line
character removed
Final entry, after loop
80
Formatted output of sequences
def print_seq(name, seq, width=50):
print ">"+name
i=0
while i<len(seq):
print seq[i : i+width]
i+=width
Default values,
assigned in parameter line,
placed rightmost
Here, width default is
50-column output
print_seq("Tata-box1", "TA"*55)
print_seq("Tata-box2", "TA"*55, 30)
>Tata-box1
TATATATATATATATATATATATATATATATATATATATATATATATATA
TATATATATATATATATATATATATATATATATATATATATATATATATA
TATATATATA
>Tata-box2
TATATATATATATATATATATATATATATA
TATATATATATATATATATATATATATATA
TATATATATATATATATATATATATATATA
TATATATATATATATATATA
81
Files of sequence names
• Easy way to specify a subset of a given
FASTA database
• Each line is the name of a sequence in a
given database
• e.g.
CG1167
CG685
CG1041
CG1043
82
Get named sequences
• Given a FASTA database and a "file of sequence
names", print every named sequence:
(fasta, fosn) = sys.argv[1:3]
name2seq = read_FASTA (fasta)
for name in open(fosn):
name = name.rstrip()
if name2seq.has_key[name]:
seq = name2seq[name]
print_seq (name, seq)
else:
print "Can't find sequence: %s."%name,
"Known sequences: ",
" ".join(name2seq.keys())
83
Common Set operations
• Two files of sequence names:
• What is the overlap/difference/union?
• Find eg. intersection using sets
from sets import Set
geneSet1 = Set([])
geneSet2 = Set([])
for line in open("C:/fosn1.txt"):
geneSet1.add(line.rstrip())
for line in open("C:/fosn2.txt"):
geneSet2.add(line.rstrip())
CG1167
CG685
CG1041
CG1043
CG215
CG1041
CG483
CG1167
CG1163
C:/fosn1.txt
C:/fosn2.txt
difference
AA
>>> geneSet1
Set(['CG1043', 'CG1041', 'CG1167', 'CG685'])
>>> geneSet1.intersection(geneSet2)
Set(['CG1041', 'CG1167'])
>>> geneSet2.difference(geneSet1)
Set(['CG483', 'CG215', 'CG1163'])
>>> geneSet1.union(geneSet2)
Set(['CG483', 'CG1043', 'CG1041', 'CG1167', 'CG685',
'CG1163', 'CG215'])
intersection
AA
B
B
union
AA
B
84
More Set operations
• Since every element in a Set occurs only once,
sets can be used to reduce redundancy
>>> from sets import Set
>>> Set([1,2,3,1,3,3])
Set([1, 2, 3])
• A is a superset of B when A fully contains B
Test: A.issuperset(B)
• A is a subset of B when A is fully contained in B
Test: A.issubset(B)
>>> pqs=Set("1kim 1dan 1bob".split())
>>> pdb=Set("1bob 3mad 1dan 2bad 1kim".split())
>>> pqs.issubset(pdb)
True
85
The genetic code as a hash
aa = {'ttt':'F',
'ttc':'F',
'tta':'L',
'ttg':'L',
'tct':'S',
'tcc':'S',
'tca':'S',
'tcg':'S',
'tat':'Y',
'tac':'Y',
'taa':'!',
'tag':'!',
'tgt':'C',
'tgc':'C',
'tga':'!',
'tgg':'W',
'ctt':'L',
'ctc':'L',
'cta':'L',
'ctg':'L',
'cct':'P',
'ccc':'P',
'cca':'P',
'ccg':'P',
'cat':'H',
'cac':'H',
'caa':'Q',
'cag':'Q',
'cgt':'R',
'cgc':'R',
'cga':'R',
'cgg':'R',
'att':'I',
'atc':'I',
'ata':'I',
'atg':'M',
'act':'T',
'acc':'T',
'aca':'T',
'acg':'T',
'aat':'N',
'aac':'N',
'aaa':'K',
'aag':'K',
'agt':'S',
'agc':'S',
'aga':'R',
'agg':'R',
'gtt':'V',
'gtc':'V',
'gta':'V',
'gtg':'V',
'gct':'A',
'gcc':'A',
'gca':'A',
'gcg':'A',
'gat':'D',
'gac':'D',
'gaa':'E',
'gag':'E',
'ggt':'G',
'ggc':'G',
'gga':'G',
'ggg':'G' }
86
Translating: DNA to protein
def translate(dna):
length = len(dna)
if len(dna) % 3 != 0:
print "Warning: Length is not a multiple of 3!"
sys.exit()
protein = ""
i = 0
while i < length:
codon = dna[i:i+3]
if not aa.has_key(codon):
print "Codon %s is illegal"%codon
sys.exit()
protein += aa[codon]
i+=3
return protein
>>> translate("gatgacgaaagttgt")
'DDESC'
>>> translate("gatgacgaaagttgta")
Warning: Length is not a multiple of 3!
… (SystemExit)
>>> translate("gatgacgiaagttgt")
Codon gia is illegal
… (SystemExit)
87
Counting residue frequencies
def count_residues(seq):
freq={}
seq = seq.lower()
for i in range(len(seq)):
if freq.has_key(seq[i]):
freq[seq[i]]+=1
else: freq[seq[i]]=1
return freq
freq = count_residues("gatgacgaaagttgt")
for residue in freq.keys():
print residue,":", freq[residue]
g
a
c
t
:
:
:
:
5
5
1
4
88
Counting N-mer frequencies
def count_nmers(seq, n):
freq={}
seq = seq.lower()
for i in range(len(seq)-n+1):
nmer=seq[i : i+n]
if freq.has_key(nmer):
freq[nmer]+=1 # incr. according counter
else: freq[nmer]=1 # first occurence
return freq
freq = count_nmers("gatgacgaaagttgt", 2)
for residue in freq.keys():
print residue,":", freq[residue]
cg:
tt:
ga:
tg:
gt:
aa:
ac:
at:
ag:
1
1
3
2
1
2
1
1
1
89
N-mer frequencies for a whole file
from read_fasta import read_fasta
def count_nmers(seq, n, freq):
seq = seq.lower()
for i in range(len(seq)-n+1):
nmer=seq[i : i+n]
if freq.has_key(nmer):
freq[nmer]+=1
else: freq[nmer]=1
return freq
name2seq = read_fasta("z:/tmp/fly3utr.txt")
freq = {}
## count for each sequence
for seq in name2seq.values():
freq = count_nmers(seq, 2, freq)
## display statistics
for residue in freq.keys():
print residue,":", freq[residue]
We reuse a function we wrote earlier by
import ing it.The first is the filename (without
.py), the second the function name
ct
tc
tt
cg
ga
tg
gc
gt
aa
ac
gg
at
ca
ag
ta
cc
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
5
9
26
4
11
12
2
17
39
10
4
17
11
15
20
2
Note how we keep passing freq back into
the count_nmers function, to get cumulative
counts
Files and filehandles
This fh is the filehandle
•
•
•
•
•
•
•
•
Opening a file:
Closing a file:
Reading a line:
Reading an array:
Printing a line:
Read-only:
Write-only:
Test if file exists:
fh = open(filename)
fh.close()
data = fh.readline()
data = fh.read()
fh.write(line)
fh = open(filename, "r")
fh = open(filename, "w")
import os
if os.path.exists(filename):
print "filename exists!"
91
Database access from Python
# use the database package with all the DB relevant sub-routines
import MySQLdb
# import class that enables data acquirement as dictionaries
from MySQLdb.cursors import DictCursor
# Connection to database with access specification
conn = MySQLdb.connect(db="scop",
# name of database
host="myserver", # name of server
user="guest",
# username
passwd="guest") # password
# create access pointer that retrieves dictionaries
cursor = conn.cursor(DictCursor)
# send a query
cursor.execute("SELECT * FROM cla LIMIT 10")
# retrieve all rows as a list of dictionaries
data = cursor.fetchall()
# close connection
conn.close()
92
Local vs. global variables
def foo():
a=3
print a
Parameters are local
def foo(a):
print a
a=6
print a
foo()
print a
a=6
print a
foo(3)
print a
6
3
6
6
3
6
does not affect global a
def foo():
global a
a=3
print a
a=6
print a
foo()
print a
• Function variables and parameters
are by default local
• Unless you declare them to be global
6
3
3
93
does affect global a
References in Python
•
Lists, Dictionaries and other
Datatypes are usually referenced,
ie. when assigning a variable, no
data is copied:
[1,2,3,4] assigning b=a
•
a
•
•
b
b points to the same list as a
>>>
>>>
>>>
>>>
[1,
"Real copies" with copy
module
Don't worry about any referencing,
Python is doing the job! But be
aware when you want to copy
objects
>>>
>>>
>>>
>>>
[1,
a = [1,2,3,4]
b=a
b[2]=7
a
2, 7, 4]
from copy import copy
b = copy(a)
b[2]=3
a
2, 7, 4]
94
Matrices
>>> m = [[1, 2], [3, 4]]
• Easy solution: Lists of lists
>>> m[1]
in core Python
[3, 4]
>>> m[1][1]
• Access an element at
4
position (i,j) in a list of lists:
selecting from the i'th row the j'th element
• Disadvantages: Operations like
Addition/Multiplications on lists (of lists) would
be slow, need to be implemented
• Luckily: big library already available,
fast (since implemented in C),
rich functionality
95
Matrices with numarray
• Faster, more calculations
(reshaping, built-in matrix
operations) with external package
numarray
• various matrix creation methods with
numarray:
–
–
–
–
–
from list of lists
zeros/2
ones/2
identity/1
from a function
– etc.
• Convenient access of
multidimensional array elements
>>> from numarray import *
>>> m1 = array(m);m1
array([[1, 2],
[3, 4]])
>>> m1.getshape()
(2, 2)
>>> zeros((3,5))
array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]])
>>> m2=array(arange(8))
>>> m2.setshape((2,4))
>>> m2
array([[0, 1, 2, 3],
[4, 5, 6, 7]])
>>> m2[1,1]
5
96
Matrices with numarray
• You can select rows
and columns,
or even submatrices
(same "slicing" as with
lists)
• You can apply a scalar
operation like
– addition +
– multiplication *
– sine or cosine
to an array
>>> m1[:,1] # second column
array([2, 4])
>>> #arange produces one-dim. array
>>> m = arange(9, shape=(3,3));m
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> m[1:,1:]
array([[4, 5],
[7, 8]])
>>> m[1] + 3
array([6, 7, 8])
>>> m[1] * 3
array([ 9, 12, 15])
>>> m1[1] * 3
array([ 9, 12])
>>> sin(m1)
array([[ 0.84147098, 0.90929743],
[ 0.14112001, -0.7568025 ]])
97
More Math
• Remember the mean and standard deviation from Lecture 3?
• Reuse of existing packages makes live easier:
>>> data = array([1, 5, 1, 12, 3, 4, 6])
>>> data.mean()
4.5714285714285712
>>> data.stddev()
3.7796447300922722
• Or finding the maximum in a list becomes now:
>>> data[argmax(data)]
12
• numarray also provides functions for dot product, vector
calculations etc. >>> dot(array([1,2,3]), array([1,2,3]))
14
>>> array([1,2,3]) + array([4,5,6])
array([5, 7, 9])
98
Longest Common Subsequence
from numarray import *
seq1="ATCTGATC"
seq2="TGCATA"
len1 = len(seq1)
len2 = len(seq2)
def max3(a,b,c): return max( max(a,b) ,c)
#Create an array val of length len1+1 times len2+1
val=zeros((len1+1,len2+1))
for i in range(1,len1+1):
for j in range(1,len2+1):
if seq1[i-1]==seq2[j-1]:
val[i,j] = max3(val[i-1,j], val[i,j-1], val[i-1,j-1]+1)
else:
val[i,j] = max3(val[i-1,j], val[i,j-1], val[i-1,j-1])
print val
lcs = val[len1,len2]
print "The longest common subsequence of %s and %s is %d (%f)"% \
99
(seq1, seq2, lcs, float(lcs) / max(len1,len2))
Longest Common Subsequence
Output
[[0 0 0 0 0]
[0 1 1 1 1]
[0 1 1 1 1]
[0 1 1 1 1]
[0 1 2 2 2]
[0 1 2 3 3]
[0 1 2 3 4]
[0 1 2 3 4]
[0 1 2 3 4]]
The longest common subsequence of
ATCTGATC and TGCATA is 4 (0.500000)
Result of print val
Final Result
100
Classes
• Define a class to store PDB residues. A residue has: a
name, a position in the sequence, and a list of atoms. An
atom has a name and coordinates. Define 2 methods:
add_residue and add_atom
class PDBStructure:
def add_residue(self, name, posseq):
residue = {'name': resname,
'posseq': posseq,
'atoms': []}
self._residues.append(residue)
return residue
def add_atom(self, residue, name, coord):
atom = {'residue': residue,
'name': name,
'coord': coord
}
residue['atoms'].append(atom)
return atom
101
Classes: Usage
struct = PDBStructure()
residue = struct.add_residue(name = "ILE", posseq = 1 )
struct.add_atom(residue, name = "N",
coord = (23.46800041, -8.01799965, -15.26200008))
struct.add_atom(residue, name = "CZ",
coord = (125.50499725, 4.50500011, -19.14800072))
residue = struct.add_residue(name = "LYS", posseq = 2 )
struct.add_atom(residue, name = "OE1",
coord = (126.12000275, -1.78199995, -15.04199982))
print struct.residues
[{'name': 'ILE', 'posseq': 1, 'atoms': [ \
{'name': 'N', 'coord': (23.468000409999998, \
-8.0179996500000001, -15.26200008)}, \
{'name': 'CZ', 'coord': (125.50499725, \
4.5050001100000001, -19.148000719999999)}]}, \
{'name': 'LYS', 'posseq': 2, 'atoms': [ \
{'name': 'OE1', 'coord': (126.12000275, \
-1.7819999500000001, -15.041999819999999)}]}]
102