Transcript Lecture 2

1
Bioinformatics Algorithms
Lecture 2
© Jeff Parker, 2015
That theory is worthless. It isn't even wrong!
- Wolfgang Pauli
2
Outline
Protein Folding
Amino Acid Roundup
Ramachandran Plots
PDB
Chou Fasman
Probability
Python
Find Frequencies
Dictionary
Open Reading Frames
3
Protein Folding
The structure that a protein adopts is vital to its chemistry
Its structure determines which of its amino acids are exposed
to carry out the protein’s function
Its structure determines what substrates it can react with
Its structure determines its interactions
4
Review Central Dogma
Difference in Amino Acids lie in side chain – 20 different kinds
Those with mostly Carbon and Hydrogen are hydrophobic
Polar side chains often have Oxygen and Nitrogen
Third class are those side chains that are charged at normal pH
Anfinsen
The Central dogma says Sequence specifies structure
C. B. Anfinsen worked on ribonuclease, which degrades RNA into
smaller components
Denature – to “unfold” a protein back to random coil configuration
-mercaptoethanol – breaks disulfide bonds
Urea or guanidine hydrochloride – denaturant
Can also denature by heating – adding energy
Anfinsen observed that when denatured, ribonuclease would no longer
function correctly, but would work again when allowed to refold
Anfinsen’s experiments
Denatured ribonuclease with urea – no longer worked
Then removed urea – Ribonuclease regained enzymatic activity
Evidence that it re-folded to native conformation
6
Outline
Protein Folding
Amino Acid Roundup
Ramachandran Plots
PDB
Chou Fasman
Probability
Python
Find Frequencies
Dictionary
Open Reading Frames
7
Review Amino Acid Chains
Difference is in side chain – 20 different kinds
Those with mostly Carbon and Hydrogen are hydrophobic
Polar side chains often have Oxygen and Nitrogen
Third class are those side chains that are charged at normal pH
8
Wide range in complexity
9
10
Where can we fold?
11
Outline
Protein Folding
Amino Acid Roundup
Ramachandran Plots
PDB
Chou Fasman
Probability
Python
Find Frequencies
Dictionary
Open Reading Frames
Ramachandran Plot
Not all pairs of angles are
possible. Some
configurations lead to
self intersections
So called steric clash
steric: Of or relating to
the spatial
arrangement of atoms
in a molecule.
Studied by
Gopalasamudram
Narayana
Ramachandra
12
13
Ramachandran Plot
14
Handedness Matters
Limonene is chiral, and has two forms
One tastes like lemon
The other tastes like orange
15
Resources
http://www.youtube.com/watch?v=1eSwDKZQpok
There are many nearby videos
http://www.youtube.com/watch?v=u9dhO0iCLww
Haight Ashbury describes Protein Synthesis
www.learner.org/courses/biology/units/proteo/images.html
Includes images and four short animated videos
https://www.youtube.com/watch?v=zm-3kovWpNQ
Ken Dill’s Ted Talk. Interesting Examples
www.chembio.uoguelph.ca/educmat/phy456/456lec01.htm
Nice overview of folding. Simple animation
webhost.bridgew.edu/fgorga/proteins/
Tutorial w/ Jmol models, including alpha helix and beta sheet
16
Levels of Protein Structure
Primary Structure
The order of amino acids
Secondary Structure
Local shape – alpha helix and beta sheets
Tertiary Structure
Fully Folded Shape
Quaternary Structure
Combination of multiple components,
such as that in hemoglobin
Proposal: Figure out secondary shapes
Use that to compute tertiary structure
Determining Protein Structure
There are O(100,000) distinct proteins in the human proteome.
3D structures have been determined for 14,000 proteins, from
all organisms
Includes duplicates with different ligands bound, etc.
Coordinates are determined by X-ray crystallography
X-Ray Crystallography
~0.5mm
The crystal is a mosaic of millions of copies of the protein.
As much as 70% is solvent (water)!
May take months (and a “green” thumb) to grow.
X-Ray diffraction
Image is averaged over:
Space (many copies)
Time (of the diffraction
experiment)
Rosalind Franklin
20
Outline
Protein Folding
Amino Acid Roundup
Ramachandran Plots
PDB
Chou Fasman
Probability
Python
Find Frequencies
Dictionary
Open Reading Frames
21
pdb
22
PDB (header)
HEADER HORMONE
08-OCT-96 2HIU
TITLE NMR STRUCTURE OF HUMAN INSULIN IN 20% ACETIC ACID, ZINCTITLE 2 FREE, 10 STRUCTURES
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: INSULIN;
COMPND 3 CHAIN: A;
COMPND 4 MOL_ID: 2;
COMPND 5 MOLECULE: INSULIN;
COMPND 6 CHAIN: B
SOURCE MOL_ID: 1;
SOURCE 2 ORGANISM_SCIENTIFIC: HOMO SAPIENS;
SOURCE 3 ORGANISM_COMMON: HUMAN;
SOURCE 4 ORGANISM_TAXID: 9606;
SOURCE 5 MOL_ID: 2;
23
PDB (contents)
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
1
2
3
4
5
6
7
8
9
10
N
CA
C
O
H1
H2
H3
HA2
HA3
N
GLY
GLY
GLY
GLY
GLY
GLY
GLY
GLY
GLY
ILE
A
A
A
A
A
A
A
A
A
A
1
1
1
1
1
1
1
1
1
2
-6.132
-4.686
-3.864
-3.324
-6.407
-6.697
-6.302
-4.370
-4.531
-3.761
6.735
6.753
6.149
6.855
5.776
7.020
7.398
7.772
6.170
4.849
1.016
1.376
0.235
-0.593
0.726
1.840
0.232
1.548
2.272
0.186
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
There is much more than positions
For details on format
www.wwpdb.org/docs.html
N
C
C
O
H
H
H
H
H
N
24
PDB Torsion Angles
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
SUBTOPIC: TORSION ANGLES
TORSION ANGLES OUTSIDE THE EXPECTED RAMACHANDRAN REGIONS:
(M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN IDENTIFIER;
SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).
STANDARD TABLE:
FORMAT:(10X,I3,1X,A3,1X,A1,I4,A1,4X,F7.2,3X,F7.2)
EXPECTED VALUES: GJ KLEYWEGT AND TA JONES (1996). PHI/PSICHOLOGY: RAMACHANDRAN REVISITED. STRUCTURE 4, 1395 - 1400
M
1
1
1
RES
SER
CYS
CYS
CSSEQI
A
9
A 20
B
7
PSI
-156.44
-112.62
127.43
PHI
167.40
-53.53
-19.39
We will return to these…
25
Secondary Structure
Proteins begin folding during translation.
Proteins fold into the low energy conformation.
Molecular chaperones may help new proteins fold.
Hydrophobic residues are buried in an interior core
to form an α helix.
Alpha helices are found in sequences with
Ala, Leu, Met, Phe, Glu, Gln, Lys, Arg, His
Another common form is β sheets.
Beta sheets are found in sequences rich in
Tyr, Trp, Ile Val, Thr, Cys
26
Viewing Structure
Snapshots from PyMol on PDB file
You can download a student version of PyMol
Many other viewers exist
27
Alpha helix Discovery
In a 1935 paper, W. Astbury showed that when you stretch moist wool, there
were drastic changes visible in X-ray diffraction
Lead him to propose that the protein formed a helix, which he called a-form
He assumed an integral number of bases per turn
Many scientists worked on the problem, without success
Key breakthrough occurred in 1948 when Linus Pauling caught a cold and
stayed home. He wrote out a sequence of amino acids on a strip of paper,
and tried wrapping it around a pencil. After a few attempts, he came up
with a model with plausible hydrogen bonds.
One factor behind his 1954 Nobel Prize
28
Alpha helix
Not an integral number of nodes per turn
29
Alpha helix
Pauling
30
Beta Sheets
Astbury, once again, proposed the Beta Sheet, proposing a
hydrogen bond parallel or antiparallel extended β strands.
However, he didn’t know that the peptide bond was planar.
Pauling and Corey proposed a more accurate model in 1951.
31
Beta Sheets
32
Beta Sheets
Two forms
Parallel
Antiparallel
33
Sample Protein Structure
34
Remember our angles…
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
SUBTOPIC: TORSION ANGLES
TORSION ANGLES OUTSIDE THE EXPECTED RAMACHANDRAN REGIONS:
(M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN IDENTIFIER;
SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).
STANDARD TABLE:
FORMAT:(10X,I3,1X,A3,1X,A1,I4,A1,4X,F7.2,3X,F7.2)
EXPECTED VALUES: GJ KLEYWEGT AND TA JONES (1996). PHI/PSICHOLOGY: RAMACHANDRAN REVISITED. STRUCTURE 4, 1395 - 1400
M
1
1
1
RES
SER
CYS
CYS
CSSEQI
A
9
A 20
B
7
PSI
-156.44
-112.62
127.43
PHI
167.40
-53.53
-19.39
Insulin Ramachandran plot
http://www.fos.su.se/~pdbdna/input_Raman.html
35
a-helix and -sheet
en.wikipedia.org/wiki/File:Ramachandran_plot_general_100K.jpg
36
37
Speed of Folding
Some proteins fold much faster than others
8 order of magnitude difference between fastest and
slowest
Some forms bind locally
Some bind remotely
38
Speed of Folding
Some proteins fold much faster than others
8 order of magnitude difference between fastest and
slowest
Some forms bind locally
a-helix
Some bind remotely
-sheet
Those that bind locally are more likely to form quickly
Search for "downhill folders"
39
Structure Prediction
Given a new protein, how can we find the structure? Three
major methods are used
Comparative modeling – look for a homologue
Fold recognition
Look for regions characteristic of folds
Ab intio
Simulate the attractions between parts of peptide
Difficult – high dimension, and hard to get accurate
models of all the forces
40
Outline
Protein Folding
Amino Acid Roundup
Ramachandran Plots
PDB
Chou Fasman
Probability
Python
Find Frequencies
Dictionary
Open Reading Frames
Secondary Structure Prediction
Goal: use primary structure to predict secondary structure
Use secondary structure to compute tertiary structure
Chou, P.Y. & Fasman, G.D. (1974). Biochemistry, 13, 211-222.
Based on frequencies of occurrence of residues in helices and sheets
Count how many times amino acid has been observed in
alpha helix, beta sheet, or in turn (a, b, or turn in next slide)
How many times it has been seen in first, second, third and fourth
position in a turn f(i), f(i+1)…
Build a table of probabilities…
Secondary Structure Prediction
Goal: use primary structure to predict secondary structure
Use secondary structure to compute tertiary structure
Chou, P.Y. & Fasman, G.D. (1974). Biochemistry, 13, 211-222.
Based on frequencies of occurrence of residues in helices and sheets
Count how many times amino acid has been observed in
alpha helix, beta sheet, or in turn (a, b, or turn in next slide)
How many times it has been seen in first, second, third and fourth
position in a turn f(i), f(i+1)…
Build a table of probabilities…
55% accuracy in prediction secondary structure
Chou-Fasman Parameters
Name
Alanine
Arginine
Aspartic Acid
Asparagine
Cysteine
Glutamic Acid
Glutamine
Glycine
Histidine
Isoleucine
Leucine
Lysine
Methionine
Phenylalanine
Proline
Serine
Threonine
Tryptophan
Tyrosine
Valine
P(a)
142
98
101
67
70
151
111
57
100
108
121
114
145
113
57
77
83
108
69
106
P(b)
83
93
54
89
119
037
110
75
87
160
130
74
105
138
55
75
119
137
147
170
P(turn)
66
95
146
156
119
74
98
156
95
47
59
101
60
60
152
143
96
96
114
50
f(i)
0.06
0.070
0.147
0.161
0.149
0.056
0.074
0.102
0.140
0.043
0.061
0.055
0.068
0.059
0.102
0.120
0.086
0.077
0.082
0.062
f(i+1)
0.076
0.106
0.110
0.083
0.050
0.060
0.098
0.085
0.047
0.034
0.025
0.115
0.082
0.041
0.301
0.139
0.108
0.013
0.065
0.048
f(i+2)
0.035
0.099
0.179
0.191
0.117
0.077
0.037
0.190
0.093
0.013
0.036
0.072
0.014
0.065
0.034
0.125
0.065
0.064
0.114
0.028
f(i+3)
0.058
0.085
0.081
0.091
0.128
0.064
0.098
0.152
0.054
0.056
0.070
0.095
0.055
0.065
0.068
0.106
0.079
0.167
0.125
0.053
43
Chou-Fasman Parameters
Large integers
Name
Alanine
Arginine
Aspartic Acid
Asparagine
Cysteine
Glutamic Acid
Glutamine
Glycine
Histidine
Isoleucine
Leucine
Lysine
Methionine
Phenylalanine
Proline
Serine
Threonine
Tryptophan
Tyrosine
Valine
P(a)
142
98
101
67
70
151
111
57
100
108
121
114
145
113
57
77
83
108
69
106
P(b)
83
93
54
89
119
037
110
75
87
160
130
74
105
138
55
75
119
137
147
170
P(turn)
66
95
146
156
119
74
98
156
95
47
59
101
60
60
152
143
96
96
114
50
Floats < 0.3
f(i)
0.06
0.070
0.147
0.161
0.149
0.056
0.074
0.102
0.140
0.043
0.061
0.055
0.068
0.059
0.102
0.120
0.086
0.077
0.082
0.062
f(i+1)
0.076
0.106
0.110
0.083
0.050
0.060
0.098
0.085
0.047
0.034
0.025
0.115
0.082
0.041
0.301
0.139
0.108
0.013
0.065
0.048
f(i+2)
0.035
0.099
0.179
0.191
0.117
0.077
0.037
0.190
0.093
0.013
0.036
0.072
0.014
0.065
0.034
0.125
0.065
0.064
0.114
0.028
f(i+3)
0.058
0.085
0.081
0.091
0.128
0.064
0.098
0.152
0.054
0.056
0.070
0.095
0.055
0.065
0.068
0.106
0.079
0.167
0.125
0.053
44
Chou-Fasman Algorithm
Identify a-helices
4 out of 6 contiguous amino acids that have P(a) > 100
Extend the region until 4 amino acids with P(a) < 100 found
Compute P(a) and P(b); If the region is > 5 residues and
P(a) > P(b) identify as a helix
Repeat for -sheets [use P(b)]
If an a and a  region overlap, the overlapping region is predicted
according to P(a) and P(b)
Many sequences can shift between helixes and sheets
45
Chou-Fasman, cont’d
Identify hairpin turns:
P(t) = f(i) of the residue  f(i+1) of the next residue  f(i+2) of the
following residue  f(i+3) of the residue at position (i+3)
Predict a hairpin turn starting at positions where:
P(t) > 0.000075
The average P(turn) for the four residues > 100
P(a) < P(turn) > P(b) for the four residues
More current algorithms do better than Chou-Fasman's 55%
Garnier-Osguthorpe-Robson: similar approach, uses Bayesian analysis
46
Chou-Fasman Example
CAENKLDHVRGPTCILFMTWYNDGP
CAENKL – Potential helix (!C and !N)
Residues with P(a) < 100: RNCGPSTY
Extend: When we reach RGPT, we must stop
CAENKLDHV: P(a) = 972, P(b) = 843
Declare alpha helix
Identifying a hairpin turn
VRGP: P(t) = 0.000085
Average P(turn) = 113.25
Avg P(a) = 79.5, Avg P(b) = 98.25
Why is this so hard?
Some sequences are ambiguous
May be a-helix in one protein or -sheet in another
Mezei M (1998) Chameleon sequences in the
PDB. Protein Eng 11, 411–414
We will see another example on the next slide
Rather than secondary shape dictating tertiary structure
Tertiary structure may dictate secondary structure
49
The normal protein has a
secondary structure
dominated by alpha
helices.
The abnormal secondary
structure is dominated by
beta sheets.
50
Outline
Protein Folding
Amino Acid Roundup
Ramachandran Plots
PDB
Chou Fasman
Probability
Python
Find Frequencies
Dictionary
Open Reading Frames
51
Puzzle – a glance ahead
What are the odds of rolling doubles with two
fair dice?
52
Puzzle – a glance ahead
What are the odds of rolling doubles with two
fair dice?
What are the odds of rolling doubles with two
fair dice if I tell you that the sum of the two
dice is odd? Even?
What are the odds of rolling doubles with two
fair dice if I tell you that one of the two dice
is even?
53
Puzzle – a glance ahead
What are the odds of rolling doubles with two
fair dice?
What are the odds of rolling doubles with two
fair dice if I tell you that the sum of the two
dice is even?
That is, I have rolled the dice, and I tell you
that the sum of the dice is even
What are the odds that the dice are equal?
54
Outline
Protein Folding
Amino Acid Roundup
Ramachandran Plots
PDB
Chou Fasman
Probability
Python
Find Frequencies
Dictionary
Open Reading Frames
55
What are the frequencies?
Let's count the frequency of letters in a DNA sequence
To do so we will use a dictionary
The basic dictionary operations are
Create an empty dictionary
symbolCounts = {}
Set a value in the dictionary
symbolCounts['A'] = 1
'A' is the key, 1 is the value
Access a value in the dictionary
print symbolCounts['G']
Iterate over the elements in a dictionary
for ch in symbolCounts:
print ch, symbolCounts[ch]
These examples are all drawn from next page
56
What are the frequencies?
Now to use our dictionary to count the frequencies of letters
# Frequency – count frequency of each letter
symbolCounts = {} # Create empty Dictionary
# Go over each letter in the sequence
for x in xrange(len(text)):
ch = text[x] # Get char from string
# Increment count
if (ch in symbolCounts):
symbolCounts[ch] = symbolCounts[ch] + 1
else:
symbolCounts[ch] = 1 # Add entry to dictnry
57
What are the frequencies?
Now to use our dictionary to count the frequencies of letters
# Frequency – count frequency of each letter
symbolCounts = {} # Create empty Dictionary
# Go over each letter in the sequence
for x in xrange(len(text)):
ch = text[x] # Get char from string
# Increment count
if (ch in symbolCounts):
symbolCounts[ch] = symbolCounts[ch] + 1
else:
symbolCounts[ch] = 1 # Add entry to dictnry
58
What are the frequencies?
Now to use our dictionary to count the frequencies of letters
# Frequency – count frequency of each letter
symbolCounts = {} # Create empty Dictionary
# Go over each letter in the sequence
for x in xrange(len(text)):
ch = text[x] # Get char from string
# Increment count
if (ch in symbolCounts):
symbolCounts[ch] = symbolCounts[ch] + 1
else:
symbolCounts[ch] = 1 # Add entry to dictnry
59
What are the frequencies?
Now to use our dictionary to count the frequencies of letters
# Frequency – count frequency of each letter
symbolCounts = {} # Create empty Dictionary
# Go over each letter in the sequence
for x in xrange(len(text)):
ch = text[x] # Get char from string
# Increment count
if (ch in symbolCounts):
symbolCounts[ch] = symbolCounts[ch] + 1
else:
symbolCounts[ch] = 1 # Add entry to dictnry
60
What are the frequencies?
Now to use our dictionary to count the frequencies of letters
# Frequency – count frequency of each letter
symbolCounts = {} # Create empty Dictionary
# Go over each letter in the sequence
for x in xrange(len(text)):
ch = text[x] # Get char from string
# Increment count
if (ch in symbolCounts):
symbolCounts[ch] = symbolCounts[ch] + 1
else:
symbolCounts[ch] = 1 # Add entry to dictnry
61
Dictionary Order
We can iterate over the dictionary
# Rough print
print symbolCounts
# Iterate over the set symbols
for ch in symbolCounts:
print ch, symbolCounts[ch]
$ python freq.py
ACGTCCC
Saw ACGTCCC
{'A': 1, 'C': 4,
'T': 1, 'G':1}
A
C
T
G
1
4
1
1
62
Dictionary Order
We can iterate over the dictionary
# Rough print
print symbolCounts
# Iterate over the set symbols
for ch in symbolCounts:
print ch, symbolCounts[ch]
$ python freq.py
ACGTCCC
Saw ACGTCCC
{'A': 1, 'C': 4,
'T': 1, 'G':1}
A
C
T
G
1
4
1
1
63
Control Order
Don’t depend on dictionary for an order
# Rough print
print symbolCounts
$ python freq.py
ACGTCCC
Saw ACGTCCC
# Iterate over the set symbols
for ch in symbolCounts:
print ch, symbolCounts[ch]
A
C
T
G
1
4
1
1
symbols = ['A', 'C', 'G', 'T']
A
C
G
T
1
4
1
1
# Iterate over the list of symbols
for ch in symbols:
print ch, dict[ch]
64
Postmortem
It would be useful to try these programs on different input
How do we do that?
65
Postmortem
It would be useful to try these programs on different input
How do we do that?
We can put the information on the command line
66
Postmortem
It would be useful to try these programs on different input
How do we do that?
We can put the information on the command line
We can put information into files – sequences are long!
67
Postmortem
It would be useful to try these programs on different input
How do we do that?
We can put the information on the command line
We can put information into files – sequences are long!
We still want to change the file name
68
Postmortem
It would be useful to try these programs on different input
How do we do that?
We can put the information on the command line
We can put information into files – sequences are long!
We still want to change the file name
We need command line parameters, as below
% python frequency.py data/Human22.fasta
69
Command Line Parameters
import sys
# Import a library
# argv – array of Command Line Parameters
if (len(sys.argv) < 2):
print "Usage: python", sys.argv[0], "<filename>"
else:
fileName = sys.argv[1]
% python frequency.py data/Human22.fasta
I will use the symbol '%' or '$' to represent the command line
70
Read from a file
import sys # Import a library
...
if (len(sys.argv) < 2):
print "Usage: python", sys.argv[0], "<filename>"
else:
fileName = sys.argv[1]
print fileName
# echo print
f = open(fileName, 'r')
# Open file to read
# First line in Fasta Format describes the file
text = f.readline()
print "Saw", text
% more data/Human22.fasta
> ref|NT_011520.12|:1-29755346 Homo sapiens chromosome 22
GTGTCTCATGCCTGTAATCCCAACACTT
TGGGAGCCTGAGGTAGAAGTTGAGGTCA...
71
Prepare file contents
fileName = sys.argv[1]
print fileName
f = open(fileName, 'r')
# echo print
# First line in Fasta Format describes the file
text = f.readline()
print "Saw", text
# Read the rest of the file in one gulp
#
# This program could process one line at a time
# Later problems prefer a single string
text = f.read()
text = text.replace("\n", "") # Remove <cr>
72
What are the frequencies?
% python frequency.py data/Human22.fasta
Saw >ref|NT_011520.12|:1-29755346 Homo sapiens
chromosome 22 genomic contig, ...
{'A': 7745441, … }
A 7745441
…
I've printed (part of) the dictionary
I've printed the count for one element
73
Translate DNA to AA
Input:
CTC AGC GTT ACC
74
Translate DNA to AA
Input:
CTC AGC GTT ACC
Output
Leu Ser Val Thr
75
Translate DNA to AA
High Level Plan
For each codon in the text
Translate the codon into an Amino Acid
Print the Amino Acid
76
Translate DNA to AA
High Level Plan
For each codon in the text
Translate the codon into an Amino Acid
Print the Amino Acid
Append the Amino Acid to our result
77
Translate DNA to AA
Mid Level Plan
Get filename: open file and read text
For each codon in the text
Lookup translation of Codon to AA
Append the Amino Acidto our result
Return the result
78
Top Level
# Turn DNA sequence to Amino Acid Sequence
# Jeff Parker
Jan 2010
#
import sys
# Command Line argument
if (len(sys.argv) < 2):
print "Usage: python”,sys.argv[0],"<filename>"
else:
fileName = sys.argv[1]
text = readFastaFile(fileName)
protein = translate(text)
print protein
79
Read in a file
# Remove \n, change to lower case
def prepare(text):
text = text.replace("\n", "")
text = text.lower()
return text
# Read a fasta file into a string
def readFastaFile(fileName):
f = open(fileName, 'r')
# First line in Fasta format is header
line = f.readline() # Read and ignore 1st line
text = f.read()
# Read the rest of the file
text = prepare(text)
return text
80
Translate DNA to AA
Mid Level Plan
Get filename: open file and read text
For each codon in the text
Lookup translation of Codon to AA
Append the Amino Acid to our result
Return the result
81
Main Loop
# Function to translate DNA string to an AA string
def translate(text):
result = ""
# Starting at 0, run until end, steps of 3
# CTC AGC GTT ACC
for i in xrange(0, len(text) + 1 - 3, 3):
codon = text[i:i+3]
if (codon in code):
result = result + code[codon]
return result
82
Translation Table
# Dictionary holding
code = {
'ttt': 'F', 'tct':
'ttc': 'F', 'tcc':
'tta': 'L', 'tca':
'ttg': 'L', 'tcg':
'ctt': 'L', 'cct':
'ctc': 'L', 'ccc':
'cta': 'L', 'cca':
'ctg': 'L', 'ccg':
'att': 'I', 'act':
'atc': 'I', 'acc':
'ata': 'I', 'aca':
...
}
Codon to Amino Acid map
'S',
'S',
'S',
'S',
'P',
'P',
'P',
'P',
'T',
'T',
'T',
'tat':
'tac':
'taa':
'tag':
'cat':
'cac':
'caa':
'cag':
'aat':
'aac':
'aaa':
'Y',
'Y',
'*',
'*',
'H',
'H',
'Q',
'Q',
'N',
'N',
'K',
'tgt':
'tgc':
'tga':
'tgg':
'cgt':
'cgc':
'cga':
'cgg':
'agt':
'agc':
'aga':
'C',
'C',
'*',
'W',
'R',
'R',
'R',
'R',
'S',
'S',
'R',
83
Translation Table
# Dictionary holding Codon to Amino Acid map
code = {
'ttt': 'F', 'tct': 'S', 'tat': 'Y', 'tgt': 'C',
'ttc': 'F', 'tcc': 'S', 'tac': 'Y', 'tgc': 'C',
'tta': 'L', 'tca': 'S', 'taa': '*', 'tga': '*',
'ttg': 'L', 'tcg': 'S', 'tag': '*', 'tgg': 'W',
'ctt': 'L', 'cct': 'P', 'cat': 'H', 'cgt': 'R',
'ctc': 'L', 'ccc': 'P', 'cac': 'H', 'cgc': 'R',
'cta': 'L', 'cca': 'P', 'caa': 'Q', 'cga': 'R',
'ctg': 'L', 'ccg': 'P', 'cag': 'Q', 'cgg': 'R',
'att': 'I', 'act': 'T', 'aat': 'N', 'agt': 'S',
'atc': 'I', 'acc': 'T', 'aac': 'N', 'agc': 'S',
'ata': 'I', 'aca': 'T', 'aaa': 'K', 'aga': 'R',
...
}
Building this table is the real work
84
Main Loop
# Function to translate DNA string to an AA string
def translate(text):
result = ""
# Starting at 0, run until end, steps of 3
# CTC AGC GTT ACC
for i in xrange(0, len(text) + 1 - 3, 3):
codon = text[i:i+3]
if (codon in code):
result = result + code[codon]
return result
85
Main Loop
# Function to translate DNA string to an AA string
def translate(text):
result = ""
# Starting at 0, run until end, steps of 3
# CTC AGC GTT ACC
for i in xrange(0, len(text) + 1 - 3, 3):
codon = text[i:i+3]
if (codon in code):
result = result + code[codon]
return result
What should we do if it is not?
86
Main Loop
# Function to translate DNA string to an AA string
def translate(text):
result = ""
# Starting at 0, run until end, steps of 3
# CTC AGC GTT ACC
for i in xrange(0, len(text) + 1 - 3, 3):
codon = text[i:i+3]
if (codon in code):
result = result + code[codon]
return result
How do we test this?
87
You cannot improve what you cannot measure – W. E. Deming
I had two versions of this: the one you have just seen, and a faster one that
used a Python idiom
From my Unix (Mac) system, I can say
$ time python dna2aa.py ../../data/EColi.fasta
real 0m0.507s
user 0m0.453s
sys
0m0.020s
The times for the faster version are below
$ time python fastdna2aa.py ../../data/EColi.fasta
real 0m0.484s
Not enough of a difference
user 0m0.424s
to justify the digression
sys
0m0.027s
We will see how to get more detail
88
Faster Version
def translate(text):
result = [ ]
# Starting at 0, run until end, steps of 3
for i in xrange(0, len(text) + 1 - 3, 3):
codon = text[i:i+3]
if (codon in code):
result.append(code[codon])
# Python idiom to convert list into string
protein = ''.join(result)
return protein
Not big enough difference
89
Outline
Protein Folding
Amino Acid Roundup
Ramachandran Plots
PDB
Chou Fasman
Probability
Python
Find Frequencies
Dictionary
Open Reading Frames
90
Open Reading Frames
In the next assignment, we will be looking for possible genes.
These should start with a Start, and end with a Stop.
Such a sequence is called an Open Reading Frame, or ORF
We look for ORFs that are longer than the expected length
91
Open Reading Frames
In the next assignment, we will be looking for possible genes.
These should start with a Start, and end with a Stop.
Such a sequence is called an Open Reading Frame, or ORF
We look for ORFs that are longer than the expected length
Don't report nested ORFs
ORF nested in a larger ORF
Report this ORF, not the other
92
Open Reading Frames
In the next assignment, we will be looking for possible genes.
These should start with a Start, and end with a Stop.
Such a sequence is called an Open Reading Frame, or ORF
We look for ORFs that are longer than the expected length
ORFs can be in any one of 3 Reading Frames. Report them all.
93
Open Reading Frames
In the next assignment, we will be looking for possible genes.
These should start with a Start, and end with a Stop.
Such a sequence is called an Open Reading Frame, or ORF
We look for ORFs that are longer than the expected length
ORF occur in either direction: for now, only need to look in forward
direction. Reversing the sequence is not difficult.
94
Checking Input
# orf_skel.py
Starting point for Open Reading
Frame Finder (ORF Finder)
#
# Usage:
#
% python orf_skel.py <filename> [<ORF length>]
...
fileName = sys.argv[1]
if (len(sys.argv) > 2):
# Read integer
try:
limit = int(sys.argv[2]) # Built-in func
except ValueError:
# try-except block
print ”Expecting int, found",
print sys.argv[2]
exit()
print "ORF must be", limit, "Base pairs long”
95
Checking Input
# orf_skel.py
Starting point for Open Reading
Frame Finder (ORF Finder)
#
# Usage:
#
% python orf_skel.py <filename> [<ORF length>]
...
fileName = sys.argv[1]
if (len(sys.argv) > 2):
# Read integer
try:
limit = int(sys.argv[2]) # Built-in func
except ValueError:
# try-except block
print ”Expecting int, found",
print sys.argv[2]
exit()
print "ORF must be", limit, "Base pairs long”
96
Checking Input
# orf_skel.py
Starting point for Open Reading
Frame Finder (ORF Finder)
#
# Usage:
#
% python orf_skel.py <filename> [<ORF length>]
...
fileName = sys.argv[1]
if (len(sys.argv) > 2):
# Read integer
try:
limit = int(sys.argv[2]) # Built-in func
except ValueError:
# try-except block
print ”Expecting int, found",
print sys.argv[2]
exit()
print "ORF must be", limit, "Base pairs long”
97
Checking Input
# orf_skel.py
Starting point for Open Reading
Frame Finder (ORF Finder)
#
# Usage:
#
% python orf_skel.py <filename> [<ORF length>]
...
fileName = sys.argv[1]
if (len(sys.argv) > 2):
# Read integer
try:
limit = int(sys.argv[2]) # Built-in func
except ValueError:
# try-except block
print ”Expecting int, found",
print sys.argv[2]
exit()
print "ORF must be", limit, "Base pairs long”
98
Checking Input
# orf_skel.py
Starting point for Open Reading
Frame Finder (ORF Finder)
#
# Usage:
#
% python orf_skel.py <filename> [<ORF length>]
...
fileName = sys.argv[1]
if (len(sys.argv) > 2):
# Read integer
try:
limit = int(sys.argv[2]) # Built-in func
except ValueError:
# try-except block
print ”Expecting int, found",
print sys.argv[2]
exit()
print "ORF must be", limit, "Base pairs long”
99
Checking Input
# orf_skel.py
Starting point for Open Reading
Frame Finder (ORF Finder)
#
# Usage:
#
% python orf_skel.py <filename> [<ORF length>]
...
fileName = sys.argv[1]
if (len(sys.argv) > 2):
# Read integer
try:
limit = int(sys.argv[2]) # Built-in func
except ValueError:
# try-except block
print ”Expecting int, found",
print sys.argv[2]
exit()
print "ORF must be", limit, "Base pairs long”
100
More than one error
# Sometimes you wish to handle different errors
for arg in sys.argv[1:]:
try:
f = open(arg, 'r')
except IOError:
print 'cannot open', arg
except Exception as e:
print “Some other error”, arg
# Read more about exceptions here:
# https://docs.python.org/2/tutorial/errors.html
101
Python Score Card
Dictionaries
Command Line Parameter
Conversion
Try-Catch Block