Transcript Document
Bioinformática 2007-I
Prof. Mirko Zimic
Lunes
-Alineamiento simple de secuencias (pairwise
alignment).
- Alineamiento local y global.
- Matrices de ‘score’
-Algoritmos de Programación Dinámica
-Dot Plot
Miércoles
Alineamiento simple de secuencias: Manejo de los
programas: Clustal, Macaw y servidores en línea
“Nada en Biología tiene sentido a
menos que se entienda en términos
de Evolución”
T. Dobzhansky
“Alinear” = “Comparar”
Sequence alignment
is similar to other
types of comparative
analysis
Involves scoring
similarities and
differences among a
group of related
entities
Finches of the Galápagos Islands observed by
Charles Darwin on the voyage of HMS Beagle
Homología
Homology Is the central concept for all of biology.
Whenever we say that a mammalian hormone is the
‘same’ hormone as a fish hormone, that a human gene
sequence is the ‘same’ as a sequence in a chimp or a
mouse, that a HOX gene is the ‘same’ in a mouse, a fruit
fly, a frog and a human - even when we argue that
discoveries about a worm, a fruit fly, a frog, a mouse, or a
chimp have relevance to the human condition - we have
made a bold and direct statement about homology. The
aggressive confidence of modern biomedical science
implies that we know what we are talking about.”
David B. Wake
Similitud ≠ Homología
1) 25% similarity ≥ 100 AAs is likely homology
2) Homology is an evolutionary statement which
means “descent from a common ancestor”
–common 3D structure
–usually common function
–all or nothing, cannot say "50%
homologous"
C O M PARAT I V E ANALY S I S
Alignment algorithms
model evolutionary
processes
Derivation from a
common ancestor
through incremental
change due to dna
replication errors,
mutations, damage,
or unequal crossingover.
GATTACCA
GATGACCA
GATTACCA
GATTACCA
GATTATCA
GATTACCA
T
GATCATCA
GATTGATCA
GAT ACCA
Substitution
insertion
deletion
C O M PARAT I V E ANALY S I S
Alignment algorithms
model evolutionary
processes
Derivation from a
common ancestor
through incremental
change
GATTACCA
GATGACCA
GATTACCA
GATTACCA
GATTATCA
GATCATCA
GATTACCA
GATTGATCA
GATACCA
Only extant sequences are known,
ancestral sequences are postulated.
C O M PARAT I V E ANALY S I S
Alignment algorithms
model evolutionary
processes
Derivation from a
common ancestor
through incremental
change. Mutations
that do not kill the
host may carry over
to the population.
Rarely are mutations
kept/rejected by
natural selection.
GATTACCA
GATGACCA
GATTACCA
GATTACCA
GATTATCA
GATCATCA
GATTACCA
GATTGATCA
GATACCA
The term homology implies a common
ancestry, which may be inferred from
observations of sequence similarity
Sequence Alignments
• Why align?
Can delineate sequence elements that are functionally
significant
Illuminates phylogenetic relationships
• Algorithms for sequence alignment
Dynamic programming
Dot-matrix
Word-based algorithms
Bayesian methods
What is Meant by Alignment?
Identical nucleotide sequences (trivial example)
ATTCGGCATTCAGTGCTAGA
ATTCGGCATTCAGTGCTAGA
Score = 20
(20 1)
Imperfect match
ATTCGGCATTCAGTGCTAGA
ATTCGGCATTGCTAGA
Score = 11
A better alignment
Score = 14
= 10 + 6 + 4(-0.5)
{
ATTCGGCATTCAGTGCTAGA
ATTCGGCATT----GCTAGA
Gap penalty
Beware of aligning apples and
oranges [and grapefruit]!
Parologous versus
orthologous;
genomic versus
cDNA;
mature versus
precursor.
Los alineamientos se pueden efectuar tanto en
secuencias de ADN como en secuencias de
proteínas…
Why Do We Want To Compare Sequences
wheat
?????
--DPNKPKRAMTSFVFFMSEFRSEFKQKHSKLKSIVEMVKAAGER
| | |||||||| || | |||
||| | |||| ||||
KKDSNAPKRAMTSFMFFSSDFRS----KHSDL-SIVEMSKAAGAA
EXTRAPOLATE
Homology?
??????
SwissProt
Why Does It Make Sense To Align
Sequences ?
-Evolution is our Real Tool.
-Nature is LAZY and Keeps re-using Stuff.
-Evolution is mostly DIVERGEANT
Same Sequence Same Ancestor
Why Does It Make Sense To Align
Sequences ?
Same
Sequence
Same
Origin
Same
Function
Same
3D Fold
Comparing Is Reconstructing Evolution
An Alignment is a STORY
ADKPKRPLSAYMLWLN
ADKPKRPLSAYMLWLN
ADKPKRPLSAYMLWLN
Mutations
+
Selection
ADKPKRPKPRLSAYMLWLN
ADKPRRPLS-YMLWLN
An Alignment is a STORY
ADKPKRPLSAYMLWLN
ADKPKRPLSAYMLWLN
ADKPKRPLSAYMLWLN
Mutations
+
Selection
ADKPKRPKPRLSAYMLWLN
ADKPRRPLS-YMLWLN
Insertion
Deletion
ADKPRRP---LS-YMLWLN
ADKPKRPKPRLSAYMLWLN
Mutation
Evolution is NOT Always Divergent…
AFGP with (ThrAlaAla)n
Similar To Trypsynogen
N
S
AFGP with (ThrAlaAla)n
NOT
Similar to Trypsinogen
SIMILAR Sequences
BUT
DIFFERENT origin
…But in MOST cases, you may assume it is.
How Do Sequences Evolve ?
CONSTRAINED Genome Positions Evolve SLOWLY
EVERY Protein Family Has its Own Level Of Constraint
Family
KS
KA
Histone3
Insulin
Interleukin I
a-Globin
Apolipoprot. AI
Interferon G
6.4
4.0
4.6
5.1
4.5
8.6
0
0.1
1.4
0.6
1.6
2.8
Rates in Substitutions/site/Billion Years as measured on Mouse Vs Human (80 Million years)
Ks Synonymous Mutations, Ka Non-Neutral.
How Do Sequences Evolve ?
The amino Acids Venn Diagram
To Make Things Worse, Every Residue has its Own
Personality
C
L V
I
Aliphatic
Aromatic
F
P
AG G
T C
D
Y HKE
W R
Small
S
N
Q
Hydrophobic
Polar
How Do Sequences Evolve ?
In a structure, each Amino Acid plays a Special Role
+
On the surface,
CHARGE MATTERS
OmpR, Cter Domain
In the core,
SIZE MATTERS
How Do Sequences Evolve ?
Accepted Mutations Depend on the Structure
Big -> Big
Small ->Small
NO DELETION
+
-
Charged -> Charged
Small <-> Big or Small
DELETIONS
How Can We Compare Sequences ?
To Compare Two Sequences, We need:
Their Structure
We Do Not
Have Them !!!
Their Function
How Can We Compare Sequences ?
We will Need To Replace Structural Information With
Sequence Information.
Same
Sequence
Same
Origin
Same
Function
Same
3D Fold
It CANNOT Work ALL THE TIME !!!
How Can We Compare Sequences ?
To Compare Sequences, We need to Compare Residues
We Need to Know How Much it COSTS to SUBSTITUTE
an Alanine into an Isoleucine
a Tryptophan into a Glycine
…
The table that contains the costs for all the possible
substitutions is called the SUBSTITUTION MATRIX
How to derive that matrix?
How Can We Compare Sequences ?
Making a Substitution Matrix
-Take 100 nice pairs of Protein Sequences,
easy to align (80% identical).
-Align them…
-Count each mutations in the alignments
-25 Tryptophans into phenylalanine
-30 Isoleucine into Leucine
…
-For each mutation, set the substitution score to the log odd ratio:
Log
Observed
Expected by chance
How Can We Compare Sequences ?
Making a Substitution Matrix
The Diagonal Indicates How
Conserved a residue tends to be.
W is VERY Conserved
Some Residues are Easier To
mutate into other similar
Cysteins that make disulfide
bridges and those that do not
get averaged
How Can We Compare Sequences ?
Using Substitution Matrix
Given two Sequences and a substitution Matrix,
We must Compute the CHEAPEST Alignment
Insertion
Deletion
ADKPRRP---LS-YMLWLN
ADKPKRPKPRLSAYMLWLN
Mutation
Scoring an Alignment
Most popular Subsitution Matrices
• PAM250
• Blosum62 (Most widely used)
Raw Score
TPEA
¦| |
APGA
Score =1 + 6 + 0 + 2 = 9
• Question: Is it possible to get such a good alignment
by chance only?
Insertions and Deletions
Gap Penalties
Gap Opening Penalty
Gap Extension Penalty
gap
Seq A GARFIELDTHE----CAT
|||||||||||
|||
Seq B GARFIELDTHELASTCAT
• Opening a gap is more expensive than extending
it
How Can We Compare Sequences ?
Limits of the substitution Matrices
They ignore non-local interactions and Assume that
identical residues are equal
ADKPKRPLSAYMLWLN
They assume evolution rate
to be constant
ADKPKRPLSAYMLWLN
ADKPKRPLSAYMLWLN
Mutations
+
Selection
ADKPKRPKPRLSAYMLWLN
ADKPRRPLS-YMLWLN
How Can We Compare Sequences ?
Limits of the substitution Matrices
Substitution Matrices Cannot Work !!!
How Can We Compare Sequences ?
Limits of the substitution Matrices
I know… But at least, could I get some idea of
when they are likely to do all right
How Can We Compare Sequences ?
The Twilight Zone
Similar Sequence
Similar Structure
%Sequence Identity
Different Sequence
Structure ????
Same 3D Fold
30%
30
Twilight Zone
Length
100
How Can We Compare Sequences ?
The Twilight Zone
Substitution Matrices Work Reasonably Well on
Sequences that have more than 30 % identity over
more than 100 residues
Major Differences between PAM and BLOSUM
PAM
Built from global alignments
Built from small amout of Data
Counting is based on minimum
replacement or maximum parsimony
Perform better for finding global
alignments and remote homologs
Higher PAM series means more
divergence
BLOSUM
Built from local alignments
Built from vast amout of Data
Counting based on groups of
related sequences counted as one
Better for finding local
alignments
Lower BLOSUM series means
more divergence
How Can We Compare Sequences ?
Which Matrix Shall I use
PAM: Distant Proteins High Index (PAM 350)
BLOSUM: Distant Proteins Low Index (Blosum30)
Choosing The Right Matrix may be Tricky…
•GONNET 250> BLOSUM62>PAM 250.
•But This will depend on:
•The Family.
•The Program Used and Its Tuning.
•Insertions, Deletions?
HOW Can we Align Two
Sequences ?
Dot Matrices
Global Alignments
Local Alignment
Global Alignments
-Take 2 Nice Protein Sequences
-A good Substitution Matrix (blosum)
-A Gap opening Penalty (GOP)
-A Gap extension Penalty (GEP)
Cost
GOP
GOP
GEP
GOP
GOP
L
Afine Gap Penalty
Parsimony:
Evolution takes the simplest path
(So We Think…)
Insertions and Deletions
Gap Penalties
Gap Opening Penalty
Gap Extension Penalty
gap
Seq A GARFIELDTHE----CAT
|||||||||||
|||
Seq B GARFIELDTHELASTCAT
• Opening a gap is more expensive than extending
it
Global Alignments
-Take 2 Nice Protein Sequences
-A good Substitution Matrix (blosum)
-A Gap opening Penalty (GOP)
-A Gap extension Penalty (GEP)
-DYNAMIC PROGRAMMING
>Seq1
THEFATCAT
>Seq2
THEFASTCAT
THEFA-TCAT
THEFASTCAT
DYNAMIC
PROGRAMMING
Global Alignments
DYNAMIC PROGRAMMING
Brut Force Enumeration
F A S T
F A T
----FAT
FAST-----FATFAST----F-ATFAST---
(
2
(L1+l2)!
)
(L1)!*(L2)!
DYNAMIC PROGRAMMING
Dynamic Programming Example
Construct an
optimal of these
two sequences:
G A T A C T A
G A T T A C C A
Using these
scoring rules:
Match:
+1
Mismatch: -1
Gap:
-1
DYNAMIC PROGRAMMING
Arrange the
sequence
residues along a
two-dimensional
lattice
Vertices of the
lattice fall
between letters
G A T A C T A
G
A
T
T
A
C
C
A
DYNAMIC PROGRAMMING
The goal is to find
the optimal path
G A T A C T A
G
A
T
T
A
C
C
A
from here
to here
DYNAMIC PROGRAMMING
Each path
corresponds to a
unique alignment
G A T A C T A
G
A
T
T
A
C
C
A
Which one is
optimal?
DYNAMIC PROGRAMMING
The score for a
path is the sum of
its incremental
edges scores
G A T A C T A
G
A
T
T
A
C
C
A
A aligned with A
Match = +1
DYNAMIC PROGRAMMING
The score for a
path is the sum of
its incremental
edges scores
G A T A C T A
G
A
T
T
A
C
C
A
A aligned with T
Mismatch = -1
DYNAMIC PROGRAMMING
The score for a
path is the sum of
its incremental
edges scores
G A T A C T A
G
A
T
T
A
C
C
A
T aligned with NULL
Gap = -1
NULL aligned with T
DYNAMIC PROGRAMMING
Incrementally
extend the path
G
A
T
T
A
C
C
A
0
-1
G A T A C T A
-1
+1
DYNAMIC PROGRAMMING
Incrementally
extend the path
Remember the
best sub-path
leading to each
point on the
lattice
G
A
T
T
A
C
C
A
0
-1
G A T A C T A
-1
-2
+1
-2
DYNAMIC PROGRAMMING
Incrementally
extend the path
Remember the
best sub-path
leading to each
point on the
lattice
G
A
T
T
A
C
C
A
0
-1
G A T A C T A
-1
-2
+1
-2
0
0
+2
DYNAMIC PROGRAMMING
Incrementally
extend the path
Remember the
best sub-path
leading to each
point on the
lattice
G
A
T
T
A
C
C
A
0
G A T A C T A
-1
-2
-1
+1
-2
0
-2
0
+2
DYNAMIC PROGRAMMING
Incrementally
extend the path
Remember the
best sub-path
leading to each
point on the
lattice
G
A
T
T
A
C
C
A
0
G A T A C T A
-1
-2
-3
-1
+1
-2
0
-1
-2
0
+2
+1
-3
-1
+1
+3
DYNAMIC PROGRAMMING
Incrementally
extend the path
Remember the
best sub-path
leading to each
point on the
lattice
G
A
T
T
A
C
C
A
0
G A T A C T A
-1
-2
-3
-4
-5
-1
+1
0
-1
-2
-3
-2
0
+2
+1
0
-1
-3
-1
+1
+3
+2
+1
-4
-2
0
+2
+2
+1
-5
-3
-1
+1
+3
+2
DYNAMIC PROGRAMMING
Incrementally
extend the path
Remember the
best sub-path
leading to each
point on the
lattice
G
A
T
T
A
C
C
A
0
G A T A C T A
-1
-2
-3
-4
-5
-6
-7
-1
+1
0
-1
-2
-3
-4
-5
-2
0
+2
+1
0
-1
-2
-3
-3
-1
+1
+3
+2
+1
0
-1
-4
-2
0
+2
+2
+1
+2
+1
-5
-3
-1
+1
+3
+2
+1
+3
-6
-4
-2
0
+2
+4
+3
+2
-7
-5
-3
-1
+1
+3
+3
+2
-8
-6
-4
-2
0
+2
+2
+4
DYNAMIC PROGRAMMING
Trace-back to get
optimal path and
alignment
G
A
T
T
A
C
C
A
0
G A T A C T A
-1
-2
-3
-4
-5
-6
-7
-1
+1
0
-1
-2
-3
-4
-5
-2
0
+2
+1
0
-1
-2
-3
-3
-1
+1
+3
+2
+1
0
-1
-4
-2
0
+2
+2
+1
+2
+1
-5
-3
-1
+1
+3
+2
+1
+3
-6
-4
-2
0
+2
+4
+3
+2
-7
-5
-3
-1
+1
+3
+3
+2
-8
-6
-4
-2
0
+2
+2
+4
DYNAMIC PROGRAMMING
Print out the
alignment
G A - T A CT A
G A T T A CC A
G A T A C T A
G
A
T
T
A
C
C
A
Global Alignments
DYNAMIC PROGRAMMING
Dynamic Programming (Needlman and Wunsch)
Match=1
MisMatch=-1
Gap=-1
F A S T
0
F -1
A -2
T -3
-1 -2 -3 -4
1
0
0
2
F A S T
F A S T
0
-1 -2 -3 -4
-1 1 0 -1 0
F
A -2
T -3
2
1
0
-1 -1
1
2
0
F A S T
F A - T
0
F
A
T
-1 -2 -3 -4
1
2
1
2
Local Alignments
GLOBAL Alignment
LOCAL Alignment
Smith And Waterman (SW)=LOCAL Alignment
Two different types of Alignment
Global Alignment
methods:
Local Alignment
methods:
Needleman & Wunch (J. Mol. Biol. (1970) 48,443-453 :
Problem of finding the best path. Revelation: Any partial subpath that ends at a point along the true optimal path must itself
be the optimal path leading to that point. This provides a
method to create a matrix of path “score”, the score of a path
leading to that point. Trace the optimal path from one end to
the other of the two sequences.
Smith & Waterman.(J. Mol. Biol. (1981), 147,195-197: Use
Needleman &Wunch, but report all non-overlapping paths,
starting at the highest scoring points in the path graph.
FASTP(Lipman &Pearson(1985),Science 227,1435-1441
BLAST (Altschul et al (1990),J. Mol. Bio. 215,408-410): don’t
report all overlapping paths, but only attempt to find paths if
there are words that are high-scoring. Speeds up considerably the
alignments.
Global vs. Local Alignment
Global alignment
High-scoring
subsequence
Gap
Local alignment
Global alignment: best overall alignment independent of whether
local high-scoring sequences are included
Local alignment: alignments involving high-scoring sequences take
precedence of global features
G L O BAL & L O CAL S I M I LAR I TY
Implementations of
dynamic
programming for
global and local
similarities
Optimal global
alignment
Optimal local
alignment
Needleman & Wunsch (1970)
Smith & Waterman (1981)
Sequences align
essentially from
end to end
Sequences align
only in small,
isolated regions
Filtering low complexity sequences
• Filters out short repeats and low complexity regions from the
query sequences before searching the database
• Filtering helps to obtain statistically significant results and reduce
the background noise resulting from matches with repeats and low
complexity regions
• The output shows which regions of the query sequence were
masked
Sequence Periodicities in Kinetoplast DNA
Marini et al. Proc. Natl. Acad. Sci. USA 79, 7664-7668 (1982)
Local Alignments
We now have a PairWise Comparison Algorithm,
We are ready to search Databases
Database Search
Q
SW
1.10e-20
10
1.10e-100
1.10e-2
1.10e-1
10
QUERRY
Comparison Engine
Database
3
1
3
6
1.10e-2
1
20
15
E-values
How many time do we expect such an
Alignment by chance?
13