property bar on wolf

Download Report

Transcript property bar on wolf

CS177 Lecture 4
Sequence Alignment
Tom Madej 10.03.05
Overview
•
•
•
•
•
•
•
•
The alignment problem.
Homology, divergence, convergence.
Dynamic programming overview.
Amino acid substitution matrices.
Fast sequence database searching: BLAST.
Multiple sequence alignment.
PSI-BLAST, position specific score matrices.
Databases of multiple sequence alignments.
What is a sequence alignment?
• A linear, one-to-one correspondence between some of
the symbols in one sequence with some of the symbols
in another sequence.
• May be DNA or protein sequences.
• A (sequence) alignment can also be derived from a
superposition of two protein structures, it is then
sometimes called a structure alignment.
Classwork
• In Exercise 1 we will see DNA-DNA alignments produced
by the program Basic Local Alignment Search Tool
(BLAST)
• In Exercise 2 we will see alignments of protein structures
produced by Vector Alignment Search Tool (VAST)
Exercise 1: BLAST this DNA sequence
against the Human Genome
> CCR5
ATGGATTATCAAGTGTCAAGTCCAATCTATGACATCAATTATTATACATCGGAGCCCTGCCAAAAAATCA
ATGTGAAGCAAATCGCAGCCCGCCTCCTGCCTCCGCTCTACTCACTGGTGTTCATCTTTGGTTTTGTGGG
CAACATGCTGGTCATCCTCATCCTGATAAACTGCAAAAGGCTGAAGAGCATGACTGACATCTACCTGCTC
AACCTGGCCATCTCTGACCTGTTTTTCCTTCTTACTGTCCCCTTCTGGGCTCACTATGCTGCCGCCCAGT
GGGACTTTGGAAATACAATGTGTCAACTCTTGACAGGGCTCTATTTTATAGGCTTCTTCTCTGGAATCTT
CTTCATCATCCTCCTGACAATCGATAGGTACCTGGCTGTCGTCCATGCTGTGTTTGCTTTAAAAGCCAGG
ACGGTCACCTTTGGGGTGGTGACAAGTGTGATCACTTGGGTGGTGGCTGTGTTTGCGTCTCTCCCAGGAA
TCATCTTTACCAGATCTCAAAAAGAAGGTCTTCATTACACCTGCAGCTCTCATTTTCCATACAGTCAGTA
TCAATTCTGGAAGAATTTCCAGACATTAAAGATAGTCATCTTGGGGCTGGTCCTGCCGCTGCTTGTCATG
GTCATCTGCTACTCGGGAATCCTAAAAACTCTGCTTCGGTGTCGAAATGAGAAGAAGAGGCACAGGGCTG
TGAGGCTTATCTTCACCATCATGATTGTTTATTTTCTCTTCTGGGCTCCCTACAACATTGTCCTTCTCCT
GAACACCTTCCAGGAATTCTTTGGCCTGAATAATTGCAGTAGCTCTAACAGGTTGGACCAAGCTATGCAG
GTGACAGAGACTCTTGGGATGACGCACTGCTGCATCAACCCCATCATCTATGCCTTTGTCGGGGAGAAGT
TCAGAAACTACCTCTTAGTCTTCTTCCAAAAGCACATTGCCAAACGCTTCTGCAAATGCTGTTCTATTTT
CCAGCAAGAGGCTCCCGAGCGAGCAAGCTCAGTTTACACCCGATCCACTGGGGAGCAGGAAATATCTGTG
GGCTTGTGA
Exercise 1: cont.
• From NCBI home page follow the “BLAST” link
• Then from “Genomes” select “Human”
• Paste the sequence from the previous slide into the box
(note that it is in FASTA format)
• Click on “Begin Search”
• Wait for a short while then try “Format”
Alignment format examples
• Aligned (similar) regions are in rectangles.
Exercise 2
• From http://www.ncbi.nlm.nih.gov/Structure/ go to the
MMDB summary page for 1nqp.
• Click on the colored bar for chain “A”.
• Scroll down and select the VAST neighbor 1KR7 A.
• Click on “View 3D Structure” at the top to view the
structural superposition using Cn3D.
• The sequence alignment will be shown in a separate
window.
Divergence and Convergence
• Two proteins may be similar because of divergence from
a common ancestor (i.e. they are homologous).
• … or, perhaps two proteins may be similar because they
perform similar functions and are thereby constrained,
even though they arose independently (functional
convergence hypothesis, they are then called
analogous).
Divergence vs. Convergence
• Convergence can occur; e.g. there exist enzymes with
different overall structures but remarkably similar
arrangements of active site residues to carry out a
similar function.
• So how can one establish homology?
Homology
“… whenever statistically significant sequence or structural
similarity between proteins or protein domains is
observed, this is an indication of their divergent evolution
from a common ancestor or, in other words, evidence of
homology.”
E.V. Koonin and M.Y. Galperin, Sequence – Evolution – Function,
Kluwer 2003
Two arguments…
• We see a continuous range of sequence similarity.
Convergence is extremely unlikely for highly similar
protein families. It then appears implausible to invoke it
for less similar families.
• The same or very similar functions may be carried out by
proteins with very different structures (folds). Therefore,
functional constraints cannot force convergence of the
sequence or structure between proteins.
Alignments of hemoglobins: human-wolf
(83%); human-chicken (59%)
Alignments of hemoglobins: human-hagfish
(24%); human-worm (17%)
Sequence identity for VAST neighbors of
1NQP A (a globin)
Issues in sequence alignment…
• How do you quantify amino acid similarity?
• How can you handle gaps in the alignment?
• How do you define the overall similarity score?
• Can you compute an optimal alignment?
• Can you compute an alignment efficiently?
• Can you calculate statistical significance?
Global and local alignment
• Alignment programs may be modified, e.g. by scoring
parameters, to produce global or local alignments.
• Local alignments tend to be more useful, as it is highly
possible that a significant similarity may encompass only
a portion of one or both sequences being compared.
Dynamic Programming overview
• Computational algorithmic method devised in the 1940’s.
• An efficient method to find optimal solutions for certain
classes of problems.
• Particularly useful in bioinformatics for protein sequence
comparison, etc.
Characteristics of problems that can be
solved by DP…
• Optimal substructure: The optimal solution may be
constructed from optimal solutions to subproblems.
• Overlapping subproblems: To compute the optimal
solution we only need to consider a “small” number of
subproblems.
Example: Longest Common Subsequence
(LCS)
• Given two sequences of symbols, find a longest
subsequence that occurs in both sequences.
• seq1: abcabddcbaabb
• seq2: bdacbaabcbaa
• abcabddcbaabb
• bdacbaabcbaa
• Here is another subsequence of length 8: bcabcbaa
Optimal substructure of an LCS
• X = x1…xm, Y = y1…yn input sequences; let Z = z1…zk be
any LCS of X and Y
• If xm = yn then zk = xm = yn and Zk-1 is an LCS for Xm-1
and Yn-1
• If xm  yn and zk  xm then Z is an LCS for Xm-1 and Y
• If xm  yn and zk  yn then Z is an LCS for X and Yn-1
• (Note: Xm-1 = x1…xm-1, etc.)
Computing the length of an LCS
•
•
•
•
sij = length of an LCS for Xi and Yj
If xi = yj then sij = si-1,j-1 + 1
Otherwise sij = max(si,j-1, si-1,j)
This leads to a bookkeeping scheme where we calculate
the scores by filling in an mxn table. The score at
position (i, j) depends on the score at (i-1, j-1) if the
symbols xi and yj match, otherwise it depends on the
scores at positions (i-1, j) and (i, j-1).
• An LCS can be computed by “traceback”.
DP table initializations: fill in first row and
first column
DP computation: fill in the rest of the tables
row by row (or column by column)
Traceback to find an LCS: bab
Complexity of the DP algorithm
• Time O(nm); space O(nm) where n, m are the lengths of
the two sequences.
• Space complexity can be reduced to O(n) by not storing
the entries of dynamic programming table that are no
longer needed for the computation (keep current row and
the previous row only).
Using DP for protein sequence alignment
• In protein sequence alignments there are blocks of
similarity and dissimilarity (gaps). E.g. look back at the
globin alignments earlier in the lecture.
• The scoring is more complicated because there are
“degrees of similarity” among the 20 amino acids.
• To enforce the block structure “gap penalties” have to be
introduced.
DNA and proteins
• Much more sensitive comparison is possible between
protein sequences than DNA sequences.
• One reason is that the third codon in the genetic code is
highly redundant, and introduces noise into DNA
comparisons.
• Another reason is that the physico-chemical properties of
the amino acid residues provide information that is highly
relevant to comparison.
Note the genetic code is degenerate.
Molecular Cell Biology, Lodish et al.
Fig. 3-2.
Molecular Cell Biology, Lodish et. al
Fig. 3-2.
Amino acid substitution matrices
• Ab initio approaches; e.g. assign scores based on
number of mutations needed to transform one codon to
another, or on other physico-chemical measures of a.a.
similarity.
• Empirical, i.e. derive statistics about a.a. substitutions
from collections of alignments. (These work the best).
Computation of scores, empirical approach
sij = ln(qij/(pipj))/λ
• sij is the score for substituting amino acid type i by type j.
• qij is the observed frequency of substitutions of a.a. type i
by a.a. type j (in the training set).
• pi is the background frequency for a.a. type i in the
training set.
• λ is a positive constant.
Example (numbers are fictional)
•
•
•
•
•
•
•
N = total number of a.a.’s in training set = 1000
F occurs 15 times
Y occurs 25 times
(F, Y) substitutions occur 200 times
Np = total number of subst. pairs = 10000
Then: pF = 0.015, pY = 0.025, qFY = 0.02
ln(qFY/(pFpY))  ln(53)  4
Substitution matrices
• There are many different matrices available.
• The most commonly used are the BLOSUM or PAM
series.
• BLAST defaults to the BLOSUM62 matrix. The
BLOSUM matrices have been shown to be more
sensitive than the PAM matrices for database searches.
Gap penalties
• There is no suitable theory for gap penalties.
• The most common type of gap penalty is the affine gap
penalty: g = a + bx, where a is the gap opening penalty,
b is the gap extension penalty, and x is the number of
gapped-out residues.
• Typical values, e.g. a = 10 and b = 1 for BLAST.
• If the gap penalty has different opening and extension
costs, then the DP algorithm becomes a little more
complicated (cf. Chapter 3 in Mount).
Dynamic programming algorithm for computing the
score of the optimal alignment
For a sequence S = a1, a2, …, an let Sj = a1, a2, …, aj
Align(Si,S’j) = the score of the highest scoring alignment between S1i,S2j
S(ai, a’j)= similarity score between amino acids ai and a’j
given by a scoring matrix like PAM, BLOSUM
g – gap penalty
Align(Si-1,S’j-1)+ S(ai, a’j)
Align(Si,S’j-1) - g
Align(Si-1,S’j) - g
{
Align(Si,S’j)= max
Organizing the computation – dynamic programming table
Align
j
Align[i, j] =
Align(Si,S’j) = max
i
{
Align(Si-1,S’j-1)+ s(ai, a’j)
Align(Si-1,S’j) - g
Align(Si,S’j-1) - g
+s(ai,aj)
max
Example of DP computation with
g = 0; match = 1; mismatch = 0
Maximal Common Subsequence
initialization
A
A
T
G
C
T
T
A
A
C
C
A
T
T
G
C
G
C
G
C
A
T
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
0
1
2
2
2
2
2
2
2
0
1
2
2
3
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
+1
max
Example of DP computation with
g = 2; match = 2; mismatch = -1
Initialization (penalty for starting with a gap)
A
A
T
G
C
T
T
A
A
C
C
A
T
T
G
0
-2
-4
-6
-8
-2
2
0
-2
-4
-4
0
4
2
-6
-2
2
3
-8
-4
C
-10
G
-12
C
G
C
-14
-16
-18
A
-20
T
-22
-10
-12
-14
-16
-18
+2
-20
-22
-2
-2
max
The iterative algorithm
m = |S|; n = |S’|
for i  0 to m do A[i,0]- i * g
for j  0 to n do A[0,j] - j * g
for i  0 to m do
for j  0 to n
A[i,j]max (
A[i-1,j] – g
A[i-1,j-1] + s(i,j)
A[i,j-1] – g
)
return(A[m,n])
Exercise 3
• Copy and paste the sequences for 1nqpA and 1kr7A into
a notepad.
• Go to the web site:
http://pir.georgetown.edu/pirwww/search/textpsd.shtml
• Follow the “Search and Retrieval” link, then “Pairwise
alignment”.
• Copy in your sequences (use FASTA format and remove
numbers) and then run SSEARCH (Smith-Waterman
algorithm, a DP alignment method).
SSEARCH (pir.georgetown.edu)
BLAST (Basic Local Alignment Search Tool)
• Extremely fast, can be on the order of 50-100 times
faster than Smith-Waterman.
• Method of choice for database searches.
• Statistical theory for significance of results.
• Heuristic; does not guarantee optimal results.
• Many variants, e.g. PHI-, PSI-, RPS-BLAST.
Why database searches?
• Gene finding.
• Assigning likely function to a gene.
• Identifying regulatory elements.
• Understanding genome evolution.
• Assisting in sequence assembly.
• Finding relations between genes.
Issues in database searches
• Speed.
• Relevance of the search results (selectivity).
• Recovering all information of interest (sensitivity).
– The results depend on the search parameters, e.g. gap
penalty, scoring matrix.
– Sometimes searches with more than one matrix should be
performed.
Main idea of BLAST
• Homologous sequences are likely to contain a short high
scoring similarity region, a hit. Each hit gives a seed that
BLAST tries to extend on both sides.
Some BLAST terminology…
word – substring of a sequence
word pair – pair of words of the same length
score of a word pair – score of the gapless alignment of the two
words:
VALMR
V A K N S score = 4 + 4 – 2 – 2 - 1 = 3 (BLOSUM62)
HSP – high scoring word pair
Main steps of BLAST
• Parameters: w = length of a hit; T = min. score of a hit (for
proteins: w = 3, T = 13 (BLOSUM62)).
• Step 1: Given query sequence Q, compile the list of possible
words which form with words in Q high scoring word pairs.
• Step 2: Scan database for exact matching with the list of words
compiled in step 1.
• Step 3: Extend the hits from step 2.
• Step 4: Evaluate significance of extended hits from step 3.
Step 1: Find high scoring words
• For every word x of length w in Q make a list of words that
when aligned to x score at least T.
• Example: Let x = AIV then the score for AIA is 4+4+0
(dropped) and for AII 4+4+3 (taken).
• The number of words in the list depends on w and T, and
is usually much less than 203 (typically about 50).
Step 2 – Finding hits
• Scan database for exact matching with the list of
words compiled in step1 :
• Two techniques.
– Hash table.
– Keyword tree (there is a finite-automaton based method for
exact matching with a set of patterns represented as a tree).
Step 3: Extending hits
• Parameter: X (controlled by a user).
• Extend the hits in both ways along diagonal (ungapped
alignment) until score drops more than X relative to the
best score yet attained.
• Return the score of the highest scoring segment pair
(HSP).
extensions
E-values, P-values
• E-value, Expectation value; this is the expected number
of hits of at least the given score, that you would expect
by random chance for the search database.
• P-value, Probability value; this is the probability that a hit
would attain at least the given score, by random chance
for the search database.
• E-values are easier to interpret than P-values.
• If the E-value is small enough, e.g. no more than 0.10,
then it is essentially a P-value.
Karlin-Altschul statistics
• Expected number of HSPs with score ≥ S is:
E = KmNe –λS
• m = length of query sequence
• N = database size in residues
• K scales the search space size
• λ a scale for the scoring system
The bit score
• This formula “normalizes” a raw score:
S’ = (λS – ln K)/ln 2
• The E-value is then given by:
E = mN 2 –S’
• Allows direct comparison of E-values, even with differing
scoring matrices.
Karlin and Altschul provided a theory for computing statistical
significance
• Assumptions:
– The scoring matrix M must be such that the score for a random
alignment is negative.
– n, m (lengths of the aligned sequences) are large.
– The amino acid distribution p(x) is in the query sequence and the
data base is the same.
– Positive score is possible (i.e. M has at least one positive entry).
Score of high scoring sequence pairs follows extreme value
distribution
l – decay constant
u – value of the peak
normal
P(S<x) = exp (-e –l(x-m) ) thus:
P(S>=x) =1- exp (-e –l(x-m)) )
Extreme values
Refinement of the basic algorithm-the two hit method
• Observation: HSPs of interest are long and can
contain multiple hits a relatively short distance
apart.
• Central idea: Look for non-overlapping pairs of
hits that are of distance at most d on the same
diagonal.
• Benefits:
– Can reduce word size w from 3 two 2 without losing
sensitivity (actually sensitivity of two-hit BLAST is
higher).
– Since extending a hit requires a diagonal partner,
many fewer hits are extended, resulting in
increased speed.
Gapped BLAST
• Find two non-overlapping hits of score at least T and
distance at most d from one another.
• Invoke ungapped extension.
• If the HSP generated has normalized score at least Sg
bits then start gapped extension.
• Report resulting alignment if it has sufficiently
significant (very small) E-value.
Gapped BLAST statistics
• Theory does not work.
• Simulations indicate that for the best hits scores for local
alignments follow an extreme value distribution.
• Method approximates l and m to match experimental distribution l and m can be computed from the median and variation of the
experimental distribution.
• BLAST approach – simulate the distribution for set of scoring
matrices and a number of gap penalties. BLAST offers a choice of
parameters from this pre-computed set.
Motivation for multiple alignments
• Look at a multiple alignment of several of the VAST
structure neighbors of 1h1b chain A.
Multiple Sequence Alignment
• A multiple alignment of sequences from a protein family
makes the conserved features much more apparent.
• Much more difficult to compute than pairwise alignments.
• The most commonly used and newer programs use the
“progressive alignment strategy”.
• There are also important databases of multiple
alignments for protein families.
Progressive alignment
• Determine an (approximate) phylogenetic tree for the
sequences.
• Construct the multiple alignment by merging pairwise
alignments based on the phylogenetic tree, the most
closely related sequences first, etc.
• The idea is, if you are careless about the order and
merge distantly related sequences too soon in the
process, then errors in the alignment may occur and
propagate.
Multiple sequence alignment programs
• CLUSTALW, Thompson et al. NAR 1994.
• T-Coffee (Tree-based Consistency Objective Function for
alignment Evaluation), C. Notredame et al. JMB 2000.
• MUSCLE (Multiple sequence comparison by log
expectation), R. Edgar NAR 2004.
PSI-BLAST
• Position Specific Iterated BLAST
• As a first step runs a (regular) BLAST.
• Hits that cross the threshold are used to construct a
position specific score matrix (PSSM).
• A new search is done using the PSSM to find more
remotely related sequences.
• The last two steps are iterated until convergence.
PSSM (Position Specific Score Matrix)
• One column per residue in the query sequence.
• Per-column residue frequencies are computed so that
log-odds scores may be assigned to each residue type in
each column.
• There are difficulties; e.g. pseudo-counts are needed if
there are not a lot of sequences, the sequences must be
weighted to compensate for redundancy.
Two key advantages of PSSMs
• More sensitive scoring because of improved estimates of
probabilities for a.a.’s at specific positions.
• Describes the important motifs that occur in the protein
family and therefore enhances the selectivity.
Position Specific Substitution Rates
Weakly conserved serine
Active site serine
Position Specific Score Matrix (PSSM)
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
D
G
V
I
D
S
C
N
G
D
S
G
G
P
L
N
C
Q
A
A R N
0 -2 0
-2 -1 0
-1 1 -3
-3 3 -3
-2 -5 0
4 -4 -4
-4 -7 -6
-2 0 2
-2 -3 -3
-5 -5 -2
-2 -4 -2
-3 -6 -4
-3 -6 -4
-2 -6 -6
-4
-6 -7
Active
-1 -6 0
0 -4 -5
0 1 4
-1 -1 1
D C Q E G H I L K M F P
2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1
-2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2
-3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4
-4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5
8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5
-4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1
-7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7
Serine scored differently
-1 -6 7 0 -2 0 -6 -4 2 0 -2 -5
in-4these
-4 -4
-5 7 two
-4 -7 positions
-7 -5 -4 -4 -6
9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5
-4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4
-5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6
-5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6
-5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9
-7 -5nucleophile
-5 -6 -7 0 -1 6 -6 1 0 -6
site
-6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6
-5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4
2 -5 2 0 0 0 -4 -2 1 0 0 0
3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3
S
0
-2
0
-3
1
4
-4
-1
-3
-4
7
-4
-2
-4
-6
-2
-1
-1
0
T
-1
-1
-2
0
-3
3
-4
-3
-5
-4
-2
-5
-4
-4
-5
-1
0
-1
-2
W
-6
0
-6
-1
-7
-6
-5
-3
-6
-8
-6
-6
-6
-7
-5
-6
-5
-3
-2
Y
-4
-6
-4
-4
-5
-5
0
-4
-6
-7
-5
-7
-7
-7
-4
-1
0
-3
-2
V
-1
-5
-2
0
-6
-3
-4
-3
-6
-7
-5
-7
-7
-6
0
6
0
-4
-3
Exercise: using PSI-BLAST
• Go to the NCBI BLAST web site.
• Click on the link to PHI- and PSI- BLAST.
• Choose the search database to be “pdb”.
• Enter the sequence for 1h1bA, Human Neutrophil
Elastase.
• How many iterations does it take before 1dleB (Factor B
Serine Protease domain) shows up as a significant hit?
Modifying thresholds…
• On occasion, it can prove useful to modify (increase) the
inclusion threshold parameter.
• The user can also manually select hits to include in the
PSSM that do not meet the threshold, e.g. if one is
certain they are homologs to the query.
• Of course, one must always be extremely careful when
doing so!
HMMs
• Hidden Markov Models, a type of statistical model.
• Have numerous applications (including outside of
bioinformatics).
• HMMs were used to construct Pfam, a database of
multiple alignments for protein families (HMMer).
CDD Search
• Conserved Domain Database (CDD) at NCBI.
• Contains alignments from Smart, Pfam, COG, KOG, and
LOAD databases.
• Many multiple alignments are manually curated.
• PSSMs derived from multiple alignments may be
searched with RPS-BLAST (Reverse Position Specific
BLAST).
Multiple alignment of globins from CDD
Thank you to Dr. Teresa Przytycka for
slides about dynamic programming
and BLAST.