GST II: ---Title--- - Digital Biology Laboratory
Download
Report
Transcript GST II: ---Title--- - Digital Biology Laboratory
Pairwise Sequence
Comparison
Dong Xu
Computer Science Department
271C Life Sciences Center
E-mail: [email protected]
573-882-7064
http://digbio.missouri.edu
Lecture Outline
Introduction
Scoring
function
Dynamic
programming
Confidence Assessment
Heuristic
alignment
Introduction (1)
Biological sequence comparison
DNA-DNA, RNA-RNA
Protein-protein
DNA-protein
Sequence comparison is the in most
important and fundamental operation
bioinformatics
Key to understand evolution of an organism
or a gene
Introduction (2)
Applications in most bioinformatics
problems
Sequence assembly
Gene finding
Protein structure prediction
Phylogenic tree analysis
THE most popular tool: BLAST
Foundation of sequence database search
Evolutionary Basis of
Sequence Comparison
Ancestor
Orthologs
(inheritance)
Gene duplication
Paralogs
X
Random mutation
Evolutionary selection
Recombination
(fusion)
Y
75%X 25%Y Mixed Homology
An Example of
Sequence Comparison
TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG
||
||
|| | | ||| | |||| |||||
||| |||
TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG
AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG
| | |
| |||||| |
|||| | || |
|
AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG
AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT
||| | ||| || || |||
|
||||||||| ||
|||||| |
AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT
Alignement (1)
a correspondence between elements of
two sequences with order (topology) kept
FSEYTTHRGHR
: ::::: ::
FESYTTHRPHR
FESYTTHRGHR
:::::::: ::
FESYTTHRPHR
pairwise alignment: 2 sequences aligned
multiple alignment: alignment of 3 or more
sequences
Alignement (2)
Similar to ”longest common subsequence” (LCS)
problem for strings, (Robinson, 1938)
LCS: define a set of operations (e.g. substitution,
insertion or deletion) that transform the aligned
elements of one sequence into the corresponding
elements of the other and associate with each
operation a cost or a score.
Optimal alignment: the alignment that is associated
with the lowest cost (or highest score).
Between two sequences several optimal alignments
can be constructed with the same optimal score.
FSEY-THRGHR
: : ::: ::
FESYTTHRPHR
FSEYT-HRGHR
: :: :: ::
FESYTTHRPHR
Components of
Sequence Alignment
(2) An algorithm for
alignment;
(3) Confidence
assessment of alignment
result.
Indel
{
(1) Scoring function: a
measure of similarity
between elements
(nucleotides, amino
acids, gaps);
Insertion
Deletion
FDSK-THRGHR
:.: :: :::
FESYWTH-GHR
Match (:)
Mismatch
(substitution)
Lecture Outline
Introduction
Scoring
function
Dynamic
programming
Confidence Assessment
Heuristic
alignment
Edit Distance
(Hamming Distance)
Introduced by Levenshtein in 1966
Binary: match 1 / mismatch 0 (Identity Matrix)
Definition: Minimum number of edit
operations to transform one string to another
Can be used for DNA/RNA
Possible edit operations
Symbol insertion
Symbol deletion
Symbol substitution
Scoring Matrix
amino acid substitution matrices (20X20) account
for probability of one amino acid being substituted
for another:
frequency of substitution - genetic code
tolerance for changes - natural selection
penalize residues pairs with a low probability of
mutation in evolution and rewards pairs with a high
probability
empirically derived from observed amino acid
substitutions that occur between aligned residues in
homologous sequences
Physical Bases of
Mutation Matrix
Geometry nature
Physical nature
(charged or hydrophobic)
Chemical nature
Frequencies of amino acids
physical property matrices
PAM
The first substitution matrices derived by
Dayhoff et al. (1978)
PAM (point accepted mutation) distance:
Two sequences are defined to have
diverged by one PAM unit if they show in
average one accepted point mutation (i.e.
one amino acid change) per hundred
amino acids.
Derived from the pairwise alignment of
sequences less than 15% divergent.
BLOSUM
Block substitution matrices (Henikoff & Henikoff
1992)
Blocks: highly conserved regions in a set of
aligned protein sequences (local multiple
alignment)
Number of BLOSUM matrix (e.g. BLOSUM 62)
indicates the cutoff of percent identity that
defines the clusters - lower cutoffs allow more
diverse sequences
BLOSUM 62 Matrix
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
4
-1
-2
-2
0
-1
-1
0
-2
-1
-1
-1
-1
-2
-1
1
0
-3
-2
0
5
0
-2
-3
1
0
-2
0
-3
-2
2
-1
-3
-2
-1
-1
-3
-2
-3
6
1
-3
0
0
0
1
-3
-3
0
-2
-3
-2
1
0
-4
-2
-3
6
-3
0
2
-1
-1
-3
-4
-1
-3
-3
-1
0
-1
-4
-3
-3
9
-3
-4
-3
-3
-1
-1
-3
-1
-2
-3
-1
-1
-2
-2
-1
5
2
-2
0
-3
-2
1
0
-3
-1
0
-1
-2
-1
-2
5
-2
0
-3
-3
1
-2
-3
-1
0
-1
-3
-2
-2
6
-2
-4
-4
-2
-3
-3
-2
0
-2
-2
-3
-3
8
-3
-3
-1
-2
-1
-2
-1
-2
-2
2
-3
4
2
-3
1
0
-3
-2
-1
-3
-1
3
4
-2
2
0
-3
-2
-1
-2
-1
1
5
-1
-3
-1
0
-1
-3
-2
-2
5
0
-2
-1
-1
-1
-1
1
6
-4
-2
-2
1
3
-1
7
-1
-1
-4
-3
-2
4
1
-3
-2
-2
5
-2
-2
0
11
2
-3
7
-1
4
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
What Matrices to Use
Close homolog: high cutoffs for BLOSUM
(up to BLOSUM 90) or lower PAM values
BLAST default: BLOSUM 62
Remote homolog: lower cutoffs for
BLOSUM (down to BLOSUM 10) or high
PAM values (PAM 200 or PAM 250)
A best performer in structure prediction:
PAM 250
Gap Penalty Functions
Corresponding
to insertion/deletion
in evolution
Typically linear gap penalties
Easy to implement in algorithms
Satisfactory performance in alignment
Can
be derived from alignment
Known alignments
Performance-based (sequence comparison)
Affine Gap Penalty
Most
w(k)
commonly used model
= h + gk , k 1 ,with w(0) = 0.
h: gap opening penalty; g: gap extension penalty
h > g > 0 (e.g., for PAM250, 10.8 + 0.6k)
Non-linear form: h + g log (k)
FDS--THRGHR
:.: ::::::
FESYTTHRGHR
FDS-T-HRGHR
:.: : :::::
FESYTTHRGHR
Lecture Outline
Introduction
Scoring
function
Dynamic
programming
Confidence Assessment
Heuristic
alignment
Global vs. Local Alignment
Global alignment: the alignment of complete
sequences
Good for comparing members of same protein family
Needleman & Wunsch 1970 J Mol Biol 48:443
Local alignment: the alignment of segments of
sequences
ignore areas that show little similarity
Smith & Waterman 1981, J Mol Biol, 147:195
modified from Needelman-Wunsh algorithm
can be done with heuristics (FASTA and BLAST)
Dot Matrix and Alignment
Dot matrix:
Score between
cross-elements
path:
Mapping to
an alignment
AACG - GT A TGC
ATCG G GT - TGC
A AC G G T A T G C
A1 1
1
T
1 1
C
1
1
G
1
11
G
1
11
G
1 1
1
T
1
1
T
1 1
G
1 1
1
C
1
1
Dynamic Programming Steps
1. Assign scores between elements in dot matrix
2. For each cell in the dot matrix, check all
possible pathways back to the beginning of the
sequence (allowing insertions and deletions)
and give that cell the value of the maximum
scoring pathway
3. Construct an alignment (pathway) back from
the last cell in the dot matrix (or the highest
scoring) cell to give the highest scoring
alignment
Dynamic programming
Foundation: any partial subpath ending
at a point along the true optimal path must
itself be an optimal path leading up to that
point. So the optimal path can be found by
incremental extensions of optimal
subpaths
One of the most fundemental algorithms in
bioinfomatics
Applications: sequence comparison,
gene finding, mass-spec data analysis, ...
Needleman-Wunsch
Algorithm (1)
Global alignment
Elementary operations:
Single insertion/deletions (s(ai,-) or s(-,bj))
Substitution (s(ai,bj))
Easy case: h=0 for gap penalty (h+gk)
Needleman-Wunsch
Algorithm (2)
S00 = 0
i
Si 0 = s(ak ,-)
1 i l1
k =1
j
S0 j = s(-, bk )
1 j l2
k =1
Si -1, j + s(ai ,-)
Sij =max Si -1, j-1 + s (a i,bj) 1 i l1 , 1 j l2
Si , j -1 + s(-,bj )
The optimal score ending at i & j
Needleman-Wunsch
Algorithm (3)
Calculate S(i,j) in three ways:
•By adding a score s(ai,bj) to
the score diagonally upwards,
i.e. S(i-1,j-1);
i
A
j
S(i-1,j-1)
S(i,j-1)
A
S(i-1,j)
S(i,j)
A
T
C
•By adding a score s (-,bj) (which represents the
introduction of a gap into the alignment) to the score
vertically above (i.e. S(i,j-1);
•Or by adding s(ai,-) to the score horizontally to the left
(i.e. S(i-1,j)
A
Alignment Construction (1)
Initialization:
S(0,0) = 0
the outside row and
column are given
incrementally decreasing
values
0
A
-1
T
-2
C
-3
...
...
A
A
C
...
-1
-2
-3
...
-AT
(AAC)
(C)
Alignment Construction (2)
S(1,1) : one of three values:
(1) ai = bj, s = 1
S(i-1,j-1) + s(ai,bj)
= 0+1 = 1
(2) add s(-,bj) to S(i,j-1)
s(i,j-1) - s(-,bj) = -2
(3) add s(ai,-) to S(i-1,j)
s(i-1,j) - s(ai,-) = -2
choose highest 1 in the cell.
A
A
C
...
0
-1
-2
-3
...
A
-1
1
T
-2
C
-3
...
...
A(AC) A-(AC) -A(AC)
A(TC) -A(TC) A-(TC)
Alignment Construction (3)
For the next cell, as ai = bj
again, s(ai,bj) = 1 and the
three possible scores are:
i,j
i, j-1
i-1, j
-1 + 1 = 0
-2 - 1 = -3
1-1=0
Two degenerate paths!
(Max=3)
A
A
C
0
-1
-2
-3
A
-1
1
0
T
-2
C
-3
...
...
...
...
Alignment Construction (4)
For the next cell, as ai bj,
s(ai,bj) = 0. The three
possible scores are:
i,j
-2 + 0 = -2
i, j-1
-3 - 1 = -4
i-1, j
0 - 1 = -1
A
A
C
0
-1
-2
-3
A
-1
1
0
-1
T
-2
C
-3
...
...
...
...
Alignment Construction (5)
A
Trace back:
C
C
AC
TC
AAC
ATC
A
C
0
-1
-2
-3
A
-1
1
0
-1
T
-2
0
1
0
C
-3
-1
0
2
. . .
. . .
. . .
. . .
Mathematical Representation
Initialization
Si 0 = -i
S0 j = - j
Scoring
Length sequence 1
0 < i l1
0 < j l2
Length sequence 2
Si -1, j - 1
Sij = max Si -1, j -1 + s (ai , bj ) 0 < i l1 ,0 < j l2
Si , j -1 - 1
with
1 for ai = bj
s (ai , bj ) =
else
0
Computational Complexity of
Dynamic Programming
Computing time: O(nm), where n and m are
sequence lengths).
Retrieval time: O(Max (n,m))
[worst case: n+m; best case: Min(n,m)]
Required memory: O(nm).
Keeping in mind the computational complexity
while programming
Smith-Waterman Algorithm
S0,j = Si,0 = 0
for 0 i l1 and 0 j l2
Si-1,j + s (ai , - )
Sij = max
Set all values in
top row and left
column to 0
Si-1,j-1+ s (ai ,bj)
1 i l1 , 1 j l2
Si,j-1 + s ( - ,bj )
Set the value of Sij
to 0 if it would
otherwise be less than 0
0
Traceback from highest value of Sij, rather than from
bottom right corner. Stop at 0 rather than top left corner.
Smith-Waterman Algorithm
With Affine Gap Penalties
Dij = max
Boundary conditions:
S0,j = Si,0 = D0,j = Di,0 = I 0, j = Ii,0 = 0
Iij = max
for 0 i l1 and 0 j l2
Si-1,j - h - g
Di-1,j - g
Si ,j-1 - h – g
Ii,j-1 - g
Si-1,j -1 + s(ai ,bj )
D i-1,j -1 + s(ai ,bj )
Sij = max
I i-1,j -1 + s(ai ,bj )
0
for 1 i l1 and 1 j l2
Gap penalty: w(k) = h + gk , k 1
Alignment retrieval for SmithWaterman algorithm
GCC-UCG
GCCAUUG
A
A
U
G
C
C
A
U
U
G
A
C
G
G
C
0.0
0.0
0.0
0.0
1.0
1.0
0.0
0.0
0.0
0.0
0.0
1.0
0.0
0.0
A
1.0
1.0
0.0
0.0
0.0
0.7
2.0
0.7
0.3
0.0
1.0
0.0
0.7
0.0
G
0.0
0.7
0.8
1.0
0.0
0.0
0.7
1.7
0.3
1.3
0.0
0.7
1.0
1.7
C
0.0
0.0
0.3
0.3
2.0
1.0
0.3
0.3
1.3
0.0
1.0
1.0
0.3
0.7
C
0.0
0.0
0.0
0.0
1.3
3.0
1.7
1.3
1.0
1.0
0.3
2.0
0.7
0.3
U
0.0
0.0
0.0
0.0
0.3
1.7
2.7
2.7
2.3
1.0
0.7
0.7
1.7
0.3
C
0.0
0.0
0.0
0.7
1.0
1.3
1.3
2.3
2.3
2.0
0.7
1.7
0.3
1.3
G
0.0
0.0
0.0
1.0
0.3
1.0
1.0
1.0
2.0
3.3
2.0
1.7
2.7
1.3
C
0.0
0.0
0.0
0.0
2.0
1.3
0.7
0.7
0.7
2.0
3.0
3.0
1.7
2.3
U
0.0
0.0
1.0
0.0
0.7
1.7
1.0
1.7
1.7
1.7
1.7
2.7
2.7
1.3
U
0.0
0.0
1.0
0.7
0.3
0.3
1.3
2.0
2.7
1.3
1.3
1.3
2.3
2.3
A
1.0
1.0
0.0
0.7
0.3
0.0
1.3
1.0
1.7
2.3
2.3
1.0
1.0
2.0
G
0.0
0.7
0.7
1.0
0.3
0.0
0.0
1.0
1.0
2.7
2.0
2.0
2.0
2.0
Lecture Outline
Introduction
Scoring
function
Dynamic
programming
Confidence Assessment
Heuristic
alignment
Confidence Assessment
of Sequence Alignment
Why
confidence assessment is
needed
True
homology or alignment by
chance
Expected
probability by chance
Statistical
models
Why not to use sequence
identity as confidence measure
What to Compare
Real but non-homologous sequences
Real sequences that are shuffled to preserve
compositional properties
Sequences that are generated randomly
based upon a DNA or protein sequence model
(analytic statistical results)
P-value or E-value
The probability that a variate would assume a
value greater than or equal to the observed value
strictly by chance P(z>zo)
If the P-value found for an alignment is low
(<0.001) then alignment is probably biologically
meaningful.
Pre-compute the parameters based on a
statistical model
Extremal Value Distribution
The maximum scores of a large number of
alignments between random sequences of equal
length tends to have an extreme value distribution
P(S’<x)=exp[-exp(-x)]
P(S’>=x)= 1- exp[-exp(-x)]
Bit scores:
S’= (S-lnK)/ln2
and K: scaling factors
(depending on composition,
mutation matrix used etc.)
An Example
100000
alignments of
between
unrelated
proteins using
Pam250
The tail of high
SD scores
Other Issues
Gapless
alignment vs. gapped
alignment
Low
complexity regions
Over-estimate
or under-estimate of
confidence level
Scalability of Software
The trend of genetic
data growth
30 billion
in year 2005
Genomes: yeast, human, rice, mouse, fly...
Software must be (linearly or close to linearly)
scalable to large datasets.
Need for Heuristic Alignment
Time complexity for optimal alignment:
O(n2) , n -- sequence length
Given the current size of sequence
databases, use of optimal algorithms is
not practical for database search
Heuristic techniques: BLAST, FASTA,
MUMmer, PatternHunter...
20 min (optimal alignment, SSearch)
2min (FASTA) 20 sec (BLAST)
Ideas in Heuristics Search
Indexing
and filtering: Google search
Good alignment includes short identical, or
similar fragments
break entire string into substrings, index the
substrings
Search for matching short substrings and use
as seed for further analysis
extend to entire string and find the most
significant local alignment segment
FASTA (1)
Lipman
& Pearson, 1985, Science
227, 1435-1441
Four
steps
Step 1: Identify regions of the sequences
with the highest density of matches. In this
step exact matches of a given length (by
default 2 for proteins, 6 for nucleic acids)
are determined and regions (fragments of
diagonals) with a high number of matches
selected.
FASTA (2)
FASTA (3)
Step 2: Rescan 10 regions with the highest
density of identities using the BLOSUM50
matrix. Trim the ends of region to include only
those residues contributing to the highest
score. Each region is a partial alignment
without gaps.
Step 3: If there are several regions with
scores greater than a cut off value, try to join
these regions. A score for the joined initial
regions is calculated given a penalty for each
gap.
FASTA (4)
FASTA (5)
Step 4: Select the sequences in the database
with the highest score. For each of those
sequences construct a Smith-Waterman
optimal alignment considering only the
positions that lie in a band centred on the
best initial region found.
FASTA (6)
A-FTFWSYAIGL--PSSSIVSWKSCHVLHKVLRDGHPNVLHDCQRYRSNI
|
||||||
|| || ||||
|
|
|...
| :
AIPQFWSYAIERPLNSSWIVVWKSCITTHHLMVYGNERFIQYLAS-RNTL
Step 1
Step 2
Step 3 / Step 4
BLAST (1)
Basic Alignment Search Tool (Altschul et
al, 1990, J. Mol. Biol. 215, 403-410)
Uses word matching like FASTA
Similarity matching of words (3 aa’s, 11
bases)
does not require identical words.
If no words are similar, then no alignment
won’t find matches for very short sequences
BLAST (2)
Detects alignments with optimal maximal
segment pair (MSP) score. Gaps are not
allowed.
MSP: Highest scoring pair of segments of
identical length. A segment pair is locally
maximal if it cannot be improved by
extending or shortening the segments.
Homologous sequences will contain a
MSP with a high score; others will be
filtered out.
BLAST (3)
BLAST (4)
BLAST (5)
Does not handle gaps well
Genome Alignment by PatternHunter
(4 seconds)
Comparison with
Blastn and MegaBlast
Blastn
MegaBlast PatternHunter
E.coli vs H.inf
716s
5s/561M
14s/68M
Arabidopsis 2 vs 4 -21720s/1087M 498s/280M
Human 21 vs 22
--5250s/417M
Human (3G) vs Mouse
20 days
Secret in PatternHunter
Seeds (length of word used)
Blastn finds a match of length 11, then
extend from there. MegaBlast in order to
increase speed, increases this to 28.
Dilemma: increasing seed size speeds up
but loses sensitivity; decreasing seed size
gains sensitivity but loses speed.
Spaced Seed: PatternHunter looks for
matches of 11 nonconsecutive matches
and optimized such seeding scheme.
Super Seeds
ATTTCCGACGCGAGGGGACTTTCAGGAGAG
AGGGGACTTTC
11111111111
ATTTCCGACGCGAGGGGACTTTCAGGAGAG
GTGATGGAACAATCGAGA
101101101100110011
G*GA*GG*AC**TC**GA
Reading Assignments
Suggested reading:
Chapter 6 in “Pavel Pevzner: Computational
Molecular Biology - An Algorithmic Approach.
MIT Press, 2000.”
Optional reading:
http://www.people.virginia.edu/~wrp/papers/i
smb2000.pdf
Optional Assignment (1)
1.
What does 200 mean in PAM 200?
2.
What does 62 mean in BLOSUM 62?
3.
Prove Needleman-Wunsch algorithm produces
optimal global alignment.
4.
Prove the computational complexity of SmithWaterman algorithm.
5.
What is the relationship between the alignment
score and statistical significance of the alignment?
Optional Assignment (2)
Explain the difference between PAM40 and PAM250.
Why some elements in the matrix have different signs?
Optional Assignment (3)
Construct dynamic
program matrix using
edit distance and
PAM250 distance,
respectively.
1. Try different affine
gap penalties, h=g or
not
2. Try global and local
alignment.
M P R C L C T
M
P
C
L
W
C
Q
Project Assignment (1)
Develop a program that can perform optimal
global-local alignment for DNA
sequences:
1.
Fit a short sequence into a long sequence and output
ALL optimal alignments.
2.
No gap penalty when deleting terminal bases of the
long sequence, but with gap penalty for deleting any
base of the short sequence.
3.
Use edit distance (match 1; otherwise 0) with gap
penalty –1 – k (k is gap size).
Project Assignment (2)
4.
Use the FASTA format for input of each sequence
(see http://www.g2l.bio.unigoettingen.de/blast/fastades.html).
5.
Test on sample sequences, e.g.,
TTTGAGCCTCTGTTTGTGTGTAATTGAT-GTGCATGTGTGGG
|| |||||| || ||||
................TG-GTAATTAATCATGCAC.......
The above format can be used for your outputs.