GST II: ---Title--- - Digital Biology Laboratory

Download Report

Transcript GST II: ---Title--- - Digital Biology Laboratory

Pairwise Sequence
Comparison
Dong Xu
Computer Science Department
271C Life Sciences Center
E-mail: [email protected]
573-882-7064
http://digbio.missouri.edu
Lecture Outline
 Introduction
 Scoring
function
 Dynamic
programming
 Confidence Assessment
 Heuristic
alignment
Introduction (1)



Biological sequence comparison
 DNA-DNA, RNA-RNA
 Protein-protein
 DNA-protein
Sequence comparison is the in most
important and fundamental operation
bioinformatics
Key to understand evolution of an organism
or a gene
Introduction (2)

Applications in most bioinformatics
problems




Sequence assembly
Gene finding
Protein structure prediction
Phylogenic tree analysis

THE most popular tool: BLAST

Foundation of sequence database search
Evolutionary Basis of
Sequence Comparison
Ancestor
Orthologs
(inheritance)
Gene duplication
Paralogs
X
Random mutation
Evolutionary selection
Recombination
(fusion)
Y
75%X 25%Y Mixed Homology
An Example of
Sequence Comparison
TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG
||
||
|| | | ||| | |||| |||||
||| |||
TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG
AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG
| | |
| |||||| |
|||| | || |
|
AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG
AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT
||| | ||| || || |||
|
||||||||| ||
|||||| |
AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT
Alignement (1)

a correspondence between elements of
two sequences with order (topology) kept
FSEYTTHRGHR
: ::::: ::
FESYTTHRPHR
FESYTTHRGHR
:::::::: ::
FESYTTHRPHR

pairwise alignment: 2 sequences aligned

multiple alignment: alignment of 3 or more
sequences
Alignement (2)




Similar to ”longest common subsequence” (LCS)
problem for strings, (Robinson, 1938)
LCS: define a set of operations (e.g. substitution,
insertion or deletion) that transform the aligned
elements of one sequence into the corresponding
elements of the other and associate with each
operation a cost or a score.
Optimal alignment: the alignment that is associated
with the lowest cost (or highest score).
Between two sequences several optimal alignments
can be constructed with the same optimal score.
FSEY-THRGHR
: : ::: ::
FESYTTHRPHR
FSEYT-HRGHR
: :: :: ::
FESYTTHRPHR
Components of
Sequence Alignment
(2) An algorithm for
alignment;
(3) Confidence
assessment of alignment
result.
Indel
{
(1) Scoring function: a
measure of similarity
between elements
(nucleotides, amino
acids, gaps);
Insertion
Deletion
FDSK-THRGHR
:.: :: :::
FESYWTH-GHR
Match (:)
Mismatch
(substitution)
Lecture Outline
 Introduction
 Scoring
function
 Dynamic
programming
 Confidence Assessment
 Heuristic
alignment
Edit Distance
(Hamming Distance)
Introduced by Levenshtein in 1966
 Binary: match 1 / mismatch 0 (Identity Matrix)
 Definition: Minimum number of edit
operations to transform one string to another
 Can be used for DNA/RNA
 Possible edit operations

 Symbol insertion
 Symbol deletion
 Symbol substitution
Scoring Matrix
amino acid substitution matrices (20X20) account
for probability of one amino acid being substituted
for another:
frequency of substitution - genetic code
tolerance for changes - natural selection
penalize residues pairs with a low probability of
mutation in evolution and rewards pairs with a high
probability
empirically derived from observed amino acid
substitutions that occur between aligned residues in
homologous sequences
Physical Bases of
Mutation Matrix
Geometry nature
Physical nature
(charged or hydrophobic)
Chemical nature
Frequencies of amino acids
physical property matrices
PAM
The first substitution matrices derived by
Dayhoff et al. (1978)
 PAM (point accepted mutation) distance:
Two sequences are defined to have
diverged by one PAM unit if they show in
average one accepted point mutation (i.e.
one amino acid change) per hundred
amino acids.
 Derived from the pairwise alignment of
sequences less than 15% divergent.

BLOSUM

Block substitution matrices (Henikoff & Henikoff
1992)

Blocks: highly conserved regions in a set of
aligned protein sequences (local multiple
alignment)

Number of BLOSUM matrix (e.g. BLOSUM 62)
indicates the cutoff of percent identity that
defines the clusters - lower cutoffs allow more
diverse sequences
BLOSUM 62 Matrix
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
4
-1
-2
-2
0
-1
-1
0
-2
-1
-1
-1
-1
-2
-1
1
0
-3
-2
0
5
0
-2
-3
1
0
-2
0
-3
-2
2
-1
-3
-2
-1
-1
-3
-2
-3
6
1
-3
0
0
0
1
-3
-3
0
-2
-3
-2
1
0
-4
-2
-3
6
-3
0
2
-1
-1
-3
-4
-1
-3
-3
-1
0
-1
-4
-3
-3
9
-3
-4
-3
-3
-1
-1
-3
-1
-2
-3
-1
-1
-2
-2
-1
5
2
-2
0
-3
-2
1
0
-3
-1
0
-1
-2
-1
-2
5
-2
0
-3
-3
1
-2
-3
-1
0
-1
-3
-2
-2
6
-2
-4
-4
-2
-3
-3
-2
0
-2
-2
-3
-3
8
-3
-3
-1
-2
-1
-2
-1
-2
-2
2
-3
4
2
-3
1
0
-3
-2
-1
-3
-1
3
4
-2
2
0
-3
-2
-1
-2
-1
1
5
-1
-3
-1
0
-1
-3
-2
-2
5
0
-2
-1
-1
-1
-1
1
6
-4
-2
-2
1
3
-1
7
-1
-1
-4
-3
-2
4
1
-3
-2
-2
5
-2
-2
0
11
2
-3
7
-1
4
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
What Matrices to Use
Close homolog: high cutoffs for BLOSUM
(up to BLOSUM 90) or lower PAM values
BLAST default: BLOSUM 62
Remote homolog: lower cutoffs for
BLOSUM (down to BLOSUM 10) or high
PAM values (PAM 200 or PAM 250)
A best performer in structure prediction:
PAM 250
Gap Penalty Functions
 Corresponding
to insertion/deletion
in evolution
 Typically linear gap penalties
 Easy to implement in algorithms
 Satisfactory performance in alignment
 Can
be derived from alignment
 Known alignments
 Performance-based (sequence comparison)
Affine Gap Penalty
 Most
 w(k)
commonly used model
= h + gk , k  1 ,with w(0) = 0.
h: gap opening penalty; g: gap extension penalty
h > g > 0 (e.g., for PAM250, 10.8 + 0.6k)
Non-linear form: h + g log (k)
FDS--THRGHR
:.: ::::::
FESYTTHRGHR
FDS-T-HRGHR
:.: : :::::
FESYTTHRGHR
Lecture Outline
 Introduction
 Scoring
function
 Dynamic
programming
 Confidence Assessment
 Heuristic
alignment
Global vs. Local Alignment

Global alignment: the alignment of complete
sequences
 Good for comparing members of same protein family
 Needleman & Wunsch 1970 J Mol Biol 48:443

Local alignment: the alignment of segments of
sequences
 ignore areas that show little similarity
 Smith & Waterman 1981, J Mol Biol, 147:195
 modified from Needelman-Wunsh algorithm
 can be done with heuristics (FASTA and BLAST)
Dot Matrix and Alignment
Dot matrix:
Score between
cross-elements
path:
Mapping to
an alignment
AACG - GT A TGC
ATCG G GT - TGC
A AC G G T A T G C
A1 1
1
T
1 1
C
1
1
G
1
11
G
1
11
G
1 1
1
T
1
1
T
1 1
G
1 1
1
C
1
1
Dynamic Programming Steps
1. Assign scores between elements in dot matrix
2. For each cell in the dot matrix, check all
possible pathways back to the beginning of the
sequence (allowing insertions and deletions)
and give that cell the value of the maximum
scoring pathway
3. Construct an alignment (pathway) back from
the last cell in the dot matrix (or the highest
scoring) cell to give the highest scoring
alignment
Dynamic programming
Foundation: any partial subpath ending
at a point along the true optimal path must
itself be an optimal path leading up to that
point. So the optimal path can be found by
incremental extensions of optimal
subpaths
 One of the most fundemental algorithms in
bioinfomatics
 Applications: sequence comparison,
gene finding, mass-spec data analysis, ...

Needleman-Wunsch
Algorithm (1)

Global alignment

Elementary operations:
 Single insertion/deletions (s(ai,-) or s(-,bj))
 Substitution (s(ai,bj))

Easy case: h=0 for gap penalty (h+gk)
Needleman-Wunsch
Algorithm (2)
S00 = 0
i
Si 0 =  s(ak ,-)
1  i  l1
k =1
j
S0 j =  s(-, bk )
1  j  l2
k =1
Si -1, j + s(ai ,-)
Sij =max Si -1, j-1 + s (a i,bj) 1  i  l1 , 1 j  l2
Si , j -1 + s(-,bj )
The optimal score ending at i & j
Needleman-Wunsch
Algorithm (3)
Calculate S(i,j) in three ways:
•By adding a score s(ai,bj) to
the score diagonally upwards,
i.e. S(i-1,j-1);
i
A
j
S(i-1,j-1)
S(i,j-1)
A
S(i-1,j)
S(i,j)
A
T
C
•By adding a score s (-,bj) (which represents the
introduction of a gap into the alignment) to the score
vertically above (i.e. S(i,j-1);
•Or by adding s(ai,-) to the score horizontally to the left
(i.e. S(i-1,j)
A
Alignment Construction (1)
Initialization:
S(0,0) = 0
the outside row and
column are given
incrementally decreasing
values
0
A
-1
T
-2
C
-3
...
...
A
A
C
...
-1
-2
-3
...
-AT
(AAC)
(C)
Alignment Construction (2)
S(1,1) : one of three values:
(1) ai = bj, s = 1
S(i-1,j-1) + s(ai,bj)
= 0+1 = 1
(2) add s(-,bj) to S(i,j-1)
s(i,j-1) - s(-,bj) = -2
(3) add s(ai,-) to S(i-1,j)
 s(i-1,j) - s(ai,-) = -2
choose highest 1 in the cell.
A
A
C
...
0
-1
-2
-3
...
A
-1
1
T
-2
C
-3
...
...
A(AC) A-(AC) -A(AC)
A(TC) -A(TC) A-(TC)
Alignment Construction (3)
For the next cell, as ai = bj
again, s(ai,bj) = 1 and the
three possible scores are:
i,j
i, j-1
i-1, j
-1 + 1 = 0
-2 - 1 = -3
1-1=0
Two degenerate paths!
(Max=3)
A
A
C
0
-1
-2
-3
A
-1
1
0
T
-2
C
-3
...
...
...
...
Alignment Construction (4)
For the next cell, as ai  bj,
s(ai,bj) = 0. The three
possible scores are:
i,j
-2 + 0 = -2
i, j-1
-3 - 1 = -4
i-1, j
0 - 1 = -1
A
A
C
0
-1
-2
-3
A
-1
1
0
-1
T
-2
C
-3
...
...
...
...
Alignment Construction (5)
A
Trace back:
C
C
AC
TC
AAC
ATC
A
C
0
-1
-2
-3
A
-1
1
0
-1
T
-2
0
1
0
C
-3
-1
0
2
. . .
. . .
. . .
. . .
Mathematical Representation
Initialization
Si 0 = -i
S0 j = - j
Scoring
Length sequence 1
0 < i  l1
0 < j  l2
Length sequence 2
Si -1, j - 1

Sij = max  Si -1, j -1 + s (ai , bj ) 0 < i  l1 ,0 < j  l2

Si , j -1 - 1

with
1 for ai = bj

s (ai , bj ) = 
else
0
Computational Complexity of
Dynamic Programming

Computing time: O(nm), where n and m are
sequence lengths).

Retrieval time: O(Max (n,m))
[worst case: n+m; best case: Min(n,m)]

Required memory: O(nm).

Keeping in mind the computational complexity
while programming
Smith-Waterman Algorithm
S0,j = Si,0 = 0
for 0  i  l1 and 0  j  l2
Si-1,j + s (ai , - )
Sij = max
Set all values in
top row and left
column to 0
Si-1,j-1+ s (ai ,bj)
1  i  l1 , 1  j  l2
Si,j-1 + s ( - ,bj )
Set the value of Sij
to 0 if it would
otherwise be less than 0
0
Traceback from highest value of Sij, rather than from
bottom right corner. Stop at 0 rather than top left corner.
Smith-Waterman Algorithm
With Affine Gap Penalties
Dij = max
Boundary conditions:
S0,j = Si,0 = D0,j = Di,0 = I 0, j = Ii,0 = 0
Iij = max
for 0  i  l1 and 0  j  l2
Si-1,j - h - g
Di-1,j - g
Si ,j-1 - h – g
Ii,j-1 - g
Si-1,j -1 + s(ai ,bj )
D i-1,j -1 + s(ai ,bj )
Sij = max
I i-1,j -1 + s(ai ,bj )
0
for 1  i  l1 and 1  j  l2
Gap penalty: w(k) = h + gk , k  1
Alignment retrieval for SmithWaterman algorithm
GCC-UCG
GCCAUUG
A
A
U
G
C
C
A
U
U
G
A
C
G
G
C
0.0
0.0
0.0
0.0
1.0
1.0
0.0
0.0
0.0
0.0
0.0
1.0
0.0
0.0
A
1.0
1.0
0.0
0.0
0.0
0.7
2.0
0.7
0.3
0.0
1.0
0.0
0.7
0.0
G
0.0
0.7
0.8
1.0
0.0
0.0
0.7
1.7
0.3
1.3
0.0
0.7
1.0
1.7
C
0.0
0.0
0.3
0.3
2.0
1.0
0.3
0.3
1.3
0.0
1.0
1.0
0.3
0.7
C
0.0
0.0
0.0
0.0
1.3
3.0
1.7
1.3
1.0
1.0
0.3
2.0
0.7
0.3
U
0.0
0.0
0.0
0.0
0.3
1.7
2.7
2.7
2.3
1.0
0.7
0.7
1.7
0.3
C
0.0
0.0
0.0
0.7
1.0
1.3
1.3
2.3
2.3
2.0
0.7
1.7
0.3
1.3
G
0.0
0.0
0.0
1.0
0.3
1.0
1.0
1.0
2.0
3.3
2.0
1.7
2.7
1.3
C
0.0
0.0
0.0
0.0
2.0
1.3
0.7
0.7
0.7
2.0
3.0
3.0
1.7
2.3
U
0.0
0.0
1.0
0.0
0.7
1.7
1.0
1.7
1.7
1.7
1.7
2.7
2.7
1.3
U
0.0
0.0
1.0
0.7
0.3
0.3
1.3
2.0
2.7
1.3
1.3
1.3
2.3
2.3
A
1.0
1.0
0.0
0.7
0.3
0.0
1.3
1.0
1.7
2.3
2.3
1.0
1.0
2.0
G
0.0
0.7
0.7
1.0
0.3
0.0
0.0
1.0
1.0
2.7
2.0
2.0
2.0
2.0
Lecture Outline
 Introduction
 Scoring
function
 Dynamic
programming
 Confidence Assessment
 Heuristic
alignment
Confidence Assessment
of Sequence Alignment
 Why
confidence assessment is
needed
 True
homology or alignment by
chance
 Expected
probability by chance
 Statistical
models
Why not to use sequence
identity as confidence measure
What to Compare
Real but non-homologous sequences
Real sequences that are shuffled to preserve
compositional properties
Sequences that are generated randomly
based upon a DNA or protein sequence model
(analytic statistical results)
P-value or E-value
The probability that a variate would assume a
value greater than or equal to the observed value
strictly by chance P(z>zo)
If the P-value found for an alignment is low
(<0.001) then alignment is probably biologically
meaningful.
Pre-compute the parameters based on a
statistical model
Extremal Value Distribution
The maximum scores of a large number of
alignments between random sequences of equal
length tends to have an extreme value distribution
P(S’<x)=exp[-exp(-x)]
P(S’>=x)= 1- exp[-exp(-x)]
Bit scores:
S’= (S-lnK)/ln2
 and K: scaling factors
(depending on composition,
mutation matrix used etc.)
An Example
100000
alignments of
between
unrelated
proteins using
Pam250
The tail of high
SD scores
Other Issues
 Gapless
alignment vs. gapped
alignment
 Low
complexity regions
 Over-estimate
or under-estimate of
confidence level
Scalability of Software

The trend of genetic
data growth
30 billion
in year 2005


Genomes: yeast, human, rice, mouse, fly...
Software must be (linearly or close to linearly)
scalable to large datasets.
Need for Heuristic Alignment
Time complexity for optimal alignment:
O(n2) , n -- sequence length
 Given the current size of sequence
databases, use of optimal algorithms is
not practical for database search
 Heuristic techniques: BLAST, FASTA,
MUMmer, PatternHunter...
20 min (optimal alignment, SSearch)
 2min (FASTA)  20 sec (BLAST)

Ideas in Heuristics Search
 Indexing
and filtering: Google search
Good alignment includes short identical, or
similar fragments
break entire string into substrings, index the
substrings
Search for matching short substrings and use
as seed for further analysis
extend to entire string and find the most
significant local alignment segment
FASTA (1)
 Lipman
& Pearson, 1985, Science
227, 1435-1441
 Four
steps
Step 1: Identify regions of the sequences
with the highest density of matches. In this
step exact matches of a given length (by
default 2 for proteins, 6 for nucleic acids)
are determined and regions (fragments of
diagonals) with a high number of matches
selected.
FASTA (2)
FASTA (3)
Step 2: Rescan 10 regions with the highest
density of identities using the BLOSUM50
matrix. Trim the ends of region to include only
those residues contributing to the highest
score. Each region is a partial alignment
without gaps.
Step 3: If there are several regions with
scores greater than a cut off value, try to join
these regions. A score for the joined initial
regions is calculated given a penalty for each
gap.
FASTA (4)
FASTA (5)
Step 4: Select the sequences in the database
with the highest score. For each of those
sequences construct a Smith-Waterman
optimal alignment considering only the
positions that lie in a band centred on the
best initial region found.
FASTA (6)
A-FTFWSYAIGL--PSSSIVSWKSCHVLHKVLRDGHPNVLHDCQRYRSNI
|
||||||
|| || ||||
|
|
|...
| :
AIPQFWSYAIERPLNSSWIVVWKSCITTHHLMVYGNERFIQYLAS-RNTL
Step 1
Step 2
Step 3 / Step 4
BLAST (1)
Basic Alignment Search Tool (Altschul et
al, 1990, J. Mol. Biol. 215, 403-410)
 Uses word matching like FASTA
 Similarity matching of words (3 aa’s, 11
bases)

 does not require identical words.

If no words are similar, then no alignment
 won’t find matches for very short sequences
BLAST (2)
Detects alignments with optimal maximal
segment pair (MSP) score. Gaps are not
allowed.
 MSP: Highest scoring pair of segments of
identical length. A segment pair is locally
maximal if it cannot be improved by
extending or shortening the segments.
 Homologous sequences will contain a
MSP with a high score; others will be
filtered out.

BLAST (3)
BLAST (4)
BLAST (5)
Does not handle gaps well
Genome Alignment by PatternHunter
(4 seconds)
Comparison with
Blastn and MegaBlast
Blastn
MegaBlast PatternHunter
E.coli vs H.inf
716s
5s/561M
14s/68M
Arabidopsis 2 vs 4 -21720s/1087M 498s/280M
Human 21 vs 22
--5250s/417M
Human (3G) vs Mouse
20 days
Secret in PatternHunter
Seeds (length of word used)
 Blastn finds a match of length 11, then
extend from there. MegaBlast in order to
increase speed, increases this to 28.
 Dilemma: increasing seed size speeds up
but loses sensitivity; decreasing seed size
gains sensitivity but loses speed.
 Spaced Seed: PatternHunter looks for
matches of 11 nonconsecutive matches
and optimized such seeding scheme.

Super Seeds
ATTTCCGACGCGAGGGGACTTTCAGGAGAG
AGGGGACTTTC
11111111111
ATTTCCGACGCGAGGGGACTTTCAGGAGAG
GTGATGGAACAATCGAGA
101101101100110011
G*GA*GG*AC**TC**GA
Reading Assignments

Suggested reading:
 Chapter 6 in “Pavel Pevzner: Computational
Molecular Biology - An Algorithmic Approach.
MIT Press, 2000.”

Optional reading:
 http://www.people.virginia.edu/~wrp/papers/i
smb2000.pdf
Optional Assignment (1)
1.
What does 200 mean in PAM 200?
2.
What does 62 mean in BLOSUM 62?
3.
Prove Needleman-Wunsch algorithm produces
optimal global alignment.
4.
Prove the computational complexity of SmithWaterman algorithm.
5.
What is the relationship between the alignment
score and statistical significance of the alignment?
Optional Assignment (2)
Explain the difference between PAM40 and PAM250.
Why some elements in the matrix have different signs?
Optional Assignment (3)
Construct dynamic
program matrix using
edit distance and
PAM250 distance,
respectively.
1. Try different affine
gap penalties, h=g or
not
2. Try global and local
alignment.
M P R C L C T
M
P
C
L
W
C
Q
Project Assignment (1)
Develop a program that can perform optimal
global-local alignment for DNA
sequences:
1.
Fit a short sequence into a long sequence and output
ALL optimal alignments.
2.
No gap penalty when deleting terminal bases of the
long sequence, but with gap penalty for deleting any
base of the short sequence.
3.
Use edit distance (match 1; otherwise 0) with gap
penalty –1 – k (k is gap size).
Project Assignment (2)
4.
Use the FASTA format for input of each sequence
(see http://www.g2l.bio.unigoettingen.de/blast/fastades.html).
5.
Test on sample sequences, e.g.,
TTTGAGCCTCTGTTTGTGTGTAATTGAT-GTGCATGTGTGGG
|| |||||| || ||||
................TG-GTAATTAATCATGCAC.......
The above format can be used for your outputs.