Transcript Slides
CS 5263 Bioinformatics
Lecture 7: Heuristic Sequence
Alignment Algorithms (BLAST)
Roadmap
• Last lecture review
– Sequence alignment statistics
• Today
– Gene finding by alignment
– Heuristic alignment algorithms
• Basic Local Alignment Search Tools
Sequence Alignment Statistics
• Substitution matrices
– How is the BLOSUM matrix made
– How to make your own substitution matrix
– What’s the meaning of an arbitrary
substitution matrix
• Significance of sequence alignments
– P-value estimation for
• Global alignment scores
• Local alignment scores
What is a p-value?
• The probability of observing an effect as
strong or stronger than you observed,
given the null hypothesis. i.e., “How likely
is this effect to occur by chance?”
• Pr(x > S | null)
What is a null-hypothesis?
• A statistician’s way of characterizing
“chance.”
• Generally, a mathematical model of
randomness with respect to a particular
set of observations.
• The purpose of most statistical tests is to
determine whether the observed data can
be explained by the null hypothesis.
For sequence alignment
• Your null hypothesis is “the two sequences
are unrelated”
• Your alternative hypothesis is “the two
sequences are related”
• You obtained a score S
– how likely that you can obtain such a score if
the null hypothesis is true?
How to test that null-hypothesis?
• Randomly generate some pairs of
“unrelated” sequences
• See what alignment scores you may get
for those “unrelated” sequences
• Must keep other factors in mind
– Your random sequences must be as close as
possible to your true sequences
– Except that they are “unrelated” (i.e., not from
a common ancestor)
Possible ways to get unrelated
sequences
• Which is better?
– Randomly pick some sequences from a
database
– Randomly pick some sequences from a
database and truncate to the same length as
your real sequences
– Generate random sequences according to the
frequency that each letter is used by your real
sequences
– Randomly shuffle your sequences
Possible ways to get random
sequences
• Which is better?
– Randomly pick some sequences from a
database
– Randomly pick some sequences from a
database and truncate to the same length as
your real sequences
– Generate random sequences according to the
frequency that each letter is used by your real
sequences
– Randomly shuffle your sequences
• Random shuffling is what we do to
estimate p-values for global alignment
Mouse HEXA
Human HEXA
Score = 732
……………………………………………………
45
40
35
Number of Sequences
30
25
20
15
10
732
5
0
-200
-100
0
100
200
300
400
Alignment Score
500
600
700
800
Distribution of the alignment scores between mouse HEXA
and 200 randomly shuffled human HEXA sequences
P-value: less than 1/200 = 0.005
• Advantages
– Easy to implement
– You don’t need to know a lot of theories to do this
• Disadvantages
– Slow
– Cannot estimate small p-values
– If we had repeated 1,000 times, would we get a score
as high 732?
• Based on what I’ve already seen, I would guess probably no
• What about 1,000,000 times?
When theory exists
• It gets much better
• You don’t really need to go there to know
what’s there
• (I know roughly how many times you can
get a score as high as 732 if you repeat
your experiments a billion times…)
• That is what happened for local alignment
• For ungapped alignment between two
sequences of lengths M and N
E(S) = KMN exp[-S]
(Expected value, E-value. )
• K, depends on sequence composition and
substitution matrix
– Can be obtained either empirically or analytically
S
Px S 1 exp E ( S ) 1 exp KMNe
E ( S ) when P is small.
Distribution of alignment scores
for 1000 random sequence pairs
Extreme value distribution
400
350
Number of Sequences
300
250
200
150
100
50
0
Theory says my score distribution
should have this shape
9
10
11
12
13
14
15
Alignment Score
16
17
18
My experiment shows me that the
theory seems correct
Example
• You are aligning two sequences, each has
1000 bases
• m = 1, s = -1, d = -inf (ungapped alignment)
• You obtain a score 20
• Is this score significant?
•
•
•
•
•
= ln3 = 1.1
E(S) = K MN exp{- S}
E(20) = 0.1 * 1000 * 1000 * 3-20 = 3 x 10-5
P-value = 3 x 10-5 << 0.05
The alignment is significant
400
350
Number of Sequences
300
250
200
150
100
20
50
0
9
10
11
12
13
14
15
Alignment Score
16
17
18
Distribution of 1000 random sequence pairs
Multiple-testing problem
• You are searching a 1000-base sequence
against a database of 106 sequences (average
length 1000 bases)
• You get a score 20
• You are essentially comparing 1000 bases with
1000x106 = 109 bases (ignore edge effect)
• E(20) = 0.1 * 1000 * 109 * 3-20 = 30
• By chance we would expect to see 30 matches
• P-value = 1 – e-30 = 0.9999999999
• Not significant at all
A better way to understand p-value
• Your p-value is 0.99999
• You have very low confidence (<0.00001)
to say that the null hypothesis is wrong
• Is the null hypothesis true then (i.e., the
two sequences are unrelated)?
– You don’t know
– Your test was not designed to tell you that
In practice
• You search the sequence against a database of
106 sequences
• You get 35 matches
• You expect to get about 30 by chance
• It could be all 35 are real, or none, or some
• You already reduced your target from 106
sequences to 35 sequences
• Take all 35 sequences with caution. Look for
other evidences
Statistics for gapped local
alignment
• Theory not well developed
• Extreme value distribution works well
empirically
• Need to estimate K and empirically
Exercising FSA
• How do you make an FSA for the
Needleman-Wunsch algorithm?
Exercising FSA
• How do you make an FSA for the
Needleman-Wunsch algorithm?
(-, yj)/d
(xi,yj) /
Ix
(-, yj) / d
(xi,yj) /
(xi,-)/d
(-, yj) / d
F
(xi,-) / d
(xi,yj) /
Iy
(xi,-)/d
Simplify
(xi,yj) /
(xi,-) / d
F
(xi,yj) /
(-, yj) / d
(xi,-) / d
I
(-, yj) / d
Simplify more
(xi,yj) /
F(i-1, j-1) + (xi, yj)
F(i,j) = max F(i-1, j) + d
F(i, j-1) + d
F
(xi,-) / d
(-, yj) / d
A more difficult alignment problem
• (A gene finder indeed!)
• X is a genomic sequence (DNA)
– X encodes a gene
– May contain introns
• Y is an ORF from another species
– Contains only exons
• We want to compare X against Y
– Conservation is on the level of amino acids
DNA
intron
intron
Pre-mRNA
5’ UTR exon
exon 3’ UTR
exon
Splice
Mature mRNA
(mRNA)
Open reading
frame (ORF)
Start codon
Stop codon
• We have a predicted gene
• We know the positions of the start codon and
stop codon
• But we don’t know where are the splicing sites
– Not even the number of introns
intron
exon
Start
codon
intron
exon
exon
Stop
codon
1. Most splicing sites start at GT and end at AG
2. But there are lots of GT and AG in the sequence
3. Aligning to a orthologous gene with known ORF may help
us determine the splicing sites
• Orthologous genes: two genes evolved from the same
ancestor
• Coding region are likely conserved on amino acid level
• UUA, UUG encode the same amino acid
• So do UCA, UCU, UCG, UCC
GT…………AG
Mouse putative
gene
human ORF
The Genetic Code
Third
letter
If know where are the exons
• Easy
Mouse putative
gene
Remove introns
Mouse putative ORF
translate
Global alignment
translate
human ORF
Or directly align triplets
Mouse putative
gene
Remove introns
Mouse putative ORF
Global alignment
human ORF
Codon substitution scores
AAA
AAG
AAU
AAC
AAA
4
3
-1
AAG
3
4
AAU
-1
AAC
…
…
…
UCU
UCC
-1
-1
-1
-1
-1
-1
-1
-1
4
3
1
1
-1
-1
3
4
1
1
UCU
-1
-1
1
1
4
3
UCC
-1
-1
1
1
3
4
…
…
…
64 x 64 substitution matrix
FSA for aligning genomic DNA to
ORF
(xi-2xi-1xi, yj-2yj-1yj) /
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
(xi-2xi-1xi, yj-2yj-1yj) /
A
B
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
Considering only exons
1. We don’t know exactly where are the splicing sites
2. Length of introns may not be a multiple of 3
- If convert the whole seq into triplets, may result in
ORF shift
17 bases?
Mouse putative
gene
human ORF
Model introns
1. Most splicing sites start at GT and end at AG
2. For simplicity, assume length of exon is a multiple of 3
• Not true in reality
• Only a little more work without this assumption
GT…………AG
Mouse putative
gene
126 nt
= 42 aa
human ORF
120 nt
= 40 aa
Aligning genomic DNA to ORF
Fixed cost to have an intron
Alignment with
Affine gap
penalty
FSA for aligning genomic DNA to
ORF
(xi-2xi-1xi, yj-2yj-1yj) /
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
(xi-2xi-1xi, yj-2yj-1yj) /
A
B
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
Considering only exons
FSA for aligning genomic DNA to
ORF
(xi-2xi-1xi, yj-2yj-1yj) /
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
Start an intron
(xi-2xi-1xi, yj-2yj-1yj) /
(-, GT) / s
A
B
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
C
FSA for aligning genomic DNA to
ORF
Continue in intron
(xi-2xi-1xi, yj-2yj-1yj) /
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
(-, yi) / 0
Start an intron
(xi-2xi-1xi, yj-2yj-1yj) /
(-, GT) / s
A
B
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
C
FSA for aligning genomic DNA to
ORF
Continue in intron
(xi-2xi-1xi, yj-2yj-1yj) /
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
(-, yi) / 0
Start an intron
(xi-2xi-1xi, yj-2yj-1yj) /
(-, GT) / s
A
C
B
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
(-, AG) / s
Close an intron
(xi-2xi-1xi, yj-2yj-1yj) /
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
(-, yj) / 0
(xi-2xi-1xi, yj-2yj-1yj) /
(-, GT) / s
A
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
B
(-, AG) / s
A(i-3,j-3) + (xi-2xi-1xi, yj-2yj-1yj)
A(i, j) = max
B(i-3,j-3) + (xi-2xi-1xi, yj-2yj-1yj)
C(i, j-2) + s, if yj-1yj == ‘AG’
C
(xi-2xi-1xi, yj-2yj-1yj) /
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
(-, yj) / 0
(xi-2xi-1xi, yj-2yj-1yj) /
(-, GT) / s
A
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
B
(-, AG) / s
A(i, j-3) + d
A(i-3, j) + d
B(i, j) = max
B(i, j-3) + e
B(i-3, j) + e
C
(xi-2xi-1xi, yj-2yj-1yj) /
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e
(-, yj) / 0
(xi-2xi-1xi, yj-2yj-1yj) /
(-, GT) / s
A
(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d
B
(-, AG) / s
B(i, j-2) + s, if yj-1yj == ‘GT’
C(i, j) = max
C(i, j-1)
C
ACGGATGCGATCAGTTGTACTACGAGCTGACGGTCCTCAGACTTGATTA
• There is a close relationship between dynamic
programming, FSA, regular expression, and
regular grammar
• Using FSA, you can design more complex
alignment algorithms
• If you can draw the state diagram for a problem,
it can be easily formulated into a DP problem
– In particular, Hidden Markov Models
– Will discuss more in a few weeks
Heuristic Local Aligners
BLAST and alike
State of biological databases
Sequenced Genomes:
Human
Mouse
Neurospora
Tetraodon
Drosophila
Rice
sea squirts
3109
2.7109
4107
3108
1.2108
1.0109
1.6108
Current rate of sequencing:
4 big labs 3 109 bp /year/lab
10s small labs
Yeast
Rat
Fugu fish
Mosquito
Worm
Arabidopsis
1.2107
2.6109
3.3108
2.8108
1.0108
1.2108
State of biological databases
Number of genes in these genomes:
Vertebrate: ~30,000
Insects:
~14,000
Worm:
~17,000
Fungi:
~6,000-10,000
Small organisms: 100s-1,000s
Each known or predicted gene has an associated protein
sequence
>1,000,000 known / predicted protein sequences
Some useful applications of
alignments
Given a newly discovered gene,
- Does it occur in other species?
- How fast does it evolve?
Assume we try Smith-Waterman:
Our
new
gene
104
The entire genomic database
1010 - 1011
May take several weeks!
Some useful applications of
alignments
Given a newly sequenced organism,
- Which subregions align with other organisms?
- Potential genes
- Other biological characteristics
Assume we try Smith-Waterman:
Our newly
sequenced
mammal
3109
The entire genomic database
1010 - 1011
> 1000 years ???
BLAST
• Basic Local Alignment Search Tool
– Altschul, Gish, Miller, Myers, Lipman, J Mol Biol 1990
– The most widely used comp bio tool
– The most cited paper
• Which is better: long mediocre match or a few nearby,
short, strong matches with the same total score?
– score-wise, exactly equivalent
– biologically, later may be more interesting, & is common
– at least, if must miss some, rather miss the former
• BLAST is a heuristic emphasizing the later
– speed/sensitivity tradeoff: BLAST may miss former, but gains
greatly in speed
BLAST
Main idea:
1. Construct a dictionary of all the
words in the query
2. Initiate a local alignment for each
word match between query and
DB
query
DB
Running Time: O(MN)
However, orders of magnitude
faster than Smith-Waterman
BLAST Original Version
……
Dictionary:
All words of length k (~11 for DNA, 3
for proteins)
Alignment initiated between words of
alignment score T (typically T = k)
query
……
Alignment:
Ungapped extensions until score
below statistical threshold
scan
DB
Output:
All local alignments with score
> statistical threshold
query
BLAST Original Version
A C G A A G T A A G G T C C A G T
k = 4, T = 4
The matching word GGTC
initiates an alignment
Extension to the left and
right with no gaps until
alignment falls < 50%
Output:
GTAAGGTCC
GTTAGGTCC
C C C T T C C T G G A T T G C G A
Example:
Gapped BLAST
Added features:
•
•
Pairs of words can
initiate alignment
Extensions with
gaps in a band
around anchor
Output:
GTAAGGTCCAGT
GTTAGGTC-AGT
C T G A T C C T G G A T T G C G A
A C G A A G T A A G G T C C A G T
Example
Query: gattacaccccgattacaccccgattaca (29 letters)
[2 mins]
Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS
sequences) 1,726,556 sequences; 8,074,398,388 total letters
>gi|28570323|gb|AC108906.9| Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence,
complete sequence Length = 144487 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21
(95%) Strand = Plus / Plus
Query:
Sbjct:
4
tacaccccgattacaccccga 24
||||||| |||||||||||||
125138 tacacccagattacaccccga 125158
Score = 34.2 bits (17),
Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus
Query:
Sbjct:
4 tacaccccgattacaccccga 24
||||||| |||||||||||||
125104 tacacccagattacaccccga 125124
>gi|28173089|gb|AC104321.7|
Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence,
complete sequence Length = 139823 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21
(95%) Strand = Plus / Plus
Query:
Sbjct:
4 tacaccccgattacaccccga 24
||||||| |||||||||||||
3891 tacacccagattacaccccga 3911
Example
Query: Human atoh enhancer, 179 letters
[1.5 min]
Result: 57 blast hits
1.
2.
3.
4.
5.
6.
7.
8.
gi|7677270|gb|AF218259.1|AF218259 Homo sapiens ATOH1 enhanc...
gi|22779500|gb|AC091158.11| Mus musculus Strain C57BL6/J ch...
gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhanc...
gi|28875397|gb|AF467292.1| Gallus gallus CATH1 (CATH1) gene...
gi|27550980|emb|AL807792.6| Zebrafish DNA sequence from clo...
gi|22002129|gb|AC092389.4| Oryza sativa chromosome 10 BAC O...
gi|22094122|ref|NM_013676.1| Mus musculus suppressor of Ty ...
gi|13938031|gb|BC007132.1| Mus musculus, Similar to suppres...
355 1e-95
264 4e-68
256 9e-66
78 5e-12
54 7e-05
44 0.068
42 0.27
42 0.27
gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhancer sequence Length = 1517
Score = 256 bits (129), Expect = 9e-66 Identities = 167/177 (94%),
Gaps = 2/177 (1%) Strand = Plus / Plus
Query:
3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62
||||||||||||| ||||||||||||||||||| ||||||||||||||||||||||||||
Sbjct: 1144 tgacaatagaggggctggcagaggctcctggccccggtgcggagcgtctggagcggagca 1203
Query:
63 cgcgctgtcagctggtgagcgcactctcctttcaggcagctccccggggagctgtgcggc 122
|||||||||||||||||||||||||| ||||||||| |||||||||||||||| |||||
Sbjct: 1204 cgcgctgtcagctggtgagcgcactc-gctttcaggccgctccccggggagctgagcggc 1262
Query:
123 cacatttaacaccatcatcacccctccccggcctcctcaacctcggcctcctcctcg 179
||||||||||||| || ||| |||||||||||||||||||| |||||||||||||||
Sbjct: 1263 cacatttaacaccgtcgtca-ccctccccggcctcctcaacatcggcctcctcctcg 1318
Different types of BLAST
• blastn: search nucleic acid database
• blastp: search protein database
• blastx: you give a nucleic acid sequence,
search protein database
• Tblastn: you give a protein sequence,
search nucleic acid database
• tblastx: you give a nucleic database,
search nucleic acid database, implicitly
translate both into protein sequences
Variants of BLAST
MEGABLAST:
Optimized to align very similar sequences
Linear gap penalty
NCBI-BLAST:
WU-BLAST: (Wash Univ BLAST)
Optimized, added features
BLAT: Blast-Like Alignment Tool
BlastZ:
Optimized for aligning two genomes
PSI-BLAST:
BLAST produces many hits
Those are aligned, and a pattern is extracted
Pattern is used for next search; above steps iterated
Sensitive for weak homologies
Slower
Pattern hunter
• Instead of exact matches of consecutive
matches of k-mer, we can
• look for discontinuous matches
– My query sequence looks like:
• ACGTAGACTAGCAGTTAAG
– Search for sequences in database that match
• AXGXAGXCTAXC
• X stands for don’t care
• Seed: 101011011101
Pattern hunter
• A good seed may give you both a higher
sensitivity and higher specificity
• You may think 110110110110 is the best seed
• Because mutation in the third position of a codon
often doesn’t change the amino acid
• Best seed is actually
– 110100110010101111
• How to design such seed is an open problem
• May combine multiple random seeds