Biology 162: Computational Genetics

Download Report

Transcript Biology 162: Computational Genetics

Alignment to a database of
sequences
Biology 162 Computational
Genetics
Todd Vision
7 Sep 2004
Preview
• How to speed up pairwise alignment
• How to statistically evaluate alignments
and database matches
• How to use BLAST
• What database should you search?
Need for fast pairwise
alignment
• Even O(mn) is too slow or memory
intensive for some applications
– Shotgun assembly (Phrap)
– Database search
– Alignment of long sequences
History
• FASTA (1983)
– First practical tool for fast pairwise alignment
– Restricts alignment to path-graph “hot-spots”
• BLAST (1990) (and WU-BLAST)
–
–
–
–
Basic local alignment search tool
Refinement of ideas for locating “hot-spots”
Searching dB for “hot-spots” in sublinear time
Probability of alignment scores
• Gapped BLAST and FASTA3 (1998, 2000)
– The two programs have largely converged
– Can now both produce gapped alignments with
accurate statistics
Why is BLAST so fast?
• Exclusion of database sequences
unlikely to contain good alignments
• Preprocessing of database to enable
fast matching algorithms
• Filling out only a small part of the path
graph
Query
Subject
Flavor
Nucleotide
Nucleotide
BLASTN
Protein
Protein
BLASTP
Nucleotide
(translated)
Protein
Protein
BLASTX
Nucleotide
(translated)
Nucleotide
(translated)
TBLASTN
Nucleotide
(translated)
TBLASTX
How BLAST works
• Segment pair - a pair of equal length
substrings in query and dB
• Locally maximal segment pair (LMSP) ungapped score would decrease by
extending or shortening either end
• High scoring pair (HSP) - the LMSP of
highest score between the query and a single
DB sequence
• BLAST heuristically finds a gapped alignment
for one or more HSPs that has a score above
some threshold
How BLAST works
• Query broken into short overlapping
words (default 11nt or 3aa)
• Up to 50 high scoring word neighbors
(with minimum score T) determined for
each word in query
LRQ
LRQNMLAC
RQN
LRE
LRO
LRT
LRG
LRT
How BLAST works
• Preprocessed database searched for
exact matches to all high scoring words
– Generation of words and search for
matches are both linear in length of query
• A memory intensive data structure allows a fast
algorithm
– We will see how this can be achieved with
suffix trees later
How BLAST works
• Segment pairs on the same diagonal and
within A cells of each other are extended in
both directions
– Extension proceeds until the dropoff from the
highest score is too great
– If the score is sufficiently high, and no other
overlapping LMSP is higher, the highest scoring
extension is considered to be an HSP
– Extension is the time consuming step of BLAST
• Requiring two close segment pairs on the same diagonal
means that most dB sequences can be discarded before
the extension step
• HSPs are joined by a gapped alignment
Alignment probabilities
ELVIS
||| |
ELVYS
• Each HSP has an associated score (Ssij)
• We know the scores for every possible amino
acid pair i, j
• We can compute the frequency of each i, j in
a random alignment
• Why not simply compute the probability of
obtaining a score at least as good as S?
BLAST statistics
• Scores are inflated over a naive probability
model for two reasons
– We have optimized the alignment
– We are taking the MSP from that alignment and
not a random pair of aligned segments
– We are only reporting the highest scoring
alignments after searching a large database
• The expected number of matches that meet
or surpass our observed score is given by the
extreme value (or Gumbel) distribution
opt
< 20
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
60
62
64
66
68
70
72
74
76
78
80
82
84
86
88
90
92
94
96
98
100
102
104
106
108
110
112
114
116
118
>120
173
0
2
1
8
37
170
714
1809
3729
5555
6991
7264
6786
6059
4920
4204
3410
2957
2469
2280
1872
1450
1176
1010
817
668
501
431
342
315
214
161
116
116
85
47
55
32
28
22
10
8
10
6
6
4
3
2
1
67
0:==
one = represents 122 library sequences
0:
0:=
1:*
16:*
95:*
367:== *
996:====== *
2045:=============== *
3379:===========================*===
4713:======================================*=======
5761:===============================================*==========
6355:====================================================*=======
6473:=====================================================*==
6197:==================================================*
*
5655:=========================================
*
4972:===================================
*
4247:============================
*
3547:=========================
2912:===================== *
2359:===================*
1891:===============*
1504:============*
1189:=========*
935:=======*=
733:======*
573:====*=
446:===*=
347:==*=
270:==*
210:=*=
160:=*
127:=*
98:*
inset = represents 2 library sequences
76:*
59:*
:======================*=
46:*
:=================*==========
35:*
:=============*==
27:*
:==========*===
21:*
:=======*===
16:*
:===== *
13:*
:====*
10:*
:===*=
8:*
:==*
6:*
:==*
5:*
:=*
4:*
:=*
3:*
:*
2:*
:*
2:*
:*=================================
1:*
E()
Bit scores, E-values
• Raw score cannot be compared among
searches
• Bit score is a function of raw score, scoring
matrix and amino acid frequencies in
database.
– Can be compared among searches
– Requires parameters l and K
• Expect value (E) depends on length of query
(m) and subject (n)
lS  ln(K)
mn
– Cannot S
be compared among
, Esearches
 S  of different
ln(2)
2
dBs
Caution
• Preceding theory does not hold for
global alignments
Where do l and K come
from?
• qi is the frequency of residue i
• pij is the frequency of aligned residues i and j
with score sij
• l is the unique solution to
pij  qiq j e
lsij
p
ij
1
i, j
• K is related to the random walk of locally
maximal scores for a given scoring matrix
and amino acid frequency
How to interpret E
• E is the expected # of matches S’ ≥ S’obs
– Given the lengths of the two sequences and
assuming both are random
– Will be approximately Poisson under null
hypothesis
• E can be much smaller or much greater than
1
– When E << 1, it is approximately equal to the
probability that any match would have S’ ≥ S’obs
– There may be many matches with E << 1 if there
are multiple homologs in the dB
Additional considerations
• Edge effects
– Length of expected random alignment
must be substracted from length of
sequence and query
• Multiple comparisons
– When searching D such sequences for a
given E threshold, the expected frequency
of matches is E’ ~ DE
– BLAST actually reports E’
Gapped versus ungapped
alignments
• For gapped alignments
– l and K can be theoretically computed
• For ungapped alignments
– l and K must be simulated for each matrix and dB
• E values are approximate for gapped
alignment
– Assumes gaps are expensive/rare
• BLAST shows E value for the sum of HSP
scores, rather than that of the gapped
alignment itself
Calculating critical scores
• Two sequences of length 100
• What is S’crit for E ≤ 0.05?
 nm 
 nm 
ln  ln 
 nm 
 E 
 E 
  log2   
Scrit

 E 
ln(2)
0.69315
• Answer: 13.3 bits
Masking low-complexity
regions
• Alignment statistics require that symbols
occur randomly in strings
– Long substrings of one or a few symbols violate
this assumption
– Sequences are preprocessed to identify such lowcomplexity regions (LCR)
• LCR are masked
– Do not contribute to alignment or score
– Appear as X’s in BLAST output
• Tools include DUST (for DNA) and SEG
(protein)
• Other types of repeats may also be masked
by BLAST
Complexity


 L! 
1
Complexit y logN 

L
  f i!
 i 
where L=window length
fi is frequency of symbol i, i=1..n

Example: GGGG
L=4,
fG=4, fA= fC= fT=0,
4!=4*3*2*1=24, 0!=1
Complexity=(1/4) * log4(24/4*1*1*1) = 0
Which matches should you
care about?
• Beware of redundancy
• Search the right set of sequences and
no more
• Choosing a significance threshold
requires judgement
Are you searching the right
dB?
• Protein
–
–
–
–
–
–
nr (Non-redundant, sort of)
Swissprot (Annotated)
Pat (Patented)
PDB (3D structures in Protein Data Bank)
Yeast, E. coli, Drosophila, Human, etc.
Month
• Nucleotide
–
–
–
–
–
nt (nucleotide version of nr)
EST (Expressed sequence tags)
GSS (Genome survey sequence - low pass)
HTGS (High Throughput Genomic Sequence)
Chromosome (completed genomes,
chromosomes)
BLAST parameters
• -W wordsize [Integer] default = 11 nucleotides, 3 proteins
• -G Cost to open gap [Integer] default = 5 nucleotides, 11
proteins
• -E Cost to extend gap [Integer] default = 2 nucleotides, 1
proteins
• Dropoffs for blast extensions
• Scoring matrix for proteins
• -q Penalty for nucleotide mismatch [Integer] default = -3
• -r reward for nucleotide match [Integer] default = 1
• Masking of low-complexity and repeat sequences
• -e expect value [Real] default = 10
Making your own BLAST dB
• Start with a MultiFASTA file
• Formatdb program from NCBI converts
MultiFASTA file into BLAST dB
– Words are preprocessed
– K and l are calculated for allowed scoring
matrices
• MultiFASTA file can also be used a query to
do many searches against one database
• All-by-all search - same file as query and dB
Don’t use BLAST blindly
• When BLAST is not appropriate
– Perfect matches (faster methods available)
• Primer sequences (also too short)
• Assembly (more specialized tools exist)
– Really distant homology (pairwise alignment is not
sufficiently sensitive)
• Pay attention to program choice/parameters
when
– Aligning cDNA against a genome
– Aligning two very long sequences
– Aligning sequences with many repeats
Cautionary tale: BRCA1
• Initial BLAST search showed match to
granins with E = 0.00175
• Granins are typically found in endocrine
cell secretory vesicles
– Is a cancer protein excreted outside of the
cell?!
• Now known to be a spurious match
Summary
• To attain high speed when searching a dB, BLAST
– Sacrifices some sensitivity by only extending HSPs
– Preprocesses the database and holds it in memory
• Meaningful statistics are key to BLAST’s widespread
use
• Conversion of
– Raw score to bit score: depends on scoring matrix and
amino acid frequencies
– Bit score to E value: depends on database size
• BLAST is easy to use and versatile, which makes it
awfully tempting to misuse
Reading assignment
• Gibson & Muse, Chapter 2: Genome
Sequencing and Annotation, pgs. 63101