CAP5510 - Bioinformatics

Download Report

Transcript CAP5510 - Bioinformatics

CAP5510 – Bioinformatics
Database Searches for
Biological Sequences or
Imperfect Alignments
Tamer Kahveci
CISE Department
University of Florida
1
Goals
• Understand how major heuristic
methods for sequence comparison work
– FASTA
– BLAST
• Understand how search results are
evaluated
2
What is Database Search ?
• Find a particular (usually) short sequence in a
database of sequences (or one huge sequence).
• Problem is identical to local sequence alignment, but
on a much larger scale.
• We must also have some idea of the significance of a
database hit.
– Databases always return some kind of hit, how much
attention should be paid to the result?
• A similar problem is the global alignment of two large
sequences
• General idea: good alignments contain high scoring
regions.
3
Imperfect Alignment
• What is an imperfect alignment?
• Why imperfect alignment?
• The result may not be optimal.
• Finding optimal alignment is usually to
costly in terms of time and memory.
4
Database Search Methods
• Hash table based methods
– FASTA family
• FASTP, FASTA, TFASTA, FASTAX, FASTAY
– BLAST family
• BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ,
MegaBLAST, PsiBLAST, PhiBLAST
– Others
• FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS
• Suffix tree based methods
– Mummer, AVID, Reputer, MGA, QUASAR
5
History of sequence searching
•
•
•
•
1970:
1980:
1985:
1990:
NW
SW
FASTA
BLAST
6
Hash Table
7
Hash Table
• K-gram =
subsequence of
length K
• Ak entries
– A is alphabet
size
• Linear time
construction
• Constant lookup
time
8
FASTP
Lipman & Pearson, 1985
9
FASTP
• Three phase algorithm
1. Find short good matches using k-grams
1. K = 1 or 2
2. Find start and end positions for good
matches
3. Use DP to align good matches
10
FASTP: Phase 1 (1)
position 1 2 3 4 5 6 7 8 9 10 11
protein 1 n c s p t a . . . . .
protein 2 . . . . . a c s p r k
position in
offset
amino acid
protein A protein B
pos A - posB
----------------------------------------------------a
6
6
0
c
2
7
-5
k
11
n
1
p
4
9
-5
r
10
s
3
8
-5
t
5
----------------------------------------------------Note the common offset for the 3 amino acids c,s and p
A possible alignment can be quickly found :
protein 1 n c s p t a
| | |
protein 2 a c s p r k
11
FASTP: Phase 1 (2)
• Similar to dot plot
• Offsets range from 1-m
to n-1
• Each offset is scored as
– # matches - #
mismatches
• Diagonals (offsets) with
large score show local
similarities
• How does it depend on
k?
12
FASTP: Phase 2
• 5 best diagonal runs
are found
• Rescore these 5
regions using
PAM250.
– Initial score
• Indels are not
considered yet
13
FASTP: Phase 3
• Sort the aligned regions in descending
score
• Optimize these alignments using
Needleman-Wunsch
• Report the results
14
FASTP - Discussion
• Results are not optimal. Why ?
• How does performance compare to SmithWaterman?
• What is the impact of k?
• How does this idea work for DNAs ?
– K = 4 or 6 for DNA
15
FASTA – Improvement Over
FASTP
Pearson 1995
16
FASTA (1)
• Phase 2: Choose 10 best diagonal runs instead of 5
17
FASTA (2)
• Phase 2.5
– Eliminate diagonals that score less than some given
threshold.
– Combine matches to find longer matches. It incurs join
penalty similar to gap penalty
18
FASTA Variations
• TFASTAX and TFASTAY: query protein
against a DNA library in all reading
frames
• FASTAX, FASTAY: DNA query in all
reading frames against protein database
19
BLAST
Altschul, Gish, Miller, Myers,
Lipman, 1990
20
BLAST (or BLASTP)
• BLAST – Basic Local Alignment Search
Tool
• An approximation of Smith-Waterman
• Designed for database searches
– Short query sequence against long database
sequence or a database of many sequences
• Sacrifices search sensitivity for speed
21
BLAST Algorithm (1)
• Eliminate low complexity regions from
the query sequence.
– Replace them with X (protein) or N (DNA)
• Hash table on query sequence.
– K = 3 for proteins
MCGPFILGTYC
CGP
MCG
22
BLAST Algorithm (2)
• For each k-gram find all
k-grams that align with
score at least cutoff T
using BLOSUM62
PQGMCGPFILGTYC
QGM
– 20k candidates
– ~50 on the average per kgram
– ~50n for the entire query
• Build hash table
PQG
PQG
PQG
PEG
PRG
PSG
PQA
18
15
14
13
12
T = 13
23
BLAST Algorithm (3)
• Sequentially scan the database and
locate each k-gram in the hash table
• Each match is a seed for an ungapped
alignment.
24
BLAST Algorithm (4)
• HSP (High Scoring Pair)
= A match between a
query word and the
database
• Find a “hit”: Two nonoverlapping HSP’s on a
diagonal within distance
A
• Extend the hit until the
score falls below a
threshold value, X
25
BLAST Algorithm (5)
• Keep only the extended matches that
have a score at least S.
• Determine the statistical significance
of the result
26
What is Statistical Significance?
•Two one-on-one
games, two scores.
13 : 15
•Which result is
more significant?
•Expected: maybe a
random result.
•Unexpected:
significant, may have
significant meanings.
13 : 15
27
Statistical Significance
• E-value: The expected number of matches
with score at least S
•
•
•
•
E = Kmne-lambda.S
m, n : sequence lengths
S : alignment score
K, lambda: normalization parameters
• P-value: The probability of having at least one
match with score at least S
• 1 – e-E
• The smaller these values are, the more
significant the result
• http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.ht
ml
28
BLAST - Analysis
• K (k-gram)
– Lower: more sensitive.
Slower.
• T (neighbor cutoff)
– Lower: Find distant
neighbors. Introduces
noise
• X (extension cutoff)
– Higher: lower chances of
getting into a local
minima. Slower.
29
Sample Query
• http://www.ncbi.nlm.nih.gov/BLAST/
Dhal_ecoli
IDRAMSAARGVFERGDWSLSSPAKRKAVLNKLADLMEAH
AEELALLETLDTGKPIRHSLRDDIPGAARAIRWYAEAIDK
VYGEVATTSSHELAMIVREPVGVIAAIVPWNFPLLLTCW
KLGPALAAGNSVILKPSEKSPLSAIRLAGLAKEAGLPDGV
LNVVTGFGHEAGQALSRHNDIDAIAFTGSTRTGKQLLKD
AGDSNMKRVWLEAGGKSANIVFADCPDLQQAASATAAG
IFYNQGQVCIAGTRLLLEESIADEFLALLKQQAQNWQP
GHPLDPATTMGTLIDCAHADSVHSFIREGESKGQLLLDG
RNAGLAAAIGPTIFVDVDPNASLSREEIFGPVLVVTRFTS
EEQALQLANDSQYGLGAAVWTRDLSRAHRMSRRLKAGS
VFVNNYNDGDMTVPFGGYKQSGNGRDKSLHALEKFTELK
TIWI
30
BLASTN
• BLAST for nucleic acids
• K = 11
• Exact match instead of neighborhood
search.
31
BLAST Variations
Program
Query
Target
Type
BLASTP
Protein
Protein
Gapped
BLASTN
Nucleic acid
Nucleic acid
Gapped
BLASTX
Nucleic acid
Protein
Gapped
TBLASTN
Protein
Nucleic acid
Gapped
TBLASTX
Protein
Nucleic acid
Gapped
32
Even More Variations
– PsiBLAST (iterative)
– BLAT, BLASTZ, MegaBLAST
– FLASH, PatternHunter, SSAHA, SENSEI,
WABA, GLASS
– Main differences are
• Seed choice (k, gapped seeds)
• Additional data structures
33
Suffix Trees
34
Suffix Tree
• Tree structure that contains all suffixes of the input sequence
•
•
•
•
•
•
•
•
•
TGAGTGCGA
GAGTGCGA
AGTGCGA
GTGCGA
TGCGA
GCGA
CGA
GA
A
35
Suffix Tree Example
36
Suffix Tree Analysis
• O(n) space and construction time
– 10n to 70n space usage reported
• O(m) search time for m-letter sequence
• Good for
– Small data
– Exact matches
37
Suffix Array
• 5 bytes per letter
• O(m log n) search
time
• Better space usage
• Slower search
38
Mummer
39
Other Sequence Comparison
Tools
• Reputer, MGA, AVID
• QUASAR (suffix array)
40