Database searching - Indiana University

Download Report

Transcript Database searching - Indiana University

I519 Introduction to Bioinformatics, Fall 2012
Sequence Database Searching
(Basic Tools and Advanced Methods)
What’s database searching
 Goal: find similar (homologous) sequences of a
query sequence in a sequence of database
 Input: query sequence & database
 Output: hits (pairwise alignments)
Basics of database searching
 Core: pairwise alignment algorithm
 Speed (fast sequence comparison)
 Relevance of the search results (statistical
tests)
 Recovering all information of interest
– The results depend of the search parameters
like gap penalty, scoring matrix.
– Sometimes searches with more than one
matrix should be preformed
 Specificity and sensitivity
The different strategies for
database searching
Query
Alignment
methods
Database
Sequence
BLAST
FASTA
SW
Sequence
MSA
Psi-Blast
Sequence
MSA
Profile
alignment
MSA
Sequence
hmmer
HMM
models
Faster
Higher
sensitivity
What program to use for searching?
 BLAST is fastest and easily accessed on the Web
– A suite; BLASTP, BLASTN, BLASTX
– lag behind the development of sequencing
techniques
 FASTA
– more sensitive for DNA-DNA comparisons
– FASTX and TFASTX can find similarities in
sequences with frameshifts
 Smith-Waterman is slower, but more sensitive
– known as a “rigorous” or “exhaustive” search
– SSEARCH in GCG and standalone FASTA
FASTA
 Derived from logic of the dot plot
– compute best diagonals from all frames of
alignment
 Word method looks for exact matches between
words in query and test sequence
–
–
–
–
hash tables (fast computer technique)
DNA words are usually 6 bases
protein words are 1 or 2 amino acids
only searches for diagonals in region of word
matches = faster searching
FASTA algorithm
Makes longest diagonal
 after all diagonals found, tries to join diagonals
by adding gaps
 computes alignments in regions of best
diagonals
FASTA alignments
FASTA results: histogram
!!SEQUENCE_LIST 1.0
(Nucleotide) FASTA of: b2.seq from: 1 to: 693 December 9, 2002 14:02
TO: /u/browns02/Victor/Search-set/*.seq Sequences:
2,050 Symbols:
913,285 Word Size: 6
Searching with both strands of the query.
Scoring matrix: GenRunData:fastadna.cmp
Constant pamfactor used
Gap creation penalty: 16 Gap extension penalty: 4
Histogram Key:
Each histogram symbol represents 4 search set sequences
Each inset symbol represents 1 search set sequences
z-scores computed from opt scores
z-score obs
exp
(=)
(*)
< 20
0
0:
22
0
0:
24
3
0:=
26
2
0:=
28
5
0:==
30
11
3:*==
32
19
11:==*==
34
38
30:=======*==
36
58
61:===============*
38
79
100:====================
*
40
134
140:==================================*
42
167
171:==========================================*
44
205
189:===============================================*====
46
209
192:===============================================*=====
48
177
184:=============================================*
FASTA results: list
The best scores are:
init1 initn
SW:PPI1_HUMAN
Begin: 1 End: 269
! Q00169 homo sapiens (human). phosph... 1854
SW:PPI1_RABIT
Begin: 1 End: 269
! P48738 oryctolagus cuniculus (rabbi... 1840
SW:PPI1_RAT
Begin: 1 End: 270
! P16446 rattus norvegicus (rat). pho... 1543
SW:PPI1_MOUSE
Begin: 1 End: 270
! P53810 mus musculus (mouse). phosph... 1542
SW:PPI2_HUMAN
Begin: 1 End: 270
! P48739 homo sapiens (human). phosph... 1533
SPTREMBL_NEW:BAC25830
Begin: 1 End: 270
! Bac25830 mus musculus (mouse). 10, ... 1488
SP_TREMBL:Q8N5W1
Begin: 1 End: 268
! Q8n5w1 homo sapiens (human). simila... 1477
SW:PPI2_RAT
Begin: 1 End: 269
! P53812 rattus norvegicus (rat). pho... 1482
opt
z-sc E(1018780)..
1854
1854
2249.3
1.8e-117
1840
1840
2232.4
1.6e-116
1543
1837
2228.7
2.5e-116
1542
1836
2227.5
2.9e-116
1533
1533
1861.0
7.7e-96
1488
1522
1847.6
4.2e-95
1477
1522
1847.6
4.3e-95
1482
1516
1840.4
1.1e-94
FASTA results: alignment
SCORES
Init1: 1515 Initn: 1565 Opt: 1687 z-score: 1158.1 E(): 2.3e-58
>>GB_IN3:DMU09374
(2038 nt)
initn: 1565 init1: 1515 opt: 1687 Z-score: 1158.1 expect(): 2.3e-58
66.2% identity in 875 nt overlap
(83-957:151-1022)
60
70
80
90
100
110
u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAGCGGAGGCGATGGCGCTGTTGGCC
|| ||| | ||||| |
||| |||||
DMU09374
AGGCGGACATAAATCCTCGACATGGGTGACAACGAACAGAAGGCGCTCCAACTGATGGCC
130
140
150
160
170
180
120
130
140
150
160
170
u39412.gb_pr GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTTTGGAGGCTCA
|||||||||
|| |||
|
| || ||| |
|| || ||||| ||
DMU09374
GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTTCGGAGGGTCC
190
200
210
220
230
240
180
190
200
210
220
230
u39412.gb_pr TCCAAAATAGAGGAAGCATGCGAAATCTACGCCAGAGCAGCAAACATGTTCAAAATGGCC
||| | ||||| ||
|||
||||
| || | |||||||| || ||| ||
DMU09374
AACAAGGTGGAGGACGCCATCGAGTGCTACCAGCGGGCGGGCAACATGTTTAAGATGTCC
250
260
270
280
290
300
240
250
260
270
280
290
u39412.gb_pr AAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCACCTGCAGCTC
||||||||||
||||| |
|||||| |||| |||
|| ||| || |
DMU09374
AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACACGCGCGGGCT
310
320
330
340
350
360
FASTA on the web
 Many websites offer FASTA searches
– Various databases and various other services
– Be sure to use FASTA 3
 Each server has its limits
 Be aware that you are depending on the
kindness of strangers.
BLAST
 [BLAST= Basic Local Alignment Search Tool]
 The central idea of the BLAST algorithm is that a
statistically significant alignment is likely to
contain a high-scoring pair of aligned words.
 Uses word matching like FASTA
– Similarity matching of words (3 aa’s, 11 bases)
– Does not require identical words.
 If no words are similar, then no alignment
– won’t find matches for very short sequences
 Original BLAST does not handle gaps well; the
“gapped” blast is better
Word match
Extend hits
BLAST: word matching
MEAAVKEEISVEDEAVDKNI
MEA
EAA
Break query
AAV
AVK
into words:
VKE
KEE
EEI
EIS
ISV
Break database
...
Compare word lists
by Hashing
(allow near matches)
sequences
into words:
BLAST: extend hits
Use two word matches as anchors to build an alignment
between the query and a database sequence.
Gapped BLAST algorithm
 The NCBI’s BLAST website now both use
“gapped BLAST”
 This algorithm is more complex than the original
BLAST
 It requires two word matches close to each other
on a pair of sequences (i.e. with a gap) before it
creates an alignment allow gaps (using SmithWaterman algorithm constrained by a band width).
The region of the path graph
explored when seeded by the
alignment of a pair of residues
Ref: Gapped BLAST and PSI-BLAST:
a new generation of protein database
search programs. NAR 25(17):3389,
1997
Statistical tests
 Evaluate the probability of an event taking place
by chance.
 Very little is known about the random distribution
of optimal global alignment scores
 Statistics for the scores of local alignments, are
well understood (particularly true for local
alignments lacking gaps)
 P-value
– Randomized data
– Distribution under the same setup
BLAST statistics
 E-value is equivalent to standard P value (based on
Karlin-Altschul theorem)
– Karlin-Alschul equation is probably one of the most
recognized equation in bioinformatics.
– E-value represents the likelihood that the observed
alignment is due to chance alone
– E-value: the number of alignments expected by chance
(E) during a sequence database search. E = 1 indicates
that an alignment this good would happen by chance
with any random sequence searched against the same
database.
How to compute BLAST E-value
Raw score (S)
Bit score (S’)
m: the effective query length
n: the effective database length
E
P (the probability of getting at least one HSP)
Which BLAST?
Program Query
Database
Comparison
Common use
blastn
DNA
DNA
DNA level
Seek identical DNA
sequences and
splicing patterns
blastp
Protein
Protein
Protein level
Find homologous
proteins
blastx
DNA
Protein
Protein level
Analyze new DNA
to find genes and
seek homologous
proteins
tblastn
Protein
DNA
Protein level
Search for genes in
unannotated DNA
tblastx
DNA
DNA
Protein level
Discover gene
structure
BLAST+ package has different user interface
Which database?
 For blastp (blastx)
–
–
–
–
–
–
Non-redundant protein sequences (nr)
Reference proteins (refseq_protein)
Swissprot protein sequences (swissprot)
Patented protein sequences (pat)
Protein Data Bank proteins (pdb)
Environmental samples (env_nr)
 For blastn (tblastx)
–
–
–
–
Human genomic plus transcript (not for tblastx)
Mouse genomic plus transcript (not available for tblastx)
Nucleotide collection (nr/nt)
etc
BLAST is approximate
 BLAST makes similarity searches very quickly
because it takes shortcuts.
– looks for short, nearly identical “words” (11
bases)
 It also makes errors
– misses some important similarities
– makes many incorrect matches
• easily fooled by repeats or skewed composition
Interpretation of BLAST hits
 very low E values (e-100) are homologs
 moderate E values are related genes
 long list of gradually declining of E values
indicates a large gene family
 long regions of moderate similarity are more
significant than short regions of high identity
Is the hit with smallest E-value the
closest sequence to the query?
 Not necessarily
 Some people argue that more strict phylogeny
analysis is needed for further conclusion
– Why? Different evolutionary rates, etc
Biological relevance
 It is up to you, the biologist to scrutinize these
alignments and determine if they are significant.
 Were you looking for a short region of nearly
identical sequence or a larger region of general
similarity?
 Are the mismatches conservative ones?
 Are the matching regions important structural
components of the genes or just introns and
flanking regions?
20 tips to improve your BLAST searches
 “Don't Use the Default Parameters” (but what
parameters?)
 “Treat BLAST Searches as Scientific Experiments
 “View BLAST Reports Graphically” (depends!)
 “Be Skeptical of Hypothetical Proteins”
 “Look for Stop Codons and Frame-Shifts to find
Pseudo-Genes”
 “Parse BLAST Reports with Bioperl” (your own
python, or perl, program)
 “How to Lie with BLAST Statistics” (“Lies, dammed
lies, and statistics”)
 …
Ref: BLAST by Ian Korf; Mark Yandell; Joseph Bedell, 2003.
Borderline similarity
 What to do with matches with E values that are
not so impressive? (e.g., E values > 0.01,
smaller numbers are more significant)
 this is the “Twilight Zone”
 retest these sequences and look for related hits
(not just your original query sequence)
 similarity is (is not) transitive:
– if A~B and B~C, then A~C
Distant homology detection
 Use more data (MSA)
 Use advanced approaches (like
HMM)
PSI-BLAST
 Position Specific Iterated BLAST (PSIBLAST)
 Basic idea
– Use results from BLAST query to construct a
profile matrix (or PSSM)
– Search database with profile instead of query
sequence
 Iterate
 Position-specific scoring matrices are an
extension of substitution scoring matrices
PSSM--Position Specific Scoring Matrix
3.0
Match A to position 3
Match D to position 3
-1.0
Match D to position 2
distribution of amino acids + pseudo-counts
Representing a profile as a logo
 Logos are used to show the residue preferences
or conservation at particular positions
 Based on information theory
helix-turn-helix motif from the CAP family of homodimeric DNA binding proteins
http://weblogo.berkeley.edu/examples.html
Profile-profile alignment
(alignment of alignments)
 Inputs: two MSAs (profiles)
 Alignment of two MSAs (similar to pairwise
sequence alignment)
 Many existing programs
 Scoring functions vary
– Dot product (FFAS)
FFAS scoring function
Dot product of two amino acid “frequency”
vectors
Sequence identity scoring zones
 >25-30%: homology zone
 15-25%: twilight zone
 <15%: midnight zone
Weak sequence similarity detection is still not
solved!
And sequence similarity != structural similarity
(for proteins) != functional similarity
Readings
 Chapter 5 (Pairwise Sequence Alignment and
Database Searching)
 Box 5.1 – saving space by throwing away the
intermediate calculations
 “Time can be saved with a loss of rigor by not
calculating the whole matrix” (Figure 5.18)