Blast Search

Download Report

Transcript Blast Search

BLAST
Lecture 3.1
1
BLAST
• Basic Local Alignment Search Tool
• Developed in 1990 and 1997 (S. Altschul)
• A heuristic method for performing local
alignments through searches of high
scoring segment pairs (HSP’s)
• 1st to use statistics to predict significance
of initial matches - saves on false leads
• Offers both sensitivity and speed
Lecture 3.1
2
BLAST
• Looks for clusters of nearby or locally dense “similar
or homologous” k-tuples
• Uses “look-up” tables to shorten search time
• Uses larger “word size” than FASTA to accelerate the
search process
• Performs both Global and Local alignment
• Fastest and most frequently used sequence alignment
tool -- THE STANDARD
Lecture 3.1
3
BLAST Access
• NCBI BLAST
• http://www.ncbi.nlm.nih.gov/BLAST/
• Canadian Bioinformatics Resource BLAST
• http://cbr-rbc.nrc-cnrc.gc.ca/blast/
• European Bioinformatics Institute BLAST
• http://www.ebi.ac.uk/blastall/
• http://www.ebi.ac.uk/blast2/
Lecture 3.1
4
Lecture 3.1
5
Lecture 3.1
6
Lecture 3.1
7
Different Flavours of BLAST
• BLASTP - protein query against protein DB
• BLASTN - DNA/RNA query against GenBank (DNA)
• BLASTX - 6 frame trans. DNA query against proteinDB
• TBLASTN - protein query against 6 frame GB transl.
• TBLASTX - 6 frame DNA query to 6 frame GB transl.
• PSI-BLAST - protein ‘profile’ query against protein DB
• PHI-BLAST - protein pattern against protein DB
Lecture 3.1
8
Other BLAST Services
• MEGABLAST - for comparison of large sets
of long DNA sequences
• RPS-BLAST - Conserved Domain Detection
• BLAST 2 Sequences - for performing pairwise
alignments for 2 chosen sequences
• Genomic BLAST - for alignments against
select human, microbial or malarial genomes
• VecScreen - for detecting cloning vector
contamination in sequenced data
Lecture 3.1
9
Running NCBI BLAST
Lecture 3.1
10
MT0895
• MMKIQIYGTGCANCQMLEKNAREAVKELG
IDAEFEKIKEMDQILEAGLTALPGLAVDG
ELKIMGRVASKEEIKKILS
Lecture 3.1
11
Running NCBI BLAST
• Paste in sequence (FASTA format, raw
sequence or type in GI or accession number)
OR
>Mysequence MT0895
KIQIYGTGCANCQMLEKNAREAVKELGIDAE
FEKIKEMDQILEAGLTALPGLAVDGELKIDS
>
KIQIYGTGCANCQMLEKNAREAVKELGIDAE
FEKIKEMDQILEAGLTALPGLAVDGELKIDS
OR
KIQIYGTGCANCQMLEKNAREAVKELGIDAE
FEKIKEMDQILEAGLTALPGLAVDGELKIDS
Lecture 3.1
12
Running NCBI BLAST
• Choose a range of interest in the sequence
“set subsequences” (not usually used)
• Select the database from pull-down menu
(usually choose nr = non-redundant)
• Keep CD Search “check box” on
• Leave “Options” unchanged (use defaults)
• Go to “Format” menu and adjust Number of
descriptions and alignments as desired
Lecture 3.1
13
Running NCBI BLAST
Select Database
Lecture 3.1
14
Conserved Domain Database
• Contains a collection of pre-identified
functional or structural domains
• Derived from Pfam and Smart databases
as well as other sources
• Uses Reverse Position Specific BLAST
(RPS-BLAST) to perform search
• Query sequence is compared to a PSSM
derived from each of the aligned domains
Lecture 3.1
15
Running NCBI BLAST
Click BLAST!
Lecture 3.1
16
Formatting Results
Lecture 3.1
17
BLAST Format Options
Lecture 3.1
18
BLAST Output
Lecture 3.1
19
BLAST Output
Lecture 3.1
20
BLAST Output
Lecture 3.1
21
BLAST Output
Lecture 3.1
22
BLAST Output
Lecture 3.1
23
BLAST Output
Lecture 3.1
24
BLAST Parameters
• Identities - No. & % exact residue matches
• Positives - No. and % similar & ID matches
• Gaps - No. & % gaps introduced
• Score - Summed HSP score (S)
• Bit Score - a normalized score (S’)
• Expect (E) - Expected # of chance HSP aligns
• P - Probability of getting a score > X
• T - Minimum word or k-tuple score (Threshold)
Lecture 3.1
25
BLAST - Rules of Thumb
• Expect (E-value) is equal to the number of BLAST
alignments with a given Score that are expected to
be seen simply due to chance
• Don’t trust a BLAST alignment with an Expect score
> 0.01 (Grey zone is between 0.01 - 1)
• Expect and Score are related, but Expect contains
more information. Note that %Identies is more
useful than the bit Score
• Recall Doolittle’s Curve (%ID vs. Length, next slide)
%ID > 30 - numres/50
• If uncertain about a hit, perform a PSI-BLAST search
Lecture 3.1
26
Doolittle’s Curve
Evolutionary Distance VS Percent Sequence Identity
Sequence Identity (%)
120
100
80
60
Twilight Zone
40
20
0
0
40
80
120
160
200
240
280
320
360
400
Number of Residues
Lecture 3.1
27
Getting the Most from
BLAST
Lecture 3.1
28
BLAST Options
Lecture 3.1
29
BLAST Options
•
•
•
•
•
•
•
Composition-based statistics (Yes)
Sequence Complexity Filter (Yes)
Expect (E) value (10)
Word Size (3)
Substitution or Scoring Matrix (Blosum62)
Gap Insertion Penalty (11)
Gap Extension Penalty (1)
Lecture 3.1
30
Composition Statistics
• Recent addition to BLAST algorithm
• Permits calculated E (Expect) values to
account for amino acid composition of
queries and database hits
• Improves accuracy and reduces false
positives
• Effectively conducts a different scoring
procedure for each sequence in database
Lecture 3.1
31
LCR’s (low complexity)
• Watch out for…
– transmembrane or signal peptide regions
– coil-coil regions
– short amino acid repeats (collagen, elastin)
– homopolymeric repeats
• BLAST uses SEG to mask amino acids
• BLAST uses DUST to mask bases
Lecture 3.1
32
Scoring Matrices
• BLOSUM Matrices
– Developed by Henikoff & Henikoff (1992)
– BLOcks SUbstitution Matrix
– Derived from the BLOCKS database
• PAM Matrices
– Developed by Schwarz and Dayhoff (1978)
– Point Accepted Mutation
– Derived from manual alignments of closely
related proteins
Lecture 3.1
33
How to Make Your Own Matrix
ACDEFGH..
ACDEFGK..
AADEFGH..
GCDEFGH..
ACAEYGK..
ACAEFAH..
Perform
Alignment
Lecture 3.1
f
f
#Aobs
(A,A) =
#Aexp
#C/Aobs
(C,A) =
+ #Cexp
#Aexp
Calculate
Frequencies
A
A 0.8
C 0.2
D 0.0
E --
C D ...
-- -0.8 -0.3 1.0
--
--
Fill Sub
Matrix
34
PAM versus BLOSUM
• First useful scoring
matrix for protein
• Assumed a Markov
Model of evolution (I.e.
all sites equally mutable
and independent)
• Derived from small,
closely related proteins
with ~15% divergence
Lecture 3.1
• Much later entry to matrix
“sweepstakes”
• No evolutionary model is
assumed
• Built from PROSITE
derived sequence blocks
• Uses much larger, more
diverse set of protein
sequences (30% - 90% ID)
35
PAM versus BLOSUM
• Higher PAM numbers to
detect more remote
sequence similarities
• Lower PAM numbers to
detect high similarities
• 1 PAM ~ 1 million years
of divergence
• Errors in PAM 1 are
scaled 250X in PAM 250
Lecture 3.1
• Lower BLOSUM numbers
to detect more remote
sequence similarities
• Higher BLOSUM numbers
to detect high similarities
• Sensitive to structural
and functional subsitution
• Errors in BLOSUM arise
from errors in alignment
36
PAM Matricies
• PAM 40 - prepared by multiplying PAM 1 by
itself a total of 40 times
best for short alignments with high similarity
• PAM 120 - prepared by multiplying PAM 1 by
itself a total of 120 times
best for general alignment
• PAM 250 - prepared by multiplying PAM 1 by
itself a total of 250 times
best for detecting distant sequence similarity
Lecture 3.1
37
BLOSUM Matricies
• BLOSUM 90 - prepared from BLOCKS
sequences with >90% sequence ID
best for short alignments with high similarity
• BLOSUM 62 - prepared from BLOCKS
sequences with >62% sequence ID
best for general alignment (default)
• BLOSUM 30 - prepared from BLOCKS
sequences with >30% sequence ID
best for detecting weak local alignments
Lecture 3.1
38
Scraping the Bottom of
the Barrel with Psi-BLAST
Lecture 3.1
39
PSI-BLAST Algorithm
• Perform initial alignment with BLAST using
BLOSUM 62 substitution matrix
• Construct a multiple alignment from matches
• Prepare position specific scoring matrix
• Use PSSM profile as the scoring matrix for a
second BLAST run against database
• Repeat steps 3-5 until convergence
Lecture 3.1
40
PSI-BLAST
Lecture 3.1
41
PresS Iterate!
Lecture 3.1
PSI-BLAST
42
PSI-BLAST
PresS Iterate!
Lecture 3.1
43
PSI-BLAST
Lecture 3.1
44
PSI-BLAST
• For Protein Sequences ONLY
• Much more sensitive than BLAST
• Slower (iterative process)
• Often yields results that are as good as
many common threading methods
• SHOULD BE YOUR FIRST CHOICE IN
ANALYZING A NEW SEQUENCE
Lecture 3.1
45
BLAST against PDB
Lecture 3.1
46
Still Confused?
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html
Lecture 3.1
47
Conclusions
• BLAST is the most important program in
bioinformatics (maybe all of biology)
• BLAST is based on sound statistical
principles (key to its speed and sensitivity)
• A basic understanding of its principles is
key for using/interpreting BLAST output
• Use NBLAST or MEGABLAST for DNA
• Use PSI-BLAST for protein searches
Lecture 3.1
48