No Slide Title

Download Report

Transcript No Slide Title

Database Searching BLAST
•Database Searching
•Sequence Alignment
•Scoring Matrices
•Significance of an alignment
•BLAST, algorithm
•BLAST, parameters
•BLAST, output
•Alignment significance in BLAST
©CMBI 2005
Database Searching
Identify similarities between
novel query sequences
whose structures and functions are unknown and
uncharacterized
sequences in (public) databases
whose structures and functions have been elucidated.
N.B. The similarity might span the entire query
sequence or just part of it!
©CMBI 2005
Database searching (2)
The query sequence is compared/aligned with every
sequence in the database.
High-scoring database sequences are assumed to be
evolutionary related to the query sequence.
If sequences are related by divergence from a common
ancestor, there are said to be homologous.
©CMBI 2005
Sequence Alignment
The purpose of a sequence alignment is to line up all
residues in the sequence that were derived from the
same residue position in the ancestral gene or protein in
any number of sequences
A
B
A
B
gap = insertion or deletion
J.Leunissen©CMBI 2005
Scoring Matrix/Substitution Matrix
To score quality of an alignment
Contains scores for pairs of residues (amino acids or
nucleic acids) in a sequence alignment
For protein/protein comparisons:
a 20 x 20 matrix of similarity scores where identical
amino acids and those of similar character (e.g. Ile,
Leu) give higher scores compared to those of different
character (e.g. Ile, Asp).
Symmetric
©CMBI 2005
Substitution Matrices
Not all amino acids are equal
Residues mutate more easily to similar ones
Residues at surface mutate more easily
Aromatics mutate preferably into aromatics
Mutations tend to favor some substitutions
Core tends to be hydrophobic
Selection tends to favor some substitutions
Cysteines are dangerous at the surface
Cysteines in bridges seldom mutate
©CMBI 2005
PAM250 Matrix
©CMBI 2005
Scoring example
Score of an alignment is the sum of the scores of all pairs of
residues in the alignment
sequence 1: TCCPSIVARSN
sequence 2: SCCPSISARNT
1 12 12 6
2 5 -1 2 6 1 0
=> alignment score = 46
©CMBI 2005
Dayhoff Matrix (1)
The group of Dayhoff created a scoring matrix from a dataset
of closely similar protein sequences that could be aligned
unambiguously.
Then they counted all mutations (and non-mutations) and
calculated the mutation frequencies
With a bit of math, they converted these frequencies into the
famous Dayhoff matrix (also called PAM matrix).
©CMBI 2005
Dayhoff Matrix (2)
Given the frequency of Leu and Val in my sequences, do I see more
mutations of V  L than I would expect by chance?
Score of mutation A  B
= log (observed a  b mutation rate / expected number of mutations)
This is called a log odd and can be negative, zero, or positive.
When using a log odds matrix, the total score of the alignment is given by
the sum of the scores for each aligned pair of residues.
©CMBI 2005
Dayhoff Matrix (3)
This log odds matrix is called PAM 1. An evolutionary distance of 1 PAM
(point accepted mutation) means there has been 1 point mutation per 100
residues
PAM 1 may be used to generate matrices for greater evolutionary
distances by multiplying it repeatedly by itself.
PAM250:
– 2,5 mutations per residue.
– equivalent to 20% matches remaining between two sequences,
i.e. 80% of the amino acid positions are observed to have
changed (one or more times).
– is default in many analysis packages.
©CMBI 2005
BLOSUM Matrix
Limit of Dayhoff matrix:
Matrices based on the Dayhoff model of evolutionary rates are
derived from alignments of sequences that are at least 85%
identical; that might not be optimal…
An alternative approach has been developed by Henikoff and
Henikoff using local multiple alignments of more distantly related
sequences
©CMBI 2005
BLOSUM Matrix (2)
The BLOSUM matrices (BLOcks SUbstitution Matrix) are based on
the BLOCKS database.
The BLOCKS database utilizes the concept of blocks (un-gapped
amino acid pattern), that act as signatures of a family of proteins.
Substitution frequencies for all pairs of amino acids were then
calculated and this used to calculate a log odds BLOSUM matrix.
Different matrices are obtained by varying the identity threshold. For
example, BLOSUM80 was derived using blocks of 80% identity.
©CMBI 2005
Which Matrix to use?
Close relationships (Low PAM, high Blosum)
Distant relationships (High PAM, low Blosum)
BLOSUM 80
PAM 20
BLOSUM 62
PAM 120
More conserved
Often used defaults are: PAM250, BLOSUM62
BLOSUM 45
PAM 250
More variable
Significance of alignment (1)
When is an alignment statistically significant?
In other words:
How much different is the alignment score found from scores
obtained by aligning random sequences to the query sequence?
Or:
What is the probability that an alignment with this score could have
arisen by chance?
©CMBI 2005
Significance of alignment (2)
Database size= 20 x 106 letters
peptide
#hits
A
AP
IAP
LIAP
WLIAP
KWLIAP
KWLIAPY
1 x 106
50000
2500
125
6
0,3
0,015
©CMBI 2005
BLAST
Question: What database sequences are most similar to
(or contain the most similar regions to) my previously
uncharacterised sequence?
•BLAST finds the highest scoring locally optimal
alignments between a query sequence and a database.
•Very fast algorithm
•Can be used to search extremely large databases
•Sufficiently sensitive and selective for most purposes
•Robust – the default parameters can usually be used
©CMBI 2005
BLAST – Algorithme
Step 1: Read/understand user query sequence.
Step 2: Use hashing technology to select several thousand
likely candidates.
Step 3: Do a real alignment between the query sequence
and those likely candidate. ‘Real alignment’ is a main topic
of this course.
Step 4: Present output to user.
©CMBI 2005
BLAST Algorithm, Step 2
For a given word length w and a given score matrix:
Create a list of all words (w-mers) that can score >T
when compared to w-mers from the query.
Query Sequence
LNKCKTPQGQRLVNQ
P Q G 18
P E G 15
P R G 14
P K G 14
P N G 13
P D G 13
P M G 13
Below
Threshold
(T=13)
Word
Neighborhood
Words
P Q A 12
P Q N 12
etc.
©CMBI 2005
BLAST Algorithm, Step 2
•
Each neighbourhood word gives all positions in the database
where it is found (hit list).
P Q G 18
P E G 15
P R G 14
P K G 14
P N G 13
P D G 13
P M G 13
PMG
Database
©CMBI 2005
BLAST Algorithm, Step 2
The program extends matching segments (seeds) in both
directions by adding residues. Residues will be added
until the incremental score drops below a threshold.
©CMBI 2005
Basic BLAST Algorithms
Program
Query
Database
BLASTP
Protein
Protein
BLASTN
DNA
DNA
BLASTX
translatedDNA
protein
TBLASTN
protein
translatedDNA
TBLASTX
translatedDNA
translatedDNA
©CMBI 2005
PSI-BLAST
Position-Specific Iterated BLAST
• Distant relationships are often best detected by motif
or profile searches rather than pair-wise comparisons
• PSI-BLAST first performs a BLAST search.
• PSI-BLAST uses the information from significant
BLAST alignments returned to construct a position
specific score matrix, which replaces the query
sequence for the next round of database searching.
• PSI-BLAST may be iterated until no new significant
alignments are found.
©CMBI 2005
BLAST Input
Steps in running BLAST:
•Entering your query sequence (cut-and-paste)
•Select the database(s) you want to search
•Choose output parameters
•Choose alignment parameters (scoring matrix, filters,….)
Example query=
MAFIWLLSCYALLGTTFGCGVNAIHPVLTGLSKIVNGEEAVPGTWPWQVTLQDRSGFHFC
GGSLISEDWVVTAAHCGVRTSEILIAGEFDQGSDEDNIQVLRIAKVFKQPKYSILTVNND
ITLLKLASPARYSQTISAVCLPSVDDDAGSLCATTGWGRTKYNANKSPDKLERAALPLLT
NAECKRSWGRRLTDVMICGAASGVSSCMGDSGGPLVCQKDGAYTLVAIVSWASDTCSASS
GGVYAKVTKIIPWVQKILSSN
©CMBI 2005
BLAST Output (1)
©CMBI 2005
BLAST Output (2)
A low probability
indicates that a
match is unlikely
to ave arisen by
chance
A high score, or
preferably, clusters of
high scores, indicates a
likely relationship
©CMBI 2005
BLAST Output (3)
Low scores with high
probabilities suggest
that matches have
arisen by chance
©CMBI 2005
Alignment Significance in BLAST
P-value (probability)
Relates the score for an alignment to the likelihood that it
arose by chance. The closer to zero, the greater the
confidence that the hit is real.
E-value (expect value)
The number of alignments with E that would be expected
by chance in that database (e.g. if E=10, 10 matches with
scores this high are expected to be found by chance).
A match will be reported if its E is below the threshold.
Lower E thresholds are more stringent, and report fewer
matches.
©CMBI 2005
BLAST Output (4)
©CMBI 2005
BLAST Output (5)
©CMBI 2005
BLAST Output (6)
©CMBI 2005
Low complexity filter
©CMBI 2005
Low complexity filter
©CMBI 2005
Low complexity filter
©CMBI 2005
Local implementation - Blast in MRS
MRS also contains a BLAST. This BLAST is simpler,
has fewer options, knows fewer databases, but is
faster.
©CMBI 2005
Blast in MRS
MRS Blast remembers all your queries from one
session, and stores them in a table. The one you are
running is in that table too. Multiple BLASTs can run at
one time.
Still running
Ready
©CMBI 2005
Blast hitlist in MRS
©CMBI 2005
Blast hitlist expansion in MRS
©CMBI 2005
Blast hitlist expansion in MRS
©CMBI 2005
Low complexity motifs visible
©CMBI 2005
Routing
©CMBI 2005
Routing to Clustal
©CMBI 2005
Routing MRS to Blast
©CMBI 2005