No Slide Title

Download Report

Transcript No Slide Title

Step 3: Tools
Database Searching
•Database Searching
•Sequence Alignment
•Scoring Matrices
•Significance of an alignment
•BLAST, algorithm
•BLAST, parameters
•BLAST, output
•Alignment significance in BLAST
©CMBI 2003
Database Searching
Identify similarities between
novel query sequences
whose structures and functions are unknown and uncharacterized
and
sequences in (public) databases
whose structures and functions have been elucidated.
N.B. The similarity might span the entire query sequence or just part of it!
©CMBI 2003
Database searching (2)
– The query sequence is compared/aligned with every
sequence in the database.
– High-scoring database sequences are assumed to be
evolutionary related to the query sequence.
– If sequences are related by divergence from a common
ancestor, there are said to be homologous.
©CMBI 2003
Sequence Alignment
The purpose of a sequence alignment is to line up all residues in
the sequence that were derived from the same residue position in
the ancestral gene or protein in any number of sequences
gap = insertion or deletion
J.Leunissen©CMBI 2003
Scoring Matrix/Substitution Matrix
– To score quality of an alignment
– Contains scores for pairs of residues (amino acids or
nucleic acids) in a sequence alignment
– For protein/protein comparisons:
a 20 x 20 matrix of similarity scores where identical amino
acids and those of similar character (e.g. Ile, Leu) give
higher scores compared to those of different character (e.g.
Ile, Asp).
– Symmetric
©CMBI 2003
Substitution Matrices
Not all amino acids are equal
– Some are more easily substituted than others
– Some mutations occur more often
– Some substitions are kept more often
Mutations tend to favor some substitutions
– Some amino acids have similar codons
– They are more likely to be changed from DNA mutation
Selection tends to favor some substitutions
– Some amino acids have similar properties/structure
– They are more likely to be kept
©CMBI 2003
PAM250 Matrix
©CMBI 2003
Scoring example
Score of an alignment is the sum of the scores of all pairs of
residues in the alignment
sequence 1: TCCPSIVARSN
sequence 2: SCCPSISARNT
1 12 12 6
2 5 -1 2 6 1 0
=> alignment score = 46
©CMBI 2003
Dayhoff Matrix (1)
– Derived from how often different amino acids replace other amino
acids in evolution.
– Created from a dataset of closely similar protein sequences (less
than 15% amino acid difference). These could be unambiguously
aligned.
– A mutation probability matrix whas derived where the entries
reflect the probabilities of a mutational event.
– This matrix is called PAM 1. An evolutionary distance of 1 PAM
(point accepted mutation) means there has been 1 point mutation
per 100 residues
©CMBI 2003
Dayhoff Matrix (2)
Log odds matrix: logs of elements of PAM matrix.
Score of mutation A  B
observed ab mutation rate
= log
mutation rate expected from amino acid frequencies
When using a log odds matrix, the total score of the alignment is
given by the sum of the scores for each aligned pair of residues.
©CMBI 2003
Dayhoff Matrix (3)
PAM 1 may be used to generate matrices for greater evolutionary
distances by multiplying it repeatedly by itself.
PAM250:
– 2,5 mutations per residue
– equivalent to 20% matches remaining between two
sequences, i.e. 80% of the amino acid positions are
observed to have changed.
– is default in many analysis packages.
©CMBI 2003
BLOSUM Matrix
Limit of Dayhoff matrix:
Matrices based on the Dayhoff model of evolutionary rates are of
limited value because their substitution rates are derived from
alignments of sequences that are at least 85% identical
An alternative approach has been developed by Henikoff and
Henikoff using local multiple alignments of more distantly related
sequences
©CMBI 2003
BLOSUM Matrix (2)
The BLOSUM matrices (BLOcks SUbstitution Matrix) are based on
the BLOCKS database.
The BLOCKS database utilizes the concept of blocks (ungapped
amino acid pattern), which act as signatures of a family of proteins.
Substitution frequencies for all pairs of amino acids were then
calculated and this used to calculate a log odds BLOSUM matrix.
Different matrices are obtained by varying the identity threshold.
For example, the BLOSUM80 matrix was derived using blocks of
80% identity.
©CMBI 2003
Which Matrix to use?
Close relationships (Low PAM, high Blosum)
Distant relationships (High PAM, low Blosum)
Reasonable defaults: PAM250, BLOSUM62
J.Kissinger
Significance of alignment (1)
When is an alignment statistically significant?
In other words:
How much different is the alignment score found from scores
obtained by aligning random sequences to the query sequence?
Or:
What is the probability that an alignment with this score could have
arisen by chance?
©CMBI 2003
Significance of alignment (2)
Database size= 20 x 106 letters
peptide
#hits
A
AP
IAP
LIAP
WLIAP
KWLIAP
KWLIAPY
1 x 106
50000
2500
125
6
0,3
0,015
©CMBI 2003
BLAST –
Basic Local Alignment Search Tool
•Find the highest scoring locally optimal alignments between a
query sequence and a database.
•Very fast algorithm
•Can be used to search extremely large databases
(uses a pre-indexed database which contributes to its great
speed)
•Sufficiently sensitive and selective for most purposes
•Robust – the default parameters can usually be used
©CMBI 2003
BLAST Algorithm, Step 1
•
For a given word length w (usually 3 for proteins) and a given score matrix:
Create a list of all words (w-mers) that can can score >T when compared to wmers from the query.
Query Sequence
LNKCKTPQGQRLVNQ
P Q G 18
P E G 15
P R G 14
P K G 14
P N G 13
P D G 13
P M G 13
Below
Threshold
(T=13)
Word
Neighborhood
Words
P Q A 12
P Q N 12
etc.
©CMBI 2003
BLAST Algorithm, Step 2
•
Each neighborhood word gives all positions in the database
where it is found (hit list).
P Q G 18
P E G 15
P R G 14
P K G 14
P N G 13
P D G 13
P M G 13
PMG
Database
©CMBI 2003
BLAST Algorithm, Step 3
•
The program tries to extend matching segments (seeds) out in
both directions by adding pairs of residues. Residues will be
added until the incremental score drops below a threshold.
©CMBI 2003
Basic BLAST Algorithms
BLASTN - compares a nucleotide query to a nucleotide database
BLASTP - compares a protein query to a protein database
BLASTX - compares a nucleotide query sequence translated in all reading
frames against a protein sequence database
TBLASTN - compares a protein query sequence against a nucleotide
sequence database dynamically translated in all reading frames.
TBLASTX - compares the six-frame translations of a nucleotide query
sequence against the six-frame translations of a nucleotide sequence
database.
©CMBI 2003
PSI-BLAST
Position-Specific Iterated BLAST
– Distant relationships are often best detected by motif or profile
searches rather than pairwise comparisons
– PSI-BLAST first performs a gapped BLAST database search.
– The PSI-BLAST program uses the information from any significant
alignments returned to construct a position-specific score matrix,
which replaces the query sequence for the next round of database
searching.
– PSI-BLAST may be iterated until no new significant alignments
are found.
©CMBI 2003
BLAST Input
Steps in running BLAST:
•Entering your query sequence (cut-and-paste)
•Select the database(s) you want to search
•Choose output parameters
•Choose alignment parameters (e.g. scoring matrix, filters,….)
Example query=
MAFIWLLSCYALLGTTFGCGVNAIHPVLTGLSKIVNGEEAVPGTWPWQVTLQDRSGFHFC
GGSLISEDWVVTAAHCGVRTSEILIAGEFDQGSDEDNIQVLRIAKVFKQPKYSILTVNND
ITLLKLASPARYSQTISAVCLPSVDDDAGSLCATTGWGRTKYNANKSPDKLERAALPLLT
NAECKRSWGRRLTDVMICGAASGVSSCMGDSGGPLVCQKDGAYTLVAIVSWASDTCSASS
GGVYAKVTKIIPWVQKILSSN
©CMBI 2003
BLAST Output (1)
©CMBI 2003
BLAST Output (2)
A low probability
indicates that a
match is unlikely
to ave arisen by
chance
A high score, or
preferably, clusters of
high scores, indicates a
likely relationship
©CMBI 2003
BLAST Output (3)
Low scores with high
probabilities suggest
that matches have
arisen by chance
©CMBI 2003
Alignment Significance in BLAST
P-value (probability)
– relates the score returned for an alignment to the likelihood of its
having arisen by chance; in general, the closer the value
approaches to zero, the greater the confidence that the match is
real.
E-value (expect value)
– the number of alignments with a given score that would be
expected to occur at random in the database that has been
searched (e.g. if E=10, 10 matches with scores this high are
expected to be found by chance).
– A match will only be reported if its E value falls below the
threshold set.
– Lower E thresholds are more stringent, and report fewer matches.
©CMBI 2003
BLAST Output (4)
©CMBI 2003
BLAST Output (5)
©CMBI 2003
BLAST Output (6)
©CMBI 2003