Transcript scores
Rationale for searching sequence
databases
May 11, 2004
Writing projects due May 25
Quiz #3 on Thurs., May 20
Learning objectives-Why do we search sequence
databases? Understand the Smith-Waterman
algorithm of local alignment and the concept of
backtracing. FASTA and BLAST programs. PsiBlast
Workshop-Use of Psi-BLAST to determine
sequence similarities.
Homework-Due May 20
Why search sequence databases?
1. I have just sequenced a gene. What is known
about the gene I sequenced?
2. I have a unique sequence. Is there similarity to
another gene that has a known function?
3. I found a new gene in a lower organism. Is it
similar to a gene from another species?
4. I have decided to work on a new gene. The
people in the field will not give me the plasmid. I
need the complete cDNA sequence to perform
PCR.
Perfect Searches
First “hit” should be an exact match.
Next “hits” should contain all of the
genes that are related to your gene
(homologs)
Next “hits” should be similar but are
not homologs
How does one achieve the
“perfect search”?
Comparison Matrices (PAM vs. BLOSUM)
Database Search Algorithms
Databases
Search Parameters
Expect Value-change threshold for score
reporting
Translation-of DNA sequence into protein
Filtering-remove repeat sequences
Smith-Waterman Algorithm Advances
in
Applied Mathematics, 2:482-489 (1981)
The Smith-Waterman algorithm is a local alignment tool used
to obtain sensitive pairwise similarity alignments. Smith-Waterman
algorithm uses dynamic programming. Operating via a matrix,
the algorithm uses backtracing and tests alternative paths to
the highest scoring alignments, and selects the optimal path as
the highest ranked alignment. The sensitivity of the
Smith-Waterman algorithm makes it useful for finding local
areas of similarity between sequences that are too dissimilar for
alignment. The S-W algorithm uses a lot of computer memory.
BLAST and FASTA are other search algorithms that use some
aspects of S-W.
Smith-Waterman (cont. 1)
a. It searches for both full and partial sequence matches .
b. Assigns a score to each pair of amino acids
-uses similarity scores
-uses positive scores for related residues
-uses negative scores for substitutions and gaps
c. Initializes edges of the matrix with zeros
d. As the scores are summed in the matrix, any sum below 0 is
recorded as a zero.
e. Begins backtracing at the maximum value found
anywhere in the matrix.
f. Continues the backtrace until the score falls to 0.
Smith-Waterman (cont. 2)
H E A G A W G H E E
P
A
W
H
E
A
E
0
0
0
0
0
0
0
0
0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 5 0 5 0 0 0 0 0 0
0 0 0 3 0 2012 4 0 0 0
10 2 0 0 1 12182214 6 0
2 16 8 0 0 4101828 20 0
0 82113 5 0 41020 27 0
0 6131912 4 0 416 26 0
0 0 0 0 0 0 0 0 0 0 0
Put zeros on
borders. Assign
initial scores
based on a scoring
matrix. Calculate
new scores based on
adjacent cell scores.
If sum is less than
zero or equal to zero
begin new scoring
with next cell.
Smith-Waterman (cont. 3)
H E A G A W G H E E
P
A
W
H
E
A
E
0
0
0
0
0
0
0
0
0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 5 0 5 0 0 0 0 0 0
0 0 0 3 0 2012 4 0 0 0
10 2 0 0 1 12182214 6 0
2 16 8 0 0 4101828 20 0
0 82113 5 0 41020 27 0
0 6131912 4 0 416 26 0
0 0 0 0 0 0 0 0 0 0 0
AWGHE
|| ||
AW-HE
Score=28
Begin backtrace at
the
maximum value
found
anywhere on the
matrix.
Continue the
backtrace
until score falls to
zero
Calculation of percent similarity
A W G H E
A W - H E
5
15 -5
10
-3
6
Blosum45 SCORES
GAP EXT. PENALTY
% SIMILARITY =
NUMBER OF POS. SCORES
DIVIDED BY NUMBER OF AAs
IN REGION x 100
% OVERALL SIMILARITY =
NUMBER OF POS. SCORES
DIVIDED BY NUMBER OF TOTAL AAs
IN REGION x 100
% SIMILARITY = 4/5 x 100
= 80%
%OVERALL SIMILARITY = 4/5 x 100
= 80%
Similarity Score = 28
FASTA (Pearson and Lipman
1988)
This is a combination of word search and SmithWaterman algorithm
The query sequence is divided into small words of
certain size.
The initial comparison of the query sequence to
the database is performed using these “words”.
If these “words” are located on the same diagonal
in an array the region surrounding the diagonals
are analyzed further.
Search time is only proportional to size of
database not (database*query sequence)
The FASTA program is the uses Hash tables.
These tables speed the process of word search.
Query Sequence
= TCTCTC
123456 (position number)
Database Sequence = TTCTCTC
1234567 (position number)
You choose to use word size = 4 for your
table (total number of words in your table is
44 = 256)
?
Sequence (total
of 256)
TCTC
CTCT
TTCT
Position w/in query
1,3
2
Position w/in DB
2,4
3
1
Offset (Q minus DB)
-1 or -3 or 1
-1
FASTA Steps
1
Different offset values
2
Identical offset
values in a
contiguous sequence
Diagonals are extended
Local regions of
identity are found
Rescore the local regions
using PAM or Blos. matrix
4
3
Eliminate short diagonals
below a cutoff score
Create a gapped alignment in
a narrow segment and then
perform S-W alignment
Summary of FASTA steps
1. Analyzes database for identical matches that are contiguous
(between 5 and 10 amino acids in length (same offset values)).
2. Longest diagonals are scored again using the PAM matrix (or
other matrix). The best scores are saved as “init1” scores.
3. Short diagonals are removed.
4. Long diagonals that are neighbors are joined. The score for this
joined region is “initn”. This score may be lower due to a
penalty for a gap.
5. A S-W dynamic programming alignment is performed around the
joined sequences to give an “opt” score.
Thus, the time-consuming S-W step is performed only on top
scoring sequences
The ktup value
•The ktup (for k-tuples) value stands for the length of the word
used to search for identity.
•For proteins a ktup value of 3 would give a hash table of 203
elements (8000 entries).
•The higher the ktup value the less likely you will get a match
unless it is identical (remember the dot plots).
•The lower the ktup value the more background you will have
•The higher the ktup value the faster analysis (fewer
diagonals).
The following rules typically apply when using FASTA:
ktup
analysis____________________
1
proteins- distantly related
2
proteins- somewhat related (default)
3
DNA-default
FASTA Versions
FASTA-nucleotide or protein sequence searching
FASTx/-compares a translated DNA query sequence
FASTy to a protein sequence database (forward
or backward translation of the query)
tFASTx/-compares protein query sequence to
tFASTy DNA sequence database that has been
translated into three forward and three
reverse reading frames
FASTA Statistical Significance
A way of measuring the significance of a score considers the mean
of the random score distribution.
The difference between the similarity score for your single alignment
and the mean of the random score distribution is normalized by
the standard deviation of that random score
distribution. This is the Z-score.
Higher Z-scores are better because
the further the real score is from this mean (in standard deviation units)
the more significant it is.
FASTA Statistical Significance
Z score for a single alignment=
(similarity score - mean score from database)
standard deviation from database
Stand. Dev. =
2
(
scores)
scores2 Total#ofSequences
Total#ofSequences
Mean similarity scores
of complete database
Mean similarity scores
of related records
FASTA statistics (cont.)
Using the distribution of the z-scores in the database, the FastA
program can estimate the number of sequences that would
be expected to produce, purely by chance, a z-score greater than or
equal to the z-score obtained in the search.
This is reported as the E() value. This value is
the number of sequences you would expect to find with this score by
searching a database of random sequences.
Thus, when z the E()
Evaluating the Results of FASTA
Best
SCORES
Init1: 2847 Initn: 2847 Opt: 2847
z-score: 2609.2 E(): 1.4e-138
Smith-Waterman score: 2847; 100.0% identity in 413 overlap
Good
SCORES
Init1: 719 Initn: 748 Opt: 793
z-score: 734.0 E(): 3.8e-34
Smith-Waterman score: 796; 41.3% identity in 378 overlap
Mediocre
SCORES
Init1: 249 Initn: 304 Opt: 260
z-score: 243.2 E(): 8.3e-07
Smith-Waterman score: 270; 35.0% identity in 183 overlap
BLAST
Basic Local Alignment Search Tool
Speed is achieved by:
Pre-indexing the database before the search
Parallel processing
Uses a hash table that contains
neighborhood words rather than just random
words.
Neighborhood words
The program declares a hit if the word taken from
the query sequence has a score >= T when a
scoring matrix is used.
This allows the word size (W (this is similar to
ktup value)) to be kept high (for speed) without
sacrificing sensitivity.
If T is increased by the user the number of
background hits is reduced and the program will
run faster
Comparison Matrices
In general, the BLOSUM series is thought to be superior to the
PAM series because it is derived from areas of conserved sequences.
It is important to vary the parameters when performing a sequence
comparison. Similarity scores for truly related sequences are
usually not sensitive to changes in scoring matrix and gap penalty.
Thus, if your “hits list” holds up after changing these parameters
you can be more sure that you are detecting similar sequences.
Which Program should one use?
Most researchers use methods for
determining local similarities:
Smith-Waterman (gold standard)
Do not find every possible alignment
FASTA
of query with database sequence. These
BLAST
are used because they run faster than S-W
}
What are the different BLAST
programs?
blastp
compares an amino acid query sequence against a protein sequence
database
blastn
compares a nucleotide query sequence against a nucleotide
sequence database
blastx
compares a nucleotide query sequence translated in all reading
frames against a protein sequence database
tblastn
compares a protein query sequence against a nucleotide sequence
database dynamically translated in all reading frames
tblastx
compares the six-frame translations of a nucleotide query sequence
against the six-frame translations of a nucleotide sequence
database. Please note that tblastx program cannot be used with the
nr database on the BLAST Web page.
When to use the correct program
Problem
Program
Explanation
Identify
Unknown
Protein
BLASTP;
FASTA3
General protein
comparison. Use ktup=2
for speed; ktup=1 for
sensitive search.
Smith-Waterman
Slower than FASTA3
and BLAST but provides
maximum sensitivity
TFASTX3;TFASTY3;
TBLASTN
Use if homolog cannot
be found in protein
databases; Approx. 33%
slower
Psi-BLAST
Finds distantly related
sequences. It replaces
the query sequence with
a position-specific score
matrix after an initial
BLASTP search. Then it
uses the matrix to find
distantly related
sequences
When to use the correct program (cont. 1)
Problem
Program
Identify
new
orthologs
TFASTX3;TFASTY3
TBLASTN:TBLASTX
Identify
EST
Sequence
FASTX3;FASTY3;
BLASTX;TBLASTX
Identify
DNA
Sequence
FASTA;BLASTN
Explanation
Use PAM matrix <=20 or
BLOSUM90 to avoid detecting
distant relationships. Search
EST sequences w/in the same
species.
Always attempt to translate
your sequence into protein
prior to searching.
Nucleotide sequence
comparision
Choosing the database
Remember that the E value increases
approximately linearly with database size.
When searching for distant relationships always
use the smallest database likely to contain the
homolog of interest.
Thought problem: If the E-value one obtains for a
search is 12 in Swiss-PROT and the E-value one
obtains for same search is 74 in PIR how large is
PIR compared to Swiss-PROT?
74/12 = ~6
Filtering Repetitive Sequences
Over 50% of genomic DNA is repetitive
This is due to:
retrotransposons
ALU region
microsatellites
centromeric sequences, telomeric sequences
5’ Untranslated Region of ESTs
Example of ESTs with simple low complexity regions:
T27311
GGGTGCAGGAATTCGGCACGAGTCTCTCTCTCTCTCTCTCTCTCTCTC
TCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTC
Filtering Repetitive Sequences
(cont. 1)
Programs like BLAST have the option of
filtering out low complex regions.
Repetitive sequences increase the chance of
a match during a database search
PSI-BLAST
PSI-position specific iterative
a position specific scoring matrix (PSSM) is
constructed automatically from multiple HSPs of
initial BLAST search. Normal E value is used
This PSSM is as the new scoring matrix for a
second BLAST search. Low E value is used
E=.001.
Result-1) obtain distantly related sequences
2) find out the important residues that
provide function or structure.