Transcript Slide 1

Blast 2.0 Details
• The Filter Option:
– process of hiding regions of (nucleic acid or amino
acid) sequence having characteristics that frequently
lead to spurious high scores
– typically involves the removal of repeated or low
complexity regions
– The SEG program is used to mask or filter LCRs in
amino acid queries.
– The DUST program is used to mask or filter LCRs in
nucleic acid queries
– More than half of the proteins in the database contain
at least one low complexity region
SEG Filter Example
Default filtering option in BLAST 2.0 automatically converts low complexity
sequences into X's which can be seen in the query line of the alignments
PSI-Blast
• Position Specific Iterated BLAST
• an automated, easy-to-use version of a "profile"
search,
– a sensitive way to look for sequence homologues
• Intuition: substitution matrices should be specific
to a particular site. Penalize alanine  glycine
more in a helix
PSI-Blast: Outline
• Algorithm:
– First perform a gapped BLAST database search
– PSI-BLAST uses information from significant alignments to
construct a position-specific score matrix (PSSM),
– PSSM replaces the query sequence for the next round of
database searching.
– PSI-BLAST is iterated until no new significant alignments are
found.
• Details:
– Set initial thresholds high. Inspect each iteration's result for
suspicious sequences.
– Do several iterations (~5), or until no new sequences are found
– Even if only looking for a small set of sequences, make the initial
search very broad
• First, use NR with up to 5 iterations to set PSSM
• Then use that PSSM to search in restricted domain
PSI-Blast: Details
• To calculate profile for position 108: only shaded regions are used
To calculate profile at position i, pseudo-counts are used
PSI-BLAST Caveats
• Good:
– Increased ability to find distant homologues
– If the sequences used to construct PSSMs are all homologous,
the sensitivity at a given specificity improves significantly.
• Bad:
– If non-homologous sequences are included in the PSSMs, they
are “corrupted.” Then they pull in more non-homologous
sequences, and become worse than generic
• Advice:
– Special care to prevent non-homologous sequences from being
included in the PSSM calculation.
• When in doubt, leave it out!
• Examine sequences with moderate similarity carefully.
– Be particularly cautious about matches to sequences with highly
biased amino acid content
Database Homology Search
• Homology search
– For genes/RNAs which do not encode proteins
• relatively inefficient at identifying highly diverged sequences
– For genes which encode proteins
• protein-protein searches are significantly better
– (two mRNA sequences might only be ~40% identical at the nucleotide
level, but could be 70% similar in the proteins they encode)
• Rules of thumb:
– 80% similarity implies same structure and function
– highly diverged homologs could have down to 25% similarity
– the "twilight zone" in the range of 20%: judgement about
significant similarity is quite difficult
– distantly related homologs may lack significant similarity
Database Homology Search
• E-values:
– expected number of sequences in the database which would
achieve a given score
– are more useful than the raw or bit scores or percentage identity
– Score of 0.001 is a standard threshold (unless sequence is
biased – e.g. low complexity)
– Scores below 10-50 are highly significant.
• Caveats with low E-values:
– while the evolutionary relationship is highly likely, it does not
necessarily imply identical function (multi-domain proteins)
– if the score is extremely low AND the alignment covers the
length of both sequences, then they would share related function
Profiles
• Rather than identifying only the “consensus” (i.e.
most common) amino acid at a particular
location, we can assign a probability to each
amino acid in each position of the domain.
• Like a PSSM, but just for the domain.
1
A .1
C .3
D .2
E .4
2
.5
.1
.2
.2
3
.25
.25
.25
.25
Applying a Profile
• Calculate score (probability of match) for a
profile at each position in a sequence by
multiplying individual probabilities.
• Use “Sliding window”:
1
A .1
C .3
D .2
E .4
2
.5
.1
.2
.2
3
.25
.25
.25
.25
For sequence EACDC:
EAC = .4 * .5 * .25 = .05
ACD = .1 * .1 * .25 = .0025
CDC = .3 * .2 * .25 = .015
• Can transform probability to significance given
random distribution assumption
Sequence Logos