PSI-BLAST is used for

Download Report

Transcript PSI-BLAST is used for

Blastology and Open Source:
Needs and Deeds
Iddo Friedberg, Ph.D.
The Burnham Institute
February, 2003
Prologue
•BLAST – Basic Local Alignment Search
Tool: fast sequence similarity searching,
query vs. database (1990)
•Gapped BLAST – now we can use gaps in
the alignment (1996)
•PSI-BLAST Position Specific Iterated
BLAST Iterated BLAST search increase
sensitivity. (1997) 7800 citations over 6 years
Blastology & Open Source:
Needs & Deeds
• How PSI-BLAST works
• Post PSI-BLAST processing possibilities
• PeCoP: conserved positions in profiles
– Information content
– Different measures and their purposes
– Implementation
• Bio* tools
• NCBI tools
• What do we still need?
Blastology & Open Source:
Needs & Deeds
•
•
•
•
How PSI-BLAST works (basically…)
Post PSI-BLAST processing
PeCoP: conserved positions in profiles
content
– Different measures and their purposes
– Implementation
• Bio* tools
• NCBI tools
• What do we still need?
PSI BLAST 101
A 029001100003200
MGLLTREIF--ILQQ
C 000070000000000
.
.
Y 002000080202000
MGLLTREIF--ILQQ
FGLGRT-I-T-YMTN
FGLLRT-I-T-YMTN
-GLVRT-I---LGLE
-RLTRD-I---LGLY
FGLLRT-I---YMTQ
FGLLRT-I---FMTS
Take a sequence
using profile
Search for similar sequences in a full
sequence database
Sequences
New
sequences
are multiply
in the multiple
alignedalignment
After several iterations of this procedure we have:
027005101003200
A 029001100003200
Construct
new
profile
aa profile,
and represent
•C 000070000000000
Sequence information, Construct
inc. links to
annotation
.
conservation in each position numerically
•. Several sets of multiple alignments.
Y 002000080202000
202000060202000
•
•
Profiles, derived by us Profile
or by PSI-BLAST
holds more information than a single
sequence:
use the
profile to retrieve additional
Thresholding information
(alignment
statistics)
sequences
Blastology & Open Source:
Needs & Deeds
• How PSI-BLAST works
• Post PSI-BLAST processing
• PeCoP: conserved positions in profiles
– Information content
– Different measures and their purposes
– Implementation
• Bio* tools
• NCBI tools
• What do we still need?
Post-BLAST Information Flow
Wetlab Typicale
PSI-BLAST
Sequence
Alignments
Annotations
Statistics
Locating homologs
Function Prediction
(if function unknown)
Enter Bioinformatics, Stage Left…
• Process many queries
• More sophisticated post-processing, e.g.
– Structure prediction
– Phylogenetics
– Function prediction: using annotation /
structural data / phylogenetic data
• “Unusual” searching:
– Need to change parameter default values
Post-BLAST Information Flow
Bioinformatics
PSI-BLAST
Annotations
Sequence
Alignments
Profiles
Statistics
Locating homologs
Function Prediction
(if function unknown)
Fold prediction
Homology Modeling
Tree building
PDB-BLAST: Sensitive Fold Recognition
(Li & Godzik)
PSI-BLAST
PSI-BLAST Structure
Large sequence
Database (PDB)
Database (nr85)
Sequence
Alignments
Profiles
Statistics
Fold recognition
PSI-PRED
2ndary Structure Prediction (David Jones)
PSI-BLAST
Filtered database:
•No Xmembrane
•No coiled-coils
Profiles
Windows
1st
of
Neural Network
Length
15
2nd Neural Network
3-state Prediction
PSI-BLAST is used for:
• Distant homology detection
• Fold assignment
• profile-profile comparison
• Domain identification
• Evolutionary Analysis (e.g. tree building)
• Sequence Annotation / function assignment
• Profile export to other programs
• Sequence clustering
• Structural genomics target selection
PSI BLAST’s ability to do all of the above has been evaluated. So have
competing programs, which used PSI-BLAST as a standard for comparison
Blastology & Open Source:
Needs & Deeds
• How PSI-BLAST works
• Post PSI-BLAST processing
• PeCoP: conserved positions in profiles
– Information content
– Different measures and their purposes
– Implementation
• Bio* tools
• NCBI tools
• What do we still need?
Why Profiles?
• More informative than sequences
• More accurate than regexps (“motifs”)
• PSI-BLAST’s consecutive profiles enable
us to obtain an “evolutionary vista”
• PeCoP: illustrating the use of iterated
profiles to detect Persistently Conserved
Positions
PeCoP: locating important residues
(Friedberg & Margalit)
PSI-BLAST
Large sequence
Database (nr)
Sequence
Alignments
Statistics
Profiles
Find Conserved
Positions
Locate important
residues
What is a Conserved Position?
• A conserved position has
a high frequency of any
single amino-acid type in
the MSA column.
•Conservation is usually
measured by determining
the information content or
the relative entropy of a
position
Blastology & Open Source:
Needs & Deeds
• How PSI-BLAST works
• Post PSI-BLAST processing
• PeCoP: getting profiles from PSI-BLAST
– Information content
– Different measures and their purposes
– Implementation
• Bio* tools
• NCBI tools
• What do we still need?
Information Content I: Uncertainty
Uncertainty: the number of “yes / no” questions to verify a
state:
• Coin toss: 1 question. (“Is it heads?”)
• Nucleotide in a DNA sequence: 2 questions
(“Is it a purine?”) -> (“Is it an adenine?”)
• Uncertainty is measured in bits
• Maximum uncertainty:
log2(number of possible states)
Coin toss: log22 = 1 bit
DNA:
log24 = 2 bits
Proteins: log220 = 4.32 bits
Information Content II: Measuring
Positional Conservation
Information content is the reduction in uncertainty
Uncertainty ``before’’: log220 = 4.32 bits
• Uncertainty ``after’’ (i.e. when we know the MSA position
makeup):
20
• Uncertainty difference is therefore: IC  log 220 - (-  Pi  log 2Pi )
• Fully conserved position: IC = 4.32 – 20*0
• Not conserved at all:
IC  4.32 - 20  (
1
1
 log
)
20
20
= 4.32
=0
“The more conserved a position, the higher its information content”
Information Content II: Measuring
Positional Conservation
. . .D. . .
Information content is the reduction in uncertainty
...D...
Uncertainty ``before’’: log220
. . .=D 4.32
. . . bits
. . . E. we
. . know the MSA position
• Uncertainty ``after’’ (i.e. when
...G...
makeup):
PD = 3/5 = 0.6
20
IC  log 220 - (-  Pi  log 2Pi )
•
Uncertainty
difference
is
therefore:
PG = 1/5 = 0.2
PE = 1/5 = 0.2
20
20*0
= 4.32
 position:
)  1–.360
 Pi  logIC2(=Pi4.32
• Not conserved at all: IC 204.32 - 20  ( 1  log 1 )
=0
20
log 220   Pi  log 20
2( Pi )  2.960
Uncertainty
“After”:
• Fully conserved
Information content:
[
]
“The more conserved a position, the higher its information content”
Information Content II: Measuring
Positional Conservation
Information content is the reduction in uncertainty
Uncertainty ``before’’: log220 = 4.32 bits
• Uncertainty ``after’’ (i.e. when we know the MSA position
makeup):
20
• Uncertainty difference is therefore: IC  log 220 - (-  Pi  log 2Pi )
• Fully conserved position: IC = 4.32 – 20*0
• Not conserved at all:
1
1
IC  4.32 - 20  (  log
)
20
20
= 4.32
=0
“The more conserved a position, the higher its information content”
Blastology & Open Source:
Needs & Deeds
• How PSI-BLAST works (basically…)
• Post PSI-BLAST processing
• PeCoP:
– Information content
– Different measures and their purposes
– Implementation
• Bio* tools
• NCBI tools
• What do we still need?
Division by Prior Frequencies:
“Conserved” vs. “Distinct”
• A conserved position has a high frequency of any given aminoacid type in the MSA column.
• “High Frequency” meaning:
– 1) a high frequency in the column? or
– 2) a higher-than-expected frequency in the column?
• Higher-than-expected: based on the frequencies of residue types
in the “sequence universe”. (SwissProt).
Question: ``How conserved is a position?’’ 20
IC  log 220 - (-  Pi  log 2Pi )
Do not divide by priors. Use
Question: ``How distinct is a position?’’
20
Pi
IC

P
i

log
2
(
)
Divide by priors. Use

Qi
When dividing by priors: relative entropy
20 Amino Acids… or Less?
• A conserved position has a high frequency of any
given amino-acid type in the MSA column.
• “Amino acid type” meaning:
– 1) There are 20 amino acid types
– 2) There are less, because they can be grouped
into similar physico-chemical types
20 Amino Acids… or Less?
Representative
letter
F
Physico-chemical
property
Hydrophobic
Included residue
types
A, V, L, I, M, C
R
Aromatic
F, W, Y, H
O
Polar
S, T, N, Q
T
Positive
R, K
N
Negative
E, D
P
Proline
P
G
Glycine
G
20
IC  log 220 - (-  Pi  log 2Pi )
7
IC  log 27 - (-  Pi  log 2Pi )
IC: Remember This
• Information content == reduction in
uncertainty. Used for measuring positional
conservation
• “The more conserved a position, the
higher its information content”
• We can divide (or not) by expected prior
frequencies
• We can group (or not) the 20 amino acids
into a smaller alphabet
Possible Schemes for Calculating
Positional Conservation
20-letter Alphabet
Priors
No Priors
20
Pi
IC   Pi  log 2( )
Qi
20
IC  log 220 - (-  Pi  log 2Pi )
Reduced Alphabet
7
Pi
IC   Pi  log 2(
)
Qi
7
IC  log 27 - (-  Pi  log 2Pi )
Blastology & Open Source: Needs & Deeds
• How PSI-BLAST works (basically…)
• Post PSI-BLAST processing
• PeCoP:
– Information content
– Different measures and their purposes
– Implementation
• Bio* tools
• NCBI tools
• What do we still need?
PeCoP: locating important residues
(Friedberg & Margalit)
PSI-BLAST
Large sequence
Database (nr)
Sequence
Alignments
Statistics
Profiles
Find Conserved
Positions
Locate important
residues
Find Conserved Positions:
Set a Threshold
• Threshold is determined by normalizing the IC
distribution over a sequence to
mean == 0, SD == 1
• Then set a threshold
Find Conserved Positions:
Conservation over Profiles
• Determine conservation in a profile according to
one of the four schemes discussed
• But PSI-BLAST gives us several profiles
(nIterations -1)
• Therefore, a position is conserved if it retains
conservation through successive iterations.
• But retention does not have to be 100%
Retention Schemes
1. Majority vote: if a position is conserved in x out
of n iterations, it is considered conserved.
2. Persistent conservation: conservation in the
first & last iteration
Persistent Conservation
• Positions conserved in close family members may be
conserved due to evolutionary non-divergence, and not
solely due to a structural / functional role. Hence, a supply
of false positives.
• Positions conserved in distant family members may be
marked as such due to an observed drift from the original
sequence. False positives again, but for a different reason.
The intersection of the above two findings
minimizes both types of errors
PeCoP
Determine conservation according to the following
parameters:
1. Either one of the four IC schemes AND
2. Set a threshold AND
3. Choose a retention scheme
PeCop Submission
PeCoP Results
Getting PSI-BLAST Profiles According to
Different Conservation Schemes
In ncbitools: ncbi/tools/posit.c
lines1826 – 2689
#ifdef POSIT_DEBUG
// the code here is concerned with matrix output,
// and normally commented out
//play around with it…
#endif
Can NCBI provide this output by use of a command-line argument?
Why Not Parse PSI-BLAST Alignments?
Speed
• Slow, esp. When using a scripting language
• Not all alignments appear on output (default 250)
• Sequence weighting, profile construction, all already
provided for.
• NCBI keep changing format: programmer has to keep
changing the parser.
Why Parse PSI-BLAST Alignments?
Gain more information:
• Assign sequence weight and filtering parameters
according to specific needs
• Use annotation: inline or linked.
• Realign sequences, and construct own profile
• PSI-BLAST source code keeps changing
• As of v. 2.1.2: XML and (2.2.1) tabulated (no alignment)
output
Post-blast Information Flow
Bioinformatics
PSI-BLAST
Annotations
Sequence
Alignments
Profiles
Statistics
Locating homologues
Function Prediction
(if function unknown)
Fold prediction
Homology Modeling
Tree building
Post Blast Processing
Many modules, but:
• Most are application-specific.
• Some are web-resources only.
• Bad licenses, machine-specific, not written for
distribution purposes, etc.
Result: need to rewrite the same stuff over (and
over.. and over..).
Blastology & Open Source: Needs & Deeds
• How PSI-BLAST works (basically…)
• Post PSI-BLAST processing
• PeCoP:
– Information content
– Different measures and their purposes
– Implementation
• Bio* tools
• NCBI tools
• What do we still need?
Bio*.org Projects
• Collaborative projects aimed at providing programming
tools for bioinformatics under an open-source license
• Bio{Perl | Java | Python} : procedural
• Bio{CORBA | MOBY}: interface,
web access standardization
The Open Bioinformatics Foundation
Bio*.org and Post PSI-BLAST Processing
BioPerl
BioJava
BioPython
PSI-BLAST
parsers
Yes
Yes
Yes
Filtering
Program
(easy)
Program
(moderate)
Program
(easy)
profiles
Yes, hard
to create
from PB
Yes, don’t
know about
creation from
PB
Yes, hard to
create from
PB
Annotation
handling
No
No
No
Information
content
No
Yes
Yes
comments
NCBI and Post Blast Processing
• Language: C/C++
• ASN.1 was around long before XML
•seqalign.asn
• Now (v. 2.1.1) there is also XML output format,
DTDs are there.
• Web APIs, for WWW-based PSI-BLAST runs
• Public domain, no license
What is Needed?
• Annotation handling. PB output has rudimentary
annotation only. The rest is served by links.
Transfer into MySQL?
• Translate parsed output into multiple sequence
alignment objects, and then into PSSMs
• Direct PB residue frequency output
• CORBA: do we need a format-aware object?
• Anything else you can think of………
Summary
• PSI-BLAST profiles have become the methodof-choice for “doing things” when a high
detection sensitivity is required
BUT…
• Profiles can and should be interpreted carefully
• Results should be interpreted carefully
• Do NOT write your own PSI-BLAST parser.
Please write something we need!
Further Reading
•
•
http://www.ncbi.nlm.nih.gov
http://open-bio.org
Books:
Durbin R. et al. Biological Sequence Analysis. Cambridge University Press (Chapter 9)
Papers:
•
http://www.ncbi.nlm.nih.gov/BLAST/blast_references.html
Blastology:
•
W. Li , F. Pio, K. Pawlowski and A. Godzik: Saturated Blast: detecting distant homology using automated multiple
intermediate sequence Blast search Bioinformatics (2000) 16:1105-1110
•
W. Li, L. Jaroszewski and A. Godzik: Clustering of highly homologous sequences from large sequence protein
databases Bioinformatics, (2001) 17:282-283.
•
W. Li, L. Jaroszewski and A. Godzik: Tolerating some redundancy significantly speeds up clustering of large protein
databases Bioinformatics (2002) 18:77-82
•
W.Li and A.Godzik: Discovering new genes with advanced homology detection Trends in Biotech, (2002) 20:315-6.
•
I. Friedberg, T. Kaplan, and H. Margalit: Evaluation of PSI-BLAST Alignment Accuracy in Comparison to Structural
Alignments. (2000) Protein Science,Nov;9(11):2278-84
•
I. Friedberg and H. Margalit: Persistently Conserved Positions inStructurally-Similar, Sequence Dissimilar Proteins:
Roles in PreservingProtein Fold and Function (2002) Protein Science 11(2):350-360
•
I. Friedberg and H. Margalit: PeCoP: automatic determination of persistently conserved positions in protein families.
Bioinformatics 18(9): 1276-77(2002)
Conserved positions:
•
Mirny LA, Shakhnovich EI: Universally conserved positions in protein folds: reading evolutionary signals about
stability, folding kinetics and function. J Mol Biol. 1999 Aug 6;291(1):177-96.
•
Reddy BV, Li WW, Shindyalov IN, Bourne PE. Conserved key amino acid positions (CKAAPs) derived from the
analysis of common substructures in proteins. Proteins. 2001 Feb 1;42(2):148-63.
•
Landgraf R, Xenarios I, Eisenberg D.Three-dimensional cluster analysis identifies interfaces and functional residue
clusters in proteins. J Mol Biol. 2001 Apr 13;307(5):1487-502.
Thanks to..
•
•
•
•
Hanah Margalit
Adam Godzik
Bio{java | perl | python}.org folks
Jeff Bizzaro
http://bioinformatics.org/pecop
The End
Iddo Friedberg - Blastology
Check the Following when Running
PSI-BLAST for PBP:
• Number of sequences printed (if making own profile from
printed sequences).
• E-value inclusion threshold for next iteration (rec: 0.001).
• Low complexity masking?
• Substitution matrix used?
PSI-BLAST 101 (contd.)
Exports:
• Multiple sequence alignments
• Annotation links
• Statistical data