Presentation
Download
Report
Transcript Presentation
Sequence Comparison –
Identification of remote
homologues
Amir Harel
Moran Yassour
Overview
Homologues proteins
Protein Sequence comparison
BLAST and its improvements
PSI-BLAST
Homologous Proteins
Proteins that share a common ancestor
are called homologous.
Common three dimensional folding
structure
Homologous Proteins
Homology refers to a similarity that
spans an entire folding domain.
The difficulty in defining homology
Why is homology important?
Prediction of protein’s properties
Classification of proteins to families
Evolution tree
How to identify homology?
Using sequence similarities
Aligning two proteins
Giving a score to the alignment
Global & Local Alignments
Global alignment –
alignment of the entire sequence
Local alignment –
alignment of a segment of the
sequence
How to score an alignment
Substitution Matrix – Sij = a value
proportional to the probability that
amino acid i mutated into amino acid j
Types of Substitution Matrices
PAM – comparison of closely related
sequences
BLOSUM – multiple alignments of
distantly related sequences
Substitution Matrices
Different matrices reflect different
evolutionary distances:
1 PAM represents the evolutionary distance
of 1 amino acid substitution per 100 amino
acids.
BLOSUM X: all sequences with a similarity
higher than X were summarized into one
Gap costs
The most widely used Gap score is
-(a+bk) for a gap of length k.
Long gaps do not cost much more than
short ones since a single mutation may
cause a large gap.
Basic Sequence Comparison
Smith & Waterman (1981) – dynamic
programming of sequence comparison
n
O(mn)
m
Complexity issue
When DBs become larger, m grows
Time complexity
Space complexity
Intuition to Solution
Go over less than the whole matrix
Put the spotlight on segments that can be
a part of the best path and extend them.
The best path is close to a diagonal
n
Less than O(mn)
m
Heuristic procedures
Heuristic: An algorithm that usually, but
not always works, or that gives nearly
the right answer.
There is no guarantee to find the best
match.
BLAST – Basic Local Alignment
Search Tool
BLAST first scans the DB for words that score at
least T when aligned with some word within the
query sequence, these are called hits. O(n)
Each hit is extended in both directions as long as
the score hasn’t dropped too much.
BLAST
x
x
x
x
x
x
-
-
x
x
-
x
x
x
x
x
x
x
-
x
x
x
-
x
x
-
x
-
x
x
x
x
-
x
-
x
x
x
-
x
x
x
x
x
x
-
x
x
x
-
x
x
x
x
-
x
x
x
x
-
x
x
x
x
-
A word about the parameter T
Small T:
greater sensitivity, more hits to expand
large T:
lower sensitivity, fewer hits to expand
Gapped BLAST
The original BALST was un-gapped
Soon after came gapped BLAST
BLAST - Results
P value – The probability of an alignment
occurring with score S or better.
E value – Expectation value. The number of
different alignments with scores S or better
that are expected to occur in this DB search
by chance.
Lower E value –> more significant score.
E-value and Homology
Non significant score does not necessarily
imply non-homology:
E-value and Homology
Use it wisely
Choose your Substitution Matrix
Choose your DB
Example 1 – remote homology
Frequently, identification of a remote
homology will require several database
searches.
The glutathione transferase family
Remote homology
Remote homology
Testing the possibility that elongation factors share
homology with glutathione S-transferases :
There is a clear relationship between this elongation
factor and the class-theta glutathione transferases.
Example 2 - mapping
Three different families of G-protein
coupled receptors:
the R family (the largest)
the C/S family
the G receptor family
Finding links between families
Name
OPSD_HUMAN RHODOPSIN.
OPSG_CHICK GREEN-SENSITIVE OPSIN (GREEN CONE PHOTO
OPSG_HUMAN GREEN-SENSITIVE OPSIN (GREEN CONE PHOTO
OPS1_DROME OPSIN RH1 (OUTER R1-R6 PHOTORECEPTOR CE
NK2R_MOUSE SUBSTANCE-K RECEPTOR (SKR) (NEUROKININ
SSR5_HUMAN SOMATOSTATIN RECEPTOR TYPE 5.
TXKR_HUMAN PUTATIVE TACHYKININ RECEPTOR.
5H7_HUMAN 5-HYDROXYTRYPTAMINE 7 RECEPTOR (5-HT-7)
CKR1_HUMAN C-C CHEMOKINE RECEPTOR TYPE 1 (C-C CKRETBR_RAT ENDOTHELIN B RECEPTOR PRECURSOR (ET-B) (E
AA2B_RAT ADENOSINE A2B RECEPTOR.
MAS_MOUSE MAS PROTO-ONCOGENE.
PAFR_MACMU PLATELET ACTIVATING FACTOR RECEPTOR (PA
OLF2_RAT OLFACTORY RECEPTOR-LIKE PROTEIN F12.
MAS_RAT MAS PROTO-ONCOGENE.
CAR1_DICDI CYCLIC AMP RECEPTOR 1
OLF2_CHICK OLFACTORY RECEPTOR-LIKE PROTEIN COR2.
CAR3_DICDI CYCLIC AMP RECEPTOR 3.
MAS_HUMAN MAS PROTO-ONCOGENE.
OLF1_CHICK OLFACTORY RECEPTOR-LIKE PROTEIN COR1.
PER2_MOUSE PROSTAGLANDIN E RECEPTOR, EP2 SUBTYPE.
Score E-value
2347
0
1791
0
1002
0
527
3.10E-30
435
1.10E-23
431
1.50E-23
419
3.50E-22
283
6.40E-14
280
8.50E-14
278
1.50E-13
276
1.60E-13
133
130
135
131
130
129
124
120
117
121
0.007
0.007
0.009
0.01
0.01
0.02
0.05
0.06
0.17
0.23
Finding links between families
Name
CAR1_DICDI CYCLIC AMP RECEPTOR 1.
CAR3_DICDI CYCLIC AMP RECEPTOR 3.
CAR2_DICDI CYCLIC AMP RECEPTOR 2.
CALR_HUMAN CALCITONIN RECEPTOR PRECURSOR (CT-R).
IL8B_HUMAN HIGH AFFINITY INTERLEUKIN-8 RECEPTOR B
CLRA_RAT CALCITONIN RECEPTOR A PRECURSOR (CT-R-A)
CLRB_RAT CALCITONIN RECEPTOR B PRECURSOR (CT-R-B)
DIHR_MANSE DIURETIC HORMONE RECEPTOR (DH-R).
CALR_PIG CALCITONIN RECEPTOR PRECURSOR (CT-R).
GLR_RAT GLUCAGON RECEPTOR PRECURSOR (GL-R).
IL8B_RABIT HIGH AFFINITY INTERLEUKIN-8 RECEPTOR B
RDC1_HUMAN G PROTEIN-COUPLED RECEPTOR RDC1 HOMOLOG
G10D_RAT PROBABLE G PROTEIN-COUPLED RECEPTOR G10D
OPSD_HUMAN RHODOPSIN.
VIPR_HUMAN VASOACTIVE INTESTINAL POLYPEPTIDE RECEP
OPSD_SPHSP OPSIN.
SCRC_RAT SECRETIN RECEPTOR PRECURSOR.
IL8A_HUMAN HIGH AFFINITY INTERLEUKIN-8 RECEPTOR A
GLPR_RAT GLUCAGON-LIKE PEPTIDE 1 RECEPTOR PRECURSO
AG2S_XENLA TYPE-1-LIKE ANGIOTENSIN II RECEPTOR 2
Family Score
2678
1524
1497
C/S
167
R
161
C/S
162
C/S
162
C/S
150
C/S
145
C/S
145
R
141
R
139
R
133
R
130
C/S
131
R
129
C/S
129
R
127
C/S
143.1
R
126
E-value
0
0
0
0.00042
0.00073
0.00087
0.00095
0.0045
0.012
0.012
0.016
0.022
0.061
0.085
0.098
0.11
0.13
0.14
0.16
0.16
Building Proteins tree
Conclusions
Searches with high-scoring, related or
unrelated sequences, is a very
important tool.
Homology is a transitive relation…
BLAST – Pros & Cons
Pros:
It works
Cons:
Statistical evaluations rather than biological
one.
Converged Evolution
Weak but biologically relevant similarities
may be overlooked (PSI will improve this
issue)
BLAST improvements
Running time improvements :
Two-hit method
Seed extension
PSI-BLAST
The two-hit method
The extension step accounts for more than
90% of BLAST’s execution time
Invoke an extension only when two nonoverlapping hits are found within a certain
distance of one another
The two-hit method
x
x
x
x
x
x
x
x
x
-
x
x
-
x
x
-
x
x
x
x
x
x
x
x
x
-
x
x
x
-
x x - - - - - x - - x
- - - x x - - - - x - - - - - x - - x - - - - - - x - - x - -second
- - xhitx - - - - x - - - - - - - x - - x
x x - two-hit
- - - -extension
x - - x
- - - - - - x - - - - - x - - - x - - - - - - x - first
- -hit
- - x - - - x x - - - - x - - - - - x - - x - - - - - - - - - - - x
x x - - - - - x - - x
- - - - - x - - x - - - x - - - x - - - - - - x x - - - - x - - x - - - x - - - -
x
x
x
x
x
-
Seed Extension
PSI-BLAST
Evolution pressure
Needle in a hey stack
PSI-BLAST comes to solve this problem
Evolution reveals itself
Giving more significance to the conserved
areas and to ignoring the background noises
PSI-BLAST = Position Specific Iterated
BLAST, shifts our view to these areas using
the Position-Specific Score Matrix - PSSM
Position-Specific Matrix - PSSM
Pij = proportional to the probability of finding
the ith amino acid in the jth position in these
sequences
PSSM
Represents the distribution of the amino
acids in each position in a collection of
sequences
Steps in the PSI-BLAST
Initiation:
Running gapped BLAST on the query, outputting a
collection of matching sequences
Iteration:
Constructing the PSSM based on the best
sequences in this collection
The PSSM is compared to the protein DB, again,
seeking alignments
PSI-BLAST Example
We start with an uncharacterized
protein – MJ0414
When submitting the query we set the
E-value threshold to 0.01 (higher than usual)
Result of initial gapped BLAST
First iteration –
Iterating the search using the derived
profile uncovers DNA ligase II with
E-value of 0.005
Second iteration –
Interpretation of the results
Considering a strong unrelated protein will
shift the PSSM to its direction
E-values retrieved in later iterations should
not be taken as automatic proof of homology
Was the ligase a right choice?
PSI-BLAST Conclusions
Uncovers protein relationships missed by
single-pass database-search methods
Errors are easily amplified by iterations.
PSI-BLAST increases rather than removes the
need for expertise, because there is more to
interpret
Running time evaluation
Smith
Waterman
Normalized
Running
time
36
Original
BLAST
1.0
Gapped
BLAST
0.34
PSI BLAST
0.87
Running time can be highly influenced by modifying
parameters
Future Improvements
Accepting PSSM as input from other
programs
Realignment – improve the alignment
before going over the DB
Automatic domain recognition
Summary
In BLAST use multiple searches for maximum
knowledge
BLAST improvements are considerably faster,
and enhance significantly the abilities of DB
search
For many queries the PSI BLAST can greatly
increase sensitivity to weak, but biologically
relevant sequence relationships
Questions time
Thank You
References
Pearson WR. (1997) Identifying distantly related protein
sequences. Comput Appl Biosci., 13, 325-332
Altschul SF, Massen TL, Shaffer AA, Zhang J, Zhang Z, Miller W,
Lipman DJ. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids
Res., 25, 3389-3402
Altschul SF, Koonin EV. (1998) Iterated profile searches with
PSI-BLAST – a tool for discovery in protein databases. Trends
Biochem Sci., 23, 444-447
Sites
http://www.ncbi.nlm.nih.gov/BLAST
http://www.cs.huji.ac.il/~cbio
http://www.people.virginia.edu/~wrp/
http://www-lmmb.ncifcrf.gov/
Appendix - Statistics
S'
S ln k
ln 2
N
E S'
2
N nm
N
S ' log 2
E