Presentation

Download Report

Transcript Presentation

Sequence Comparison –
Identification of remote
homologues
Amir Harel
Moran Yassour
Overview
 Homologues proteins
 Protein Sequence comparison
 BLAST and its improvements
 PSI-BLAST
Homologous Proteins
 Proteins that share a common ancestor
are called homologous.
 Common three dimensional folding
structure
Homologous Proteins
 Homology refers to a similarity that
spans an entire folding domain.
 The difficulty in defining homology
Why is homology important?
 Prediction of protein’s properties
 Classification of proteins to families
 Evolution tree
How to identify homology?
 Using sequence similarities
 Aligning two proteins
 Giving a score to the alignment
Global & Local Alignments
 Global alignment –
alignment of the entire sequence
 Local alignment –
alignment of a segment of the
sequence
How to score an alignment
 Substitution Matrix – Sij = a value
proportional to the probability that
amino acid i mutated into amino acid j
Types of Substitution Matrices
 PAM – comparison of closely related
sequences
 BLOSUM – multiple alignments of
distantly related sequences
Substitution Matrices
 Different matrices reflect different
evolutionary distances:
 1 PAM represents the evolutionary distance
of 1 amino acid substitution per 100 amino
acids.
 BLOSUM X: all sequences with a similarity
higher than X were summarized into one
Gap costs
 The most widely used Gap score is
-(a+bk) for a gap of length k.
 Long gaps do not cost much more than
short ones since a single mutation may
cause a large gap.
Basic Sequence Comparison
 Smith & Waterman (1981) – dynamic
programming of sequence comparison
n
 O(mn)
m
Complexity issue
 When DBs become larger, m grows
 Time complexity
 Space complexity
Intuition to Solution
 Go over less than the whole matrix
 Put the spotlight on segments that can be
a part of the best path and extend them.
 The best path is close to a diagonal
n
 Less than O(mn)
m
Heuristic procedures
 Heuristic: An algorithm that usually, but
not always works, or that gives nearly
the right answer.
 There is no guarantee to find the best
match.
BLAST – Basic Local Alignment
Search Tool
 BLAST first scans the DB for words that score at
least T when aligned with some word within the
query sequence, these are called hits. O(n)
 Each hit is extended in both directions as long as
the score hasn’t dropped too much.
BLAST
x
x
x
x
x
x
-
-
x
x
-
x
x
x
x
x
x
x
-
x
x
x
-
x
x
-
x
-
x
x
x
x
-
x
-
x
x
x
-
x
x
x
x
x
x
-
x
x
x
-
x
x
x
x
-
x
x
x
x
-
x
x
x
x
-
A word about the parameter T
 Small T:
greater sensitivity, more hits to expand
 large T:
lower sensitivity, fewer hits to expand
Gapped BLAST
 The original BALST was un-gapped
 Soon after came gapped BLAST
BLAST - Results
 P value – The probability of an alignment
occurring with score S or better.
 E value – Expectation value. The number of
different alignments with scores S or better
that are expected to occur in this DB search
by chance.
 Lower E value –> more significant score.
E-value and Homology
 Non significant score does not necessarily
imply non-homology:
E-value and Homology
Use it wisely
 Choose your Substitution Matrix
 Choose your DB
Example 1 – remote homology
 Frequently, identification of a remote
homology will require several database
searches.
 The glutathione transferase family
Remote homology
Remote homology
 Testing the possibility that elongation factors share
homology with glutathione S-transferases :
 There is a clear relationship between this elongation
factor and the class-theta glutathione transferases.
Example 2 - mapping
 Three different families of G-protein
coupled receptors:
 the R family (the largest)
 the C/S family
 the G receptor family
Finding links between families
Name
OPSD_HUMAN RHODOPSIN.
OPSG_CHICK GREEN-SENSITIVE OPSIN (GREEN CONE PHOTO
OPSG_HUMAN GREEN-SENSITIVE OPSIN (GREEN CONE PHOTO
OPS1_DROME OPSIN RH1 (OUTER R1-R6 PHOTORECEPTOR CE
NK2R_MOUSE SUBSTANCE-K RECEPTOR (SKR) (NEUROKININ
SSR5_HUMAN SOMATOSTATIN RECEPTOR TYPE 5.
TXKR_HUMAN PUTATIVE TACHYKININ RECEPTOR.
5H7_HUMAN 5-HYDROXYTRYPTAMINE 7 RECEPTOR (5-HT-7)
CKR1_HUMAN C-C CHEMOKINE RECEPTOR TYPE 1 (C-C CKRETBR_RAT ENDOTHELIN B RECEPTOR PRECURSOR (ET-B) (E
AA2B_RAT ADENOSINE A2B RECEPTOR.
MAS_MOUSE MAS PROTO-ONCOGENE.
PAFR_MACMU PLATELET ACTIVATING FACTOR RECEPTOR (PA
OLF2_RAT OLFACTORY RECEPTOR-LIKE PROTEIN F12.
MAS_RAT MAS PROTO-ONCOGENE.
CAR1_DICDI CYCLIC AMP RECEPTOR 1
OLF2_CHICK OLFACTORY RECEPTOR-LIKE PROTEIN COR2.
CAR3_DICDI CYCLIC AMP RECEPTOR 3.
MAS_HUMAN MAS PROTO-ONCOGENE.
OLF1_CHICK OLFACTORY RECEPTOR-LIKE PROTEIN COR1.
PER2_MOUSE PROSTAGLANDIN E RECEPTOR, EP2 SUBTYPE.
Score E-value
2347
0
1791
0
1002
0
527
3.10E-30
435
1.10E-23
431
1.50E-23
419
3.50E-22
283
6.40E-14
280
8.50E-14
278
1.50E-13
276
1.60E-13
133
130
135
131
130
129
124
120
117
121
0.007
0.007
0.009
0.01
0.01
0.02
0.05
0.06
0.17
0.23
Finding links between families
Name
CAR1_DICDI CYCLIC AMP RECEPTOR 1.
CAR3_DICDI CYCLIC AMP RECEPTOR 3.
CAR2_DICDI CYCLIC AMP RECEPTOR 2.
CALR_HUMAN CALCITONIN RECEPTOR PRECURSOR (CT-R).
IL8B_HUMAN HIGH AFFINITY INTERLEUKIN-8 RECEPTOR B
CLRA_RAT CALCITONIN RECEPTOR A PRECURSOR (CT-R-A)
CLRB_RAT CALCITONIN RECEPTOR B PRECURSOR (CT-R-B)
DIHR_MANSE DIURETIC HORMONE RECEPTOR (DH-R).
CALR_PIG CALCITONIN RECEPTOR PRECURSOR (CT-R).
GLR_RAT GLUCAGON RECEPTOR PRECURSOR (GL-R).
IL8B_RABIT HIGH AFFINITY INTERLEUKIN-8 RECEPTOR B
RDC1_HUMAN G PROTEIN-COUPLED RECEPTOR RDC1 HOMOLOG
G10D_RAT PROBABLE G PROTEIN-COUPLED RECEPTOR G10D
OPSD_HUMAN RHODOPSIN.
VIPR_HUMAN VASOACTIVE INTESTINAL POLYPEPTIDE RECEP
OPSD_SPHSP OPSIN.
SCRC_RAT SECRETIN RECEPTOR PRECURSOR.
IL8A_HUMAN HIGH AFFINITY INTERLEUKIN-8 RECEPTOR A
GLPR_RAT GLUCAGON-LIKE PEPTIDE 1 RECEPTOR PRECURSO
AG2S_XENLA TYPE-1-LIKE ANGIOTENSIN II RECEPTOR 2
Family Score
2678
1524
1497
C/S
167
R
161
C/S
162
C/S
162
C/S
150
C/S
145
C/S
145
R
141
R
139
R
133
R
130
C/S
131
R
129
C/S
129
R
127
C/S
143.1
R
126
E-value
0
0
0
0.00042
0.00073
0.00087
0.00095
0.0045
0.012
0.012
0.016
0.022
0.061
0.085
0.098
0.11
0.13
0.14
0.16
0.16
Building Proteins tree
Conclusions
 Searches with high-scoring, related or
unrelated sequences, is a very
important tool.
 Homology is a transitive relation…
BLAST – Pros & Cons
 Pros:
 It works
 Cons:
 Statistical evaluations rather than biological
one.
 Converged Evolution
 Weak but biologically relevant similarities
may be overlooked (PSI will improve this
issue)
BLAST improvements
 Running time improvements :
 Two-hit method
 Seed extension
 PSI-BLAST
The two-hit method
 The extension step accounts for more than
90% of BLAST’s execution time
 Invoke an extension only when two nonoverlapping hits are found within a certain
distance of one another
The two-hit method
x
x
x
x
x
x
x
x
x
-
x
x
-
x
x
-
x
x
x
x
x
x
x
x
x
-
x
x
x
-
x x - - - - - x - - x
- - - x x - - - - x - - - - - x - - x - - - - - - x - - x - -second
- - xhitx - - - - x - - - - - - - x - - x
x x - two-hit
- - - -extension
x - - x
- - - - - - x - - - - - x - - - x - - - - - - x - first
- -hit
- - x - - - x x - - - - x - - - - - x - - x - - - - - - - - - - - x
x x - - - - - x - - x
- - - - - x - - x - - - x - - - x - - - - - - x x - - - - x - - x - - - x - - - -
x
x
x
x
x
-
Seed Extension
PSI-BLAST
 Evolution pressure
 Needle in a hey stack
 PSI-BLAST comes to solve this problem
Evolution reveals itself
 Giving more significance to the conserved
areas and to ignoring the background noises
 PSI-BLAST = Position Specific Iterated
BLAST, shifts our view to these areas using
the Position-Specific Score Matrix - PSSM
Position-Specific Matrix - PSSM
 Pij = proportional to the probability of finding
the ith amino acid in the jth position in these
sequences
PSSM
 Represents the distribution of the amino
acids in each position in a collection of
sequences
Steps in the PSI-BLAST
 Initiation:
 Running gapped BLAST on the query, outputting a
collection of matching sequences
 Iteration:
 Constructing the PSSM based on the best
sequences in this collection
 The PSSM is compared to the protein DB, again,
seeking alignments
PSI-BLAST Example
 We start with an uncharacterized
protein – MJ0414
 When submitting the query we set the
E-value threshold to 0.01 (higher than usual)
Result of initial gapped BLAST
First iteration –
 Iterating the search using the derived
profile uncovers DNA ligase II with
E-value of 0.005
Second iteration –
Interpretation of the results
 Considering a strong unrelated protein will
shift the PSSM to its direction
 E-values retrieved in later iterations should
not be taken as automatic proof of homology
Was the ligase a right choice?
PSI-BLAST Conclusions
 Uncovers protein relationships missed by
single-pass database-search methods
 Errors are easily amplified by iterations.
 PSI-BLAST increases rather than removes the
need for expertise, because there is more to
interpret
Running time evaluation
Smith
Waterman
Normalized
Running
time
36
Original
BLAST
1.0
Gapped
BLAST
0.34
PSI BLAST
0.87
 Running time can be highly influenced by modifying
parameters
Future Improvements
 Accepting PSSM as input from other
programs
 Realignment – improve the alignment
before going over the DB
 Automatic domain recognition
Summary
 In BLAST use multiple searches for maximum
knowledge
 BLAST improvements are considerably faster,
and enhance significantly the abilities of DB
search
 For many queries the PSI BLAST can greatly
increase sensitivity to weak, but biologically
relevant sequence relationships
Questions time
Thank You
References
 Pearson WR. (1997) Identifying distantly related protein
sequences. Comput Appl Biosci., 13, 325-332
 Altschul SF, Massen TL, Shaffer AA, Zhang J, Zhang Z, Miller W,
Lipman DJ. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids
Res., 25, 3389-3402
 Altschul SF, Koonin EV. (1998) Iterated profile searches with
PSI-BLAST – a tool for discovery in protein databases. Trends
Biochem Sci., 23, 444-447
Sites




http://www.ncbi.nlm.nih.gov/BLAST
http://www.cs.huji.ac.il/~cbio
http://www.people.virginia.edu/~wrp/
http://www-lmmb.ncifcrf.gov/
Appendix - Statistics
S' 
S  ln k
ln 2
N
E  S'
2
N  nm
N
S '  log 2
E