Transcript S x - IBIVU

Introduction to Bioinformatics
Lecture 12: Iterative homology searching
and Protein Structure-Function
Relationships
Centre for Integrative Bioinformatics VU (IBIVU)
PSI (Position Specific Iterated)
BLAST
• basic idea
– use results from BLAST query to construct a
profile matrix
– search database with profile instead of query
sequence
• iterate
A Profile Matrix (Position Specific
Scoring Matrix – PSSM)
This is the same as a profile without position-specific gap penalties
PSI BLAST
• Searching with a Profile
• aligning profile matrix to a simple sequence
– like aligning two sequences
– except score for aligning a character with a matrix
position is given by the matrix itself
– not a substitution matrix
PSI BLAST:
Constructing the Profile Matrix
Figure from: Altschul et al. Nucleic Acids Research 25, 1997
PSI BLAST:
Determining Profile Elements
• the value for a given element of the profile matrix
is given by:
• where the probability of seeing amino acid ai in
column j is estimated as:
Observed frequency
Pseudocount
e.g.  = number of
sequences in profile, =1
PSI-BLAST iteration
Q
xxxxxxxxxxxxxxxxx
Query sequence
Gapped BLAST search
Q
xxxxxxxxxxxxxxxxx
Query sequence
Database hits
iterate
A
C
D
.
.
Y
PSSM
Pi
Px
Gapped BLAST search
A
C
D
.
.
Y
Pi
Px
PSSM
Database hits
PSI-BLAST
• Query sequences are first scanned for the presence of
so-called low-complexity regions (Wooton and
Federhen, 1996), i.e. regions with a biased composition
likely to lead to spurious hits; are excluded from
alignment.
• The program then initially operates on a single query
sequence by performing a gapped BLAST search
• Then, the program takes significant local alignments
(hits) found, constructs a multiple alignment (masterslave alignment) and abstracts a position-specific
scoring matrix (PSSM) from this alignment.
• Rescan the database in a subsequent round, using the
PSSM, to find more homologous sequences. Iteration
continues until user decides to stop or search has
converged
1 - This portion of each description links to the sequence record for a particular hit.
2 - Score or bit score is a value calculated from the number of gaps and substitutions
associated with each aligned sequence. The higher the score, the more significant the
alignment. Each score links to the corresponding pairwise alignment between query
sequence and hit sequence (also referred to as subject sequence).
3 - E Value (Expect Value) describes the likelihood that a sequence with a similar score will
occur in the database by chance. The smaller the E Value, the more significant the
alignment. For example, the first alignment has a very low E value of e-117 meaning that a
sequence with a similar score is very unlikely to occur simply by chance.
4 - These links provide the user with direct access from BLAST results to related entries in
other databases. ‘L’ links to LocusLink records and ‘S’ links to structure records in NCBI's
Molecular Modeling DataBase.
‘X’ residues denote low-complexity sequence fragments that are ignored
PSI-BLAST output example
Alignment Bit Score
B = (S – ln K) / ln 2
•S is the raw alignment score
•The bit score (‘bits’) B has a standard set of units
•The bit score B is calculated from the number of gaps and
substitutions associated with each aligned sequence. The higher the
score, the more significant the alignment
• and K and are the statistical parameters of the scoring system
(BLOSUM62 in Blast).
•See Altschul and Gish, 1996, for a collection of values for  and K
over a set of widely used scoring matrices.
•Because bit scores are normalized with respect to the scoring
system, they can be used to compare alignment scores from
different searches based on different scoring schemes (a.a.
exchange matrices)
Normalised sequence similarity
The p-value is defined as the probability of seeing at
least one unrelated score S greater than or equal to a
given score x in a database search over n sequences.
This probability follows the Poisson distribution
(Waterman and Vingron, 1994):
P(x, n) = 1 – e-nP(S x),
where n is the number of sequences in the database
Depending on x and n (fixed)
Normalised sequence similarity
Statistical significance
The E-value is defined as the expected number of nonhomologous sequences with score greater than or equal
to a score x in a database of n sequences:
E(x, n) = nP(S  x)
For example, if E-value = 0.01, then the expected
number of random hits with score S  x is 0.01, which
means that this E-value is expected by chance only once
in 100 independent searches over the database.
if the E-value of a hit is 5, then five fortuitous hits with S
 x are expected within a single database search, which
renders the hit not significant.
A model for database searching
score probabilities
• Scores resulting from searching with a
query sequence against a database follow
the Extreme Value Distribution (EDV)
(Gumbel, 1955).
• Using the EDV, the raw alignment scores
are converted to a statistical score (E value)
that keeps track of the database amino acid
composition and the scoring scheme (a.a.
exchange matrix)
Extreme Value Distribution
y = 1 – exp(-e-(x-))
Probability density function for the extreme value
distribution resulting from parameter values  = 0 and  = 1,
[y = 1 – exp(-e-x)], where  is the characteristic value and 
is the decay constant.
Extreme Value Distribution (EDV)
EDV approximation
real data
You know that an optimal alignment of two sequences is selected out of many
suboptimal alignments, and that a database search is also about selecting the best
alignment(s). This bodes well with the EDV which has a right tail that falls off
more slowly than the left tail. Compared to using the normal distribution, when
using the EDV an alignment has to score further away from the expected mean
value to become a significant hit.
Extreme Value Distribution
The probability of a score S to be larger than a given
value x can be calculated following the EDV as:
E-value: P(S  x) = 1 – exp(-e -(x-)),
where  =(ln Kmn)/, and K a constant that can be
estimated from the background amino acid distribution
and scoring matrix (see Altschul and Gish, 1996, for a
collection of values for  and K over a set of widely
used scoring matrices).
Extreme Value Distribution
Using the equation for  (preceding slide), the
probability for the raw alignment score S becomes
P(S  x) = 1 – exp(-Kmne-x).
In practice, the probability P(Sx) is estimated using
the approximation 1 – exp(-e-x)  e-x, which is valid for
large values of x. This leads to a simplification of the
equation for P(Sx):
P(S  x)  e-(x-) = Kmne-x.
The lower the probability (E value) for a given
threshold value x, the more significant the score S.
Normalised sequence similarity
Statistical significance
• Database searching is commonly performed
using an E-value in between 0.1 and 0.001.
• Low E-values decrease the number of false
positives in a database search, but increase
the number of false negatives, thereby
lowering the sensitivity of the search.
Words of Encouragement
• “There are three kinds of lies: lies, damned
lies, and statistics” – Benjamin Disraeli
• “Statistics in the hands of an engineer are
like a lamppost to a drunk – they’re used
more for support than illumination”
• “Then there is the man who drowned
crossing a stream with an average depth of
six inches.” – W.I.E. Gates
Protein structure-function
relationships
Protein function
Genome/DNA
Transcription
factors
Transcriptome/mRNA
Proteome
Ribosomal
proteins
Chaperonins
Metabolome
Enzymes
Physiome
Protein function
Not all proteins are enzymes:
-crystallin: eye lens protein – needs to stay
stable and transparent for a lifetime (very little
turnover in the eye lens)
Protein function groups
• Catalysis (enzymes)
• Binding – transport (active/passive)
– Protein-DNA/RNA binding (e.g. histones, transcription
factors)
– Protein-protein interactions (e.g. antibody-lysozyme)
– Protein-fatty acid binding (e.g. apolipoproteins)
– Protein – small molecules (drug interaction, structure
decoding)
• Structural component (e.g. -crystallin)
• Regulation
• Transcription regulation
• Signalling
• Immune system
• Motor proteins (actin/myosin)
What can happen to protein
function through evolution
Proteins can have multiple functions (and sometimes
many -- Ig).
Enzyme function is defined by specificity and
activity
Through evolution:
• Function and specificity can stay the same
• Function stays same but specificity changes
• Change to some similar function (e.g. somewhere
else in metabolic system)
• Change to completely new function
How to arrive at a given function
• Divergent evolution – homologous proteins
–proteins have same structure and “sameish” function
• Convergent evolution – analogous proteins
– different structure but same function
• Question: can homologous proteins change
structure (and function)?
How to evolve
Important distinction:
• Orthologues: homologous proteins in different
species (all deriving from same ancestor)
• Paralogues: homologous proteins in same species
(internal gene duplication)
• In practice: to recognise orthology, bi-directional
best hit is used in conjunction with database
search program (this is called an operational
definition)
How to evolve
By addition of domains (at either end of
protein sequence) – Lesk book page 108
Often through gene duplication followed by
divergence
Multi-domain proteins are result of gene
fusion
Protein structure evolution
Insertion/deletion of secondary structural
elements can ‘easily’ be done at loop sites
Flavodoxin fold
Flavodoxin family - TOPS diagrams
(Flores et al., 1994)
4
5
4
5
3 2
3
1
1
2
Protein structure evolution
Insertion/deletion of structural domains can
‘easily’ be done at loop sites
N
C
The basic functional unit of a
protein is the domain
A domain is a:
• Compact, semi-independent unit
(Richardson, 1981).
• Stable unit of a protein structure that can
fold autonomously (Wetlaufer, 1973).
• Recurring functional and evolutionary
module (Bork, 1992).
“Nature is a ‘tinkerer’ and not an inventor” (Jacob, 1977).
Delineating domains is essential for:
•
•
•
•
•
•
•
•
Obtaining high resolution structures (x-ray, NMR)
Sequence analysis
Multiple sequence alignment methods
Prediction algorithms (SS, Class, secondary/tertiary
structure)
Fold recognition and threading
Elucidating the evolution, structure and function of
a protein family (e.g. ‘Rosetta Stone’ method)
Structural/functional genomics
Cross genome comparative analysis
Structural domain organisation can be nasty…
Pyruvate kinase
Phosphotransferase
 barrel regulatory domain
/ barrel catalytic substrate binding
domain
/ nucleotide binding domain
1 continuous + 2 discontinuous domains
Complex protein functions are a
result of multiple domains
• An example is the so-called swivelling
domain in pyruvate phosphate dikinase
(Herzberg et al., 1996), which brings an
intermediate enzymatic product over about
45 Å from the active site of one domain to
that of another.
• This enhances the enzymatic activity
The DEATH Domain
http://www.mshri.on.ca/pawson
• Present in a variety of Eukaryotic
proteins involved with cell death.
• Six helices enclose a tightly
packed hydrophobic core.
• Some DEATH domains form
homotypic and heterotypic dimers.
Globin fold
 protein
myoglobin
PDB: 1MBN
 sandwich
 protein
immunoglobulin
PDB: 7FAB
TIM barrel
 /  protein
Triose
phosphate
IsoMerase
PDB: 1TIM
A fold in
 + protein
ribonuclease A
PDB: 7RSA
The red balls
represent
waters that
are ‘bound’
to the protein
based on
polar
contacts
434 Cro
protein
complex
(phage)
PDB: 3CRO
Zinc finger
DNA recognition
(Drosophila)
PDB: 2DRP
..YRCKVCSRVY THISNFCRHY VTSH...
Zinc-finger DNA binding protein family
Characteristics of the family:
Function:
The DNA-binding motif is found as part of
transcription regulatory proteins.
Structure:
One of the most abundant DNA-binding motifs.
Proteins may contain more than one finger in a
single chain. For example Transcription Factor
TF3A was the first zinc-finger protein discovered
to contain 9 C2H2 zinc-finger motifs (tandem
repeats). Each motif consists of 2 antiparallel
beta-strands followed by by an alpha-helix. A
single zinc ion is tetrahedrally coordinated by
conserved histidine and cysteine residues,
stabilising the motif.
Zinc-finger DNA binding protein family
Characteristics of the family:
Binding:
Fingers bind to 3 base-pair subsites and specific
contacts are mediated by amino acids in positions 1, 2, 3 and 6 relative to the start of the alpha-helix.
Contacts mainly involve one strand of the DNA.
Where proteins contain multiple fingers, each
finger binds to adjacent subsites within a larger
DNA recognition site thus allowing a relatively
simple motif to specifically bind to a wide range of
DNA sequences.
This means that the number and the type of zinc
fingers dictates the specificity of binding to DNA
Leucine zipper
(yeast)
PDB: 1YSA
..RA RKLQRMKQLE DKVEE LLSKN YHLENEVARL...