Protein Sequence Analysis Structure & Function Prediction

Download Report

Transcript Protein Sequence Analysis Structure & Function Prediction

Function Prediction from Protein
Sequence
Basic definitions
• Primary structure
– the linear sequence of amino acids in a protein
• Secondary structure
– regions of local regularity
• i.e., a-helices, b-strands, -sheets & -turns
Definitions contd.
• Super-secondary structure
– the packing of secondary structure elements into stable units
• e.g., b-barrels, bab units, Greek keys, etc..
Definitions contd.
• Tertiary structure
– the overall chain fold that results from packing of secondary
structure elements
Definitions contd.
• Quaternary structure
– the arrangement of separate chains within a protein that has
more than one subunit
• e.g., haemoglobin
Definitions contd.
• Quinternary structure
– the arrangement of separate molecules, such as in proteinprotein or protein-nucleic acid interactions
Importance of sequence analysis
• Feb 2008 >67 million sequences available in Genbank
– & millions more (including ESTs) in proprietary dbs
– these #s will snowball with completion of more genomes
– so what?
• Locked up in sequences is a huge amount of
structural, functional & evolutionary info
– they're a highly valuable resource
• By contrast, the # of solved protein structures is
~45000
– a huge information deficit
PDB Feb 2007
The legacy of the genome projects
Sequence-structure deficit
800
700
600
500
400
300
200
100
1988
2002
Non-redundant growth of sequences during 1988-2002 (
corresponding growth in the number of structures (
) & the
).
Challenges for bioinformatics
• Spurred on by the seq/structure deficit, the challenges
– rationalise the mass of sequence data
– derive more efficient means of data storage
– design more incisive & reliable analysis tools
• The imperative - to convert sequence information into
biochemical & biophysical knowledge
– to decipher the structural, functional & evolutionary clues
encoded in the language of biological sequences
The Holy Grail of bioinformatics
...to be able to understand the words in a sequence sentence
that form a particular protein structure
Pattern recognition & prediction
• In investigating the meaning of sequences, two
distinct analytical approaches have emerged
– pattern recognition is used to detect similarity between
sequences & hence to infer related structures & functions
– ab initio prediction is used to deduce structure, & to infer
function, directly from sequence
• These methods are quite different!
– pattern recognition methods demand that some
characteristic has been seen before & housed in a db
– prediction methods remove the need for template dbs,
because deductions are made directly from sequence
Science fact & fiction
• Sequence pattern recognition is easier to
achieve, & is much more reliable, than fold
recognition
– which is ~50% reliable even in expert hands
• Prediction is still not possible
– & is unlikely to be so for decades to come (if ever)
• But, to debunk a popular myth, knowing
structure alone does not inherently tell us
function
The Twilight Zone
• Prediction methods don’t work because we don’t fully
understand the Folding Problem
– we can’t read the language sequences use to create their folds
• But, with sequence analysis techniques, we can try to find
similarities between new sequences & those in dbs
– whose structures & functions we hope have been elucidated
• This is straightforward at high levels of identity, but below
50% it is difficult to establish relationships reliably
• Analyses can be pursued with decreasing certainty
towards the Twilight Zone
– ~20% identity, where results may look plausible to the eye, but are
no longer statistically significant
Beyond the Twilight Zone
• To penetrate deeper into the Twilight Zone is the aim
of most analytical methods
– whether using single sequences, motifs, complex weighting
schemes or raw amino acid frequencies
• Each offers a different perspective, depending on the
type of information used in the search
– none gives the right answer
• It is good practice to devise an analysis protocol that
uses a variety of methods
– but don’t expect the impossible – no method is infallible!
Application areas of analysis tools
•
•
The scale indicates % identity
between aligned sequences
Alignment of 2 random seqs
can produce ~20% identity
– less than 20% does not
constitute a significant
alignment
– around this threshold is the
Twilight Zone, where
alignments may appear
plausible to the eye, but can’t
be proved by conventional
methods
Homology & analogy
• The term homology is confounded & abused!
– sequences are homologous if they are related by divergence
from a common ancestor
– analogy relates to the acquisition of common features from
unrelated ancestors via convergent evolution
• e.g., b-barrels occur in soluble & membrane proteins; enzymes
chymotrypsin & subtilisin share groups of catalytic residues,
with near identical spatial geometries, but no other similarities
The Midnight Zone
• In the genome era, prediction of function from sequence is of
more immediate value than is the prediction of structure
• However, between distantly-related proteins, structure is more
conserved than the underlying sequences
– thus, some relationships are only apparent at the structural level
• Such relationships can’t be detected by even the most sensitive
sequence comparison methods
– the region of identity where sequence comparisons fail completely
to detect structural similarity is the Midnight Zone – there is thus a
theoretical limit to the effectiveness of sequence analysis methods
Family (pattern) databases
• SWISS-PROT is emerging as a standard, &
most pattern dbs use it as their basis
PROSITE
PRINTS
(fingerprints)
Pfam
(HMMs)
Profiles
Blocks
(blocks)
eMOTIF
SWISS-PROT
SWISS-PROT/TrEMBL
SWISS-PROT/TrEMBL
Regular expressions (patterns)
Aligned motifs
Hidden Markov Models
SWISS-PROT
InterPro/PRINTS
Weight matrices (profiles)
Weighted motifs
Blocks/PRINTS
Permissive regular expressions
Why create pattern databases?
• Pattern dbs arise from the need to make more specific functional
diagnoses than are possible simply by searching
• They are built on the principle that homologous sequences may
be gathered together in multiple alignments, within which are
regions (motifs) that show little variation
– these motifs usually reflect some vital biological role in terms of
either structure or function
• Motifs are exploited in different ways to build diagnostic patterns
for protein families
– new sequences can be searched against dbs of such patterns to
see if they can be assigned to known families
– hence they offer a fast track to the inference of function
What's in a sequence?
Methods for family analysis
Single motif
methods
Fuzzy regex
(eMOTIF)
Exact regex
(PROSITE)
Full domain
alignment methods
Profiles
(PROFILE LIBRARY)
HMMs
(Pfam)
Identity matrices
(PRINTS)
Multiple motif
methods
Weight matrices
(Blocks)
PROSITE
• The first pattern db
– based on the idea that a protein family can be characterised
by a pattern of conserved residues within a single motif
• Sequence information in motifs is reduced to
consensus or regular expressions (regexs) & the
seed regex used to search SP
– results are inspected manually to achieve optimal results
• Some families can’t be characterised by single motifs
– here, additional regexs are created until an optimal set is
achieved that captures most or all of the family
– results are then manually annotated for inclusion in the db
R-Y-x-[DT]-W-x-[LIVM]-[ST]-T-P-[LIVM](3)
Regular expressions (patterns)
• These are derived from single conserved regions in alignments
– they are minimal expressions, so sequence information is lost
– the more divergent the sequences used, the more fuzzy & poorly
discriminating the regex becomes
Alignment
GAVDFIALCDRYF
GPIDFVCFCERFY
GRVEFLNRCDRYY
Regex
G-X-[IV]-[DE]-F-[IVL]-X2-C-[DE]-R-[FY]2
• Regexs do not tolerate similarity
– sequences either match or not, regardless of how similar they are
– matching is a binary ‘on-off’ event & frequently misses true matches
– single-motif methods are very hit-or-miss – how do you know if
you've encoded the ‘best’ region?
Regular expressions (rules)
• Regex patterns are most effective when applied to highlyconserved, family-specific motifs
• It is often possible to identify, shorter generic patterns within
sequences, characteristic of common functional sites
Functional site
N-glycosylation
Protein kinase C phosphorylation
Casein kinase II phosphorylation
Rule
N-{P}-[ST]-{P}
[ST]-X-[RK]
[ST]-X2-[DE]
• Such features result from convergence to a common property
– glycosylation sites, phosphorylation sites, etc.
• They cannot be used for family diagnosis & don’t discriminate
– they can only be used to suggest whether a certain functional site
might exist (which must then be tested by experiment)
– such patterns are normally termed rules
Residue groups for fuzzy regexs
• It is possible to assign residues to groups based on
various biochemical properties – e.g., charge & size
– using such groups theoretically ensures that resulting regexs
have sensible biochemical interpretations
small
small hydroxyl
basic
aromatic
aliphatic
acidic/amide
small/polar
Ala, Gly
Ser, Thr
His, Lys, Arg
Phe, Tyr, Trp
Val, Leu, Ile, Met
Asp, Glu, Asn, Gln
Ala, Gly, Ser, Thr, Pro
• This is more flexible than exact regex matching
Profiles & Pfam
• An alternative to motif-based methods exploits regions
between motifs, which also contain valuable information
– the full alignment effectively becomes the discriminator
• A complex scoring scheme allowing for substitutions &
INDELs is used to create family-specific profiles
• These profiles can be used to detect distant relationships, where only few residues are conserved
– this is the basis of the Profile library
• In an extension of this approach, alignments are
encoded as probabilistic models termed HMMs
– this is the basis of Pfam
Profiles
• Profiles are scoring tables derived from full domain alignments
–
–
–
–
these define which residues are allowed at given positions
which positions are conserved & which degenerate
which positions, or regions, can tolerate insertions
the scoring system is intricate, & may include evolutionary weights,
results from structural studies, & data implicit in the alignment
– variable penalties are specified to weight against INDELs occurring
in core 2' structure elements
• Within a profile, the I & M fields contain position-specific scores
for insert & match positions
– in conserved regions, INDELs aren't totally forbidden, but are
strongly impeded by large penalties defined in the DEFAULT field
– these are superseded by more permissive values in gapped regions
– the inherent complexity of profiles renders them highly potent
discriminators, but they are time-consuming to derive
A HMM
L L
C
C
A
E
D
S
N
B
I1
L
W R
Y
D
E
G
I2
L
Y
W
I3
C
C
M1
M2
M3
M4
D1
D2
D3
D4
J
E
C
T
Fingerprints
• Fingerprints are groups of conserved (ungapped) motifs excised
from alignments & used for iterative db searching
– no weighting scheme is used
– searches depend only on residue frequencies
– resulting scoring matrices are thus sparse
• Each motif trawls the db independently
– search results are correlated to determine which sequences match
all the motifs & which match only partially
– no information is thrown away
• The iterative process refines the fingerprint & increases its power
– potency is gained from the mutual context of motif neighbours
– results are biologically more meaningful than those from single motifs
TM domain
loop region
TM domain
Composite pattern databases
•
To simplify sequence analysis,
the family databases are being
integrated to create a unified
annotation resource – InterPro
– release 4.0 contains 4691 entries
– a central annotation resource,
with pointers to its satellite dbs
– initial partners were PRINTS,
PROSITE, profiles & Pfam
– new partners include ProDom,
TIGRfam, SMART & hopefully
others (e.g., Blocks, MetaFam)
– lags behind its sources
– major role in fly & human
genome annotation