No Slide Title
Download
Report
Transcript No Slide Title
Exploring Protein Sequences
You want to learn everything possible about your own protein sequence.
Multiple sequence alignments of related sequences can build up
consensus sequences of known families, domains, motifs or sites.
Combining these predictions with primary biochemical data can provide
valuable insights into protein structure and function
Let’s make a quick tour through:
– Patterns and Motifs
– Domains and domain databases
Celia van Gelder
CMBI
Radboud University
June 2006
©CMBI 2005
Exploring Protein Sequences
Part 1:
Part 2:
Patterns and Motifs
Profiles
Hydropathy Plots
Transmembrane helices
(Antigenic Prediction)
Signal Peptides
Repeats
(Coiled Coils)
Protein Domains
Domain databases
©CMBI 2005
Patterns and Motifs (1)
•In a multiple sequence alignment (MSA) islands of conservation
emerge
•These conserved regions (motifs, segments, blocks, features) are
typically around 10-20 aa in length
•They tend to correspond to the core structural or functional
elements of the protein
•Their conserved nature allows them to be used to diagnose family
membership
©CMBI 2005
Patterns and Motifs (2)
•A motif (or pattern or signature) is a regular expression for what
residues can be present at any given position.
•Motifs can contain
- alternative residues
- flexible regions
C-x(2,5)-C-x-[GP]-x-P-x(2,5)-C
CXXXCXGXPXXXXXC
|
| | |
|
FGCAKLCAGFPLRRLPCFYG
Syntax:
A-[BC]-X-D(2,5)-{EFG}-H
Means:
A
B or C
Anything
2-5 D’s
Not E,F or G
H
Patterns and Motifs (3)
•Motifs can not contain
- mismatches
exact match or no match at all
- gaps
C-x(2,5)-C-x-[GP]-x-P-x(2,5)-C
CXXCXGXPXXXXX-C
| ?| | |
?|
FGCA-CAGFPLRRLPKCFYG
J.Leunissen
PROSITE
• PROSITE - A Dictionary of Protein Sites and Patterns
• 1328 patterns and 577 profiles/matrices (dec 2005)
• For every pattern or profile there is documentation present (e.g.
PDOC00975)
- information on taxonomic occurrence
- domain architecture,
- function,
- 3D structure,
- main characteristics of the sequence
- some references.
©CMBI 2005
PROSITE Pattern
•PROSITE patterns consist of an exact regular expression
•Possible patterns occur frequently in proteins; they may not
actually be present, such as post-translational modification sites
ID ASN_GLYCOSYLATION; PATTERN.
DE N-glycosylation site.
PA N-{P}-[ST]-{P}.
•Notice also in the PROSITE record the number of false positives
and false negatives
©CMBI 2005
PROSITE Pattern (2)
©CMBI 2005
Profiles
•If regular expressions fail to define the motif properly we need a
profile.
•Profiles are specific representations that incorporate the entire
information of a multiple sequence alignment.
•A profile is a position-specific scoring scheme and holds for each
position in the sequence 20 scores for the 20 residue types, and
sometimes also two values for gap open and gap elongation.
•Profiles provide a sensitive means of detecting distant sequence
relationships
©CMBI 2005
©CMBI 2005
Hydropathy plots
Hydropathy plots are designed to display the distribution of polar and
apolar residues along a protein sequence.
A positive value indicates local hydrophobicity and a negative value
suggests a water-exposed region on the face of a protein.
(Kyte-Doolittle scale)
Hydropathy plots are generally most useful in predicting transmembrane
segments, and N-terminal secretion signal sequences.
©CMBI 2005
Hydrophobicity
Hydrophobicity is the most important characteristic of amino acids. It is the
hydrophobic effect that drives proteins towards folding.
Actually, it is all done by water. Water does not like hydrophobic surfaces.
When a protein folds, exposed hydrophobic side chains get buried, and
release water of its sad duty to sit against the hydrophobic surfaces of
these side chains.
Water is very happy in bulk water because there it has on average 3.6 Hbonds and about six degrees of freedom.
So, whenever we discuss protein structure, folding, and stability, it is all the
entropy of water, and that is called the hydrophobic effect.
©CMBI 2005
Hydropathy scales
©CMBI 2005
Sliding Window Approach
Sum amino acid property (e.g. hydrophobicity values) in a given
window
Plot the value in the middle of the window
I L I K E I R
4.50+3.80+4.50-3.90-3.50+4.50-4.50 = 5.40
=>
5.4/7=0.77
Move to the next position in the sequence
L I K E I R Q
+3.80+4.50-3.90-3.50+4.50-4.50 – 3.50 =
=>
-2.6/7=-0.37
J. Leunissen
Hydropathy plot
for rhodopsin
The window size can be changed. A small window produces "noisier" plots that more
accurately reflect highly local hydrophobicity.
A window of about 19 is generally optimal for recognizing the long hydrophobic
stretches that typify transmembrane stretches.
©CMBI 2005
Transmembrane Helices
Transmembrane proteins are integral membrane proteins that interact
extensively with the membrane lipids.
Nearly all known integral membrane proteins span the lipid bilayer
Hydropathy analysis can be used to locate possible transmembrane
segments
The main signal is a stretch of hydrophobic and helix-loving amino acids
©CMBI 2005
Transmembrane Helices (2)
In a -helix the rotation is 100 degrees per amino acid
The rise per amino acid is 1,5 Å
To span a membrane of 30 Å approx. 30/1,5 = 20 amino acids are
needed
©CMBI 2005
Transmembrane Helix Prediction Servers
1. KDD
2. Tmpred (database Tmbase)
3. DAS
4. TopPred II
5. TMHMM 2.0
6. MEMSAT 2
7. SOSUI
8. HMMTOP 2.0
©CMBI 2005
Signal Peptides
Proteins have intrinsic signals that
govern their transport and
localization in the cell (nucleus, ER,
mitochondria, chloroplasts)
Specific amino acid sequences
determine whether a protein will
pass through a membrane into a
particular organelle, become
integrated into the membrane, or
be exported out of the cell.
©CMBI 2005
Signal Peptides (2)
The common structure of signal peptides from various proteins is
described as:
• a positively charged (N-terminal) n-region
• followed by a hydrophobic h-region (which can adopt an -helical
conformation in an hydrophobic environment)
• and a neutral but polar c-region (cleavage region; the signal
sequence is cleaved off here after delivering the protein at the
right site).
The (-3, -1) rule states that the residues at positions –3 and –1 (relative to
the cleavage site) must be small and neutral for cleavage to occur
correctly.
©CMBI 2005
Prediction of Signal Peptides
Prokaryotes and Eukaryotes:
SignalP 3.0
SPScan
SigCleave
PSORT
Eukaryotes:
SIGFIND
TargetP
Specific localization signals:
PredictNLS - Nuclear Localization Signals
ChloroP – Chloroplast transit peptides
NetNes – Nuclear Export Signals
©CMBI 2005
Repeats in proteins
•Although they are usually found in non-coding genomic regions, repeating
sequences are also found within genes.
•Ranging from repeats of a single amino acid, through three residue short
tandem repeats (e.g. in collagen), to the repetition of homologous domains
of 100 or more residues.
•Duplicated sequence segments occur in 14 % of all proteins, but
eukaryotic proteins are three times more likely to have internal repeats
than prokaryotic proteins
©CMBI 2005
Repeats, example 2
©CMBI 2005
Prediction of Repeats
•
Repsim (a database of simple repeats)
•
Rep (Searches a protein sequence for repeats)
•
RADAR (Rapid Automatic Detection and Alignment of Repeats in
protein sequences.)
•
REPRO (De novo repeat detection in protein sequences)
•
Other?
©CMBI 2005
Definition of protein domains
• Group of residues with high contact density, number of contacts
within domains is higher than the number of contacts between
domains.
• A stable unit of protein structure that can fold autonomously
• A rigid body linked to other domains by flexible linkers
• A portion of the protein that can be active on its own if you remove
it from the rest of the protein.
©CMBI 2005
Protein Domains
• Domains can be 25 to 500 residues long; most are less than 200
residues
• The average protein contains 2 or 3 domains
• The total number of different types of domains ~1000 – 3000
• The same or similar domains are found in different proteins.
“Nature is a ‘tinkerer’ and not an inventor” (Jacob, 1977).
“Nature is smart but lazy”
• Usually, each domain plays a specific role in the function of the
protein.
©CMBI 2005
Linkers
Domain linkers link the protein domains together and have been found to
contain an amino acid signature that is distinct from the structurally
compact domains.
Average linker size 8-9 amino acids
Linkers are susceptible for protease attack and they are flexible.
©CMBI 2005
Protein Domain Databases
Even though the structure of a domain is not always known it is still
possible to define the domain boundaries from sequence alone
Many of the common domains have already been defined in domain
databases
Advantages:
• Pre-annotated domains
• Easy interpretation of domain structure
Problem:
• Not trivial to define domain boundaries unambiguously
©CMBI 2005
Protein Domains
http://ip30.eti.uva.nl/ember-demo/ch3
Domain databases (2)
Generation
#entries
PfamA
manual
7503 families
PfamB
automatic
>140,000 families
Prints
manual
11,170 motifs
Prosite Profiles
manual
577 profiles
Blocks
automatic
28,337 blocks, 5733 groups
SMART
manual
667 HMMs
ProDom
automatic
501,917 domain families
©CMBI 2005
PRINTS database
•
Most protein families are characterised not by one, but by several
conserved motifs
•
Fingerprints are groups of conserved motifs excised from sequence
alignments
•
Taken together, they provide diagnostic family signatures. They are
are the basis of the PRINTS database, and are stored in the form of
aligned motifs
•
Input about protein families is done manually
•
True members match all elements of the fingerprint in order, subfamily
members may match part of fingerprint
©CMBI 2005
PRINTS database
http://ip30.eti.uva.nl/ember-demo/ch3
PRINTS
©CMBI 2005
ProDom:
The Protein Domain Database
• ProDom is a comprehensive set of protein domain families
automatically generated
• Each entry provides a multiple sequence alignment of homologous
domains and a family consensus sequence.
• Current ProDom release:
ProDom 2004.1, June 2004, 501917 domain families
©CMBI 2005
Pfam
Pfam (Protein families) is a large collection of multiple sequence
alignments and hidden Markov models covering many common protein
domains and families.
For each family in Pfam you can:
•Look at multiple alignments
•View the domain organisation of proteins
•Examine species distribution
•Follow links to other databases
•View known protein structures
©CMBI 2005
Pfam
Two distinct parts:
–Pfam-A entries are manually curated
7503 families
–Pfam-B entries automatically generated clusters
>140,000
(not covered by Pfam-A)
New:
iPfam is a resource that describes domain-domain interactions
that are observed in known structures
©CMBI 2005
©CMBI 2005
SMART
SMART - Simple Modular Architecture Research Tool
Domain families found in:
1) signalling
2) nuclear
3) extracellular
4) other
Current version 5.0: Number of SMART HMMs: 669
You can use SMART in two different modes: normal or genomic.
©CMBI 2005
Bacteriorhodopsin
Human serine protease
©CMBI 2005
Limitations of domain databases
• Patterns not present for all families of proteins
• Multiple sequence alignment to define patterns could be
inaccurate due to an automatic alignment
• Low number of sequences from different species could
result in inaccurate patterns
©CMBI 2005
Integrating Pattern databases
InterPro - Integrated Documentation Resource of Protein Families,
Domains and Functional Sites.
InterPro is a database of protein families, domains and functional
sites in which identifiable features found in known proteins can be
applied to unknown protein sequences.
The aim is to provide a one-stop-shop for protein family diagnostics
©CMBI 2005
InterPro
Member Databases
Prosite
(regular expressions and profiles)
Pfam, SMART, TIGRFAMs, PIRSF, PANTHER, Gene3D and
SUPERFAMILY
(hidden Markov Models - HMMs)
PRINTS
(groups of aligned, un-weighted motifs)
ProDom
(uses cluster analysis to group sequences)
Release 12.0 contains 12542 entries
Types of entries: Family, Domain, Repeat, PTM, Binding Site, Active Site
©CMBI 2005
©CMBI 2005
©CMBI 2005
©CMBI 2005
Summary
•
Many different protein signature databases exist (from small
patterns to alignments to complex HMMs)
•
The databases have different strengths and weaknesses. Some
databases can be better for your sequence than others
•
Therefore: best to combine methods, preferably in an integrated
database
•
The quality of a database/server is best tested with a sequence
you know very well
•
Always do control experiments: never trust a server
©CMBI 2005