Transcript Pfam-A
Exploring Protein Sequences
Prediction methods exist for all kinds of motifs, signals etc. in newly
discovered protein sequences. These are based on either the
protein sequence itself or its comparison to protein families (a
multiple sequence alignment)
Combining these predictions with primary biochemical data can
provide valuable insights into protein structure and function
Let’s make a quick tour through:
– Patterns
– Domains and domain databases
– Signals in proteins
Celia van Gelder
CMBI
Radboud University
October 2007
©CMBI 2007
Exploring Protein Sequences
Part 1:
Part 2:
Patterns
Profiles
Protein Domains
Protein Domain Databases
Signals in Proteins:
Hydropathy Plots
Transmembrane helices
Signal Peptides
Repeats
Coiled Coils
©CMBI 2007
Patterns
•Homologous sequences in multiple alignments show conserved
regions
•These conserved regions (patterns, motifs, segments, blocks,
features) are typically around 10-20 aa in length
•They usually reflect the structural and/or functional elements of the
protein
•New sequences can be searched against a library of patterns and
can be assigned a function, to a family or sub-family.
©CMBI 2007
Identifying patterns
--CYDEGGIS---CYEDGGIS---CYEEGGIT---CYRGDGNT--
C-Y-X2-[DG]-G-X-[ST]
regular expression or pattern
PROSITE Syntax:
A-[BC]-X-D(2,5)-{EFG}-H
Means:
A
B or C
Anything
2-5 D’s
Not E,F or G
H
Identifying patterns (2)
Patterns can contain:
- alternative residues
- flexible regions
Patterns can not contain:
- mismatches (exact match or no match at all)
- gaps
PROSITE
–PROSITE - Database of protein domains, families and functional
sites
–1319 patterns and 748 profiles/matrices (oct 2007)
–For every pattern or profile there is documentation present
–Sequence search and Keyword search possible
–http://www.expasy.ch/prosite/
©CMBI 2007
PROSITE example
©CMBI 2007
PROSITE Patterns
Some patterns occur frequently in proteins; they may not actually be
present, such as post-translational modification sites.
–ID ASN_GLYCOSYLATION; PATTERN.
–DE N-glycosylation site.
–PA N-{P}-[ST]-{P}.
You will get a warning:
Notice also in the PROSITE record the number of false positives and
false negatives
©CMBI 2007
Identifying patterns – fingerprints
Pattern 1
Pattern 2
Matrix
Matrix
Pattern 3 Pattern 4
Matrix
Matrix
Fingerprint or signature
Databases: PRINTS, BLOCKS
©CMBI 2007
Profiles
Many motifs cannot be easily defined using simple regular
expressions.
Such motifs can be defined using a profile, which is a numerical
representation of a MSA. For each position in the MSA, each of the
20 amino acids is given a score depending on how likely it is to
occur.
Profiles provide a sensitive means of detecting distant sequence
relationships.
©CMBI 2007
The profile represents a specific pattern found for a set of proteins.
It is then used to search a target sequence for matches to the profile.
©CMBI 2007
Identifying patterns – full domain alignment
Pattern 1
gaps and
insertions
Pattern 2
+
Pattern 3 Pattern 4
Fingerprint or signature
position-specific matrix + gaps and insertions
Databases:
Profiles (alignment manually corrected)
Pfam (automatically aligned)
©CMBI 2007
Protein domains - definitions
• Group of residues with high contact density, number of
contacts within domains is higher than the number of
contacts between domains.
• A stable unit of protein structure that can fold autonomously
• A rigid body linked to other domains by flexible linkers
• A portion of the protein that can be active on its own if you
remove it from the rest of the protein.
©CMBI 2007
Protein Domains
• Domains can be 25 to 500 amino acids long; most are less
than 200 amino acids
• The average protein contains 2 or 3 domains
• The same or similar domains are found in different proteins.
“Nature is a ‘tinkerer’ and not an inventor” (Jacob, 1977).
“Nature is smart but lazy”
• Usually, each domain plays a specific role in the function of
the protein.
©CMBI 2007
Protein Domains - an alphabet of functional modules
14-3-3
ANK3
Death
DED
PH
PTB
ARM
EFH
SAM
BH1
C1
EH
EVH
SH2
C2
SH3
CARD
FYVE
PDZ
WD40
WW
From: Bioinformatics.ca
©CMBI 2007
Domain Linkers
Domain linkers link the protein domains together and have been
found to contain an amino acid signature that is distinct from the
structurally compact domains.
Average linker size 8-9 amino acids
Linkers are susceptible for protease attack and they are flexible.
Often amino acids like Pro, Ser, Gly, Thr (and less frequent Ala,
Asn and Asp) are found in linker sequences.
©CMBI 2007
Protein Domain Databases
Even though the structure of a domain is not always known it is still
possible to define the domain boundaries from sequence alone
Many of the common domains have already been defined in domain
databases
Advantages:
• Pre-annotated domains
• Easy interpretation of domain structure
Problem:
• Not trivial to define domain boundaries unambiguously
©CMBI 2007
The challenge of family analysis
T. Attwood
Domain databases
Generation
#entries
PfamA
manual
7503 families
PfamB
automatic
>140,000 families
Prints
manual
11,435 motifs, 1900 fingerprints
Prosite Profiles
manual
577 profiles
Blocks
automatic
28,337 blocks, 5733 groups
SMART
manual
667 HMMs
ProDom
automatic
501,917 domain families
December 2005
©CMBI 2007
PRINTS database
•
•
Most protein families are characterised not by one motif, but by several
conserved motifs, so-called fingerprints
.
Use all fingerprints of a protein family to build a diagnostic signature
for this family
•
Fingerprints are the basis of the
PRINTS database, and are
stored in the form of aligned motifs
•
Input about protein families is
done manually
•
True members match all elements
of the fingerprint in order, subfamily
members may match part of fingerprint
http://ip30.eti.uva.nl/ember-demo/ch3
©CMBI 2007
PRINTS
©CMBI 2007
BLOCKS database
Blocks are multiply aligned ungapped segments corresponding to the most
highly conserved regions of proteins.
The blocks for the BLOCKs database are made automatically
To ensure complete coverage it is recommended that both the PRINTS
and the BLOCKS database be searched
©CMBI 2007
©CMBI 2007
Pfam
Pfam (Protein families) is a large collection of multiple sequence
alignments and hidden Markov models covering many common protein
domains and families.
For each family in Pfam you can:
•Look at multiple alignments
•View the domain organisation of proteins
•Examine species distribution
•Follow links to other databases
•View known protein structures
©CMBI 2007
Pfam
Pfam-A entries are manually curated - 9318 families (July 2007)
Pfam-B entries are automatically generated clusters –
>140,000 (not covered by Pfam-A)
iPfam is a resource that describes domain-domain interactions
that are observed in known structures - 3019 interactions
©CMBI 2007
©CMBI 2007
SMART
SMART - Simple Modular Architecture Research Tool
Specializes in:
1) signalling domains
2) nuclear domains
3) extracellular domains
Current version 5.0: Number of SMART HMMs: 669
©CMBI 2007
Bacteriorhodopsin
Human serine protease
©CMBI 2007
Structure Databases & Structural classification
PDB Brookhaven Databank
http://www.rcsb.org/pdb/
CDD – Conserved Domain Database
http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
MSD – Macromolecular Structure Database
http://www.ebi.ac.uk/msd/index.html
CATH - Protein Structure Classification
http://www.biochem.ucl.ac.uk/bsm/cath/
SCOP - Structural Classification of Proteins
http://scop.mrc-lmb.cam.ac.uk/scop/
Adapted from: Bioinformatics.ca
©CMBI 2007
Limitations of domain databases
• Patterns not present for all families of proteins
• Multiple sequence alignment to define patterns could be
inaccurate due to an automatic alignment
• Low number of sequences from different species could
result in inaccurate patterns
©CMBI 2007
Integrating Pattern databases
InterPro - Integrated Documentation Resource of Protein Families,
Domains and Functional Sites.
InterPro is a database of protein families, domains and functional
sites in which identifiable features found in known proteins can be
applied to unknown protein sequences.
The aim is to provide a one-stop-shop for protein family diagnostics
©CMBI 2007
InterPro
Member Databases
Prosite
(regular expressions and profiles)
Pfam, SMART, TIGRFAMs, PIRSF, PANTHER,
Gene3D and SUPERFAMILY
(hidden Markov Models - HMMs)
PRINTS
(groups of aligned, un-weighted motifs)
ProDom
(uses cluster analysis to group sequences)
Release 16.1: 14768 entries (Oct 2007)
Types of entries:
Family, Domain, Repeat, PTM, Binding Site, Active Site
©CMBI 2007
©CMBI 2007
Summary patterns & domains
•
Many different protein signature databases exist (from small
patterns to alignments to complex HMMs)
•
The databases have different strengths and weaknesses. Some
databases can be better for your sequence than others
•
Therefore: best to combine methods, preferably in an integrated
database
•
The quality of a database/server is best tested with a sequence
you know very well
•
Always do control experiments: never trust a server
©CMBI 2007
Exploring Protein Sequences
Part 1:
Part 2:
Patterns
Profiles
Protein Domains
Protein Domain Databases
Signals in Proteins:
Hydropathy Plots
Transmembrane helices
Signal Peptides
Repeats
Coiled Coils
©CMBI 2007
Hydropathy plots
Hydropathy plots are designed to display the distribution of polar
and apolar residues along a protein sequence.
Hydrophobicity scales are based on experimental evidence
indicating hydrophobic/hydrophilic properties of each amino acid
Hydropathy plots are generally most useful in predicting
transmembrane segments and N-terminal secretion signal
sequences.
©CMBI 2007
Hydropathy scales
A positive value indicates local
hydrophobicity and a negative value
suggests a water-exposed region on
the face of a protein.
(Kyte-Doolittle scale)
©CMBI 2007
Sliding Window Approach
Sum the amino acid hydrophobicity values in a given window
Plot the average value in the middle of the window
I L I K E I R
4.50+3.80+4.50-3.90-3.50+4.50-4.50 = 5.40
=>
5.4/7=0.77
Move to the next position in the sequence
L I K E I R Q
+3.80+4.50-3.90-3.50+4.50-4.50 – 3.50 =
=>
-2.6/7=-0.37
The window size can be changed.
J. Leunissen
Hydrophobicity plot
interior residues
exterior
score
Score
hydrophobic 3+
hydrophilic
2
1
0
-1
-2
-3-4
1
NH2
51
101
151
201
protein
sequence
251
301
COOH
From: Bioinformatics.ca
©CMBI 2007
Transmembrane Helices
Transmembrane proteins are integral membrane proteins that interact
extensively with the membrane lipids.
Nearly all known integral membrane proteins span the lipid bilayer
Hydropathy analysis can be used to locate possible transmembrane
segments
The main signal is a stretch of hydrophobic and helix-loving amino acids
A window of about 19 is generally optimal for recognizing the long
hydrophobic stretches that typify transmembrane stretches.
©CMBI 2007
Transmembrane Helices (2)
In a -helix the rotation is 100 degrees per amino acid
The rise per amino acid is 1,5 Å
To span a membrane of 30 Å approx. 30/1,5 = 20 amino acids are
needed
©CMBI 2007
Transmembrane Helix Prediction - Rhodopsin
©CMBI 2007
Signal Peptides
Proteins have intrinsic signals that
govern their transport and
localization in the cell (nucleus, ER,
mitochondria, chloroplasts)
Specific amino acid sequences
determine whether a protein will
pass through a membrane into a
particular organelle, become
integrated into the membrane, or
be exported out of the cell.
©CMBI 2007
Signal Peptides (2)
The common structure of signal peptides from various proteins is
described as:
• a positively charged (N-terminal) n-region
• followed by a hydrophobic h-region (which can adopt an helical conformation in an hydrophobic environment)
• and a neutral but polar c-region (cleavage region; the signal
sequence is cleaved off here after delivering the protein at
the right site).
©CMBI 2007
Signal Peptides (3)
Eukaryotes
Total length
(average)
n-regions
h-regions
c-regions
-3,-1 positions
+1 to +5 region
22.6 aa
only slightly Arg-rich
short, very
hydrophobic
short, no pattern
small and neutral
residues
no pattern
Prokaryotes
Gram-negative
Gram-positive
25.1 aa
32.0 aa
Lys+Arg-rich
slightly longer, less
very long, less
hydrophobic
hydrophobic
short, Ser+Ala-rich
longer, Pro+Thr-rich
almost exclusively Ala
rich in Ala, Asp/Glu, and Ser/Thr
Marlinda Hupkes 2004
Repeats in proteins
• A repeat is any piece of protein sequence that appears multiple
times within a single protein
• Length of the repeat can vary from 1 (single amino acid repeat) up
to 240 amino acids
• Repeats are rarer in coding regions than in non-coding regions
• Repeats occur in 14 % of all proteins
• Eukaryotic proteins have three times more internal repeats than
prokaryotic proteins
• The three kingdoms of life have very few repeats in common
©CMBI 2007
Repeats, examples
• Gln repeat in huntingtin (Huntington’s disease)
(CAG)n = a polyglutamine tract (polyQ)
Up to 35 repeats not pathological, > 35 repeats is pathological
•
•
•
•
•
Bacterial transferase hexapeptide (three repeats)
Leucine-rich repeats (LRRs) 20-29 aa motif
WD-repeat
Ankyrin-repeat
etc.etc.
Coiled-Coils
The coiled-coil is a ubiquitous protein motif that is often used to control
oligomerisation.
It is found in many types of proteins, including transcription factors, viral
fusion peptides, and certain tRNA synthetases.
Examples:
– Very long coils in tropomyosin and intermediate filaments
– GCN4 – gene regulation in yeast; leucine zipper
©CMBI 2007
Coiled-Coils
Left-handed spiral of right-handed helices
May be parallel
N
N
C
or anti-parallel
N
C
C
N
C
David Gossard
©CMBI 2007
Coiled-Coils – Heptad repeat
Seven residue patterns abcdefg in which the a and d residues (core
positions) are generally hydrophobic.
Residues at “d” and “a”
form hydrophobic core
b
Residues at “e” and “g”
form ion pairs
g
e
a
c
d
f
f
d
c
g
a
e
b
David Gossard
©CMBI 2007
Assignment (see also paper version)
Make a report about the protein signal of your choice.
Questions which should be answered in this report are:
•
Describe the protein signal you want to detect.
•
Describe existing prediction method(s), their prediction quality and their
underlying theory.
•
Describe the available webservers for detecting this protein signal, the
quality of their predictions, their pro's and con's, and all else you find
relevant.
•
Give example output for a (for your protein signal) relevant protein and
explain this output.
©CMBI 2007