Domain analysis, motifs and repeats
Download
Report
Transcript Domain analysis, motifs and repeats
Classifying the protein universe
SynapseAssociated
Protein 97
Ashwin Sivakumar
Wu et al, 2002. EMBO J 19:5740-5751
Domain Analysis and Protein
Families
Introduction
What are protein families?
Protein families
Description & Definition
Motifs and Profiles
The modular architecture of proteins
Domain Properties and Classification
Protein Families
Protein families are defined by homology:
In a family, everyone is related to everyone
Everybody in a family shares a common
ancestor:
Protein family 1
Protein family 2
Homology versus Similarity
Homologous proteins have similar 3D
structures and (usually) share common
ancestry:
1chg
1sgt
1chg
Superfamily:
Trypsin-like
Serine
Proteases
1sgt
1chg and 1sgt 31% identity, 43% similarity
Homology versus Similarity
But Homologous proteins may not share
sequence similarity:
1chg
1chg
Superfamily:
Trypsin-like
Serine
Proteases
1sgc
1chg and 1sgc 15% identity, 25% similarity
We cannot infer similarity from homology
1sgc
Homology versus Similarity
Similar sequences may not have structural
similarity:
2baa
1chg
2baa
1chg
1chg and 2baa 30% similarity, 140/245 aa
We cannot assume homology from similarity!
Homology versus Similarity
Summary
– Sequences can be similar without being homologous
– Sequences can be homologous without being similar
Evolution /
Homology
BLAST
Similarity
Families ??
Domain Analysis and Protein
Families
Introduction
What are protein families?
Protein families
Description & Definition
Motifs and Profiles
The modular architecture of proteins
Domain Properties and Classification
Description of a Protein Family
Let’s assume we know some members of a
protein family
What is common to them all?
Multiple alignment!
Describing Sequences in a Protein
Family
As a motif or rule
describes essential features of the protein family
catalytic residues, important structural residues
As a profile
describes variability in the family alignment
Techniques for searching sequence databases to
Some common strategies to uncover common
domains/motifs of biological significance that
categorize a protein into a family
• Pattern - a deterministic syntax that describes
multiple combinations of possible residues within a
protein string
• Profile - probabilistic generalizations that assign to
every segment position, a probability that each of
the 20 aa will occur
Consensus - mathematical probability that a
particular amino acid will be located at a given
position.
• Probabilistic pattern constructed from a MSA.
Opportunity to assign penalties for insertions and
deletions
• PSSM - (Position Specific Scoring Matrix)
– Represents the sequence profile in tabular form
– Columns of weights for every aa corresponding
to each column of a MSA.
HMMs
Hidden Markov Models are Statistical methods
that consider all the possible combinations of
matches, mismatches, and gaps to generate a
consensus (Higgins, 2000)
•Sequence ordering and alignments are not
necessary at the onset (but in many cases
alignments are recommended)
More the number of sequences better the models.
One can Generate a model (profile/PSSM), then
search a database with it (Eg: PFAM)
Motif Description of a Protein
Family
Regular expressions:
........C.............S...L..I..DRY..I.......................W...
I
E W V
/ C x{13} S x{3} [LI] x{2} I x{2} [DE] R [YW] x{2} [IV] x{10} – x{12} W /
x = [AC-IK-NP-TVWY]
Motif Description of a Protein
Family
Database: PROSITE
“PROSITE is a database of protein families and domains. It is based on the
observation that, while there is a huge number of different proteins, most of
them can be grouped, on the basis of similarities in their sequences, into a
limited number of families. Proteins or protein domains belonging to a
particular family generally share functional attributes and are derived from
a common ancestor. It is apparent, when studying protein sequence
families, that some regions have been better conserved than others during
evolution. These regions are generally important for the function of a
protein and/or for the maintenance of its three-dimensional structure. By
analyzing the constant and variable properties of such groups of similar
sequences, it is possible to derive a signature for a protein family or
domain, which distinguishes its members from all other unrelated proteins.”
http://au.expasy.org/prosite/prosite_details.html
Automated Motif Discovery
Given a set of sequences:
GIBBS Sampler
http://bayesweb.wadsworth.org/cgi-bin/gibbs.8.pl?data_type=protein
MEME
http://meme.sdsc.edu/meme/
PRATT
http://www.ebi.ac.uk/pratt
TEIRESIAS
http://cbcsrv.watson.ibm.com/Tspd.html
Automated Profile Generation
Any multiple alignment is a profile!
PSIBLAST
Algorithm:
Start from a single query sequence
Perform BLAST search
Build profile of neighbours
Repeat from 2 …
Very sensitive method for database search
PSI-Blast
Starts with a sequence, BLAST it,
align select results to query sequence,
estimate a profile with the MSA, search
database with the profile - constructs PSSM
Iterate until process stabilizes
Focus here is on domains, not entire
sequences
Greatly improves sensitivity
PSIBLAST
Position Specific Iterative Blast
Query
Profile1
Profile2
After n iterations
...
Threshold for
inclusion in profile
Benchmarking a motif/profile
You have a description of a protein family, and
you do a database search…
Are all hits truly members of your protein
family?
TP: true positive
Benchmarking:
TN: true negative
Result
Dataset
FP: false positive
FN: false negative
family member
not a family member
unknown
Benchmarking a motif/profile
Precision / Selectivity
Precision = TP / (TP + FP)
Sensitivity / Recall
Sensitivity = TP / (TP + FN)
Balancing both:
Precision ~ 1, Recall ~ 0: easy but useless
Precision ~ 0, Recall ~ 1: easy but useless
Precision ~ 1, Recall ~ 1: perfect but very
difficult
Domain Analysis and Protein
Families
Introduction
What are protein families?
Protein families
Description & Definition
Motifs and Profiles
The modular architecture of proteins
Domain Properties and Classification
The Modular Architecture
of Proteins
BLAST search of a multi-domain protein
Phosphoglycerate kinase
Triosephosphate isomerase
What are domains?
Functional - from experiments:
example: Decay Accelerating Factor
(DAF) or CD55
Has six domains (units):
4x Sushi domain (complement
regulation)
1x ST-rich ‘stalk’
1x GPI anchor (membrane attachment)
PDB entry 1ojy (sushi domains only)
P Williams et al (2003) Mapping CD55 Function. J Biol Chem 278(12): 10691-10696
There is only so much we can
conclude…
Classifying domains [To aid structure
prediction
(predict
structural
domains,
molecular function of the domain)]
Classifying complete sequences (predicting
molecular function of proteins, large scale
annotation)
Majority of proteins are multi-domain proteins.
What are domains?
Structural - from structures:
MKTQVAIIGAGPSGLLLGQLLHKAGIDNVILER
QTPDYVLGRIRAGVLEQGMVDLLREAGVDRRMA
RDGLVHEGVEIAFAGQRRRIDLKRLSGGKTVTV
YGQTEVTRDLMEAREACGATTVYQAAEVRLHDL
QGERPYVTFERDGERLRLDCDYIAGCDGFHGIS
RQSIPAERLKVFERVYPFGWLGLLADTPPVSHE
LIYANHPRGFALCSQRSATRSRYYVQVPLTEKV
EDWSDERFWTELKARLPAEVAEKLVTGPSLEKS
IAPLRSFVVEPMQHGRLFLAGDAAHIVPPTGAK
GLNLAASDVSTLYRLLLKAYREGRGELLERYSA
ICLRRIWKAERFSWWMTSVLHRFPDTDAFSQRI
QQTELEYYLGSEAGLATIAENYVGLPYEEIE
Are these domains?
Yes - structural domains!
1phh
M A Marti-Renom (2003) Identification of Structural Domains in Proteins. DIMACS, Rutgers University, Piscataway, NJ, Feb 27 2003.
What are domains?
Mobile – Sequence Domains:
Protein 1
Protein 2
Mobile module
Protein 3
Protein 4
Domains are...
...evolutionary building blocks:
Families of evolutionarily-related sequence segments
Domain assignment often coupled with classification
With one or more of the following properties:
Globular
Independently foldable
Recurrence in different contexts
To be precise,
we say: “protein family”
we mean: “protein domain family”
Example: global alignment
Phthalate dioxygenase
reductase (PDR_BURCE)
Toluene - 4 monooxygenase electron
transfer component
(TMOF_PSEME)
Global alignment fails!
Only aligns largest domain.
Sometimes even more complex!
PGBM_HUMAN: “Basement membrane-specific heparan sulphate
proteoglycan core protein precursor”
980
1960
2940
3920
4391
45 domains of 9 different type, according to PFam
http://www.sanger.ac.uk/cgi-bin/Pfam/swisspfamget.pl?name=P98160
http://www.glycoforum.gr.jp/science/word/proteoglycan/PGA09E.html
Domain Analysis and Protein
Families
Introduction
What are protein families?
Protein families
Description & Definition
Motifs and Profiles
The modular architecture of proteins
Domain Properties and Classification
Categories of Domain Definitions
Sequence
(continuous
domains)
Curated
Automatic
PFAM
SMART
PROSITE
PRINTS
ADDA
DOMO
TRIBE-MCL
GENERAGE
SYSTERS
PROTOMAP
Structure
(discontinuous
domains)
SCOP
CATH
DALI
PUU
DETEKTIVE
DOMAINPARSER 1 & 2
DIAL
STRUDL
DOMAK
Pfam-Protein family database
Families of HMM profiles built from
hand curated multiple alignments. (Pfam
A)
Pfam A covers 7973 protein families.
You can search your sequence against
these profiles to decipher family
membership for your sequence.
7973
Sequence Space Graph
Why we need to consider domains:
Sequence
Alignment
Topology:
●
80% of all sequences in
one giant component
●
10% smaller groups
●
10% in singletons
Automatic domain definitions
Rely on alignment
information
Alignment information is
unreliable
Incomplete sequences
(fragments)
Spurious alignments
Conserved motifs in mostly
disordered region
How to remove the noise?
UREA_CANEN: three domain protein
Distant
relatives
Sequence Space
Graph:
•Where to cut
connections?
•What is real,
what is noise?
•Precision vs
Sensitivity…
ADDA
HolmGroup in-house database!
http://ekhidna.biocenter.helsinki.fi:9801/sqgraph/pairsdb
Classification of non-redundant sequences
100% level: 1562243 sequences, 2697368 domains
40% level: 479740 sequences, 827925 domains
PFAM-A benchmark
Sensitivity: 87% (average unification in single
cluster)
Selectivity: 98% (average purity of cluster)
Coverage: 100% (all known proteins) [ Pfam ~50% ]
Example: ABC transporter
PFAM
PRODOM
DOMO
ADDA
UniProt id: CFTR_BOVIN
Properties of domains
Most domains: size approx 75 – 200 residues
So, you have a sequence...
...look it up in existing database
–
–
SRS: http://srs.ebi.ac.uk
INTERPRO: http://www.ebi.ac.uk/interpro
...search against existing family descriptions
–
–
–
–
PFAM: http://www.sanger.ac.uk/Software/Pfam
SMART: http://smart.embl-heidelberg.de
PRINTS: http://bioinf.man.ac.uk/dbbrowser/PRINTS
PROSITE: http://us.expasy.org/prosite
...look it up in ADDA
Manually Curated Protein
Family Databases
PFAM (Hidden Markov Models)
–
http://www.sanger.ac.uk/Software/Pfam
SMART (Hidden Markov Models)
–
http://smart.embl-heidelberg.de
PROSITE (Regular Expressions, Profiles)
–
http://au.expasy.org/prosite
PRINTS (combination of Profiles)
–
http://bioinf.man.ac.uk/dbbrowser/PRINTS
Why a multiple alignment?
With a multiple alignment, we can
guess which residues are “important”
secondary structure prediction
transmembrane segments prediction
homology modelling
guide to wet-lab EXPERIMENTATION!
build a motif/profile and find more family members
build phylogenetic trees
Multiple Alignments are THE central
object in protein sequence analysis!
From sequence to function…
3-motif resource
The server seems
to be down
today!
Methylmalanoyl CoA Decarboxylase Pattern [ILV]-x(3)E-x(7)-V-[GA]-x-[IVL]-x-L-N-R-P mapped on the
structure of 1DUB. Ball representation in pink shows the
potential ligands and its binding pockets. The balls in
blue represent the residues making up the motif on the