Identification of Protein domains

Download Report

Transcript Identification of Protein domains

Identification of Protein Domains
Eden Dror
Menachem Schechter
Computational Biology Seminar 2004
Overview
• Introduction to protein domains.
– Classification of homologs.
• Representing a domain.
– PSSM
– HMM
• Internet resources
–
–
–
–
Pfam
SMART
PROSITE
InterPro
• Research example.
Protein domains
• A discrete portion of a protein assumed to
fold independently, and possessing its own
function.
• Mobile domain (“module”): a domain that can
be found associated with different domain
combinations in different proteins.
Protein domains
• The assumption: The domain is the
fundamental unit of protein structure and
function.
• Protein family – all proteins containing a
specific domain.
What can we learn from them?
• Common ancestors & homology information
of a set of proteins.
• Homology can induce properties of a protein
like functionality & localization.
• Therefore, domains can be used to classify a
new protein to a family, inferring
functionality.
Classification of homologs
• Homology is not a sufficiently well-defined
term to describe the evolutionary
relationships between genes.
• Homologous genes can be derived by two
major ways:
– Gene duplication (in the same species).
– Speciation (splitting of one species into two).
Classification of homologs
Classification of homologs
• Orthologs – Two genes from two different
species that derive from a single gene in the
last common ancestor of the species.
• Paralogs – Two genes that derive from a
single gene that was duplicated within a
genome.
Classification of homologs
ortho
para
para
ortho
Classification of homologs
• Inparalogs - paralogs that evolved by gene
duplication after the speciation event.
• Outparalogs - paralogs that evolved by gene
duplication before the speciation event.
Classification of homologs
In-para
In-para
out-para
When comparing
human with worm
What can we learn from them?
• Ortholog proteins are evolutionary, and
typically functional counterparts in different
species.
• Paralog proteins are important for detecting
lineage-specific adaptations.
• Both of them can reveal information on a
specific species or a set of species.
Protein domains – summary
• By identifying domains we can:
– infer functionality & localization of a protein.
– Learn on a specific species.
– Learn on a set of species as a group.
Domain representation
• Different methods to represent (model)
domains:
• Patterns (regular expressions).
• PSSM (Position specific score matrix).
• HMM (Hidden Markov model).
PSSM
• Position specific score matrix
• Score matrix representing the score for
having each amino acid in a given position in
a specific sequence.
• Based on the independent probabilities P(a|i)
of observing amino acid a in position i.
PSSM: Example
PSSM: Identifying a domain
• Given a sequence and a PSSM:
• Run over all positions.
• Score each sub-sequence according to the
matrix.
HMM: Hidden Markov Model
• Markov model: a way of describing a process
that goes through a series of states.
• Each state has a probability of transitioning
to the other states.
• xi is a random variable of state.
x1
x2
x3
x4
HMM: Markov Model
• Example:
• States are  {0,1}
x1=0
x2=1
x3 =1
x4 =1
x1 =1
x2 =0
x3 =0
x4 =1
x1 =0
x2 =0
x3 =0
x4 =0
HMM: Markov Model
• Transition matrix:
 0.6 0.4 

A  (aij )  
 0.2 0.8 
x
aij  P ( xk 1  j | xk  i )
x1
x2
x3
x4
HMM: Markov Model
• State transition example:
• States are the nucleotides A, T, G, C.
HMM: Hidden Markov Model
• Hidden Markov model:
• Each state x emits an output y, at a specific
probability.
• We only know the output (observations).
• Thus, the states are hidden.
x1
x2
x3
x4
y1
y2
y3
y4
HMM: Hidden Markov Model
• Example: states are  {0,1}, output  {0,1}
x1 =0
x2 =1
x3 =1
x4 =1
y1 =1
y2 =1
y3 =0
y4 =0
x1 =1
x2 =0
x3 =0
x4 =1
y1 =1
y2 =0
y3 =1
y4 =0
HMM: Hidden Markov Model
• Emission matrix:
 0.1 0.9 

B  (bij )  
 0.85 0.15 
x
y
bij  P( yk  j | xk  i )
x1
x2
x3
x4
y1
y2
y3
y4
HMM: What can we do with it?
• Given (A, B):
• Probability of given states and outputs
P( x1 x2  xn y1 y2  yn )  P( x1 ) P( y1 | x1 )  P( x2 | x1 ) P( y2 | x2 ) 
• Probability of a given output sequence
P( y1 y2  yn ) 
 P( x x  x y y  y )
1 2
n 1 2
n
x1xn
• Most likely sequence of states that generated
a given output sequence
max P( x1 x2  xn | y1 y2  yn )
HMM: What can we do with it?
• Learning:
• Given state and output sequences calculate
the most probable (A, B).
• Easy when the states are known.
• Otherwise: use a training algorithm.
HMM: Profile HMM
• Use HMM to represent sequence families.
• A particular type of HMM suited to modeling
multiple alignments.
• (Assume we have a multiple alignment).
HMM: Trivial profile HMM
• We begin with ungapped regions.
• Each position corresponds to a state.
• Transitions are of probability 1.
HMM: Trivial profile HMM
• Let ei(a) be the independent probability of
observing amino acid a in position i.
• The probability of a new sequence x,
according to the model:
N
P( x | M )   ei ( xi )
i 1
HMM: Trivial profile HMM
• We can score the sequence x:
ei ( xi )
S   log
i 1
q xi
N
• Where q indicates the probability under a
random model.
HMM: Trivial profile HMM
ei ( xi )
• Consider the values log
qxi
• They behave like elements in a score matrix.
• The trivial profile HMM is equivalent to a
PSSM.
HMM: profile HMM
• Let’s untrivialize by allowing for gaps:
insertions and deletions.
• Start off with the PSSM HMM.
HMM: profile HMM
• Handling insertions:
• Introduce new states Ij – match insertions
after position j.
• These states have random emission
probabilities.
HMM: profile HMM
• The score of a gap of length k:
log aM j I j  log aI j M j1  (k 1) log aI j I j
HMM: profile HMM
• Handling deletions:
• Introduce silent states Dj.
• These states do not emit.
HMM: profile HMM
• The complete profile HMM:
Internet resources
• Databases of protein families.
• Family information and identification.
• Considerations:
–
–
–
–
–
Type of representation (pattern, PSSM, HMM).
Choice of seed multiple alignment proteins.
Quality control.
Database features (links, annotations, views).
Database Specificity (organism, functions).
Pfam: Home
Pfam
• Protein families database of alignments and HMMs
• Uses profile-HMMs to represent families.
• For each family in Pfam you can:
–
–
–
–
–
Look at multiple alignments
View protein domain architectures
Examine species distribution
Follow links to other databases
View known protein structures
Pfam: Databases
2 databases:
• Pfam-A – curated multiple alignments.
– Grows slowly.
– Quality controlled by experts.
•
Pfam-B – automatic clustering (ProDom
derived).
– Complements Pfam-A.
– New sequences instantly incorporated.
– Unchecked: false positives, etc.
Pfam: Features
• Search by: Sequence, keyword, domain,
taxonomy.
• Browsing by family or genome.
• Evolutionary tree
Pfam: Construction
• Source of seed alignments:
–
–
–
–
Pfam-B families.
Published articles.
'domain hunting' studies.
occasionally using entries from other databases
(e.g. MEROPS for peptidases).
Pfam: Domain information
Pfam: Domain organization
Pfam: Multiple alignment
Pfam: HMM logo
Pfam: Species distribution
Pfam: Genome comparison
PROSITE
• Database of protein families.
• Matching according to simple patterns or
PSSM profiles.
• Browsing all proteins of a specific family.
• Latest release knows 1696 protein families.
PROSITE: Features
•
•
•
•
•
Comprehensive domain documentation.
All profile matches checked by experts.
Specificity/sensitivity:
Specificity: true-pos/all-pos
Sensitivity: true-pos/(true-pos + false-neg)
PROSITE: Example
• Specificity of Zinc finger C2H2 type domain
SMART
SMART
• Simple Modular Architecture Research Tool
• Identification and annotation of genetically
mobile domains and the analysis of domain
architectures.
• SMART consists of a library of HMMs.
• Knows 665 HMMs to date.
SMART: Features
• finding proteins containing specific domains
i.e. of the same family
• Function prediction
• Sub-cellular localization
• Binding partners
• Architecture
• Alternative splicing information
• Orthology information
SMART: Domain selection example
Tyrosine kinase (TyrKc) AND Transmembrane region (TRANS)
InterPro
• InterPro combines 9 other databases such as
SMART, Pfam, Prodom and more.
• Queries can use many different methods (as
the other databases use different methods).
• However, thresholds are predefined and
cannot be changed for those methods.
InterPro
• Provides more results, but can sometimes be
redundant.
• Coverage statistics:
• 93% of Swiss-Prot v42.5 –
128540 out of 138922 proteins
• 81% of TrEMBL v25.5 –
819966 out of 1013263 proteins
InterPro: Features
• Searching by Protein/DNA sequences
• Finding domains & homologs
• List of InterPro entries of type:
–
–
–
–
–
–
–
Family
Domain
Repeat
PTM- Post Transcriptional modifications
Binding Site
Active Site
Keyword
InterPro: Example
• Kringle domain
Research Example: Introduction
• Goal: The systematic identification of novel
protein domain families.
• Using computational methods.
Research Example: Method
Derive set of 107 nuclear domains
extract proteins
Extract unannotated regions
Cluster sequences
Take longest member
PSI-BLAST
Investigate homologous regions
Manual confirmation
Research Example: Results
• 28 New Domains identified:
• 15 domains in diverse contexts, in different
species.
• 3 domains species specific.
• 7 domains with weak similarity to previously
described domains.
• 3 extension domains.
Predictions of Function
• On the basis of reports in literature and/or
occurrence with other identified domains,
functional features can be predicted for our
novel domain families.
• Examples:
– Chromatin binding
– Protein Interaction
– Predicted sub-cellular localization
Predictions of Function:
Chromatin-Binding example
• The novel domain CSZ is contained in protein
SPT6, which regulates transcription via
chromatin structure modification.
• SPT6 has a histone-binding capability,
experimentally confirmed.
• Other domains (S1, SH2) in SPT6 are unlikely
to bind histones or chromatin.
• Conclusion: CSZ has a predicted histone
binding function.
Predictions of Function:
Localization example
• Some of the novel domains are only found
within proteins from the initial set of nuclear
domains.
• This predicts that these domains have a
nuclear function.
• The other domains are likely to have roles in
both nucleus and cytoplasm.
Conclusion
• Domains are the functional units of proteins.
• Identifying a domain within a new protein may teach
us much about it.
• There are several types of models to represent
domains.
• These models can also be used to identify the
domain they represent.
• Many Internet databases available to catalogue and
identify families.
• Protocol to identify new domains using old ones.
Resources
• Pfam:
http://www.sanger.ac.uk/Software/Pfam/
• SMART:
http://smart.embl-heidelberg.de/
• PROSITE:
http://www.expasy.org/prosite/
• InterPro:
http://www.ebi.ac.uk/interpro/
The End