Transcript Slide 1
Ankita Sarangi
School of Informatics, IUB
Capstone Presentation,
May 11, 2009
Advisor : Yuzhen Ye
Sequence based approaches
Structure-based approaches
Motif-based approaches (sequence motifs, 3D motifs)
◦ Protein A has function X, and protein B is a homolog (ortholog) of protein A;
Hence B has function X
◦ Protein A has structure X, and X has specific structural features; Hence X’s
function sites are used to assign function to the Protein A
◦ A group of genes have function X and they all have motif Y; protein A has motif
Y; Hence protein A’s function might be related to X
“Guilt-by-association”
◦ Gene A has function X and gene B is often “associated” with gene A, B might
have function related to X
◦ Associations
Domain fusion, phylogenetic profiling, PPI, etc.
A protein domain is a part of protein
sequence and structure that can evolve,
function, and exist independently of the rest
of the protein chain.
◦ Each domain forms a compact three-dimensional
structure and often can be independently folded.
◦ Many proteins consist of several structural domains.
Among relevant sequence features of a
protein, domains occupy a key position. They
are sequential and structural motifs found
independently in different proteins, in
different combinations, and as such seem to
be the building blocks of proteins
However, it is also known that certain sets of
independent domains are frequently found
together, which may indicate functional
cooperation.
Supra- Domains : A supra-domain is defined as a domain
combination in a particular N-to-C-terminal orientation that occurs
in at least two different domain architectures in different proteins
with: (i) different types of domains at the N and C-terminal end of
the combination; or (ii) different types of domains at one end and no
domain at the other. `
A type of Supra-domain are ones whose activity is created at the
interface between the two domains of a protein
◦ (Ref: JMB, 2004, 336:809–823)
We may make mistakes if we do function prediction
based on individual domains
◦ We know proteins that have domain A and B have function
F, what about proteins having domain A or domain B only?
A survey of mis-annotation based on single
domains
◦ We are interested to know how serious this problem is in
the current annotation system
◦ There is no systematic survey on this so far
Function annotation using domain patterns
(domain combinations) instead of individual
domains
◦ Utilize the relationship of the predicted functions (as
shown in the GO directed acyclic graph of functions)
◦ Provide a web-tool and visualization of the predicted
functions and their relationship with domain patterns
SUPERFAMILY is a database of structural and
functional protein annotations for all
completely sequenced organisms.
The SUPERFAMILY web site and database
provides protein domain assignments, at the
SCOP 'superfamily' and 'family' levels, for the
predicted protein sequences in over 900
organisms
We made a local copy of this database
The GENE ONTOLOGY(GO) project is a
collaborative effort to address the need for
consistent descriptions of gene products in
different databases.
Consists of three structured, controlled
vocabularies (ontologies) that describe gene
products in terms of their associated
biological processes, cellular components and
molecular functions in a species-independent
manner.
We looked at several supra-domains listed in
this paper:
Supra-domains: Evolutionary
Units Larger than Single Protein Domains;
Voget etal.; J. Mol. Biol. (2004) 336, 809–823
Superfamily
Use the SCOP ID of the domains to obtain Gene Identifiers
associated with the supra-domains as well as their individual
domains
UniProt ID mapping file
(is a tab-delimited table, which includes mappings for 20
different sequence identifier types(example: ENSGALP000,
AN7518.2, Afu1g003, gi|41409236|ref|NP_962072.1|)and
gene_association.goa_uniprot
(GO assignments for the UniProt KnowledgeBase (UniProtKB))
To obtain Swiss prot ID, GO ID
Find Gene Ontology functions that are associated with
proteins which contain both the domains and the individual
domains
(SCOP ID - 63380)
(SCOP ID – 52343)
The N-terminal domain binds FAD and the C-terminal domain binds
NADPH. The FAD acts as an intermediate in electron transfer between
NADPH and substrate, and this domain combination is used by many
different enzymes
GO:0046872
GO:0050661
GO:0051536
GO:0006810
GO:0016491
GO:0016021
GO:0055114
GO:0004517
GO:0004783
GO:0005506
GO:0004497
GO:0010181
GO:0005488
GO:0003824
GO:0008152
GO:0020037
GO:0009055
GO:0050381
GO:0005737
GO:0050660
GO:0016020
120
100% IEA
80 % IEA
100
80
60
Riboflavin Synthase domain-like
40
Ferredoxin Reductase-like
Supra-domain
20
0
10 proteins with Supra domains annotated to
GO:0016491---- 2491proteins with Supra domains
3 proteins with Riboflavin Synthase domain-like
annotated to GO:0016491 --- 42 proteins with
Riboflavin Synthase domain-like
1 protein with reductase-like, C-terminal NADPlinked domain annotated to GO:0016491--- 47
protein with reductase-like, C-terminal NADP-linked
domain
Specific proteins searched and presence and absence
of the combined domain was confirmed along with
GO ID as well as annotation evidence which was
found to be Inferred Electronic Annotation
Supra –Domains: Riboflavin Synthase domain-like, Ferredoxin
reductase-like, C-terminal NADP-linked domain
Protein Name : Oxidoreductase FAD-binding domain protein
Gene Ontology : Biological Process: GO:005514 is_a child of
GO:0008152
molecular function: GO:0016491
PFAM domains: PF00970. FAD_binding
PF00175 NAD_binding
Evidence : IEA (Inferred Electronic Annotation)
Proteins: A4FHX1 , A1UCP3, A4T5V2, A3PWD0 ,Q1BCA1
Protein Name : Sulfide dehydrogenase (Flavoprotein) subunit
SudA sulfide dehydrogenase (Flavoprotein) subunit SudB
Gene Ontology: Biological Process: GO:005514 is_a child of
GO:0008152
molecular function: GO:0016491
PFAM Domains : PF00175. NAD_binding
PF07992. Pyr_redox (FAD_pyr_nucl-diS_OxRdtase.)
Evidence : IEA (Inferred Electronic Annotation)
Proteins: Q2J1U9, Q13CJ3, Q5PB24
Riboflavin Synthase domain-like
Protein Name : Putative uncharacterized
protein
Gene Ontology: Biological Process:
GO:005514 is_a child of GO:0008152
molecular function: GO:0016491
PFAM Domains : PF07992 - Pyr_redox
(Q0A5G3)
OR
PF08021. FAD_binding (A4FEM2, A1WVX7
)
Evidence : IEA (Inferred Electronic
Annotation)
Proteins: Q0A5G3, A4FEM2, A1WVX7
Ferredoxin reductase-like, C-terminal NADPlinked domain
Protein Name : Dihydroorotate dehydrogenase electron transfer
subunit, putative
Gene Ontology: Biological Process: GO:005514 is_a child of
GO:0008152
molecular function: GO:0016491
molecular function: GO:0016491
PFAM domains: PF00970. FAD_binding
PF00175. NAD_binding
Evidence : IEA (Inferred Electronic Annotation)
Proteins: A3CN91, Q73P17
Protein Name: Protein-P-II
uridylyltransferase
Gene Ontology: Biological Process:
GO:0008152 is a parent of GO:0006807
PFAM: PF01966 - NAD Binding
Evidence : IEA (Inferred Electronic
Annotation)
Protein: Q6MLQ2
Ref: http://www.uniprot.org/uniprot/
PreATP-grasp domain
(SCOP ID = 52440)
Glutathione synthetase ATP-binding
domain-like (SCOP ID = 56059)
Lots of different enzymes forming carbon–nitrogen bonds
have this combination of domains. Both domains contribute to
substrate binding and the active site, and the C-terminal
domain binds ATP as well as the other substrate;
GO:0006750
GO:0004086
GO:0005618
GO:0009374
GO:0003824
120
GO:0008152
GO:0009317
GO:0004075
GO:0008716
GO:0003989
GO:0016491
GO:0006633
GO:0006807
GO:0009252
GO:0016874
GO:0005737
GO:0004363
GO:0005524
75% IEA
100% IEA
100
80
Pre-ATP Grasp domain
60
Glutathione Synthetase ATP-Binding
40
Domain Like
Supra-Domain
20
0
Functional annotations were found to be shared
by proteins having the Supra-domains as well as
the single domains.
The percentage of proteins having Supradomains were much higher than single domains.
Since, both domains are required for the function
of the protein, the functions assigned to single
domain proteins may be said to be misannotated.
This study gave us motivation of developing a
computational tool for function annotation based
on domain combinations (domain patterns)
instead of individual domains
Utilize the relationship of the predicted
functions (as shown in the GO directed acrylic
graph of functions)
Provide a web-tool and visualization of the
predicted functions and their relationship
with domain patterns
Functional annotation term F (in this case a Gene
Ontology) and a domain set D. The probability that a
protein exhibiting D would possess F is modeled as
P(F|D)=P(D|F)P(F)/P(D)
(i.e., posterior probability of a function given a set of
domains D; P(D|F), P(F), and P(D) can be learned from
proteins with known functions)
Ref: Predicting protein function from domain content;
Forslund et al;Bioinformatics, Vol. 24 no. 15 2008,
pages 1681–1687
Gene Ontology database
gene_association.goa_uniprot
Swisspfam
For an input domain pattern (pfam domains):
All the Pfam pattern containing the given pattern are
extracted (e.g., if input domain pattern is A + B, all the
domain patterns that contain this domain pattern will be
considered, such as A + B + C, etc)
GO function associated with all the domain patterns are
extracted
Calculate the probability using P(F|D)=P(D|F)P(F)/P(D)
number of proteins that occurs with the domain pattern
possessing the function
If the percentage probabilities lie close to one another than
the parent GO function is found and a diagram depicting a
sum of the distance of the parent from the two children is
printed; otherwise the GO terms that have P(F|D) >= 0.9 *
Max{P(F|D)} are extracted
Summary graph providing all the GO functions associated
with the pattern search
A survey of annotation based on single
domains
Function annotation using domain patterns
(domain combinations) instead of individual
domains
To DO:
◦ Do a more thorough survey with the annotation
studies of single domains
◦ Define all the relationships between the GO ID’s in
the Summary Graph
◦ Refine and test the computational tool.
I would like to thank:
Dr. Yuzhen Ye
Faculty of the Department of Bioinformatics
Drs. Dalkilic, Kim, Hahn, Radivojac, Tang
Linda Hostetter and Rachel Lawmaster
Family and Friends