Project : Operon Prediction

Download Report

Transcript Project : Operon Prediction

Improving
Gene Function Prediction
Using Gene Neighborhoods
Kwangmin Choi
Bioinformatics Program
School of Informatics
Indiana University, Bloomington, IN
Introduction :
PLATCOM (A Platform for Computational Comparative Genomics)


PLATCOM is a system for the comparative analysis of multiple
genomes.
PLATCOM consists of 3 components:

Databases of biological entities


Databases of relationships among entities



e.g. fna, faa, ptt, gbk…
e.g. genome-genome, protein-protein pairwise comparison
Mining tools over the databases
The web interface of PLATCOM system is located at
http://biokdd.informatics.indiana.edu/kwchoi/platcom/
PLATCOM Web Interface
Frontpage of Genome Plot
Background :
What is operon ?

http://biocyc.org:1555/ECOLI/new-image?object=Transcription-Units
The operon structure was found in 1960 by 2 French
biologists. Jacob,F. and Monod,J. (1961) Genetic regulatory mechanisms in the synthesis of
proteins. J. Mol. Biol., 3, 318–356.

An operon is a group of genes that encodes functionally
linked proteins. Its components are :



Adjacent (200-300 nt)
On the same strand (+ or -)
Co-expressed by one promoter.
Background :
How to identify or predict operon structure?

When a promoter and terminator are known :



When a promoter is not known :




Gene clusters = Transcription Units
Classical concept of operon
Gene clusters = Directrons
Hypothetical operon candidates
Depending on direction and proper intergenic distance (200-300 nt)
Computational methods have been developed to find
gene clusters in bacterial genomes.
PCBBH and PCH
R.Overbeek et al. PNAS, 1999, Vol.96, pp.2896-2901
PCBBH : Pair of Close Bidirectional Best Hits
BBH : Bidirectional Best Hits
PCH : Pair of Close Homologs
COG : Clusters of Orthologous Genes
Background :
Über-operon


: P.Bork et al. Treds. Biochem. Sci., Vol. 25, pp. 474-479
Über-operon : A set of genes with a close functional and
regulatory contexts that tends to be conserved despite
numerous rearrangements.
This concept focus on the functional themes of operons, not a
specific genes or gene order.
Background :
Why gene clusters are conserved ?



Certain operons, particularly those that encode subunits of
multiprotein complexes (e.g. ribosomal proteins) are
conserved in phylogenetically distant bacterial genomes.
These gene clusters might have been conserved since the
last universal common ancestor. Why?
Selfish-operon hypothesis :Horizontal transfer of an entire
operon is favored by natural selection over transfer of
individual genes because co-expression and co-regulation
are preserved.
Background :
Problems in Operon Prediction.




Over 150 genomes have been fully sequenced until today, but
The biological functions of some genes are still unknown.
There is only a few promoter detection algorithms, but they are
not fully satisfactory.
In many cases, genomic data files do not provide full
information of genes and their products. ( e.g. gene name, COG, PID.)
Operon tends to undergo multiple rearrangements during
evolution.

As a result, gene order at a lever above is poorly conserved. (e.g. genes involved in
de novo purine synthesis)
Background :
Problems in Computational Algorithms to Predict Operons

Direct Signal Finding




Experiment-based approach
Transcription promoters (5’-end) and terminators (3’-end) were searched.
Only be effective for species whose transcription signals are well known, E.coli.
Combination of gene expression data, functional annotation
and other experimental data.



Literature-based approach
Primarily applicable to well studied genomes such as E.coli, because data files are
incomplete for other genomes.
In many cases, genomic data files do not provide full information of genes and
their products. ( e.g. gene name, COG, PID.)
Procedure

As a part of PLATCOM project, an integrated whole
genome analysis system was built on BIOKDD server.


Several tools for multiple genomes analysis were written in
Perl and then gene neighborhoods was reconstructed from
the clustering data.


Web interface for all-to-all pairwise comparison DB and tools are also provided.
My gene clustering algorithm was used to compensate the defect of the
literature-based approach.
Connected gene neighborhoods were analyzed to predict
gene function and functional coupling between clusters.
Materials/ Tools

Raw Data


22 genomes were chosen for this study. (14 groups)
Protein-Protein Pairwise Comparison Data


PTT files from NCBI site


e.g. http://biokdd.informatics.indiana.edu/kwchoi/Thesis/U00096.ptt.txt
Data Generated by Web Tools

Gene Clustering Data (based on sequence homology)


e.g. http://biokdd.informatics.indiana.edu/kwchoi/Thesis/clustering_13321_23_750.txt
Gene Clusters generated from PTT file (given intergenic distance)


e.g. http://biokdd.informatics.indiana.edu/kwchoi/Thesis/L42023.faa.U00096.faa.cmp.txt
e.g. http://biokdd.informatics.indiana.edu/kwchoi/Thesis/candidates_22211.htm
E. coli database for reality check


http://biocyc.org/
http://ecocyc.org/
Genomes
http://www.infobiogen.fr/services/deambulum/english/genomes2a.html
Procedure
My Approach to reconstruct Genomic Neighborhoods

The idea underlying this study is that



Different genomes contain different, overlapping parts of evolutionarily and
functionally connected gene neighborhoods
By generating a “Tiling Path”, the entire neighborhood can be reconstructed.
Genomic context of well-known genome (e.g. E.coli ) is used as
a contextual framework.



Start with looking at this framework and then search a group of similar gene
neighborhoods in the target genomes.
“Genomic context” means the pattern of series of COG. If COG is not given, we can
predict the function of a unknown gene based on my gene clustering data.
We can also identify some “Hitchhikers”. “Hitchhikers” are inserted genes that are
originated from different contexts/themes.
Tiling Path
V.Koonin et al. Nucleic Acids Research, 2002, Vol.30, No.10, pp. 2212-2223
Gene Neighborhoods
Results


Case 1
 Relationship between Gene Order and Phylogenetic Distance
Case 2
 One theme : Typical Operon (rbs operon)



Reconstruct gene neighborhoods
Find missing components from the reconstructed gene clusters.
Case 3
 Two or more themes : Functional Coupling ?



Find genomic hitchhikers
Predict gene function of uncharacterized protein
Predict functional coupling
Case 1 :


Gene Order and Phylogenetic Distance
If gene order of two genome is well conserved, the
sequence of homologs should appear as a line on the
genome comparison diagonal plot.
What is the relationship between phylogenetic distance
and the conservation of gene order?
Phylogenetic Tree
V.Daubin et al. Genome Research, Vol 12, Issue 7, 1080-1090
Genome Comparison Diagonal Plot
: Phylogenetically-Distant Species (Z-score = over 500)
Genome Comparison Diagonal Plot
: Phylogenetically-Close Species (Z-score > 1000)
Fragmented Gene Clusters
Case 1 :


Gene order in phylogenetically-distant species are poorly
conserved.
But this observation does not mean that gene order is
conserved very well among the phylogenetically-close
species.



Conclusion
In case of very close species (e.g. E.coli vs. H.influenza), gene orders are
completely scattered.
In most cases, only a small number of genes are observed
as a short line or cluster and we may consider it as a
putative operon.
In next step, this possibility will be investigated deeply.
Case 2 :

Rbs Operon (Typical Operon)
Theme : Ribose transport across membrane






COG1869
COG1129
COG1172
COG1879
COG0524
COG1609
D-ribose high-affinity transport system; membrane-associated protein
ATP-binding component of D-ribose high-affinity transport system
D-ribose high-affinity transport system
D-ribose periplasmic binding protein
ribokinase
regulator for rbs operon
http://biocyc.org:1555/ECOLI/new-image?type=OPERON&object=TU00206
Case 2 : Rbs Operon
Z-score = over 750, Intergenic Distance = 300
Case 2 :


All components are involved in ribose transport across
bacterial cell membrane
In Rbs operon system, gene order pattern is 1869-11291172-1879-0524-1609.



Conclusion
10 out of 22 genomes have this operon system.
Exceptsome cases, this gene order pattern is conserved very well.
So it is possible that there exists a kind of “General
Contextual Framework” of gene order.
Case 3 :

Theme 1 : Transcription

COG0779 Uncharacterized Conserved Protein
COG0195 Transcription elongation factor

COG2740 Predicted nucleic-acid-binding protein (transcription termination?)


Theme 2 : Translation







Functional Coupling of 2 or more themes
COG1358
COG0532
COG1550
COG0858
COG0184
COG0130
Ribosomal protein S17E
Translation initiation factor 2 (GTPase)
Uncharacterized Conserved Protein
Ribosome-binding factor A
Ribosomal protein S15P/S13E
tRNA Pseudouridine synthase
Hitchhiker ?
COG0196 FDA Synthase (Hitchhiker?)
http://biocyc.org:1555/ECOLI/new-image?type=OPERON&object=TU341

Case 3 : Functional Coupling
Z-score = over 750, Intergenic Distance = 300
Case 3 :

Functional Coupling :



In bacteria, transcription, translation and RNA modification/degradation are coupled
and the advantages of co-regulation the corresponding genes are obvious.
COG0779(Uncharacterized) is almost inseparable from the COG0195(Transcription
Elongation Factor), so it is likely to be a functional partner of COG0195.
Hitchhiker :


Conclusion
The association of the COG0196(FDA synthase) is not as tight as the connections
between the genes belonging to the theme.
Gene function prediction :

The functions of 3 genes in AE0004092 genomes can be predicted by reading genomic
context.
Conclusion



Genome Comparison Diagonal Plot visualizes the sequence
comparison of 2 genomes. It is a simple tool, but presents a
very strong intuition to understand the genome structure.
Conserved gene neighborhoods reconstructed from many
genomes by the Tiling Path Method can be used to predict the
functions of uncharacterized genes and functional coupling
between well-characterized genes in those genomes.
Ultimately, We can use this methods to reconstruct metabolic
and functional subsystems.
Acknowledgements

Haifeng Zhao


Scott Martin


Genome Pairwise Comparison DB
Server Management and Technical Suppor
Dr. Sun Kim

Graduate Advisor and P.I.