1 - Bioinformatics and Systems Biology

Download Report

Transcript 1 - Bioinformatics and Systems Biology

Eukaryotic Genome Annotation
Lieven Sterck1, Stéphane Rombauts1, Jeffrey Fawcett1, Yao-Cheng Lin1, Steven Robbens1, Jan
Wuyts1, Francis Dierick1, Pierre Rouzé2 and Yves Van de Peer1
1 Bioinformatics & Evolutionary Genomics Division, Plant Systems Biology, VIB/Ugent, Technologiepark 927, B-9052 Gent, Belgium
2 INRA-associated to Bioinformatics & Evolutionary Genomics Division, Plant Systems Biology, VIB/Ugent, Technologiepark 927, B-9052 Gent, Belgium
E-mail: [email protected]
URL: http://bioinformatics.psb.ugent.be/
Gene prediction and genome annotation have always been one of the main research topics of our group. Over the past years we have
demonstrated the strength of our annotation platform and gained name and fame in the field of genome annotation through a number of
collaborative efforts to annotate newly sequenced plant genomes. Now, although we are still involved in several annotation projects for
higher plants, we are also more and more asked to be responsible for producing automatic genome annotations for a broader diversity of
eukaryotic genomes like fungi and algae.
Introduction
Raw sequence data is not useful for
biologists. To be meaningful it has to be
converted into biological significant
knowledge : markers, genes, RNAs,
protein sequences. Genome annotation
is the first step toward this knowledge
acquisition.
A thorough annotation must take into account:
• similarities with known sequences (proteins, ESTs,
other genomes,…)
• region content analysis
• signal prediction software (ATG, splice sites)
• integrated prediction tools (GenScan, FgenesH, … )
• all available significant biological knowledge
 Try to automate this as much as possible
through the use of annotation platforms.
The EuGene Annotation Platform
• Coding IMM
• Intron IMM
• Intergenic IMM
SpliceMachine
• each base of the genomic sequence is represented
individually (nodes)
• weighting, removal and addition of edges according
to available information
• shortest path in the graph = a possible gene
structure
 Based on all the available information,
EuGene will output a prediction of maximal
score, i.e. maximally consistent with the
provided information.
ATCCGTAAGATGGTG
CGATGCCCTAAATGG
GTCGGTTTATAAAGG
CGCGTAGGTAAGTGC
AATTTATTCTTCAAGT
TCCGAATTTTATATGC
GCATATCGTCAGTTCT
TCTGTTGCAGTTGGC
GCACTTGGACTACCT
GCAATTTATTCTTCAA
GTTCCGAATTTTATAT
Intrinsic
approaches
Start sites
Splice sites
Content potential
for coding, intron
and intergenic
join(9265..9395,9749..9
9342).
complement(join(10164.
.10295,10349..10420,10
467..10514,10566..1062
6,10681..10770,10823..
10949,11001))
EuGene
Extrinsic
approaches
Genomic
sequence
Blastx
Blastn
RepeatMasker
Predicted
Genes
(structural
annotation)
Schematical representation of the EuGene platform. Depicted above is the basic set-up of
EuGene, this scheme can be modified according to the genome that has to be annotated and
the available data.
EuGene is developed by T. Schiex and co-workers (INRA-Toulouse, France) in cooperation with our group.
• EuGene can be specifically adapted
to the particularities of newly
sequenced genomes which leads to
higher quality predictions
• exploits probabilistic models like
Markov models for discriminating
coding from non coding sequences
• integrates information from several
signal (splice site, translation start...)
prediction software or other 3rd party
software
• Exploits the wealth of existing
sequences (mRNA, 5'/3' EST couples,
proteins, genomic homologous
sequences)
• integrates each source of
information through small
independent software components,
called "plugins"
1: Schiex T, Moisan A, and Rouzé P. (2001) EuGène: An Eukaryotic Gene Finder that combines several sources of evidence. Computational Biology, Eds. O. Gascuel and M-F. Sagot, LNCS 2066, pp. 111-125, 2001
References
This work is supported by the European Commission (QLRI-CT-2001-00006)
2: Tuskan et al. (2006) The genome of black cottonwood, Populus trichocarpa (Torr. & Gray ex Brayshaw). Science 313, 1596 - 1604
3: Derelle et al. (2006) Genome analysis of the smallest free-living eukaryote Ostreococcus tauri unveils many unique features, Proc. Natl. Acad. Sci. USA 103, 11647-11652