From Genome Sequencing to Biology in the Lab of Milk and
Download
Report
Transcript From Genome Sequencing to Biology in the Lab of Milk and
BeeBase - The Honey Bee Model
Organism Database
Chris Elsik
[email protected]
Outline
• BeeBase - what it is now
• How it works
• Future Plans
BeeBase
http://racerx00.tamu.edu/PHP/bee_search.php
•
•
•
•
Predicted Gene and Homolog Search Page
Genome Browser
Comparative Map Viewer
Protein Families Database with Bee, Fly and
Mosquito proteins
• The newest assembly ( release 2.0)
http://racerx00.tamu.edu/cgi-bin/gbrowse/bee_genome2
Gbrowse
• A module of the Generic Model Organism
Database Project (GMOD), www.gmod.org
• A graphical viewer of features along a
reference sequence
• Based on MySQL and Perl
• The configuration file allows us to
– Change fonts, colors, text.
– Change overview – sequence scaffold,
contig, genetic map, karyotype.
– Define tracks.
– Modify track appearance.
Gbrowse Internals
• BioPerl Library - allows browser to run on top
of a variety of database management
systems and schemata
• Bio::Graphics module - used to graphically
render any type of nucleotide or protein
feature
• Bio::DB::GFF Database - uses a flat
coordinate system to represent genomic
features. Optimized for queries that retrieve
features by ID, type or region of genome
Our task is to generate GFF data
• GFF = generic feature format
• A standard format that aids data exchange
• Allows you to specify a substring of a
biological sequence
• The current version (2) uses terms from the
Sequence Ontology project
- A set of terms used to describe features on a
nucleotide or protein sequence. It encompasses
both "raw" features, such as nucleotide similarity
hits, and interpretations, such as gene models.
• For information on the specifications:
http://www.sanger.ac.uk/Software/formats/GFF/
Computing Data for Tracks
• Markers
– Compare marker sequences to genome scaffolds
using BLASTN
– Use ePCR (primersearch) for markers with primers,
but no sequence
• ESTs
– Compare ESTs to genome scaffolds using fasta or
BLAT
– Use exonerate (http://www.ebi.ac.uk/~guy/exonerate/)
to predict exon/intron boundaries for each match
• Protein Homologs
– Compare protein sequences to genome scaffolds
using tfastx to identify matches
– Use exonerate to predict exon/intron boundaries for
each match
Annotating Tracks
• The most time consuming task in computing tracks is
providing annotations for protein homologs.
• Annotations come from different sources and are in
different formats depending on protein dataset.
• We use UniProt for all homolog tracks in assembly 1.1
and 1.2 browsers.
• Assembly 2 uses proteome sets for Drosophila (FlyBase),
C. elegans (WormBase), Yeast (SGD), Mosquito
(Ensembl) and Human (Ensembl) to avoid redundancy
within proteomes.
– The fasta formatted sequences are not annotated (except yeast).
• The “other insect” track will come from UniProt.
– To identify which sequences are insect, we use taxon-id and
a locally installed NCBI taxonomy database.
CMAP
• CMap is a web-based tool that allows users
to view comparisons of genetic and physical
maps.
• The package also includes tools for curating
map data.
• MySQL and Perl
• Consists of modules for data, logic (howmaps
are layed out), and presentation.
• Our work is to modify the configuration file
and format data.
Future BeeBase Plans
• Redo protein families analysis after final gene
prediction set is released; add proteins from
additional model organisms (worm, yeast, mouse,
human)
• Phylogenetic analysis to identify orthologs
• Gene Ontology assignment
• Create gene pages for each gene, similar to FlyBase,
using the new “Turnkey gmod-web” module
More BeeBase Plans
• Curate literature for orthologs to provide an
entry into the BeeSpace conceptual
navigation system.
• Incorporate QTL viewer using Dave Adelson’s
QTL viewer software, which was developed
for cattle.
• Incorporate OpenGeneX gene expression
database and expression data from the
BeeSpace project.
Gene Ontology For Honey Bee
Gene Ontology Consortium
http://www.geneontology.org/
•
“The goal of the Gene OntologyTM (GO) Consortium is to produce a controlled
vocabulary that can be applied to all organisms even as knowledge of gene and
protein roles in cells is accumulating and changing.”
•
GO provides three structured networks of defined terms to describe gene product
attributes.
•
Molecular Function Ontology the tasks performed by individual gene products; examples are
carbohydrate binding and ATPase activity
•
Biological Process Ontology broad biological goals, such as mitosis or purine metabolism, that
are accomplished by ordered assemblies of molecular functions
•
Cellular Component Ontology subcellular structures, locations, and macromolecular complexes;
examples include nucleus, telomere, and origin recognition complex
GO Evidence Codes
•
IDA inferred from direct assay - Enzyme assays, In vitro reconstitution (e.g. transcription),
Immunofluorescence (for cellular component), Cell fractionation (for cellular component), Physical
interaction/binding assay
•
IEP inferred from expression pattern - useful for biological process ontology
•
IGI inferred from genetic interaction - "Traditional" genetic interactions such as suppressors,
synthetic lethals, etc., Functional complementation, Rescue experiments, Inference about one
gene drawn from the phenotype of a mutation in a different gene
•
IMP inferred from mutant phenotype - Any gene mutation/knockout, Overexpression/ectopic
expression of wild-type or mutant genes, Anti-sense experiments, RNAi experiments, Specific
protein inhibitors
•
IPI inferred from physical interaction - 2-hybrid interactions, Co-purification, Coimmunoprecipitation, Ion/protein binding experiments
•
IEA inferred from electronic annotation
•
ISS inferred from sequence or structural similarity
•
IC inferred by curator, TAS traceable author statement, NAS non-traceable author statement , ND
no biological data available, NR not recorded
Applying GO to Honey Bee
• We must rely heavily on IEA (inferred from electronic
annotation - no curator) or ISS (inferred from
sequence similarity - inspected by curator)
• We must make the most reliable inferences possible based on orthology instead of homology
Background:
Evolution-based functional inference
and orthology
Evolution Allows us to Infer Function
• The most powerful method for inferring function of a gene or
protein is by similarity searching a sequence database.
• Our ability to characterize biological properties of a protein using
sequence data alone stems from properties conserved through
evolutionary time.
• Homologous (evolutionarily related) proteins always share a
common 3-dimensional folding structure.
• They often contain common active sites or binding domains.
• They frequently share common functions.
• Predictions made using similar, but non-homologous proteins
are much less reliable.
Orthologs
• Homologs = genes that are evolutionarily related
• There are two kinds of homologs:
• Orthologs = genes in different species that have diverged from a
common gene in an ancestral species.
• Paralogs = genes that have diverged due to gene duplication.
• Orthologs are more likely than paralogs to have conserved
function.
• Orthologs cannot be identified using BLAST or FASTA sequence
comparison alone.
• Reliable ortholog identification requires phylogenetic methods.
Example Gene Tree (with plant genes)
Rice-2b
paralogs
Rice-2a
Maize-2
paralogs
Wheat-2
Sorghum-2
Barley-1
Wheat-1
Maize-1
Sorghum-1
Arabidopsis
orthologs
The outgroup, Arabidopsis
is a dicot. The cereals are
monocots. Monocots and
dicots diverged ~230
million years ago.
Monocots diverged from
each other ~60 mya.
Why shouldn’t we depend on
inferences based on paralogs?
• Paralogs emerge after a gene duplication.
• Possible fates of duplicated genes:
– Loss of function for one of the duplicates - lack of
selective pressure allows gene to mutate beyond
recognition
– Emergence of new functional paralogs - one duplicate
aquires a new function, so selection favors its
maintenance in the genome
– Sub-functionalization - both duplicates are required to
maintain the function of the original
Back to Gene Ontology for Honey Bee:
Proposed Evidence Codes within ISS
•
ISS = inferred from sequence similarity (inspected by a curator)
•
We can break this down into:
•
Inferred from homology (lowest)
•
Inferred from a ortholog in one species
•
Inferred orthologs in more than one species, all of which have the same
GO classification (highest).
– What if they don’t all have the same GO classification? Move up in
the diacylic graph to a point where GO classifications converge.
– This can be tricky since the graph is diacyclic and each node can
have more than one parant
Some Ongoing Gene Ontology
Work in the Elsik Lab - Cattle
• Cattle EST Gene Family Database
• Cattle gene families were created using
assembled, translated ESTs grouped with
homologous human protein families.
• Database is searchable using GO for the
human proteins.
• The next step is phylogenetic analysis to
identify human/cattle orthologs.
Searching by Gene Ontology
Borrowing More From Cattle
• Bovine QTL Database - David Adelson,
TAMU
The Bovine QTL viewer Interface
Image showing all chromosomes
Image showing one chromosome
QTL Details
OpenGeneX
• Web-based access to database
• PostgreSQL
• Includes as a curation tool a client side Java
application that formats data in MAGE-ML
• Includes several statistical routines and data
analysis tools
– Uses R statistical analysis package (open
source)
Acknowledgements
• Elsik Lab
–
–
–
–
–
–
Justin Reese
Kyounghwa Bae
Anand Venkatraman
Shreyas Murthi
Michael Dickens
Juan Anzola
• Collaborators
– Bruce Schatz, Gene
Robinson and the
BeeSpace group, UIUC
– William Gelbart - FlyBase
(Harvard University)
– Spencer Johnston (TAMU)
– Danny Weaver, Bee Power
LP