The Ensembl Database - Washington University in St. Louis
Download
Report
Transcript The Ensembl Database - Washington University in St. Louis
Web Databases for Drosophila
Introduction to FlyBase and
Ensembl Database
Wilson Leung
6/06
Outline
Introduction to FlyBase
Introduction to Ensembl
Using web databases to assist annotation of
novel sequences
Introduction to FlyBase
Available at http://www.flybase.org
Introduction to FlyBase
FlyBase is primarily funded by the National
Institutes of Health
FlyBase consortium includes Drosophila
researchers and computer scientists at Harvard
University, Indiana University, and University of
Cambridge, plus scientists worldwide
In addition to the main site at www.flybase.org,
there are also many mirror sites
What is FlyBase?
It is a comprehensive database of genetic and
molecular data for many Drosophila species:
Information on genes and mutant alleles
Expression and function of gene products
Genetic, cytological, molecular map information
Data from Berkeley Drosophila Genome Project
Data from European Drosophila Genome Project
Introduction to Ensembl
Available at http://www.ensembl.org
What is Ensembl?
Ensembl is a joint project between the
European Bioinformatics Institute (EBI) and the
Wellcome Trust Sanger Institute
Ensembl seeks to develop an automated
system for the production and maintenance of
annotations on eukaryotic genomes
These annotations should also be easily
accessible to researchers
What is Ensembl?
While originally developed for eukaryotes, the
Ensembl system has also been used to
analyze prokaryotic genomes
EBI Genome Review (archaea and bacteria)
Most recent version is v38 (Apr 2006)
Genomes available include human, chimp, mouse,
dog, C. elegans, fruit fly, honey bee, mosquito
among others
Ensembl Gene Annotation System
All Ensembl gene
predictions are based on
experimental evidence
Predictions based on
manually curated
Uniprot/Swissprot/Refseq
databases
UTR’s are annotated
only if they are supported
by EMBL mRNA records
Val Curwen, et al. The Ensembl Automatic
Gene Annotation System Genome Res., May
2004; 14: 942 - 950.
Using Web Databases for Annotation
List of available species in the
FlyBase BLAST service to use
in a search for sequences
homologous to your query
Exon View in Ensembl: used
to obtain sequence of a gene,
exon-by-exon
Using Web Databases for Annotation
Motivations for using FlyBase
Learn the biological functions of the gene of interest
Use FlyBase BLAST service to detect sequence
homology to Drosophila species or species related to
Drosophila
Motivations for using Ensembl
Obtain records of gene from multiple databases
Obtain coding sequence of each exon of a gene
Walkthrough
Typical use of web databases is to identify
putative homolog to a D. melanogaster gene
We have a novel 20 kb sequence from D. erecta
Using RepeatMasker, we masked all drosophilaspecific repeats from the sequence
Using blastx, we searched this sequence against the
Swissprot database
blastx results indicate our sequence is similar to the
Paired-box protein (Pax6) in D. melanogaster
Function of Pax-6
Clicking on the accession number of the first hit in the
blastx output shows that Pax-6 is also known as
eyeless
We can learn more about eyeless using the FlyBase
web site @ http://flybase.org
Type in eyeless in the search field, then click on the hit
“ey” (#17)
Function of Pax-6
This brings up the gene report for eyeless in D.
melanogaster
We find that eyeless is important for brain and eye
development
It is expressed in embryo, larva, and adult
Phenotypic changes in mutants include changes in the
antenna, arista, and eye of the fruit fly
Finding Homologs in Other Species
Click on the BLAST button to access the BLAST service
Search our masked sequence against D. melanogaster,
D. yakuba, D. mojavensis, D. virilis genome assemblies
using blastn
Most of the species, other than D. melanogaster, are
unannotated.
Nonetheless, this is useful for finding putative orthologs
and for discovering regulatory regions using multiple
sequence alignments
Using the Ensembl Database
Navigate to Ensembl @ http://www.ensembl.org
Click on “Drosophila melanogaster” to access the data
specific for this species
In the search box, type in the name “eyeless” then click
“Go”
We find only one match - CG1464 (the eyeless protein)
Transcripts of eyeless
There are four different isoforms of eyeless in D.
melanogaster
We would typically annotate the most “comprehensive” isoform
• In this case, isoform D
The Fruitfly GeneView provides a general overview of
the gene structure and function of eyeless
Links to FlyBase, RefSeq, Swiss-Prot, EMBL records of
eyeless are also available on this page.
Obtaining Transcript Sequence
Click on “Exon Info” for the transcript CG1464-RD
This bring us to the exon report for this transcript
9 exons, 3024 bps, 898 residues
The sequence is shown with each exon in its own block.
Sequence is color-coded:
Purple = UTR’s
Black = Coding DNA sequences (CDS)
Blue = intronic sequences
Green = upstream or downstream sequences
Obtaining Peptide Sequence
Click on the link “Protein Information” to obtain the
peptide sequence of CG1464-RD
This bring us to the protein report for this transcript
“Protein Family” section shows that there are six gene members
in this species
Clicking on the link brings up the Family view - allows
visualization of multiple sequence alignments of members of
this family
The peptide sequence has the following color-code:
Black/Blue = Alternating text color for exons
Red = Residue overlap splice site
Green = Synonymous SNP
Yellow = Non-synonymous SNP
Next Step
Annotate the exact boundaries of each exon in our D.
erecta sequence based on sequence homology to D.
melanogaster eyeless gene
Use exon-by-exon BLAST search with BLAST 2 Sequences
(bl2seq)
Questions?
Walk- through example
Determining Exon Boundaries
Use bl2seq to determine exon boundaries of the
putative ortholog in our D. erecta sequence
Go to www.ncbi.nlm.nih.gov/blast/ and select bl2seq
Copy D. erecta sequence and paste into the Sequence
1 box. Copy the first exon of DM eyeless and paste into
the Sequence 2 box.
Change program to tblastx. Click “BLAST”
Determining Exon Boundaries
We find that the first exon corresponds to bases 1930719414 in our sequence
We can repeat the previous steps to locate the other
exons in our sequence