Transcript Slide 1

Ensembl
Genome Repository
Main Data Repositories
• Ensembl- BLAST or BLAT
• UCSC - BLAT
• NCBI (Entrez) - BLAST
• Ensembl, NCBI, and UCSC use the same
human genome assembly that is
generated by NCBI
Ensembl
• Provide automatic annotation of
sequenced genomes
• Integrate with biological data
• Make available from web
– Genome Browser
– Web interface
– BioMart
– Direct database access Perl API
Outline
• Where the data comes from
• Questions that can be answered
Ensembl Genomes
Genome Annotation
• Identify elements on the genome
• Attach biological information to the
elements
• Automatic annotation and curation
Vega/Havana
Annotation
• Addition of positional, functional, regulatory and
evolutionary datasets to a raw assembled
genome.
• Genes, exon-intron boundaries, protein
products, miRNAs, alternative splicing,
transcriptional start sites, expression,orthologs,
paralogs, repeats, structural features, syntenic
relationships, ChIP-chip data ...
• Based on experimental data and computational
predictions.
Genebuild
• Align species-specific proteins to the genome to
create CDS models (targeted build)
• Align proteins from closely related species to
locate additional CDS models (similarity build)
• Add UTRs using cDNA/EST evidence and ditag
data
• Cluster transcripts into genes
• Classify transcripts
• Name genes
Human/Mouse Genebuild
• additional steps not included in the standard
Ensembl build.
• For both species, transcripts from the
Consensus Coding Sequence (CCDS) set are
imported directly and not altered by the
genebuild process.
• In addition, where manual curation is available
for a transcript, the Ensembl and HAVANA
transcript models are compared.
• The Ensembl and HAVANA models are merged
when they agree on the same coding sequence
Ensembl Identifiers
• ENS_Species_Type_00000_ID
• Species: blank for human for all other species a
three letter code (MUS - mouse)
• Type: G (gene), T (transcript), P (protein)
• ID: six-digit number
• ENSMUST00000118022
• ENSMUSP00000113891
• ENSMUSG00000021944
Ensembl Organization
• Views designed into four classes
–
–
–
–
Gene
Transcript
Location (Genome Browser)
Variation
Questions
•
•
•
•
Are there splice variants?
How do I find orthologs and paralogs?
Are there variations in the genomic sequence?
How can I download different parts of the mRNA
sequence?
• What protein domains exist?
• Gene Ontology
• Can I download sets of data (DNA, cDNA, protein) for a
species?
• BioMart question
Resources
• Ensembl Tutorials
http://www.ensembl.org/info/website/tutorials/index.html
• Ensembl 2009 Nucleic Acids Research
PMID: 19033362
• Bert Overduin, Ph.D. Ensembl
http://www.ebi.ac.uk/~bert/workshops/london_080509/br
owser_london_080509.pdf