Ensembl Introduction

Download Report

Transcript Ensembl Introduction

Investigating Genomes with
Ensembl
Drs. Bert Overduin and Giulietta Spudich
Overview of the day
• Introduction and website walk-through
• Hands-on exercises (the browser)
Tea/Coffee
• Introduction to BioMart
• Hands-on exercises (BioMart)
Lunch
• Determining the gene set
• Hands-on exercises (gene set)
Tea/Coffee
• Variations presentation and hands-on
Introducing…
•
•
•
•
Genome browsing: a comparison
Consensus genes
Ensembl annotation and software
How to find help
Sequencing the genome
DNase I sensitive site
Histone modification
Gene
Conserved
sequence
SNP
What can we learn about
genomes?
• Within one genome: regulatory
elements, gene order, chromatin
structure…
• Through comparative studies:
Evolution, conserved regions,
rearrangements…
Gene quality and prediction.
Genome Browsers Today
• Ensembl Genome browser
http://www.ensembl.org
• NCBI Map Viewer
http://www.ncbi.nlm.nih.gov/mapview/
• UCSC Genome Browser
http://genome.ucsc.edu
Ensembl Genome Browser
NCBI Map Viewer
UCSC Genome Browser
What Distinguishes Ensembl from
the UCSC and NCBI Browsers?
• The gene set. Automatic annotation
based on mRNA and protein information.
• Programmatic access via the Perl API
(open source)
• BioMart
• Integration with other databases (DAS)
• Comparative analysis (gene trees)
Challenges of genome browsers
• Increasing sequence information
198,879,188,987 nt
(Aug 2007)
Challenges of genome browsers
• Increasing annotation: ENCODE
• Pilot project completed in 2007: 1% of human
genome
• Discovered promoter elements are on either
side of the transcription start site
To meet a challenge…
Ensembl’s AIM: To provide annotation for the biological
community that is freely available and of high quality
• Started in 1999
• Joint project between EBI and Sanger
• Funded primarily by the Wellcome Trust,
additional funding by EMBL, NIH-NIAID, EU,
BBSRC and MRC
• Team of ca. 40 people, led by Ewan Birney
(EBI) and Tim Hubbard (Sanger)
The Ensembl gene set
• All Ensembl genes start from a known
protein or mRNA
Sequence
Assembly
Ensembl
gene set
mRNAs
protein
• An initial alignment of protein and mRNA to the genome
begins the ‘Genebuild’.
Have you heard of…
• Ensembl – strives for best possible gene set
www.ensembl.org
• Havana (VEGA) – same goal
http://vega.sanger.ac.uk
• HGNC – a unique name and symbol for every
gene in human
http://www.genenames.org/
• UniProt – focus on proteins, and functional
information
www.uniprot.org
Ensembl vs Havana annotation
All genes at once
(Ensembl Genebuild)
Gene by gene
(Havana/ VEGA)
• Quick, keeps current
• Flexible, can deal with
inconsistencies
• Consistent annotation
• Can apply rules to more
species
• Consult publications as
well as databases
• ‘Out of the Ordinary’
Biology
• However… Slow,
Expensive
Merging sets
• Havana transcripts are incorporated into
Ensembl
• UniProt proteins are aligned to the
genome in the Ensembl genebuild
• UniProt imports Ensembl peptides for
human
• HGNC moved to Hinxton… coordination
Consensus across genome
browsers: the CCDS set
http://www.ensembl.org/info/about/docs/ccds.html
• A protein is deposited into the ‘Consensus
CDS protein set’ or CCDS set if:
NCBI
UCSC
Havana
Ensembl
have determined the same sequence.
More about Ensembl…
•
•
•
•
Genome browsing: a comparison
Consensus genes
Ensembl annotation and software
How to find help
Ensembl Genes – biological basis
All Ensembl gene predictions are based on
proteins and mRNAs in:
• UniProt/Swiss-Prot (manually curated)
• UniProt/TrEMBL
• NCBI RefSeq (manually curated)
Protein/ mRNA
Sequence Assembly
Ensembl Genes
Genes and Transcripts in Ensembl
• Ensembl known genes or transcripts
• Ensembl novel genes or transcripts
• Ensembl EST genes or transcripts
Non-Ensembl genes:
• Imports for yeast, c. elegans, fly, mosquito,
takifugu and tetraodon
Names in Ensembl
•
•
•
•
ENSG###
ENST###
ENSP###
ENSE###
Ensembl Gene ID
Ensembl Transcript ID
Ensembl Peptide ID
Ensembl Exon ID
• For other species than human a suffix is
added:
MUS (Mus musculus) for mouse: ENSMUSG###
DAR (Danio rerio) for zebrafish: ENSDARG###, etc.
Gene Structure in Ensembl
No UTRs
UTRs annotated
Calmodulin Chicken
Calmodulin Human
What annotation is available?
•
Gene/transcript/peptide models (coding and noncoding (ncRNAs))
•
IDs in other database
•
Mapped cDNAs, peptides, micro array probes, BAC clones etc.
•
Cytogenetic bands, markers, repeats etc.
•
Comparative data:
orthologues and paralogues, protein families, whole genome
alignments, syntenic regions
•
Variation data:
Single Nucleotide Polymorphisms (SNPs)
•
Regulatory data:
“best guess” set of regulatory elements from ENCODE
•
Data from external sources (DAS)
Specific data sources
• Microarrays (Affimetrix, Illumina, Agilent)
• GO (Gene Ontology: functional classes)
http://www.geneontology.org/
• OMIM (human diseases and phenotypes)
http://www.ncbi.nlm.nih.gov/sites/entrez?db=OMIM
• Identifiers in Entrez, UniProt, Refseq, etc
• PDB, MSD (structural databases)
http://www.rcsb.org/pdb/
http://www.ebi.ac.uk/msd/
Interpro
Collection of protein data
Sequences, Motifs, Structures
http://www.ebi.ac.uk/interpro/
How is this information organised?
• Ensembl Views (Website)
• Ensembl Database (open source)
(Perl API, FTP site)
• BioMart ‘DataMining tool’
Ensembl – Open Source
•
•
•
•
•
Data and software freely available
More than 50 installs worldwide
Academia and industry
Local or available via the web
Mirrors with Ensembl data, e.g.
http://ensembl.genome.tugraz.at/index.html
http://ensembl.genomics.org.cn/
or user projects with own data
28 of 42
Powered by Ensembl
29 of 42
Help and Information
• Use our helpdesk!
[email protected]
• View our help pages!
(the ‘using Ensembl’ link)
• View our animated tutorials
http://www.ensembl.org/common/Workshops_Online
• Mailing lists:
[email protected]
• Come visit our blog!
http://ensembl.blogspot.com/
Ensembl Team