Transcript ppt_I

How to access
genomic information
using Ensembl
Damian Smedley and Xosé Fernández
Ensembl Project
European Bioinformatics Institute
Cambridge, UK
November 2004
Schedule
Today
Introduction to the Ensembl system
Hands-on examples to introduce the system
Evaluating genes and transcripts
Variation in Ensembl (SNPs, haplotypes)
Tomorrow
Data mining with EnsMart
Comparative genomics and proteomics in Ensembl
BioMart
Advanced topics (Upload your own data, DAS)
2 of 45
Our goal
3 of 45
Assembly
From 325,109 initial contigs
Other ordering data
non-redundant, “virtual contig” view
to 26,720 overlapping clones
4 of 45
Mapping and Sequencing the
human genome
BACs
fragment
bacterial artificial
chromosomes
avg size 150 kb
Shizuya et al 1992
Dib et al 1996
Deloukas et al 1998
Osoegawa et al 2001
WGS
sequence
assembly
draft
finished BAC
fragment
pUCs
avg size 2-4 kb
Bentley et al 2001
Bruls et al 2001
McPherson et al 2001
Montgomery et al 2001
Tilford et al 2001
map
Status of the human sequence
finished red /orange
~96% (99.999% accurate)
30-40% repetitive elements
(eg Alpha satellite, Alu repeats)
All known genes, correctly
identified (99.74%)
heterochromatin
~4% grey
Assembled draft sequence totals 2.85 Gb
Human genome: Current status
• 22,287 'gene loci‘ defined, consisting of 19,599
protein-coding genes in the human genome
and 2,188 DNA additional segments ‘predicted’
to be protein-coding genes
– 1183 genes ‘were born’ in the last 60-100 My
– ~ 30 genes ‘died’ in a similar time period
Finishing the euchromatic sequence of the human genome, Nature 431:931-45 (2004)
7 of 45
Ensembl - project aims
• funded to provide metazoan genomes to the world
• aims to provide the world’s best automated
genome annotation
• a leading group for human and mouse analysis
• all software, data and results freely available
8 of 45
Ensembl - project background
•
•
•
•
group split between EBI and Sanger
mainly Wellcome Trust funded
largest dedicated compute in biology in Europe
developer community > 100 people, including
companies
9 of 45
Ensembl – Open source
Freely-available
Community development.
– >51 Ensembl installs worldwide.
– Both public and commercial,
e.g. Gramene (CSHL)
Fugu-sg (ICMB)
Ciona-sg (Temasek)
10 of 45
Ensembl
Analysis DB
Final
DB
Supporting
Databases
SNP
Manual
Annotation
CPU
11 of 45
Genome browsing
why present the whole genome?
•
•
•
•
•
Explore what is in a chromosome region
See features in and around a specific gene
Search & retrieve across the whole genome
Investigate genome organization
Compare to other genomes
12 of 45
Genome browsers
• Ensembl – public site + installable system
• UCSC Human Genome Browser
• NCBI Map Viewer
http://www.ensembl.org
http://www.ncbi.nlm.nih.gov/mapview
http://genome.ucsc.edu
13 of 45
Introduction to the
Ensembl web site
Ensembl … …
takes genomic sequence assemblies
human build 34, mouse, rat, Fugu,mosquito
adds annotation and links
automated process
presents all the data on a web site
14 of 45
Annotation: genes
Known genes
Novel genes
• where?
• genomic structure?
• transcripts(s)?
• protein(s)?
• orthologues?
• attach useful links
• how to predict?
require evidence
• transcripts(s)?
• protein(s)?
• orthologues?
• attach useful links
15 of 45
Annotation: other features
•
•
•
•
markers and SNPs
cytogenetic bands
repeated sequences
ESTs & other sequence records
where do they show sequence similarity?
• regions homologous to other species
16 of 45
How to get started … …
•
•
•
•
•
•
•
Species homepage
Site map
Map View
Text search
BLAST
SSAHA
Disease View
17 of 45
Homepage
Site map
MapView
AnchorView
BLAST and SSAHA
BLAST and SSAHA
Regions, maps and markers
ContigView
CytoView
SyntenyView
MultiContigView
MarkerView
SNPView
23 of 45
Ensembl
ContigView
ContigView
close-up
Customising
& short cuts
Evidence
Transcripts
red & black
(Ensembl predictions)
Blue (Vega)
Pop-up
menu
ContigView - Chromosome 20 close-up
Forward strand
Manual
annotation
via Vega
Reverse strand
Ensembl
predictions
Ensembl
EST-based
predictions
Other chromosomes with manual annotation from http://vega.sanger.ac.uk: 6, 7, 9, 10, 13, 14, 20, 22, X
CytoView
GeneSNP
View
MarkerView
SNPView
Synteny
View
MultiContig
View
Genes & gene products
GeneView
TransView
ExonView
ProteinView
FamilyView
DomainView
GOView
DiseaseView
32 of 45
Ensembl
GeneView
TransView
ExonView
Protein
View
Family
View
GOView
DiseaseView
Data retrieval
EnsMart
Export View
Data sets on ftp site
MySQL queries of databases
Perl API access to databases
39 of 45
ExportView
EnsMart
Mouse differences
• Genomic sequence assembly based
on whole genome shotgun, with
finished ‘stitched’ BACs
• BACs are shown in CytoView (FPC
map), but for most no sequence is
available
42 of 45
Mouse
CytoView
Help!
• context sensitive help pages click
• access other documentation
via generic home page
• email the helpdesk
HelpDesk / Suggestions
44 of 45
Thanks
Ensembl Team
45 of 45
Ensembl Team
November 2004
Database Schema
and Core API
Arne Stabenau
Yuan Chen
Ian Longden
Craig Melsopp
Glenn Proctor
Daniel Ríos
Guy Slater
Project Leader
Ewan Birney (EBI)
Tim Hubbard (Sanger)
Distributed Annotation System
Andreas Kähäri
Vega Web Team
Patrick Meidl
Steve Trevianon
User Support
Xosé Mª Fernández
Michael Schuster
Comparative Genomics
Abel Ureta-Vidal
Javier Herrero Sánchez
Jessica Severin
Cara Woodwark
Ensembl Web Team
James Stalker
Fiona Cunningham
James Smith
Analysis and
Annotation Pipeline
Val Curwen
Steve Searle
Dan Andrews
Mario Caccamo
Laura Clarke
Martin Hammond
Jan Hinnerck-Vogel
Kevin Howe
Vivek Iyer
Kerstin Jekosch
Felix Kokocinski
Simon White
EnsMart & BioMart
Arek Kasprzyk
Damian Keefe
Darin London
Damian Smedley