Browsing Genomes with Ensembl

Download Report

Transcript Browsing Genomes with Ensembl

Browsing Genomes with Ensembl
Xosé Mª Fernández
European Bioinformatics Institute
April
20062007
March
Outline of talk
• Overview of Ensembl
• Making genomes useful
• Beyond Ensembl
2 of 50
Outline of talk
• Overview of Ensembl
– Ensembl - Project
– Exploring genomes
– Gene annotation
• Making genomes useful
• Beyond Ensembl
3 of 50
Ensembl - Project
• Joint project
– EMBL – European Bioinformatics Institute (EBI)
– Wellcome Trust Sanger Institute
• Produce accurate, automatic genome annotation
• Focused on selected eukaryotic genomes
• Integrate external (distributed) biological data
• Presentation of the analysis to all via the Web at
http://www.ensembl.org
• Open distribution of the analysis the community
• Development of open, collaborative software
(databases and APIs)
4 of 50
Ensembl - Project
• Joint project
– EMBL – European Bioinformatics Institute (EBI)
– Wellcome Trust Sanger Institute
• Produce accurate, automatic genome annotation
• Focused on selected eukaryotic genomes
• Integrate external (distributed) biological data
• Presentation of the analysis to all via the Web at
http://www.ensembl.org
• Open distribution of the analysis the community
• Development of open, collaborative software
(databases and APIs)
5 of 50
Beyond classical ab initio
gene prediction
• Ensembl automatic gene prediction relies on
homology ‘supporting evidence’ to avoid
overprediction.
• Classical ab initio gene prediction (eg GENSCAN)
relies partly on global statistics of protein coding
potentials, not used in the cell
• Genes are just a series of short signals
–
–
–
–
Transcription start site
Translation start site
5’ & 3’ Intron splicing signals
Termination signals
• Short signal sequences difficult to recognise over
background noise in large genomes
6 of 50
Ensembl - Project
• Joint project
– EMBL – European Bioinformatics Institute (EBI)
– Wellcome Trust Sanger Institute
• Produce accurate, automatic genome annotation
• Focused on selected eukaryotic genomes
• Integrate external (distributed) biological data
• Presentation of the analysis to all via the Web at
http://www.ensembl.org
• Open distribution of the analysis the community
• Development of open, collaborative software
(databases and APIs)
7 of 50
Ensembl v43
8 of 50
Ensembl - Project
• Joint project
– EMBL – European Bioinformatics Institute (EBI)
– Wellcome Trust Sanger Institute
• Produce accurate, automatic genome annotation
• Focused on selected eukaryotic genomes
• Integrate external (distributed) biological data
• Presentation of the analysis to all via the Web at
http://www.ensembl.org
• Open distribution of the analysis the community
• Development of open, collaborative software
(databases and APIs)
9 of 50
DAS
Registry
http://www.dasregistry.org
10 of 50
DAS
11 of 50
Ensembl - Project
• Joint project
– EMBL – European Bioinformatics Institute (EBI)
– Wellcome Trust Sanger Institute
• Produce accurate, automatic genome annotation
• Focused on selected eukaryotic genomes
• Integrate external (distributed) biological data
• Presentation of the analysis to all via the Web at
http://www.ensembl.org
• Open distribution of the analysis the community
• Development of open, collaborative software
(databases and APIs)
12 of 50
Pre! and Archive! sites
http://pre.ensembl.org
http://www.ensembl.org
http://archive.ensembl.org
13 of 50
Ensembl - Project
• Joint project
– EMBL – European Bioinformatics Institute (EBI)
– Wellcome Trust Sanger Institute
• Produce accurate, automatic genome annotation
• Focused on selected eukaryotic genomes
• Integrate external (distributed) biological data
• Presentation of the analysis to all via the Web at
http://www.ensembl.org
• Open distribution of the analysis the community
• Development of open, collaborative software
(databases and APIs)
14 of 50
Open source open
standards
• Object model
– standard interface makes it easy for others to build
custom applications on top of Ensembl data
• Open discussion of design ([email protected])
• Most major pharma and many academics represented
on mailing list and code is being actively developed
externally
• Ensembl locally
– Both industry & academia
15 of 50
Ensembl – Open source
16 of 50
Ensembl - Project
• Joint project
– EMBL – European Bioinformatics Institute (EBI)
– Wellcome Trust Sanger Institute
• Produce accurate, automatic genome annotation
• Focused on selected eukaryotic genomes
• Integrate external (distributed) biological data
• Presentation of the analysis to all via the Web at
http://www.ensembl.org
• Open distribution of the analysis the community
• Development of open, collaborative software
(databases and APIs)
17 of 50
APIs
• Used to retrieve data from and to store data
in Ensembl databases.
• Ensembl Perl API;
– Written in Object-Oriented Perl,
– Foundation for the Ensembl Pipeline and
Ensembl Web interface.
18 of 50
• Overview of Ensembl
– Ensembl - Project
– Exploring genomes
– Gene annotation
• Making genomes useful
• Beyond Ensembl
19 of 50
Making genomes useful
• Interpretation
– Where are the interesting parts of the genome?
– What do they do?
– How are they related to elements in other
genomes?
• Access
– for bench biologists
– for non-programming mid-scale groups
– for good programming groups
20 of 50
Access… bench biologists
• Mainly via the web
• Web site designed for non programming, not
that genome aware biologist
– Simple things to find are simple to find
– Graphically displays and overviews
– Consistency of layout, colour and text
21 of 50
Ensembl
Analysis DB
Final
DB
Supporting
Databases
SNP
Manual
Annotation
CPU
22 of 50
Genome browsing
why present the whole genome?
•
•
•
•
•
Explore what is in a chromosome region
See features in and around a specific gene
Search & retrieve across the whole genome
Investigate genome organization
Compare to other genomes
23 of 50
Introduction to the
Ensembl web site
Ensembl … …
takes genomic sequence assemblies
human build 36, mouse, rat, mosquito…
adds annotation and links
automated process
presents all the data on a web site
24 of 50
Basic Genome Annotation
• Genes
– Genomic location
– Gene model structures
• Exons
• Introns
• UTRs
– Transcript(s)
• Pseudogenes
• Non-coding RNA
– Protein(s)
– Links to other sources of information
25 of 50
Advanced Genome Annotation
• Cytogenetic bands
• Polymorphic markers
– Sequence Tagged Sites (STS)
• Genetic variation
– Single Nucleotide Polymorphisms (SNPs)
– Deletion-Insertion Polymorphisms (DIPs)
– Short Tandem Repeats (STRs)
•
•
•
•
Repetitive sequences
Expressed Sequence Tags (ESTs)
cDNAs or mRNAs from related species
Regions of sequence homology
26 of 50
How to get started … …
•
•
•
•
•
Species homepage
Map View
Text search
BLAST
SSAHA
27 of 50
Homepage
28 of 50
MapView
BLAST and SSAHA
See blast hit
on genome
30 of 50
Regions, maps and markers
ContigView
CytoView
SyntenyView
MultiContigView
MarkerView
SNPView
GeneSNPView
31 of 50
Ensembl
ContigView
ContigView
close-up
Transcripts
red & black
(Ensembl predictions)
Blue (Vega) & gold (HAVANA,
only in human)
Pop-up
menu
33 of 62
ContigView - Navigation
Click and drag
mouse to select
region
34 of 62
CytoView
GeneSNP
View
SNPView
MarkerView
MultiContigView
Genes & gene products
GeneView
TransView
ExonView
ProteinView
FamilyView
GOView
40 of 50
Ensembl
GeneView
TransView
ExonView
Protein
View
Family
View
GOView
Data retrieval
BioMart
Export View
Data sets on ftp site
MySQL queries of databases
Perl API access to databases
46 of 50
ExportView
Help!
• context sensitive help
pages - click
• access other
documentation via
generic home page
• email the helpdesk
48 of 50
July 2006
Ensembl Team
49 of 50
Ensembl Team
Leaders
Database Schema and Core API
BioMart
Distributed Annotation System (DAS)
Outreach
Web Team
Comparative Genomics
Analysis and Annotation Pipeline
Glenn Proctor, Andreas Kähäri, Ian Longden, Patrick Meidl
Arek Kasprzyk, Damian Smedley, Richard Holland, Syed Haider
Eugene Kulesha
Xosé M Fernández, Bert Overduin, Giulietta Spudich, Michael Schuster
James Smith, Bethan Pritchard, Fiona Cunningham, Anne Parker, Stephen Rice, Steve
Trevanion (VEGA), Matt Wood
Abel Ureta-Vidal, Kathryn Beal, Benoît Ballester, Stephen Fitzgerald, Javier Herrero Sánchez,
Albert Vilella
Val Curwen, Steve Searle, Bronwen Aken, Julio Banet, Laura Clarke, Sarah Dyer, Jan-Hinnerck
Vogel, Kevin Howe, Felix Kokocinski, Stephen Rice, Simon White
Functional Genomics
Paul Flicek, Yuan Chen, Stefan Gräf, Nathan Johnson, Daniel Rios
Zebrafish Annotation
Kerstin Howe, Mario Caccamo, Tina Eyre, Ian Sealy
VectorBase Annotation
Systems & Support
Research
March 2007
Ewan Birney (EBI), Tim Hubbard (Sanger Institute)
Martin Hammond, Dan Lawson, Karyn Megy
Guy Coates, Tim Cutts, Shelley Goddard
Damian Keefe, Guy Slater, Michael Hoffman, Alison Meynert, Benedict Paten, Daniel Zerbino,
Dace Ruklisa
50 of 50
Training...
Somewhere near you
51 of 50