Introduction - ILRI Research Computing

Download Report

Transcript Introduction - ILRI Research Computing

Browsing Genes and Genomes
with Ensembl
Bert Overduin
Ensembl User Support
EMBL Outstation
European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton, Cambridge, UK
1 of 42
Course Schedule
Introduction
Website walk-through
Coffee
Exercises
BioMart
Lunch
Exercises
GeneBuild
Tea
Variations / Compara
Exercises
2 of 42
Ensembl Workshops
3 of 42
4 of 42
EMBL-EBI
Hinxton, Cambridge
Wellcome Trust Genome Campus
Hinxton, Cambridge
5 of 42
© John Freebrey (www.thedigitaldarkcloth.com)
6 of 42
Cambridge
7 of 42
© Sean T. McHugh (www.cambridgeincolour.com)
A Bit of History
Sequenced genomes
•
•
•
•
•
•
•
•
8 of 42
1995
1996
1998
1999
2000
2001
2002
2004
Haemophilus influenzae
Yeast
C. elegans
Fruit fly
Arabidopsis
Human (draft)
Mouse
Human (“finished”)
1.8 Mb
12 Mb
100 Mb
125 Mb
115 Mb
2.6 Gb
3 Gb
A Bit of History
http://www.genomesonline.org/
9 of 42
Annotation
Wikipedia:
Genome annotation is the process of attaching biological
information to sequences. It consists of two main steps:
1. identifying elements on the genome, a process called
Gene Finding, and
2. attaching biological information to these elements.
Automatic annotation tools try to perform all this by computer
analysis, as opposed to manual annotation which involves
human expertise. Ideally, these approaches co-exist and
complement each other in the same annotation pipeline.
10 of 42
Ensembl - Goals
• Provide automatic annotation of
genomic sequence
• Integrate other biological data
• Make data available to all via the
web
11 of 42
Ensembl - Organisation
• Joint project between European Bioinformatics
Institute (EMBL-EBI) and Wellcome Trust Sanger
Institute
• Started in 1999 for the Human Genome Project
• Funded primarily by the Wellcome Trust, additional
funding by EMBL, EU, NIH-NIAID, BBSRC and
MRC
• Team of ca. 40 people, led by Ewan Birney (EBI)
and Tim Hubbard (Sanger)
• Uses the largest dedicated computer system in
biology in Europe
12 of 42
Genome Browsers
• Ensembl Genome browser
http://www.ensembl.org
• NCBI Map Viewer
http://www.ncbi.nlm.nih.gov/mapview/
• UCSC Genome Browser
http://genome.ucsc.edu
13 of 42
NCBI Map Viewer
14 of 42
UCSC Genome Browser
15 of 42
Ensembl Genome Browser
16 of 42
What Distinguishes Ensembl from
the UCSC and NCBI Browsers?
• Automatic annotation for those
species for which no manually curated
gene set exists
• Direct database access and
programmatic access via the Perl API
• Not only the data, but also the
software source code is open source
17 of 42
Caveats
• While genome browsers can be very
useful tools they do not provide the
definitive answer to every question!
• Data is fluid
18 of 42
Which Species Are Available?
• 36 chordates, ranging from mammals to
‘primitive’ chordates (Ciona intestinalis and
Ciona savignyi)
• 3 key eukaryote model organisms:
fruitfly (Drosophila melanogaster)
nematode (Caenorhabditis elegans)
yeast (Saccharomyces cerevisiae)
• 2 insect pathogen vectors:
malaria mosquito (Anopheles gambiae)
yellow fever / dengue mosquito (Aedes
aegypti)
19 of 42
Species in Ensembl
MYBP
65
144
208
245
286
360
408
438
505
570
CAMBRI ORDO SIL DEV CARBON PER TRIA JURA CRETAC TERTIA
MAMMALS
PLACENTALS
MONOTREMES
MARSUPIALS
OTHER BIRDS
BIRDS
REPTILES
PALEOGNATHS
PASSERINES
CROCODILES
TURTLES
LIZARDS
AMPHIBIANS
TELEOSTS
FISHES
SHARKS
RAYS
LATIMERIA
BICHIR/POLYPTERUS
LUNGFISHES
AGNATHANS
NON-VERTEBRATES
20 of 42
More Species to Come ….
Oikopleura
Gorilla
Zebrafinch
Orangutan
Marmoset
Amphioxus
Acorn worm
Hyrax
21 of 42
Megabat
Dolphin
Tarsier
Kangaroo rat
Chinese pangolin
Two toed sloth
Llama
Flying lemur
Which Data Are Available?
•
•
•
•
Genomic sequence
Gene/transcript/peptide models
External references
Mapped cDNAs, peptides, micro array probes,
BAC clones etc.
• Other features of the genome:
cytogenetic bands, markers, repeats etc.
• Comparative data:
orthologues and paralogues, protein families, whole
genome alignments, syntenic regions
• Variation data:
SNPs
• Regulatory data:
“best guess” set of regulatory elements
22 of 42
• Data from external sources (DAS)
Gene/Transcript/Peptide Models
• Manual annotation
For parts of genomes:
human, dog, mouse, zebrafish (“Vega genes”)
For complete genomes:
fruitfly (FlyBase), C. elegans (WormBase), yeast (SGD)
• Automatic predictions (“Ensembl genes”)
• EST predictions
• Ab initio predictions (GENSCAN, SNAP)
23 of 42
Biological Evidence
All Ensembl gene predictions are based on
experimental evidence:
• UniProt/Swiss-Prot
A manually curated database and therefore of highest
accuracy
• NCBI RefSeq
A partially manually curated database
• UniProt/TrEMBL
Automatically annotated translations of EMBL coding
sequence (CDS) features
• EMBL / GenBank / DDBJ
Primary nucleotide sequence repository
24 of 42
The Ensembl Genebuild
Genome
assembly
+
Experimental
evidence
+
Computer
programs
25 of 42
Ensembl
Genes
Ensembl Identifiers
•
•
•
•
•
•
ENSG###
ENST###
ENSP###
ENSE###
ENSF###
ENSR###
Ensembl Gene ID
Ensembl Transcript ID
Ensembl Peptide ID
Ensembl Exon ID
Ensembl Family ID
Ensembl Regulatory Feature ID
• For other species than human a suffix is added:
MUS for mouse (Mus musculus) : ENSMUSG###,
DAR for zebrafish (Danio rerio) : ENSDARG### etc.etc.
• For imported genes Ensembl uses the original identifiers
26 of 42
Access to Genome Annotation
• Release web site
• Pre-Release
• Archive
• BioMart
• Downloads
• MySQL interface
• Perl API
27 of 42
http://www.ensembl.org/
http://pre.ensembl.org/
http://archive.ensembl.org
http://www.ensembl.org/Multi/martview
ftp://ftp.ensembl.org/
ensembldb.ensembl.org
http://www.ensembl.org/info/software/
Pre! and Archive! Sites
28 of 42
BioMart Data Mining Tool
29 of 42
Downloads
ftp://ftp.ensembl.org/pub
http://www.ensembl.org/info/data/download.html
FASTA files: plain sequence
• DNA (assembly masked and unmasked)
• cDNA (Ensembl and ab initio predictions)
• Peptides (Ensembl and ab initio predictions)
• RNA (non-coding RNA predictions)
Flatfiles: annotated 1Mb slices
• EMBL format
• GenBank format
MySQL: database table dumps
30 of 42
MySQL
SQL = Structured Query Language
Needed:
• MySQL client program
http://www.mysql.com
• Ability to write MySQL queries
• Knowledge of database schema
31 of 42
Perl API
API = Application Programming Interface
Needed:
• BioPerl modules
• Ensembl modules
• Ability to code in Perl
For more information (installation instructions,
tutorials, documentation etc.):
http://www.ensembl.org/info/software/index.html
32 of 42
Ensembl BLAST
WU-BLAST 2.0:
• search against assemblies, Ensembl predictions or ab
initio predictions
BLAT and SSAHA2:
• BLAST-like Alignment Tool
• Sequence Search and Alignment by Hashing Algorithm
• very fast
• search against assemblies for (almost) exact DNA-DNA
matches
Search against one or multiple species
Search max. 30 sequences simultaneously
33 of 42
Ensembl Accounts
• Personalise Ensembl by saving
bookmarks, view configurations
and homepage preferences in a
user account
• Share bookmarks and
configurations by setting up
groups
Please note that all Ensembl data remains free access. It is
not necessary to register in order to gain access to Ensembl
data!
34 of 42
Website Statistics
On average 1,000,000 page impressions / week
Top 3 species:
Top 3 countries:
35 of 42
Ensembl – Open Source
•
•
•
•
•
Data and software freely available
More than 50 installs worldwide
Academia and industry
Local or available via the web
Mirrors with Ensembl data, e.g.
http://ensembl.genome.tugraz.at/index.html
or user projects with own data
36 of 42
Powered by Ensembl
37 of 42
What If I Need Help?
•
Helpdesk:
[email protected]
•
Workshops on use of the browser or the API
•
Mailing lists:
[email protected]
[email protected]
•
‘Geek for a week’ program
•
Animated tutorials
http://www.ensembl.org/common/Workshops_Online
38 of 42
Ensembl Team
Leaders
Database Schema and
Core API
BioMart
Distributed Annotation
System (DAS)
Outreach & QC
Web Team
Comparative Genomics
Analysis and Annotation
Pipeline
Glenn Proctor, Andreas Kähäri, Ian Longden, Patrick Meidl
Arek Kasprzyk, Syed Haider, Richard Holland, Damian Smedley
Eugene Kulesha, Andy Jenkinson
Xosé M Fernández, Bert Overduin, Michael Schuster, Giulietta Spudich
James Smith, Fiona Cunningham, Anne Parker, Bethan Pritchard, Stephen Rice,
Steve Trevanion
Javier Herrero, Benoit Ballester, Kathryn Beal, Stephen Fitzgerald, Albert Vilella
Val Curwen, Steve Searle, Bronwen Aken, Julio Banet, Laura Clarke, Sarah Dyer,
Felix Kokocinski, Jan-Hinnerck Vogel, Simon White
Functional Genomics
Paul Flicek, Yuan Chen, Stefan Gräf, Nathan Johnson, Daniel Rios
Zebrafish Annotation
Kerstin Howe, Tina Eyre, Ian Sealy
Vectorbase Annotation
Systems & Support
Research
39 of 42
Ewan Birney (EBI), Tim Hubbard (Sanger Institute)
Martin Hammond, Dan Lawson, Karyn Megy
Guy Coates, Tim Cutts, Shelley Goddard
Damian Keefe, Guy Slater, Michael Hoffman, Alison Meynert, Dace Ruklisa, Daniel
Zerbino
40 of 42
Ensembl Team on the river Cam, 2006
41 of 42
Ewan Birney
Q U E S T I O N S
A N S W E R S
42 of 42