Information Encoding in Biological Molecules: DNA and

Download Report

Transcript Information Encoding in Biological Molecules: DNA and

http://creativecommons.org/licenses/by-sa/2.0/
Lecture/Lab 7.3
1
Ensembl
Database and Web Browser
Erin Pleasance
Canada’s Michael Smith Genome Sciences Centre, Vancouver
Lecture/Lab 7.3
2
www.ensembl.org
Lecture 7.1
3
What is Ensembl?
•
•
•
•
•
Joint project of EBI and Sanger
Automated annotation of eukaryotic genomes
Open source software
Relational database system
Web interface
“The main aim of this campaign is to encourage
scientists across the world - in academia,
pharmaceutical companies, and the biotechnology and
computer industries - to use this free information.”
- Dr. Mike Dexter, Director of the Wellcome Trust
Lecture/Lab 7.3
4
Ensembl components
Search tools:
Data:
Chromosomes
(ChromoView, KaryoView,
CytoView, MapView)
Diseases
SNPs and Haplotypes
(SNPView, GeneSNPView,
HaploView, LDView)
(DiseaseView)
Functions
(GOView)
Sequence
Similarity
(BLAST, SSAHA)
Genes
(GeneView, TransView,
ExonView, ProtView)
Families
(DomainView,
FamilyView
Genome
Sequence
Markers
(MarkerView)
(ContigView)
Comparative
Genomics
Text
(TextView)
Other
Annotations
Anything
(EnsMart)
(ContigView, MultiContigView,
SyntenyView, GeneView)
Lecture/Lab 7.3
5
Species in Ensembl
• Focus on vertebrates
• No fungi/plants
• Arabidopsis genome browser
based on Ensembl at
http://atensembl.arabidopsis.info/
Vertebrates
Lecture/Lab 7.3
Mammals:
Human
Chimp
Mouse
Rat
Dog
Cow
Opossum
Fish:
Zebrafish
Fugu Pufferfish
Tetraodon Pufferfish
Other:
Chicken
Frog
Invertebrates
Insects:
Fruitfly
Mosquito
Honeybee
Other:
6
Nematode
Ensembl Gene Annotation
• “Basis for initial analysis and publication of
most vertebrate genomes”
• Genome assembly from NCBI
• Gene build system
– Targetted gene builds predict known genes
– Similarity gene builds predict novel genes
Lecture/Lab 7.3
7
Curwen et al, Genome Res
14: 942-950, 2004
Lecture/Lab 7.3
8
Targetted gene build
• Align known proteins with pmatch and BLAST
• Incorporate aligned cDNA sequences to find
splice sites, UTRs with genewise
UTRs predicted
Known gene
(p53)
ContigView of best in genome gene
with associated evidence
Proteins aligned
Unigene clusters aligned
Lecture/Lab 7.3
cDNAs aligned
9
Similarity gene build
• Identify novel exons ab initio using Genscan
• Confirm exons by BLAST to known proteins,
mRNAs, UniGene clusters
Unigene
ContigView of homology gene with
clusters
associatedaligned
evidence
Proteins aligned
GenScan predictions
Lecture/Lab 7.3
Novel gene
10
Ensembl Gene Annotation
• Resulting “Ensembl genes” are highly accurate with
low false positive rates
• Ensembl human gene identifiers are 95% stable
between builds
Lecture/Lab 7.3
11
Manually curated genes: VEGA
• Some
chromosomes
contain
manually
curated genes
from VEGA
database
• Otter
database/server
allows
integration of
automatic and
manual
annotations (eg.
from Apollo)
Lecture/Lab 7.3
VEGA gene
12
Ensembl EST genes
• ESTs not accurate enough to produce
Ensembl genes, but important especially for
identifying alternative transcripts
• ESTs aligned to genome and merged to
create an independent set of “EST genes”
Known gene
EST genes
Unigene clusters aligned
Lecture/Lab 7.3
13
Pseudogenes
• Processed pseudogenes in annotation
identified (lack of introns, frameshifts,
presence of multi-exon version elsewhere in
genome, etc.)
Pseudogene
Lecture/Lab 7.3
14
Noncoding RNA Genes
• Genes with no ORFs that are functional (tRNAs,
rRNAs, miRNAs …)
• 7220 annotations from Sean Eddy and Tom Jones
miRNAs
Coding gene
Lecture/Lab 7.3
15
Example 1: Exploring Caspase-3
• Aim to demonstrate basic browsing and views
• Caspase-3 is a gene involved in apoptosis
(cell suicide)
• We will look at:
–
–
–
–
Gene annotation
SNPs
Orthologs and genome alignments
Alternative transcripts and EST genes
Lecture/Lab 7.3
16
Example 1: Exploring Caspase-3
http://www.ensembl.org
Go to human
homepage
Lecture/Lab 7.3
17
Species-specific homepage
Site map
Statistics
of current
release
Lecture/Lab 7.3
18
Finding the tool/view: Site Map
Lecture/Lab 7.3
19
Click Back to
Gene
Lecture/Lab 7.3
Text Search
caspase-3
Species-specific
homepage
20
GeneView
ContigView
ExportView
SNPView
TransView
of transcript
Lecture/Lab 7.3
ProteinView
ExonView
21
GeneView
Orthologs predicted
by sequence
similarity and synteny
Lecture/Lab 7.3
GeneDAS: Get data
from external sources22
GeneView
On the same
page, information
provided for each
transcript
individually
Lecture/Lab 7.3
Links to external
databases
23
GeneView
Lecture/Lab 7.3
24
GeneSNPView
Lecture/Lab 7.3
25
Other SNP/Haplotype tools
• SNPView
• ProteinView (protein sequence with SNP markup)
• LDView: View
linkage
disequilibrium
(only limited
regions)
• HaploView: View
haplotypes (only
limited regions)
Lecture/Lab 7.3
26
Click Back to
Lecture/Lab 7.3
GeneView
27
ContigView
Chromosome
and bands
Sequence
contigs
Lecture/Lab 7.3
28
ContigView: Detailed View
See other
tracks, options
in menus
Gene
annotations
Genscan
predictions
Targetted gene
predictions
(2 alternative
transcripts)
EST genes
Other tracks:
Aligned
sequences etc.
Lecture/Lab 7.3
29
ContigView
Lecture/Lab 7.3
30
MultiContigView
DNA sequence
homology
Rat ortholog
Lecture/Lab 7.3
31
Other Comparative
Genomics Tools
• Saw gene orthology,
DNA homology
• Other view is
SyntenyView
• Also access
comparative
genomics through
EnsMart
Lecture/Lab 7.3
32
Data Mining with EnsMart
• Allows very fast, cross-data source querying
• Search for genes (features, sequences, etc.) or
SNPs based on
– Position; function; domains; similarity; expression;
etc.
• Accessible from Ensembl website (MartView) as
well as stand-alone
• Extremely powerful for data mining
Lecture/Lab 7.3
33
Example 2: EnsMart
• A new disease locus has been mapped
between markers D21S1991 and D21S171. It
may be that the gene involved has already
been identified as having a role in another
disease. What candidates are in this region?
Lecture/Lab 7.3
34
Example 2: EnsMart
• EnsMart is based on BioMart
• http://www.ensembl.org/Multi/martview
OR
• http://www.ebi.ac.uk/BioMart/martview
Lecture/Lab 7.3
35
EnsMart: Choosing your dataset
Lecture/Lab 7.3
36
EnsMart: Filtering
21
D21S1991
D21S171
Lecture/Lab 7.3
37
EnsMart: Output
Note you can
output different
types of information
eg. sequences
Lecture/Lab 7.3
38
EnsMart: Output
Lecture/Lab 7.3
39
Sequence
Similarity
Searching
• Use SSAHA
for exact
matches
(fast)
• Use BLAST
for more
distant
similarity
(slow)
Lecture/Lab 7.3
40
Finding anything else: Help
Lecture/Lab 7.3
41
DAS: Getting your Own Data in Ensembl
• DAS (Distributed Annotation
System)
– Anyone can load data into Ensembl and
allow others to view it in the same view
(eg. ContigView) as other Ensembl
annotations
– Some built-in DAS sources
• http://www.ensembl.org/Docs/
ldas.html
Lecture/Lab 7.3
42
Other Ways to Access Ensembl
• MySQL database directly accessible
• APIs for Perl and Java
• Other software
– Apollo Java genome
annotation viewer/editor
– Sockeye Java viewer
• You can get your own
local version of
Ensembl: software and
data7.3freely available
Lecture/Lab
Sockeye
43
For more information
• Publications (listed at http://www.ensembl.org/Docs/
wiki/html/EnsemblDocs/EnsemblPublications.html)
– Ensembl Special: Genome Research May 2004
– Ensembl updates: NAR Jan. 2002-2005
– EnsMart: Kasprzyk et al, Genome Res Jan. 2004
• Documentation on how to download software
and database:
– http://www.ensembl.org/Docs/
Lecture/Lab 7.3
44
Exercises
• Homologues of human genes are often present in Fugu rubripes in
more condensed form (with shorter introns). Is this true for the gene
PTEN, a tumor suppressor often mutated in advanced cancers?
– Try MultiContigView; can you think of another way to get this
information as well?
• The microRNA bantam regulates the Drosophila (fruitfly) gene hid by
binding the 3’ UTR. Hid is involved in apoptosis, and it is possible
that binding sites for bantam could be found in the 3’ UTR of other
apoptosis genes as well. Obtain the 3’ UTR sequence of all
Drosophila genes known to be involved in apoptosis.
– Using EnsMart, the GO term for apoptosis is GO:0006915, evidence
code TAS
• The file “PCR_product.txt” contains the sequence of a PCR product
amplified from a mouse cDNA library. What gene does the product
correspond to? Does it contain the complete coding sequence of
that gene?
– Would it be better to use BLAST or SSAHA?
Lecture/Lab 7.3
45