Transcript enstour_

A guided tour of Ensembl
•
This quick tour will give you an outline view of what
Ensembl is all about.
•
You will learn:
–
–
–
–
–
Why we need Ensembl
What is in the Ensembl database
Who is behind it
What it can tell you about the genome
How to interpret the data Ensembl database web pages
Background
•
The HGP has produced the first “draft” sequence of the human
genome. This sequence is not a finished product –it contains
errors and will need much work before it can be considered truly
accurate. However, it will provide scientists with their first overall
view of the sequence of the human genome.
•
Producing this draft sequence is much like assembling a huge
jigsaw puzzle. Millions of short pieces of DNA must be fitted
together to form an overall sequence of the complete genome.
•
To make it more complicated the “pieces” come from all over the
world; produced by the international collaboration of sequencing
laboratories that comprise the HGP consortium.
•
As the DNA is produced it is released into the public domain by
placing it in publicly accessible databases such as EMBL and
Genbank.
The Jigsaw Puzzle Genome
Modern DNA sequencing technology can only determine accurate
sequences of short stretches of DNA (less than 1000 base pairs).
Since the human genome is in excess of 3 billion base pairs long
the genome has had to be sequenced in many small pieces that
must be reassembled afterwards. The pieces are reassembled by
comparing the sequence of the ends to find overlaps which can be
used to join them together.
What is Ensembl?
•
Ensembl is a joint project between EMBL-EBI and the Sanger Center
to develop a system which automatically tracks all the sequenced
pieces of the human genome, attempts to assemble them into large
single stretches and then analyse the assembled DNA to find genes
and other features of interest to biologists and medical researchers.
•
Ensembl:
–
–
–
–
–
Is “fed” raw DNA sequence taken from the public DNA databases
Puts it into a large tracking database (the “Ensembl” database)
Joins the sequences into their proper place in the genome
Automatically finds genes and other features in the sequence
Presents the results on the internet for everyone to see, for free.
World DNA data
Sanger Centre
Computation
Analysis
Pipeline
Ensembl
Database
Map SNP
WWW
Why do we need Ensembl?
•
Keeping track of the thousands of individual pieces of DNA
making up the human genome jigsaw puzzle is very difficult.
As the sequence is refined and mistakes are corrected in
sequencing labs around the world the sequence of the pieces
changes. It is vitally important to keep track of these changes
accurately so that consistent “big picture” is maintained.
•
This task is extremely difficult to do manually and would
require many people to do it. Automatic tracking via a system
such as Ensembl is quicker, cheaper and more accurate.
What’s in the Ensembl database?
•
All of the human genome DNA that is currently available in
the public domain.
•
Collectively, the features identified on the DNA sequence by
Ensembl are called “annotation” and mostly comprise:
– Genes. These fall into 3 general classes:
• genes that are known already from other experiments
• genes that are predicted by the Ensembl system
– Other interesting features of the DNA such as:
• SNPs (single nucleotide polymorphisms)
• Repeats (regions of simple repetitive DNA sequence)
• Regions highly similar to other sequences in the public databases
(also called “homologies”).
How does Ensembl predict genes?
•
Ensembl uses specialized gene finding software called
“Genscan” to predict the location of gene sequences. The
software studies DNA sequences and identifies DNA regions
that look like they may be genes.
•
These “candidate” gene sequences are then compared to the
sequence of all known genes in the public databases. If
matches are found then this provides “supporting evidence”
suggesting the predictions are accurate to some degree.
•
The predicted genes are stored in the database so that they
can be retrieved later.
Ensembl Naming Conventions
•
Keeping stable name for “things”, such as genes, in
databases is very important. This allows scientists in
different labs around the world to be confident that they are
all referring to the same “thing”.
•
Ensembl goes to great lengths to try to maintain stable
names for genes and other features in the genome. This is a
very difficult task because Ensembl is an environment where
DNA sequence is continuously changing and being
improved. Changes to the underlying DNA sequence may
cause new genes to be created, deleted, altered or merged
with one another. Wherever possible the names are
maintained when the DNA sequence is revised. Ensembl
keeps a “version” number for many things so changes can
be tracked over time.
•
Ensembl identifiers look like:
– “ENSG00000XXXX” for genes
– “ENST00000XXXX” for gene transcripts,
– etc,
The Ensembl Website
•
The Ensembl website is at: http://www.ensembl.org/
•
It provides a quick and easy way to browse the contents of
the Ensembl database or find specific items of interest.
•
There are a number of main entry points into the Ensembl
database.
–
–
–
–
–
–
DNA similarity searches (“BLAST” searches). This is useful if you already
have a DNA or protein sequence and you want to see if anything similar
exists in the Ensembl database.
Browse from the chromosome level all the way down to the DNA sequence
level.
Ensembl identifier search. If you already have an ID number you can
search for it directly.
Known gene names.
OMIM diseases.
Free text search of OMIM,SWISSPROT and InterPro annotation
Browsing Ensembl
From the Ensembl home
page click on a picture of
a chromosome you are
interested in.
Browsing Chromosome Maps
Feature Density Plots
The chromosome view shows
a picture of the chromosome
and graphical
representations of features
on the chromosome. Click
anywhere on the image to
see a magnified view of that
region.
Browsing Contig Displays
In addition to sequence displays a map of DNA
fragments is shown giving the location of genes.
Location on the chromosome
Each display is a magnified view of the red
window in the display above.
1Mb overview
of the region
Landmark map markers
Genes positions are
shown under the map
Use these buttons
to move and
resize your view
Use these menus
to reconfigure
your view and
access advanced
features.
Adjacent contigs are
shown in alternating
blue
Using Contig Overview
The region of interest is the area surrounded by the
red window. The Contigview Overview display always
shows 1Mb around this region. Clicking anywhere
within this display will center around that click.
Using Contig Detailed View (1)
Holding your mouse over features in the detailed view
will pop up a menu through which you can access
detailed information about those features.
EMBL annotation
Homologies
to other
known
sequences
Mouse trace
alignments
Simple
sequence
repeats
Clone tiling path
Known Ensembl
transcript
Sequence
length
Features on
forward DNA
strand
Features on
reverse DNA
strand
“Unstranded”
features
Using Contig Detailed View (2)
Menus at the top of the detailed
display control the features which
are displayed.
You may also export the region in a
variety of formats or view the region
in using other genome browsers.
Adding External Data to
ContigView via DAS
DAS provides a system for adding user-defined data to Ensembl
displays. An external server serves features which may be
layered onto the Ensembl ContigView.
External
Sources
2. Enter your DAS server and add your sources
1. Access the “DAS sources” menu.
4. Refresh your ContigView Display
3. Manage your existing sources
Interpreting Marker Information
View marker sequence
View chromosome
map
Database where
markers are stored
Marker flanking
sequences
Gene Views
Clicking on a gene displays detailed
information about gene structure
Predicted properties
Gene structure
Transcript context
Transcript
cDNA
sequence
Supporting evidence leading
to the prediction of this gene
Supporting Evidence For Genes
Evidence supporting the prediction of a gene ordered
according to its reliability. Reliable data is shown in green.
Lower reliability evidence is shown in grey
Supporting
data source
Supporting
data ID
Summary
of data
Diagrammatic representation of
which part of the gene prediction
this evidence supports.
Further Information
The Ensembl Project:
http://www.ensembl.org/
Ensembl Trace Server:
http://trace.ensembl.org/
Ensembl Distributed Annotation Server:
http://servlet.sanger.ac.uk/das/
Human Genome Central Resources:
http://www.ensembl.org/genome/central/