Transcript Ensembl

The Ensembl Gene set
The “Genebuild”
21 April 2008
Outline
 The GeneBuild
(determining the Ensembl gene set)
 What it means for the scientist?
 ‘annotation pipeline’ vs ‘manual curation’
 Pseudogenes
 ncRNAs
 The CCDS project
2 of 32
Introduction
What is available?
I) Sequence Assemblies from genome
sequencing efforts
3 of 32
Gene Sequencingthe Assembly
This generates clones, vs new sequencing methods
http://seqcore.brcf.med.umich.edu/doc/educ/dnapr/sequencing.html
4 of 32
Clones Available
Human:
(Tilepath- used in the assembly)
Ciona intestinalis
Shotgun assembly
5 of 32
ContigView: Clones and Contigs
Contigs
Clones
(Plate/well numbers)
Ensembl
Transcripts
6 of 32
Task:
View the tilepath clone in ContigView
for the region containing the human
BRCA2 gene.
Hint: Start with a search for the BRCA2
gene.
7 of 32
The Ensembl Geneset
How does Ensembl use mRNA and
protein information along with
the sequence assembly to define
distinct genes on the genome?
Protein
Sequence Assembly
Ensembl Geneset
8 of 32
Once the Assembly is Imported…
Proteins/mRNAs are aligned.
These have been submitted to
databases such as:
UniProt (manually curated) and
RefSeq (partially manually curated)
9 of 32
The Biological Evidence
All Ensembl gene predictions are based on
experimental evidence:
UniProt/Swiss-Prot
A manually curated database and therefore of
highest accuracy
NCBI RefSeq
A partially manually curated database
UniProt/TrEMBL
Automatically annotated translations of EMBL
coding sequence (CDS) features
EMBL / GenBank / DDBJ
Primary nucleotide sequence repository
10 of 32
Database Relationship
NCBI
RefSeq
Individual
Lab’s
Submission
EMBL-Bank
DDBJ
GenBank
UniProt
SwissProt TrEMBL
11 of 32
Genebuild
EMBL-Bank
GenBank
DDBJ
Sequence
(Assembly)
Proteins
(e.g. Swiss-Prot)
Manual
annotation
(HAVANA)
Ensembl
mRNA
EST
EST
genes
12 of 32
Why do I want to know?…
Ensembl genes may be based on
multiple protein/mRNAs
What is an Ensembl gene based on?
13 of 32
Task
Look at the evidence for the
human EPO gene.
What was this gene based on?
Hint: Go to Exon Information from
the GeneView page
14 of 32
EPO gene supporting evidence
15 of 32
Species-Specific GeneBuilds
Pan troglodytes genes are built by
projection from human genes.
Zebrafish has many gene
duplications.
Homo sapiens genes must have
protein evidence, not just mRNA.
16 of 32
Task
When was the chimpanzee (Pan
troglodytes) Genebuild
performed?
Can you find information as to
how genes were annotated?
Hint: Look on the chimpanzee
index page
17 of 32
External Gene Set: VEGA/Havana
Human, zebrafish, mouse and dog
Havana transcripts in blue or
gold…
What are Havana transcripts?
18 of 32
Havana and Ensembl match
When a Havana (manually curated) and Ensembl (automatic methods) predict
the same transcript, basepair for basepair, the transcripts are merged and
coloured gold.
20 of 32
Manually-curated gene sets in
Ensembl
Vega (Havana)
Homo sapiens, Danio rerio,
Mus musculus and Canis familiaris
WormBase
Caenorhabditis elegans
FlyBase
Drosophila melanogaster
SGD
Saccharomyces cerevisiae
21 of 32
What Can Go Wrong?
I)
A Gap in the assembly
BLAST hit
(SwissProt
entry)
Gene might not be found in Ensembl
II)
Fused genes
Gene might be associated with two names
23 of 32
Outline




The genome sequence
The Genebuild
‘manual curation’ by Havana
Other: EST gene set
Pseudogenes
ncRNAs
24 of 32
Expressed Sequence Tags vs
‘cDNA’
ESTs are annotated separately. Why?
 mRNA and cDNA used in the GeneBuild:
Sequenced to high standard, often complete.
 EST: Lower quality sequence.
‘One shot’ sequencing of cDNA from the 5’ and 3’ end
creates the EST sequence.
ESTs are only 500-800 nucleotides long
Low quality fragment- sequence error of ~2%.
BUT confers useful expression information
 discovery of new genes esp in diseased organisms
 Tissue type
 Timing/developmental stage
 Samples more transcripts, variants
25 of 32
Where Can I See This EST Geneset?
ContigView
Choose EST
genes
EST track
26 of 32
Pseudogenes: ‘False’ Genes
Processed
Unprocessed
mRNA
AAAAAA
Reverse transcription
and re-integration
Produced by gene
duplication and
rearrangement
pseudogene
AAAAAA
27 of 32
ncRNAs (non coding RNAs)
What types are in Ensembl?
tRNA (transfer RNA)
rRNA (ribosomal RNA)
scRNA (small cytoplasmic)
snRNA (small nuclear)
snoRNA (small nucleolar)
miRNA (microRNA)
28 of 32
ncRNAs (2 types)
I) RNA with low homology can be
identified through conserved 2ary
structure (search genome using
Rfam pattern)
II) High sequence conservation (miRNA)
BLAST alignment
‘RNA fold’ applied to make sure
sequences can fold (hairpin)
29 of 32
ncRNAs… where can I see them?
Find them in ContigView:
or use BioMart.
30 of 32
Summary – Ensembl Genes
*All Ensembl genes are based on biological evidence
(protein and mRNA)
 One Ensembl gene may come from proteins and
mRNAs in various databases.
 Havana (manually curated) genes are incorporated
into the Ensembl geneset, merged for human.
 The CCDS set strives for consensus coding
sequences across databases.
 Pseudogenes and RNAs are annotated, along with a
separate EST gene set.
31 of 32
For more on GeneBuild:
Help and Documentation
(About Ensembl)
http://www.ensembl.org/info/about/docs/genome_annotation.html
32 of 32