Comparison between Human and Mouse genomes

Download Report

Transcript Comparison between Human and Mouse genomes

Alternative Splicing from ESTs
Eduardo Eyras
Bioinformatics UPF – February 2004
Intro
ESTs
Prediction of
Alternative Splicing from ESTs
5’
3’
3’
5’
exons
Transcription
introns
pre-mRNA
Splicing
Mature mRNA
5’ CAP
AAAAAAA
Translation
Peptide
5’
3’
3’
5’
exons
Transcription
introns
pre-mRNA
Different Splicing
Mature mRNA
5’ CAP
AAAAAAA
Translation
Different Peptide
Alt splicing as a mechanism of gene regulation
Functional domains can be added/subtracted  protein diversity
Can introduce early stop codons, resulting in truncated proteins or
unstable mRNAs
It can modify the activity of the transcription factors, affecting the
expression of genes
It is observed nearly in all metazoans
Estimated to occur in 30%-40% of human
Forms of alternative splicing
Exon skipping / inclusion
Alternative 3’ splice site
Alternative 5’ splice site
Mutually exclusive exons
Intron retention
Constitutive exon
Alternatively spliced exons
How to study alternative splicing?
ESTs (Expressed Sequence Tags)
Single-pass sequencing of a small (end) piece of cDNA
Typically 200-500 nucleotides long
It may contain coding and/or non-coding region
ESTs
Cells from a specific
organ, tissue or
developmental stage
5’
mRNA extraction
AAAAAA 3’
Add oligo-dT primer
5’
Reverse transcriptase
AAAAAA 3’
3’ TTTTTT 5’
RNA
5’
AAAAAA 3’
DNA
3’
TTTTTT 5’
Ribonuclease H
3’
Double stranded cDNA
TTTTTT
5’
DNA polimerase
Ribonuclease H
5’
AAAAAA 3’
3’
TTTTTT 5’
ESTs
5’
AAAAAA 3’
3’
TTTTTT 5’
5’ EST
Single-pass sequence reads
3’ EST
Clone cDNA into a vector
Multiple cDNA clones
Alternative Splicing from ESTs
Genomic
Primary transcript
Splicing
Splice variants
cDNA clones
EST sequences
5’
3’
5’
3’
Alternative Splicing from ESTs
ESTs can also provide information about potential
alternative splicing when aligned to the genome (and
when aligned to mRNA data)
EST sequencing
Is fast and cheap
Gives direct information about the gene sequence
Partial information
Resulting ESTs
(DB searches)
Known gene
Similar to known gene
Contaminant
Novel gene
ESTs provide expression data
eVOC Ontologies
Anatomical
System
Cell Type
http://www.sanbi.ac.za/evoc/
The tissue, organ or anatomical system from which the sample was prepared.
Examples are digestive, lung and retina.
The precise cell type from which a sample was prepared. Examples are: Blymphocyte, fibroblast and oocyte.
Pathology
The pathological state of the sample from which the sample was prepared.
Examples are: normal, lymphoma, and congenital.
Developmental
Stage
The stage during the organism's development at which the sample was prepared.
Examples are: embryo, fetus, and adult.
Pooling
Indicates whether the tissue used to prepare the library was derived from single or
multiple samples.
Examples are pooled, pooled donor and pooled tissue.
Linking the expression vocabulary to gene
annotations
ESTs
Genes
Normalized vs. non-normalized libraries
The down side of the ESTs
Cannot detect lowly/rarely expressed genes or nonexpressed sequences (regulatory)
Random sampling: the more ESTs we sequence the
less new useful sequences we will get
Gene Hunting
Sequencing of the Human
Genome (HGP)
EST Sequencing
Origin of the ESTs
Science. 1991 Jun 21;252(5013):1651-6
Complementary DNA sequencing: expressed sequence tags and human
genome project.
Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H,
Merril CR, Wu A, Olde B, Moreno RF, et al.
Section of Receptor Biochemistry and Molecular Biology, National Institute of
Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD.
Automated partial DNA sequencing was conducted on more than 600 randomly selected human brain
complementary DNA (cDNA) clones to generate expressed sequence tags (ESTs). ESTs have applications in
the discovery of new human genes, mapping of the human genome, and identification of coding regions in
genomic sequences. Of the sequences generated, 337 represent new genes, including 48 with significant
similarity to genes from other organisms, such as a yeast RNA polymerase II subunit; Drosophila kinesin,
Notch, and Enhancer of split; and a murine tyrosine kinase receptor. Forty-six ESTs were mapped to
chromosomes after amplification by the polymerase chain reaction. This fast approach to cDNA
characterization will facilitate the tagging of most human genes in a few years at a fraction of the cost of
complete genomic sequencing, provide new genetic markers, and serve as a resource in diverse biological
research fields.
EST-sequencing explosion
 non-exclusivity (1992)
Merck and WashU (1994)
 public ESTs
 GenBank
 dbEST
dbEST release 20 February 2004
Number of public entries:
20,039,613
Summary by organism
Homo sapiens (human)
Mus musculus + domesticus (mouse)
Rattus sp. (rat)
Triticum aestivum (wheat)
Ciona intestinalis
Gallus gallus (chicken)
Danio rerio (zebrafish)
Zea mays (maize)
Xenopus laevis (African clawed frog)
…
5,472,005
4,056,481
583,841
549,926
492,511
460,385
450,652
391,417
359,901
EST lengths
~ 450 bp
Human EST length distribution
(dbEST Sep. 2003 )
Recover the mRNA from the ESTs
What is an EST cluster?
A cluster is a set of fragmented EST data (plus mRNA data if known),
consolidated according to sequence similarity
Clusters are indexed by gene such that all expressed data concerning
a single gene is in a single index class, and each index class contains
the information for only one gene.
(Burke, Davison, Hide, Genome Research 1999).
EST pre-processing
Vector
Repeats
Mitochondrial
Xenocontaminants
EST Clustering
UniGene (NCBI)
www.ncbi.nlm.nih.gov/UniGene
TIGR Human Gene Index
(The Institute for Genomic Research)
www.tigr.org
StackDB
(South African Bioinformatics Institute)
www.sanbi.ac.za
UniGene
Species UniGene Entries
Homo sapiens
Mus musculus
Rattus norvegicus
Sus scrofa
Gallus gallus
Xenopus laevis
Xenopus tropicalis
…
118,517
82,482
43,942
20,426
11,970
21,734
17,102
ESTs and the Genome
ESTs aligned to the genome
Some advantages:
•It defines the location of exons and introns
•We can verify the splice sites of introns (e.g. GT-AG)
 hence also check the correct strand of spliced ESTs
•It helps preventing chimeras
•It can avoid putting together ESTs from paralogous genes
•We can prevent including pseudogenes in our analysis
Aligning ESTs to the Genome
Many ESTs  Fast programs, Fast computers
Nearly exact matches
Splice sites:
Coverage
>= 97%
Percent_id >= 97%
GT—AG, AT—AC, GC—AG
Aligning ESTs to the Genome
Extra pre-processing of ESTs:
Clip poly A tails/Clip 20bp from either end
Best in genome
Remove potential processed pseudogenes
Give preference to ESTs that are spliced
Human ESTGenes
Genomic length distribution of aligned human ESTs
~ 400bp
Tail up to ~ 800kb
The Problem
ESTs
Genome
What are the transcripts represented in this
set of mapped ESTs?
Predict Transcripts from ESTs
ESTs
Transcript predictions
Merge ESTs according to splicing structure compatibility
Representation
Every 2 ESTs in a Genomic Cluster may represent the same
splicing (redundant) or not
The redundancy relation is a graph:
Extension
Inclusion
x
x
y
x
z
y
x
Sort by the smallest coordinate ascending and
by the largest coordinate descending
z
Criteria of merging
Allow edge-exon mismatches
Allow internal mismatches
Allow intron mismatches
Transitivity
x
x
y
Extension
z
y
w
Inclusion
x
z
w
x
z
This reduces the number of comparisons needed
w
ClusterMerge graph
Each node defines an inclusion sub-tree
y
x
z
y
x
Extensions form acyclic graphs
x
y
z
w
z
x
y
z
w
Recovering the Solution
Mergeable sets of ESTs can be recovered as
special paths in the graph
1
2
3
4
5
6
8
9
7
Recovering the Solution
Root: does not extend any node
Root
1
2
3
4
5
6
Leaves 8
7
9
Leaf: not-extended and root of an inclusion tree
Recovering the Solution
Any set of ESTs in a path from a root to a leaf is mergeable
Root
1
2
3
4
5
6
Leaves 8
9
7
Recovering the Solution
Add the inclusion tree attached to each node in the path
Root
1
2
3
4
5
6
Leaves 8
9
7
Recovering the Solution
Lists produced:
(1,2,3,4,5,6,7,8)
( 1,2,3,4,5,6,7,9)
1
2
3
4
5
6
8
7
9
This representation minimizes the necessary
comparisons between ESTs
How to build the graph
Mutual Recursion
Inclusion => go up in the tree
Search graph (leaves)
Recursion search along
extension branch
Search sub-graph
How to build the graph
Example
1
2
3
4
5
6
How to build the graph
Example
1
2
3
4
5
6
6
1
3
2
5
4
How to build the graph
Example
1
2
3
4
5
6
7
6
1
3
2
5
4
Leaves
How to build the graph
Example
1
2
3
4
5
6
7
6
1
3
2
5
4
Inclusion
How to build the graph
Example
1
2
3
4
5
6
7
6
1
3
2
5
4
Inclusion
How to build the graph
Example
1
2
3
4
5
6
7
6
1
3
2
5
4
Extension
How to build the graph
Example
1
2
3
4
5
6
7
6
1
3
2
5
4
Inclusion
How to build the graph
Example
1
2
3
4
5
6
7
6
1
3
2
5
4
Place
7
How to build the graph
Example
1
2
3
4
5
6
7
6
1
3
2
5
4
Inclusion
7
How to build the graph
Example
1
2
3
4
5
6
7
6
1
3
2
5
4
tagged as visited - skip
7
How to build the graph
Example
1
2
3
4
5
6
7
6
1
3
2
5
4
Possible sub-trees beyond 1 or 3 remain unseen!
The representation minimizes the necessary comparisons
7
Deriving the transcripts from the lists
Internal Splice Sites:
external coordinates of the 5’ and
3’ exons are not allowed to contribute
Deriving the transcripts from the lists
Splice Sites:
are set to the most common coordinate
5’ and 3’ coordinates:
are set to the exon coordinate that
extends the potential UTR the most
Single exon transcripts
Reject resulting single exon transcripts when using ESTs
Annotation with ESTs
ESTs aligned to the genome can provide information about
UTRs and alternative splicing
Annotation with ESTs
EST-Transcripts at www.ensembl.org
Annotation with ESTs
Results for Human and Mouse
Human EST-genes (assembly ncbi33):
38,581 Genes
122,247Transcripts ( 42% with full CDS )
Mouse EST-genes (assembly ncbi30):
32,848 Genes
103,664 Transcripts ( 36% with full CDS )
How many transcripts are conserved?
Is Alternative Splicing conserved?
EST-transcript pairs
42,625 transcript pairs (in 18,242 gene pairs)
gene pairs
78% with one transcript pair conserved
22% with more than one transcript pair conserved
For 22% of the gene pairs
some form of alt. splicing is conserved
Conservation of Alt. Splicing
Take gene-pairs with more than one transcript-pair
∑ ( number of paired transcripts - 1)
%conservation =
------------------------------------------------------∑ ( number of transcripts - 1 )
∑ = sum over genes in a gene pair with more than one variant
( subtract the ‘main’ transcript form)
19% of alt. variants in human are conserved in mouse
32% of alt. variants in mouse are conserved in human
How many predicted ‘novel’ genes
are validated by Human-Mouse comparison?
Novel genes
ESTGenes
Not in Ensembl
Human ESTGenes validated by
comparison to mouse
13,174
18,242
24,201
ESTGenes with at least one complete ORF
Novel genes
ESTGenes not in Ensembl
validated by comparison to mouse
984
With a complete ORF
THE END