GA1_3_Chacko - The International Conference on

Download Report

Transcript GA1_3_Chacko - The International Conference on

Alternative Splicing and
Disease: an overview
Shoba Ranganathan
Professor and Chair – Bioinformatics
Dept. of Chemistry and Biomolecular Sciences &
Adjunct Professor
ARC CoE in Bioinformatics
Dept. of Biochemistry
Macquarie University
Yong Loo Lin School of Medicine
Sydney, Australia
National University of Singapore, Singapore
([email protected])
([email protected])
Visiting scientist @
Institute for Infocomm Research (I2R), Singapore
Outline of the talk

Background

Determining gene architecture

Graph theory in AS

Whole genome analysis results

AS and disease
Unexpectedly low number of
genes in the human genome
C.elegans
19,000 genes
Drosophila
14,000 genes
Human
~22,000-25000
genes
 How can the genome of Drosophila contain
fewer genes than the undoubtedly simpler
organism C. elegans?
 This raises the possibility of expanded diversity
leading to biological complexity
www.utexas.edu, www.sih.m.u-tokyo.ac.jp
http://pub.tv2.no/multimedia/na/archive/
Sources of Biological complexity
With a limited number of genes:
 Enhanced regulation of genes and pathways
 Post-translational modifications
 Alternative splicing
A Genomic View
Spliceosomal splicing
Maniatis & Tasic, 2002
Protein Diversity
a
b
b
a
mRNA
Sequences
a
b
Alternative splicing
 Splicing is a regulated process that removes
the non-coding sequence from transcripts to
produce mRNA (Bernot, 2004).
 Contradicts the central dogma of molecular
biology:
 One gene – one protein
Why AS?
 Protein diversity (Neverov et al., 2005).
 Form of spatial and temporal regulation (Lopez,
1995)
 Errors in splicing lead to diseases (Orengo &
Cooper, 2007)
 Drug discovery (Levanon & Sorek, 2003)
Usual way of studying AS
1. One gene at a time – tedious
for genomes
2. Collect intron-exon structures for all
isoforms
3. Try to analyze them … again one
isoform at a time and then gene by
gene.
4. Unsuitable for genes with large
numbers of transcripts.
Usual way of studying AS
Why use bioinformatics?
 Most research into alternative splicing is
limited to a few genes (reductionist
approach)
 Bioinformatics overcomes this by
facilitating a systems biology approach:
 Information can be obtained for all
genes in a genome
 This can be done for many genomes
allowing for comparative genomics
Where is the splicing?
 Information on the intron-exon (coding/noncoding) arrangement of a gene is essential.
 Aligning mRNA/EST sequence to their coordinate genomic sequences will give the
arrangement of exons in a gene. (MGAlign,
Ranganathan et al 2003; MGAlignIt, Lee et
al 2003)
Outline of the talk

Background

Determining gene architecture

Graph theory in AS

Whole genome analysis results

AS and disease
MGAlignIt (Lee et al., 2003)
 Fast heuristic approach and highly accurate
 Capitalizes on the fact that the mRNA sequence
constitutes a very small percentage of the genomic
sequence
15
MGAlign’s “biological” alignment strategy
MGAlignIt
web
service
http://origin.bic.nus.edu.sg/mgalign
Benchmarking
 Dataset: human Chr 22 from the Sanger Centre
(Collins et al., 2003)
ssing
xons
1
22
424
 936 annotated mRNA (5176 exons)
 48Mbp long human Chr 22 genomic sequence
Programs
MGAlign
Sim4
Spidey
Misaligned
Programs
Exons
MGAlign
41
Sim474
Spidey
79
Predicted
Exons
5175
5157
6535
Predicted
False
Exons
Negatives
5175
0.02%
5157
0.43%
6535
8.19%
Correct
Exons
5134
5080
4673
Wrong
Exons
0
3
1783
Missing
Exons
1
22
424
Misaligned
Exons
41
74
79
Correct
Wrong
Missing
Misaligned
True
Percentage
of Total
time
Exons
Exons
Exons (hours)
Exons
Positives
Correct Exons
5134
099.19% 1
99.21%
4.35 41
5080
398.15% 22
98.51%
55.0774
4673
1783
71.51%
90.28% 424
15.3079
False
Negative
0.02%
0.43%
8.19%
Fals
Negati
0.02%
0.43%
8.19%
Some successes
 Short internal exons (exon 2: 9 bp & exon 9: 21bp)
 Short terminal exons (exon 1: 15 bp)
MGAlign performance
140
 More savings in
computer time
with longer
gDNA
sequences
 Based on 41
randomly chosen
genomic
fragments
120
Time (seconds)
100
sim4
80
MGAlign
sim4
Spidey
60
40
spidey
20
mgalign
0
0
10
20
30
Length of genomic sequence (Mbp)
40
Outline of the talk

Background

Determining gene architecture

Graph theory in AS

Whole genome analysis results

AS and disease
Problem: Königsberg bridges (1700s)
Node
Path or Edge
The residents of Königsberg, Germany, wondered
if it was possible to take a walking tour of the town
that crossed each of the seven bridges over the
Presel river exactly once.
Leonhard Euler, 1736 (father of graph theory)
Graph theory for AS
1. First used for AS by
Heber et al. (2002).
2. Each independent
segment represented
as a node, connected
by arrows.
3. “Node” here is not
necessarily based on
introns and exons:
simply a common
contiguous segment of
the gene.
4. Human ADSL
(adenylosuccinate
lyase) gene
Our splicing graph approach
 A biologist’s viewpoint: each exon should be a
node and each intron, an edge (connection).
 Automatic generation of AS clusters from gene
structure.
 Identifying Reference distinct Exon and its
associated variants.
 Simple rules for classifying alternative splicing
events and visualization system for studying all
variants from a single gene.
 Single-line diagram :Experimentalist way of
Alternative splicing analysis
Making the splicing graph
Usual classification of AS events
(Leipzig et al., 2004)
Representing splice variants of the
same gene as a splicing graph
Normal representation of transcripts
human hyalouronidase HYAL1 gene:
ENSG00000114378 (an early version)
www.ebi.ac.uk/asd
Splicing Graph representation of
the same gene
Intron
retention
Exon
skipping
Alternative
Termination site
Transcripts are shown as exon numbers:
5+2+3+9; 6+3+9; 1+7+3+4; 1+8+3+4; 1+2+4; 1+3+4.
Single-line Splice Diagram
Patterns using the above exon numbers are shown as:
5+2+3+9; 6+3+9; 1+7+3+4; 1+8+3+4; 1+2+4; 1+3+4.
• A Digraph or DAG (Directed Acyclic Graph)
• Graphs for which every unilateral orientation is traceable
• Experimentalist’s way of Alternative Splicing analysis (for a
gene of interest with all transcripts) for validating splive
junctions
• Intron retention is clearly visible
Our extended classification
Automatic rule-based
classification
Our extended
classification
Where to make your splicing graphs
Outline of the talk

Background

Determining gene architecture

Graph theory in AS

Whole genome analysis results

AS and disease
AS Databases (Of men and mice)
ASAP II (Kim et al., Comparative and
2007)
evolutionary studies
17 genomes
EC Gene (Lee et
al., 2007)
Provides functional
9 genomes
annotation for AS genes
ASTD (previously
ASD) (Thanaraj et
al., 2004)
ASTALAVISTA
(Foissac et al.,
2007)
Genome wide analysis
Human, mouse
and rat
Visual summary of the
AS landscape
Mainly for
human genome
 Does not provide sufficient information for multi-gene
comparison to understand the phenomenon of AS.
6
Genome-wide AS analysis:
“I” said the fly…
Homology
 Similarity between biological sequences due to
shared ancestry
 Orthology
 Homologous sequences are orthologous if
separated by a speciation event
 The divergent copies of a singe gene in the
resulting species are orthologous genes.
 At least 25 - 30% similarity at the protein
level
13
Gene Ontology
 Provides a controlled vocabulary to describe
gene and gene product attributes in organisms.
 Three organizing principles
 Cellular component
 A component of a cell, e.g. nucleus
 Biological process
 Series of events accomplished by one or more
ordered assemblies, e.g. signal transduction
 Molecular function
 Describes activities, e.g. catalytic activity
14
AS genes in Bovine genome
 Part of bovine annotation
project
 16560 human genes,
15986 mouse genes,
4567 bovine genes
 Data extracted from
ASTD and Ensembl
(Hubbard et al., 2002)
 Orthologous genes
found using Biomart
from Ensembl
 Gene Ontology using
Blast2GO (Conesa et al.,
16
Percentage of AS genes and orthologous
spliced genes in bovine, human and mouse
Genome
Bovine
Human
Mouse
Bovine
Human
Mouse
Genes with
multiples
transcripts
21755
4567
24573
16715
28931
16491
Orthologous gene set
% of
AS
21755
24573
28931
16%
16%
13%
Total
Genes
3504
3835
3774
21%
68%
57%
 Orthologous genes were analysed in order to reduce bias
in the data.
17
Gene Level AS Analysis of orthologous subset
Percentage of Genes =
No. of genes having one event
Sum of the genes for all the events
X 100
 Percentage of bovine genes showing AS events are fewer
compared to human.
Bovine
Human
Mouse
90
80
70
60
50
40
30
20
Mutually
Exclusive
Intron
Retention
Cassette
Exons
Alt Donor
Site
Alt Acceptor
Site
Alt Term
Exon
Alt Trans
Term Site
0
Alt Init Exon
10
Alt Trans
Start Site
% of alternative splicing genes
100
18
AS Event Analysis of the orthologous subset
 % of AS events in bovine similar to human
 implies that more splice variants are obtained from fewer
bovine genes.
Bovine
No. of times an event occurs
30
% Events =
X 100
Sum of occurrences of all events
Human
Mouse
20
15
10
Mutually
Exclusive
Intron
Retention
Cassette
Exons
Alt Donor
Site
Alt Acceptor
Site
Alt Term
Exon
Alt Trans
Term Site
0
Alt Init Exon
5
Alt Trans
Start Site
% of AS events
25
19
Gene Ontology analysis
Gene Ontology using Blast2GO (Conesa et al., 2005)
 2458 (out of 4567) AS genes has GO annotations
in Ensembl
 1716 AS genes can be further annotated
4567
4174
2458
S1
Ensembl
Our analysis
Total
Outline of the talk

Background

Determining gene architecture

Graph theory in AS

Whole genome analysis results

AS and disease
Implications for disease
 Diagnostics from early recognition of splice
variants associated with disease, based on
nucleotide detection
 Treatment options using siRNA
 Aberrant splicing in survival of motor neuron 1
gene (SMN1) in spinal muscular atrophy
(Cartegni and Krainer 2002)
 Suppressing anti-apoptotic AS variant of Bcl-x
pre-mRNA in prostate and breast cancer cells
(Mercatante et al. 2001)
 Correcting CFTR mis-splicing (Friedman et al.
1999)
Many diseases are caused by AS
Myotonic
dystropy
Why study farm animals?
 Provide valuable insights into gene function and genetic
and environmental influences on animal production and
human diseases. (Roberts et al., 2009 )
 The size and relatively long intervals between
generations, domestic species are widely used to
unravel the mechanisms involved in programming the
development of an embryo and fetus, resulting in adult
onset of diseases (King et. al., 2007 , Padmanabhan et
al., 2007)
 Mapping human disease genes to bovine orthologous
genes is an excellent mode for carrying out analytical
work and verifying the suitability of cow as a model
organism.
Mapping human disease genes to
bovine genome
 94 human disease genes were extracted from
NCBI Genes and Disease database to analyse
which of these genes were alternatively spliced in
human and bovine genomes.
 AS analysis was conducted on 66 spliced genes.
 17 orthologous spliced genes were observed in
bovine.
Human disease genes: Conservation
of cassette exons in bovine
orthologous genes
• Cassette exons occur in 38 of human disease
genes and 14 orthologous bovine genes.
Number of cassette exons in 38 AS human
disease genes
Exons present and constitutive in bovine
orthologous gene
Exons present and regulated in bovine
orthologous gene
Exons absent in bovine orthologous gene
120
90
7
23
Human disease genes: Cassette exons
present and regulated in bovine
orthologous genes
• 3 genes with cassette exons in human were
present and regulated in bovine.
Disease
Colon Cancer
Gene name
MLH1
Spinal muscular
atrophy
SMN1
Cassette exon
Exon9
Exon10
Exon6
ABC1
Exon5
Exon32
Exon2
Tangier disease
Exon10
Human disease genes: Intron retention
present and constitutive in bovine
orthologues
• Intron retention in nine human genes out of which, IR in
five genes was present and constitutive in bovine
Disease
Gene name
Intron retention
GLC1A
Exon1
Spinocerebellar ataxia
SCA1
Exon9
Polycystic kidney
disease
PKD1
Exon23
Glaucoma
Exon15
Autoimmune
polyglandular syndrome
APS1
Exon10
Wilson’s disease
ATP7B
Exon2
Protein domain analysis of the
orthologous disease gene set
 Carried out Pfam domain search on 8 human
disease genes to identify the effects of alternative
splicing on the functional protein domains.
 Genes responsible for spinal muscular atrophy and
colon cancer are spliced in bovine and resulted in
probable structure and function disruption.
 4 disease genes (glaucoma, Tangier, spinal muscular
atrophy and colon cancer ) had all the domains from
their human counterparts conserved in bovine.
Conclusion
Our results provide a window of opportunity
for more in-depth analysis over a larger
dataset, where the cow can serve as a
model organism for many more human
diseases.
Acknowledgements
 PhD students at the
 National University of Singapore (Bernett
T.K. Lee)
 Macquarie University (Durgaprasad
Bollina and Elsa Chacko)
 Colleagues and A/Prof. Tin Wee Tan, NUS
 All of you
Invitation to attend InCoB2009
International Conference in Bioinformatics
(incob.apbionet.org)
Singapore, 7-11 Sept. at Matrix, Biopolis
Keynote: Nobel Laureate Robert Huber,
7 Sept: Tutorials and Bioinformatics
Education workshop (WEBCB)
8 Sept: Clinical Bioinformatics (CBAS) and
SYMBIO (Students) Symposia
9-11 Sept:
Scientific Meeting