Analysis of Exon Arrays
Download
Report
Transcript Analysis of Exon Arrays
Analysis of Exon Arrays
Slides provided by Dr. Yi Xing
Outline
– Design of exon arrays
– Background correction
– Probe selection, expression index
computation
– Evaluation of gene level index
– Exon level analysis
– Conclusion
1. Basic design of Exon Array
3’ Arrays
Exon Arrays
1 gene --- 1 or 2 probesets
1 gene --- many probesets
Probes from 600 bps near 3’ end
Probes from each putative exon
Probeset has 11 PM, 11 MM probes
Probeset has 4 PM probes
54,000 probesets
1.4 Million probesets, 6 M features
Average16 probes per RefSeq gene
Average147 probes per RefSeq gene
Exon Array Probesets Classified
by Annotational Confidence
• Core probesets target exons
supported by RefSeq mRNAs.
• Extended probesets target
exons supported by ESTs or
partial mRNAs.
• Full probesets target exons
supported purely by
computational predictions.
Core
21%
Full
41%
Extended
38%
2. Background modeling: predict nonspecific hybridization from probe
sequence
• Wu and Irizarry (2005) use probe effect
modeling to obtain more accurate expression
index on 3’ arrays
• Johnson et al (2006) use probe effect modeling
to detect ChIP peaks for Tiling arrays
• Kapur et al (2007) use probe effect modeling to
correct background for Exon array
Background modeling in Exon Arrays
• logBi = α*niT + ∑ βjk Iijk + ∑ γk nik2+ εi
• Estimate parameters from either
– Background probes (n = 37,687)
– Full probes (n = 400,000)
• test on a different array (with single scaling constant)
Train on Background Probes, Test on Background Probes R2
Train/Test
Cerebellum
Heart
Liver
Cerebellum
0.64
0.67
Heart
0.64
0.65
Liver
0.66
0.64
Train on Full Probes, Test on Background Probes R2
Train/Test
Cerebellum
Heart
Cerebellum
0.61
Heart
0.61
Liver
0.64
0.63
• Full probes useful for modeling background
Liver
0.63
0.63
Promoter array may be used to train exon array background
Array
stem cell
exon array R2
mPromoter R2
H9-38-3B
0.60644
0.632071
H9-38-3C
0.597005
0.623045
H9-38-3CM_8
0.589289
0.603118
H9-38-7B
0.580949
0.596331
H9-39-7B
0.542581
0.555235
H9-41-7B
0.603742
0.631153
H9-43-3B
0.612422
0.634044
H9-43-7B
0.594246
0.61426
Preliminary conclusions
• Background correction based on background
probe effect modeling can greatly reduce
background noise
• Model parameters are similar for different ChIPDNA samples, or for different RNA samples, but
not across DNA and RNA.
• The data may be rich enough to support learning
of more complex models with even better
predictive power.
3. Probe selection and expression
index computation
Gene-level visualization: Heatmap of Intensities
major histocompatibility complex,
class II, DM beta
Probes
Samples
Core
probes
Heatmap of Pairwise Correlations
HLA_DMB
Probes
Probes
First observations
• Heapmap of correlations is a useful
complement to heatmap of intensities
• Core probes have higher intensity than
extended and full probes
Probe selection for gene-level expression
• Most full and extended probes are not suitable for
estimating gene-level expression
– Probes may target false exon predictions
• Even some core probes may not be suitable
– Bad probes with low affinity, or cross-hybridize
– Probes targeting differentially spliced exons
• Probe selection
– Selecting a suitably large subset of good probes targeting
constitutively spliced regions of the gene
– Use only to selected probes to estimate gene expression
Heatmap of CD44 core probes (Ordered By
Genomic Locations)
_____________ ________________________ _____________
constitutive
alternatively spliced
constitutive
ataxin 2-binding protein 1
These examples motivated our
Probe Selection Strategy
• Probe selection procedure (on core probes)
– Hierarchical clustering of the probe intensities across 11 tissues
(33 samples), and cut the tree at various heights (0.1,0.2,…1.0).
– Choose a height cutoff to strike a balance between the size of the
largest sub-group and the correlation within the sub-group.
– Iteratively remove probes if they do not correlate well with current
expression index
– At least 11 core probes need to be chosen.
– If the total number of core probes is less than 11 for the entire
transcript cluster, we skip probe selection.
(Xing Y, Kapur K, Wong WH. PLoS ONE. 2006 20;1:e88)
Hierarchical Clustering of CD44
Core Probes (distance=1-corr,
average linkage)
h=0.1
44 (42%) probes
Computation of gene level expression index
Background correction
Normalization
(linear scaling or none)
Probe selection
Computation of Overall Gene
Expression Indexes
optional
Gene level quantile normalization
GeneBASE: Gene-level Background Adjusted Selected probe Expression
Download: http://biogibbs.stanford.edu/~kkapur/GeneBASE/
Xing, Kapur, Wong, PLoS ONE, 1:e88, 2006
Kapur, Xing, Wong, Genome Biology, 8:R82, 2007
(dChip type model)
In most cases selection does not affect fold changes
Sometimes, selections change fold-change significantly
spectrin, beta, non-erythrocytic 4 (SPTBN4)
BetaIV spectrins are essential for membrane stability and the molecular organization of nodes of Ranvier along neuronal axons
4. Evaluations of gene level index
1st evaluation: tissue fold change
After selection
Fold-change of liver over muscle, in 438 genes
with high fold-change in 3’ expression array data
Before selection
Probe selection allows more sensitive detection of fold-changes
After selection
Zoom-in
Before selection
After selection
FC of muscle over liver, in 500 genes detected
to be overexpressed in muscle over liver by 3’
array
Before selection
FC of muscle over liver
After selection
Zoom-in
Before selection
2nd evaluation: Presence/Absence calls
• Use SAGE data to construct gold-standard
• Presence in tissue if 100 tags per million
• Absence if no tags in given tissue but >100 tpm
in at least another tissue
• Exon array A/P calls: use sum of z-scores for
core probes (z-score is computed based on
background model)
(a)
Cerebellum
(b)
Heart
(c)
Kidney
ROC curves shows that background
correction improves A/P calls.
Red: Exon, Z-score call
Blue: Exon Affy call
Brown: 3’ Affy call, max probeset
Purple: 3’ Affy call, min probe set
3rd evaluation: Cross-species conservation
• 3’ and Exon array data for six adult tissues in
both human and mouse
• Expression computed for about 10,000 pairs of
human-mouse ortholog pairs
Similarity of gene expression profiles in six human tissues and six
corresponding mouse tissues.
For each ortholog pair we calculated the Pearson correlation coefficient
(PCC) of expression indexes across six tissues (solid line). We also permutated ortholog
relationships and calculated the PCC for random human-mouse gene pairs (dashed line).
3’ arrays
Exon arrays
(Xing Y, Ouyang Z, Kapur K, Scott MP, Wong WH. Mol Biol Evol. April 2007)
3’ arrays scatter plot
Exon arrays also
reveal conservation
of absolute
abundance of
transcripts in
individual tissues!
3’ arrays correlations
Exon arrays scatter plot
Exon arrays correlations
4th evaluation: q-PCR
On log scale, exon array fold change estimate is
correlated with qPCR fold change (corr = 0.9)
2
1.5
1
0.5
0
-2
-1
0
-0.5
-1
-1.5
-2
1
2
5. Issues in exon level analysis
Challenges
• The experimental validation rate in several published
exon array studies are highly variable.
–
–
–
–
Gardina et al. BMC Genomics 7:325, 21%
Kwan et al. Genome Res 17:1210, 45%
Hung et al. RNA 14:284, 22%-56%
Clark et al. Genome Biol 8:R64, 84%.
• Most exons are targeted by no more than four probes.
No probes for splice junctions.
• Noise in observed probe intensities (due to background,
cross-hybridization) can make the inferred splicing
pattern unreliable.
MADS: Microarray Analysis
of Differential Splicing
1. Correction for background (nonspecific hybridization)
25
logPMi T nT
I
jk jk
j 1 k {A, C, G}
n
2
k k
k {A, C, G}
4. Detection of differential splicing
1. Kapur, Xing, Wong, Genome Biology, 8:R82, 2007
2. Xing, Kapur, Wong WH. PLoS ONE. 2006 20;1:e88
3. Xing et.al., 2008, RNA, 2008, 14(8): 1470-1479
2. Probe selection and expression index
calculation
i
3. Correction for crosshybridization
Splicing Index:
Corrected Probe Intensity
Estimated Gene Expression Level
Analysis of “gold-standard” alternative splicing
data via PTB knockdown experiments
• Our “gold-standard” - a list of exons
with pre-determined
inclusion/exclusion profiles in
response to PTB depletion (Boutz P,
et.al. Genes Dev. 2007,
21(13):1636-52.)
• We used shRNA to knock-down
PTB, generated Exon array data,
and analyzed data on “goldstandard” exons.
• MADS detected all exons with
large changes (>25%) in transcript
inclusion levels, and offered
improvement over Affymetrix’s
analysis procedure.
Collaboration with Douglas Black (UCLA)
Boutz P, et.al. Genes Dev. 2007, 21(13):1636-52.
MADS sensitivity correlates with the
magnitude of change in exon inclusion
levels of “gold-standard exons”
Xing et.al., 2008, RNA, 2008, 14(8): 1470-1479
Exon array detection of novel PTBdependent splicing events
control
shRNA knockdown of
splicing repressor PTB
Detection of alternative 3’-UTR and Poly-A
sites of Ncam1
30 differentially spliced exons were
tested; 27 were validated.
Validation rate: 27/30=90%
Cross-Hybridization
• Probes are designed to hybridize to their
target transcripts
• Often probes have 0,1,2,3 base pair
mismatches to non-target transcripts
• Cross-hyb seriously complicates exonlevel analysis.
Mapping mismatches to probes
•
•
•
•
6,000,000 probes
Each 25bp long
3,000,000,000bp genome sequence
For 1-bp mismatch, a naïve search needs O(6M
x 3G x 25) ~ years of CPU time
• Fast matching algorithm (by Hui Jiang) makes
this feasible in hours
Distribution of Number of
Cross-hyb Transcripts
Full Probes
Core Probes
0 Trans.
1 Trans.
2 Trans.
3 Trans.
≥ 4 Trans.
0 bp
99.52
0.40
0.05
0.01
0.02
0-1 bp
99.21
0.62
0.10
0.03
0.03
0-2 bp
98.90
0.84
0.15
0.05
0.06
0-3 bp
97.49
1.98
0.29
0.10
0.13
0-4 bp
88.25
9.67
1.36
0.35
0.36
0 Trans.
1 Trans.
2 Trans.
3 Trans.
≥ 4 Trans.
0 bp
97.05
2.14
0.40
0.18
0.23
0-1 bp
96.14
2.60
0.59
0.27
0.40
0-2 bp
95.59
2.79
0.69
0.33
0.60
0-3 bp
92.36
5.06
1.12
0.48
0.98
0-4 bp
80.50
13.37
3.10
1.09
1.93
Correction of sequence-specific crosshybridization to off-target transcripts
PAN3
Estimated expression
levels of off-target
transcripts of EEF1A1
Intensities of four
probes of the target
exon of PAN3
Conclusion
• Gene level index is accurate and reflects absolute abundance
• We show that sequence-specific modeling of microarray noise
(background and cross-hybridization) improves the precision of exonlevel analysis of exon array data.
• Overall, our data demonstrate that exon array design is an effective
approach to study gene expression and differential splicing.
• Development of future “probe rich” exon arrays, with increased probe
density on exons and inclusion of splice junction probes, will offer
more powerful tools for global or targeted analysis of alternative
splicing.