No Slide Title

Download Report

Transcript No Slide Title

Motif-directed Network Component
Analysis for Regulatory Network Inference
Chen Wang, Lily Chen, Yue Wang, (Jason) Jianhua Xuan*
Virginia Tech, USA
Po Zhao, Eric Hoffman
Children’s National Medical Center, USA
Robert Clarke
Georgetown University Medical Center, USA
Aug. 29, 2007
InCoB 2007
Outline
• Background & Motivation
• Proposed Approach
– Motif-directed network component analysis
(mNCA)
– Stability analysis
• Experimental Results
– Muscle regeneration
• Conclusion & Discussion
InCoB 2007
Background & Motivation
• High-throughput biological data (e.g.,
microarray data, proteomic data, etc.)
provide us a great opportunity to study
genome systems.
– Identify gene modules, interactions and
pathways.
• Gene regulatory network modeling
– Clustering or biclustering
– Decomposition
• The whole gene population is regulated
by a few key transcription factors (TFs).
• TFs and their interactions can form a
skeleton of the regulatory networks.
InCoB 2007
Background
• However, decomposition methods relying on
microarray data alone often make their results
difficult to interpret biologically.
– Independent Component Analysis (ICA), and
– Non-negative Matrix Factorization (NMF).
• Network Component Analysis (NCA) – An
integrative approach
– Microarray gene expression data
– Protein binding data (i.e., ChIP-on-chip data) – network
connections (topology)
• Available in yeast model system
InCoB 2007
Motivation
• Limitations of NCA:
– ChIP-on-chip data are often not available for species like mouse and
human;
– When different data sources are integrated, the consistency is often
not guaranteed;
– ChIP-on-chip data come from biological experiments, which might
contain false-positives leading to incorrect network inference.
• Proposed solution - motif-directed network
component analysis (mNCA)
– Motif information derived from DNA sequence for initial network
topology.
– With the awareness of false-positives in motif information, stability
analysis procedures shall be developed to combat the inconsistency
between motif information and microarray data.
InCoB 2007
Motivation - Pathway Building
• Emery Dreifuss Muscular Dystrophy (EDMD)
Bakay, M, et al., Brain (129), 2006
Aug. 29, 2007
InCoB 2007
Network Component Analysis (NCA)
TF Connection mRNA
TF Connection mRNA
InCoB 2007
Mathematical Formulation of NCA
• A linear model:
E N M  AN L TLM ,
s.t. A  Z0
A: the connection strengths
T: transcription factor activities (TFAs)
Criterion to infer TFAs and regulation
relationship according to both
expression and topology:
min || E N M  AN  L TLM ||2 ,
s.t. A  Z 0 .
CHRNG = a1 MYOD1 +
a2 MYOG
InCoB 2007
Illustration of NCA
Microarray data Regulation strength
E
=
A
8
Transcription
Factor Activities
(TFAs) T
6
4
2
0
-2
-4
-6
-8
-10
-12
0
100
200
300
400
500
600
700
800
900
1000
0
100
200
300
400
500
600
700
800
900
1000
800
900
1000
5
0
-5
-10
=
8
6
4
2
0
-2
-4
-6
-8
0
100
200
300
400
500
600
700
gene
6
4
2
0
-2
-4
-6
-8
-10
0
20
40
60
80
100
120
140
160
180
200
InCoB 2007
mNCA - Motif Information
• Transcription Factors (TFs)
– Proteins that bind to the promoter regions of genes
– Activate or inhibit gene expression.
• Motif (DNA sequence motif)
– Common pattern in
binding sites for a TF
– Short sequences (5-25 bp)
– Up to 1000 bp (or farther)
from the gene
– Inexactly repeating patterns
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Binding sites for a TF
InCoB 2007
Motif Representation
• Consensus sequence
MyoD (M00001): SRACAGGTGKYG
• Position-Weighted Matrices (PWMs)
MyoD (M00001):
• Sequence Logo:
– graphical depiction of a profile
– conservation of elements in a motif
MyoD (M00001):
InCoB 2007
Motif Identification
• Input:
– Promoter region of a gene g (2000bp upstream)
– Muscle specific binding site s
• Match™ search algorithm
– Minimize false positives
[Kel, A.E., et al., ucleic Acids Res, 2003.
31(13): p. 3576-9.]
• Output:
– Initial connection strength – motif score A
0
gs
Ags0 : average scores of matrix similarity and core similarity
InCoB 2007
Stability Analysis for mNCA
• The information sources:
– mRNA Microarray data (specific but noisy)
– motif information (general & with false positives)
• The questions we want to answer:
– What TFs play a relevant role in the experiment?
– What genes are regulated by a particular TF? (downstream
targets)
• Stability analysis: If small perturbations being applied,
– A bad TFA estimate tends to be altered easily, even
destroyed;
– A good TFA estimate tends to keep its activity pattern
throughout the perturbation..
InCoB 2007
Testing Stability by Perturbations
• Method 1: Theresholding the motif score
– A TF-gene connection is deleted if the motif score is
below some cut-off threshold. By setting different cut-off
thresholds, we can change the number of connections,
hence, change the network topology accordingly.
• Method 2: Deleting/inserting connections
– TF-gene connections are altered randomly, either by
deleting the existing connections or inserting new
connections with some small percentage (e.g., 10%).
Aug. 29, 2007
InCoB 2007
Understanding of Stability Analysis
• Obtain the confidence measure of an estimate:
e.g. absolute
correlation
coefficient: 0.92;
highly confident
comparison
perturbation
e.g. absolute
correlation
coefficient: 0.52;
less confident
InCoB 2007
Stability Measurement
• Stability measurements from perturbations:
stability measurements of j-th TFA 
{| correlation[TFAj (i), TFAj (k )] |i  k }
75% Quantile
Median
25% Quantile
Boxplot of the stability measurements
InCoB 2007
Experimental Results
• Dataset Description:
Staged skeletal muscle degeneration/regeneration was
induced by injection of cardiotoxin (CTX). In the time range
up to 40 days, 27 time points were sampled, and each time
sample contains two mice duplicates.
…
0.5 1 2 3
4 5
…
10 11 12 13 14
16
20
30
40
(day)
The time course microarray data set was acquired with
Affymetrix’s Murine Genome U74v2 Set from an expression
profiling study in Children’s National Medical Center (CNMC).
We obtained expression measurements of 7570 probesets in
each sample.
InCoB 2007
Muscle Related TFs
• 24 Muscle related TF binding sites from TRANSFAC:
YY1
MEF2
E2A
SRF
TalNF-Y alpha1alpha:
CP1
E47
USF
USF2 Tal1beta:
E47
NKX25
Nkx2- TATA
5
TBP
GATA GATA
-4
Sp1
Hand1:
E47
Ebox myoge
nin
TBX
5
E47
MyoD
E12
InCoB 2007
Muscle Related TFs
• Some muscle related TF binding sites from TRANSFAC:
InCoB 2007
Stability Analysis (Method I)
• Thresholding the motif score:
E12
E47
GATA
GATA-4
TBP
SRF
MyoD
TBX5
TATA
Nkx2-5
NKX25
E2A
Ebox
myogenin
USF2
USF
MEF-2
Hand1:E47
Sp1
NF-Y
alpha-CP1
Tal-1beta:E47
1
Tal-1alpha:E47
YY1
– The threshold of motif score was set from low to high, making the connection number
vary gradually from 12,000 to 18,000, which results in more than 30% topology
YY1 alterations.
MyoD
0.9
0.8
myogenin
Stability Measurement
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Transcription factor index
16
17
18
19
20
21
22
23
24
InCoB 2007
Stability Analysis (Method II)
• Deleting or inserting connections:
E12
E47
GATA
GATA-4
TBP
SRF
MyoD
TBX5
TATA
Nkx2-5
E2A
NKX25
myogenin
Ebox
USF2
USF
MEF-2
Hand1:E47
Sp1
NF-Y
Tal-1beta:E47
1
alpha-CP1
YY1
Tal-1alpha:E47
YY1
– For each transcription factor, 10% of connections were altered randomly regardless of
the motif score, by deleting existing connections or inserting new connections to test
the stability of TFA estimates.
0.9
MyoD
0.8
myogenin
Stability Measurement
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
Transcription factor index
16
17
18
19
20
21
22
23
24
InCoB 2007
Stable TFA Estimates
• The most stable TFA - YY1:
– Observed expression is of almost no change;
– Estimated TFA is muscle regeneration related.
YY1’s gene expression
Estimated YY1’s TFA
(probe id: 98767_at)
0.02
2.5
0
2
-0.02
-0.04
log TFA ratio
log expression ratio
1.5
1
-0.06
-0.08
0.5
-0.1
0
-0.12
-0.14
-0.5
0
5
10
15
Time (days)
20
25
30
0
5
10
15
Time (days)
20
25
InCoB 2007
30
YY1’s TFA Estimate
• The difference between YY1’s mRNA level
and protein level is supported by biological
experiments.
Walowitz, JL, et al., “Proteolytic Regulation of the Zinc Finger Transcription Factor
YY1, a Repressor of Muscle-restricted Gene Expression ,”J Biol Chem, Vol.
273, Issue 12, 6656-6661, March 20, 1998.
YY1 expression level
YY1 protein level
InCoB 2007
YY1 – A Repressor in Muscle Regeneration
• Underlying regulation mechanism:
Estimated YY1’s TFA
Calpain II’s gene expression
(probe id: 101040_at)
0.02
1
0
0.8
-0.02
0.6
log expression ratio
log TFA ratio
-0.04
-0.06
-0.08
0.4
0.2
0
-0.1
-0.2
-0.12
-0.14
YY1 targets
-0.4
0
5
10
15
Time (days)
YY1
20
25
30
0
5
10
15
Time (days)
20
25
Calpain II
InCoB 2007
30
Stable TFA estimates
• Some other stable TFAs - myogenin & MyoD
myogenin
MyoD
(probe id: 103053_at)
(probe id: 102986_at)
2.5
2.5
2
1.5
1.5
log expression ratio
Expression
log expression ratio
2
1
1
0.5
0.5
0
0
-0.5
0
5
10
15
Time (days)
20
25
-0.5
30
0.16
0
5
10
15
Time (days)
20
25
30
0
5
10
15
Time (days)
20
25
30
0.25
0.14
0.2
0.1
log TFA ratio
Estimated TFA
log TFA ratio
0.12
0.08
0.06
0.15
0.1
0.04
0.02
0.05
0
-0.02
0
5
10
15
Time (days)
20
25
30
0
InCoB 2007
Identifying TF’s Downstream Targets
• Stability Analysis:
– Similarly, we can test the stability of regulation
strength A with small perturbations, hence to rank
the most likely targets of a specific TF.
• Ranking downstream targets by frequency count
(confidence measure):
– Perform multiple independent perturbations by
deleting a connection with some probability.
– Count how many times a TF-gene regulation
strength is in the top rank group (defined by some
preset threshold), based on its regulation strength A.
InCoB 2007
Stability Analysis of MyoD’s Targets
• MyoD’s downstream targets ranking:
– 1000 independent perturbations are carried out.
– Each connection is deleted with a probability (e.g.,
0.3).
– The top ranking
threshold is set
to 100 in this case.
700
600
if one gene’s regulation
strength by MyoD is in
the top 100, then this
gene is counted for once.
Frequency Count
500
400
300
200
100
0
0
100
200
300
400
Sorted downstream targets' index
500
InCoB 2007
MyoD’s Downstream Targets
• MyoD’s downstream genes from Ingenuity Pathway Analysis:
Top 100 genes: 16 directly related genes with MyoD, and several
key muscle regeneration TFs: MYC, MYOG, and MEF2C
InCoB 2007
YY1’s Downstream Targets
• YY1’s downstream genes from Ingenuity Pathway Analysis:
InCoB 2007
Conclusions
• A new computational approach, namely motifdirected network component analysis (mNCA), has
been developed to integrate motif information and
microarray data for regulatory network inference.
– Motif information has been utilized to derive the initial
topology information for mNCA.
– With the awareness of many false-positives in motif
information, stability analysis procedures have been
developed to extract stable TFAs and TFs’ downstream
targets.
• The experimental results have demonstrated that
mNCA can help reveal key regulators in muscle
regeneration.
InCoB 2007
Future Work – New Hypothesis &
Validation
• Integrative approaches to pathway building
CYBB
MCM5
………
RRM1
myogenin
MyoD
MYL4
TNNC1
MYBPH
…
DES
c-Myc
YY1
Calpain II
PAX2
DYS
….
: interaction from database
and knowledge
: interaction derived
from computational
methods
InCoB 2007
Acknowledgement
• NIH Grants:
– NS2925-13A, CA 096483 & CA109872
• DoD/CDMRP Grant
– BC030280
Aug. 29, 2007
InCoB 2007
Thank you very much!
Aug. 29, 2007
InCoB 2007