slides - UNC Computer Science - The University of North Carolina at

Download Report

Transcript slides - UNC Computer Science - The University of North Carolina at

Graph Regularized Dual Lasso for
Robust eQTL Mapping
Wei Cheng1 Xiang Zhang2 Zhishan Guo1 Yu Shi3 Wei Wang4
1University
of North Carolina at Chapel Hill,
2Case Western Reserve University,
3University of Science and Technology of China,
4University of California, Los Angeles
Speaker: Wei Cheng
The 22th Annual International Conference on Intelligent Systems for Molecular
Biology (ISMB’14)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
eQTL (Expression QTL)
• Goal: Identify genomic locations where
genotype significantly affects gene expression.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Statistical Test
• Partition individuals into groups according to genotype of a SNP
• Do a statistic (t, ANOVA) test
individuals
• Repeat for each SNP
Gene expression level
4
0
8
12
SNPs
(X)
SNP1
1
Gene
expression
levels (Z)
.
.
.
0
0
0
1
0
1
.
.
.
.
8
9
.
.
.
.
0
0
0
0
0
0
.
.
.
.
7
8
.
. .
. .
. .
0 0
1 1
1 0
0 0
0 1
1 0
. .
. .
. .
. .
12 11
1 0
. .
.
.
.
0
1
0
1
0
1
.
.
.
.
9
8
.
.
.
.
0
1
0
0
0
0
.
.
.
.
13
5
.
.
.
.
1
0
1
1
1
1
.
.
.
.
6
2
.
.
.
.
1
0
0
0
1
0
.
.
.
.
4
1
.
.
.
.
1
1
1
1
1
1
.
.
.
.
2
0
.
.
.
.
1
0
0
1
0
0
.
.
.
.
5
8
.
.
.
.
1
0
0
1
0
1
.
.
.
.
0
6
.
.
.
.
1
0
1
1
0
0
.
.
.
.
3
2
.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Lasso-based feature selection
 X: the SNP matrix (each row is one SNP)
 Z: the gene expression matrix (each row is one
gene expression level)
 Objective:
1
min || Z  WX ||2F  || W ||1
W 2
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Incorporating prior knowledge
• SNPs (and genes) usually are not independent
• The interplay among SNPs and the interplay
among genes can be represented as networks
and used as prior knowledge
 Prior knowledge: genetic interaction network, PPI network,
gene co-expression network, etc.
• E.g., group lasso, multi-task, SIOL, MTLasso
2G.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Limitations of current methods
• A clustering step is usually needed to obtain the
grouping information.
• Do not take into consideration the incompleteness
of the prior knowledge and the noise in them
 E.g., PPI networks may contain many false interactions and miss
true interactions
• Other prior knowledge, such as location and gene
pathway information, are not considered.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Motivation
• Examples of prior knowledge on genetic interaction network S and gene-gene
interactions represented by PPI network (or gene co-expression network G).W is
the regression coefficients to be learned.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
GD-Lasso: Graph-regularized
Dual Lasso
• Objective:
1
|| Z  WX  L ||2F  || W ||1   || L ||*
W , L ,S  0 ,G  0 2
min
  tr( W ( DS  S) W T )   tr( W T ( DG  G ) W)
  || S  S 0 ||2F   || G  G 0 ||2F
Lasso objective considering
confounding factors (L), ||L||*
is the nuclear norm to
control L as low-rank.
The graph regularizer
The fitting constraint for
prior knowledge
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
GGD-Lasso: Generalized
Graph-regularized Dual Lasso
• Further incorporating location and pathway
information.
• Objective:
1
|| Z  WX  L ||2F  || W ||1  || L ||*
W , L ,S  0 ,G  0 2
min
   D(w*i , w* j )Si , j    D(w i* , w j* )G i , j
i, j
i, j
D(·, ·) is a nonnegative distance measure.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
GGD-Lasso: Optimization
• Executes the following two steps iteratively until
the termination condition is met:
 1) update W while fixing S and G;
 2) update S and G according to W, while decreasing:
D(w i* , w j* )G i , j
D(w*i , w* j )Si , j



and
i, j
i, j
We can maintain a fixed number of edges in S and G. E.g., to update G, we
can swap edge (i’, j’) and edge (i,j) when
D(w i* , w j* )  D( w i '* , w j '* )
• Further integrate location and pathway information
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study: simulation
• 10 gene expression profiles are generated by
Z j*  W j* X   j*  E j*
 j*~ N (0, ),
where E j* ~ N (0,  2I )
  MM T , M ij ~ N (0,1)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study: simulation
The ROC curve. The black solid line denotes what random guessing would have
achieved.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study: simulation
AUCs of Lasso, LORS, G-Lasso and GD-Lasso. In each panel,
we vary the percentage of noises in the prior networks S0 and
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
G.
Experimental Study: Yeast
• yeast eQTL dataset
 112 yeast segregants generated from a cross of two
inbred strains: BY and RM;
 removing those SNP markers with percentage of
NAs larger than 0.1 (the incomplete SNPs are
imputed), and merging those markers with the same
genotypes, dropping genes with missing values;
 get 1017 SNP markers, 4474 expression profiles;
• Genetic interaction network and PPI
network (S and G)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study: Yeast
• cis-enrichment analysis
 (1) one-tailed Mann-Whitney: test on each SNP for cis
hypotheses;
 (2) a paired Wilcoxon sign-rank: test on the p-values
obtained from (1).
• trans-enrichment:
 Similar strategy: genes regulated by transcription factors
(TF) are used as trans-acting signals.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study: Yeast
Pairwise comparison of different models using cis-enrichment and
trans-enrichment analysis
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study: Yeast
Summary of the top-15 hotspots detected by GGD-Lasso. Hotspot (12) in bold cannot be detected by
G-Lasso. Hotspot (6) in italic cannot be detected by SIOL. Hotspot (3) in teletype cannot be detected
by LORS.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study: Yeast
Hotspots detected by different methods
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Conclusion
• In this paper…
 We propose novel and robust graph regularized
regression models to take into account the prior
networks of SNPs and genes simultaneously.
 Exploiting the duality between the learned
coefficients and incomplete prior networks enables
more robust model.
 We also generalize our model to integrate other
types of information, such as location and gene
pathway information.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Thank You !
Questions?
Travel funding to ISMB 2014 was
generously provided by DOE
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL