Contrast pattern aided regression
Download
Report
Transcript Contrast pattern aided regression
Pattern Aided Regression Modeling
& Pattern Aided Problem Solving
Guozhu Dong
Professor, PhD
Prediction is difficult, even when
it is about the past.
Please cite this paper on CPXR:
Guozhu Dong & Vahid Taslimitehrani.
Pattern-Aided Regression Modeling and
Prediction Model Analysis.
To appear in IEEE TKDE.
Data Mining Research Lab
CSE & Kno.e.sis Center
Wright State University
Overview
Introduction
Pattern aided regression modeling: PXR
Contrast pattern aided regression algorithm: CPXR
Diverse predictor-response relationships
CPXR(Log): Traumatic brain injury (TBI) and heart failure (HF)
outcome prediction
Potential applications of the CPXR methodology
Other pattern aided problem solving results/apps
new regression model type
Most are contrast pattern aided results
Some use patterns only; some use something extra
Concluding remarks
2
Prediction is difficult
Prediction is difficult, especially if it is about the
future
Those who have knowledge, don't predict. Those
who predict, don't have knowledge.
Nils Bohr, Nobel laureate in Physics
Danish Proverb
Lao Tzu, 6th Century BC Chinese Philosopher
Prediction is difficult, even when it is about the past.
Guozhu Dong
Guozhu Dong: Pattern Aided Regression
Modeling
3
Preliminaries on prediction
using regression
Training dataset: {(xi,yi) | 1 <= i <=n}
xi: vector of predictor variables
yi: value of response variable
Regression model evaluation
LR: Linear regression
Pattern Aided Regression Modeling
Guozhu Dong
4
Teaser 1: Performance of contrast
pattern aided regression (CPXR)
CPXR:
highest accuracy in 41 out of 50 datasets
Average RMSE reduction (relative to LR) of 42% in 50 datasets,
much higher than that of best competing method
CPXR achieved 60+% RMSE reduction in 10 out the 50.
CPXR is better than LR in all 50 datasets.
RMSE reduction =
[RMSE(LR)-RMSE(M)]
/ RMSE(LR)
LR: Linear regression
GBM: Gradient Boosting (generalization of AdaBoost)
5
Teaser 2: AUC of ROC curves for
CPXR(Log) vs other methods on HF and TBI
0.6
0.8
1.0
classification results
0.2
0.4
AUC_CPXR(Log) = 0.93
AUC_Log Reg = 0.81
AUC_SVM = 0.59
0.0
CPXR(Log)
Log Reg
SVM
0.0
0.2
0.4
0.6
0.8
1.0
False positive rate
HF (with Mayo researchers)
Guozhu Dong: Pattern Aided Regression
Modeling
TBI
6
CPXR is good because
PXR can effectively capture diverse predictorresponse relationships to find good PXR models
Definition: The data (for given application) contains
diverse predictor-response relationships if it contains
different subgroups whose best-fit local models are
highly different [Dong+TaslimiteheraniTKDE15]
Diverse predictor-response relationships are the
main reason why best state-of-the-art regression
methods perform often poorly
Guozhu Dong: Pattern Aided Regression
Modeling
7
Preliminaries
Patterns & contrast patterns
Pattern: condition describing set of objects
EG: age <= 35 & rank = full professor
It describes all full professors with age <=35.
A pattern describes a region of data in a low dimensional
subspace, of high dimensional data
Contrast patterns: conditions that distinguish objects in
different classes/conditions
Contrast patterns are useful:
They are strongly associated with issues of importance
8
Contrast patterns thru example
• CP: A1=b & A3=e
• It matches
all C1 objects.
• It matches
no C2 objects
Its mds={t1,t2}
Generally: A pattern is CP if it matches many more objects in
one class than in other classes (aka emerging patterns)
mds(P): the set of objects matching P.
An equivalence class: A set of patterns with same mds (having
same behavior).
Pick one as representative.
9
Why contrast pattern based approaches
are successful & have big potentials?
Contrasting is meaningful
Contrastingrisk indicatoradvantages in survival/wellbeing
Contrasting is built into human (animal) instincts
Focus on contrast patterns focus on important issues
High (3—7) dim contrast patterns capture important
novel multi-variable interactions related to goals
Opportunity: Humans often use low dimensional CPs
(?brain’s computing power is low & lack of data?)
They use independence assumption for high dimension apps
WE WANT TO DO BETTER! WE CAN DO BETTER!!!
10
Pattern aided regression model
A PXR model is represented by a tuple
Each Pi is a pattern
Each fi is a local regression model for data satisfying Pi,
fd is a default local regression model
Each wi is weight for Pi
The regression function of PM is defined by
11
A pictorial illustration of a simple PXR model
Different patterns can involve different sets of variables
[describing data regions in different subspaces]
Matching datasets of different patterns can overlap
Guozhu Dong: Pattern Aided Regression
Modeling
12
An example PXR model
u,v z: predictor variables, y: response variable
((P1, f1, w1), (P2, f2, w2), fd) gives a PXR model
Pattern Aided Regression Modeling
Guozhu Dong
13
Discussion
PXR is a strict generalization of Piecewise
Linear Regression (PLR)
PLR can be viewed as trying to model diverse PR
relationships, but it is limited in modeling capabilities
and computing algorithms [and they didn’t see DPR]
Often a PXR model uses few patterns (e.g. 7)
Local regression models
model type can be complex or simple
we often use simple ones such as linear or
piecewise linear models
Pattern Aided Regression Modeling
Guozhu Dong
14
Diversity of predictor-response
relationships
Different
pattern-model pairs emphasize
different sets of variables
Different pattern-model pairs use highly
different regression functions
Each pattern-model pair captures a highly
distinct kind of behavior
Diverse predictor-response relationships
may be neutralized at the global level
Pattern Aided Regression Modeling
Guozhu Dong
15
DPR in TBI (traumatic brain injury)
Pattern Aided Regression Modeling
Guozhu Dong
16
How CPXR builds PXR models
Training
Data D
for
regression
PXR
model
Baseline
Regression
Model f0
Local
Regression
Models for
CPs
D = LE U SE
LE: Large Error
SE: Small Error
Mine CPs
Representative
CPs of LE & SE
17
How CPXR builds PXR models: (D,f0)
(LE,SE) Ps and fs PXR
Starting with training dataset
Build a baseline regression model f0 (or use given f0)
Split data into LE (large error) and SE (small error),
based on f’s prediction error
Mine CPs; Remove some CPs
Build corresponding local regression models for
remaining CPs, and select patterns to construct PXR
Many technical details, including variable binning,
splitting data, search objectives, baseline model type,
Control overritting: don’t
local model type …
use P if |mds(P)| <= #vars
Pattern Aided Regression Modeling
Guozhu Dong
18
Summary of empirical results
CPXR is highly accurate for building regression
models
Outperforms other regression methods, often by big
margins
• On accuracy
• On overfitting
• On. sensitivity to noise
Exp says: Diverse predictor-response relationships
occur often in real life, for data with >=3 dimensions
We used 50 real datasets, and 20+ synthetic ones
Pattern Aided Regression Modeling
Guozhu Dong
19
Previous regression methods
Linear regression (LR): uses a linear function
Piecewise linear regression (PLR): splits one
variable into intervals, uses a different linear
function for each interval
Support vector regression (SVR): SVM like, but
minimizing prediction error
Bayesian additive regression trees (BART):
ensemble of (hundreds of) decision trees
Neural networks, Gradient Boosting … Interpretability
Pattern Aided Regression Modeling
Guozhu Dong
is low
20
Experiments used 50 real datasets
used in previous regression studies
6 example datasets
Pattern Aided Regression Modeling
Guozhu Dong
21
CPXR achieved large RMSE reduction
(accuracy improvement) consistently
CPXR:
highest accuracy in 41 out of 50 datasets (4 competitors)
Average RMSE reduction (relative to LR) of 42% in 50 datasets,
much higher than that of best competing method
CPXR achieved 60+% RMSE reduction in 10 out the 50.
CPXR is better than LR in all 50 datasets.
RMSE(LR)-RMSE(M)
----------------------------RMSE(LR)
GBM: Gradient Boosting (generalization of AdaBoost)
We also tried other competitors but they are not competitive
CPXR is not better than other methods on random data
22
Box plot of RMSE reduction
CPXR(LP)’s
median > Q3 of
PLR,LR,BART
CPXR(LP)’s Q1
> median of
PLR,LR,BART
CPXR(LL) is a
variant of CPXR
Pattern Aided Regression Modeling
Guozhu Dong
23
Evaluation on overfitting
CPXR is more accurate than PLR, LR and BART on testing
data; its model complexity is fairly low.
CPXR has smaller relative accuracy drop (from training to
test)
Pattern Aided Regression Modeling
Guozhu Dong
24
Evaluation on sensitivity to noise
Build PXR on
clean training
data
Compare
accuracies on
•training data
•noise-added
test data
Pattern Aided Regression Modeling
Guozhu Dong
25
CPXR’s outperformance vs degree
of diversity of PR-Relationships
PIP: Positive impact pattern
Diff = ratio of largest coefficien
local model making big improvement
of pairs of local models
High: CPXR has large RMSE reduction; Low: CPXR has low RMSE reduction
Pattern Aided Regression Modeling
Guozhu Dong
26
Diverse Predictor-Response Relationships
May Neutralize Each Other at High Level
Example: We considered a set S1 of 4 variables, S2=S1 + 2
variables, on soil water content data
For LR, S2 does not give improvement over S1
Ditto for PLR, SVR, BART
For CPXR, PXR model on S2 gives 20% RMSE
improvement over PXR model on S1
The new variables are involved in most of the diverse PR
relationships (the patterns in the PXR model)
These relationships somehow cancelled each other’s
effect, at the whole data set level. missed by LR etc.
Pattern Aided Regression Modeling
Guozhu Dong
27
Other apps of CPXR, besides
building accurate prediction models
Analysis on a given prediction model
On what kinds of data it make large prediction errors
How to correct those prediction errors
Do important models in science and medicine
have systematic mistakes?
Analysis on comparing two given prediction models,
w.r.t. their differences
Discovering policy errors, niche opportunities, …
Discovering true importance of variables (medicine)
Discovering intricate multi-variable interactions
……
Pattern Aided Regression Modeling
Guozhu Dong
28
CPXR for logistic regression modeling and
results on outcome prediction for HF/TBI
The PXR-CPXR approach is not limited to linear
regression.
We adapted it for logistic regression to get CPXR(Log)
We used CPXR(Log) for outcome prediction for
traumatic brain injury patients & heart failure patients.
CPXR(Log) is much more accurate than standard
logistic regression and SVM
CPXR(Log) also identifies important variables that are
considered unimportant by standard logistic regression
Guozhu Dong: Pattern Aided Regression
Modeling
29
Results on TBI
Guozhu Dong: Pattern Aided Regression
Modeling
30
0.6
0.8
1.0
AUC of ROC curves for CPXR(Log)
and other methods (on HF and TBI)
0.2
0.4
AUC_CPXR(Log) = 0.93
AUC_Log Reg = 0.81
AUC_SVM = 0.59
0.0
CPXR(Log)
Log Reg
SVM
0.0
0.2
0.4
0.6
0.8
1.0
False positive rate
HF
TBI
Guozhu Dong: Pattern Aided Regression
Modeling
31
Details on CPXR(Log) Model for TBI
Guozhu Dong: Pattern Aided Regression
Modeling
32
Guozhu Dong: Pattern Aided Regression
Modeling
33
Work on Heart Failure Patient Risk Prediction
Vahid Taslimitehrani, Guozhu Dong, Naveen L. Pereira, Maryam
Panahiazar, Jyotishman Pathak: Developing an EHR-driven Heart
Failure Risk Prediction Model using CPXR(Log). Submitted to journal
Mayo’s EHR Data:
Patient’s demographic data -- age, gender, race and ethnicity.
Lab results -- cholesterol, sodium, hemoglobin and lymphocytes, and EF.
Medications -- Angiotensin Converting Enzyme (ACE) inhibitors,
Angiotensin Receptor Blockers (ARBs), β-adrenoceptor antagonists (βblockers), Statins, and Calcium Channel Blocker (CCB).
26 major chronic conditions (co-morbidities).
Many variables; many are important/needed
Guozhu Dong: Pattern Aided Regression
Modeling
34
Finished CPXR-based projects/papers
upto March 2015
Vahid Taslimitehrani, Guozhu Dong, Naveen L. Pereira, Maryam
Panahiazar, Jyotishman Pathak: Developing an EHR-driven Heart Failure
Risk Prediction Model using CPXR(Log). Submitted to journal
Behzad Ghanbarian, Vahid Taslimitehrani, Guozhu Dong, Yakov A.
Pachepsky. Measurement Scale Effect on Prediction of Soil Water
Retention Curve and Saturated Hydraulic Conductivity. Submitted.
Vahid Taslimitehrani, Guozhu Dong. A New CPXR Based Logistic
Regression Method and Clinical Prognostic Modeling Results Using the
Method on Traumatic Brain Injury. In Proceedings of IEEE International
Conference on BioInformatics and BioEngineering (BIBE) 2014
Building accurate loan default risk models for a private company. 20142015.
Looking for collaborators
Guozhu Dong: Pattern Aided Regression
Modeling
35
Summary of strength of
PXR/CPXR
More accurate than state-of-the-art methods
Philosophy: Using patterns to identify data groups
where given model makes large prediction errors
that can be corrected systematically
Using different pattern-model pairs to model diverse
predictor-response variable relationships
The approach is better suited to high dimensional
data than other methods
Offering insights to mistakes in business strategies
…
Guozhu Dong: Pattern Aided Regression
Modeling
36
I have focused on PXR/CPXR. I will now
discuss other pattern aided problem
solving methods and applications
Other methods:
Contrast pattern aided clustering
Contrast pattern based classification and improvement
of traditional methods
Contrast pattern aided gene ranking for complex
diseases
Contrast pattern aided outlier detection
Applications: Diagnosis of diseases, study of complex
diseases, blog analysis, compound selection for drug
design, crime environ analysis, apartment rental price
prediction, activity recognition, …
37
We published a book on
CDM in 2012; 3 out of 6
parts on applications
1.Preliminaries and Measures on
Contrasts
2.Contrast Mining Algorithms
3.Mining Generalized Contrasts
4.Contrast Mining for Classification &
Clustering
5.Contrast Mining for Bioinformatics
& Chemoinformatics
6.Contrast Mining for Special
Application Domains
44 contributing authors, from ~dozen
countries; not comprehensive
Methods used by many scientists. 38
My recent results in this area
1.CAEP-style classification:
discriminative power aggregation of
emerging patterns
2.Outlier detection / intrusion
detection: almost model free; using
discriminative pattern length
3.Clustering quality evaluation using
patterns (quality, abundance, diversity):
no distance function
4.CP based clustering and cluster
description: no distance func needed
5.Interaction based gene/SNP ranking
for complex diseases
6.Contrast pattern aided regression:
Effectively handling diverse predictorresponse relationships
3
Key Challenges for Pattern Aided
Problem Solving
For each problem to solve, our general approach is
to use a selected pattern set to help reach our goals.
Q:
(1) What kinds of pattern sets?
(2) How to use the patterns in the set?
(3) How to efficiently search for desired pattern sets?
We need effective techniques
There are millions of (contrast) patterns
The search space is huge
Pattern Aided Regression Modeling
Guozhu Dong
Using CPs
more efficient
40
CPCQ Clustering Quality Index
•
CPCQ Rationale: A high-quality clustering, capturing natural
concepts in data, should have many diversified high-quality contrast
patterns (CPs) contrasting its clusters.
•
A CP characterizes its home cluster and discriminates its home
cluster against other clusters.
•
Home cluster of a CP: the cluster where it has highest
frequency among all clusters
•
Think of a cluster as a class.
CPC Algorithm
Contrast Pattern Based Clustering – aimed to maximize CPCQ
2. Assign Patterns to
CP Group G1 of
Clusters, Using MPQ
1. Select Seed CPs
items
.....
3. Assign Patterns as
CPs of Clusters,
Using Tuple Overlap
4. Assign Tuples to
Clusters, Using
Tuple Overlap
tuples
S1
S1
PS(C1)
S1
C1
S2
S2
PS(C2)
S2
C2
Clustering data using CPC into groups,
each having succinct informative group descriptions
• EG: Given a collection of texts/blogs (collected at ASU).
• We cluster the blogs into four groups
• each group is associated with a small set of patterns
• the patterns clearly indicate what the groups are about.
43
CAEP: Semi-supervised chemical compound
screening with few training samples
A special feature is ECP (an adaptation of CAEP)’s
ability to accurately classify molecules on the basis of
very small training sets containing only a few (e.g. 3
per class) compounds.
This feature is highly relevant for virtual compound
screening when very few experimental hits are
available as templates.
Reference: Jens Auer et al. Simulation of sequential screening
experiments using emerging chemical patterns. Medicinal
Chemistry, 4(1):80–90, 2008. [from its abstract]
44
IBIG vs IG & FC for Colon Cancer
Traditional gene
ranking methods:
Fold change & entropy:
rank genes by
considering impact of
one gene at a time.
Are they suitable
for complex diseases?
Genes lowly ranked by
FC could be highly
ranked by IBIG
45
Quote from our Contrast Data Mining book
… the most important contribution of contrast mining
will come when we no longer need … simplifying
approaches to handle … challenge of high
dimensional data, when we have developed the
methodology to systematically analyze, and
accurately use, sets of multi-feature contrast
patterns …. … contrast mining has made useful
progress …. Success … will have a large impact on
… handling of intrinsically complex processes, such
as complex diseases whose behaviors are
influenced by the interaction of multiple … factors.
46
Potentials (1): Develop New
Pattern Aided Methods
Selecting set of patterns can help solve existing
challenging problem in much better way
Identify such problems, work on them …
Using sets of patterns can lead to systematic ways of
handling multiple multi-variable interactions
Contrast mining, pattern aided problem solving, &
pattern aided data analytics, have potential to help
effectively handle challenges of high dimensional data
47
Potentials (2): Use Current Pattern
Aided Methods to Solve Problems
Our developed methods can help perform regression
more accurately, characterize & correct errors of given
models, for vital applications in science, medicine, &
economics
Improving scientific models
Changing what we believe in?
Our developed clustering, classification, outlier
detection, gene ranking, multiple multi-variable
interaction mining/selection methods can help solve
challenging problems and offer new insights
48
www.cs.wright.edu/~gdong
Questions
Next step – wish to find collaborators
To work on high impact prediction modeling problems
March 28, 2016
49