Transcript ICDM10Ozone
0.0 0.2 0.4 0.6 0.8 1.0
Precision
Forecasting Skewed Biased
Stochastic Ozone Days:
Analyses and Solutions
Ma
Mb
VE
Presentor: Prof. Longbin Cao
0.0
0.2
0.4
0.6
Recall
Wei Fan, Kun Zhang, and Xiaojing Yuan
0.8
1.0
What is the business problem and broadbased areas
Problem: ozone pollution day detection
Ground ozone level is a sophisticated chemical, physical process and
“stochastic” in nature.
Ozone level above some threshold is rather harmful to human health and
our daily life.
8-hour peak and 1-hour peak standards.
8-hour average > 80 ppt (parts per billion)
1-hour average > 120 ppt
It happens from 5 to 15 days per year.
Broad-area: Environmental Pollution Detection and Protection
Drawback of alternative approaches
Simulation: consume high computational power; customized for a
particular location, so solutions not portable to different places
Physical model approach: hard to come up with good equations
when there are many parameters, and changes from place to place
What are the research challenges that
cannot be handled by the state-of-the-art?
Dataset is sparse, skewed, stochastic, biased and
streaming in the same time.
High dimensional
Very few positives
Under similar conditions: sometimes it happens and
sometimes it doesn’t
P(x) difference between training and testing
Training data from past, predicting the future
Physical model is not well understood and cannot
be customized easily from location to location
what is the main idea of your
approach?
Non-parametric models are easier to use when “physical or
generative mechanism” is unknown.
Reliable “conditional probabilities” estimation under “skewed,
biased, high-dimensional, possibly irrelevant features
Estimate “decision threshold” to predict on the unknown
distribution of the future
Random Decision Tree
Super fast implementation
Formal Analysis:
Bound analysis
MSE reduction
Bias and bias reduction
P(y|x) order correctness proof
A CV based procedure
for decision threshold
selection
Estimated
1
probability
+
values
1 fold
3
+
Estimated
TrainingSet
Algorithm
Precision
1
2
+
2
+
-
3
0.0 0.2 0.4 0.6 0.8 1.0
Decision Threshold when P(x) is different
and P(y|x) is non-deterministic
+
+
Ma
Mb
VE
0.0
0.2
-
“Probability-
probability
P(y=“ozoneday”|x,θ)
Lable Distribution
Testing
Training Distribution
TrueLabel”
values7/1/98
0.1316
Normal
file
2 fold
…..
7/3/98
0.5944
7/2/98
0.6245
Estimated
probability
values
10 fold
………
Ozone
Ozone
P(y=“ozoneday”|x,θ)
0.4 0.6
Recall
1.0
PrecRec
plot
Decision
threshold
VE
Lable
7/1/98
0.1316
Normal
7/2/98
0.6245
Ozone
7/3/98
0.5944
Ozone
………
0.8
Random Decision Tree
B1: {0,1}
B1 chosen randomly
B1 == 0
B2: {0,1}
B3: continuous
B2: {0,1}
Y
Random threshold 0.3
N
B2 == 0?
B3 chosen randomly
B2: {0,1}
B3 < 0.3?
B3: continuous
B3: continuous
Y
N
B2 chosen randomly
………
B3 < 0.6?
RDT vs Random Forest
B2: {0,1}
B3
chosen
randomly
1. Original Data vs Bootstrap
B3: continous
2. Random pick vs. Random Subset + info gain
3. Probability Averaging vs. Voting
Random threshold 0.6
4. RDT: superfast
Optimal Decision Boundary
from Tony Liu’s thesis (supervised by Kai Ming Ting)
what is the main advantage of your
approach, how do you evaluate it?
Fast and Reliable
Compare with
State-of-the-art data mining algorithms:
Decision tree
NB
Logistic Regression
SVM (linear and RBF kernel)
Boosted NB and Decision Tree
Bagging
Random Forest
Physical Equation-based Model
Actual streaming environment on daily basis
what impact has been made in particular,
changing the real world business?
From 4-year studies on actual data, the
proposed data mining approach consistently
outperforms physical model-based method
can your approach be widely expanded to
other areas? and how easy would it be?
Other known application using proposed
approach
Fraud Detection
Manufacturing Process Control
Congestion Prediction
Marketing
Social Tagging
Proposed method is general enough and
doesn’t need any tuning or re-configuration
Hidden Variable
Limitation of GUIDE
Need to decide grouping variables and
independent variables. A non-trivial task.
If all variables are categorical, GUIDE
becomes a single CART regression tree.
Strong assumption and greedy-based search.
Sometimes, can lead to very unexpected
results.
Data Mining Challenges
Application: more accurate solution to predict “ozone
days” than physical models
Interesting and Difficult Data Mining Problem:
High dimensionality and some could be irrelevant features:
72 continuous, 10 verified by scientists to be relevant
Skewed class distribution : either 2 or 5% “ozone days”
depending on “ozone day criteria” (either 1-hr average
peak and 8-hr average peak)
Streaming: data in the “past” collected to train model to
predict the “future”.
“Feature sample selection bias”: hard to find many days
in the training data that is very similar to a day in the
future
Stochastic true model: given measurable information,
sometimes target event happens and sometimes it doesn’t.
Key Solution Highlights
Physical model is not known.
Do not know well what factors are really
contributing
Non-parametric models are easier to use
when “physical or generative mechanism”
is unknown.
Reliable conditional probabilities estimation
under “skewed, high-dimensional, possibly
irrelevant features”, …
Estimate decision threshold predict the
unknown distribution of the future
Seriousness of Ozone Problem
Ground ozone level is a
sophisticated chemical
and physical process and
“stochastic” in nature.
Ozone level above some
threshold is rather harmful
to human health and our
daily life.
Drawbacks of current ozone
forecasting systems
Traditional simulation systems
Regression-based methods
Consume high computational power
Customized for a particular location, so solutions not
portable to different places
E.g. Regression trees, parametric regression
equations, and ANN
Limited prediction performances
Physical Model: hard to come up with good
equations when there are many parameters,
and changes from place to place
Challenges as a Data Mining Problem
Rather skewed and relatively sparse
distribution
1.
3500+ examples collected over 10 year period
72 continuous features with missing values
Huge instance space
If binary and uncorrelated, 272 is an astronomical
number
2% and 5% true positive ozone days for 1-hour
and 8-hour peak respectively
Many factors contribute to ozone pollution. Some
we know and some we do not know well.
It is suspected that true model for ozone
days are stochastic in nature.
2.
Given all relevant features XR,
P(Y = “ozone day”| XR) < 1
Predictive mistakes are inevitable
A large number of unverified physical features
3.
Only about 10 out of 72 features verified to be relevant,
No information on the relevancy of the other 62 features
For stochastic problem, given irrelevant features Xir ,
where X=(Xr, Xir),
P(Y|X) = P(Y|Xr) only if the data is exhaustive.
May introduce overfitting problem, and change the
probability distribution represented in the data.
P(Y = “ozone day”| Xr, Xir)
P(Y = “normal day”|Xr, Xir)
1
0
“Feature sample selection bias”.
4.
Given 72 continuous features, hard to find many
days in the training data that is very similar to a
day in the future
Given these, 2 closely-related
challenges
1
1
1.
2.
How to train
an
2
2 accurate model
+
+
+
+
How to effectively use a model to predict the
3 a different
3 unknown
future with
and yet
+
+
distribution
Training Distribution
Testing Distribution
List of methods:
• Logistic Regression
• Naïve Bayes
• Kernel Methods
List of methods:
• Linear
Regression
• Decision Trees
• RBF
• RIPPER mixture
rule learner
• Gaussian
• CBA: association Ma
rule
models
Mb
• clustering-based methods
VE
•……
Skewed and stochastic
distribution
Probability distribution
estimation
Precision
Parametric methods
Non-parametric
methods
use
a family of
“free-form” functions to “match the data”
Decision
threshold
given
determination
some
“preference
criteria”.
Highly
accurate
if through
the data
is indeed generated from that model you use!
0.0 0.2 0.4 0.6 0.8 1.0
Addressing Challenges
0.0 0.2 0.4 0.6 0.8 1.0
optimization of some
Recall
given
criteria
But how
about, you don’t know which to choose or use the wrong one?
Compromise between
precision and recall
• free form function/criteria is appropriate.
• preference criteria is appropriates
Reliable probability estimation
under irrelevant features
Recall that due to irrelevant features:
P(Y = “ozone day”| Xr, Xir)
P(Y = “normal day”|Xr, Xir)
1
0
Construct multiple models
Average their predictions
P(“ozone”|xr): true probability
P(“ozone”|Xr, Xir, θ): estimated probability by model
θ
MSEsinglemodel:
MSEAverage
Difference between “true” and “estimated”.
Difference between “true” and “average of many models”
Formally show that MSEAverage ≤ MSESingleModel
A CV based procedure
for decision threshold
selection
Estimated
1
probability
+
values
1 fold
3
TrainingSet
Algorithm
1
0.0 0.2 0.4 0.6 0.8 1.0
Prediction with feature sample selection bias
Precision
Ma
Mb
VE
2
+
2
+
-
3
+
0.0 0.2 0.4 0.6 0.8 1.0
+
- Recall
+
Estimated
“Probabilityprobability
P(y=“ozoneday”|x,θ)
Lable Distribution
Testing
Training Distribution
PrecRec
TrueLabel”
values7/1/98
0.1316
Normal
plot
file
2 fold
…..
7/3/98
0.5944
7/2/98
0.6245
Estimated
probability
values
10 fold
………
Ozone
Ozone
P(y=“ozoneday”|x,θ)
Lable
7/1/98
0.1316
Normal
7/2/98
0.6245
Ozone
7/3/98
0.5944
Ozone
………
Decision
threshold
VE
Addressing Data Mining Challenges
Prediction with feature sample selection bias
Future prediction based on decision threshold
selected
Whole
Training
Set
Classification
if P(Y = “ozonedays”|X,θ ) ≥ VE on future
θ
Predict “ozonedays”
days
Probabilistic Tree Models
Single tree estimators
C4.5 (Quinlan’93)
C4.5Up,C4.5P
C4.4 (Provost’03)
Ensembles
RDT (Fan et al’03)
Member tree trained randomly
Average probability
Bagging Probabilistic Tree (Breiman’96)
Bootstrap
Compute probability
Member tree: C4.5, C4.4
Illustration of RDT
B1: {0,1}
B1 chosen randomly
B1 == 0
B2: {0,1}
B3: continuous
B2: {0,1}
Y
Random threshold 0.3
N
B2 == 0?
B3 chosen randomly
B2: {0,1}
B3 < 0.3?
B3: continuous
B3: continuous
Y
N
B2 chosen randomly
………
B3 < 0.6?
RDT vs Random Forest
B2: {0,1}
B3
chosen
randomly
1. Original Data vs Bootstrap
B3: continous
2. Random pick vs. Random Subset + info gain
3. Probability Averaging vs. Voting
Random threshold 0.6
4. RDT: superfast
Optimal Decision Boundary
from Tony Liu’s thesis (supervised by Kai Ming Ting)
Target
Distribution
SVM
RBF kernel
(1 day)
Single
Decision
Tree
(5 sec to train)
RDT
(5 sec)
SVM
Linear kernel
(over night)
Hidden Variable
Hidden Variable
Limitation of GUIDE
Need to decide grouping variables and
independent variables. A non-trivial task.
If all variables are categorical, GUIDE
becomes a single CART regression tree.
Strong assumption and greedy-based search.
Sometimes, can lead to very unexpected
results.
Baseline
Forecasting Parametric Model
O3 Upwind
EmFact or T max T b SRd
W Sa 0.1 W Sp 0.5 1
in which,
• O3 - Local ozone peak prediction
• Upwind - Upwind ozone background level
• EmFactor - Precursor emissions related factor
• Tmax - Maximum temperature in degrees F
• Tb - Base temperature where net ozone production begins (50 F)
• SRd - Solar radiation total for the day
• WSa - Wind speed near sunrise (using 09-12 UTC forecast mode)
• WSp - Wind speed mid-day (using 15-21 UTC forecast mode)
Model evaluation criteria
Precision and Recall
At the same recall level, Ma is preferred
over Mb if the precision of Ma is
consistently higher than that of Mb
Coverage under PR curve, like AUC
0.0 0.2 0.4 0.6 0.8 1.0
Precision
Ma
Mb
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Some Coverage Results
8-hour: recall = [0.4,0.6]
0.09
BC4.4
RDT
Para
0.06
C4.4
0.03
0
Coverage under PR-Curve
System Results
Annual test
Previous
years’ data
for training
• 1.
8-hour:
thresholds
selected
at
• 1-hour: thresholds selected at the
2. Nextthe
yearrecall
for testing
= 0.6
recall = 0.6
3. Repeated 6 times using 7 years of data
0.7
0.6
0.6
0.5
0.5
0.4
0.4
Recall
0.3
0.3
Precision
0.2
0.2
0.1
0.1
0
0
BC4.4
RDT
C4.4
Para
BC4.4
RDT
C4.4
Para
1. C4.4 best among single trees
2. BC4.4 and RDT best among tree ensembles
1. BC4.4 and RDT more accurate than baseline Para
2. BC4.4 and RDT “less surprise” than single tree
SVM: 1-hr criteria CV
AdaBoost: 1-hr criteria CV
Intuition
The true distribution P(y|X) is never known.
Is it an elephant?
Every random tree is not a random guess of
this P(y|X).
Their structure is, but not the “node statistics”
Each tree looks at the elephant from a different angle.
Every tree is consistent with the training data.
Each tree is quite strong.
Expected Error Reduction
Quadratic loss:
for probability estimation:
regression problems
( P(y|X) – P(y|X, θ) )2
( y – f(x))2
Theorem 1:
the “expected quadratic loss” of RDT is less than any combined
model chosen at random.
Bias and Variance Reduction
Summary
When physical model is hard to build, data mining is one of the top
choices
Procedures to formulate as a data mining problem
How to collect data
Analysis of combination of technical challenges:
Process to search for the most suitable solutions.
Model averaging of probability estimators can effectively
approximate the true probability
Skewed problem
Sample selection bias
Many features
Stochastic problem
A lot of irrelevant features
Feature sample selection bias
A CV based guide for decision threshold determination for stochastic
problems under sample selection bias
Random Decision Tree (Fan et al’03)
ICDM’06 Best Application Award
ICDM’08 Data Mining Contest Championship
Thank you!
Questions?