Transcript PowerPoint

0.0 0.2 0.4 0.6 0.8 1.0
Precision
Forecasting Skewed Biased
Stochastic Ozone Days:
Analyses and Solutions
Ma
Mb
VE
0.0
0.2
0.4
0.6
Recall
Kun Zhang, Wei Fan, Xiaojing Yuan,
Ian Davidson, and Xiangshang Li
0.8
1.0
What this Paper Offers


Application: more accurate (higher recall & precision)
solution to predict “ozone days”
Interesting and Difficult Data Mining Problem:





High dimensionality and some could be irrelevant features:
72 continuous, 10 verified by scientists to be relevant
Skewed class distribution : either 2 or 5% “ozone days”
depending on “ozone day criteria” (either 1-hr peak and
8-hr peak)
Streaming: data in the “past” collected to train model to
predict the “future”.
“Feature sample selection bias”: hard to find many days
in the training data that is very similar to a day in the
future
Stochastic true model: given measurable information,
sometimes target event happens and sometimes it doesn’t.
Key Solution Highlights



Non-parametric models are easier to use
when “physical or generative mechanism”
is unknown.
Reliable conditional probabilities estimation
under “skewed, high-dimensional, possibly
irrelevant features”, …
Estimate decision threshold predict the
unknown distribution of the future
Seriousness of Ozone Problem


Ground ozone level is a
sophisticated chemical
and physical process and
“stochastic” in nature.
Ozone level above some
threshold is rather harmful
to human health and our
daily life.
Drawbacks of current ozone
forecasting systems

Traditional simulation systems



Consume high computational power
Customized for a particular location, so
solutions not portable to different places
Regression-based methods


E.g. Regression trees, parametric regression
equations, and ANN
Limited prediction performances
Ozone Level Prediction:
Problems we are facing

Daily summary maps of two datasets from
Texas Commission on Environmental Quality
(TCEQ)
Challenges as a Data Mining Problem
Rather skewed and relatively sparse
distribution
1.




2500+ examples over 7 years (1998-2004)
72 continuous features with missing values
Huge instance space

If binary and uncorrelated, 272 is an astronomical
number
2% and 5% true positive ozone days for 1-hour
and 8-hour peak respectively
True model for ozone days are stochastic
in nature.
2.


Given all relevant features XR,
P(Y = “ozone day”| XR) < 1
Predictive mistakes are inevitable
A large number of irrelevant features
3.
Only about 10 out of 72 features verified to be relevant,
No information on the relevancy of the other 62 features
For stochastic problem, given irrelevant features Xir ,
where X=(Xr, Xir),
P(Y|X) = P(Y|Xr) only if the data is exhaustive.



May introduce overfitting problem, and change the
probability distribution represented in the data.



P(Y = “ozone day”| Xr, Xir)
P(Y = “normal day”|Xr, Xir)
1
0
“Feature sample selection bias”.
4.


Given 7 years of data and 72 continuous
features, hard to find many days in the training
data that is very similar to a day in the future
Given these, 2 closely-related
challenges
1
1
1.
2.
How to train
an
2
2 accurate model
+
+
+
+
How to effectively use a model to predict the
3 a different
3 unknown
future with
and yet
+
+
distribution
Training Distribution
Testing Distribution
List of methods:
• Logistic Regression
• Naïve Bayes
• Kernel Methods
List of methods:
• Linear
Regression
• Decision Trees
• RBF
• RIPPER mixture
rule learner
• Gaussian
• CBA: association Ma
rule
models
Mb
• clustering-based methods
VE
•……
Skewed and stochastic
distribution
Probability distribution
estimation


Precision
Parametric methods

Non-parametric
methods
use
a family of
“free-form” functions to “match the data”
Decision
threshold
given
determination
some
“preference
criteria”.
Highly
accurate
if through
the data
is indeed generated from that model you use!


0.0 0.2 0.4 0.6 0.8 1.0
Addressing Challenges
0.0 0.2 0.4 0.6 0.8 1.0
optimization of some
Recall
given
criteria
But how
about, you don’t know which to choose or use the wrong one?

Compromise between
precision and recall
• free form function/criteria is appropriate.
• preference criteria is appropriates
Reliable probability estimation
under irrelevant features
Recall that due to irrelevant features:

P(Y = “ozone day”| Xr, Xir)
P(Y = “normal day”|Xr, Xir)




1
0
Construct multiple models
Average their predictions
 P(“ozone”|xr): true probability
 P(“ozone”|Xr, Xir, θ): estimated probability by model
θ
MSEsinglemodel:


MSEAverage



Difference between “true” and “estimated”.
Difference between “true” and “average of many models”
Formally show that MSEAverage ≤ MSESingleModel

A CV based procedure
for decision threshold
selection
Estimated
1
probability
+
values
1 fold
3
TrainingSet
Algorithm
1
0.0 0.2 0.4 0.6 0.8 1.0
Prediction with feature sample selection bias
Precision

Ma
Mb
VE
2
+
2
+
-
3
+
0.0 0.2 0.4 0.6 0.8 1.0
+
- Recall
+
Estimated
“Probabilityprobability
P(y=“ozoneday”|x,θ)
Lable Distribution
Testing
Training Distribution
PrecRec
TrueLabel”
values7/1/98
0.1316
Normal
plot
file
2 fold
…..
7/3/98
0.5944
7/2/98
0.6245
Estimated
probability
values
10 fold
………
Ozone
Ozone
P(y=“ozoneday”|x,θ)
Lable
7/1/98
0.1316
Normal
7/2/98
0.6245
Ozone
7/3/98
0.5944
Ozone
………
Decision
threshold
VE
Addressing Data Mining Challenges

Prediction with feature sample selection bias

Future prediction based on decision threshold
selected
Whole
Training
Set
Classification
if P(Y = “ozonedays”|X,θ ) ≥ VE on future
θ
Predict “ozonedays”
days
Probabilistic
Tree
RDT: Random
Decision Tree
(Fan Models
et al’03)
 “Encoding
data”
in trees.
Single tree
estimators
C4.5 node,
(Quinlan’93)
 At each
an un-used feature is chosen
 C4.5Up,C4.5P
randomly

C4.4
(Provost’03)
A discrete feature is un-used if it has never been chosen
 Ensembles
previously on a given decision path starting from the root to
 RDT (Fan et al’03)
the current node.
 Member tree trained
 A continuous
feature can be chosen multiple times on the
randomly
same
decision probability
path, but each time a different threshold
 Average
value is chosen
 Bagging Probabilistic
 Stop
when one of the following happens:


Tree (Breiman’96)
1. Original Data vs Bootstrap
 A node
becomes too small
3 examples).
2. (<=
Random
pick vs. Random Subset + info gain
 Bootstrap
Averaging
 Or the
total height
of the 3.
treeProbability
exceeds some
limits:vs. Voting
 Compute
probability
 Member
tree: C4.5,
Different
from Random
Forest
C4.4
Optimal Decision Boundary
from Tony Liu’s thesis (supervised by Kai Ming Ting)
Baseline
Forecasting Parametric Model
O3  Upwind 
EmFact or T max T b  SRd
W Sa  0.1  W Sp  0.5  1
in which,
• O3 - Local ozone peak prediction
• Upwind - Upwind ozone background level
• EmFactor - Precursor emissions related factor
• Tmax - Maximum temperature in degrees F
• Tb - Base temperature where net ozone production begins (50 F)
• SRd - Solar radiation total for the day
• WSa - Wind speed near sunrise (using 09-12 UTC forecast mode)
• WSp - Wind speed mid-day (using 15-21 UTC forecast mode)
Model evaluation criteria
Precision and Recall

At the same recall level, Ma is preferred
over Mb if the precision of Ma is
consistently higher than that of Mb
Coverage under PR curve, like AUC
0.0 0.2 0.4 0.6 0.8 1.0

Precision

Ma
Mb
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Some Coverage Results

8-hour: recall = [0.4,0.6]
0.09
BC4.4
RDT
Para
0.06
C4.4
0.03
0
Coverage under PR-Curve
Some “Action” Results

Annual test
Previous
years’ data
for training
• 1.
8-hour:
thresholds
selected
at
• 1-hour: thresholds selected at the
2. Nextthe
yearrecall
for testing
= 0.6
recall = 0.6
3. Repeated 6 times using 7 years of data
0.7
0.6
0.6
0.5
0.5
0.4
0.4
Recall
0.3
0.3
Precision
0.2
0.2
0.1
0.1
0
0
BC4.4
RDT
C4.4
Para
BC4.4
RDT
C4.4
Para
1. C4.4 best among single trees
2. BC4.4 and RDT best among tree ensembles
1. BC4.4 and RDT more accurate than baseline Para
2. BC4.4 and RDT “less surprise” than single tree
Summary




Procedures to formulate as a data mining
problem,
Analysis of combination of technical
challenges
Process to search for the most suitable
solutions.
Model averaging of probability estimators
can effectively approximate the true
probability



A lot of irrelevant features
Feature sample selection bias
A CV based guide for decision threshold
determination for stochastic problems under
sample selection bias
Choosing the Appropriate PET
come to our other talk 10:30 RM 402
Signal-noise
separability
estimation through
RDT or BPET
Given
dataset
< 0.9
Low signal-noise
separability
Single Tree
Ensembl
e or
Single
trees Single
Ensemble
Trees
(AUC,MSE,
ErrorRate)
RDT
>=0.9
AUC
Score
AUC
MSE
Error
Rate
(AUC,MSE,
ErrorRate)
AUC
CFT
CFT
High signalnoise
separability
Ensemble or
Single trees
MSE,
ErrorRate
Ensemble
Feature
types and
value
Continuous characteristic
features or
s
categorical feature AUC, MSE,
with a large
ErrorRate
number of values
C4.5 or C4.4
RDT ( BPET)
Categorical
feature with AUC, MSE,
limited
ErrorRate
values
BPET
Thank you!
Questions?