ICDM10Ozone-5minsver2
Download
Report
Transcript ICDM10Ozone-5minsver2
0.0 0.2 0.4 0.6 0.8 1.0
Precision
Forecasting Skewed Biased
Stochastic Ozone Days:
Analyses and Solutions
Ma
Mb
VE
Presentor: Prof. Longbin Cao
0.0
0.2
0.4
0.6
Recall
Wei Fan, Kun Zhang, and Xiaojing Yuan
0.8
1.0
What is the business problem and broadbased areas
Problem: ozone pollution day detection
Ground ozone level is a sophisticated chemical, physical process and
“stochastic” in nature.
Ozone level above some threshold is rather harmful to human health and
our daily life.
8-hour peak and 1-hour peak standards.
8-hour average > 80 ppt (parts per billion)
1-hour average > 120 ppt
It happens from 5 to 15 days per year.
Broad-area: Environmental Pollution Detection and Protection
Drawback of alternative approaches
Simulation: consume high computational power; customized for a
particular location, so solutions not portable to different places
Physical model approach: hard to come up with good equations
when there are many parameters, and changes from place to place
What are the research challenges that
cannot be handled by the state-of-the-art?
Dataset is sparse, skewed, stochastic, biased and
streaming in the same time.
High dimensional
Very few positives
Under similar conditions: sometimes it happens and
sometimes it doesn’t
P(x) difference between training and testing
Training data from past, predicting the future
Physical model is not well understood and cannot
be customized easily from location to location
what is the main idea of your
approach?
Non-parametric models are easier to use when “physical or
generative mechanism” is unknown.
Reliable “conditional probabilities” estimation under “skewed,
biased, high-dimensional, possibly irrelevant features
Estimate “decision threshold” to predict on the unknown
distribution of the future
Random Decision Tree
Super fast implementation
Formal Analysis:
Bound analysis
MSE reduction
Bias and bias reduction
P(y|x) order correctness proof
A CV based procedure
for decision threshold
selection
Estimated
1
probability
+
values
1 fold
3
+
Estimated
TrainingSet
Algorithm
Precision
1
2
+
2
+
-
3
0.0 0.2 0.4 0.6 0.8 1.0
Decision Threshold when P(x) is different
and P(y|x) is non-deterministic
+
+
Ma
Mb
VE
0.0
0.2
-
“Probability-
probability
P(y=“ozoneday”|x,θ)
Lable Distribution
Testing
Training Distribution
TrueLabel”
values7/1/98
0.1316
Normal
file
2 fold
…..
7/3/98
0.5944
7/2/98
0.6245
Estimated
probability
values
10 fold
………
Ozone
Ozone
P(y=“ozoneday”|x,θ)
0.4 0.6
Recall
1.0
PrecRec
plot
Decision
threshold
VE
Lable
7/1/98
0.1316
Normal
7/2/98
0.6245
Ozone
7/3/98
0.5944
Ozone
………
0.8
Random Decision Tree
B1: {0,1}
B1 chosen randomly
B1 == 0
B2: {0,1}
B3: continuous
B2: {0,1}
Y
Random threshold 0.3
N
B2 == 0?
B3 chosen randomly
B2: {0,1}
B3 < 0.3?
B3: continuous
B3: continuous
Y
N
B2 chosen randomly
………
B3 < 0.6?
RDT vs Random Forest
B2: {0,1}
B3
chosen
randomly
1. Original Data vs Bootstrap
B3: continous
2. Random pick vs. Random Subset + info gain
3. Probability Averaging vs. Voting
Random threshold 0.6
4. RDT: superfast
Optimal Decision Boundary
from Tony Liu’s thesis (supervised by Kai Ming Ting)
what is the main advantage of your
approach, how do you evaluate it?
Fast and Reliable
Compare with
State-of-the-art data mining algorithms:
Decision tree
NB
Logistic Regression
SVM (linear and RBF kernel)
Boosted NB and Decision Tree
Bagging
Random Forest
Physical Equation-based Model
Actual streaming environment on daily basis
what impact has been made in particular,
changing the real world business?
From 4-year studies on actual data, the
proposed data mining approach consistently
outperforms physical model-based method
can your approach be widely expanded to
other areas? and how easy would it be?
Other known application using proposed
approach
Fraud Detection
Manufacturing Process Control
Congestion Prediction
Marketing
Social Tagging
Proposed method is general enough and
doesn’t need any tuning or re-configuration