Final Presentation

Download Report

Transcript Final Presentation

An Evaluation of A
Commercial Data
Mining Suite
Oracle Data Mining
Presented by Emily Davis
Supervisor: John Ebden
Oracle Data Mining
An Investigation
Emily Davis
Investigating the
data mining tools
and software
available with
Oracle9i.
Naive Bayes
Use Oracle Data
Mining and
JDeveloper (Java
API) to run
algorithms in data
mining suite on
sample data.
Adaptive Bayes
An evaluation of results
using confusion
matrices, lift charts &
error rates. A
comparison of the
effectiveness of different
algorithms.
Oracle Data Mining,
DM4J and
JDeveloper
Supervisor: John Ebden
Contact: [email protected]
Visit: http://www.cs.ru.ac.za/research/students/g01D1801/
Model
A
Model
Accept
Model
Reject
Actual
Accept
600
25
Actual
Reject
75
300
Problem Statement

To determine how Oracle provides data
mining functionality
 Ease
of use
 Data preparation
 Model building
 Model testing
 Applying models to new data
Problem Statement

To determine whether the algorithms used
would find a pattern in a data set
 What
happened when the models were
applied to a new data set

To determine which algorithm built the
most effective model and under what
circumstances
Problem Statement
To determine how models are tested and if
this indicates how they will perform when
applied to new data
 To determine how the data affected the
model building and how the test data
affected the model testing

Methodology

Two Classification algorithms selected:
 Naïve
Bayes
 Adaptive Bayes Network

Both produce predictions which could then
be compared
Methodology



Data from http://www.ru.ac.za/weather/
Weather data
Data recorded includes:









Temperature (degrees F)
Humidity (percent)
Barometer (inches of mercury)
Wind Direction (degrees, 360 = North, 90 = East)
Wind Speed (MPH)
High Wind Speed (MPH)
Solar Radiation (Watts/m^2)
Rainfall (inches)
Wind Chill (computed from high wind speed and temperature)
Data
Rainfall reading removed and replaced
with a yes or no depending on whether
rainfall was recorded
 This variable, RAIN, was chosen as the
target variable
 2 Data sets put into tables in the database

 WEATHER_BUILD
 WEATHER_APPLY

WEATHER_BUILD
 2601
records
 Used to create build and test data with
Transformation Split wizard

WEATHER_APPLY
 290
records
 Used to validate models
Building and Testing the Models
The Priors technique
 Training and tuning the models
 The models built
 Testing Results

Data Preparation Techniques Priors
Histogram for:RAIN
Bin Count
1400
1200
1000
800
600
400
200
0
yes
no
Bin Range
Priors
Histogram for:RAIN
Bin Count
1400
1200
1000
800
600
400
200
0
yes
no
Bin Range
Stratified
Sampling
Priors
Histogram for:RAIN
Histogram for:RAIN
1400
1200
1200
1000
1000
Bin Count
Bin Count
1400
800
600
400
200
800
600
400
0
yes
no
Bin Range
Stratified
Sampling
200
0
yes
no
Bin Range
Training and Tuning the Models
Predicted No
Predicted Yes
Actual No
384
34
Actual Yes
141
74
Training and Tuning the Models
Viable to introduce a weighting of 3
against false negatives
 Makes a false negative prediction 3 times
as costly as a false positive
 Algorithm attempts to minimise costs

The Models
8 models in total
 4 using each algorithm

 One
using default settings
 One using the Priors technique
 One using weighting
 One using Priors and weighting
Testing the Models
Tested on test data set created from
WEATHER_BUILD data set
 Confusion matrices indicating accuracy of
models

Testing Results
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
Naïve Bayes
Model Settings
weighting,
priors
weighting,
no priors
no
weighting,
priors
Adaptive Bayes
Network
no
weighting,
no priors
Accuracy
Testing Results
Applying the Models to New Data

Models were applied to the new data in
WEATHER_APPLY
Prediction
Probability
THE_TIME
Prediction
Cost of
incorrect
prediction
THE_TIME
no
0.9999
1
no
0
1
yes
0.6711
138
yes
0.3288 138
Extracts showing 2 predictions in actual results
Attribute Influence on Predictions
Adaptive Bayes Network provides rules
along with predictions
 Rules in if…….then format
 Rules showed attributes with most
influence were:

 Wind
Chill
 Wind Direction
Results of Applying Models to New
Data
Model Results
80.00%
70.00%
Accuracy
60.00%
50.00%
Naïve Bayes
40.00%
Adaptive Bayes Network
30.00%
20.00%
10.00%
0.00%
no
no
weighting, weighting,
weighting, weighting, no priors
priors
no priors
priors
Model Settings
Comparing Accuracy
Model Results
80.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
70.00%
Adaptive Bayes
Network
60.00%
Accuracy
Naïve Bayes
50.00%
Naïve Bayes
40.00%
Adaptive Bayes Network
30.00%
20.00%
Model Settings
weighting,
priors
weighting,
no priors
no
weighting,
priors
10.00%
no
weighting,
no priors
Accuracy
Testing Results
0.00%
no
no
weighting, weighting,
weighting, weighting, no priors
priors
no priors
priors
Model Settings
Observations




Algorithms found a pattern in the weather data
Most effective model: Adaptive Bayes Network
algorithm using weighting
Accuracy of Naïve Bayes models improves
dramatically if weighting and Priors are used
Significant difference between accuracy during
testing of models and accuracy when applied to
new data
Conclusions
Oracle Data Mining provides easy to use
wizards that support all aspects of the data
mining process
 Algorithms found a pattern in the weather
data

 Best
case: the Adaptive Bayes Network model
predicted 73.1% of RAIN outcomes correctly
Conclusions

Adaptive Bayes Network algorithm produced
most effective model: accuracy 73.1% when
applied to new data
 Tuned

using a weighting of 3 against false negatives
Most effective model using Naïve Bayes:
accuracy of 63.79%
 Uses
a weighting of 3 against false negatives and
uses Priors technique
Conclusions
Accuracy during testing does not always
indicate performance of model on new
data
 Test accuracy inflated if target attribute
distribution in build and test data sets is
similar
 Shows the need for testing of a model on
a variety of data sets

Questions