0 - Stevens Institute of Technology

Download Report

Transcript 0 - Stevens Institute of Technology

Knowledge Discovery in
Databases
MIS 637
Professor Mahmoud Daneshmand
Fall 2012
Final Project: Red Wine Recipe Data Mining
By Jorge Madrazo
Profound Questions
• What basic properties are the formula for a
good wine?
– Wine making is believed to be an art. But is there
a formula for a quality wine?
– There was a paper on “Modeling wine preferences
by Data Mining” submitted by the provider of the
data set. How do my results compare with the
paper’s?
Procedure
• Follow a data mining process
• Use SAS and SAS Enterprise Miner to execute
the process
• SAS Enterprise Miner tool is modeled on the
SAS Institute defined data mining process of
SEMMA – Sample, Explore, Modify, Model,
Assess
• SEMMA is similar to the CRISP DM process
Sample
• 1,599 records
• Set up a data partition
– Training 40%
– Validation 30%
– Test 30%
Explore: Data Background
• Data source
– UCI Machine Learning Repository.
• Wine Quality Data Set.
– There are a red and white wine data set. I focused on the red wine set only.
– There are 11 input variables and one target variable.
» fixed acidity
» volatile acidity
» citric acid
» residual sugar
» chlorides
» free sulfur dioxide
» total sulfur dioxide
» density
» pH
» sulphates
» alcohol
» Output variable (based on sensory data): quality (score between 0 and 10)
Explore: Target=Quality
• Quality
– People gave a quality assessment of different
wines on a scale of 0-10. Actual range 3-8.
– An ordinal target
Explore: Inputs
• Correlation Analysis
– Some correlation, but not enough to discard
inputs
• ods graphics on;
• ods select MatrixPlot;
• proc corr data=wino.red PLOTS(MAXPOINTS=100000 )
•
plots=matrix(histogram nvar=all);
• var quality alcohol ph fixed_acidity density volatile_acidity sulphates
citric_acid;
• run;
Explore: Correlation Graphs
Explore: Chi2 Statistics of Inputs
Explore: Worth of Inputs
Explore: Worth Graph
• The Worth Tracks closely with the Chi Statistic
Modify
• At this stage, no modifications are done
Model: Selection
• Because I want to list the important elements
in what is considered a quality wine, I choose
a Decision Tree
• Configuration
– The Splitting Rule is Entropy
– Maximum Branch is set to 5
• Therefore a C4.5 type of algorithm is being
implemented
Assess: Initial Results
• A Bushy Tree using. The Resulting tree is too
intricate for simple recommendation.
– Over 20 Leaf nodes.
Modify: Target
• Change the target so that it becomes a binary.
• New variable in the model called isGood. Any rating
over 6 is categorized as isGood.
– SAS Code:
data wino.xx;
set wino.red;
if (quality>6) then
isgood=1;
else isgood = 0;
run;
proc print data = wino.xx;
title 'xx';
run;
Explore: Target = isGood
Model Strategy for isGood
• Model with Decision Tree to hope for more
descriptive results.
• Also model with Neural Network to aid in
assessment and do comparison
Model: Decision Tree
• ProbF splitting criteria at Significance Level .2
• Maximum Branch size = 5
Assess: Decision Tree Results
• Much simpler Tree
Assess: Decision Tree Results 2
• Leaf Statistics
Assess: Variable Importance
Number of Number of
Variable
Splitting
Surrogate
Name
Label
Rules
Rules
alcohol
1
0
density
0
1
volatile_acidity
0
1
sulphates
1
0
fixed_acidity
0
1
citric_acid
0
1
free_sulfur_dioxide
0
0
pH
0
0
chlorides
0
0
total_sulfur_dioxide
0
0
residual_sugar
0
0
Importance
Validation
Importance
1
0.77055175
0.728868987
0.671675628
0.553719729
0.549750361
0
0
0
0
0
1
1
0.77055175
1
0.728868987
1
0.477710505
0.711222032
0.393817671
0.711222032
0.390994569
0.711222032
0 NaN
0 NaN
0 NaN
0 NaN
0 NaN
Event Classification Table
Data Role=TRAIN Target=isgood
False
Negative
False
True Negative Positive
53
539
True
Positive
14
34
Data Role=VALIDATE Target=isgood
False
Negative
False
True Negative Positive
43
403
True
Positive
12
Ratio of
Validation to
Training
Importance
21
Model: Neural Network
• Positive – better at predicting
• Negative – hard to interpret the model
• Configured with 3 Hidden Nodes
Modify: Input Variables to NN
• Because of the complexity of the NN, it is
recommended to prune variables prior to
running the network.
Modify: R2 Filter
Variable Name
alcohol
chlorides
citric_acid
density
fixed_acidity
free_sulfur_dioxide
pH
residual_sugar
sulphates
total_sulfur_dioxide
volatile_acidity
Role
INPUT
INPUT
REJECTED
INPUT
INPUT
INPUT
REJECTED
REJECTED
INPUT
REJECTED
INPUT
Measurement
Level
INTERVAL
INTERVAL
INTERVAL
INTERVAL
INTERVAL
INTERVAL
INTERVAL
INTERVAL
INTERVAL
INTERVAL
INTERVAL
Reasons for Rejection
Varsel:Small R-square value
Varsel:Small R-square value
Varsel:Small R-square value
Varsel:Small R-square value
Model: NN
• Specify 3 Hidden Units in the Hidden Layer
Assess: NN Results
• Hard to interpret results to formulate a recipe
The NEURAL Procedure
Optimization Results
Parameter Estimates
Gradient
Objective
N Parameter
Estimate
Function
1 alcohol_H11
3.679818
-0.001411
2 chlorides_H11
0.520190
-0.000479
3 density_H11
-2.171623
0.000883
4 fixed_acidity_H11
-0.055929
0.000179
5 free_sulfur_dioxide_H11
0.403412
0.000139
6 sulphates_H11
-4.954290
-0.000224
7 volatile_acidity_H11
2.686209
0.000205
8 alcohol_H12
-0.313005
0.001209
9 chlorides_H12
0.200973
0.000759
Assess: Comparative Results
•
Receiver Operating Characteristics (ROC) Chart for NN vs Decision Tree
Assess: Comparative Results
• Cumulative Lift for NN vs Decision Tree
Assess: Comparison with Reference
Paper
• Used R-Miner
• Support Vector Machine (SVM) and Neural
Network used
• He applied techniques to extract relative
importance of variables
• He attempted to predict every quality level
• He noted the importance of alcohol and
sulphates. “An increase in sulphates might be
related to the fermenting nutrition, which is very
important to improve the wine aroma.”
Assess: Paper Variable Importance
Overall Project in SAS EM
References
• UCI Machine Learning Repository
http://archive.ics.uci.edu/ml/datasets/Wine
• P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J.
Reis. Modeling wine preferences by data mining
from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547553, 2009.
• Modeling wine preferences by data mining from
physicochemical properties, Paulo Cortez et. al
http://www3.dsi.uminho.pt/pcortez/wine5.pdf