Step 3. Get to Know the Data

Download Report

Transcript Step 3. Get to Know the Data

Final ProjectMining Mushroom World
Agenda
•
•
•
•
Motivation and Background
Determine the Data Set (2)
10 DM Methodology steps (19)
Conclusion
Motivation and Background
• To distinguish between edible mushrooms
and poisonous ones by how they look
• To know whether we can eat the
mushroom, to survive in the wild
• To survive outside the computer world
Determine the Data Set (1/2)
Source of data:UCI Machine Learning Repository
Mushrooms Database
• From Audobon Society Field Guide
• Documentation:complete, but missing statistical
information
• Described in terms of physical characteristics
• Classification:poisonous or edible
• All attributes are nominal-valued
*Large database: 8124 instances (2480 missing values for attribute
#12)
Determine the Data Set (2/2)
1. Past Usage
• Schlimmer,J.S. (1987). Concept Acquisition
Through Representational Adjustment (Technical
Report 87-19).
• Iba,W., Wogulis,J., & Langley,P. (1988). ICML,
73-79
2. No other mushrooms data
10 DM Methodology steps
Step 1. Translate the Business Problem
into a Data Mining Problem
a. Data Mining Goal:separate edible mushrooms
from poisonous ones
b. How will the Results be Used- increase the
survival rate
c. How will the Results be Delivered- Decision
Tree, Naïve Bayes, Ripper, NeuralNet
10 DM Methodology steps
Step 2. Select Appropriate Data
a. Data Source
–
–
The Audubon Society Field guide to North American
Mushrooms (1981). G. H. Lincoff (Pres.), New York:
Alfred A. Knopf
Jeff Schlimmer donated these data on April 27th, 1987
b. Volumes of Data
-
Total 8124 instances
4208(51.8%) edible; 3916(48.2%) poisonous
-
2480(30.5%) missing in attribute “stalk-root”
10 DM Methodology steps
Step 2. Select Appropriate Data
c. How Many Variables- 22 attributes
-
cap-shape, cap-color, odor, population, habitat and so
on……
d. How Much History is Required- no seasonality
*As long as we can eat them when we see them
10 DM Methodology steps
Step 3. Get to Know the Data
a. Examine Distributions:Use “Weka” to
visualize all the 22 attributes with histograms
b. Class:edible=e, poisonous=p
Step 3. Get to Know the Data
a. Examine Distributions: there are 2 types of
historgrams
b. First- all kinds of values appear
c. (Attribute 21) population: abundant=a, clustered=c,
numerous=n, scattered=s, several=v, solitary=y
Step 3. Get to Know the Data
1. Examine Distributions:there are 2 types of
historgrams
– Second- only some kinds of value appear
– (Attribute 7) gill-spacing:close=c, crowded=w,
distant=d
Step 3. Get to Know the Data
1. Examine Distributions:there are exceptions
– Exception 1- missing values in the attribute
– (Attribute 11) stalk-root:bulbous=b, club=c, cup=u,
equal=e, rhizomorphs=z, rooted=r, missing=?
2480 of this attribute have missing values (Total 8124)
Step 3. Get to Know the Data
1. Examine Distributions:there are exceptions
– Exception 2- undistinguishable attribute
– (Attribute 16) veil-type:partial=p, universal=u
Step3. Get to Know the Data
2. Compare Values with Descriptions
– no unexpected values except for missing values
10 DM Methodology steps
Step 4. Create a Model Set
– Creating a Balanced Sample- 75%(6093) as
training data, 25%(2031) as test data
– Rapid Miner’s “cross-validation” function: k-1 as
training, 1 as test
10 DM Methodology steps
Step 5. Fix Problems with the Data
– Dealing with Missing Values- the attribute “stalkroot” has 2480 missing values
– replace all missing values with the average of
“stalk-root” value
– We replaced ‘?’ with the average value ‘b’
10 DM Methodology steps
Step 6. Transform Data to Bring Information
to the Surface
– all nominal attribute, no numerical analysis in this
step
10 DM Methodology steps
True p
True e
Class precision
Pred. p
961
0
100%
Pred. e
18
1052
98.32%
Step 7. Build Model
1. Decision Tree
Performance
– Accuracy:99.11%
– Lift:189.81%
Class recallTrue p 98.16% True e 100.00%
Class precision
Pred. p
961
0
100%
Pred. e
18
1052
98.32%
Class recall
98.16%
100.00%
10 DM Methodology steps
True p
True e
Class precision
Pred. p
902
9
99.01%
Pred. e
77
1043
93.12%
Class recall
92.13%
99.14%
Step 7. Build Model
2. Naïve Bayes
Performance
– Accuracy:95.77%
– Lift:179.79%
True p
True e
Class precision
Pred. p
902
9
99.01%
Pred. e
77
1043
93.12%
Class recall
92.13%
99.14%
10 DM Methodology steps
Step 7. Build Model
3. Ripper
Performance
– Accuracy:100%
True p
True e
Class precision
Pred. p
979
0
100.00%
Pred. e
0
1052
100.00%
Class recall
100.00%
100.00%
– Lift:193.06%
True p
True e
Class precision
Pred. p
979
0
100.00%
Pred. e
0
1052
100.00%
Class recall
100.00%
100.00%
10 DM Methodology steps
True p
True e
Class precision
Pred. p
907
110
89.18%
Pred. e
72
942
92.90%
Class recall
92.65%
89.54%
Step 7. Build Model
4. NeuralNet
Performance
– Accuracy:91.04%
– Lift:179.35%
True p
True e
Class precision
Pred. p
907
110
89.18%
Pred. e
72
942
92.90%
Class recall
92.65%
89.54%
10 DM Methodology steps
Step 8. Assess Models
– Accuracy:Ripper and Decision Tree have
better performances
Accuracy
105
100
100
99.11
95.77
95
91.04
90
85
Decision
Tree
Naïve
Bayes
Ripper
Neural Net
Accuracy
10 DM Methodology steps
Step 8. Assess Models
– Lift (to compare the performances of different
classification models):Ripper and Decision Tree
have higher lifts
Lift
195
193.06
189.81
190
185
179.79
179.35
180
175
170
1
2
3
4
Lift
10 DM Methodology steps
Step 9. Deploy Models
–
We haven’t go out and find real mushrooms
Step 10. Assess Results
Conclusion and questions
–
–
Maybe ripper and decision tree are better models
for nominal data
How Rapid Miner separates training data from test
data