csce462SampleProjectPowerPointx

Download Report

Transcript csce462SampleProjectPowerPointx

CS 490 Sample Project
Mining the Mushroom Data
Set
Kirk Scott
1
2
Yellow Morels
3
Black Morels
4
• This set of overheads begins with the
contents of the project check-off sheet
• After that an example project is given
5
CS 490 Data Mining Project
Check-Off Sheet
• Student's name: _______
• 1. Meets requirements for formatting.
(No pts.) [ ]
• 2. Oral presentation given. (No pts.) [ ]
• 3. Attendance at Other Students'
Presentations. Partial points for partial
attendance. 20 pts.____
6
I. Background Information on the
Problem Domain and the Data Set
7
• Name of Data Set: _______
• I.A. Random Information Drawn from the
Online Data Files Posted with the Data
Set. 3 pts.___
• I.B. Contents of the Data File. 3 pts.___
• I.C. Summary of Background Information.
3 pts.___
• I.D. Screen Shot of Open File. 3 pts.___
8
II. Applications of Data Mining
Algorithms to the Data Set
9
II. Case 1. This Needs to Be a
Classification Algorithm
•
•
•
•
Name of Algorithm: _______
i. Output Results. 3 pts.___
ii. Explanation of Item. 2 pts.___
iii. Graphical or Other Special Purpose
Additional Output. 2 pts.___
10
II. Case 2. This Needs to Be a
Clustering Algorithm
•
•
•
•
Name of Algorithm: _______
i. Output Results. 3 pts.___
ii. Explanation of Item. 2 pts.___
iii. Graphical or Other Special Purpose
Additional Output. 2 pts.___
11
II. Case 3. This Needs to Be
an Association Mining Algorithm
•
•
•
•
Name of Algorithm: _______
i. Output Results. 3 pts.___
ii. Explanation of Item. 2 pts.___
iii. Graphical or Other Special Purpose
Additional Output. 2 pts.___
12
II. Case 4. Any Kind of
Algorithm
•
•
•
•
Name of Algorithm: _______
i. Output Results. 3 pts.___
ii. Explanation of Item. 2 pts.___
iii. Graphical or Other Special Purpose
Additional Output. 2 pts.___
13
II. Case 5. Any Kind of
Algorithm
•
•
•
•
Name of Algorithm: _______
i. Output Results. 3 pts.___
ii. Explanation of Item. 2 pts.___
iii. Graphical or Other Special Purpose
Additional Output. 2 pts.___
14
II. Case 6. Any Kind of
Algorithm
•
•
•
•
Name of Algorithm: _______
i. Output Results. 3 pts.___
ii. Explanation of Item. 2 pts.___
iii. Graphical or Other Special Purpose
Additional Output. 2 pts.___
15
II. Case 7. Any Kind of
Algorithm
•
•
•
•
Name of Algorithm: _______
i. Output Results. 3 pts.___
ii. Explanation of Item. 2 pts.___
iii. Graphical or Other Special Purpose
Additional Output. 2 pts.___
16
II. Case 8. Any Kind of
Algorithm
•
•
•
•
Name of Algorithm: _______
i. Output Results. 3 pts.___
ii. Explanation of Item. 2 pts.___
iii. Graphical or Other Special Purpose
Additional Output. 2 pts.___
17
III. Choosing the Best Algorithm
Among the Results
18
• III.A. Random Babbling. 6 pts.___
• III.B. An Application of the Paired t-test. 6
pts.___
•
•
• Total out of 100 points possible: _____
19
Example Project
• The point of this sample project is to
illustrate what you should produce for
your project.
• In addition to the content of the project,
information given in italics provides
instructions or commentary or
background information.
20
• Needless to say, your project should
simply contain all of the necessary
content.
• You don't have to provide italicized
commentary.
21
I. Background Information on the
Problem Domain and the Data Set
• If you are working with your own data
set you will have to produce this
documentation entirely yourself.
• If you are working with a downloaded
data set, you can use whatever
information comes with the data set.
• You may paraphrase that information,
rearrange it, do anything to it to help
make your presentation clear.
22
• You don't have to follow academic
practice and try to document or
footnote what you did when presenting
the information.
• The goal is simply adaptation for clear
and complete presentation.
• What I'm trying to say is this: There
will be no penalty for "plagiarism".
23
• What I would like you to avoid is simply
copying and pasting, leading to a mass
of information that is not relevant or
helpful to the reader (the teacher—who
will be making the grades) in
understanding what you were doing.
• Reorganize and edit as necessary in
order to make it clear.
24
• Finally, include a screen shot of the
explorer view of the data set after
you've opened the file containing it.
• Already here you have a choice of what
exactly to show and you need to write
some text explaining what the screen
shot displays.
25
I.A. Random Information Drawn from the
Online Data Files Posted with the Data
Set
• This data set includes descriptions of
hypothetical samples corresponding to 23
species of gilled mushrooms in the
Agaricus and Lepiota Family (pp. 500525).
• Each species is identified as definitely
edible, definitely poisonous, or of unknown
edibility and not recommended.
• This latter class was combined with the
26
poisonous one.
• The Guide clearly states that there is no
simple rule for determining the edibility of
a mushroom; no rule like ''leaflets three, let
it be'' for Poisonous Oak and Ivy.
27
• Number of Instances: 8124
• Number of Attributes: 22 (all nominally
valued)
• Attribute Information: (classes: edible=e,
poisonous=p)
•
28
• 1. cap-shape:
bell=b,conical=c,convex=x,flat=f,knobb
ed=k,sunken=s
• 2. cap-surface:
fibrous=f,grooves=g,scaly=y,smooth=s
• 3. cap-color:
brown=n,buff=b,cinnamon=c,gray=g,gr
een=r,pink=p,purple=u,red=e,white=w,yell
ow=y
29
• 4. bruises?:
bruises=t,no=f
• 5. odor:
almond=a,anise=l,creosote=c,fishy=y,f
oul=f,musty=m,none=n,pungent=p,spicy=s
• 6. gill-attachment:
attached=a,descending=d,free=f,notch
ed=n
• 7. gill-spacing:
close=c,crowded=w,distant=d
30
• 8. gill-size:
broad=b,narrow=n
• 9. gill-color:
black=k,brown=n,buff=b,chocolate=h,gr
ay=g,green=r,orange=o,pink=p,purple=u,r
ed=e,white=w,yellow=y
• 10. stalk-shape: enlarging=e,tapering=t
• 11. stalk-root:
bulbous=b,club=c,cup=u,equal=e,
rhizomorphs=z,rooted=r,missing=?
31
• 12. stalk-surface-above-ring:
fibrous=f,scaly=y,silky=k,smooth=s
• 13. stalk-surface-below-ring:
fibrous=f,scaly=y,silky=k,smooth=s
• 14. stalk-color-above-ring:
brown=n,buff=b,cinnamon=c,gray=g,or
ange=o,pink=p,red=e,white=w,yellow=y
32
• 15. stalk-color-below-ring:
brown=n,buff=b,cinnamon=c,gray=g,or
ange=o,pink=p,red=e,white=w,yellow=y
• 16. veil-type:
partial=p,universal=u
• 17. veil-color:
brown=n,orange=o,white=w,yellow=y
• 18. ring-number: none=n,one=o,two=t
33
• 19. ring-type:
cobwebby=c,evanescent=e,flaring=f,lar
ge=l,none=n,pendant=p,sheathing=s,zone
=z
• 20. spore-print-color:
black=k,brown=n,buff=b,chocolate=h,gr
een=r,orange=o,purple=u,white=w,yellow=
y
34
• 21. population:
abundant=a,clustered=c,numerous=n,
scattered=s,several=v,solitary=y
• 22. habitat:
grasses=g,leaves=l,meadows=m,paths
=p,urban=u,waste=w,woods=d
35
• Missing Attribute Values: 2480 of them
(denoted by "?"), all for attribute #11.
•
• Class Distribution:
• -- edible: 4208 (51.8%)
• -- poisonous: 3916 (48.2%)
• -- total: 8124 instances
36
• Logical rules for the mushroom data sets.
• This is information derived by researchers
who have already worked with the data
set.
• Logical rules given below seem to be the
simplest possible for the mushroom
dataset and therefore should be treated as
benchmark results.
37
• Disjunctive rules for poisonous
mushrooms, from most general to most
specific:
• P_1)
odor=NOT(almond.OR.anise.OR.none)
• 120 poisonous cases missed, 98.52%
accuracy
• P_2) spore-print-color=green
• 48 cases missed, 99.41% accuracy
38
• P_3) odor=none.AND.stalk-surface-belowring=scaly.AND.(stalk-color-abovering=NOT.brown)
• 8 cases missed, 99.90% accuracy
• P_4) habitat=leaves.AND.cap-color=white
• 100% accuracy
• Rule P_4) may also be
• P_4')
population=clustered.AND.cap_color=whit39
• These rules involve 6 attributes (out of 22).
Rules for edible mushrooms are obtained
as negation of the rules given above, for
example the rule:
• odor=(almond.OR.anise.OR.none).AND.s
pore-print-color=NOT.green
• gives 48 errors, or 99.41% accuracy on
the whole dataset.
40
• Several slightly more complex variations
on these rules exist, involving other
attributes, such as gill_size, gill_spacing,
stalk_surface_above_ring, but the rules
given above are the simplest we have
found.
41
I.B. Contents of the Data File
•
Here is a snippet of five records from
the data file:
•
•
•
•
•
p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g
42
• Incidentally, the data file contents also
exist in expanded form.
• Here is a record from that file:
• EDIBLE,CONVEX,SMOOTH,WHITE,BRUI
SES,ALMOND,FREE,CROWDED,NARRO
W,WHITE,TAPERING,BULBOUS,SMOOT
H,SMOOTH,WHITE,WHITE,PARTIAL,WH
ITE,ONE,PENDANT,PURPLE,SEVERAL,
WOODS
43
• Section I.C should be written by you.
You should summarize the information
given above, which is largely copy and
paste, in a brief, well-organized
paragraph that you write yourself and
which conveys the basics in a concise
way.
44
• The idea is that a reader who really
doesn't want or need to know the
details could go to this paragraph and
find out everything they needed to
know in order to keep reading the rest
of your write-up and have some idea of
what is going on.
45
I.C. Summary of Background
Information
• The problem domain is the classification of
mushrooms as either poisonous/inedible
or non-poisonous/edible.
• There are 8124 instances in the data set
consisting of 22 nominal attributes apiece.
• Roughly half of the instances are
poisonous and half are non-poisonous.
46
• There are 2480 cases of missing attribute
values, all on the same attribute.
• As is to be expected with non-original data
sets, this set has already been extensively
studied.
• Other researchers have provided sets of
rules they have derived which would serve
as benchmarks when considering the
results of the application of further data
47
mining algorithms to the data set.
I.D. Screen Shot of Open File
• ***What this shows:
• The cap-shape attribute is chosen out of
the list on the left.
• Its different values are given in the table in
the upper right.
• In the lower right, the Edible attribute is
selected from a (hidden) drop down list.
48
• The graph shows the proportion of edible
and inedible mushrooms among the
instances containing different values of
cap-shape.
49
50
II. Applications of Data Mining
Algorithms to the Data Set
• The overall requirement is that you use
the Weka explorer and run up to 8
different data mining algorithms on
your data set.
• Here is a preview of what is involved:
51
• i. You will get full credit for all 8 cases
if among the 8 there is at least one each
of classification, clustering, and
association rule mining.
• In order to make it clear that this has
been done, the first case should be a
classification, the second case should
be a clustering, and the third case
should be an application of association
52
rule mining.
• The grading check-off sheet will reflect
this requirement.
• All remaining cases can be of your
choice, given in any order you want to.
53
• ii. You will have to either copy a screen
shot or copy certain information out of
the Weka explorer interface and paste it
into your report.
• The stuff you need to do this for in the
different kinds of cases is simply
illustrated.
• I won't try and list it all out here.
54
• At every point, ask yourself this
question:
• "Was it immediately apparent to me
what I was looking at and what it
meant?“
• If the answer to that question was no,
you should include explanatory
remarks with whatever you chose to
show from Weka.
55
• For consistency's sake in these cases
you can label your remarks "***What
this shows:".
56
• iii. The most obvious kind of results
that you would reproduce would be the
percent correct and percent incorrect
classification for a classification
scheme, for example.
• In addition to this, the Weka output
would include things like a confusion
matrix, the Kappa statistic, and so on.
57
• For each case that you examine, you
will be expected to highlight one aspect
of the output and to provide your own
brief, written explanation of it.
• Note that this is an "educational"
aspect of this project.
58
• On the job, the expectation would be
that you as a user knew what it all
meant.
• Here, as a student, the goal is to show
that you know what it all meant.
59
• iv. Finally, there is an additional aspect
of Weka that you should use and
illustrate.
• I will not try to describe it in detail here.
You will see examples in the content
below.
60
• In short, for the different algorithms, if
you right click on the results, you will
be given options to create graphs,
charts, and various other kinds of
output.
• For each case that you cover you
should take one of these options.
• Again, there is an educational, as
opposed to practical aspect to this.
61
• For the purposes of this project, just
cycle through the different options that
are available to show that you are
familiar with them.
• For each one, provide a sentence or
two making it clear that you know what
this additional output means.
62
II. Case 1. This Needs to Be a
Classification Algorithm
• Name of Algorithm: J48
63
i. Output Results
• ***What this shows:
• This shows the classifier tree generated by
the J48 algorithm.
64
65
• ***What this shows:
• This gives the analysis of the output of the
algorithm.
• The most notable thing that should jump
out at you is that this is a "perfect" tree.
• The output shows 100% correct
classification and no misclassification.
66
67
ii. Explanation of Item
• There is no need to repeat the screen
shot.
• For this item I have chosen the confusion
matrix.
• It is very easy to understand.
• It shows 0 false positives and 0 false
negatives.
68
• It is interesting to note that you need to
know the values for the attributes in the
data file to be sure which number
represents TP and which represents TN.
• Referring back to the earlier the screen
shot, the same is true for the bars.
• What do the blue and red parts of the bars
represent, edible or inedible?
69
iii. Graphical or Other Special
Purpose Additional Output
• ***What this shows:
• Going back to the previous screen shot, if
you right click on the item highlighted in
blue—the results of running J48 on the
data set, you get several options.
• One of them is "Visualize tree".
• This screen shot shows the result of taking
that option.
70
71
II. Case 2. This Needs to Be a
Clustering Algorithm
• Name of Algorithm: SimpleKMeans
72
i. Output Results
• ***What this shows:
• This shows the results of the
SimpleKMeans clustering algorithm with
the edible/inedible attribute ignored.
• The results compare the
clusters/classifications with the ignored
attribute.
• The algorithm finds 2 clusters based on
the remaining attributes.
73
74
ii. Explanation of Item
• At the bottom of the screen shot there is
an item, "Incorrectly clustered instances".
• 37.6% of the clustered instances don't fall
into the desired edible/inedible category.
• The algorithm finds 2 clusters, but these 2
clusters don't agree with the 2
classifications of the attribute that was
ignored.
75
iii. Graphical or Other Special
Purpose Additional Output
• ***What this shows:
• Going back to the previous screen shot, if
you right click on the item highlighted in
blue—the results of running
SimpleKMeans on the data set, you get
several options.
• One of them is "Visualize cluster
assignments".
76
• This screen shot shows the result of taking
that option.
• Since it isn't possible to visualize the
clusters in n-dimensional space, the
screen provides the option of picking
which individual attribute to visualize.
77
• This screen shows the instances in order
by number along the x-axis.
• The y-axis shows the cluster placements
for the different values for the cap-shape
attribute.
• The drop down box allows you to change
what the axes represent.
78
79
II. Case 3. This Needs to Be
an Association Mining Algorithm
•
• Name of Algorithm: Apriori
80
i. Output Results
• ***What this shows:
• This shows the results of the Apriori
association rule mining algorithm.
81
82
ii. Explanation of Item
• Various relevant parameters are shown on
the screen shot.
• The system defaults to a minimum support
level of .95 and a minimum confidence
level of .9.
• The system lists the 10 best rules found.
• The first 9 have confidence levels of 1.
• On the one hand, this is good.
83
• From a practical point of view, what this
tends to suggest is that the data are
effectively redundant.
• Just to take the first rule for example, if
you know the color of the veil, you know
the type of the veil.
• The 10th rule provides an interesting
reverse insight into this.
• It tells you that if you know the type, you
only know the color with .98 confidence.
84
iii. Graphical or Other Special
Purpose Additional Output
• There don't appear to be any other output
options for association rules.
• There is no standard visualization for them
so nothing is included for this point.
85
II. Case 4. Any Kind of
Algorithm
• Name of Algorithm: ADTree
86
i. Output Results
• ***What this shows.
• These are the results of running the
ADTree classification algorithm.
• I haven't bothered to scroll up and show
the ASCII representation of the tree.
• Instead, I've just shown the critical output
at the bottom.
87
88
ii. Explanation of Item
• There are two items I'd like to highlight:
• a. Notice that this tree generation
algorithm didn't get 100% classified
correctly.
• If I'm reading the data correctly, there were
8 false positives on the attribute of
interest, which is named Edible.
• This is not good.
89
• False negatives deprive you of a tasty
gustatory and culinary experience.
• False positives deprive you of your health
or your life.
• I point this out in contrast to the J48
results given above.
90
• b. Notice that the time taken to build the
model was .73 seconds.
• This is about 10 times slower than J48, but
I'm mainly interested in comparing with the
following algorithm.
91
iii. Graphical or Other Special
Purpose Additional Output
• ***What this shows:
• This is the visualization of the tree.
• There are other graphical options, but they
are difficult to interpret for the mushroom
data set, so this is given for comparison
with the J48 tree.
92
93
II. Case 5. Any Kind of
Algorithm
• Name of Algorithm: BFTree
94
i. Output Results
• ***What this shows:
• This shows the results of using the BFTree
classification algorithm.
95
96
ii. Explanation of Item
• This algorithm also doesn't give a tree that
classifies with 100% accuracy.
• It gives the same kind of error as the
ADTree, although there are 3 fewer.
97
• The additional item I'd like to highlight is
that the time taken to build the model was
12.42 seconds.
• As a matter of fact, that information came
out first and then additional, significant
amounts of time were taken to run through
each fold of the data.
• This was quite time consuming compared
to the other trees produced so far.
98
iii. Graphical or Other Special
Purpose Additional Output
• ***What this shows:
• This screen shot is the result of taking the
"Visualize classifier errors" option on the
results of the algorithm.
• I believe what this screen illustrates is a
decision point in the tree on the capsurface attribute.
99
• In one of the cases, symbolized by the
blue rectangle, an incorrect classification
is made on this basis while 7 other
instances classify correctly based on this
attribute.
100
101
II. Case 6. Any Kind of
Algorithm
• Name of Algorithm: Naïve Bayes
102
i. Output Results
• ***What this shows:
• This screen shot shows the bottom of the
output for the Naïve Bayes classification
algorithm.
• The upper part of the output shows
conditional probability counts for all of the
attributes in the data.
103
• If the cost of an error wasn't so high, this
algorithm by itself does OK.
• It's time cost is only .03 seconds and it
achieves 95.8% correct classification.
104
105
ii. Explanation of Item
• I'm running out of items to highlight which
are particularly meaningful for the example
in question.
• Notice that the output includes the Mean
absolute error, the Root mean squared
error, the Relative absolute error, and the
Root relative squared error.
106
• These differ in magnitude because of the
way they're calculated, but they are all
indicators of the same general thing.
• As pointed out in the book, when
comparing two different data mining
approaches, if you compared the same
measure for both, you will tend to have a
valid comparison regardless of which of
the measures you used.
107
iii. Graphical or Other Special
Purpose Additional Output
• ***What this shows:
• Two graphical output screen shots are
given below.
• They show a cost-benefit analysis.
• Such an analysis is more appropriate to
something like direct mailing, but it is
possible to illustrate something by
changing one of the parameters in the
display.
108
• Both screen shots show a threshold curve
and a cost-benefit curve where the button
to minimize cost/benefit has been clicked.
• In the first screen shot the costs of FP and
FN are equal, at 1.
• In the second, the cost of a false positive
has been raised to 1,000.
• Notice how the shape of the curve
changes.
109
• Roughly speaking, I would interpret the
second screenshot to mean that you have
effectively no costs as long as you are
correctly predicting TP, but your cost rises
linearly with the increasing probability of
FP predictions later in the data set.
110
111
112
II. Case 7. Any Kind of
Algorithm
• Name of Algorithm: BayesNet
113
i. Output Results
• ***What this shows.
• This screen shot shows the results of the
BayesNet classification algorithm.
114
115
ii. Explanation of Item
• This is not a new item to explain, but it is
an observation related to the values and
results previously obtained.
• The association rule mining algorithm
seemed to suggest that there were heavy
dependencies among some of the
attributes in the data set.
• BayesNet is supposed to take these into
account, while Naïve Bayes does not.
116
• However, when you compare the rate of
correctly classified instances, here you get
96.2% vs. 95.8% for Naïve Bayes.
• It seems fair to ask what difference it really
made to include the dependencies in the
analysis.
117
iii. Graphical or Other Special
Purpose Additional Output
• What this shows:
• This shows the result of taking the
"Visualize cost curve" option on the results
of the data mining.
• Honestly, I've about reached the limit of
what I understand without further
research.
• I present this here without further
explanation.
118
• This is one of the reasons I advertise this
sample project write-up as an example of
a B, rather than an A effort.
• Everything that has been asked for is
included, but in this point, for example, the
explanation isn't complete.
• It sure is pretty though…
119
120
II. Case 8. Any Kind of
Algorithm
• Name of Algorithm: RIDOR
121
i. Output Results
• ***What this shows:
• This screen shows the results of applying
the RIDOR algorithm to the data set.
• RIDOR was the technique based on rules
and exceptions.
• Look at the top of the output.
• Here you see clearly that the default
classification is edible, with exceptions
listed underneath.
122
• Philosophically, this goes against my point
of view on mushrooms.
• The logical default should be inedible, but
there are more edible mushrooms in the
data set than inedible.
• So it goes.
123
124
ii. Explanation of Item
• The last set of items that appears in these
output screens are the Precision, Recall,
F-Measure, and ROC values.
• This is probably not the best example for
illustrating what they mean.
• It's apparent that things like recall would
be better suited to document retrieval for
example.
125
• Maybe the best illustration that they don't
really apply is that they are all 1 or .999.
• On the other hand, maybe that's realistic
for a classification scheme that gives
99.95% correct results.
126
iii. Graphical or Other Special
Purpose Additional Output
• Once again, the fact that this is a "B"
example rather than an "A" example
comes into play.
• I'm not showing a new bit of graphical
output.
• I'm showing the cost curve, like for the
previous data mining algorithm.
127
• The main reason for choosing to show it
again is that this picture looks so much like
the simple picture in the text that they
used to illustrate some of the cost
concepts graphically.
128
129
III. Choosing the Best Algorithm
Among the Results
• Depending on the problem domain and
your level of ambition, you might
compare algorithms on the basis of lift
charts, cost curves, and so on.
• For simple classification, the tools will
give results showing the percent
classified correctly and the percent
classified incorrectly.
130
• It would be natural to simply choose
the one with the highest percent
classified incorrectly.
• However, this is not good enough for
credit on this item.
• I have chosen to illustrate what you
need to do with a simple basic
example.
131
• I consider the two classification
algorithms that gave the highest
percent classified correctly.
• I then apply the paired t-test to see
whether or not there is actually a
statistically significant different
between them.
132
• If there is, that's the correct basis for
preferring one over the other.
• For the purposes of illustration, I do
this by hand and explain what I'm
doing.
• You may find tools that allow you to
make a valid comparison of results.
• That's OK, as long as you explain.
133
• The point simply is that it's not
sufficient to just list a bunch of
percents and pick the highest one.
• Illustrate the use of some advanced
technique, whether involving concepts
like lift charts or cost curves or
statistics.
• You may also have noticed that Weka
tells you the run time for doing an
134
analysis.
• When making a decision about which
algorithm is the best, at a minimum
take into account an advanced
comparison of the two apparent best,
and you may want to make an
observation about the apparent
complexity or time cost of the
algorithms.
135
III.A. Random Babbling
• The concept of "Cost of classification"
seems relevant to this example.
• It takes a human expert to tell if a
mushroom is poisonous.
• If you're not an expert, you can tell by
eating a mushroom and seeing what
happens.
136
• The cost of finding out that the mushroom
is poisonous is about as high as it gets.
• I guess if you're truly dedicated, you'd be
willing to die for science.
• Directly related to this is the cost of a
misclassification.
• It seems to be on the infinite side…
137
• The J48 tree approach, given first, even
though it's apparently been pruned, still
classifies 100% correctly.
• This seems to be at odds with claims
made at various point that you don't want
a perfect classifier because it will tend to
be overtrained.
138
• On the other hand, since the cost of a
misclassification is so high, maybe it would
be best to bias the training.
• Lots of false "It's poisonous" results would
be desirable.
• I remember learning this rule from my
parents:
• Don't eat any wild mushrooms.
139
• It's also interesting to compare with the
commentary provided at the beginning.
• "Experts" who have examined the data
wanted to get a minimal rule set.
• They apparently considered that a
success.
• But they were willing to live with errors.
• I'm not sure living with errors is consistent
with this data set.
140
III.B. An Application of the
Paired t-test
• Pick any two of your results above,
identify them and the success rate
values they gave, and compare them
using the paired t-test.
• Give a statistically valid statement that
tells whether or not the two cases
you're comparing are significantly
different.
141
• What is shown is my attempt to
interpret and apply what the book says
about the paired t-test.
• I do not claim that I have necessarily
done this correctly.
• Students who have recently taken
statistics may reach different
conclusions about how this is done.
142
• However, I have gone through the
motions.
• To get credit for this section, you
should do the same, whether following
my example or following your own
understanding.
143
• I have chosen to compare the percent of
correct classifications by Naïve Bayes
(NB) and BayesNet (BN) given above.
144
•
•
•
•
•
Taken from Weka results:
NB sample mean = 95.8272%
NB root mean squared error = .1757
Squaring the value above:
NB mean squared error = .03087049
145
•
•
•
•
•
Taken from Weka results:
BN sample mean = 96.2211%
BN root mean squared error = .1639
Squaring the value above:
BN mean squared error = .02686321
146
• This is my estimate of the standard
deviation of the t statistic where the divisor
is 10 because I opted for the default 10fold cross-validation in Weka:
• Estimate of paired root mean squared
error (EPRMSE)
• = square root(( NB mean squared error /
10) + (BN mean squared error / 10))
• = .075982695
147
• t statistic
• = (NB sample mean – BN sample mean) /
EPRMSE
• = 5.184
148
• The book says this is a two-tailed test.
• For a 99% confidence interval I want to
use a threshold of .5%.
• The book's table gives a value of 3.25.
149
• The computed value, 5.184 is greater than
the table value of 3.25.
• This means you reject the null hypothesis
that the means of the two distributions are
the same.
150
• In other words, you conclude that there is
a statistically significant difference
between the percent of correct
classifications resulting from the Naïve
Bayes and the Bayesian Network
algorithms on the mushroom data.
151
The End
152