Data Mining - Computer Science Intranet

Download Report

Transcript Data Mining - Computer Science Intranet

COMP527:
Data Mining
COMP527: Data Mining
M. Sulaiman Khan
([email protected])
Dept. of Computer Science
University of Liverpool
2009
Classification: Rules
February 10, 2009
Slide 1
COMP527:
Data Mining
COMP527: Data Mining
Introduction to the Course
Introduction to Data Mining
Introduction to Text Mining
General Data Mining Issues
Data Warehousing
Classification: Challenges, Basics
Classification: Rules
Classification: Trees
Classification: Trees 2
Classification: Bayes
Classification: Neural Networks
Classification: SVM
Classification: Evaluation
Classification: Evaluation 2
Regression, Prediction
Classification: Rules
Input Preprocessing
Attribute Selection
Association Rule Mining
ARM: A Priori and Data Structures
ARM: Improvements
ARM: Advanced Techniques
Clustering: Challenges, Basics
Clustering: Improvements
Clustering: Advanced Algorithms
Hybrid Approaches
Graph Mining, Web Mining
Text Mining: Challenges, Basics
Text Mining: Text-as-Data
Text Mining: Text-as-Language
Revision for Exam
February 10, 2009
Slide 2
COMP527:
Data Mining
Today's Topics
Introduction
Rule Sets vs Rule Lists
Constructing Rules-based Classifiers
1R
PRISM
Reduced Error Pruning
RIPPER
Rules with Exceptions
Classification: Rules
February 10, 2009
Slide 3
COMP527:
Data Mining
Rules-Based Classifiers
Idea: Learn a set of rules from the data. Apply those rules to
determine the class of the new instance.
For example:
R1. If blood-type=Warm and lay-eggs=True then Bird
R2. If blood-type=Cold and flies=False then Reptile
R3. If blood-type=Warm and lay-eggs=False then Mammal
Hawk: flies=True, blood-type=Warm, lay-eggs=True, class=???
R1 is True, so the classifier predicts that Hawk = Bird.
Yay!
Classification: Rules
February 10, 2009
Slide 4
COMP527:
Data Mining
Rules-Based Classifiers
A rule r covers an instance x if the attributes of the instance satisfy
the condition of the rule.
The coverage of a rule is the percentage of records that satisfy the
condition.
The accuracy of a rule is the percentage of covered records that
satisfy the condition and the conclusion.
For example, a rule might cover 10/50 records (coverage 20%) of
which 8 are correct (accuracy 80%).
Classification: Rules
February 10, 2009
Slide 5
COMP527:
Data Mining
Rule Set vs Rule List
Rules can either be grouped as a set or an ordered list.
Set:
The rules make independent predictions.
Every record is covered by 0..1 rules (hopefully 1!)
R1. If flies=True and lays-eggs=True and lives-in-water=False then
Bird
R2. If flies=False and lives-in-water=True and lays-eggs=True then
Fish
R3. If blood-type=Warm and lays-eggs=False then Mammal
R4. If blood-type=Warm and lays-eggs=True then Reptile
Doesn’t matter which order we evaluate these rules.
Classification: Rules
February 10, 2009
Slide 6
COMP527:
Data Mining
Rule Set vs Rule List
List:
The rules make dependent predictions.
Every record is covered by 0..* rules (hopefully 1..*!)
R1. If flies=True and lays-eggs=True then Bird
R2. If blood-type=Warm and lays-eggs=False then Mammal
R3. If lives-in-water=True then Fish
R4. If lays-eggs=True then Reptile
Does matter which order we evaluate these rules.
If all records are covered by at least one rule, then rule set or list is
considered Exhaustive.
Classification: Rules
February 10, 2009
Slide 7
COMP527:
Data Mining
Constructing Rules-Based Classifiers
Covering approach: At each stage, a rule is found that covers
some instances.
“Separate and Conquer” -- Choose rule that identifies many
instances, separate them out, repeat.
But first a very very simple classifier called “1R”.
1R because the rules all test one particular attribute.
Classification: Rules
February 10, 2009
Slide 8
COMP527:
Data Mining
1R Classifier
Idea: Construct one rule for each attribute/value combination
predicting the most common class for that combination.
Example Data:
Outlook
sunny
sunny
overcast
rainy
rainy
rainy
overcast
sunny
sunny
rainy
sunny
overcast
overcast
rainy
Temperature
hot
hot
hot
mild
cool
cool
cool
mild
cool
mild
mild
mild
hot
mild
Humidity
high
high
high
high
normal
normal
normal
high
normal
normal
normal
high
normal
high
Classification: Rules
Windy
false
true
false
false
false
true
true
false
false
false
true
true
false
true
February 10, 2009
Play?
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Slide 9
COMP527:
Data Mining
1R Classifier
Rules generated:
Attribute
Outlook
Temperature
Humidity
Windy
Rules
sunny » no
overcast » yes
rainy » yes
hot » no
mild » yes
cool » yes
high » no
normal » yes
false » yes
true » no
Errors
2/5
0/4
2/5
2/4 (random)
2/6
1/4
3/7
1/7
2/8
3/6 (random)
Total Errors
4/14
5/14
4/14
5/14
Now choose the attribute with the fewest errors. Randomly decide
on Outlook. So 1R will simply use the outlook attribute to predict
the class for new instances.
Classification: Rules
February 10, 2009
Slide 10
COMP527:
Data Mining
1R Algorithm
foreach attribute,
foreach value of that attribute,
find class distribution for attr/value
conc = most frequent class
make rule: attribute=value -> conc
calculate error rate of ruleset
select ruleset with lowest error rate
Almost not worth wasting a slide on, really!
Classification: Rules
February 10, 2009
Slide 11
COMP527:
Data Mining
PRISM Classifier
Instead of always looking to the full data set, after constructing
each rule we could remove the instances that the rule covers
before looking for a new rule.
Start with a high coverage rule, then increase its accuracy by
adding more conditions to it.
Want to maximise the accuracy of each rule: maximise the ratio of
positive instances/covered instances.
Finished adding conditions when p/t = 1, or no more instances to
look at
Classification: Rules
February 10, 2009
Slide 12
COMP527:
Data Mining
PRISM Classifier
Following Witten (pg 6, 108+)
If X then recommendation=hard
Find highest coverage ratio condition for X:
Age = Young
Age = Pre-presbyopic
Age = Presbyopic
Prescription = Myope
Prescription = Hypermetrope
Astigmatism = no
Astigmatism = yes
Tear-Production = Reduced
Tear-Production = Normal
2/8
1/8
1/8
3/12
1/12
0/12
4/12
0/12
4/12
Select astigmatism = yes
(arbitrarily over Tear-Production = Normal)
Classification: Rules
February 10, 2009
Slide 13
COMP527:
Data Mining
PRISM Classifier
This covers:
Age
Spectacle prescription
Astigmatism
Young
Young
Young
Young
Pre-presbyopic
Pre-presbyopic
Pre-presbyopic
Pre-presbyopic
Presbyopic
Presbyopic
Presbyopic
Presbyopic
Myope
Myope
Hypermetrope
Hypermetrope
Myope
Myope
Hypermetrope
Hypermetrope
Myope
Myope
Hypermetrope
Hypermetrope
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Tear production
rate
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Recommended
lenses
None
Hard
None
hard
None
Hard
None
None
None
Hard
None
None
Now need to add another condition to make it more accurate.
If astigmatism = yes and X then recommendation = hard
Classification: Rules
February 10, 2009
Slide 14
COMP527:
Data Mining
PRISM Classifier
Best condition is Tear-Production = normal (4/6)
New rule: astigmatism=yes and tear-production = normal
But still some inaccuracy...
Age=Young (2/2) or Prescription = Myope (3/3) both have 100%
ratio in remaining instances. Choose the greater coverage.
If astigmatism = yes and tear-production = normal and
prescription = myope then recommendation = hard
Repeat the process, removing the instances covered by this rule.
Then repeat for all classes.
Classification: Rules
February 10, 2009
Slide 15
COMP527:
Data Mining
PRISM Classifier
Try with the other example data set. If X then play=yes
Outlook
sunny
sunny
overcast
rainy
rainy
rainy
overcast
sunny
sunny
rainy
sunny
overcast
overcast
rainy
Temperature
hot
hot
hot
mild
cool
cool
cool
mild
cool
mild
mild
mild
hot
mild
Humidity
high
high
high
high
normal
normal
normal
high
normal
normal
normal
high
normal
high
Windy
false
true
false
false
false
true
true
false
false
false
true
true
false
true
Play?
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Outlook=overcast is (4/4) Already perfect, remove instances and
look again.
Classification: Rules
February 10, 2009
Slide 16
COMP527:
Data Mining
PRISM Classifier
With reduced dataset, if X then play=yes
Outlook
sunny
sunny
rainy
rainy
rainy
sunny
sunny
rainy
sunny
rainy
Temperature
hot
hot
mild
cool
cool
mild
cool
mild
mild
mild
Humidity
high
high
high
normal
normal
high
normal
normal
normal
high
Windy
false
true
false
false
true
false
false
false
true
true
Play?
no
no
yes
yes
no
no
yes
yes
yes
no
Sunny (2/5) rainy (3/5) hot (0/2) mild (3/5) cool (2/3) high (1/5) normal (4/5) false (4/6) true
(¼)
Select humidity=normal (4/5) and look for another rule as not
perfect
Classification: Rules
February 10, 2009
Slide 17
COMP527:
Data Mining
PRISM Classifier
If humidity=normal and X then play=yes
Outlook
rainy
rainy
sunny
rainy
sunny
Temperature
cool
cool
cool
mild
mild
Humidity
normal
normal
normal
normal
normal
Windy
false
true
false
false
true
Play?
yes
no
yes
yes
yes
If we could use 'and-not' we could have:
and-not (temperature=cool and windy=true)
But instead:
rainy (2/3), sunny (2/2), cool (2/3), mild (2/2), false(3/3), true
(1/2)
So we select windy=false to maximise t and add that to the rule.
Classification: Rules
February 10, 2009
Slide 18
COMP527:
Data Mining
PRISM Algorithm
for each class C
initialise E to the complete instance set
while E contains instances with class C
create empty rule R if X then C
until R is perfect (or no more attributes)
for each attribute A not in R, and each value v,
consider adding A=v to R
select A and v to maximise accuracy of p/t
add A=v to R
remove instances covered by R from E
Classification: Rules
February 10, 2009
Slide 19
COMP527:
Data Mining
Issues with PRISM
Overfitting. As we saw, we had 4/5 and then lost one positive
instance to lose the one negative instance. But with more
examples maybe it was 199/200 and we'd need to lose 40
positive to remove it... that's crazy.
Measure 2: Information Gain
p * (log(p/t) - log (P/T))
p/t as before (positive over total)
P and T are positive and total before new condition.
Emphasises large number of positive examples. Use this in PRISM
in place of maximising p/t.
Classification: Rules
February 10, 2009
Slide 20
COMP527:
Data Mining
Over-Fitting Avoidance
We could split the training set into a growing set and a pruning set.
Grow rules out using the first, and then try to cut the rules back with
the pruning set.
Two strategies:
Reduced-error pruning: Build full rule set then prune rules
Incremental reduced-error pruning: Simplify rules as built
Can re-split the data after each rule. Let's look at this one.
Classification: Rules
February 10, 2009
Slide 21
COMP527:
Data Mining
Incremental Reduced Error Pruning
initialise E to instance set
until E is empty:
split E into Grow and Prune (ratio 2:1)
for each class C in Grow
generate best rule for C
using Prune:
calc worth(R) and worth(R-finalCondition)
while worth(R-) > worth(R), prune rule
from rules for different classes, select
largest worth(R)
remove instances covered by rule
Classification: Rules
February 10, 2009
Slide 22
COMP527:
Data Mining
Rule Worth?
How can we generate the worth of the rules? (Witten 203)



( p + (N -n)) / T
 true positive + true negative / total number of instances
 positive + totalNegative - negativesCovered / totalInstances
 p=2000, t=3000 --> 1000 + N / T
 p=1000, t=1001
--> 999 + N / T
p/t
 Same problem as before p=1,t=1 vs p=1000,t=1001
Simple but intuitive algorithm for worth?
Classification: Rules
February 10, 2009
Slide 23
COMP527:
Data Mining
Issue with Grow/Prune Splitting
Say we have 1000 examples, and we split 2:1 for train/test
(666,334), then 2:1 for grow/prune (444,222) ... we're building
our rules on less than half of our data!
Depending on the dataset, classes may be absent from the training
set, or the distributions may be very wrong, or any number of
other statistical problems with random sampling to this degree.
Ameliorated in Incremental as re-split often. But might still want to
perform the algorithm several times and pick the best.
Classification: Rules
February 10, 2009
Slide 24
COMP527:
Data Mining
RIPPER
Rules-based classifier from industry.
If 2 classes, then learn rules for one and default the other
If more than 2 classes, start with smallest until you have 2.
Information Gain to grow rules
Measure for pruning: p-n / p + n
covered in pruning set)
(positive/negative examples
Uses 'Description Length' metric -- Ockham's Razor says that the
simplest solution is the best, so here the simplest rule set is the
best. (Not going into how to calculate this)
Classification: Rules
February 10, 2009
Slide 25
COMP527:
Data Mining
RIPPER Algorithm
Repeated Incremental Pruning to Produce Error Reduction
split E into Grow/Prune
BUILD:
repeat until no examples, or DL of ruleset >minDL(rulesets)+64, or error >50%
GROW: add conditions until rule is 100% by IG
PRUNE: prune last to first while worth metric W increases
for each rule R, for each class C:
split E into Grow/Prune
remove all instances from Prune covered by other rules
GROW and PRUNE two competing rules:
R1 is new rule built from scratch
R2 is generated by adding conditions to R
prune using worth metric A for reduced dataset
replace R by R, R1 or R2 with smallest DL
if uncovered instances of C, return to BUILD to make more rules
calculate DL for ruleset and ruleset with each rule omitted, delete any rule that
increases the DL
remove instances covered by rules generated
DL = Description Length, Metric W= p+1/t+2, Metric A=p+N-n/T
Classification: Rules
February 10, 2009
Slide 26
COMP527:
Data Mining
Rules with Exceptions
If we get more data after a ruleset has been generated, it might be
useful to add exceptions to rules.
If X then class1 unless Y then class2
Consider our humidity rule:
if humidity=normal then play=yes
unless temperature=cool and windy=true then play = no
Exceptions developed with the Induct system, called 'ripple-down
rules'
Classification: Rules
February 10, 2009
Slide 27
COMP527:
Data Mining
Further Reading

Witten, Sections 3.3, 3.5, 3.6, 4.1, 4.4

Dunham Section 4.6

Han, Section 6.5

Berry and Browne, Chapter 8
Classification: Rules
February 10, 2009
Slide 28