Data Mining - Computer Science Intranet

Download Report

Transcript Data Mining - Computer Science Intranet

COMP527:
Data Mining
COMP527: Data Mining
M. Sulaiman Khan
([email protected])
Dept. of Computer Science
University of Liverpool
2009
Classification: Bayes
February 17, 2009
Slide 1
COMP527:
Data Mining
COMP527: Data Mining
Introduction to the Course
Introduction to Data Mining
Introduction to Text Mining
General Data Mining Issues
Data Warehousing
Classification: Challenges, Basics
Classification: Rules
Classification: Trees
Classification: Trees 2
Classification: Bayes
Classification: Neural Networks
Classification: SVM
Classification: Evaluation
Classification: Evaluation 2
Regression, Prediction
Classification: Bayes
Input Preprocessing
Attribute Selection
Association Rule Mining
ARM: A Priori and Data Structures
ARM: Improvements
ARM: Advanced Techniques
Clustering: Challenges, Basics
Clustering: Improvements
Clustering: Advanced Algorithms
Hybrid Approaches
Graph Mining, Web Mining
Text Mining: Challenges, Basics
Text Mining: Text-as-Data
Text Mining: Text-as-Language
Revision for Exam
February 17, 2009
Slide 2
COMP527:
Data Mining
Today's Topics
Statistical Modeling
Bayes Rule
Naïve Bayes
Fixes to Naïve Bayes
Document classification
Bayesian Networks
Structure
Learning
Classification: Bayes
February 17, 2009
Slide 3
COMP527:
Data Mining
Bayes Rule
The probability of hypothesis H, given evidence E:
Pr[H|E] = PR[E|H]*Pr[H] / Pr[E]
Pr[H] = A Priori probability of H (before evidence seen)
Pr[H|E] = A Posteriori probability of H (after evidence seen)
We want to use this in a classification system, so our goal is to find
the most probable hypothesis (class) given the evidence (test
instance).
Classification: Bayes
February 17, 2009
Slide 4
COMP527:
Data Mining
Example
Meningitis causes a stiff neck 50% of the time.
Meningitis occurs 1/50,000, stiff necks occur 1/20.
Probability of Meningitis, given that the patient has a stiff neck:
Pr[H|E] = PR[E|H]*Pr[H] / Pr[E]
Pr[M|SN] = Pr[SN|M]*Pr[M]/Pr[SN]
= 0.5 * 1/50000 / 1/20
= 0.0002
Classification: Bayes
February 17, 2009
Slide 5
COMP527:
Data Mining
Bayes Rule
Our evidence E is made up of different attributes A[1..n], so:
Pr[H|E] = Pr[A1|H]*Pr[A2|H]...Pr[An|H]*Pr[H]/Pr[E]
So we need to work out the probability of the individual attributes
per class. Easy...
Outlook=Sunny appears twice for yes out of 9 yes instances.
We can work these out for all of our training instances...
Classification: Bayes
February 17, 2009
Slide 6
COMP527:
Data Mining
Weather Probabilities
Outlook
Temperature
Yes
Humidity
Yes
No
No
Sunny
2
3
Hot
2
2
Overcast
4
0
Mild
4
2
Rainy
3
2
Cool
3
1
Sunny
2/9
3/5
Hot
2/9
2/5
Overcast
4/9
0/5
Mild
4/9
2/5
Rainy
3/9
2/5
Cool
3/9
1/5
Windy
Yes
No
High
3
4
Normal
6
High
Normal
Play
Yes
No
Yes
No
False
6
2
9
5
1
True
3
3
3/9
4/5
False
6/9
2/5
9/14
5/14
6/9
1/5
True
3/9
3/5
Given a test instance (sunny, cool, high, true)
play=yes: 2/9 * 3/9 * 3/9 * 9/14
= 0.0053
play=no: 3/5 * 1/5 * 4/5 * 3/5 * 5/14 = 0.0206
So we'd predict play=no for that particular instance.
Classification: Bayes
February 17, 2009
Slide 7
COMP527:
Data Mining
Weather Probabilities
play=yes: 2/9 * 3/9 * 3/9 * 9/14
= 0.0053
play=no: 3/5 * 1/5 * 4/5 * 3/5 * 5/14 = 0.0206
This is the likelihood, not the probability. We need to normalise these.
Prob(yes) = 0.0053 / 0.0053 + 0.0206 = 20.5%
This is when the Pr[E] denominator disappears from Bayes's rule.
Nice. Surely there's more to it than this... ?
Classification: Bayes
February 17, 2009
Slide 8
COMP527:
Data Mining
Naïve Bayes
Issue: It's only valid to multiply probabilities when the events are
independent of each other. It is “naïve” to assume independence
between attributes in datasets, hence the name.
Eg: The probability of Liverpool winning a football match is not
independent of the probabilities for each member of the team
scoring a goal.
But even given that, Naïve Bayes is still very effective in practice,
especially if we can eliminate redundant attributes before
processing.
Classification: Bayes
February 17, 2009
Slide 9
COMP527:
Data Mining
Naïve Bayes
Issue: If an attribute value does not co-occur with a class value, then the
probability generated for it will be 0.
Eg: Given outlook=overcast, the probability of play=no is 0/5. The other
attributes will be ignored as the final result will be multiplied by 0.
This is bad for our 4 attribute set, but horrific for (say) a 1000 attribute
set. You can easily imagine a case where the likelihood for all classes
is 0.
Eg: 'Viagra' is always spam, 'data mining' is never spam. An email with
both will be 0 for spam=yes and 0 for spam=no ... probability will be
undefined ... uh oh!
Classification: Bayes
February 17, 2009
Slide 10
COMP527:
Data Mining
Laplace Estimator
The trivial solution is of course to mess with the probabilities such that
you never have 0s. We add 1 to the numerator and 3 to the
denominator to compensate.
So we end up with 1/8 instead of 0/5.
No reason to use 3, could use 2 and 6. No reason to split equally... we
could add weight to some attributes by giving them a larger share:
a+3/na+6 * b+2/nb+6 * c+1/nc+6
However, how to assign these is unclear.
For reasonable training sets, simply initialise counts to 1 rather than 0.
Classification: Bayes
February 17, 2009
Slide 11
COMP527:
Data Mining
Missing Values
Naïve Bayes deals well with missing values:
Training: Ignore the instance for the attribute/class combination,
but we can still use it for the known attributes.
Classification: Ignore the attribute in the calculation as the
difference will be normalised during the final step anyway.
Classification: Bayes
February 17, 2009
Slide 12
COMP527:
Data Mining
Numeric Values
Naïve Bayes does not deal well with numeric values without some help.
The probability of it being exactly 65 degrees is zero.
We could discretize the attribute, but instead we'll calculate the mean and
standard deviation and use a density function to predict the probability.
mean: sum(values) / count(values)
variance: sum(square(value - mean)) / count(values)-1
standard deviation: square root of variance
Mean for temperature is 73, Std. Deviation is 6.2
Classification: Bayes
February 17, 2009
Slide 13
COMP527:
Data Mining
Numeric Values
Density function:
f(x) =
1 e
2p s
( x- m ) 2
2s 2
Unless you've a math background, just plug the numbers in...
At which point we get a likelihood of 0.034
Then we continue with this number as before.
This assumes a reasonably normal distribution. Often not the case.
Classification: Bayes
February 17, 2009
Slide 14
COMP527:
Data Mining
Document Classification
The Bayesian model is often used to classify documents as it deals
well with a huge number of attributes simultaneously. (eg
boolean occurrence of words within the text)
But we may know how many times the word occurs.
This leads to Multinomial Naive Bayes.
Assumptions:
1. Probability of a word occurring in a document is independent
of its location within the document.
2. The document length is not related to the class.
Classification: Bayes
February 17, 2009
Slide 15
COMP527:
Data Mining
Document Classification
Pr[E|H] = N! * product(pn/n!)
N = number of words in document
p = relative frequency of word in documents of class H
n = number of occurrences of word in document
So, if A has 75% and B has 25% frequency in class H
Pr[“A A A”|H] = 3! * 0.753/3! * 0.250/0!
= 27/64
= 0.422
Pr[“A A A B B”|H]
= 5! * 0.753/3! * 0.252/2!
= 0.264
Classification: Bayes
February 17, 2009
Slide 16
COMP527:
Data Mining
Document Classification
Pr[E|H] = N! * product(pn/n!)
We don't need to work out all the factorials, as they'll normalise out
at the end.
We still end up with insanely small numbers, as vocabularies are
much much larger than 2 words. Instead we can sum the
logarithms of the probabilities instead of multiplying them.
Classification: Bayes
February 17, 2009
Slide 17
COMP527:
Data Mining
Bayesian Networks
Back to the attribute independence assumption. Can we get rid of
it?
Yes, with a Bayesian Network.
Each attribute has a node in a Directed Acyclic Graph.
Each node has a table of all attributes with edges pointing at the
node linked against the probabilities for the attribute's values.
Examples will be hopefully enlightening...
Classification: Bayes
February 17, 2009
Slide 18
COMP527:
Data Mining
Simple Network
play
yes
no
.633 .367
outlook
play| sunny overcast
rainy
yes | .238
.429
.333
no | .538
.077
.385
windy
play| false true
yes | .350 .650
no | .583 .417
temperature
play| hot mild cold
yes | .238 .429 .333
no | .385 .385 .231
Classification: Bayes
humidity
play| high normal
yes | .350 .650
no | .750 .250
February 17, 2009
Slide 19
COMP527:
Data Mining
Less Simple Network
play
yes
no
.633 .367
Outlook
play sunny overcast rainy
yes .238
.429
.333
no
.538
.077
.385
play
yes
yes
yes
no
no
no
temperature
outlook hot mild
sunny
.238 .429
overcast .385 .385
rainy
.111 .556
sunny
.556 .333
overcast .333 .333
rainy
.143 .429
play
yes
yes
yes
no
no
no
cold
.333
.231
.333
.111
.333
.429
Classification: Bayes
windy
outlook false
sunny
.500
overcast .500
rainy
.125
sunny
.375
overcast .500
rainy
.833
play
yes
yes
yes
no
no
no
February 17, 2009
true
.500
.500
.875
.625
.500
.167
humidity
temp high normal
hot
.500 .500
mild .500 .500
cool .125 .875
hot
.833 .167
mild .833 .167
cool .250 .750
Slide 20
COMP527:
Data Mining
Bayesian Networks
To use the network, simply step through each node and multiply the
results in the table together for the instance's attributes' values.
Or, more likely, sum the logarithms as with the multinomial case.
Then, as before, normalise them to sum to 1.
This works because the links between the nodes determine the
probability distribution at the node.
Using it seems straightforward. So all that remains is to find out the best
network structure to use. Given a large number of attributes, there's a
LARGE number of possible networks...
Classification: Bayes
February 17, 2009
Slide 21
COMP527:
Data Mining
Training Bayesian Networks
We need two components:
-
Evaluate a network based on the data
As always we need to find a system that measures the
'goodness' without overfitting
(overfitting in this case = too many edges)
We need a penalty for the complexity of the network.
-
Search through the space of possible networks
As we know the nodes, we need to find where the edges in the
graph are. Which nodes connect to which other nodes?
Classification: Bayes
February 17, 2009
Slide 22
COMP527:
Data Mining
Training Bayesian Networks
Following the Minimum Description Length ideal, networks with lots
of edges will be more complex, and hence likely to over-fit.
We could add a penalty for each cell in the nodes' tables.
AIC:
MDL:
-LL +K
-LL + K/2 log(N)
LL is total log-likelihood of the network and training set. eg Sum of
log of probabilities for each instance in the data set.
K is the number of cells in tables, minus the number of cells in the
last row (which can be calculated, by 1- sum of other cells in row)
N is the number of instances in the data.
Classification: Bayes
February 17, 2009
Slide 23
COMP527:
Data Mining
Network Training: K2
K2:
for each node,
for each previous node,
add node, calculate worth
continue when doesn't improve
(Use MDL or AIC to determine worth)
The results of K2 depend on initial order selected to process the
nodes in.
Run it several times with different orders and select the best.
Can help to ensure that the class attribute is first and links to all
nodes (not a requirement)
Classification: Bayes
February 17, 2009
Slide 24
COMP527:
Data Mining
Other Structures
TAN: Tree Augmented Naive Bayes.
Class attribute is only parent for each node in Naive Bayes. Start
here and consider adding a second parent to each node.
Bayesian Multinet:
Build a separate network for each class and combine the values.
Classification: Bayes
February 17, 2009
Slide 25
COMP527:
Data Mining





Further Reading
Witten 4.2, 6.7
Han 6.4
Dunham 4.2
Devijver and Kittler, Pattern Recognition: A Statistical Approach,
Chapter 2
Berry and Browne, Chapter 2
Classification: Bayes
February 17, 2009
Slide 26