Transcript C=c - ISI
Data Classification by
Statistical Methods
Wei-Min Shen
Information Sciences Institute
University of Southern California
2/5/98
UCLA Data Mining Short Course (2)
1
Outline
•
•
•
•
Basic of Probability theory
Naive Bayesian Classifier
Boost Algorithms
Bayesian Network
2/5/98
UCLA Data Mining Short Course (2)
2
Bayesian Probability Theory
• What is probability?
– Subjective vs objective
• Simple and principled
– P(A|C) + P(~A|C) = 1
– P(AB|C) = P(A|C)P(B|AC) = P(B|C)P(A|BC)
• Are there anything else?
– P(A B|C) = 1 - P(~A~B|C) = …
2/5/98
UCLA Data Mining Short Course (2)
3
Certainty and Impossible
• If A is certain when given C, then P(A|C)=1
– Assume B doesn’t contradict C, then
P(AB|C)=P(B|C), P(A|BC)=P(A|C), and
P(B|C)=P(A|C)P(B|C) (by law2), and P(A|C)=1
• If A is impossible when C, then P(A|C)=0
– By law 1
2/5/98
UCLA Data Mining Short Course (2)
4
The Product and Sum Laws
• P(AB|C) = P(A|C) P(B|AC) = P(B|C)P(A|BC)
– If A,B independent, P(AB|C) = P(A|C)P(B|C)
– If AB, then P(B|AC)=1, P(A|C) P(B|C)
• P(AB|C) = P(A|C)P(B|AC) = P(A|C)*1, or
• P(AB|C) = min(P(A|C),P(B|C)), the “fuzzy” rule
• P(AB|C) = P(A|C) + P(B|C) - P(AB|C)
– If AB, then P(AB|C)=P(A|C), P(A|C) P(B|C)
• P(A B|C) = P(A|C)+P(B|C)-P(A|C) = P(B|C)
• P(A B|C) = max(P(A|C),P(B|C)), the other “fuzzy” rule
2/5/98
UCLA Data Mining Short Course (2)
5
Generalization of Classic Logic
• P(A|C)P(B|AC) = P(B|C)P(A|BC)
• Quantitative deductive inference
– If AB and A, then B
– If AB and ~B, then ~A (abduction)
– If AB and B, then “A becomes more plausible”
• Quantitative inductive inference
– If AB and ~A, then “B becomes less plausible”
– If A”B becomes more plausible” and B, then “A
becomes more plausible”
2/5/98
UCLA Data Mining Short Course (2)
6
Suitable for Mining
• P(A|BC) = P(A|C)P(B|AC) / P(B|C)
– P(A|BC): your new knowledge of the theory A after experiment B,
given the background knowledge C
– P(A|C): what you know about A without B,
– P(B|AC): the possibility of B if your current understanding of A
were correct,
– P(B|C): the possibility of knowing B anyway.
– This is recursive: your new understanding of A becomes a part of
C, which is used to make a newer understanding of A.
• Caution: the recipe of “Rabbit Stew”
2/5/98
UCLA Data Mining Short Course (2)
7
Concept Learning with
Probability Theory
• Given
– X the current information about the learning task
– Ai: “Ci is the target concept”
– Dj: “the instance j is in the target concept”
• Here is how
– P(Ai|DjX) = P(Ai|X)P(Dj|AiX) / P(Dj|X)
2/5/98
UCLA Data Mining Short Course (2)
8
Naive Bayesian Classifier
• Let A1,…,Ak be attributes, [a1, …, ak] an example,
C a class to be predicted, then the optimal
prediction is class value c such that
P(C=c | A1=a1 … Ak=ak)
is maximal.
• By Bayesian rule, this equals
P(A1=a1 … Ak=ak | C=c) * P(C=c)
P(A1=a1 … Ak=ak )
2/5/98
UCLA Data Mining Short Course (2)
9
Naïve Bayesian Classifier
• P(C=c) is easy to estimate from training data
• P(A1=a1 … Ak=ak ) is irrelevant (same for all c)
• Compute P(A1=a1 … Ak=ak |C=c)
–
–
–
–
2/5/98
assume attributes are independent (naïve)
P(A1=a1|C=c) P(A1=a1|C=c) … P(Ak=ak|C=c)
where each term can be estimated as
P(Aj=aj|C=c) = count(Aj=aj|C=c) / count(C=c)
UCLA Data Mining Short Course (2)
10
Tennis Example Revisited
• [Day, Outlook, Temp, Humidity, Wind, PlayTennis]
–
2/5/98
(D1
(D2
(D3
(D4
(D5
(D6
(D7
(D8
(D9
(D10
(D11
(D12
(D13
(D14
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Weak
Strong
strong
Weak
Strong
UCLA Data Mining Short Course (2)
No)
No)
Yes)
Yes)
Yes)
No)
Yes)
No)
Yes)
Yes)
Yes)
Yes)
Yes)
No)
11
Estimated Probabilities
• P(Play=Yes | [Sunny, Hot, High, Weak]) = ?
• P(Play=No | [Sunny, Hot, High, Weak]) = ?
–
–
–
–
–
–
–
–
–
–
2/5/98
(winner)
P(Play=Yes) = 9/14
P(Play=No) = 5/14
P(Outlook=Sunny | Play=Yes) = 2/9
P(Outlook=Sunny | Play=No) = 3/5
P(Temp=Hot | Play=Yes) = 2/9
P(Temp=Hot | Play=No) = 2/5
P(Humidity=High | Play=Yes) = 3/9
P(Humidity=High | Play=No) = 4/5
P(Wind=Weak | Play=Yes) = 6/9
P(Wind=Weak | Play=No) = 2/5
UCLA Data Mining Short Course (2)
12
Boosting
• To learn a series of classifiers, where each
classifier in the series pays more attention to the
examples misclassified by its predecessor
• For t =1 to T times:
– Learn the classifier Ht on the weighted training
examples
– Increases the weights of training examples
misclassified by Ht
2/5/98
UCLA Data Mining Short Course (2)
13
The AdaBoost Algorithm
(Freund and Schapire 1995)
– Let xiX be an example, yi its class
– Assign equal weight to all N examples, w(1)i=1/N
– For t =1 through T do:
•
•
•
•
Given w(1)i obtain a hypothesis H(t): X [0,1].
Let the error of H(t): be (t) = w(t)i |yi - hi(xi)|.
Let (t) = (t) / (1 - (t) ) and let w(t+1)i = w(t)i ((t))1- |yi - hi(xi)|
Normalize w(t+1)i to sum to 1.0.
– The final combination hypothesis H is:
• H(x) = 1 / [1+((t))2r(x)-1], where
• r(x) = [(log1/(t))H(t)(x)] / (log1/(t)).
2/5/98
UCLA Data Mining Short Course (2)
14
Boosted Naïve Bayesian Algorithm
(Elkan 97)
• Let w(t)i be the weights of examples at time t,
H(t)(x) =
H(t)([a1,…,ak]) =
maxc{P(t)(A1=a1|C=c)…P(t)(Ak=ak|C=c)P(t)(C=c)}
where
– P(t)(Aj=aj|C=c) = g w(t)g(Ai=ai|C=c) / h w(t)h(C=c)
– P(t)(C=c) = l w(t)l(C=c)
2/5/98
UCLA Data Mining Short Course (2)
15
Tennis Example Revisited
• [Day, Outlook, Temp, Humidity, Wind, PlayTennis]
–
–
–
–
–
–
–
–
–
–
–
–
–
–
2/5/98
(D1
(D2
(D3
(D4
(D5
(D6
(D7
(D8
(D9
(D10
(D11
(D12
(D13
(D14
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Weak
Strong
strong
Weak
Strong
UCLA Data Mining Short Course (2)
No)
No)
Yes)
Yes)
Yes)
No)
Yes)
No)
Yes)
Yes)
Yes)
Yes)
Yes)
No)
1/28
3/28
1/28
3/28
1/28
3/28
1/28
3/28
1/28
3/28
1/28
3/28
1/28
3/28
16
Estimate Probabilities Based on
Weighted Examples
• P(Play=Yes | [Sunny, Hot, High, Weak]) = ?
• P(Play=No | [Sunny, Hot, High, Weak]) = ?
–
–
–
–
2/5/98
P(Play=Yes) = 15/28
P(Play=No) = 13/28
P(Outlook=Sunny | Play=Yes) = 2/15
P(Outlook=Sunny | Play=No) = 7/13
UCLA Data Mining Short Course (2)
17
How Good Is Boosted Naive
Bayesian Classifier
• Very Efficient:
– Suppose there are e examples that are over f attributes, each with v
values, then the algorithm’s complexity is O(ef) independent of v.
– Learning a decision tree without pruning requires O(ef2) time.
• Accuracy: Won the first place in 1997 KDDCUP
• “Remarkably successful in practice, and no
uniformly better learning method is known.”
(Elkan97)
2/5/98
UCLA Data Mining Short Course (2)
18
Bayesian Belief Networks
• Concise representation of conditional probabilities used in
Bayesian inference
• Plays the role of the model in data mining
– Can encode effects of actions
– Patterns may be probabilistic
• Can be learned
• Powerful representation language
– probabilistic propositional logic
– no quantifiers
– relations encoded by probabilistic correlation
2/5/98
UCLA Data Mining Short Course (2)
19
Bayes Nets
• Model supports several forms of inference
– Diagnostic: from effects to causes
• If meningitis and whiplash both cause stiff neck and observe stiff
neck, which cause is most likely
• P(Men|Stiff) vs. P(Whip|Stiff)
• Useful in learning action models
• Useful in identifying hidden state
– Causal: from causes to effects
• What are the likely effects of contracting meningitis?
• P(Stiff|Men)
• Useful in predicting effects of actions
– Mixed: combination of the above.
2/5/98
UCLA Data Mining Short Course (2)
20
More Probability Theory
• Pr(Raining|C)
– In logic “Raining” is Boolean variable
– In probability “Raining” is special case of: random
variable. (Binomial variable with probability p)
– General case: random variables can take on a set of
values.
Pr(Precipitation=0.5in)
– We associate a probability density function (pdf) with
the r.v. This defines all probabilities for this set.
– f(Toothache) = { (True)-->0.05, (False)-->0.95 }
– f(Age) = gaussian(45,20)
2/5/98
UCLA Data Mining Short Course (2)
21
Joint Distribution
• probability density function of a set of variables
– f(Toothache, Cavity) :
Cavity
Not Cavity
Toothache
.0
4
.01
Not Toothache
.06
.89
P(Cavity) = .1
P(-Cavity) = .9
• joint distribution expresses relationship between variables
– eg. Shoe size vs. reading-level
• From joint, can derive all other probabilities
– marginal probabilities : P(A), P(B)
– conditional (from product rule): P(B|A) = P(AB)/P(A)
2/5/98
UCLA Data Mining Short Course (2)
22
Example
Cavity Ache Probe Catch
Probability
T
T
T
T
F
F
F
F
0.03
0.01
0.045
0.015
0.0025
0.0075
0.2225
0.6675
T
T
F
F
T
T
F
F
T
F
T
F
T
F
T
F
P(Probe) = 0.25, P(Cavity) = 0.01, P(Ache) = 0.05
P(Cavity| Probe and Ache) ?
2/5/98
UCLA Data Mining Short Course (2)
23
Conditional Independence
• Joint distribution can be big and cumbersome
– Boolean variables --> 2n entries
• Often don’t need the whole table!
• Definitions: A and B are independent if:
– Pr(AB) = P(A)P(B)
– Pr(A|B) = P(A) --> follows from product rule
• A and B are conditionally independent given C if:
– Pr(AB|C) = P(A|C)P(B|C), or Pr(A|BC) = Pr(A|C)
• Can use this to simplify joint distribution function
2/5/98
UCLA Data Mining Short Course (2)
24
Example
• Toothache example
– To do inference might need all 8 conditional probabilities
– Probe and Ache are conditionally independent given Cavity
P(Probe | Cavity,Ache) = P(Probe | Cavity)
P(Ache | Cavity,Probe) = P(Ache | Cavity)
P(Ache,Probe | Cavity) = P(Ache | Cavity)P(Probe | Cavity)
• Can get away with fewer conditional probabilities
2/5/98
UCLA Data Mining Short Course (2)
25
Intuition
• There is a relationship between conditional independence
and causality
– If A causes B and C
• There is a correlation between B and C
• This correlation is broken given A
– B and C are conditionally independent given A
– Shoe size and reading level are independent given Age since age is
the “direct cause” of both
• Important because
– people find it hard to recognize conditional independence
– but find it easy to state direct causes
2/5/98
UCLA Data Mining Short Course (2)
26
Bayes Belief Networks
• Simple representation that “encodes” many conditional
independencies
– Each node is a random variable
– A has directed link to B if A is “direct cause” of B
• In statistical terms: Knowing all parents of B makes B conditionally
independent of everything else
– Each node has a conditional probability table that quantifies the
effects parents have on the node
– The graph has no directed cycles.
• The joint distribution is the following product:
– f(X1,...Xn) = P(Xi | Parents(Xi) )
2/5/98
UCLA Data Mining Short Course (2)
27
Example
Fraud
Age
Buys Gas
Sex
Buys Jewelry
Conditional Independencies
P(A|F)=P(A), P(S|FA)=P(S), P(G|FAS)=P(G|F), P(J|FASG)=P(J|FAS)
P(fasgj) = P(f)P(asgj|f)
= P(f)P(a|f)P(sgj|af)
= P(f)P(a|f)P(s|af)P(gj|saf)
= P(f)P(a|f)P(s|af)P(g|saf)P(j|gsaf) ;; use conditional independencies
= P(f) P(a) P(s) P(g|f) P(j|saf)
2/5/98
UCLA Data Mining Short Course (2)
28
Inference
• Inference in Bayes network:
– fix the value of a subset of variables (evidence variables)
– compute the posterior distribution of remain variables (query
variables)
– can do this for any arbitrary subset of variables
• Several algorithms have been proposed
• Easy if variables discrete and no undirected cycles
• NP-hard if there are undirected cycles
– in this case heuristic techniques are often effective
2/5/98
UCLA Data Mining Short Course (2)
29