Probabilistic and Bayesian Analytics

Download Report

Transcript Probabilistic and Bayesian Analytics

Slide 1
Probabilistic and Bayesian
Analytics
Brigham S. Anderson
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~brigham
[email protected]
Copyright © 2001, Andrew W. Moore
Probability
• The world is a very uncertain place
• 30 years of Artificial Intelligence and Database
research danced around this fact
• And then a few AI researchers decided to use
some ideas from the eighteenth century
2
What we’re going to do
• We will review the fundamentals of probability.
• It’s really going to be worth it
• You’ll see examples of probabilistic analytics in
action:
• Inference,
• Anomaly detection, and
• Bayes Classifiers
3
Discrete Random Variables
• A is a Boolean-valued random variable if A denotes
an event, and there is some degree of uncertainty
as to whether A occurs.
• Examples
• A = The US president in 2023 will be male
• A = You wake up tomorrow with a headache
• A = You have influenza
4
Probabilities
• We write P(A) as “the probability that A is true”
• We could at this point spend 2 hours on the
philosophy of this.
• We’ll spend slightly less...
5
Sample Space
Definition 1.
The set, S, of all possible outcomes of a particular
experiment is called the sample space for the experiment
The elements of the
sample space are called
outcomes.
6
Sample Spaces
Sample space of a coin flip:
S = {H, T}
T
H
7
Sample Spaces
Sample space of a die roll:
S = {1, 2, 3, 4, 5, 6}
8
Sample Spaces
Sample space of three die rolls?
S = {111,112,113,…,
…,664,665,666}
9
Sample Spaces
Sample space of a single draw from a
deck of cards:
S={As,Ac,Ah,Ad,2s,2c,2h,…
…,Ks,Kc,Kd,Kh}
10
So Far…
Definition
Example
The sample space is the set of
all possible worlds.
{As,Ac,Ah,Ad,2s,2c,2h,…
…,Ks,Kc,Kd,Kh}
An outcome is an element of
the sample space.
2c
11
Events
Definition 2.
An event is any subset of S (including S itself)
12
Events
Sample Space of card draw
• The Sample Space is the set of all
outcomes.
• An Outcome is a possible world.
• An Event is a set of outcomes
Event: “Jack”
13
Events
Sample Space of card draw
• The Sample Space is the set of all
outcomes.
• An Outcome is a possible world.
• An Event is a set of outcomes
Event: “Hearts”
14
Events
Sample Space of card draw
• The Sample Space is the set of all
outcomes.
• An Outcome is a possible world.
• An Event is a set of outcomes
Event: “Red and Face”
15
Definitions
Definition
Example
The sample space is the set of
all possible worlds.
{As,Ac,Ah,Ad,2s,2c,2h,…
…,Ks,Kc,Kd,Kh}
An outcome is a single point in
the sample space.
2c
An event is a set of outcomes
from the sample space.
{2h,2c,2s,2d}
16
17
Events
Definition 3.
Two events A and B are mutually exclusive if A^B=Ø.
hearts
clubs
spades
diamonds
Definition 4.
If A1, A2, … are mutually exclusive and A1 A2 … = S,
then the collection A1, A2, … forms a partition of S.
Probability
Definition 5.
Given a sample space S, a probability function is a function


that maps each
event in S to a real number, and satisfies
• P(A) ≥ 0 for any event A in S
• P(S) = 1
• For any number of mutually exclusive events
A1, A2, A3 …, we have
P(A1  A2A3  …) = P(A1) + P(A2) + P(A3) +…
*
This definition of the domain of this function is
not 100% sufficient, but it’s close enough for
our purposes… (I’m sparing you Borel Fields)
18
Definitions
Definition
Example
The sample space is the set of
all possible worlds.
{As,Ac,Ah,Ad,2s,2c,2h,…
…,Ks,Kc,Kd,Kh}
An outcome is a single point in
the sample space.
4c
An event is a set of one or
more outcomes
Card is “Red”
P(E) maps event E to a real
number and satisfies the
axioms of probability
P(Red) = 0.50
P(Black) = 0.50
19
Misconception
•
The relative area of the events determines
their probability
•
•
…in a Venn diagram it does, but not in general.
However, the “area equals probability” rule is guaranteed
to result in axiom-satisfying probability functions.
We often~A
assume, for example, that the
probability of “heads” Ais equal to “tails” in
absence of other information…
But this is totally outside the axioms!
20
Creating a Valid P()
•
One convenient way to create an axiomsatisfying probability function:
1. Assign a probability to each outcome in S
2. Make sure they sum to one
3. Declare that P(A) equals the sum of outcomes in event A
21
Everyday Example
Assume you are a doctor.
This is the sample space of
“patients you might see on
any given day”.
Outcomes
Non-smoker, female, diabetic,
headache, good insurance, etc…
Smoker, male, herniated disk,
back pain, mildly schizophrenic,
delinquent medical bills, etc…
22
Everyday Example
Number of elements in the
“patient space”:
100 jillion
Are these patients equally likely to occur?
Again, generally not. Let’s assume for the
moment that they are, though.
…which roughly means “area equals probability”
23
Everyday Example
Event: Patient has Flu
F
Size of set “F”:
2 jillion
(Exactly 2 jillion of the points in
the sample space have flu.)
Size of “patient space”:
100 jillion
2 jillion
PpatientSpace(F) =
100 jillion
= 0.02
24
Everyday Example
F
2 jillion
PpatientSpace(F) =
100 jillion
= 0.02
From now on, the subscript on P() will
be omitted…
25
These Axioms are Not to be
Trifled With
• There have been attempts to do different
methodologies for uncertainty
•
•
•
•
Fuzzy Logic
Three-valued logic
Dempster-Shafer
Non-monotonic reasoning
• But the axioms of probability are the only system
with this property:
If you gamble using them you can’t be unfairly exploited by an opponent
using some other system [di Finetti 1931]
26
Theorems from the Axioms
Axioms
• P(A) ≥ 0 for any event A in S
• P(S) = 1
• For any number of mutually exclusive events
A1, A2, A3 …, we have
P(A1 A2 A3  …) = P(A1) + P(A2) + P(A3) +…
Theorem.
If P is a probability function and A is an event in S, then
P(~A) = 1 – P(A)
Proof:
(1) Since A and ~A partition S, P(A
 ~A)
(2) Since A and ~A are disjoint, P(A

= P(S) = 1
~A) = P(A) + P(~A)
Combining (1) and (2) gives the result
27
Multivalued Random
Variables
• Suppose A can take on more than 2 values
• A is a random variable with arity k if it can take on
exactly one value out of {A1,A2, ... Ak}, and
• The events {A1,A2,…,Ak} partition S, so
P( Ai , A j )  0 if i  j
P( A1  A2  ...  Ak )  1
28
Elementary Probability in
Pictures
P(~A) + P(A) = 1
~A
A
29
Elementary Probability in
Pictures
P(B) = P(B, A) + P(B, ~A)
~A
A
B
30
Elementary Probability in
Pictures
k
 P( A )  1
j 1
j
A2
A3
A1
31
Elementary Probability in
Pictures
k
P( B)   P( B, Aj )
j 1
A2
B
A3
A1
Useful!
32
Conditional Probability
Assume once more that you
are a doctor.
Again, this is the sample
space of “patients you might
see on any given day”.
33
Conditional Probability
Event: Flu
P(F) = 0.02
F
34
Conditional Probability
Event: Headache
H
F
P(H) = 0.10
35
Conditional Probability
P(F) = 0.02
P(H) = 0.10
H
F
…we still need to
specify the interaction
between flu and
headache…
Define
P(H|F) = Fraction of F’s outcomes which are also in H
36
Conditional Probability
0.89
0.09
H
F
0.01
0.01
H = “headache”
F = “flu”
P(F) = 0.02
P(H) = 0.10
P(H|F) = 0.50
37
Conditional Probability
P(H|F) = Fraction of flu worlds in
which patient has a headache
0.89
0.09
H
F
0.01
0.01
H = “headache”
F = “flu”
= #worlds with flu and headache
-----------------------------------#worlds with flu
= Size of “H and F” region
-----------------------------Size of “F” region
= P(H, F)
---------P(F)
38
Conditional Probability
39
Definition.
If A and B are events in S, and P(B) > 0, then the
conditional probability of A given B, written P(A|B), is
P( A, B)
P( A | B) 
P( B)
The Chain Rule
A simple rearrangement of the above equation yields
P( A, B)  P( A | B) P( B)
Main Bayes
Net concept!
Probabilistic Inference
H = “Have a headache”
F = “Coming down with
Flu”
H
F
P(H) = 0.10
P(F) = 0.02
P(H|F) = 0.50
One day you wake up with a headache. You think: “Drat!
50% of flus are associated with headaches so I must have a
50-50 chance of coming down with flu”
Is this reasoning good?
40
Probabilistic Inference
41
H = “Have a headache”
F = “Coming down with
Flu”
H
F
P(H) = 0.10
P(F) = 0.02
P(H|F) = 0.50
P( F , H ) P( H | F ) P( F ) (0.50)(0.02)
P( F | H ) 


 0.10
0.1
P( H )
P( H )
What we just did…
P(A,B)
P(A|B) P(B)
P(B|A) = ----------- = --------------P(A)
P(A)
This is Bayes Rule
Bayes, Thomas (1763) An essay
towards solving a problem in the
doctrine of chances. Philosophical
Transactions of the Royal Society of
London, 53:370-418
42
More General Forms of Bayes
Rule
P( B | A) P( A)
P( A |B) 
P( B | A) P( A)  P( B |~ A) P(~ A)
P( B | A, C ) P( A, C )
P( A |B, C ) 
P ( B, C )
43
More General Forms of Bayes
Rule
P( Ai |B) 
P( B | Ai ) P( Ai )
nA
 P( B | A ) P( A )
k 1
k
k
44
45
Independence
Definition.
Two events, A and B, are statistically independent if
P( A, B)  P( A) P( B)
Which is equivalent to
P( A | B)  P( A)
Important for
Bayes Nets
Representing P(A,B,C)
• How can we represent the function P(A)?
• P(A,B)?
• P(A,B,C)?
46
The Joint
Probability Table
Recipe for making a joint distribution
of M variables:
1. Make a truth table listing all
combinations of values of your
variables (if there are M boolean
variables then the table will have
2M rows).
2. For each combination of values,
say how probable it is.
3. If you subscribe to the axioms of
probability, those numbers must
sum to 1.
47
Example: P(A, B, C)
A
B
C
Prob
0
0
0
0.30
0
0
1
0.05
0
1
0
0.10
0
1
1
0.05
1
0
0
0.05
1
0
1
0.10
1
1
0
0.25
1
1
1
0.10
A
0.05
0.25
0.30
B
0.10
0.10
0.10
0.05
0.05
C
48
Using the
Joint
One you have the JPT you
can ask for the probability of
any logical expression
P( E ) 
…what is P(Poor,Male)?
 P(row )
rows matching E
49
Using the
Joint
P(Poor, Male) = 0.4654
P( E ) 
 P(row )
rows matching E
…what is P(Poor)?
50
Using the
Joint
P(Poor) = 0.7604
P( E ) 
 P(row )
rows matching E
…what is P(Poor|Male)?
51
Inference
with the
Joint
P( E1 , E2 )
P( E1 | E2 ) 

P ( E2 )
 P(row )
rows matching E1 and E2
 P(row )
rows matching E2
52
Inference
with the
Joint
P( E1 , E2 )
P( E1 | E2 ) 

P ( E2 )
 P(row )
rows matching E1 and E2
 P(row )
rows matching E2
P(Male | Poor) = 0.4654 / 0.7604 = 0.612
Inference is a big deal
• I’ve got this evidence. What’s the chance that this
conclusion is true?
• I’ve got a sore neck: how likely am I to have meningitis?
• I see my lights are out and it’s 9pm. What’s the chance my spouse
is already asleep?
• There’s a thriving set of industries growing based around
Bayesian Inference. Highlights are: Medicine, Pharma,
Help Desk Support, Engine Fault Diagnosis
53
Where do Joint Distributions
come from?
54
• Idea One: Expert Humans
• Idea Two: Simpler probabilistic facts and some
algebra
Example: Suppose you knew
P(A) = 0.5
P(B|A) = 0.2
P(B|~A) = 0.1
P(C|A,B) = 0.1
P(C|A,~B) = 0.8
P(C|~A,B) = 0.3
P(C|~A,~B) = 0.1
Then you can automatically
compute the JPT using the
chain rule
P(A,B,C) = P(A) P(B|A) P(C|A,B)
Bayes Nets are a
systematic way to
do this.
Where do Joint Distributions
come from?
• Idea Three: Learn them from data!
Prepare to witness an impressive learning algorithm….
55
56
Learning a JPT
Build a Joint Probability table for
your attributes in which the
probabilities are unspecified
Then fill in each row with
records matching row
ˆ
P(row ) 
total number of records
A
B
C
Prob
0
0
0
?
0
0
1
?
A
B
C
Prob
0
1
0
?
0
0
0
0.30
0
1
1
?
0
0
1
0.05
1
0
0
?
0
1
0
0.10
1
0
1
?
0
1
1
0.05
1
1
0
?
1
0
0
0.05
1
1
1
?
1
0
1
0.10
1
1
0
0.25
1
1
1
0.10
Fraction of all records in which
A and B are True but C is False
Example of Learning a JPT
• This JPT was
obtained by
learning from
three attributes
in the UCI
“Adult” Census
Database
[Kohavi 1995]
57
Where are we?
• We have recalled the fundamentals of probability
• We have become content with what JPTs are and
how to use them
• And we even know how to learn JPTs from data.
58
Density Estimation
• Our Joint Probability Table (JPT) learner is our first
example of something called Density Estimation
• A Density Estimator learns a mapping from a set of
attributes to a Probability
Input
Attributes
Density
Estimator
Probability
59
Evaluating a density
estimator
• Given a record x, a density estimator M can tell
you how likely the record is:
Pˆ ( x|M )
• Given a dataset with R records, a density
estimator can tell you how likely the dataset is:
(Under the assumption that all records were independently generated
from the probability function)
R
Pˆ (dataset |M )  Pˆ (x1 , x 2 , x R|M )   Pˆ (x k|M )
k 1
60
A small dataset: Miles Per
Gallon
192
Training
Set
Records
mpg
modelyear maker
good
bad
bad
bad
bad
bad
bad
bad
:
:
:
bad
good
bad
good
bad
good
good
bad
good
bad
75to78
70to74
75to78
70to74
70to74
70to74
70to74
75to78
:
:
:
70to74
79to83
75to78
79to83
75to78
79to83
79to83
70to74
75to78
75to78
asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
america
america
america
america
america
america
europe
europe
From the UCI repository (thanks to Ross Quinlan)
61
A small dataset: Miles Per
Gallon
192
Training
Set
Records
mpg
modelyear maker
good
bad
bad
bad
bad
bad
bad
bad
:
:
:
bad
good
bad
good
bad
good
good
bad
good
bad
75to78
70to74
75to78
70to74
70to74
70to74
70to74
75to78
:
:
:
70to74
79to83
75to78
79to83
75to78
79to83
79to83
70to74
75to78
75to78
asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
america
america
america
america
america
america
europe
europe
62
A small dataset: Miles Per
Gallon
192
Training
Set
Records
mpg
modelyear maker
good
bad
bad
bad
bad
bad
bad
bad
:
:
:
bad
good
bad
good
bad
good
good
bad
good
bad
75to78
70to74
75to78
70to74
70to74
70to74
70to74
75to78
:
:
:
70to74
79to83
75to78
79to83
75to78
79to83
79to83
70to74
75to78
75to78
asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
america
1
america
america
america
america
america
europe
europe
R
Pˆ (dataset |M )  Pˆ (x , x 2 , x R|M )   Pˆ (x k|M )
k 1
 3.4  10-203
63
Log Probabilities
Since probabilities of datasets get so
small we usually use log probabilities
R
R
k 1
k 1
log Pˆ (dataset |M )  log  Pˆ (x k|M )   log Pˆ (x k|M )
64
A small dataset: Miles Per
Gallon
192
Training
Set
Records
mpg
modelyear maker
good
bad
bad
bad
bad
bad
bad
bad
:
:
:
bad
good
bad
good
bad
good
good
bad
good
bad
75to78
70to74
75to78
70to74
70to74
70to74
70to74
75to78
:
:
:
70to74
79to83
75to78
79to83
75to78
79to83
79to83
70to74
75to78
75to78
asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
america
america
america
america
america
america
europe
europe
R
R
k 1
k 1
log Pˆ (dataset |M )  log  Pˆ (x k|M )   log Pˆ (x k|M )
  466.19
65
Summary: The Good News
The JPT allows us to learn P(X) from data.
• Can do inference: P(E1|E2)
Automatic Doctor, Recommender, etc
• Can do anomaly detection
spot suspicious / incorrect records
(e.g., credit card fraud)
• Can do Bayesian classification
Predict the class of a record
(e.g, predict cancerous/not-cancerous)
66
Summary: The Bad News
• Density estimation with JPTs is trivial, mindless
and dangerous
67
Using a test set
An independent test set with 196 cars has a much worse log
likelihood than it had on the training set
(actually it’s a billion quintillion quintillion quintillion quintillion
times less likely)
….Density estimators can overfit. And the JPT estimator is the
overfittiest of them all!
68
Overfitting Density
Estimators
If this ever happens, it means
there are certain combinations
that we learn are “impossible”
69
Using a test set
The only reason that our test set didn’t score -infinity is that
Andrew’s code is hard-wired to always predict a probability of at
least one in 1020
We need Density Estimators that are less prone
to overfitting
70
Is there a better way?
The problem with the JPT is that it just mirrors
the training data.
In fact, it is just another way of storing the data: we could
reconstruct the original dataset perfectly from it!
We need to represent the probability function
with fewer parameters…
71
72
Aside:
Bayes Nets
Bayes Nets
• What are they?
• Bayesian nets are a framework for representing and analyzing
models involving uncertainty
• What are they used for?
• Intelligent decision aids, data fusion, 3-E feature recognition,
intelligent diagnostic aids, automated free text understanding,
data mining
• How are they different from other knowledge
representation and probabilistic analysis
tools?
• Uncertainty is handled in a mathematically rigorous yet efficient
and simple way
73
Bayes Net Concepts
1.Chain Rule
P(A,B) = P(A) P(B|A)
2.Conditional Independence
P(A|B,C) = P(A|B)
74
A Simple Bayes Net
• Let’s assume that we already have P(Mpg,Horse)
low
P(Mpg, Horse) =
high
P(good, low) =
0.36 0.04
good
P(good,high)
=
P(
low)
=
0.12
0.48
badbad,
P( bad,high) =
0.36
0.04
0.12
0.48
How would you rewrite this using the Chain rule?
75
76
Review: Chain Rule
P(Mpg)
P(Mpg, Horse)
low
high
P(good, low) =
0.36 0.04
good
P(good,high)
=
P(
low)
=
0.12
0.48
badbad,
P( bad,high) =
P(Mpg, Horse)
P(good) = 0.4
P( bad) = 0.6
0.36
0.04
0.12
0.48
*
P(Horse|Mpg)
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
=
=
=
=
0.89
0.21
0.11
0.79
77
Review: Chain Rule
= P(good) * P(low|good) =
0.4 * 0.89
= P(good) * P(high|good)
= 0.4 * 0.11
P(Mpg)
P(Mpg, Horse)
low
high
P(good, low) =
0.36 0.04
good
P(good,high)
=
P(
low)
=
0.12
0.48
badbad,
P( bad,high) =
P(good) = 0.4
P( bad) = 0.6
0.36
0.04
0.12
0.48
P(Mpg, Horse)
*
P(Horse|Mpg)
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
= P(bad) * P(high|bad)
= 0.6 * 0.79
=
=
=
=
= P(bad) * P(low|bad)
= 0.6 * 0.21
0.89
0.21
0.11
0.79
How to Make a Bayes Net
P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg)
Mpg
Horse
78
How to Make a Bayes Net
P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg)
P(Mpg)
Mpg
P(good) = 0.4
P( bad) = 0.6
P(Horse|Mpg)
Horse
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
=
=
=
=
0.90
0.21
0.10
0.79
79
How to Make a Bayes Net
• Each node is a probability function
• Each arc denotes conditional dependence
P(Mpg)
Mpg
P(good) = 0.4
P( bad) = 0.6
P(Horse|Mpg)
Horse
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
=
=
=
=
0.90
0.21
0.10
0.79
80
How to Make a Bayes Net
So, what have we
accomplished thus far?
Nothing;
we’ve just “Bayes Net-ified” the
P(Mpg, Horse) JPT using the
Chain rule.
…the real excitement starts
when we wield conditional
independence
Mpg
Horse
P(Mpg)
P(Horse|Mpg)
81
How to Make a Bayes Net
82
Before we continue, we need a worthier opponent
than puny P(Mpg, Horse)…
We’ll use P(Mpg, Horse, Accel):
* Note: I made these up…
P(Mpg,Horse,Accel)
P(good, low,slow)
P(good, low,fast)
P(good,high,slow)
P(good,high,fast)
P( bad, low,slow)
P( bad, low,fast)
P( bad,high,slow)
P( bad,high,fast)
=
=
=
=
=
=
=
=
0.37
0.01
0.02
0.00
0.10
0.12
0.16
0.22
How to Make a Bayes Net
Step 1: Rewrite joint using the Chain rule.
P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse)
Note:
Obviously, we could have written this 3!=6 different ways…
P(M, H, A) = P(M) * P(H|M) * P(A|M,H)
= P(M) * P(A|M) * P(H|M,A)
= P(H) * P(M|H) * P(A|H,M)
= P(H) * P(A|H) * P(M|H,A)
=…
=…
83
How to Make a Bayes Net
Step 1: Rewrite joint using the Chain rule.
P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse)
Mpg
Horse
Accel
84
How to Make a Bayes Net
Mpg
P(Mpg)
Horse
P(Horse|Mpg)
Accel
P(Accel|Mpg,Horse)
85
How to Make a Bayes Net
P(Mpg)
Mpg
P(Horse|Mpg)
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
=
=
=
=
0.90
0.21
0.10
0.79
P(good) = 0.4
P( bad) = 0.6
P(Accel|Mpg,Horse)
Horse
Accel
P(slow|good, low)
P(slow|good,high)
P(slow| bad, low)
P(slow| bad,high)
P(fast|good, low)
P(fast|good,high)
P(fast| bad, low)
P(fast| bad,high)
=
=
=
=
=
=
=
=
* Note: I made these up too…
0.97
0.15
0.90
0.05
0.03
0.85
0.10
0.95
86
How to Make a Bayes Net
P(Mpg)
Mpg
P(Horse|Mpg)
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
=
=
=
=
0.89
0.21
0.11
0.79
P(good) = 0.4
P( bad) = 0.6
P(Accel|Mpg,Horse)
P(slow|good, low)
P(slow|good,high)
P(slow| bad, low)
P(slow| bad,high)
Accel
P(fast|good, low)
P(fast|good,high)
A Miracle Occurs!
P(fast| bad, low)
P(fast|
You are told by God (or another domain
expert) bad,high)
Horse
that Accel is independent of Mpg given Horse!
i.e., P(Accel | Mpg, Horse) = P(Accel | Horse)
=
=
=
=
=
=
=
=
0.97
0.15
0.90
0.05
0.03
0.85
0.10
0.95
87
How to Make a Bayes Net
P(Mpg)
Mpg
P(good) = 0.4
P( bad) = 0.6
P(Horse|Mpg)
Horse
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
=
=
=
=
0.89
0.21
0.11
0.79
P(Accel|Horse)
Accel
P(slow| low)
P(slow|high)
P(fast| low)
P(fast|high)
=
=
=
=
0.22
0.64
0.78
0.36
88
How to Make a Bayes Net
P(Mpg)
Mpg
Thank you, domain expert!
P(good) = 0.4
P( bad) = 0.6
P(Horse|Mpg)
Now I only need to
learn 5 parameters
instead of 7 from my data!
Horse
My parameter estimates
will be more accurate as
a result!
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
=
=
=
=
0.89
0.21
0.11
0.79
P(Accel|Horse)
Accel
P(slow| low)
P(slow|high)
P(fast| low)
P(fast|high)
=
=
=
=
0.22
0.64
0.78
0.36
89
Independence
“The Acceleration does not depend on the Mpg once I know the
Horsepower.”
This can be specified very simply:
P(Accel Mpg, Horse) = P(Accel | Horse)
This is a powerful statement!
It required extra domain knowledge. A different kind of knowledge than
numerical probabilities. It needed an understanding of causation.
90
Bayes Nets Formalized
A Bayes net (also called a belief network) is an augmented
directed acyclic graph, represented by the pair V , E where:
• V is a set of vertices.
• E is a set of directed edges joining vertices. No loops
of any length are allowed.
Each vertex in V contains the following information:
• A Conditional Probability Table (CPT) indicating how
this variable’s probabilities depend on all possible
combinations of parental values.
91
Bayes Nets Summary
• Bayes nets are a factorization of the full JPT which
uses the chain rule and conditional independence.
• They can do everything a JPT can do (like quick,
cheap lookups of probabilities)
92
The good news
We can do inference.
We can compute any conditional probability:
P( Some variable  Some other variable values )
P( E1  E2 )
P( E1 | E2 ) 

P ( E2 )
 P( joint entry )
joint entries matching E1 and E2
 P( joint entry )
joint entries matching E2
93
The good news
We can do inference.
We can compute any conditional probability:
P( Some variable  Some other variable values )
P( E1  E2 )
P( E1 | E2 ) 

P ( E2 )
 P( joint entry )
joint entries matching E1 and E2
 P( joint entry )
joint entries matching E2
Suppose you have m binary-valued variables in your Bayes
Net and expression E2 mentions k variables.
How much work is the above computation?
94
The sad, bad news
Doing inference “JPT-style” by enumerating all matching entries in the joint
are expensive:
Exponential in the number of variables.
But perhaps there are faster ways of querying Bayes nets?
• In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find
there are often many tricks to save you time.
• So we’ve just got to program our computer to do those tricks too, right?
Sadder and worse news:
General querying of Bayes nets is NP-complete.
95
Case Study I
Pathfinder system. (Heckerman 1991, Probabilistic Similarity Networks, MIT
Press, Cambridge MA).
•
Diagnostic system for lymph-node diseases.
•
60 diseases and 100 symptoms and test-results.
•
14,000 probabilities
•
Expert consulted to make net.
• 8 hours to determine variables.
• 35 hours for net topology.
• 40 hours for probability table values.
•
Apparently, the experts found it quite easy to invent the causal links and
probabilities.
•
Pathfinder is now outperforming the world experts in diagnosis. Being
extended to several dozen other medical domains.
96
Bayes Net Info
GUI Packages:
• Genie -- Free
• Netica -- $$
• Hugin -- $$
Non-GUI Packages:
• All of the above have APIs
• BNT for MATLAB
• AUTON code (learning extremely large networks of tens
of thousands of nodes)
97
98
Bayes Nets and
Machine Learning
Machine Learning Tasks
Evidence e1
Missing Variables e2
Inference
Engine
P(e2 | e1)
Data point x
Anomaly
Detector
P(x)
Data point x
Classifier
P(C | x)
99
What is an Anomaly?
• An irregularity that cannot be explained by
simple domain models and knowledge
• Anomaly detection only needs to learn from
examples of “normal” system behavior.
• Classification, on the other hand, would need examples
labeled “normal” and “not-normal”
100
Anomaly Detection in
Practice
• Monitoring computer networks for attacks.
• Monitoring population-wide health data for outbreaks or
attacks.
• Looking for suspicious activity in bank transactions
• Detecting unusual eBay selling/buying behavior.
101
102
JPT
Anomaly Detector
• Suppose we have
the following model:
P(Mpg, Horse) =
P(good, low)
P(good,high)
P( bad, low)
P( bad,high)
• We’re trying to detect anomalous cars.
• If the next example we see is <good,high>, how
anomalous is it?
=
=
=
=
0.36
0.04
0.12
0.48
103
JPT
Anomaly Detector
How likely is
<good,high>?
P(Mpg, Horse) =
P(good, low)
P(good,high)
P( bad, low)
P( bad,high)
likelihood ( good , high)  P( good , high)
 0.04
Could not be easier! Just look up the entry in the JPT!
Smaller numbers are more anomalous in that the
model is more surprised to see them.
=
=
=
=
0.36
0.04
0.12
0.48
104
Bayes Net
Anomaly Detector
P(Mpg)
Mpg
How likely is
<good,high>?
P(good) = 0.4
P( bad) = 0.6
P(Horse|Mpg)
Horse
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
likelihood ( good , high )  P( good , high )
 P( good ) P(high | good )
 0.04
=
=
=
=
0.90
0.21
0.10
0.79
105
Bayes Net
Anomaly Detector
P(Mpg)
Mpg
How likely is
<good,high>?
P(good) = 0.4
P( bad) = 0.6
P(Horse|Mpg)
Horse
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
likelihood ( good , high )  P( good , high )
Again, trivial!
Pone
( good
P(high |
We need todo
tiny )lookup
for each variable in the network!
 0.04
good )
=
=
=
=
0.90
0.21
0.10
0.79
Machine Learning Tasks
Evidence e1
Inference
Engine
P(E2 | e1)
Data point x
Anomaly
Detector
P(x)
Data point x
Classifier
P(C | x)
Missing Variables E2
106
Bayes Classifiers
• A formidable and sworn enemy of decision trees
DT
BC
107
Bayes Classifiers in 1 Slide
Bayes classifiers just do inference.
That’s it.
The “algorithm”
1. Learn P(class,X)
2. For a given input x, infer P(class|x)
3. Choose the class with the highest probability
108
109
JPT
Bayes Classifier
• Suppose we have
the following model:
P(Mpg, Horse) =
P(good, low)
P(good,high)
P( bad, low)
P( bad,high)
=
=
=
=
• We’re trying to classify cars as Mpg = “good” or “bad”
• If the next example we see is Horse = “low”, how do we
classify it?
0.36
0.04
0.12
0.48
110
JPT
Bayes Classifier
How do we classify
<Horse=low>?
P( good | low) 

P(Mpg, Horse) =
P(good, low)
P(good,high)
P( bad, low)
P( bad,high)
=
=
=
=
0.36
0.04
0.12
0.48
P( good , low)
P(low)
P( good , low)
P( good , low)P(bad , low)
0.36

 0.739
0.360.12
The P(good | low) = 0.75,
so we classify the example
as “good”
111
Bayes Net Classifier
P(Mpg)
Mpg
P(good) = 0.4
P( bad) = 0.6
P(Horse|Mpg)
• We’re trying to classify cars
as Mpg = “good” or “bad”
Horse
• If the next example we see is
<Horse=low,Accel=fast>
how do we classify it?
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
=
=
=
=
0.89
0.21
0.11
0.79
P(Accel|Horse)
Accel
P(slow| low)
P(slow|high)
P(fast| low)
P(fast|high)
=
=
=
=
0.95
0.11
0.05
0.89
Bayes Net 112
Bayes Classifier
Suppose we get a
<Horse=low, Accel=fast>
example?
P( good , low, fast )
P( good | low, fast ) 
P(low, fast )
P(Mpg)
P(good) = 0.4
P( bad) = 0.6
Mpg
P(Horse|Mpg)


P( good ) P(low | good ) P( fast | low)
P(low, fast )
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
Horse
(0.4)(0.89)(0.05)
0.0178

P(low, fast )
P(low, fast )
=
=
=
=
0.89
0.21
0.11
0.79
P(Accel|Horse)
0.0178

P( good , low, fast )  P(bad , low, fast )
 0.75
Accel
Note: this is not exactly 0.75 because I
rounded some of the CPT numbers earlier…
P(slow| low)
P(slow|high)
P(fast| low)
P(fast|high)
=
=
=
=
0.95
0.11
0.05
0.89
Bayes Net 113
Bayes Classifier
P(Mpg)
Mpg
The P(good | low, fast) = 0.75,
so we classify the example
as “good”.
P(good) = 0.4
P( bad) = 0.6
P(Horse|Mpg)
…but that seems somehow familiar…
Horse
Wasn’t that the same answer as
P(Mpg=good | Horse=low)?
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
=
=
=
=
0.89
0.21
0.11
0.79
P(Accel|Horse)
Accel
P(slow| low)
P(slow|high)
P(fast| low)
P(fast|high)
=
=
=
=
0.95
0.11
0.05
0.89
Bayes Classifiers
• OK, so classification can be posed as inference
• In fact, virtually all machine learning tasks are a
form of inference
•
•
•
•
•
Anomaly detection:
Classification:
Regression:
Model Learning:
Feature Selection:
P(x)
P(Class | x)
P(Y | x)
P(Model | dataset)
P(Model | dataset)
114
The Naïve Bayes Classifier
ASSUMPTION:
all the attributes are conditionally independent
given the class variable
115
The Naïve Bayes Advantage
At least 256 parameters!
You better have the data
to support them…
A mere 25 parameters!
(the CPTs are tiny because the attribute
nodes only have one parent.)
116
What is the Probability Function
of the Naïve Bayes?
P(Mpg,Cylinders,Weight,Maker,…)
=
P(Mpg) P(Cylinders|Mpg) P(Weight|Mpg) P(Maker|Mpg) …
117
What is the Probability Function
of the Naïve Bayes?
P(class , x)  P(class ) P( xi | class )
i
This is another great feature of Bayes
Nets; you can graphically see your
model assumptions
118
Bayes
Classifier
Results:
“MPG”: 392
records
The Classifier
learned by
“Naive BC”
119
120
Bayes
Classifier
Results:
“MPG”: 40
records
More Facts About Bayes
Classifiers
• Many other density estimators can be slotted in
• Density estimation can be performed with real-valued inputs
• Bayes Classifiers can be built with real-valued inputs
• Rather Technical Complaint: Bayes Classifiers don’t try to be maximally
discriminative---they merely try to honestly model what’s going on
• Zero probabilities are painful for Joint and Naïve. A hack (justifiable with
the magic words “Dirichlet Prior”) can help.
• Naïve Bayes is wonderfully cheap. And survives 10,000 attributes
cheerfully!
121
Summary
• Axioms of Probability
• Bayes nets are created by
• chain rule
• conditional independence
• Bayes Nets can do
• Inference
• Anomaly Detection
• Classification
122
123
Using Bayes Rule to Gamble
$1.00
The “Win” envelope has a
dollar and four beads in
it
The “Lose” envelope
has three beads and
no money
Trivial question: someone draws an envelope at random and offers to
sell it to you. How much should you pay?
124
Using Bayes Rule to Gamble
$1.00
The “Win” envelope has a
dollar and four beads in
it
The “Lose” envelope
has three beads and
no money
Interesting question: before deciding, you are allowed to see one bead
drawn from the envelope.
Suppose it’s black: How much should you pay?
Suppose it’s red: How much should you pay?
125
Calculation
…
126
$1.00
Probability Model Uses
Evidence e1
Inference
Engine
P(E2 | e1)
Data point x
Anomaly
Detector
P(x)
Data point x
Classifier
P(C | x)
Missing Variables E2
How do we evaluate a particular density
estimator?
127
Probability Models
•
•
Bayes Nets
Carefully chosen assumptions
Overfitting and scaling
properties depend on
assumptions
Full Prob.
Table
•
•
•
No assumptions
Overfitting-prone
Scales horribly
128
Naïve Prob.
•
•
•
Strong assumptions
Overfitting-resistant
Scales incredibly well
What you should know
• Probability
• Fundamentals of Probability and Bayes Rule
• What’s a Joint Distribution
• How to do inference (i.e. P(E1|E2)) once you have a JD
• Density Estimation
• What is DE and what is it good for
• How to learn a Joint DE
• How to learn a naïve DE
129
How to build a Bayes Classifier
•
•
•
•
•
Assume you want to predict output Y which has arity nY and values v1, v2, …
vny.
Assume there are m input attributes called X1, X2, … Xm
Break dataset into nY smaller datasets called DS1, DS2, … DSny.
Define DSi = Records in which Y=vi
For each DSi , learn Density Estimator Mi to model the input distribution
among the Y=vi records.
130
How to build a Bayes Classifier
•
•
•
•
•
•
Assume you want to predict output Y which has arity nY and values v1, v2, …
vny.
Assume there are m input attributes called X1, X2, … Xm
Break dataset into nY smaller datasets called DS1, DS2, … DSny.
Define DSi = Records in which Y=vi
For each DSi , learn Density Estimator Mi to model the input distribution
among the Y=vi records.
Mi estimates P(X1, X2, … Xm | Y=vi )
131
How to build a Bayes Classifier
•
•
•
•
•
Assume you want to predict output Y which has arity nY and values v1, v2, …
vny.
Assume there are m input attributes called X1, X2, … Xm
Break dataset into nY smaller datasets called DS1, DS2, … DSny.
Define DSi = Records in which Y=vi
For each DSi , learn Density Estimator Mi to model the input distribution
among the Y=vi records.
•
Mi estimates P(X1, X2, … Xm | Y=vi )
•
Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um)
come along to be evaluated predict the value of Y that makes P(X1,
X2, … Xm | Y=vi ) most likely
Y predict  argmax P( X 1  u1  X m  um | Y  v)
v
Is this a good idea?
132
How to build a Bayes Classifier
•
•
•
•
•
•
•
Assume you want to predict output Y which has arity nY and values v1, v2, …
vny.
ThisXis1, Xa2,Maximum
Likelihood
Assume there are m input attributes called
… Xm
Break dataset into nY smaller datasets called DS1, DS
classifier.
2, … DSny.
Define DSi = Records in which Y=vi
For each DSi , learn Density EstimatorItMican
to model
the input
distribution
get silly
if some
Ys are
among the Y=vi records.
very unlikely
Mi estimates P(X1, X2, … Xm | Y=vi )
Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um)
come along to be evaluated predict the value of Y that makes P(X1,
X2, … Xm | Y=vi ) most likely
Y predict  argmax P( X 1  u1  X m  um | Y  v)
v
Is this a good idea?
133
How to build a Bayes Classifier
•
•
•
•
•
Assume you want to predict output Y which has arity nY and values v1, v2, …
vny.
Assume there are m input attributes called X1, X2, … Xm
Break dataset into nY smaller datasets called DS1, DS2, … DSny.
Define DSi = Records in which Y=vi
Better Idea
For each DSi , learn Density Estimator Mi to model the Much
input distribution
among the Y=vi records.
•
Mi estimates P(X1, X2, … Xm | Y=vi )
•
Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um)
come along to be evaluated predict the value of Y that makes P(Y=vi
| X1, X2, … Xm) most likely
Y predict  argmax P(Y  v | X 1  u1  X m  um )
v
Is this a good idea?
134
Terminology
• MLE (Maximum Likelihood Estimator):
Y predict  argmax P( X 1  u1  X m  um | Y  v)
v
• MAP (Maximum A-Posteriori Estimator):
Y predict  argmax P(Y  v | X 1  u1  X m  um )
v
135
Getting what we need
Y predict  argmax P(Y  v | X 1  u1  X m  um )
v
136
Getting a posterior probability
P(Y  v | X 1  u1  X m  um )
P( X 1  u1  X m  um | Y  v) P(Y  v)
P( X 1  u1  X m  um )


P( X 1  u1  X m  um | Y  v) P(Y  v)
nY
 P( X
j 1
1
 u1  X m  um | Y  v j ) P(Y  v j )
137
Bayes Classifiers in a nutshell
1. Learn the distribution over inputs for each value Y.
2. This gives P(X1, X2, … Xm | Y=vi ).
3. Estimate P(Y=vi ). as fraction of records with Y=vi .
4. For a new prediction:
Y predict  argmax P(Y  v | X 1  u1  X m  um )
v
 argmax P( X 1  u1  X m  um | Y  v) P(Y  v)
v
138
Bayes Classifiers in a nutshell
1. Learn the distribution over inputs for each value Y.
2. This gives P(X1, X2, … Xm | Y=vi ).
3. Estimate P(Y=vi ). as fraction of records with Y=vi .
We can use our favorite
4. For a new prediction:
Density Estimator here.
Y predict  argmax P(Y  v | X 1 Right
u1 now
X m we
 uhave
m ) two
v
options:
 argmax P( X 1  u1  X m  um | Y  v) P(Y  v)
v
•Joint Density Estimator
•Naïve Density Estimator
139
Joint Density Bayes Classifier
Y predict  argmax P( X 1  u1  X m  um | Y  v) P(Y  v)
v
In the case of the joint Bayes Classifier this
degenerates to a very simple rule:
Ypredict = the most common value of Y among records
in which X1 = u1, X2 = u2, …. Xm = um.
Note that if no records have the exact set of inputs X1
= u1, X2 = u2, …. Xm = um, then P(X1, X2, … Xm | Y=vi )
= 0 for all values of Y.
In that case we just have to guess Y’s value
140
Naïve Bayes Classifier
Y predict  argmax P( X 1  u1  X m  um | Y  v) P(Y  v)
v
In the case of the naive Bayes Classifier this can be
simplified:
nY
Y predict  argmax P(Y  v) P( X j  u j | Y  v)
v
j 1
141
Naïve Bayes Classifier
Y predict  argmax P( X 1  u1  X m  um | Y  v) P(Y  v)
v
In the case of the naive Bayes Classifier this can be
simplified:
nY
Y predict  argmax P(Y  v) P( X j  u j | Y  v)
v
j 1
Technical Hint:
If you have 10,000 input attributes that product will
underflow in floating point math. You should use logs:
Y predict
nY


 argmax  log P(Y  v)   log P( X j  u j | Y  v) 
v
j 1


142
What you should know
• Bayes Classifiers
• How to build one
• How to predict with a BC
• Contrast between naïve and joint BCs
143
Where are we now?
• We have a methodology for building Bayes nets.
• We don’t require exponential storage to hold our probability table. Only
exponential in the maximum number of parents of any node.
• We can compute probabilities of any given assignment of truth values
to the variables. And we can do it in time linear with the number of
nodes.
• So we can also compute answers to any questions.
P(s)=0.3
P(LM^S)=0.05
P(LM^~S)=0.1
P(L~M^S)=0.1
P(L~M^~S)=0.2
M
S
P(M)=0.6
P(RM)=0.3
P(R~M)=0.6
L
P(TL)=0.3
P(T~L)=0.8
R
T
E.G. What could we do to compute P(R  T,~S)?
144
Where are we now?
Step 1: Compute P(R ^ T ^ ~S)
• We have a methodology for building Bayes nets.
Step 2: Compute P(~R ^ T ^ ~S)
• We don’t require exponential storage to hold our probability table. Only
exponential in the maximum number of parents of any node.
Step 3: Return
• We can compute probabilities of any given assignment of truth values
to the variables. And we can do it in time linear with the number of
nodes. P(R ^ T ^ ~S)
------------------------------------• So
we can also compute answers to any questions.
P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)
P(s)=0.3
P(LM^S)=0.05
P(LM^~S)=0.1
P(L~M^S)=0.1
P(L~M^~S)=0.2
M
S
P(M)=0.6
P(RM)=0.3
P(R~M)=0.6
L
P(TL)=0.3
P(T~L)=0.8
R
T
E.G. What could we do to compute P(R  T,~S)?
145
Where are we now?
Step 1: Compute P(R ^ T ^ ~S)
Sum of all the rows in the Joint
that match R ^ T ^ ~S
• We have a methodology for building Bayes nets.
Step 2: Compute P(~R ^ T ^ ~S)
• We don’t require exponential storage to hold our probability table. Only
exponential in the maximum number of parents
Sum of of
allany
the node.
rows in the Joint
Step 3: Return
that match of
~Rtruth
^ T ^values
~S
• We can compute probabilities of any given assignment
to the variables. And we can do it in time linear with the number of
nodes. P(R ^ T ^ ~S)
------------------------------------• So
we can also compute answers to any questions.
P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)
P(s)=0.3
P(LM^S)=0.05
P(LM^~S)=0.1
P(L~M^S)=0.1
P(L~M^~S)=0.2
M
S
P(M)=0.6
P(RM)=0.3
P(R~M)=0.6
L
P(TL)=0.3
P(T~L)=0.8
R
T
E.G. What could we do to compute P(R  T,~S)?
146
147
Where are we now?4 joint computes
Step 1: Compute P(R ^ T ^ ~S)
Sum of all the rows in the Joint
that match R ^ T ^ ~S
• We have a methodology for building Bayes nets.
Step 2: Compute P(~R ^ T ^ ~S)
• We don’t require exponential storage to hold our probability table. Only
exponential in the maximum number of parents
Sum of of
allany
the node.
rows in the Joint
Step 3: Return
that match of
~Rtruth
^ T ^values
~S
• We can compute probabilities of any given assignment
to the variables. And we can do it in time linear with the number of
nodes. P(R ^ T ^ ~S)
4 joint computes
------------------------------------• So
we can also compute answers to any questions.
Each of these obtained by
P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)
the “computing a joint
P(s)=0.3
P(LM^S)=0.05
P(LM^~S)=0.1
P(L~M^S)=0.1
P(L~M^~S)=0.2
M
S
probability entry” method of
the earlier
slides
P(RM)=0.3
P(M)=0.6
P(R~M)=0.6
L
P(TL)=0.3
P(T~L)=0.8
R
T
E.G. What could we do to compute P(R  T,~S)?
Independence
We’ve stated:
P(M) = 0.6
P(S) = 0.3
P(S  M) = P(S)
148
From these statements, we can
derive the full joint pdf.
S
M
T
T
T
F
F
T
F
F
Prob
And since we now have the joint pdf, we can make
any queries we like.
Classic Machine Learning Tasks
Evidence e1
Inference
Engine
P(E2 | e1)
Data point x
Anomaly
Detector
P(x)
Data point x
Classifier
P(C | x)
Missing Variables E2
149
Anomaly Detection on 1 page
150
A note about independence
• Assume A and B are Boolean Random Variables.
Then
“A and B are independent”
if and only if
P(A|B) = P(A)
• “A and B are independent” is often notated as
A B
151
Independence Theorems
• Assume P(A|B) = P(A)
• Then P(A^B) =
= P(A) P(B)
• Assume P(A|B) = P(A)
• Then P(B|A) =
= P(B)
152
Independence Theorems
• Assume P(A|B) = P(A)
• Then P(~A|B) =
= P(~A)
• Assume P(A|B) = P(A)
• Then P(A|~B) =
= P(A)
153
Multivalued Independence
For multivalued Random Variables A and B,
A B
if and only if
u, v : P( A  u | B  v)  P( A  u )
from which you can then prove things like…
u, v : P( A  u  B  v)  P( A  u ) P( B  v)
u, v : P( B  v | A  v)  P( B  v)
154
•
•
•
Back to Naïve Density
Estimation
Let x[i] denote the i’th field of record x:
Naïve DE assumes x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…x[M]}
Example:
• Suppose that each record is generated by randomly shaking a green dice and a red
dice
• Dataset 1: A = red value, B = green value
• Dataset 2: A = red value, B = sum of values
• Dataset 3: A = sum of values, B = difference of values
•
Which of these datasets violates the naïve assumption?
155
Using the Naïve Distribution
• Once you have a Naïve Distribution you can easily
compute any row of the joint distribution.
• Suppose A, B, C and D are independently distributed. What
is P(A^~B^C^~D)?
156
Using the Naïve Distribution
• Once you have a Naïve Distribution you can easily
compute any row of the joint distribution.
• Suppose A, B, C and D are independently distributed. What
is P(A^~B^C^~D)?
= P(A|~B^C^~D) P(~B^C^~D)
= P(A) P(~B^C^~D)
= P(A) P(~B|C^~D) P(C^~D)
= P(A) P(~B) P(C^~D)
= P(A) P(~B) P(C|~D) P(~D)
= P(A) P(~B) P(C) P(~D)
157
Naïve Distribution General
Case
• Suppose x[1], x[2], … x[M] are independently distributed.
M
P( x[1]  u1 , x[2]  u2 ,  x[ M ]  u M )   P( x[k ]  uk )
k 1
• So if we have a Naïve Distribution we can
construct any row of the implied Joint Distribution
on demand.
• So we can do any inference
• But how do we learn a Naïve Density Estimator?
158
Learning a Naïve Density
Estimator
# records in which x[i ]  u
ˆ
P( x[i ]  u ) 
total number of records
Another trivial learning algorithm!
159
Contrast
Joint DE
Naïve DE
Can model anything
Can model only very boring
distributions
No problem to model “C is a
noisy copy of A”
Outside Naïve’s scope
Given 100 records and more than 6
Boolean attributes will screw up badly
Given 100 records and 10,000
multivalued attributes will be fine
160
Empirical Results: “MPG”
The “MPG” dataset consists of 392 records and 8 attributes
A tiny part of
the DE
learned by
“Joint”
The DE
learned by
“Naive”
161
Empirical Results: “MPG”
The “MPG” dataset consists of 392 records and 8 attributes
A tiny part of
the DE
learned by
“Joint”
The DE
learned by
“Naive”
162
Empirical Results: “Weight vs. MPG”
Suppose we train only from the “Weight” and “MPG” attributes
The DE
learned by
“Joint”
The DE
learned by
“Naive”
163
Empirical Results: “Weight vs. MPG”
Suppose we train only from the “Weight” and “MPG” attributes
The DE
learned by
“Joint”
The DE
learned by
“Naive”
164
“Weight vs. MPG”: The best that Naïve can
do
The DE
learned by
“Joint”
The DE
learned by
“Naive”
165
166