2013-01-16-prob-tour.. - Carnegie Mellon School of Computer Science

Download Report

Transcript 2013-01-16-prob-tour.. - Carnegie Mellon School of Computer Science

A Quick Overview of
Probability
William W. Cohen
Machine Learning 10-605
Big ML c. 2001 (Banko & Brill, “Scaling to Very Very Large…”, ACL
2001)
Task: distinguish pairs of easily-confused words (“affect”
vs “effect”) in context
Twelve years later….
• Starting point: Google books 5-gram data
– All 5-grams that appear >= 40 times in a
corpus of 1M English books
• approx 80B words
• 5-grams: 30Gb compressed, 250-300Gb
uncompressed
• Each 5-gram contains frequency distribution over
years
– Wrote code to compute
• Pr(A,B,C,D,E|C=affect or C=effect)
• Pr(any subset of A,…,E|any other fixed values of
A,…,E with C=affect V effect)
[Some material pilfered from http://www.cs.cmu.edu/~awm/tutorials]
Probabilistic and Bayesian
Analytics
Note to other teachers and users of these
slides. Andrew would be delighted if
you found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them
to fit your own needs. PowerPoint
originals are available. If you make use
of a significant portion of these slides in
your own lecture, please include this
message, or the following link to the
source repository of Andrew’s tutorials:
http://www.cs.cmu.edu/~awm/tutorial
s . Comments and corrections gratefully
received.
Copyright © Andrew W. Moore
Andrew W. Moore
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
[email protected]
412-268-7599
Tuesday’s Lecture - Review
• Intro
– Who, Where, When - administrivia
– Why – motivations
– What/How – assignments, grading, …
• Review - How to count and what to count
– Big-O and Omega notation, example, …
– Costs of i/o vs computation
• What sort of computations do we want to do in
(large-scale) machine learning programs?
– Probability
Probability - what you need to
really, really know
Probability - what you need to
really, really know
• Probabilities are cool
The Problem of Induction
•
David Hume (17111776): pointed out
1.
2.
•
Empirically, induction
seems to work
Statement (1) is an
application of
induction.
This stumped people
for about 200 years
1.
Of the Different Species of Philosophy.
2.
Of the Origin of Ideas
3.
Of the Association of Ideas
4.
Sceptical Doubts Concerning the Operations of the
Understanding
5.
Sceptical Solution of These Doubts
6.
Of Probability9
7.
Of the Idea of Necessary Connexion
8.
Of Liberty and Necessity
9.
Of the Reason of Animals
10.
Of Miracles
11.
Of A Particular Providence and of A Future State
12.
Of the Academical Or Sceptical Philosophy
A Second Problem of Induction
•
•
•
A black crow seems to
x CROW ( x)  BLACK ( x)
support the hypothesis
“all crows are black”.
A pink highlighter
or equivalent ly
supports the hypothesis
“all non-black things are
x BLACK ( x)  CROW ( x)
non-crows”
Thus, a pink highlighter
supports the hypothesis
“all crows are black”.
Probability - what you need to
really, really know
• Probabilities are cool
• Random variables and events
Discrete Random Variables
• A is a Boolean-valued random variable if
– A denotes an event,
– there is uncertainty as to whether A occurs.
• Examples
–
–
–
–
A = The US president in 2023 will be male
A = You wake up tomorrow with a headache
A = You have Ebola
A = the 1,000,000,000,000th digit of π is 7
• Define P(A) as “the fraction of possible worlds in which A is true”
– We’re assuming all possible worlds are equally probable
Discrete Random Variables
• A is a Boolean-valued random variable if
– A denotes an event,
a possible outcome of an “experiment”
– there is uncertainty as to whether A occurs.
the experiment is not deterministic
• Define P(A) as “the fraction of experiments in which A is true”
– We’re assuming all possible outcomes are equiprobable
• Examples
– You roll two 6-sided die (the experiment) and get doubles (A=doubles, the
outcome)
– I pick two students in the class (the experiment) and they have the same
birthday (A=same birthday, the outcome)
Visualizing A
Event space of all
possible worlds
Its area is 1
Worlds in which
A is true
Worlds in which A is False
P(A) = Area of
reddish oval
Probability - what you need to
really, really know
• Probabilities are cool
• Random variables and events
• There is One True Way to talk about
uncertainty: the Axioms of Probability
The Axioms of Probability
•
•
•
•
0 <= P(A) <= 1
P(True) = 1
P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)
“Dice”
“Experiments”
Events, random
variables, ….,
probabilities
(This is Andrew’s joke)
These Axioms are Not to be Trifled With
• There have been many many other approaches to
understanding “uncertainty”:
• Fuzzy Logic, three-valued logic, Dempster-Shafer, nonmonotonic reasoning, …
• 25 years ago people in AI argued about these; now they
mostly don’t
– Any scheme for combining uncertain information, uncertain
“beliefs”, etc,… really should obey these axioms
– If you gamble based on “uncertain beliefs”, then [you can be
exploited by an opponent]  [your uncertainty formalism violates
the axioms] - di Finetti 1931 (the “Dutch book argument”)
Interpreting the axioms
•
•
•
•
0 <= P(A) <= 1
P(True) = 1
P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)
The area of A can’t get any
smaller than 0
And a zero area would
mean no world could ever
have A true
Interpreting the axioms
•
•
•
•
0 <= P(A) <= 1
P(True) = 1
P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)
The area of A can’t get any
bigger than 1
And an area of 1 would
mean all worlds will have A
true
Interpreting the axioms
•
•
•
•
0 <= P(A) <= 1
P(True) = 1
P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)
A
B
Interpreting the axioms
•
•
•
•
0 <= P(A) <= 1
P(True) = 1
P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)
A
P(A or B)
B
Simple addition and subtraction
P(A and B)
B
Theorems from the Axioms
• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0
• P(A or B) = P(A) + P(B) - P(A and B)
 P(not A) = P(~A) = 1-P(A)
P(A or ~A) = 1
P(A and ~A) = 0
P(A or ~A) = P(A) + P(~A) - P(A and ~A)
1
= P(A) + P(~A) -
0
Elementary Probability in Pictures
• P(~A) + P(A) = 1
A
~A
Another important theorem
• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0
• P(A or B) = P(A) + P(B) - P(A and B)
 P(A) = P(A ^ B) + P(A ^ ~B)
A = A and (B or ~B) = (A and B) or (A and ~B)
P(A) = P(A and B) + P(A and ~B) – P((A and B) and (A and ~B))
P(A) = P(A and B) + P(A and ~B) – P(A and A and B and ~B)
Elementary Probability in Pictures
• P(A) = P(A ^ B) + P(A ^ ~B)
A^B
A ^ ~B
B
~B
Probability - what you need to
really, really know
•
•
•
•
Probabilities are cool
Random variables and events
The Axioms of Probability
Independence
Independent Events
• Definition: two events A and B are
independent if Pr(A and B)=Pr(A)*Pr(B).
• Intuition: outcome of A has no effect on the
outcome of B (and vice versa).
– We need to assume the different rolls are
independent to solve the problem.
– You frequently need to assume the
independence of something to solve any
learning problem.
Some practical problems
•
•
You’re the DM in a D&D game.
Joe brings his own d20 and throws 4 critical hits in a row
to start off
– DM=dungeon master
– D20 = 20-sided die
– “Critical hit” = 19 or 20
•
•
•
What are the odds of that happening with a fair die?
Ci=critical hit on trial i, i=1,2,3,4
P(C1 and C2 … and C4) = P(C1)*…*P(C4) = (1/10)^4
Multivalued Discrete Random Variables
• Suppose A can take on more than 2 values
• A is a random variable with arity k if it can take on
exactly one value out of {v1,v2, .. vk}
– Example: V={aaliyah, aardvark, …., zymurge, zynga}
– Example: V={aaliyah_aardvark, …, zynga_zymgurgy}
• Thus…
P( A  vi  A  v j )  0 if i  j
P( A  v1  A  v2  A  vk )  1
Terms: Binomials and Multinomials
• Suppose A can take on more than 2 values
• A is a random variable with arity k if it can take on
exactly one value out of {v1,v2, .. vk}
– Example: V={aaliyah, aardvark, …., zymurge,
zynga}
– Example: V={aaliyah_aardvark, …,
zynga_zymgurgy}
• The distribution Pr(A) is a multinomial
• For k=2 the distribution is a binomial
More about Multivalued Random Variables
• Using the axioms of probability and assuming that A
obeys…
P( A  v  A  v )  0 if i  j
i
j
P( A  v1  A  v2  A  vk )  1
• It’s easy to prove that
i
P( A  v1  A  v2  A  vi )   P( A  v j )
• And thus we can prove
j 1
k
 P( A  v )  1
j 1
j
Elementary Probability in Pictures
k
 P( A  v )  1
j
j 1
A=2
A=3
A=5
A=4
A=1
Elementary Probability in Pictures
k
 P( A  v )  1
j 1
j
A=aaliyah
A=…
A=….
A=aardvark
…
A=zynga
Probability - what you need to
really, really know
•
•
•
•
•
Probabilities are cool
Random variables and events
The Axioms of Probability
Independence, binomials, multinomials
Conditional probabilities
A practical problem
• I have lots of standard d20 die, lots of loaded die, all identical.
• Loaded die will give a 19/20 (“critical hit”) half the time.
• In the game, someone hands me a random die, which is fair
(A) or loaded (~A), with P(A) depending on how I mix the die.
Then I roll, and either get a critical hit (B) or not (~B)
•. Can I mix the dice together so that P(B) is anything I want say, p(B)= 0.137 ?
P(B) = P(B and A) + P(B and ~A)
“mixture model”
= 0.1*λ + 0.5*(1- λ) = 0.137
λ = (0.5 - 0.137)/0.4 = 0.9075
Another picture for this problem
It’s more convenient to say
• “if you’ve picked a fair die then …” i.e. Pr(critical hit|fair die)=0.1
• “if you’ve picked the loaded die then….” Pr(critical hit|loaded die)=0.5
A (fair die)
A and B
~A (loaded)
~A and B
Conditional probability:
Pr(B|A) = P(B^A)/P(A)
P(B|A)
P(B|~A)
Definition of Conditional
Probability
P(A ^ B)
P(A|B) = ----------P(B)
Corollary: The Chain Rule
P(A ^ B) = P(A|B) P(B)
Some practical problems
“marginalizing out” A
• I have 3 standard d20 dice, 1 loaded die.
• Experiment: (1) pick a d20 uniformly at random then (2) roll it.
Let A=d20 picked is fair and B=roll 19 or 20 with that die. What
is P(B)?
P(B) = P(B|A) P(A) + P(B|~A) P(~A)
= 0.1*0.75 + 0.5*0.25 = 0.2
P(B) = P(B|A)P(A) + P(B|~A)P(~A)
A (fair die)
A and B
P(A)
P(~A)
~A (loaded)
~A and B
P(B|A)
P(B|~A)
Some practical problems
• I have 3 standard d20 dice, 1 loaded die.
• Experiment: (1) pick a d20 uniformly at random then (2) roll it.
Let A=d20 picked is fair and B=roll 19 or 20 with that die.
• Suppose B happens (e.g., I roll a 20). What is the chance the
die I rolled is fair? i.e. what is P(A|B) ?
P(A|B) = ?
P(B|A) * P(A)
P(A and B) = P(A|B) * P(B)
P(A|B) =
P(A and B) = P(B|A) * P(A)
P(A|B) * P(B) = P(B|A) * P(A)
P(B)
A (fair die)
~A (loaded)
P(B)
A and B
A (fair die)
A and B
~A (loaded)
P(A)
~A and B
P(~A)
~A and B
P(B|A)
P(B|~A)
posterior
prior
P(B|A) * P(A)
P(A|B) =
Bayes’ rule
P(B)
P(A|B) * P(B)
P(B|A) =
P(A)
Bayes, Thomas (1763) An essay towards
solving a problem in the doctrine of
chances. Philosophical Transactions of the
Royal Society of London, 53:370-418
…by no means merely a curious speculation in the doctrine of chances,
but necessary to be solved in order to a sure foundation for all our
reasonings concerning past facts, and what is likely to be hereafter….
necessary to be considered by any that would give a clear account of the
strength of analogical or inductive reasoning…
Probability - what you need to
really, really know
•
•
•
•
•
•
Probabilities are cool
Random variables and events
The Axioms of Probability
Independence, binomials, multinomials
Conditional probabilities
Bayes Rule
Some practical problems
•
•
•
•
•
Joe throws 4 critical hits in a row, is Joe cheating?
A = Joe using cheater’s die
C = roll 19 or 20; P(C|A)=0.5, P(C|~A)=0.1
B = C1 and C2 and C3 and C4
Pr(B|A) = 0.0625 P(B|~A)=0.0001
P( B | A) P( A)
P( A |B) 
P( B | A) P( A)  P( B |~ A) P(~ A)
0.0625 * P( A)
P( A |B) 
0.0625 * P( A)  0.0001* (1  P( A))
What’s the experiment and outcome here?
• Outcome A: Joe is cheating
• Experiment:
– Joe picked a die uniformly at random from
a bag containing 10,000 fair die and one
bad one.
– Joe is a D&D player picked uniformly at
random from set of 1,000,000 people and n
of them cheat with probability p>0.
– I have no idea, but I don’t like his looks.
Call it P(A)=0.1
Remember: Don’t Mess with The Axioms
• A subjective belief can be treated, mathematically, like a
probability
– Use those axioms!
• There have been many many other approaches to
understanding “uncertainty”:
• Fuzzy Logic, three-valued logic, Dempster-Shafer, nonmonotonic reasoning, …
• 25 years ago people in AI argued about these; now they
mostly don’t
– Any scheme for combining uncertain information, uncertain
“beliefs”, etc,… really should obey these axioms
– If you gamble based on “uncertain beliefs”, then [you can be
exploited by an opponent]  [your uncertainty formalism violates
the axioms] - di Finetti 1931 (the “Dutch book argument”)
Some practical problems
P( B | A) P( A)
P( A |B) 
P( B)
P( B | A)
P( A)
P( A |B)
P( B | A) P( A) / P( B)



P( B | A) P(A)
P(A | B) P( B | A) P(A) / P( B)
0.0625 P( A)
P( A)


 6,250 
0.0001 P(A)
P(A)
•
•
•
•
•
Joe throws 4 critical hits in a row, is Joe cheating?
A = Joe using cheater’s die
C = roll 19 or 20; P(C|A)=0.5, P(C|~A)=0.1
B = C1 and C2 and C3 and C4
Pr(B|A) = 0.0625 P(B|~A)=0.0001
Moral: with enough
evidence the prior P(A)
doesn’t really matter.
Probability - what you need to
really, really know
•
•
•
•
•
•
•
Probabilities are cool
Random variables and events
The Axioms of Probability
Independence, binomials, multinomials
Conditional probabilities
Bayes Rule
MLE’s, smoothing, and MAPs
Some practical problems
I bought a loaded d20 on EBay…but it didn’t come
with any specs. How can I find out how it behaves?
Frequency
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
Face Shown
1. Collect some data (20 rolls)
2. Estimate Pr(i)=C(rolls of i)/C(any roll)
One solution
I bought a loaded d20 on EBay…but it didn’t come
with any specs. How can I find out how it behaves?
Frequency
6
P(1)=0
P(2)=0
5
P(3)=0
4
P(4)=0.1
3
MLE =
maximum
likelihood
estimate
2
…
1
P(19)=0.25
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
Face Shown
P(20)=0.2
But: Do I really think it’s impossible to roll a 1,2 or 3?
Would you bet your house on it?
A better solution
I bought a loaded d20 on EBay…but it didn’t come
with any specs. How can I find out how it behaves?
Frequency
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
Face Shown
0. Imagine some data (20 rolls, each i shows up 1x)
1. Collect some data (20 rolls)
2. Estimate Pr(i)=C(rolls of i)/C(any roll)
A better solution
I bought a loaded d20 on EBay…but it didn’t come
with any specs. How can I find out how it behaves?
P(1)=1/40
Frequency
6
P(2)=1/40
5
P(3)=1/40
4
P(4)=(2+1)/40
3
2
…
1
P(19)=(5+1)/40
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
Face Shown
C (i)  1
P̂r(i) 
C ( ANY )  C ( IMAGINED)
P(20)=(4+1)/40=1/8
0.25 vs. 0.125 – really
different! Maybe I should
“imagine” less data?
A better solution?
P(1)=1/40
Frequency
6
P(2)=1/40
5
P(3)=1/40
4
P(4)=(2+1)/40
3
2
…
1
P(19)=(5+1)/40
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
Face Shown
C (i)  1
P̂r(i) 
C ( ANY )  C ( IMAGINED)
P(20)=(4+1)/40=1/8
0.25 vs. 0.125 – really
different! Maybe I should
“imagine” less data?
A better solution?
Q: What if I used m rolls with a
probability of q=1/20 of rolling any i?
C (i)  1
P̂r(i) 
C ( ANY )  C ( IMAGINED)
C (i)  mq
P̂r(i) 
C ( ANY )  m
I can use this formula with m>20, or
even with m<20 … say with m=1
A better solution
Q: What if I used m rolls with a
probability of q=1/20 of rolling any i?
C (i)  1
P̂r(i) 
C ( ANY )  C ( IMAGINED)
C (i)  mq
P̂r(i) 
C ( ANY )  m
If m>>C(ANY) then your imagination q rules
If m<<C(ANY) then your data rules BUT you
never ever ever end up with Pr(i)=0
Terminology – more later
This is called a uniform Dirichlet prior
C(i), C(ANY) are sufficient statistics
C (i)  mq
P̂r(i) 
C ( ANY )  m
MLE = maximum
likelihood estimate
MAP= maximum
a posteriori estimate
Probability - what you need to
really, really know
•
•
•
•
•
•
•
•
Probabilities are cool
Random variables and events
The Axioms of Probability
Independence, binomials, multinomials
Conditional probabilities
Bayes Rule
MLE’s, smoothing, and MAPs
The joint distribution
Some practical problems
• I have 1 standard d6 die, 2 loaded d6 die.
• Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50
• Experiment: pick one d6 uniformly at random (A) and roll it. What is
more likely – rolling a seven or rolling doubles?
Three combinations: HL, HF, FL
P(D) = P(D ^ A=HL) + P(D ^ A=HF) + P(D ^ A=FL)
= P(D | A=HL)*P(A=HL) + P(D|A=HF)*P(A=HF) + P(A|A=FL)*P(A=FL)
Some practical problems
• I have 1 standard d6 die, 2 loaded d6 die.
• Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50
• Experiment: pick one d6 uniformly at random (A) and roll it. What is
more likely – rolling a seven or rolling doubles?
Roll 1
Three combinations: HL, HF, FL
1
1
3
4
D
7
D
7
4
7
D
5
7
7
6
7
3
6
5
D
2
Roll 2
2
D
D
A brute-force solution
A
Roll 1
Roll 2
P
Comment
FL
1
1
1/3 * 1/6 * ½
doubles
1
2
1/3 * 1/6 * 1/10
FL
FL
…
FL
A joint probability table shows P(X1=x1 and … and Xk=xk)
1 every possible
…
… of values x1,x2,…., xk
for
combination
seven
1
6
FL
2
With
this you1 can compute any P(A) where A is any
boolean
combination
of the primitive events (Xi=Xk), e.g.
2
…
…
•…P(doubles) …
FL
•6P(seven or 6eleven)
HL
1
•1P(total is higher
than 5)
HL
•1….
2
…
…
…
HF
1
1
…
doubles
doubles
The Joint Distribution
Example: Boolean variables A,
B, C
Recipe for making a joint distribution of
M variables:
The Joint Distribution
Example: Boolean variables A,
B, C
Recipe for making a joint distribution of
M variables:
1.
Make a truth table listing all
combinations of values of your
variables (if there are M Boolean
variables then the table will have 2M
rows).
A
B
C
0
0
0
0
0
1
0
1
0
0
1
1
1
0
0
1
0
1
1
1
0
1
1
1
The Joint Distribution
Example: Boolean variables A,
B, C
Recipe for making a joint distribution of
M variables:
1.
2.
Make a truth table listing all
combinations of values of your
variables (if there are M Boolean
variables then the table will have 2M
rows).
For each combination of values, say
how probable it is.
A
B
C
Prob
0
0
0
0.30
0
0
1
0.05
0
1
0
0.10
0
1
1
0.05
1
0
0
0.05
1
0
1
0.10
1
1
0
0.25
1
1
1
0.10
The Joint Distribution
Example: Boolean variables A,
B, C
Recipe for making a joint distribution of
M variables:
1.
2.
3.
Make a truth table listing all
combinations of values of your
variables (if there are M Boolean
variables then the table will have 2M
rows).
For each combination of values, say
how probable it is.
If you subscribe to the axioms of
probability, those numbers must sum
to 1.
A
B
C
Prob
0
0
0
0.30
0
0
1
0.05
0
1
0
0.10
0
1
1
0.05
1
0
0
0.05
1
0
1
0.10
1
1
0
0.25
1
1
1
0.10
A
0.05
0.25
0.30
B
0.10
0.05
0.10
0.05
0.10
C
Using the
Joint
One you have the JD you can
ask for the probability of any
logical expression involving
your attribute
P( E ) 
 P(row )
rows matching E
Abstract: Predict whether income exceeds $50K/yr based on census
data. Also known as "Census Income" dataset. [Kohavi, 1996]
Number of Instances: 48,842
Number of Attributes: 14 (in UCI’s copy of dataset); 3 (here)
Using the
Joint
P(Poor Male) = 0.4654
P( E ) 
 P(row )
rows matching E
Using the
Joint
P(Poor) = 0.7604
P( E ) 
 P(row )
rows matching E
Probability - what you need to
really, really know
•
•
•
•
•
•
•
•
•
Probabilities are cool
Random variables and events
The Axioms of Probability
Independence, binomials, multinomials
Conditional probabilities
Bayes Rule
MLE’s, smoothing, and MAPs
The joint distribution
Inference
Inference
with the
Joint
P( E1  E2 )
P( E1 | E2 ) 

P ( E2 )
 P(row )
rows matching E1 and E2
 P(row )
rows matching E2
Inference
with the
Joint
P( E1  E2 )
P( E1 | E2 ) 

P ( E2 )
 P(row )
rows matching E1 and E2
 P(row )
rows matching E2
P(Male | Poor) = 0.4654 / 0.7604 = 0.612
Estimating the joint distribution
• Collect some data points
• Estimate the probability P(E1=e1 ^ … ^ En=en) as
#(that row appears)/#(any row appears)
• ….
Gender
Hours
Wealth
g1
h1
w1
g2
h2
w2
..
…
…
gN
hN
wN
Estimating the joint distribution
Complexity?
O(2d)
• For each combination of values r:
d = #attributes (all binary)
– Total = C[r] = 0
Complexity?
O(n)
• For each data row ri
– C[ri] ++
n = total size of input data
– Total ++
Gender
Hours
Wealth
g1
h1
w1
g2
h2
w2
..
…
…
gN
hN
wN
= C[ri]/Total
ri is “female,40.5+, poor”
Estimating the joint distribution
d
ki )
• For each combination of values r: Complexity? O(
i 1
ki = arity of attribute i
– Total = C[r] = 0
Complexity?
O(n)
• For each data row ri
– C[ri] ++
n = total size of input data
– Total ++
Gender
Hours
Wealth
g1
h1
w1
g2
h2
w2
..
…
…
gN
hN
wN
Estimating the joint distribution
d
• For each combination of values r:
– Total = C[r] = 0
• For each data row ri
– C[ri] ++
– Total ++
Gender
Hours
Wealth
g1
h1
w1
g2
h2
w2
..
…
…
gN
hN
wN
Complexity?
O( ki )
i 1
ki = arity of attribute i
Complexity?
O(n)
n = total size of input data
Estimating the joint distribution
Complexity?
O(m)
• For each data row ri
m = size of the model
– If ri not in hash tables C,Total:
Complexity?
O(n)
• Insert C[ri] = 0
– C[ri] ++
n = total size of input data
– Total ++
Gender
Hours
Wealth
g1
h1
w1
g2
h2
w2
..
…
…
gN
hN
wN
Another example….
Big ML c. 2001 (Banko & Brill, “Scaling to Very Very Large…”, ACL
2001)
Task: distinguish pairs of easily-confused words (“affect”
vs “effect”) in context
An experiment
• Starting point: Google books 5-gram data
– All 5-grams that appear >= 40 times in a corpus of 1M
English books
• approx 80B words
• 5-grams: 30Gb compressed, 250-300Gb uncompressed
• Each 5-gram contains frequency distribution over years
– Extract all 5-grams from books published before 2000
that contain ‘effect’ or ‘affect’ in middle position
• about 20 “disk hours”
• approx 100M occurrences
• approx 50k distinct n-grams --- not big
– Wrote code to compute
• Pr(A,B,C,D,E|C=affect or C=effect)
• Pr(any subset of A,…,E|any other subset,C=affect V effect)
Some of the Joint Distribution
A
B
C
D
E
is
the
effect
of
the
0.00036
is
the
effect
of
a
0.00034
.
The
effect
of
this
0.00034
to
this
effect
:
“
0.00034
be
the
effect
of
the
…
…
…
…
…
…
the
effect
of
any
0.00024
…
…
…
…
…
does
not
affect
the
general
0.00020
does
not
affect
the
question
0.00020
any
manner
affect
the
principle 0.00018
not
p
Another experiment
• Extracted all affect/effect 5-grams from the old
(small) Reuters corpus
– about 20k documents
– about 723 n-grams, 661 distinct
– Financial news, not novels or textbooks
• Tried to predict center word with:
– Pr(C|A=a,B=b,D=d,E=e)
– then P(C|A,B,D,C=effect V affect)
– then P(C|B,D, C=effect V affect)
– then P(C|B, C=effect V affect)
– then P(C, C=effect V affect)
EXAMPLES
• “The cumulative _ of the”  effect (1.0)
• “Go into _ on January”  effect (1.0)
• “From cumulative _ of accounting” not
present
– Nor is ““From cumulative _ of _”
– But “_ cumulative _ of _”  effect (1.0)
• “Would not _ Finance Minister” not present
– But “_ not _ _ _”  affect (0.9625)
Performance summary
Pattern
Used
Errors
P(C|A,B,D,E)
101
1
P(C|A,B,D)
157
6
P(C|B,D)
163
13
P(C|B)
244
78
P(C)
58
31
Probability - what you need to
really, really know
•
•
•
•
•
•
•
•
•
•
Probabilities are cool
Random variables and events
The Axioms of Probability
Independence, binomials, multinomials
Conditional probabilities
Bayes Rule
MLE’s, smoothing, and MAPs
The joint distribution
Inference
Density estimation and classification
Density Estimation
• Our Joint Distribution learner is our first
example of something called Density
Estimation
• A Density Estimator learns a mapping from
a set of attributes values to a Probability
Input
Attributes
Copyright © Andrew W.
Moore
Density
Estimator
Probability
Density Estimation
• Compare it against the two other major
kinds of models:
Input
Attributes
Classifier
Prediction of
categorical output or class
One of a few discrete values
Input
Attributes
Density
Estimator
Probability
Input
Attributes
Regressor
Prediction of
real-valued output
Copyright © Andrew W.
Moore
Density Estimation  Classification
Input
Attributes
x
Classifier
Input
Attributes
Class
Density
Estimator
Prediction of
categorical output
One of y1, …., yk
^
P(x,y)
To classify x
^
^
1. Use your estimator to compute P(x,y1), …., P(x,yk)
2. Return the class y* with the highest predicted probability
^
^
^
^
Ideally is correct with P(x,y*) = P(x,y*)/(P(x,y1) + …. + P(x,yk))
Copyright © Andrew W.
Moore
Binary case:
predict POS
if ^
P(x)>0.5
Classification vs Density Estimation
Classification
Density Estimation
Classification vs density
estimation
How do you evaluate?
Classification
• Estimate Pr(error)
• Given n test examples
and k errors, error rate is
Pr(error) ~= k/n.
Density Estimation
• Given n test examples
x1,y1, …, xn,yn, compute
1
ˆ

log
P
(Y  yi | x)

n i
Avg # bits to encode the true labels
using a code based on your density
estimator. (Optimal code will use logP(Y=y) bits for event Y=y.)
Bayes Classifiers
• If we can do inference over Pr(X1…,Xd,Y)…
• … in particular compute Pr(X1…Xd|Y) and Pr(Y).
– And then we can use Bayes’ rule to compute
Pr(Y | X 1 ,..., X d ) 
Pr( X 1 ,..., X d | Y ) Pr(Y )
Pr( X 1 ,..., X d )
Summary: The Bad News
• Density estimation by directly learning the
joint is trivial, mindless and dangerous
Andrew’s joke
• Density estimation by directly learning the joint is
hopeless unless you have some combination of
• Very few attributes
• Attributes with low “arity”
• Lots and lots of data
• Otherwise you can’t estimate all the row frequencies
Copyright © Andrew W.
Moore
Back to the demo
• Extracted all affect/effect 5-grams from the old (small)
Reuters corpus
– about 20k documents
– about 723 n-grams, 661 distinct
– Financial news, not novels or textbooks
• Tried to predict center word with:
– Pr(C|A=a,B=b,D=d,E=e)
– Of 723 combinations of a,b,d,e, only 101 (about 14%)
appear in the n-gram corpus.
– Of those there is only one error.
Performance …
Pattern
Used
Errors
P(C|A,B,D,E)
101
1
P(C|A,B,D)
157
6
P(C|B,D)
163
13
P(C|B)
244
78
P(C)
58
31
• Is this good performance?
• Do other brute-force estimates of joint probabilities
have the same problem?
Probability - what you need to
really, really know
•
•
•
•
•
•
•
•
•
•
•
Probabilities are cool
Random variables and events
The Axioms of Probability
Independence, binomials, multinomials
Conditional probabilities
Bayes Rule
MLE’s, smoothing, and MAPs
The joint distribution
Inference
Density estimation and classification
Naïve Bayes density estimators and classifiers
Naïve Density Estimation
The problem with the Joint Estimator is that it just
mirrors the training data.
We need something which generalizes more usefully.
The naïve model generalizes strongly:
Assume that each attribute is distributed
independently of any of the other attributes.
Copyright © Andrew W.
Moore
Using the Naïve Distribution
• Once you have a Naïve Distribution you can easily
compute any row of the joint distribution.
• Suppose A, B, C and D are independently
distributed. What is P(A ^ ~B ^ C ^ ~D)?
Copyright © Andrew W.
Moore
Using the Naïve Distribution
• Once you have a Naïve Distribution you can easily
compute any row of the joint distribution.
• Suppose A, B, C and D are independently
distributed. What is P(A^~B^C^~D)?
= P(A|~B^C^~D) P(~B^C^~D)
= P(A) P(~B^C^~D)
= P(A) P(~B|C^~D) P(C^~D)
= P(A) P(~B) P(C^~D)
= P(A) P(~B) P(C|~D) P(~D)
= P(A) P(~B) P(C) P(~D)
Copyright © Andrew W.
Moore
Naïve Distribution General Case
• Suppose X1,X2,…,Xd are independently distributed.
Pr( X 1  x1 ,..., X d  xd )  Pr( X 1  x1 )  ...  Pr( X d  xd )
• So if we have a Naïve Distribution we can
construct any row of the implied Joint Distribution
on demand.
• How do we learn this?
Copyright © Andrew W.
Moore
Learning a Naïve Density
Estimator
# records with X i  xi
P( X i  xi ) 
# records
# records with X i  xi  mq
P( X i  xi ) 
# records  m
MLE
Dirichlet (MAP)
Another trivial learning algorithm!
Copyright © Andrew W.
Moore
Is this an interesting learning algorithm?
^
• For n-grams, what is P(C=effect|A=will)?
^
• In joint: P(C=effect|A=will) = 0.38
^
^
• In naïve: P(C=effect|A=will) = P(C=effect) =
#[C=effect]/#totalNgrams = 0.94 (!)
^
• What is P(C=effect|B=no)?
^
• In joint: P(C=effect|B=no) = 0.999
^
^
• In naïve: P(C=effect|B=no) = P(C=effect) = 0.94
No
Ended here….
Probability - what you need to
really, really know
•
•
•
•
•
•
•
•
•
•
•
•
Probabilities are cool
Random variables and events
The Axioms of Probability
Independence, binomials, multinomials
Conditional probabilities
Bayes Rule
MLE’s, smoothing, and MAPs
The joint distribution
Inference
Density estimation and classification
Naïve Bayes density estimators and classifiers
Conditional independence…more on this next week!