pptx - University of Pittsburgh
Download
Report
Transcript pptx - University of Pittsburgh
CS 2750: Machine Learning
Probability Review
Prof. Adriana Kovashka
University of Pittsburgh
February 29, 2016
Plan for today and next two classes
• Probability review
• Density estimation
• Naïve Bayes and Bayesian Belief Networks
Procedural View
• Training Stage:
– Raw Data x
– Training Data { (x,y) } f
(Feature Extraction)
(Learning)
• Testing Stage
– Raw Data x
– Test Data x f(x)
(C) Dhruv Batra
(Feature Extraction)
(Apply function, Evaluate error)
3
Statistical Estimation View
• Probabilities to rescue:
– x and y are random variables
– D = (x1,y1), (x2,y2), …, (xN,yN)
~ P(X,Y)
• IID: Independent Identically Distributed
– Both training & testing data sampled IID from P(X,Y)
– Learn on training set
– Have some hope of generalizing to test set
(C) Dhruv Batra
4
Probability
• A is non-deterministic event
– Can think of A as a boolean-valued variable
• Examples
– A = your next patient has cancer
– A = Rafael Nadal wins US Open 2016
(C) Dhruv Batra
5
Interpreting Probabilities
• What does P(A) mean?
• Frequentist View
– limit N∞ #(A is true)/N
– limiting frequency of a repeating non-deterministic event
• Bayesian View
– P(A) is your “belief” about A
• Market Design View
– P(A) tells you how much you would bet
(C) Dhruv Batra
6
Axioms of Probability Theory
• All probabilities between 0 and 1
0 P( A) 1
• True proposition has probability 1, false has
probability 0.
P(true) = 1
P(false) = 0
• The probability of disjunction is:
P( A B) P( A) P( B) P( A B)
A
Slide credit: Ray Mooney
A B
B
7
Interpreting the Axioms
•
•
•
•
0<= P(A) <= 1
P(false) = 0
P(true) = 1
P(A v B) = P(A) + P(B) – P(A ^ B)
Event space of
all possible
worlds
Visualizing A
Worlds in which
A is true
P(A) = Area of
reddish oval
Its area is 1
Worlds in which A is False
(C) Dhruv Batra
Image Credit: Andrew Moore
8
Interpreting the Axioms
Of Probability
0<=The
P(A) <=Axioms
1
•
<= P(A) <=
• 0P(false)
= 10
P(True) = 1
• P(true) = 1
P(False) = 0
• P(A
P(AorvB)B)
= P(A)
+ -P(B)
– P(A
^ B)
= P(A)
+ P(B)
P(A and
B)
The area of A can
t get
any smaller than 0
And a zero area would
mean no world could
ever have A true
8
(C) Dhruv Batra
Image Credit: Andrew Moore
9
Interpreting the Axioms
Interpreting
the axioms
0<= P(A)
<= 1
•
<= P(A) <=
• 0P(false)
= 10
P(True) = 1
• P(true) = 1
P(False) = 0
• P(A
P(AorvB)B)
= P(A)
+ -P(B)
– P(A
^ B)
= P(A)
+ P(B)
P(A and
B)
The area of A can
t get
any bigger than 1
And an area of 1 would
mean all worlds will have
A true
9
(C) Dhruv Batra
Image Credit: Andrew Moore
10
Interpreting the Axioms
Interpreting
the axioms
0<= P(A)
<= 1
•
<= P(A) <=
• 0P(false)
= 10
P(True) = 1
• P(true) = 1
P(False) = 0
• P(A
P(AorvB)B)
= P(A)
+ -P(B)
– P(A
^ B)
= P(A)
+ P(B)
P(A and
B)
A
P(A or B)
B
P(A and B)
B
Simple addition and subtraction
11
(C) Dhruv Batra
Image Credit: Andrew Moore
11
Joint Distribution
• The joint probability distribution for a set of random variables,
X1,…,Xn gives the probability of every combination of values (an ndimensional array with vn values if all variables are discrete with v
values, all vn values must sum to 1): P(X1,…,Xn)
negative
positive
circle
square
red
0.20
0.02
blue
0.02
0.01
circle
square
red
0.05
0.30
blue
0.20
0.20
• The probability of all possible conjunctions (assignments of values
to some subset of variables) can be calculated by summing the
appropriate subset of values from the joint distribution.
P(red circle ) 0.20 0.05 0.25
P(red ) 0.20 0.02 0.05 0.3 0.57
• Therefore, all conditional probabilities can also be calculated.
P( positive | red circle )
Slide credit: Ray Mooney
P( positive red circle ) 0.20
0.80
P(red circle )
0.25
12
Marginal Distributions
y
z
Sum rule
(C) Dhruv Batra
Slide Credit: Erik Suddherth
13
Conditional Probabilities
• P(A | B) = In worlds where B is true, fraction where A is true
Conditional Probability
P( A B)
P( A | B) Probability
Conditional
P( Bin
) which B is true
P(A|B) = Fraction of worlds
P(A|B)
=
Fraction
of
worlds
in
which
B
is
true
that also have A true
• that
Example
also have A true
H = Have a headache
– H: “Have a headache”
– F: “Coming down with Flu”
F
F
H
H
(C) Dhruv Batra
H F= =Have
a headache
Coming
down with Flu
F = Coming down with Flu
P(H) = 1/10
P(H)
P(F)= =1/10
1/40
P(F)
= 1/40
P(H|F)
= 1/2
P(H|F) = 1/2
Headaches are rare and flu
Headaches
rarereand
flu
is rarer, butare
if you
coming
is down
rarer, with
but if
you
re coming
flu
there
s a 50down
with
fluyou
there
s a 5050 chance
ll have
a
50headache.
chance you
ll have a
headache.
14
Conditional Probabilities
• P(Y=y | X=x)
• What do you believe about Y=y, if I tell you X=x?
• P(Rafael Nadal wins US Open 2016)?
• What if I tell you:
– He has won the US Open twice
– Novak Djokovic is ranked 1; just won Australian Open
(C) Dhruv Batra
15
Conditional Distributions
Product rule
(C) Dhruv Batra
Slide Credit: Erik Sudderth
16
Conditional Probabilities
Figures from Bishop
17
Chain rule
• Generalized product rule:
• Example:
Equations from Wikipedia
18
Independence
• A and B are independent iff:
P( A | B) P( A)
P( B | A) P( B)
These two constraints are logically equivalent
• Therefore, if A and B are independent:
P( A B)
P( A | B)
P( A)
P( B)
P( A B) P( A) P( B)
Slide credit: Ray Mooney
19
Independence
• Marginal: P satisfies (X Y) if and only if
– P(X=x,Y=y) = P(X=x) P(Y=y),
xVal(X), yVal(Y)
• Conditional: P satisfies (X Y | Z) if and only if
– P(X,Y|Z) = P(X|Z) P(Y|Z),
(C) Dhruv Batra
xVal(X), yVal(Y), zVal(Z)
20
Independent Random Variables
(C) Dhruv Batra
Slide Credit: Erik Sudderth
21
Other Concepts
• Expectation:
• Variance:
• Covariance:
Equations from Bishop
22
Entropy
(C) Dhruv Batra
Slide Credit: Sam Roweis
24
KL-Divergence / Relative Entropy
(C) Dhruv Batra
Slide Credit: Sam Roweis
25
Bayes Theorem
P( E | H ) P( H )
P( H | E )
P( E )
Simple proof from definition of conditional probability:
P( H E )
P( H | E )
P( E )
(Def. cond. prob.)
P( H E )
(Def. cond. prob.)
P( E | H )
P( H )
P( H E ) P( E | H ) P( H ) P( H | E ) P( E )
QED:
P( H | E )
P( E | H ) P( H )
P( E )
26
Adapted from Ray Mooney
Probabilistic Classification
• Let Y be the random variable for the class which takes values
{y1,y2,…ym}.
• Let X be the random variable describing an instance consisting
of a vector of values for n features <X1,X2…Xn>, let xk be a
possible value for X and xij a possible value for Xi.
• For classification, we need to compute P(Y=yi | X=xk) for i=1…m
• However, given no other assumptions, this requires a table
giving the probability of each category for each possible instance
in the instance space, which is impossible to accurately estimate
from a reasonably-sized training set.
– Assuming Y and all Xi are binary, we need 2n entries to specify
P(Y=pos | X=xk) for each of the 2n possible xk’s since
P(Y=neg | X=xk) = 1 – P(Y=pos | X=xk)
– Compared to 2n+1 – 1 entries for the joint distribution P(Y,X1,X2…Xn)
27
Slide credit: Ray Mooney
Bayesian Categorization
• Determine category of xk by determining for each yi
prior
likelihood
P(Y yi ) P( X xk | Y yi )
P(Y yi | X xk )
P ( X xk )
posterior
• P(X=xk) can be determined since categories are
complete and disjoint.
m
P( X xk ) P(Y yi ) P( X xk | Y yi )
i 1
P(Y yi ) P( X xk | Y yi )
P(Y yi | X xk )
1
P( X xk )
i 1
i 1
m
m
28
Adapted from Ray Mooney
Bayesian Categorization (cont.)
• Need to know:
– Priors: P(Y=yi)
– Conditionals (likelihood): P(X=xk | Y=yi)
• P(Y=yi) are easily estimated from data.
– If ni of the examples in D are in yi then P(Y=yi) = ni / |D|
• Too many possible instances (e.g. 2n for binary
features) to estimate all P(X=xk | Y=yi).
• Need to make some sort of independence
assumptions about the features to make learning
tractable (more details later).
29
Adapted from Ray Mooney
Likelihood / Prior / Posterior
• A hypothesis is denoted as h; it is one
member of the hypothesis space H
• A set of training examples is denoted as D,
a collection of (x, y) pairs for training
• Pr(h) – the prior probability of the
hypothesis – without observing any training
data, what’s the probability that h is the
target function we want?
30
Slide content from Rebecca Hwa
Likelihood / Prior / Posterior
• Pr(D) – the prior probability of the
observed data – chance of getting the
particular set of training examples D
• Pr(h|D) – the posterior probability of h –
what is the probability that h is the target
given that we’ve observed D?
• Pr(D|h) –the probability of getting D if h
were true (a.k.a. likelihood of the data)
• Pr(h|D) = Pr(D|h)Pr(h)/Pr(D)
31
Slide content from Rebecca Hwa
MAP vs MLE Estimation
• Maximum-a-posteriori (MAP) estimation:
– hMAP = argmaxh Pr(h|D)
= argmaxh Pr(D|h)Pr(h)/Pr(D)
= argmaxh Pr(D|h)Pr(h)
• Maximum likelihood estimation (MLE):
– hML = argmax Pr(D|h)
32
Slide content from Rebecca Hwa