Transcript Document
Bayesian Learning,
Cont’d
Administrivia
• Various homework bugs:
• Due: Oct 12 (Tues) not 9 (Sat)
• Problem 3 should read:
• (duh)
• (some) info on naive Bayes in Sec. 4.3 of
Administrivia
• Another bug in last time’s lecture:
• Multivariate Gaussian should look like:
5 minutes of math...
• Joint probabilities
• Given d different random vars,
• The “joint” probability of them taking on the
simultaneous values
• given by
• Or, for shorthand,
• Closely related to the “joint PDF”
5 minutes of math...
• Independence:
• Two random variables are statistically
independent iff:
• Or, equivalently (usually for discrete RVs):
• For multivariate RVs:
Exercise
• Suppose you’re given the PDF:
• Where z is a normalizing constant.
• What must z be to make this a legitimate
PDF?
• Are
not?
and
independent? Why or why
• What about the PDF:
Parameterizing PDFs
• Given training data, [X, Y], w/ discrete labels Y
• Break data out into sets
,
etc.
• Want to come up with models,
,
, etc.
• Suppose the individual f()s are Gaussian, need
the params μ and σ
• How do you get the params?
• Now, what if the f()s are something really funky
you’ve never seen before in your life, with
Maximum likelihood
• Principle of maximum likelihood:
• Pick the parameters that make the data as
probable (or, in general “likely”) as possible
• Regard the probability function as a func of two
variables: data and parameters:
• Function L is the “likelihood function”
• Want to pick the that maximizes L
Example
• Consider the exponential PDF:
• Can think of this as either a function of x or τ
Exponential as fn of x
Exponential as a fn of τ
Max likelihood params
• So, for a fixed set of data, X, want the
parameter that maximizes L
• Hold X constant, optimize
• How?
• More important: f() is usually a function of a
single data point (possibly vector), but L is a
func. of a set of data
• How do you extend f() to set of data?
IID Samples
• In supervised learning, we usually assume that
data points are sampled independently and
from the same distribution
• IID assumption: data are independent and
identically distributed
IID Samples
• In supervised learning, we usually assume that
data points are sampled independently and
from the same distribution
• IID assumption: data are independent and
identically distributed
• ⇒ joint PDF can be written as product of
individual (marginal) PDFs:
The max likelihood
recipe
• Start with IID data
• Assume model for individual data point, f(X;Θ)
• Construct joint likelihood function (PDF):
• Find the params Θ that maximize L
• (If you’re lucky): Differentiate L w.r.t. Θ, set
=0 and solve
• Repeat for each class
Exercise
• Find the maximum likelihood estimator of μ for
the univariate Gaussian:
• Find the maximum likelihood estimator of β for
the degenerate gamma distribution:
• Hint: consider the log of the likelihood fns in
both cases
Putting the parts together
complete
training
data
[X,Y]
5 minutes of math...
• Marginal probabilities
• If you have a joint PDF:
• ... and want to know about the probability of
just one RV (regardless of what happens to the
others)
• Marginal PDF of
or
:
5 minutes of math...
• Conditional probabilities
• Suppose you have a joint PDF, f(H,W)
• Now you get to see one of the values, e.g.,
H=“183cm”
• What’s your probability estimate of A, given
this new knowledge?
5 minutes of math...
• Conditional probabilities
• Suppose you have a joint PDF, f(H,W)
• Now you get to see one of the values, e.g.,
H=“183cm”
• What’s your probability estimate of A, given
this new knowledge?
Everything’s random...
• Basic Bayesian viewpoint:
• Treat (almost) everything as a random variable
• Data/independent var: X vector
• Class/dependent var: Y
• Parameters: Θ
• E.g., mean, variance, correlations,
multinomial params, etc.
• Use Bayes’ Rule to assess probabilities of
classes