Transcript lecture2
Machine Learning
Generative verses discriminative
classifier
Eric Xing
Lecture 2, August 12, 2010
Reading:
© Eric Xing @ CMU, 2006-2010
1
Generative and Discriminative
classifiers
Goal: Wish to learn f: X Y, e.g., P(Y|X)
Generative:
Modeling the joint distribution
of all data
Discriminative:
Modeling only points
at the boundary
© Eric Xing @ CMU, 2006-2010
2
Generative vs. Discriminative
Classifiers
Goal: Wish to learn f: X Y, e.g., P(Y|X)
Generative classifiers (e.g., Naïve Bayes):
Assume some functional form for P(X|Y), P(Y)
This is a ‘generative’ model of the data!
Estimate parameters of P(X|Y), P(Y) directly from training data
Use Bayes rule to calculate P(Y|X= x)
Discriminative classifiers (e.g., logistic regression)
Directly assume some functional form for P(Y|X)
This is a ‘discriminative’ model of the data!
Yn
Xn
Yn
Xn
Estimate parameters of P(Y|X) directly from training data
© Eric Xing @ CMU, 2006-2010
3
Suppose you know the following
…
Class-specific Dist.: P(X|Y)
p ( X | Y 1)
p1 ( X ; 1 , 1 )
p( X | Y 2)
p2 ( X ; 2 , 2 )
Class prior (i.e., "weight"): P(Y)
This is a generative model of the data!
© Eric Xing @ CMU, 2006-2010
Bayes classifier:
4
Optimal classification
Theorem: Bayes classifier is optimal!
That is
How to learn a Bayes classifier?
Recall density estimation. We need to estimate P(X|y=k), and P(y=k) for all k
© Eric Xing @ CMU, 2006-2010
5
Gaussian Discriminative Analysis
learning f: X Y, where
X is a vector of real-valued features, Xn= < Xn,1…Xn,m >
Y is an indicator vector
Yn
What does that imply about the form of P(Y|X)?
The joint probability of a datum and its label is:
Xn
p(x n , ynk 1 | , ) p( ynk 1) p(x n | ynk 1, , )
k
1
2
1
exp
(
x
)
2
n
k
2
(2 2 )1/ 2
Given a datum xn, we predict its label using the conditional probability of the label
given the datum:
k
p( ynk 1 | x n , , )
1
exp 21 2 (x n k ) 2
(2 )
1
exp 21 2 (x n k ' ) 2
k'
2 1/ 2
(2 )
k'
2 1/ 2
© Eric Xing @ CMU, 2006-2010
6
Conditional Independence
X is conditionally independent of Y given Z, if the probability
distribution governing X is independent of the value of Y, given
the value of Z
Which we often write
e.g.,
Equivalent to:
© Eric Xing @ CMU, 2006-2010
7
Naïve Bayes Classifier
When X is multivariate-Gaussian vector:
Yn
The joint probability of a datum and it label is:
p(x n , ynk 1 | , ) p( ynk 1) p(x n | ynk 1, , )
k
Xn
T 1
1
1
exp
(
x
)
(
x
n
k
n
k)
2
1/ 2
(2 )
The naïve Bayes simplification
Yn
p(x n , y 1 | , ) p( y 1) p( xn , j | y 1, k , j , k , j )
k
n
k
n
k
n
j
k
j
1
(2 )
2 1/ 2
k, j
exp - 212 ( xn , j - k , j ) 2
k,j
Xn,1
Xn,2
…
Xn,m
m
More generally:
p(x n , yn | , ) p( yn | ) p( xn, j | yn , )
j 1
Where p(. | .) is an arbitrary conditional (discrete or continuous) 1-D density
© Eric Xing @ CMU, 2006-2010
8
The predictive distribution
Understanding the predictive distribution
k
p
(
y
1
,
x
|
, , )
n
n
p( ynk 1 | xn , , , )
p( xn | , )
Under naïve Bayes assumption:
p( ynk 1 | xn , , , )
k N ( xn , | k , k )
*
k ' k ' N ( xn , | k ' , k ' )
1
j
j 2
(
x
)
log
C
n
k
k, j
2
2
k
,
j
k exp j
1
j
j 2
exp
(
x
)
log
C
k ' k ' j 2 2 n k '
k ', j
k ', j
**
For two class (i.e., K=2), and when the two classes haves the same
variance, ** turns out to be a logistic function
p( yn1 1 | xn )
1
1e
1
1
2 exp
1 exp
1
( xnj - 2j )2 log j C
j 2 2
j
1
j
j
2
( xn - 1 ) log j C
j 2 2
j
1
1 exp j xnj
1
2j
( 1j - 2j ) 12 ([ 1j ]2 - [ 2j ]2 ) log
j
(1 1 )
1
T xn
© Eric Xing @ CMU, 2006-2010
9
The decision boundary
The predictive distribution
p( yn1 1 | xn )
1
M
1 exp j xnj 0
j 1
1
1 e
T
xn
The Bayes decision rule:
1
1
T xn
p ( y n 1 | xn )
1
e
ln
ln
T
p( yn2 1 | xn )
e xn
T
1 e xn
Tx
n
For multiple class (i.e., K>2), * correspond to a softmax function
p( ynk 1 | xn )
e
kT xn
e
Tj xn
j
© Eric Xing @ CMU, 2006-2010
10
Bayesian estimation
© Eric Xing @ CMU, 2006-2010
11
Generative vs. Discriminative
Classifiers
Goal: Wish to learn f: X Y, e.g., P(Y|X)
Generative classifiers (e.g., Naïve Bayes):
Assume some functional form for P(X|Y), P(Y)
This is a ‘generative’ model of the data!
Estimate parameters of P(X|Y), P(Y) directly from training data
Use Bayes rule to calculate P(Y|X= x)
Discriminative classifiers:
Directly assume some functional form for P(Y|X)
This is a ‘discriminative’ model of the data!
Yi
Xi
Yi
Xi
Estimate parameters of P(Y|X) directly from training data
© Eric Xing @ CMU, 2006-2010
12
Linear Regression
The data:
( x1 , y1 ), ( x2 , y2 ), ( x3 , y3 ),, ( xN , yN )
Both nodes are observed:
X is an input vector
Y is a response vector
Yi
Xi
N
(we first consider y as a generic
continuous response vector, then
we consider the special case of
classification where y is a discrete
indicator)
A regression scheme can be
used to model p(y|x) directly,
rather than p(x,y)
© Eric Xing @ CMU, 2006-2010
13
Linear Regression
Assume that Y (target) is a linear function of X (features):
e.g.:
let's assume a vacuous "feature" X0=1 (this is the intercept term, why?), and
define the feature vector to be:
then we have the following general representation of the linear function:
Our goal is to pick the optimal
We seek
. How!
that minimize the following cost function:
1 n
J ( ) ( yˆ i ( xi ) yi )2
2 i 1
© Eric Xing @ CMU, 2006-2010
14
The Least-Mean-Square (LMS)
method
Consider a gradient descent algorithm:
j t 1 j t
t
Now we have the following descent rule:
j
J ( )
j
t 1
n
T
j ( yi x i t ) xij
t
i 1
For a single training point, we have:
j t 1 j t ( yi xi T t ) xij
This is known as the LMS update rule, or the Widrow-Hoff learning rule
This is actually a "stochastic", "coordinate" descent algorithm
This can be used as a on-line algorithm
© Eric Xing @ CMU, 2006-2010
15
Probabilistic Interpretation of
LMS
Let us assume that the target variable and the inputs are
related by the equation:
yi T xi i
where ε is an error term of unmodeled effects or random noise
Now assume that ε follows a Gaussian N(0,σ), then we have:
( yi T xi )2
1
p( yi | xi ; )
exp
2
2
2
By independence assumption:
n
T
2
(
y
x
)
1
i
i
L( ) p( yi | xi ; )
exp i 1
2 2
2
i 1
n
n
© Eric Xing @ CMU, 2006-2010
16
Probabilistic Interpretation of
LMS, cont.
Hence the log-likelihood is:
l ( ) n log
Do you recognize the last term?
Yes it is:
1
1 1 n
2 i 1 ( yi T xi )2
2 2
1 n
T
J ( ) (x i yi )2
2 i 1
Thus under independence assumption, LMS is equivalent to
MLE of θ !
© Eric Xing @ CMU, 2006-2010
17
Classification and logistic
regression
© Eric Xing @ CMU, 2006-2010
18
The logistic function
© Eric Xing @ CMU, 2006-2010
19
Logistic regression (sigmoid
classifier)
The condition distribution: a Bernoulli
p( y | x) ( x) y (1 ( x))1 y
where is a logistic function
( x)
1
1e
T x
We can used the brute-force gradient method as in LR
But we can also apply generic laws by observing the p(y|x) is
an exponential family function, more specifically, a
generalized linear model (see future lectures …)
© Eric Xing @ CMU, 2006-2010
20
Training Logistic Regression:
MCLE
Estimate parameters =<0, 1, ... m> to maximize the
conditional likelihood of training data
Training data
Data likelihood =
Data conditional likelihood =
© Eric Xing @ CMU, 2006-2010
21
Expressing Conditional Log
Likelihood
Recall the logistic function:
and conditional likelihood:
© Eric Xing @ CMU, 2006-2010
22
Maximizing Conditional Log
Likelihood
The objective:
Good news: l() is concave function of
Bad news: no closed-form solution to maximize l()
© Eric Xing @ CMU, 2006-2010
23
The Newton’s method
Finding a zero of a function
© Eric Xing @ CMU, 2006-2010
24
The Newton’s method (con’d)
To maximize the conditional likelihood l():
since l is convex, we need to find * where l’(*)=0 !
So we can perform the following iteration:
© Eric Xing @ CMU, 2006-2010
25
The Newton-Raphson method
In LR the is vector-valued, thus we need the following
generalization:
is the gradient operator over the function
H is known as the Hessian of the function
© Eric Xing @ CMU, 2006-2010
26
The Newton-Raphson method
In LR the is vector-valued, thus we need the following
generalization:
is the gradient operator over the function
H is known as the Hessian of the function
© Eric Xing @ CMU, 2006-2010
27
Iterative reweighed least squares
(IRLS)
Recall in the least square est. in linear regression, we have:
which can also derived from Newton-Raphson
Now for logistic regression:
© Eric Xing @ CMU, 2006-2010
28
Logistic regression: practical
issues
NR (IRLS) takes O(N+d3) per iteration, where N = number of
training cases and d = dimension of input x, but converge in
fewer iterations
Quasi-Newton methods, that approximate the Hessian, work
faster.
Conjugate gradient takes O(Nd) per iteration, and usually
works best in practice.
Stochastic gradient descent can also be used if N is large c.f.
perceptron rule:
© Eric Xing @ CMU, 2006-2010
29
Generative vs. Discriminative
Classifiers
Goal: Wish to learn f: X Y, e.g., P(Y|X)
Generative classifiers (e.g., Naïve Bayes):
Assume some functional form for P(X|Y), P(Y)
This is a ‘generative’ model of the data!
Estimate parameters of P(X|Y), P(Y) directly from training data
Use Bayes rule to calculate P(Y|X= x)
Discriminative classifiers:
Directly assume some functional form for P(Y|X)
This is a ‘discriminative’ model of the data!
Yi
Xi
Yi
Xi
Estimate parameters of P(Y|X) directly from training data
© Eric Xing @ CMU, 2006-2010
30
Naïve Bayes vs Logistic
Regression
Consider Y boolean, X continuous, X=<X1 ... Xm>
Number of parameters to estimate:
NB:
LR:
1
2
(
x
)
log
C
j
k, j
k, j
2
2
k
,
j
p ( y | x)
**
1
k ' k ' exp j 2 2 ( x j k ', j ) 2 log k ', j C
k ', j
k exp j
( x)
1
1e
T x
Estimation method:
NB parameter estimates are uncoupled
LR parameter estimates are coupled
© Eric Xing @ CMU, 2006-2010
31
Naïve Bayes vs Logistic
Regression
Asymptotic comparison (# training examples infinity)
when model assumptions correct
NB, LR produce identical classifiers
when model assumptions incorrect
LR is less biased – does not assume conditional independence
therefore expected to outperform NB
© Eric Xing @ CMU, 2006-2010
32
Naïve Bayes vs Logistic
Regression
Non-asymptotic analysis (see [Ng & Jordan, 2002] )
convergence rate of parameter estimates – how many training
examples needed to assure good estimates?
NB order log m (where m = # of attributes in X)
LR order m
NB converges more quickly to its (perhaps less helpful)
asymptotic estimates
© Eric Xing @ CMU, 2006-2010
33
Rate of convergence: logistic
regression
Let hDis,n be logistic regression trained on n examples in m
dimensions. Then with high probability:
Implication: if we want
for some small constant 0, it suffices to pick order m
examples
Convergences to its asymptotic classifier, in order m examples
result follows from Vapnik’s structural risk bound, plus fact that the "VC
Dimension" of an m-dimensional linear separators is m
© Eric Xing @ CMU, 2006-2010
34
Rate of convergence: naïve
Bayes parameters
Let any 1, d>0, and any n 0 be fixed.
Assume that for some fixed r0 > 0,
we have that
Let
Then with probability at least 1-d, after n examples:
1.
For discrete input,
for all i and b
2.
For continuous inputs,
for all i and b
© Eric Xing @ CMU, 2006-2010
35
Some experiments from UCI data
sets
© Eric Xing @ CMU, 2006-2010
36
Case study
Dataset
20 News Groups (20 classes)
61,118 words, 18,774 documents
Experiment:
Solve only a two-class subset: 1 vs 2.
1768 instances, 61188 features.
Use dimensionality reduction on the data (SVD).
Use 90% as training set, 10% as test set.
Test prediction error used as accuracy measure.
© Eric Xing @ CMU, 2006-2010
37
Generalization error (1)
Versus training size
• 30 features.
• A fixed test set
• Training set varied
from 10% to 100%
of the training set
© Eric Xing @ CMU, 2006-2010
38
Generalization error (2)
Versus model size
• Number of
dimensions of the
data varied from 5
to 50 in steps of 5
• The features were
chosen in
decreasing order of
their singular
values
• 90% versus 10%
split on training and
test
© Eric Xing @ CMU, 2006-2010
39
Summary
Naïve Bayes classifier
What’s the assumption
Why we use it
How do we learn it
Logistic regression
Functional form follows from Naïve Bayes assumptions
For Gaussian Naïve Bayes assuming variance
For discrete-valued Naïve Bayes too
But training procedure picks parameters without the conditional independence
assumption
Gradient ascent/descent
– General approach when closed-form solutions unavailable
Generative vs. Discriminative classifiers
– Bias vs. variance tradeoff
© Eric Xing @ CMU, 2006-2010
40
Appendix
© Eric Xing @ CMU, 2006-2010
41
Parameter Learning from iid Data
Goal: estimate distribution parameters from a dataset of N
independent, identically distributed (iid), fully observed,
training cases
D = {x1, . . . , xN}
Maximum likelihood estimation (MLE)
1.
One of the most common estimators
2.
With iid and full-observability assumption, write L() as the likelihood of the data:
L( ) P( x1, x2 ,, xN ; )
P( x; ) P( x2 ; ), , P( xN ; )
i 1 P( xi ; )
N
3.
pick the setting of parameters most likely to have generated the data we saw:
* arg max L( ) arg max log L( )
© Eric Xing @ CMU, 2006-2010
42
Example: Bernoulli model
Data:
We observed N iid coin tossing: D={1, 0, 1, …, 0}
Representation:
xn {0,1}
Binary r.v:
Model:
How to write the likelihood of a single observation xi ?
1 for x 0
P( x)
for x 1
P( x) x (1 )1 x
P( xi ) xi (1 )1 xi
The likelihood of datasetD={x1, …,xN}:
N
N
i 1
i 1
N
xi
N
1 xi
P( x1 , x2 ,..., xN | ) P( xi | ) xi (1 )1 xi i1 (1 ) i1
© Eric Xing @ CMU, 2006-2010
#head (1 ) #tails
43
Maximum Likelihood Estimation
Objective function:
l ( ; D) log P( D | ) log n (1 ) n nh log ( N nh ) log( 1 )
h
t
We need to maximize this w.r.t.
Take derivatives wrt
l nh N nh
0
1
MLE
n
h
N
or MLE
1
N
x
i
i
Frequency as
sample mean
Sufficient statistics
The counts, nh , where nk xi , are sufficient statistics of data D
i
© Eric Xing @ CMU, 2006-2010
44
Overfitting
Recall that for Bernoulli Distribution, we have
head
ML
n head
head
n
n tail
What if we tossed too few times so that we saw zero head?
head
We have ML 0, and we will predict that the probability of
seeing a head next is zero!!!
The rescue: "smoothing"
Where n' is know as the pseudo- (imaginary) count
head
ML
n head n '
head
n
n tail n '
But can we make this more formal?
© Eric Xing @ CMU, 2006-2010
45
Bayesian Parameter Estimation
Treat the distribution parameters also as a random variable
The a posteriori distribution of after seem the data is:
p( | D)
p( D | ) p( )
p( D | ) p( )
p ( D)
p( D | ) p( )d
This is Bayes Rule
likelihood prior
posterior
marginal likelihood
The prior p(.) encodes our prior knowledge about the domain
© Eric Xing @ CMU, 2006-2010
46
Frequentist Parameter Estimation
Two people with different priors p() will end up with
different estimates p(|D).
Frequentists dislike this “subjectivity”.
Frequentists think of the parameter as a fixed, unknown
constant, not a random variable.
Hence they have to come up with different "objective"
estimators (ways of computing from data), instead of using
Bayes’ rule.
These estimators have different properties, such as being “unbiased”, “minimum
variance”, etc.
The maximum likelihood estimator, is one such estimator.
© Eric Xing @ CMU, 2006-2010
47
Discussion
or p(), this is the problem!
Bayesians know it
© Eric Xing @ CMU, 2006-2010
48
Bayesian estimation for Bernoulli
Beta distribution:
P( ; , b )
( b ) 1
(1 ) b 1 B( , b ) 1 (1 ) b 1
( )( b )
When x is discrete
( x 1) x( x) x!
Posterior distribution of :
P( | x1 ,..., xN )
p( x1 ,..., xN | ) p( )
nh (1 ) nt 1 (1 ) b 1 nh 1 (1 ) nt b 1
p( x1 ,..., xN )
Notice the isomorphism of the posterior to the prior,
such a prior is called a conjugate prior
and b are hyperparameters (parameters of the prior) and correspond to the
number of “virtual” heads/tails (pseudo counts)
© Eric Xing @ CMU, 2006-2010
49
Bayesian estimation for
Bernoulli, con'd
Posterior distribution of :
P( | x1 ,..., xN )
p( x1 ,..., xN | ) p( )
nh (1 ) nt 1 (1 ) b 1 nh 1 (1 ) nt b 1
p( x1 ,..., xN )
Maximum a posteriori (MAP) estimation:
MAP arg max log P( | x1 ,..., xN )
Bata parameters
can be understood
as pseudo-counts
Posterior mean estimation:
Bayes p( | D)d C n 1 (1 ) n b 1 d
h
t
nh
N b
Prior strength: A=+b
A can be interoperated as the size of an imaginary data set from which we obtain
the pseudo-counts
© Eric Xing @ CMU, 2006-2010
50
Effect of Prior Strength
Suppose we have a uniform prior (=b=1/2),
Weak prior A = 2. Posterior prediction:
and we observe n (nh 2,nt 8)
12
p( x h | nh 2, nt 8, '2)
0.25
2 10
Strong prior A = 20. Posterior prediction:
10 2
p( x h | nh 2, nt 8, '20)
0.40
20 10
However,
if we have enough data, it washes away the prior.
e.g., n (nh 200,nt 800). Then the estimates under
200
10200
weak and strong prior are 211000
and 20
, respectively,
1000
both of which are close to 0.2
© Eric Xing @ CMU, 2006-2010
51
Example 2: Gaussian density
Data:
We observed N iid real samples:
D={-0.1, 10, 1, -5.2, …, 3}
Model:
Log likelihood:
P( x) 2
2 1 / 2
exp ( x )2 / 2 2
N
x
N
1
l ( ; D) log P( D | ) log( 2 2 ) n 2
2
2 n 1
2
MLE: take derivative and set to zero:
l
(1 / 2 )n xn
l
N
1
xn 2
2
2
4 n
2
2
© Eric Xing @ CMU, 2006-2010
1
xn
N n
1
2
n xn ML
N
MLE
2
MLE
52
MLE for a multivariate-Gaussian
It can be shown that the MLE for µ and Σ is
1
N
1
N
MLE
x
MLE
T
n xn ML xn ML
n
n
1
S
N
x1T
x2T
X
xT
N
where the scatter matrix is
S n xn ML xn ML
T
x x N
n
T
n nn
xn1
2
x
xn n
xK
n
T
ML ML
The sufficient statistics are nxn and nxnxnT.
Note that XTX=nxnxnT may not be full rank (eg. if N <D), in which case ΣML is not
invertible
© Eric Xing @ CMU, 2006-2010
53
Bayesian estimation
Normal Prior:
P( ) 202
1 / 2
exp ( 0 )2 / 2 02
Joint probability:
P( x, ) 2
1 N
2
exp 2 xn
2 n 1
2 N /2
202
1 / 2
exp ( 0 )2 / 2 02
Posterior:
P ( | x ) 2~2
where
1 / 2
exp ( ~)2 / 2~2
1 / 02
N / 2
~
~ 2 N 1
x
,
and
0
2 2
N / 2 1 / 02
N / 2 1 / 02
0
mean
© Eric XingSample
@ CMU, 2006-2010
1
54
Bayesian estimation: unknown µ, known σ
1 / 02
N / 2
N
x
0 ,
2
2
2
2
N / 1 /0
N / 1 /0
N
1
~ 2 2 2
0
1
The posterior mean is a convex combination of the prior and the MLE, with
weights proportional to the relative noise levels.
The precision of the posterior 1/σ2N is the precision of the prior 1/σ20 plus one
contribution of data precision 1/σ2 for each observed data point.
Sequentially updating the mean
µ∗ = 0.8 (unknown), (σ2)∗ = 0.1 (known)
Effect of single data point
02
02
1 0 ( x 0 ) 2
x ( x 0 ) 2
02
02
Uninformative (vague/ flat) prior, σ20 →∞
N 0
© Eric Xing @ CMU, 2006-2010
55