Transcript lecture2

Machine Learning
Generative verses discriminative
classifier
Eric Xing
Lecture 2, August 12, 2010
Reading:
© Eric Xing @ CMU, 2006-2010
1
Generative and Discriminative
classifiers

Goal: Wish to learn f: X  Y, e.g., P(Y|X)

Generative:

Modeling the joint distribution
of all data

Discriminative:

Modeling only points
at the boundary
© Eric Xing @ CMU, 2006-2010
2
Generative vs. Discriminative
Classifiers

Goal: Wish to learn f: X  Y, e.g., P(Y|X)

Generative classifiers (e.g., Naïve Bayes):

Assume some functional form for P(X|Y), P(Y)
This is a ‘generative’ model of the data!


Estimate parameters of P(X|Y), P(Y) directly from training data

Use Bayes rule to calculate P(Y|X= x)
Discriminative classifiers (e.g., logistic regression)

Directly assume some functional form for P(Y|X)
This is a ‘discriminative’ model of the data!

Yn
Xn
Yn
Xn
Estimate parameters of P(Y|X) directly from training data
© Eric Xing @ CMU, 2006-2010
3
Suppose you know the following
…

Class-specific Dist.: P(X|Y)
p ( X | Y  1)

 p1 ( X ; 1 , 1 )
p( X | Y  2)

 p2 ( X ; 2 , 2 )

Class prior (i.e., "weight"): P(Y)

This is a generative model of the data!
© Eric Xing @ CMU, 2006-2010
Bayes classifier:
4
Optimal classification

Theorem: Bayes classifier is optimal!


That is
How to learn a Bayes classifier?

Recall density estimation. We need to estimate P(X|y=k), and P(y=k) for all k
© Eric Xing @ CMU, 2006-2010
5
Gaussian Discriminative Analysis


learning f: X  Y, where

X is a vector of real-valued features, Xn= < Xn,1…Xn,m >

Y is an indicator vector
Yn
What does that imply about the form of P(Y|X)?

The joint probability of a datum and its label is:
Xn
p(x n , ynk  1 |  ,  )  p( ynk  1)  p(x n | ynk  1,  ,  )
 k


1
2
1
exp
(
x

)
2
n
k
2
(2 2 )1/ 2

Given a datum xn, we predict its label using the conditional probability of the label
given the datum:
k
p( ynk  1 | x n ,  ,  ) 
1

exp  21 2 (x n   k ) 2

(2 )
1

exp  21 2 (x n   k ' ) 2

k'
2 1/ 2
(2 )
k'
2 1/ 2

© Eric Xing @ CMU, 2006-2010

6
Conditional Independence

X is conditionally independent of Y given Z, if the probability
distribution governing X is independent of the value of Y, given
the value of Z
Which we often write

e.g.,

Equivalent to:
© Eric Xing @ CMU, 2006-2010
7
Naïve Bayes Classifier

When X is multivariate-Gaussian vector:

Yn
The joint probability of a datum and it label is:


p(x n , ynk  1 |  , )  p( ynk  1)  p(x n | ynk  1,  , )
 k


Xn

 T 1

1
1
exp
(
x

)

(
x

n
k
n
k)
2
1/ 2
(2  )
The naïve Bayes simplification
Yn
p(x n , y  1 |  ,  )  p( y  1)   p( xn , j | y  1,  k , j ,  k , j )
k
n
k
n
k
n
j
 k 
j
1
(2 )
2 1/ 2
k, j

exp - 212 ( xn , j -  k , j ) 2
k,j

Xn,1
Xn,2
…
Xn,m
m

More generally:
p(x n , yn |  ,  )  p( yn |  )   p( xn, j | yn , )
j 1

Where p(. | .) is an arbitrary conditional (discrete or continuous) 1-D density
© Eric Xing @ CMU, 2006-2010
8
The predictive distribution


Understanding the predictive distribution

k
p
(
y

1
,
x
|

, ,  )

n
n
p( ynk  1 | xn ,  , ,  ) 


p( xn |  , )
Under naïve Bayes assumption:


p( ynk  1 | xn ,  , ,  ) 

 k N ( xn , | k ,  k )
*
k '  k ' N ( xn , | k ' , k ' )
 1

j
j 2

(
x


)

log


C
n
k
k, j
2

2

k
,
j


 k exp   j 


 1

j
j 2



exp

(
x


)

log


C
k ' k '   j  2 2 n k '
k ', j

 k ', j


**
For two class (i.e., K=2), and when the two classes haves the same
variance, ** turns out to be a logistic function
p( yn1  1 | xn )

1
1e
1

1

 2 exp  


 1 exp  



 1
 

( xnj - 2j )2  log j C  

j  2 2
 j

 1
 
j
j
2

( xn - 1 )  log j C  

j  2 2
 j

1

 
1  exp   j xnj
1
 2j

( 1j - 2j )  12 ([ 1j ]2 - [ 2j ]2 )  log
j
(1 1 )
1

 T xn
© Eric Xing @ CMU, 2006-2010
9
The decision boundary

The predictive distribution
p( yn1  1 | xn ) 

1
 M

1  exp   j xnj   0 
 j 1

1
1  e 
T
xn
The Bayes decision rule:

1

1
 T xn
p ( y n  1 | xn )

1

e
ln
 ln
T

p( yn2  1 | xn )
e  xn

T
1  e  xn





 Tx
n



For multiple class (i.e., K>2), * correspond to a softmax function
p( ynk  1 | xn ) 
e
 kT xn
e
 Tj xn
j
© Eric Xing @ CMU, 2006-2010
10
Bayesian estimation
© Eric Xing @ CMU, 2006-2010
11
Generative vs. Discriminative
Classifiers

Goal: Wish to learn f: X  Y, e.g., P(Y|X)

Generative classifiers (e.g., Naïve Bayes):

Assume some functional form for P(X|Y), P(Y)
This is a ‘generative’ model of the data!


Estimate parameters of P(X|Y), P(Y) directly from training data

Use Bayes rule to calculate P(Y|X= x)
Discriminative classifiers:

Directly assume some functional form for P(Y|X)
This is a ‘discriminative’ model of the data!

Yi
Xi
Yi
Xi
Estimate parameters of P(Y|X) directly from training data
© Eric Xing @ CMU, 2006-2010
12
Linear Regression

The data:
( x1 , y1 ), ( x2 , y2 ), ( x3 , y3 ),, ( xN , yN )

Both nodes are observed:

X is an input vector

Y is a response vector
Yi
Xi
N
(we first consider y as a generic
continuous response vector, then
we consider the special case of
classification where y is a discrete
indicator)

A regression scheme can be
used to model p(y|x) directly,
rather than p(x,y)
© Eric Xing @ CMU, 2006-2010
13
Linear Regression


Assume that Y (target) is a linear function of X (features):

e.g.:

let's assume a vacuous "feature" X0=1 (this is the intercept term, why?), and
define the feature vector to be:

then we have the following general representation of the linear function:
Our goal is to pick the optimal

We seek

 . How!
that minimize the following cost function:
1 n

J ( )   ( yˆ i ( xi )  yi )2
2 i 1
© Eric Xing @ CMU, 2006-2010
14
The Least-Mean-Square (LMS)
method

Consider a gradient descent algorithm:
 j t 1   j t  

t
Now we have the following descent rule:
j


J ( )
 j
t 1
n
T
  j    ( yi  x i  t ) xij
t
i 1
For a single training point, we have:

 j t 1   j t   ( yi  xi T t ) xij

This is known as the LMS update rule, or the Widrow-Hoff learning rule

This is actually a "stochastic", "coordinate" descent algorithm

This can be used as a on-line algorithm
© Eric Xing @ CMU, 2006-2010
15
Probabilistic Interpretation of
LMS

Let us assume that the target variable and the inputs are
related by the equation:
yi   T xi   i
where ε is an error term of unmodeled effects or random noise

Now assume that ε follows a Gaussian N(0,σ), then we have:
 ( yi   T xi )2 
1

p( yi | xi ; ) 
exp  
2
2
2 



By independence assumption:
n
T
2 

(
y


x
)
 1 

i
i

L( )   p( yi | xi ; )  
 exp   i 1


2 2
 2  
i 1


n
n
© Eric Xing @ CMU, 2006-2010
16
Probabilistic Interpretation of
LMS, cont.

Hence the log-likelihood is:
l ( )  n log

Do you recognize the last term?
Yes it is:

1
1 1 n
 2 i 1 ( yi   T xi )2
2   2
1 n
T
J ( )   (x i   yi )2
2 i 1
Thus under independence assumption, LMS is equivalent to
MLE of θ !
© Eric Xing @ CMU, 2006-2010
17
Classification and logistic
regression
© Eric Xing @ CMU, 2006-2010
18
The logistic function
© Eric Xing @ CMU, 2006-2010
19
Logistic regression (sigmoid
classifier)

The condition distribution: a Bernoulli
p( y | x)   ( x) y (1   ( x))1 y
where  is a logistic function
 ( x) 
1
1e
 T x

We can used the brute-force gradient method as in LR

But we can also apply generic laws by observing the p(y|x) is
an exponential family function, more specifically, a
generalized linear model (see future lectures …)
© Eric Xing @ CMU, 2006-2010
20
Training Logistic Regression:
MCLE

Estimate parameters =<0, 1, ... m> to maximize the
conditional likelihood of training data

Training data

Data likelihood =

Data conditional likelihood =
© Eric Xing @ CMU, 2006-2010
21
Expressing Conditional Log
Likelihood

Recall the logistic function:
and conditional likelihood:
© Eric Xing @ CMU, 2006-2010
22
Maximizing Conditional Log
Likelihood

The objective:

Good news: l() is concave function of 

Bad news: no closed-form solution to maximize l()
© Eric Xing @ CMU, 2006-2010
23
The Newton’s method

Finding a zero of a function
© Eric Xing @ CMU, 2006-2010
24
The Newton’s method (con’d)

To maximize the conditional likelihood l():
since l is convex, we need to find * where l’(*)=0 !

So we can perform the following iteration:
© Eric Xing @ CMU, 2006-2010
25
The Newton-Raphson method

In LR the  is vector-valued, thus we need the following
generalization:

 is the gradient operator over the function

H is known as the Hessian of the function
© Eric Xing @ CMU, 2006-2010
26
The Newton-Raphson method

In LR the  is vector-valued, thus we need the following
generalization:

 is the gradient operator over the function

H is known as the Hessian of the function
© Eric Xing @ CMU, 2006-2010
27
Iterative reweighed least squares
(IRLS)

Recall in the least square est. in linear regression, we have:
which can also derived from Newton-Raphson

Now for logistic regression:
© Eric Xing @ CMU, 2006-2010
28
Logistic regression: practical
issues

NR (IRLS) takes O(N+d3) per iteration, where N = number of
training cases and d = dimension of input x, but converge in
fewer iterations

Quasi-Newton methods, that approximate the Hessian, work
faster.

Conjugate gradient takes O(Nd) per iteration, and usually
works best in practice.

Stochastic gradient descent can also be used if N is large c.f.
perceptron rule:
© Eric Xing @ CMU, 2006-2010
29
Generative vs. Discriminative
Classifiers

Goal: Wish to learn f: X  Y, e.g., P(Y|X)

Generative classifiers (e.g., Naïve Bayes):

Assume some functional form for P(X|Y), P(Y)
This is a ‘generative’ model of the data!


Estimate parameters of P(X|Y), P(Y) directly from training data

Use Bayes rule to calculate P(Y|X= x)
Discriminative classifiers:

Directly assume some functional form for P(Y|X)
This is a ‘discriminative’ model of the data!

Yi
Xi
Yi
Xi
Estimate parameters of P(Y|X) directly from training data
© Eric Xing @ CMU, 2006-2010
30
Naïve Bayes vs Logistic
Regression

Consider Y boolean, X continuous, X=<X1 ... Xm>

Number of parameters to estimate:

NB:
LR:

 1

2

(
x


)

log


C
j
k, j
k, j
2

2


k
,
j


p ( y | x) 
**


 1


k '  k ' exp   j  2 2 ( x j  k ', j ) 2  log  k ', j  C 
 k ', j


 k exp   j 
 ( x) 
1
1e
 T x
Estimation method:

NB parameter estimates are uncoupled

LR parameter estimates are coupled
© Eric Xing @ CMU, 2006-2010
31
Naïve Bayes vs Logistic
Regression

Asymptotic comparison (# training examples  infinity)

when model assumptions correct


NB, LR produce identical classifiers
when model assumptions incorrect

LR is less biased – does not assume conditional independence

therefore expected to outperform NB
© Eric Xing @ CMU, 2006-2010
32
Naïve Bayes vs Logistic
Regression

Non-asymptotic analysis (see [Ng & Jordan, 2002] )

convergence rate of parameter estimates – how many training
examples needed to assure good estimates?
NB order log m (where m = # of attributes in X)
LR order m

NB converges more quickly to its (perhaps less helpful)
asymptotic estimates
© Eric Xing @ CMU, 2006-2010
33
Rate of convergence: logistic
regression

Let hDis,n be logistic regression trained on n examples in m
dimensions. Then with high probability:

Implication: if we want
for some small constant 0, it suffices to pick order m
examples
 Convergences to its asymptotic classifier, in order m examples

result follows from Vapnik’s structural risk bound, plus fact that the "VC
Dimension" of an m-dimensional linear separators is m
© Eric Xing @ CMU, 2006-2010
34
Rate of convergence: naïve
Bayes parameters

Let any 1, d>0, and any n 0 be fixed.
Assume that for some fixed r0 > 0,
we have that

Let

Then with probability at least 1-d, after n examples:
1.
For discrete input,
for all i and b
2.
For continuous inputs,
for all i and b
© Eric Xing @ CMU, 2006-2010
35
Some experiments from UCI data
sets
© Eric Xing @ CMU, 2006-2010
36
Case study


Dataset

20 News Groups (20 classes)

61,118 words, 18,774 documents
Experiment:

Solve only a two-class subset: 1 vs 2.

1768 instances, 61188 features.

Use dimensionality reduction on the data (SVD).

Use 90% as training set, 10% as test set.

Test prediction error used as accuracy measure.
© Eric Xing @ CMU, 2006-2010
37
Generalization error (1)

Versus training size
• 30 features.
• A fixed test set
• Training set varied
from 10% to 100%
of the training set
© Eric Xing @ CMU, 2006-2010
38
Generalization error (2)

Versus model size
• Number of
dimensions of the
data varied from 5
to 50 in steps of 5
• The features were
chosen in
decreasing order of
their singular
values
• 90% versus 10%
split on training and
test
© Eric Xing @ CMU, 2006-2010
39
Summary



Naïve Bayes classifier

What’s the assumption

Why we use it

How do we learn it
Logistic regression

Functional form follows from Naïve Bayes assumptions

For Gaussian Naïve Bayes assuming variance

For discrete-valued Naïve Bayes too

But training procedure picks parameters without the conditional independence
assumption
Gradient ascent/descent


– General approach when closed-form solutions unavailable
Generative vs. Discriminative classifiers

– Bias vs. variance tradeoff
© Eric Xing @ CMU, 2006-2010
40
Appendix
© Eric Xing @ CMU, 2006-2010
41
Parameter Learning from iid Data

Goal: estimate distribution parameters  from a dataset of N
independent, identically distributed (iid), fully observed,
training cases
D = {x1, . . . , xN}

Maximum likelihood estimation (MLE)
1.
One of the most common estimators
2.
With iid and full-observability assumption, write L() as the likelihood of the data:
L( )  P( x1, x2 ,, xN ; )
 P( x; ) P( x2 ; ), , P( xN ; )
 i 1 P( xi ; )
N
3.
pick the setting of parameters most likely to have generated the data we saw:
 *  arg max L( )  arg max log L( )

© Eric Xing @ CMU, 2006-2010

42
Example: Bernoulli model

Data:


We observed N iid coin tossing: D={1, 0, 1, …, 0}
Representation:
xn  {0,1}
Binary r.v:

Model:

How to write the likelihood of a single observation xi ?
1   for x  0
P( x)  
for x  1


P( x)   x (1   )1 x
P( xi )   xi (1   )1 xi

The likelihood of datasetD={x1, …,xN}:
N
N
i 1
i 1


N
 xi
N
1 xi
P( x1 , x2 ,..., xN |  )   P( xi |  )   xi (1   )1 xi   i1 (1   ) i1
© Eric Xing @ CMU, 2006-2010
  #head (1   ) #tails
43
Maximum Likelihood Estimation

Objective function:
l ( ; D)  log P( D |  )  log  n (1   ) n  nh log   ( N  nh ) log( 1   )
h
t

We need to maximize this w.r.t. 

Take derivatives wrt 
l nh N  nh
 
0
 
1 

 MLE
n
 h
N

or  MLE
1

N
x
i
i
Frequency as
sample mean

Sufficient statistics

The counts, nh , where nk   xi , are sufficient statistics of data D
i
© Eric Xing @ CMU, 2006-2010
44
Overfitting

Recall that for Bernoulli Distribution, we have
head
ML
n head
 head
n
 n tail

What if we tossed too few times so that we saw zero head?
head
We have ML  0, and we will predict that the probability of
seeing a head next is zero!!!

The rescue: "smoothing"

Where n' is know as the pseudo- (imaginary) count
head
 ML

n head  n '
 head
n
 n tail  n '
But can we make this more formal?
© Eric Xing @ CMU, 2006-2010
45
Bayesian Parameter Estimation

Treat the distribution parameters  also as a random variable

The a posteriori distribution of  after seem the data is:
p( | D) 
p( D |  ) p( )
p( D |  ) p( )

p ( D)
 p( D |  ) p( )d
This is Bayes Rule
likelihood  prior
posterior 
marginal likelihood
The prior p(.) encodes our prior knowledge about the domain
© Eric Xing @ CMU, 2006-2010
46
Frequentist Parameter Estimation
Two people with different priors p() will end up with
different estimates p(|D).

Frequentists dislike this “subjectivity”.

Frequentists think of the parameter as a fixed, unknown
constant, not a random variable.

Hence they have to come up with different "objective"
estimators (ways of computing from data), instead of using
Bayes’ rule.

These estimators have different properties, such as being “unbiased”, “minimum
variance”, etc.

The maximum likelihood estimator, is one such estimator.
© Eric Xing @ CMU, 2006-2010
47
Discussion
 or p(), this is the problem!
Bayesians know it
© Eric Xing @ CMU, 2006-2010
48
Bayesian estimation for Bernoulli

Beta distribution:
P( ; , b ) 


(  b )  1
 (1   ) b 1  B( , b )  1 (1   ) b 1
( )( b )
When x is discrete
( x  1)  x( x)  x!
Posterior distribution of  :
P( | x1 ,..., xN ) 
p( x1 ,..., xN |  ) p( )
  nh (1   ) nt    1 (1   ) b 1   nh  1 (1   ) nt  b 1
p( x1 ,..., xN )

Notice the isomorphism of the posterior to the prior,

such a prior is called a conjugate prior

 and b are hyperparameters (parameters of the prior) and correspond to the
number of “virtual” heads/tails (pseudo counts)
© Eric Xing @ CMU, 2006-2010
49
Bayesian estimation for
Bernoulli, con'd

Posterior distribution of  :
P( | x1 ,..., xN ) 

p( x1 ,..., xN |  ) p( )
  nh (1   ) nt    1 (1   ) b 1   nh  1 (1   ) nt  b 1
p( x1 ,..., xN )
Maximum a posteriori (MAP) estimation:
 MAP  arg max log P( | x1 ,..., xN )


Bata parameters
can be understood
as pseudo-counts
Posterior mean estimation:
 Bayes   p( | D)d  C     n  1 (1   ) n  b 1 d 
h

t
nh  
N   b
Prior strength: A=+b

A can be interoperated as the size of an imaginary data set from which we obtain
the pseudo-counts
© Eric Xing @ CMU, 2006-2010
50
Effect of Prior Strength

Suppose we have a uniform prior (=b=1/2),

Weak prior A = 2. Posterior prediction:

and we observe n  (nh  2,nt  8)
12
 
p( x  h | nh  2, nt  8,    '2) 
 0.25
2  10

Strong prior A = 20. Posterior prediction:
10  2
 
p( x  h | nh  2, nt  8,    '20) 
 0.40
20  10

However,
 if we have enough data, it washes away the prior.
e.g., n  (nh  200,nt  800). Then the estimates under
200
10200
weak and strong prior are 211000
and 20
, respectively,
1000
both of which are close to 0.2
© Eric Xing @ CMU, 2006-2010
51
Example 2: Gaussian density

Data:

We observed N iid real samples:
D={-0.1, 10, 1, -5.2, …, 3}

Model:

Log likelihood:

P( x)  2

2 1 / 2

exp  ( x   )2 / 2 2

N
x   
N
1
l ( ; D)  log P( D |  )   log( 2 2 )   n 2
2
2 n 1 
2

MLE: take derivative and set to zero:
l
 (1 /  2 )n xn   

l
N
1
xn   2



2
2
4 n

2
2
© Eric Xing @ CMU, 2006-2010
1
  xn 
N n
1
2
 n xn   ML 
N
 MLE 
2
 MLE
52
MLE for a multivariate-Gaussian

It can be shown that the MLE for µ and Σ is
1
N
1

N
 MLE 
 x 
 MLE
T
n xn   ML xn   ML 
n
n

1
S
N
    x1T    


    x2T    
X 




    xT    
N


where the scatter matrix is
S  n xn   ML xn   ML  
T


 x x  N
n
T
n nn
 xn1 
 2
x 
xn   n 
  
 xK 
 n 
T

ML ML
The sufficient statistics are nxn and nxnxnT.
Note that XTX=nxnxnT may not be full rank (eg. if N <D), in which case ΣML is not
invertible
© Eric Xing @ CMU, 2006-2010
53
Bayesian estimation

Normal Prior:

P(  )  202

1 / 2

exp  (   0 )2 / 2 02

Joint probability:

P( x,  )  2

 1 N
2
exp  2  xn    
 2 n 1


2 N /2
 202



1 / 2

exp  (   0 )2 / 2 02

Posterior:

P (  | x )  2~2
where

1 / 2

exp  (   ~)2 / 2~2

1 /  02
N / 2
~
~ 2   N  1 

x


,
and

0
 2  2 
N /  2  1 /  02
N /  2  1 /  02
0 

mean
© Eric XingSample
@ CMU, 2006-2010
1
54
Bayesian estimation: unknown µ, known σ
1 /  02
N / 2
N 
x
0 ,
2
2
2
2
N / 1 /0
N / 1 /0

N
1 
~ 2   2  2 
0 

1

The posterior mean is a convex combination of the prior and the MLE, with
weights proportional to the relative noise levels.

The precision of the posterior 1/σ2N is the precision of the prior 1/σ20 plus one
contribution of data precision 1/σ2 for each observed data point.
Sequentially updating the mean

µ∗ = 0.8 (unknown), (σ2)∗ = 0.1 (known)

Effect of single data point
 02
 02
1  0  ( x  0 ) 2
 x  ( x  0 ) 2
   02
   02

Uninformative (vague/ flat) prior, σ20 →∞
 N  0
© Eric Xing @ CMU, 2006-2010
55