QA for the Web

Transcript QA for the Web

CS546: Machine Learning and Natural Language
Probabilistic Classification
Feb 24, 26 2009
1
2: Generative Model




Model the problem of text correction as that of generating
correct sentences.
Goal: learn a model of the language; use it to predict.
PARADIGM
Learn a probability distribution over all sentences
 In practice: make assumptions on the distribution’s type
Use it to estimate which sentence is more likely.
Pr(I saw the girl it the park) <> Pr(I saw the girl in the park)
[In the same paradigm we sometimes learn a conditional probability distribution]
 In practice: a decision policy depends on the assumptions
2
Before: Error Driven Learning
Discriminative Learning




Consider a distribution D over space XY
X - the instance space; Y - set of labels. (e.g. +/-1)
Given a sample {(x,y)}1m,, and a loss function L(x,y)
Find hH that minimizes
i=1,mL(h(xi),yi)
L can be: L(h(x),y)=1, h(x)y, o/w L(h(x),y) = 0 (0-1 loss)
L(h(x),y)= (h(x)-y)2 ,
(L2)
L(h(x),y)=exp{- y h(x)}

Find an algorithm that minimizes average loss; then, we know
that things will be okay (as a function of H).
3
Bayesian Decision Theory




Goal: find the best hypothesis from some space H of
hypotheses, given the observed data D.
Define best to be: most probable hypothesis in H
In order to do that, we need to assume a probability
distribution over the class H.
In addition, we need to know something about the relation
between the data observed and the hypotheses (E.g., a coin
problem.)

As we will see, we will be Bayesian about other things, e.g., the
parameters of the model
4
Basics of Bayesian Learning




P(h) - the prior probability of a hypothesis h (prior over H)
Reflects background knowledge; before data is observed. If no
information - uniform distribution.
P(D) - The probability that this sample of the Data is
observed. (No knowledge of the hypothesis)
P(D|h): The probability of observing the sample D, given that
the hypothesis h holds
P(h|D): The posterior probability of h. The probability h
holds, given that D has been observed.
5
Bayes Theorem
P(h | D)  P(D | h)P(h)
P(D)


P(h|D) increases with P(h) and with P(D|h)
P(h|D) decreases with P(D)
6
Learning Scenario
P(h | D)  P(D | h)P(h)
P(D)


The learner considers a set of candidate hypotheses H
(models), and attempts to find the most probable one h H,
given the observed data.
Such maximally probable hypothesis is called maximum a
posteriori hypothesis (MAP); Bayes theorem is used to
compute it:
h MAP  argmaxhH P(h | D)  argmaxhH P(D | h)P(h)
P(D)
 argmaxhH P(D | h)P(h)
7
Learning Scenario (2)
h MAP  argmaxhH P(h | D)  argmaxhH P(D | h)P(h)

We may assume that a priori, hypotheses are equally probable
P(hi )  P(hj ),h i , h j  H

We get the Maximum Likelihood hypothesis:
h ML  argmaxhH P(D | h)

Here we just look for the hypothesis that best explains the
data
8
Bayes Optimal Classifier







How should we use the general formalism?
What should H be?
H can be a collection of functions. Given the training data, choose an
optimal function. Then, given new data, evaluate the selected function on
it.
H can be a collection of possible predictions. Given the data, try to directly
choose the optimal prediction.
H can be a collection of (conditional) probability distributions.
Could be different!
Specific examples we will discuss:



Naive Bayes: a maximum likelihood based algorithm;
Max Entropy: seemingly, a different selection criteria;
Hidden Markov Models
9
Bayesian Classifier
f:XV, finite set of values
 Instances x X can be described as a collection of features
 Given an example, assign it the most probable value in V

x  (x1 , x 2 ,...,x n )
x i  {0,1}
v MAP  argmaxv jV P(vj | x)  argmaxv jV P(vj | x1 , x 2 ,...xn )
10
Bayesian Classifier
• f:XV, finite set of values
•Instances x X can be described as a collection of features
x  (x1, x2 ,..., xn )
xi  {0,1}
• Given an example, assign it the most probable value in V
v MAP  argmaxv jV P(vj | x)  argmaxv jV P(vj | x1 , x 2 ,...xn )
• Bayes Rule:
v MAP  argm axv jV
P(x1 , x 2 ,...xn | v j )P(vj )
P(x1 , x 2 ,...xn )

 argm axv jV P(x1 , x 2 ,...xn | v j )P(vj )
• Notational convention: P(y) means P(Y=y)
11
Bayesian Classifier
vMAP  argmaxv jVP(x1, x 2 ,...xn | v j )P(vj )
• Given training data we can estimate the two terms.
• Estimating P(vj) is easy. For each value vj count how many times
it appears in the training data.
• However, it is not feasible to estimate P(x1, x2 ,...xn | vj )
• In this case we have to estimate, for each target value,
the probability of each instance (most of which will not occur)
• In order to use a Bayesian classifiers in practice, we need to make
assumptions that will allow us to estimate these quantities.
12
Naive Bayes
v MAP  argmaxv jV P(x1 , x 2 ,...xn | v j )P(vj )
P(x1 , x 2 ,...xn | v j ) 
 P(x1 | x 2 ,...xn , v j )P(x2 ,...xn | v j ) 
 P(x1 | x 2 ,...xn , v j )P(x2 | x 3 ,...xn , v j )P(x3 ,...xn | v j ) 
 .......
 P(x1 | x 2 ,...xn , v j )P(x2 | x 3 ,...xn , v j )P(x3 ,...xn | v j )...P(xn | v j ) 
• Assumption: feature values are independent given the target value
 i 1 P(xi | v j )
n
13
Naive Bayes
v MAP  argmaxv jV P(x1 , x 2 ,...xn | v j )P(vj )
• Assumption: feature values are independent given the target value
P(x 1  b 1 , x 2  b 2 ,...xn  b n | v  v j )  i 1 P(x i  b i | v  v j )
• Generative model:
• First choose a value vj V
according to P(vj )
• For each vj : choose x1 x2 …, xn
according to P(xk |vj )
n
14
Naive Bayes
v MAP  argmaxv jV P(x1 , x 2 ,...xn | v j )P(vj )
• Assumption: feature values are independent given the target value
P(x 1  b 1 , x 2  b 2 ,...xn  b n | v  v j )  i 1 P(x i  b i | v  v j )
• Learning method: Estimate n|V| parameters and use them to
compute the new value. (how to estimate?)
n
15
Naive Bayes
v MAP  argmaxv jV P(x1 , x 2 ,...xn | v j )P(vj )
• Assumption: feature values are independent given the target value
P(x 1  b 1 , x 2  b 2 ,...xn  b n | v  v j )  i 1 P(x i  b i | v  v j )
n
• Learning method: Estimate n|V| parameters and use them to
compute the new value.
• This is learning without search. Given a collection of training examples,
you just compute the best hypothesis (given the assumptions)
• This is learning without trying to achieve consistency or even
approximate consistency.
• Why does it work?
16
Conditional Independence
• Notice that the features values are conditionally independent,
given the target value, and are not required to be independent.
• Example:
f(x,y)=xy over the product distribution defined by
p(x=0)=p(x=1)=1/2 and p(y=0)=p(y=1)=1/2
The distribution is defined so that x and y are independent:
p(x,y) = p(x)p(y) (Interpretation - for every value of x and y)
• But, given that f(x,y)=0:
p(x=1|f=0) = p(y=1|f=0) = 1/3
p(x=1,y=1 | f=0) = 0
so x and y are not conditionally independent.
17
Conditional Independence
• The other direction also does not hold.
x and y can be conditionally independent but not independent.
f=0: p(x=1|f=0) =1, p(y=1|f=0) = 0
f=1: p(x=1|f=1) =0, p(y=1|f=1) = 1
and assume, say,
that p(f=0) = p(f=1)=1/2
Given the value of f, x and y are independent.
• What about unconditional independence ?
18
Conditional Independence
• The other direction also does not hold.
x and y can be conditionally independent but not independent.
f=0: p(x=1|f=0) =1, p(y=1|f=0) = 0
f=1: p(x=1|f=0) =0, p(y=1|f=1) = 1
and assume, say,
that p(f=0) = p(f=1)=1/2
Given the value of f, x and y are independent.
• What about unconditional independence ?
p(x=1) = p(x=1|f=0)p(f=0)+p(x=1|f=1)p(f=1) = 0.5+0=0.5
p(y=1) = p(y=1|f=0)p(f=0)+p(y=1|f=1)p(f=1) = 0.5+0=0.5
But,
p(x=1, y=1)=p(x=1,y=1|f=0)p(f=0)+p(x=1,y=1|f=1)p(f=1) = 0
so x and y are not independent.
19
Example
vNB  argm axv jVP(v j )iP(xi | v j )
Day Outlook Temperature
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
Humidity Wind
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Strong
PlayTennis
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
20
Estimating Probabilities
v NB  argmaxv{yes, no} P(v)i P(xi  observatio
n | v)
• How do we estimate P(observation | v) ?
21
Example
vNB  argm axv jVP(v j )iP(xi | v j )
• Compute P(PlayTennis= yes); P(PlayTennis= no)=
• Compute P(outlook= s/oc/r | PlayTennis= yes/no) (6 numbers)
• Compute P(Temp= h/mild/cool | PlayTennis= yes/no) (6 numbers)
• Compute P(humidity= hi/nor | PlayTennis= yes/no) (4 numbers)
• Compute P(wind= w/st
| PlayTennis= yes/no) (4 numbers)
22
Example
vNB  argm axv jVP(v j )iP(xi | v j )
• Compute P(PlayTennis= yes); P(PlayTennis= no)=
• Compute P(outlook= s/oc/r | PlayTennis= yes/no) (6 numbers)
• Compute P(Temp= h/mild/cool | PlayTennis= yes/no) (6 numbers)
• Compute P(humidity= hi/nor | PlayTennis= yes/no) (4 numbers)
• Compute P(wind= w/st
| PlayTennis= yes/no) (4 numbers)
•Given a new instance:
(Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)
• Predict: PlayTennis= ?
23
Example
vNB  argm axv jVP(v j )iP(xi | v j )
•Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)
P(PlayTennis= yes)=9/14=0.64
P(PlayTennis= no)=5/14=0.36
P(outlook = sunny|yes)= 2/9
P(temp = cool | yes) = 3/9
P(humidity = hi |yes) = 3/9
P(wind = strong | yes) = 3/9
P(outlook = sunny|no)= 3/5
P(temp = cool | no) = 1/5
P(humidity = hi |no) = 4/5
P(wind = strong | no)= 3/5
P(yes|…..) ~ 0.0053
P(no|…..) ~ 0.0206
24
Example
vNB  argm axv jVP(v j )iP(xi | v j )
•Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)
P(PlayTennis= yes)=9/14=0.64
P(PlayTennis= no)=5/14=0.36
P(outlook = sunny|yes)= 2/9
P(temp = cool | yes) = 3/9
P(humidity = hi |yes) = 3/9
P(wind = strong | yes) = 3/9
P(outlook = sunny|no)= 3/5
P(temp = cool | yes) = 1/5
P(humidity = hi |yes) = 4/5
P(wind = strong | no)= 3/5
P(yes|…..) ~ 0.0053
P(no|…..) ~ 0.0206
What is we were asked about Outlook=OC ?
25
Example
vNB  argm axv jVP(v j )iP(xi | v j )
•Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)
P(PlayTennis= yes)=9/14=0.64
P(PlayTennis= no)=5/14=0.36
P(outlook = sunny|yes)= 2/9
P(temp = cool | yes) = 3/9
P(humidity = hi |yes) = 3/9
P(wind = strong | yes) = 3/9
P(outlook = sunny|no)= 3/5
P(temp = cool | no) = 1/5
P(humidity = hi |no) = 4/5
P(wind = strong | no)= 3/5
P(yes|…..) ~ 0.0053
P(no|…..) ~ 0.0206
P(no|instance) = 0.0206/(0.0053+0.0206)=0.795
26
Naive Bayes: Two Classes
v NB  argmaxv jV P(vj )i P(xi | v j )
• Notice that the naïve Bayes method seems to be giving a method
for predicting rather than an explicit classifier.
• In the case of two classes, v{0,1} we predict that v=1 iff:
P(vj  1)  i 1 P(xi | v j  1)
n
P(vj  0)  i 1 P(xi | v j  0)
n
1
27
Naive Bayes: Two Classes
v NB  argmaxv jV P(vj )i P(xi | v j )
• Notice that the naïve Bayes method gives a method for predicting
rather than an explicit classifier.
• In the case of two classes, v{0,1} we predict that v=1 iff:
P(vj  1)  i 1 P(xi | v j  1)
n
P(vj  0)  i 1 P(xi | v j  0)
n
1
Denote: p i  p(xi  1 | v  1), q i  p(xi  1 | v  0)
P(vj  1)  i 1 p i i (1 - p i )1-xi
n
x
P(vj  0)  i 1 q i (1 - q i )1-xi
n
xi
1
28
Naive Bayes: Two Classes
•In the case of two classes, v{0,1} we predict that v=1 iff:
P(v j  1)  i 1 p i i (1 - p i )1-xi
n
x
P(v j  0)  i 1 q i i (1 - q i )1- xi
n
x
p i xi
P(v j  1)  i 1 (1 - p i )(
)
1 - pi

1
q i xi
n
P(v j  0)  i 1 (1 - q i )(
)
1 - qi
n
29
Naïve Bayes: Two Classes
•In the case of two classes, v{0,1} we predict that v=1 iff:
p i xi
n
P(v j  1)  i 1 (1 - p i )(
)
xi
1- x i
P(v j  1)  i 1 p i (1 - p i )
1 - pi

1
n
xi
1- x i
q i xi
n
P(v j  0)  i 1 q i (1 - q i )
P(v j  0)  i 1 (1 - q i )(
)
1 - qi
Take logarithm;
we pre dictv  1 iff :
n
P(vj  1)
1 - pi
pi
qi
log
 i log
 i (log
 log
)xi  0
P(vj  0)
1 - qi
1 - pi
1 - qi
30
Naïve Bayes: Two Classes
•In the case of two classes, v{0,1} we predict that v=1 iff:
p i xi
n
P(v j  1)  i 1 (1 - p i )(
)
xi
1- x i
P(v j  1)  i 1 p i (1 - p i )
1 - pi

1
n
xi
1- x i
q i xi
n
P(v j  0)  i 1 q i (1 - q i )
P(v j  0)  i 1 (1 - q i )(
)
1 - qi
Take logarithm;
we pre dictv  1 iff :
n
P(vj  1)
1 - pi
pi
qi
log
 i log
 i (log
 log
)xi  0
1 - qi
1 - pi
1 - qi
• P(vj  0)
•We get that the optimal Bayes behavior is given by a linear separator
with
pi
qi
pi 1 - qi
w i  i (log
 log
)  log
1 - pi
1 - qi
qi 1 - pi
if p i  q i the nw i  0 andthefe atureis irre le vant
31
Why does it work?
• We have not addressed the question of why does this Classifier
perform well, given that the assumptions are unlikely to be
satisfied.
• The linear form of the classifiers provides some hints.
32
Naïve Bayes: Two Classes
•In the case of two classes we have that:
log
P(vj  1 | x )
P(vj  0 | x )
 i w i x i  b
33
Naïve Bayes: Two Classes
•In the case of two classes we have that:
log
•but since
P(vj  1 | x )
P(vj  0 | x )
 i w i x i  b
P(vj  1 | x )  1 - P(vj  0 | x )
•We get (plug in (2) in (1); some algebra):
P(vj  1 | x ) 
1
1  e xp(-i w i xi  b)
•Which is simply the logistic (sigmoid) function used in the
neural network representation.
34
Another look at Naive Bayes
Pr(l)
Pr(  2 | l)
Pr(  1 | l)
l
Graphical model. It encodes
the NB independence
assumption in the edge
structure (siblings are
independent given parents)
Pr(  n | l)
1  2  3
x1xx21 xx2 x2 3 xx33 x 4
n
x n-1 x n
D
ˆ
c[ 0 ,l]  logPr[ 0 ,l]  log | { (x, j) | j  l } | / | S |
c[  ,l]
| { (x, j) |  (x)  1  j  l } |
D
D
ˆ
ˆ
 logPr[  ,l] / logPr[  0 ,l]  log
| { (x, j) | j  l } |
n
prediction (x)  argmax {l0.1}  c[xi ,l]  i (x)
Linear Statistical Queries
Model
i 0
35
Hidden Markov Model (HMM)

HMM is a probabilistic generative model



It models how an observed sequence is generated
Let’s call each position in a sequence a time step
At each time step, there are two variables


Current state (hidden)
Observation
36
HMM




s2
s3
s4
s5
s6
o1
o2
o3
o4
o5
o6
Elements


s1
Initial state probability P(s1)
Transition probability P(st|st-1)
Observation probability P(ot|st)
As before, the graphical model is an encoding of the
independence assumptions
Consider POS tagging: O – words, S – POS tags
37
HMM for Shallow Parsing

States:


{B, I, O}
Observations:

Actual words and/or part-of-speech tags
s1=B
o1
Mr.
s2=I
s3=O
o2
o3
Brown blamed
s4=B
s5=I
s6=O
o4
Mr.
o5
Bob
o6
for
38
HMM for Shallow Parsing
s1=B

s2=I
s3=O
s4=B
s5=I
s6=O
o1
o2
o3
o4
o5
o6
Mr.
Brown blamed Mr.
Bob
for
Transition
probabilty:
Initial state
probability:
Observation Probability:
P(s1t=B|s
P(s1=B),P(s
=I),P(s
=O)
t-1=B),P(s
t=I|st-1=B),P(st=O|st-1=B),
1
P(o
=‘Mr.’|s
Given a sentences,
what
the
most likely
state
tcan ask
t=B),P(o
t=‘Brown’|s
t=B),…,
P(st=B|st-1we
=I),P(s
=I|s
=I),P(s
=O|s
=I),
t
t-1
t
t-1
P(ot=‘Mr.’|st=I),P(ot=‘Brown’|st=I),…,
sequence is
…
…
39
Finding most likely state sequence in HMM (1)
P (sk ; sk ¡1; : : : ; s1; ok; ok ¡1; : : : ; o1)
= P (okjok¡1; ok¡2; : : : ; o1; sk; sk¡1; : : : ; s1)
¢P (o k¡1; ok¡2; : : : ; o1; sk; sk¡1; : : : ; s1)
= P (okjsk) ¢ P (ok¡1; ok¡2; : : : ; o1; sk; sk¡1; : : : ; s1)
= P (okjsk) ¢ P (skjsk¡1; sk¡2; : : : ; s1; ok¡1; ok¡2; : : : ; o 1)
¢P (sk¡1; sk¡2; : : : ; s1; ok¡1; ok¡2; : : : ; o1)
= P (okjsk) ¢ P (skjsk¡1)
¢P (sk¡1; sk¡2; : : : ; s1; ok¡1; ok¡2; : : : ; o1)
¡1
kY
= P (okjsk) ¢ [
t=1
P (st+1jst) ¢ P (otjst)] ¢ P (s1)
40
Finding most likely state sequence in HMM (2)
arg max P (sk ; sk¡1; : : : ; s1jok ; ok ¡1 ; : : : ; o1)
sk ;sk¡1;:::;s1
P (sk; sk¡1; : : : ; s1; ok; ok¡1; : : : ; o1)
= arg max
sk ;sk¡1 ;:::;s1
P (ok; ok¡1; : : : ; o1)
= arg max P (sk; sk¡1; : : : ; s1; ok; ok¡1; : : : ; o1)
sk ;sk¡1 ;:::;s1
=
¡1
kY
arg max P (okjsk) ¢ [
sk ;sk¡1 ;:::;s1
t=1
P (st+1jst) ¢ P (otjst)] ¢ P (s1)
41
Finding most likely state sequence in HMM (3)
A function of sk
max
sk ;sk¡1;:::;s1
P (ok jsk )¢[
= max P (okjsk) ¢
sk
¡1
kY
t =1
P (st+1jst )¢P (ot jst )] ¢P (s1)
max [
sk¡1;:::;s1
¡1
kY
t=1
P (st+1jst) ¢ P (otjst)] ¢ P (s1)
= max P (okjsk) ¢ max[P (sk jsk¡1) ¢ P (ok¡1jsk¡1)]
s
s ¡
k 1
k
¢
¡2
kY
max [
sk ¡2;:::;s1
t=1
P (st+1jst) ¢ P (o tjst)] ¢ P (s1)
= max P (okjsk) ¢ max[P (sk jsk¡1) ¢ P (ok¡1jsk¡1)]
s
s ¡
k
k 1
¢ max[P (sk¡1jsk¡2) ¢ P (ok¡2jsk¡2)] ¢ : : :
sk¡2
¢ max[P (s2js1) ¢ P (o1js1)] ¢ P (s1)
s1
42
Finding most likely state sequence in HMM (4)
max P (ok jsk ) ¢ max[P (sk jsk¡1) ¢ P (ok ¡1jsk¡1)]
sk
sk¡ 1
¢ max[P (sk ¡1jsk ¡2) ¢ P (ok ¡2jsk¡2)] ¢ : : :
sk¡ 2
¢ max[P (s3js2) ¢ P (o2js2)] ¢
s2
¢ max[P (s2js1) ¢ P (o1js1)] ¢ P (s1)
s1

Viterbi’s Algorithm

Dynamic Programming
43
Learning the Model

Estimate




Unsupervised Learning (states are not observed)


EM Algorithm
Supervised Learning (states are observed; more common)


Initial state probability P (s1)
Transition probability P(st|st-1)
Observation probability P(ot|st)
ML Estimate of above terms directly from data
Notice that this is completely analogues to the case of naive
Bayes, and essentially all other models.
44
Another view of Markov Models
Input:
States:
Observations:
x  ( (w 1 : t 1 ),...(w i-1 : t i-1 ), (w i : ?), (w i 1 : t i 1 )...)
t i -1
t
w i -1 w i
t i1
w i1
T
W
Assumptions: Pr(w i | t i )  Pr(w i | x)
Pr(t i 1 | t i )  Pr(t i 1 | t i , t i-1 ,..., t 1 )
Prediction: predict tT that maximizes
Pr(t | t i-1 )  Pr(w i | t)  Pr(t i 1 | t)
45
Another View of Markov Models
x  ( (w 1 : t 1 ),...(w i-1 : t i-1 ), (w i : ?), (w i 1 : t i 1 )...)
Input:
States:
t i -1
Observations:
t
w i -1 w i
t i1
w i1
T
W
As for NB: features are pairs and singletons of t‘s, w’s
c[t1 t 2 ,t1 ]
| { (t, t' ) | t  t 1  t'  t 2 } |
D
D
ˆ
ˆ
 logPr[tc1 ,[tt 21]t/2 log
Otherwise, c[  ,t]  0
,t 2 ] Pr[1,t 1 ]c
[ wlog
t, t]
| { (t, t' ) | t  t 1 } |
prediction (x)  argmax {tT}  c[xi ,t]  (x) Only 3 active features
This can be extended to an argmax that maximizes the prediction of
the whole state sequence and computed, as before, via Viterbi.
46
Learning with Probabilistic Classifiers


Learning Theory
Err S (h) | { x  S | h(x)  l } | / | S |
We showed that probabilistic predictions can be viewed as predictions via
Linear Statistical Queries Models.

The low expressivity explains Generalization+Robustness

Is that all?


It does not explain why is it possible to (approximately) fit the data with
these models. Namely, is there a reason to believe that these hypotheses
minimize the empirical error on the sample?
In General, No. (Unless it corresponds to some probabilistic assumptions
that hold).
47
Learning Protocol


LSQ hypotheses are computed directly, w/o assumptions on
the underlying distribution:
- Choose features
- Compute coefficients
Is there a reason to believe that an LSQ hypothesis
minimizes the empirical error on the sample?
In general, no.
(Unless it corresponds to some probabilistic assumptions that
hold).

48
Learning Protocol: Practice


LSQ hypotheses are computed directly:
- Choose features
- Compute coefficients
If hypothesis does not fit the training data - Augment set of features
(Forget your original assumption)
49
Example: probabilistic classifiers
If hypothesis does not
fit
the training data - augment
set of features (forget
assumptions)
Pr(l)
l
Pr(  2 | l)
Pr(  1 | l)
Pr(  n | l)
1  2  3
x1xx12 x 2 x 3 x 3x 3 x 4
States:
t i -1
Observations:
n
xxnn-1 x n
t
w i -1 w i
t i1
w i1
T
W
Features are pairs and singletons of t‘s, w’s
Additional features are included
50
Robustness of Probabilistic Predictors


Why is it relatively easy to fit the data?
Consider all distributions with the same marginals
Pr(  i | l )
(E.g, a naïve Bayes classifier will predict the same regardless of
which distribution generated the data.)
(Garg&Roth ECML’01):
In most cases (i.e., for most such distributions), the resulting
predictor’s error is close to optimal classifier (that if given
the correct distribution)
51
Summary: Probabilistic Modeling

Classifiers derived from probability density
estimation models were viewed as LSQ hypotheses.
 c  (x)   c  x
i

i
i
i1
x i 2 ...x i k
Probabilistic assumptions:
+ Guiding feature selection but also - Not allowing the use of more general features.
52
A Unified Approach

Most methods blow up original feature space.
X ( x1 , x2 , x3 ,...xk )  (  1 (x) ,  2 (x) ,  3 (x) ... n (x) )

n  k
And make predictions using a linear representation over the
new feature space
arg max  c ij  i (x)
j
i
Note: Methods do not have to actually do that; But: they
produce same decision as a hypothesis that does that. (Roth
98; 99,00)
53
A Unified Approach

Most methods blow up original feature space.
X ( x1 , x2 , x3 ,...xk )  (  1 (x) ,  2 (x) ,  3 (x) ... n (x) )

n  k
And make predictions using a linear representation over the
new feature space
arg max  c ij  i (x)


Probabilistic Methods
Rule based methods
j
(TBL; decision lists;
exponentially decreasing weights)
i

Linear representation
(SNoW;Perceptron; SVM;Boosting)

Memory Based Methods
(subset features)
54
A Unified Approach

Most methods blow up original feature space.
X ( x1 , x2 , x3 ,...xk )  (  1 (x) ,  2 (x) ,  3 (x) ... n (x) )

n  k
And make predictions using a linear representation over the
new feature space
arg max  c ij  i (x)
j
i
Q 1: How are weights determined?
Q 2: How is the new feature-space determined?
Implications? Restrictions?
55
What’s Next?

(1) If probabilistic hypotheses are actually like other linear
functions, can we interpret the outcome of other linear
learning algorithms probabilistically?


(2) If probabilistic hypotheses are actually like other linear
functions, can you actually train them similarly (that is,
discriminatively)?




Yes
Yes.
Single Classification: Logistics regression/Max Entropy
HMM
(3) General Sequential Processes

First, multi-class classification (what’s the relation?)
56
Conditional models: Introduction


Starting from early 200x conditional and discriminative models become
very popular in the NLP community
Why?




They often give better accuracy than generative models
It is often easier to integrate complex features in these models
Generalization results suggest these methods
Does it makes sense to distinguish probabilistic conditional
models and discriminative classifiers?



( I.e. models which output conditional probabilities vs methods
learned to minimize the expected errors)
Usually not (Recall Zhang’s paper presented by Sarah)
They often can be regarded as both (either optimizing a bound on
error or estimating parameters of some conditional model)
57
Joint vs Conditional (Probabilistic View)


Joint (or generative) models model joint probability P(x,y)
of both input
Conditional Models model P(y | x):


Joint examples:


n-gram language models, Naive-Bayes, HMM, Probabilistic Context
Free Grammars (PCFGs) in parsing
Conditional examples


There is no need to make assumption on the form of P(x)
logistic regression, MaxEnt, perceptron, SNoW, etc
Many of the following slides are based/ taken from Klein &
Manning tutorial at ACL 03
58
Joint vs Conditional (Probabilistic View)

Bayesian Networks for conditional and generative models,
recall:




circles (nodes) correspond to variables
each node is dependent on its parents in the graph
a table of probabilities corresponds to each node in the graph
you can think of each node as a classifier which predicts its value given
values of its parents
59
Example: Word Sense Disambiguation


We can use the same features (i.e. edges in the graphical
model are the same but direction may be different) and get
very different results
Word Sense Disambiguation (supervised): the same features
(i.e. the same class of functions) but different estimation
method (Klein & Manning 02):
60
Joint vs. Conditional

However, this is not always the case, depends on



How wrong the assumptions in the generative model are
Amount of the training data (we plan to discuss Ng & Jordan 03
paper about it)
Other reasons why we may want to use generative models



Sometimes easier to integrate non-labeled date in generative model
(semi-supervised learning)
Often easier to estimate parameters (to train)
....
61
Maximum Entropy Models

Known by different names:


Max Ent , CRF (local vs. global normalization), Logistic Regression,
Log-linear models
Plan:





Recall feature based models
Max-entropy motivation for the model
ML motivation (logistic regression)
Learning (parameter estimation)
Bound minimization view
62
Features: Predicates on variables
63
Example Tasks
64
Conditional vs. Joint

Joint: given data {c, d} we want to construct model P(c,d)






Maximize likelihood
In the simples case we just compute counts, normalize and
you have a model
In fact not so simple: smoothing, etc
Not always trivial counts, what if you believe that there is
some unknown variable h: P(c,d) = h{P(h, c,d)}
Not trivial counts any more...
Conditional:



Maximize conditional likelihood (probabilistic view)
Harder to do
Closer to minimization of error
65
Feature-Based Classifier
66
Feature-Based Classifier
67
Maximizing Conditional Likelihood


Given the model and the data find parameters which
maximize conditional likelihood L(¸) = i{P(c | d,µ)}
For the log-linear case we consider it means finding ¸ to
maximize:
68
MaxEnt model
69
Learining
70
Learining
71
Learining
72
Learining
73
Relation to Max Ent Principle

Recall we discussed MaxEnt Principle before:



We need know the marginals and now we want to fill the table:
We argued that we want to have as uniform distribution as possible
but which has the same feature expectation as we observed:
So, we want MaxEnt distribution
74
Relation to Max Ent Principle




It is possible to show (see additional notes on the lectures
page) that the MaxEnt distribution has log-linear form
And we already observed that (Conditional) ML training
corresponds to finding such parameters that empirical
distribution of features matches the distribution given by the
model
So, they are the same!
Important: a log-linear model trained to maximize
likelihood is equivalent to MaxEnt model with constraints
on the expectations for the same set of features
75
Relation to Max Ent Principle
76
Relation to Max Ent Principle
77
Examples
78
Examples
79
Properties of MaxEnt

Unlike Naive Bayes there is no double counting:

Even a frequent feature may have small weights if it cooccurs with some other feature(-s)

What if our counts are not reliable?
Our prior believe can be that the function is not spily (the
weights are small):

min¸ f(¸) = ||¸|| / 2 + log{P(C | D,¸)}

80
MaxEnt vs. Naive Bayes
81
MaxEnt vs. Naive Bayes
82
Summary




Today we reviewed max-ent framework
Considered both ML and MaxEnt motivation
Discussed HMMs (you can imagine conditional models
instead of HMMs in a similar way)
Next time we will probably start talking about the term
project

You may want to start reading about Semantic Role Labeling and
Dependency Parsing
83

QA for the Web

Transcript QA for the Web

Directory