Transcript CMRFs

Hidden Markov Models &
Conditional Random Fields
Eran Segal
Weizmann Institute
Representing Time


Add the time dimension to variables X  X(t)
Assumptions


Time can be discretized into interesting points t1,t2,...tn
n
P
(
X
)

P
(
X
(
t
)
|
X
(
1
),...
X
(
t

1
)
)

t

1
Markov assumption holds
P
(
X
)

P
(
X
(
t
)
|
X
(
t

1
))

t

1
n

Distribution is stationary (time invariant or homogeneous)

i
,
j
:
P
(
X
(
i
)
|
X
(
i

1
))

P
(
X
(
j
)
|
X
(
j

1
))
Hidden Markov Model



Single (hidden) state variable
Single (observed) observation variable
Transition probability P(Y’|Y) assumed to be sparse


Usually encoded by a state transition graph
Observation probability P(X|Y)
Y
G
Y’
Y0
X’
X0
G0
Y0
Y1
Y2
Y3
X1
X2
X3
Unrolled network
Hidden Markov Model



Single (hidden) state variable
Single (observed) observation variable
Transition probability P(Y’|Y) assumed to be sparse


Usually encoded by a state transition graph
Observation probability P(X|Y)
P(Y’|Y)
Y1
Y2
Y3
Y4
State transition representation
y1
y2
y3
y4
y1
0.2
0.8
0
0
y2
0
0
1
0
y3
0.4
0
0
0.6
y4
0
0.5
0
0.5
Hidden Markov Models

Generative models



Assign a joint probability to paired label and observation
Parameters trained to maximize train joint likelihood
To make inference tractable, there are typically no
long-range dependencies
Y0
Y1
Y2
Y3
X1
X2
X3
Unrolled network
Conditional Models



Specifies the probability of possible label sequences
given the observations, P(Y|X)
Does not “waste” parameters on modeling P(X|Y)
Key advantage:



Distribution over Y can depend on non-independent
features without modeling feature dependencies
Transition probabilities can depend on past and future
Two representations


Maximum Entropy Markov Models (MEMMs)
Conditional Random Fields (CRFs)
Max Entropy Markov Models



Models the probability over the next state given the
previous state and the observations
Training is by iterative scaling in the maximum
entropy framework
Weakness: label bias problem
Y0
Y1
Y2
Y3
X1
X2
X3
HMM
Y0
Y1
Y2
Y3
X1
X2
X3
MEMM
Label-Bias Problem of MEMMs

Transitions from a state compete only with each other




Transition scores are conditional probabilities of next states
given current state and observation
This implies a “conservation of score mass” whereby mass
of a state is distributed among next states
Observations do not affect the total mass of next states
 Biases for states with fewer outgoing transitions


States with a single outgoing transition ignore observations
Unlike HMMs, observations after a branch point in a path are
ignored
Label-Bias Problem: Example


A model for distinguishing ‘rob’ from ‘rib’
Suppose we get an input sequence ‘rib’




First step, ‘r’ matches both possible states so equally likely
Next, ‘i’ is observed, but since both y1 and y4 have have one
outgoing state, they both give probability 1 to the next state
Does not happen in HMMs
Note: if one word is more likely in train it will win
r
y1
i
P(Y’|Y)
y2
b
y0
y3
r
y4
o
y5
b
State transition representation
y0
y1
y2
y3
y4
y5
y0
0
0.5
0.5
0
0
0
y1
0
0
1
0
0
0
y2
0
0
0
1
0
0
y3
0
0
0
1
0
0
y4
0
0
0
0
0
1
y5
0
0
0
1
0
0
Conditional Random Fields


Advantages of MEMMs without the label bias problem
Key difference


MEMMs use per-state model for conditional probabilities of
next state given current state
CRFs have a single model for the joint probability of the
entire sequence of labels given the observations


Thus, weights of different features at different states can trade off
against each other
CRF training


Maximum likelihood or MAP
Loss function is convex, guaranteeing convergence to global
optimum
Conditional Random Fields


Let G=(V,E) be a graph with vertices V and edges E,
such that Y  (Yv )vV
Then (X,Y) is a CRF if the random variables Yv obey
the Markov property with respect to the graph:
P(Yv | X,Y j ,i  v)  P(Yv | X,Y j ,i ~ v)
j
i

 where Y is the set of Y neighbors of Y

Model only P(Y|X)

Y0
Y1
Y2
Y3
X1
X2
X3
CRF
Conditional Random Fields

Joint probability distribution for trees over Y

Cliques (and thus potentials) are the edges and vertices


p (y | x) exp  k f k (e, y[e],x)  k gk (v, y[v],x)
e E,k

v V ,k
x are the observed variables

 y are the state variables
 y[S] is the components of Y associated with vertices in S
 fk is an edge feature with weight λk
 gk is a vertex feature with weight μk
 Note that features can be over all of variables in x

Representing HMMs with CRFs

An HMM can be represented by the following CRF


p (y | x) exp  k f k (e, y[e],x)  k gk (v, y[v],x)
e E,k

v V ,k




where
f y',y ( u,v , y[ u,v ], x)   (y u , y') (y v , y)
gy,x (v, y[v],x)
  (y v , y) (x v , x)
Note that we defined one feature for each state pair (y’,y)
and one feature for each state-observation pair (y,x)
Parameter Estimation
Applications
Conditional Random Fields

Joint probability distribution for trees over Y

Cliques (and thus potentials) are the edges and vertices


p (y | x) exp  k f k (e, y[e],x)  k gk (v, y[v],x)
e E,k

v V ,k



p (y | x)  exp  k f k (e, y[e],x)
e E,k
