Transcript CMRFs
Hidden Markov Models &
Conditional Random Fields
Eran Segal
Weizmann Institute
Representing Time
Add the time dimension to variables X X(t)
Assumptions
Time can be discretized into interesting points t1,t2,...tn
n
P
(
X
)
P
(
X
(
t
)
|
X
(
1
),...
X
(
t
1
)
)
t
1
Markov assumption holds
P
(
X
)
P
(
X
(
t
)
|
X
(
t
1
))
t
1
n
Distribution is stationary (time invariant or homogeneous)
i
,
j
:
P
(
X
(
i
)
|
X
(
i
1
))
P
(
X
(
j
)
|
X
(
j
1
))
Hidden Markov Model
Single (hidden) state variable
Single (observed) observation variable
Transition probability P(Y’|Y) assumed to be sparse
Usually encoded by a state transition graph
Observation probability P(X|Y)
Y
G
Y’
Y0
X’
X0
G0
Y0
Y1
Y2
Y3
X1
X2
X3
Unrolled network
Hidden Markov Model
Single (hidden) state variable
Single (observed) observation variable
Transition probability P(Y’|Y) assumed to be sparse
Usually encoded by a state transition graph
Observation probability P(X|Y)
P(Y’|Y)
Y1
Y2
Y3
Y4
State transition representation
y1
y2
y3
y4
y1
0.2
0.8
0
0
y2
0
0
1
0
y3
0.4
0
0
0.6
y4
0
0.5
0
0.5
Hidden Markov Models
Generative models
Assign a joint probability to paired label and observation
Parameters trained to maximize train joint likelihood
To make inference tractable, there are typically no
long-range dependencies
Y0
Y1
Y2
Y3
X1
X2
X3
Unrolled network
Conditional Models
Specifies the probability of possible label sequences
given the observations, P(Y|X)
Does not “waste” parameters on modeling P(X|Y)
Key advantage:
Distribution over Y can depend on non-independent
features without modeling feature dependencies
Transition probabilities can depend on past and future
Two representations
Maximum Entropy Markov Models (MEMMs)
Conditional Random Fields (CRFs)
Max Entropy Markov Models
Models the probability over the next state given the
previous state and the observations
Training is by iterative scaling in the maximum
entropy framework
Weakness: label bias problem
Y0
Y1
Y2
Y3
X1
X2
X3
HMM
Y0
Y1
Y2
Y3
X1
X2
X3
MEMM
Label-Bias Problem of MEMMs
Transitions from a state compete only with each other
Transition scores are conditional probabilities of next states
given current state and observation
This implies a “conservation of score mass” whereby mass
of a state is distributed among next states
Observations do not affect the total mass of next states
Biases for states with fewer outgoing transitions
States with a single outgoing transition ignore observations
Unlike HMMs, observations after a branch point in a path are
ignored
Label-Bias Problem: Example
A model for distinguishing ‘rob’ from ‘rib’
Suppose we get an input sequence ‘rib’
First step, ‘r’ matches both possible states so equally likely
Next, ‘i’ is observed, but since both y1 and y4 have have one
outgoing state, they both give probability 1 to the next state
Does not happen in HMMs
Note: if one word is more likely in train it will win
r
y1
i
P(Y’|Y)
y2
b
y0
y3
r
y4
o
y5
b
State transition representation
y0
y1
y2
y3
y4
y5
y0
0
0.5
0.5
0
0
0
y1
0
0
1
0
0
0
y2
0
0
0
1
0
0
y3
0
0
0
1
0
0
y4
0
0
0
0
0
1
y5
0
0
0
1
0
0
Conditional Random Fields
Advantages of MEMMs without the label bias problem
Key difference
MEMMs use per-state model for conditional probabilities of
next state given current state
CRFs have a single model for the joint probability of the
entire sequence of labels given the observations
Thus, weights of different features at different states can trade off
against each other
CRF training
Maximum likelihood or MAP
Loss function is convex, guaranteeing convergence to global
optimum
Conditional Random Fields
Let G=(V,E) be a graph with vertices V and edges E,
such that Y (Yv )vV
Then (X,Y) is a CRF if the random variables Yv obey
the Markov property with respect to the graph:
P(Yv | X,Y j ,i v) P(Yv | X,Y j ,i ~ v)
j
i
where Y is the set of Y neighbors of Y
Model only P(Y|X)
Y0
Y1
Y2
Y3
X1
X2
X3
CRF
Conditional Random Fields
Joint probability distribution for trees over Y
Cliques (and thus potentials) are the edges and vertices
p (y | x) exp k f k (e, y[e],x) k gk (v, y[v],x)
e E,k
v V ,k
x are the observed variables
y are the state variables
y[S] is the components of Y associated with vertices in S
fk is an edge feature with weight λk
gk is a vertex feature with weight μk
Note that features can be over all of variables in x
Representing HMMs with CRFs
An HMM can be represented by the following CRF
p (y | x) exp k f k (e, y[e],x) k gk (v, y[v],x)
e E,k
v V ,k
where
f y',y ( u,v , y[ u,v ], x) (y u , y') (y v , y)
gy,x (v, y[v],x)
(y v , y) (x v , x)
Note that we defined one feature for each state pair (y’,y)
and one feature for each state-observation pair (y,x)
Parameter Estimation
Applications
Conditional Random Fields
Joint probability distribution for trees over Y
Cliques (and thus potentials) are the edges and vertices
p (y | x) exp k f k (e, y[e],x) k gk (v, y[v],x)
e E,k
v V ,k
p (y | x) exp k f k (e, y[e],x)
e E,k