Conditional Random Field(CRF)
Download
Report
Transcript Conditional Random Field(CRF)
Presenter: Kuang-Jui Hsu
Date :2011/6/16(Thu.)
Outline
1. Introduction
2. Graphical Models
2-1 Definitions
2-2 Application of Graphical Models
2-3 Discriminative and Generative Models
3. Linear-Chain Conditional Random Fields
3-1 From HMMs to CRFs
3-2 Parameter Estimation
3-3 Inference
Introduction
Relational data:
Statistical dependencies exist between the entities
Each entity often has rich set of features that can aid classification
Graphical models are a natural formalism for exploiting the
dependence structure among the entities
We can use the two different types of models to model it
Markov random field (MRF)
Conditional random field (CRF)
Two Types of Models
MRF
The traditional type of model
Modeled by the joint distribution 𝑝 𝒚, 𝒙 ,where 𝒚: the data we want
to predict, and 𝒙: the data we can observe or called features
It requires modeling 𝑝 𝒙 , but 𝑝 𝒙 is hard to be modeled when
using rich features which often include the complex dependences
Ignoring the features can lead to reducing the performance
CRF
Modeled by the conditional distribution 𝑝 𝒚|𝒙
Specifies the probabilities of label data given observation data, so
not to model the distribution 𝑝 𝒙 even with rich features
Solve the problem using the MRF
Introduction
Divided into two part
Present a tutorial on current training and inference technique for
CRF
Discuss the important special case of linear-chain CRFs
Generalize these to arbitrary graphical structure
Present an example of applying a general CRF to a practical
relational learning problem
Outline
2. Graphical Models
2-1 Definitions
2-2 Application of Graphical Models
2-3 Discriminative and Generative Models
Definitions
Probability distribution over sets of random variables:
𝑉 =𝑋∪𝑌
𝑋: a set of input variables we assumed are observed
𝑌: a set of output variables we wish to predict
An assignment to 𝑋 by x and an assignment to a set 𝐴 ⊂ 𝑋
by 𝒙𝐴 , and similar for Y
𝟏 𝑥=𝑥 ′
1, 𝑥 = 𝑥 ′
=
0, otheriwise
Definitions
A graphical model is a family of probability distributions
that factorize according to an underlying graph.
There are two type of graphical models
Undirected graphical models by using the factors to represent
Directed graphical models by using the Bayesian concept
Undirected Graphical Model
Given a collection of subsets 𝐴 ⊂ 𝑉, define an undirected
graphical model form:
1
𝑝 𝒙, 𝒚 =
𝑍
Ψ𝐴 (𝒙𝐴 , 𝒚A )
𝐴
For any choice of factors 𝐹 = {Ψ𝐴 }, where Ψ𝐴 : 𝒱 𝑛 → ℛ +
Z: a normalization factor defined or called a partition
function as
𝑍=
Ψ𝐴 (𝒙𝐴 , 𝒚A )
𝒙,𝒚
𝐴
Computing the Z is a difficult work, but much work exists on
how to approximate it
Undirected graphical models
Present the factorization by a factor graph
1
𝑝 𝒙, 𝒚 =
𝑍
Ψ𝐴 (𝒙𝐴 , 𝒚A )
𝐴
A bipartite graph G=(V, F, E)
Factor nodeΨ𝐴 ∈ 𝐹
Variable node 𝑣𝑠 ∈ 𝑉
Undirected graphical models
Assume that each Ψ function has the form:
Ψ𝐴 𝒙𝐴 , 𝒚A = exp(
𝜃𝐴𝑘 𝑓𝐴𝑘 (𝒙𝐴 , 𝒚𝐴 ))
𝑘
For some real-valued parameter 𝜃𝐴 , and for some set of feature
function {𝑓𝐴𝑘 }
Directed graphical models
Also known as a Bayesian network
𝑝 𝒚, 𝒙 =
𝑃(𝑣|𝜋(𝑣))
Based on a directed graph 𝐺 = 𝑉, 𝐸
𝑣∈𝑉
Factorized as:
The parent of the variable 𝑣
𝑝 𝒚, 𝒙 =
𝑃(𝑣|𝜋(𝑣))
𝑣∈𝑉
𝜋(𝑣): the parent of the variable 𝑣
The variable 𝑣
Generative model
Generative model: call directed graphical models
satisfied the condition that no 𝑥 ∈ 𝑋 can be a parent of an
output y ∈ 𝑌
A generative model is one that directly describes how the
outputs probabilistically “generate ” the inputs
Outline
2. Graphical Models
2-1 Definitions
2-2 Application of Graphical Models
2-3 Discriminative and Generative Models
Application of Graphical Models
Devote special attention to the hidden Markov
model(HMM), because of close relation to the linear-
chain CRF
Classification
Introduce two type of the classifier
Naïve Bayes classifier: based on a joint probability model
Logistic regression: based on a conditional probability model
Naïve Bayes Classifer
Predicting a single class variable y given a vector of features
𝐱 = 𝑥1 , 𝑥2 , … , 𝑥𝑘
Assume that all the features are independent
The resulting classifier is called naïve Bayes classifier
Based on a joint probability model of the form:
𝐾
𝑝 𝑦, 𝒙 = 𝑝(𝑦)
𝑝(𝑥𝑘 |𝑦)
𝑘=1
Proof
𝐾
𝑝 𝑦, 𝒙 = 𝑝(𝑦)
𝑝(𝑥𝑘 |𝑦)
𝑘=1
Proof:
𝑝 𝑦, 𝒙 = 𝑝
=𝑝
=𝑝
=𝑝
=𝑝
=𝑝
𝑦, 𝑥1 , 𝑥2 , … 𝑥𝑘
𝑦 𝑝(𝑥1 , 𝑥2 , … 𝑥𝑘 |𝑦) all the features are independent
𝑦 𝑝(𝑥1 |𝑦)𝑝(𝑥2 , … 𝑥𝑘 |𝑦, 𝑥1 )
𝑦 𝑝 𝑥1 𝑦 𝑝 𝑥2 𝑦, 𝑥1 … 𝑝(𝑥𝑘 |𝑦, 𝑥1 , 𝑥2 , … , 𝑥𝑘 )
𝑦 𝑝 𝑥1 𝑦 𝑝 𝑥2 𝑦 … 𝑝 𝑥𝑘 𝑦
𝑦 𝐾
𝑘=1 𝑝 𝑥𝑘 𝑦
Naïve Bayes Classifer
Written as a factor graph
1
𝑝 𝒙, 𝒚 =
𝑍
𝐾
Ψ𝐴 (𝒙𝐴 , 𝒚A )
𝐴
𝑝 𝑦, 𝒙 = 𝑝(𝑦)
𝑝(𝑥𝑘 |𝑦)
𝑘=1
By defining Ψ 𝑦 = 𝑝(𝑦)
Ψ𝑘 𝑦, 𝑥𝑘 = 𝑝(𝑥𝑘 |𝑦)
However, the assumption is not practical, because the
features are not always independent
Ex. Classify the ball using following features:
color and weight V.S. size and weight
Logistic Regression
𝐾
log
𝑝
𝑦,
𝒙
=
log
𝑝
𝑦
+
𝑥𝑘 𝑦 of each
𝑘=1 log
Assumption that the log probability,
log𝑝𝑝(𝑦|𝐱),
𝐾
class
logis𝑝a 𝑦linear
𝐱 =function
−log Z of
𝑥 x,
+plus
𝜆𝑦 +a normalization
𝑗=1 𝜆𝑦,𝑗 𝑥𝑗
constant.
The conditional distribution:
1
𝑝 𝑦𝐱 =
exp 𝜆𝑦 +
𝑍(𝐱)
𝐾
𝜆𝑦,𝑗 𝑥𝑗
𝑗=1
𝑍 𝐱 = 𝑦 exp 𝜆𝑦 + 𝐾
𝑗=1 𝜆𝑦,𝑗 𝑥𝑗 is a normalizing constant
𝜆𝑦 : a bias weight that act as log 𝑝(𝑦) in naïve Bayes.
Different Notation
Using a set of feature functions
Defined for the feature weights: 𝑓𝑦′ ,𝑗 𝑦, 𝐱 = 𝟏 𝑦′ =𝑦 𝑥𝑗
Defined for the bias weights: 𝑓𝑦′ 𝑦, 𝐱 = 𝟏 𝑦′ =𝑦
𝑓𝑘 to index each feature function 𝑓𝑦′ ,𝑗
𝜆𝑘 to index corresponding weight 𝜆𝑦′ ,𝑗
1
𝑝 𝑦𝐱 =
exp
𝑍 𝐱
𝐾
𝜆𝑘 𝑓𝑘 (𝑦, 𝐱)
𝑘=1
Different Notation
𝐾
1
𝑝 𝑦𝐱 =
exp
𝑍 𝐱
𝜆𝑘 𝑓𝑘 (𝑦, 𝐱)
𝑘=1
1
𝑝 𝑦𝐱 =
exp 𝜆𝑦 +
𝑍(𝐱)
Where 𝜆1 𝑓1 𝑦, 𝐱 = 𝜆𝑦,1 𝑥1 +
𝜆2 𝑓2 𝑦, 𝐱 = 𝜆𝑦,2 𝑥2 +
𝜆𝑦
𝑘
𝜆𝑦
𝑘
.
.
.
𝜆𝐾 𝑓𝐾 𝑦, 𝐱 = 𝜆𝑦,𝐾 𝑥𝐾 +
𝜆𝑦
𝑘
𝐾
𝜆𝑦,𝑗 𝑥𝑗
𝑗=1
Sequence Models
Discussing the simplest form of dependency, in which the
output variables are arranged in a sequence
Use the hidden Markov model (HMM)
An HMM models a sequence of observations 𝑿 = 𝑥𝑡
𝑇
𝑡=1
by assuming there is an underlying sequence of states 𝒀 =
𝑦𝑡 𝑇𝑡=1 drawn from a finite state set S
Sequence Models
HMM makes two independence assumption:
First, each state 𝑦𝑡 is independent of all its ancestors 𝑦1 , 𝑦2 ,…, 𝑦𝑡−2
given its previous state 𝑦𝑡−1
Each observation variable 𝑥𝑡 depends only on the current state 𝑦𝑡
With these assumption, specify an HMM using three
probability distributions:
The distribution 𝑝(𝑦1 ) is over initial states
The transition distribution 𝑝(𝑦𝑡 |𝑦𝑡−1 )
The observation distribution 𝑝(𝑥𝑡 |𝑦𝑡 )
The form of HMM:
𝑇
𝑝 𝒚, 𝒙 =
𝑝 𝑦𝑡 𝑦𝑡−1 𝑝 𝑥𝑡 𝑦𝑡
𝑡=1
Initial state distribution 𝑝 𝑦1 = 𝑝 𝑦1 𝑦0
Proof
𝑇
𝑝 𝒚, 𝒙 =
𝑝 𝑦𝑡 𝑦𝑡−1 𝑝 𝑥𝑡 𝑦𝑡
𝑡=1
Proof :
𝑝 𝒚, 𝒙
each state 𝑦𝑡 is independent of all its ancestors
𝑦1 , 𝑦2 ,…, 𝑦𝑡−2 given its previous state 𝑦𝑡−1
Each observation variable 𝑥𝑡 depends only on
the current state 𝑦𝑡
=𝑝 𝒚 𝑝 𝐱𝐲
= 𝑝 𝑦1 , 𝑦2 , … , 𝑦𝑇 𝑝(𝑥1 , 𝑥2 , … , 𝑥𝑇 |𝒚)
= 𝑝 𝑦1 )𝑝(𝑦2 , … , 𝑦𝑇 |𝑦1
𝑝 𝑥1 𝑝(𝑥2 , … , 𝑥𝑇 |𝒚, 𝑥1 )
= 𝑝 𝑦1 )𝑝 𝑦2 𝑦1 … 𝑝(𝑦𝑇 |𝑦𝑡−1 , 𝑦𝑡−2 , … , 𝑦1
𝑝 𝑥1 𝑝 𝑥2 𝒚, 𝑥1 … 𝑝(𝑥𝑇 |𝒚, 𝑥𝑡−1 , 𝑥𝑡−2 , … 𝑥1 )
= 𝑝 𝑦1 )𝑝 𝑦2 𝑦1 … 𝑝(𝑦𝑇 |𝑦𝑡−1
𝑇
=
𝑝 𝑦𝑡 𝑦𝑡−1 𝑝 𝑥𝑡 𝑦𝑡
𝑡=1
𝑝 𝑥1 𝑝 𝑥2 𝑦2 … 𝑝(𝑥𝑇 |𝑦𝑇 )
Outline
2. Graphical Models
2-1 Definitions
2-2 Application of Graphical Models
2-3 Discriminative and Generative Models
Discriminative and Generative Models
Naïve Bayes is generative, meaning that it is based on a
model of the joint distribution 𝑝(𝑦, 𝒙)
Logistic regression is discriminative, meaning that it is
based on a model of the conditional distribution 𝑝(𝑦|𝒙)
The main difference is that the conditional distribution
does not include the distribution of 𝑝(𝒙)
To include interdependent features in a generative models,
two choice are used:
Enhancing the model to present dependencies among inputs
Making simplifying independence assumptions
Discriminative and Generative Models
If the two models are defined in the same hypothesis, the
two models can be converted with each other
Interpret it generatively as
exp 𝑘 𝜆𝑘 𝑓𝑘 𝑦, 𝐱
𝑝 𝑦, 𝐱 =
𝑘 𝜆𝑘 𝑓𝑘 𝑦, 𝐱
𝑦,𝐱 exp
Naïve Bayes and logistic regression form a generative-
discriminative pair
The principal advantage of discriminative modeling is that
it is better suited to including rich, overlapping features.
Discriminative and Generative Models
If the two models are defined in the same hypothesis, the
two models can be converted with each other
Interpret it generatively as
exp 𝑘 𝜆𝑘 𝑓𝑘 𝑦, 𝐱
𝑝 𝑦, 𝐱 =
𝑘 𝜆𝑘 𝑓𝑘 𝑦, 𝐱
𝑦,𝐱 exp
Naïve Bayes and logistic regression form a generative-
discriminative pair
The principal advantage of discriminative modeling is that
it is better suited to including rich, overlapping features.
Outline
3. Linear-Chain Conditional Random Fields
3-1 From HMMs to CRFs
3-2 Parameter Estimation
3-3 Inference
From HMMs to CRFs
Begin by considering the conditional distribution 𝑝(𝐲|𝐱) that
follows from the joint distribution 𝑝 𝒚, 𝒙 of an HMM
Key point: with a particular choice of feature functions
Sequence HMM joint distribution:
𝑇
𝑝 𝒚, 𝒙 =
𝑝 𝑦𝑡 𝑦𝑡−1 𝑝 𝑥𝑡 𝑦𝑡
𝑡=1
Rewriting it generally:
𝑝 𝒚, 𝒙
1
= exp
𝑍
𝜆𝑖𝑗 𝟏 𝑦𝑡 =𝑖 𝟏 𝑦𝑡−1=𝑗 +
𝑡 𝑖,𝑗∈𝑆
𝜇𝑜𝑖 1 𝑦𝑡 =𝑖 1 𝑥𝑡=𝑜
𝑡
𝑖∈𝑆 𝑜∈𝑂
Reviewing The HMM
HMM makes two independence assumption:
First, each state 𝑦𝑡 is independent of all its ancestors 𝑦1 , 𝑦2 ,…, 𝑦𝑡−2
given its previous state 𝑦𝑡−1
Each observation variable 𝑥𝑡 depends only on the current state 𝑦𝑡
With these assumption, specify an HMM using three
probability distributions:
The distribution 𝑝(𝑦1 ) is over initial states
The transition distribution 𝑝(𝑦𝑡 |𝑦𝑡−1 )
The observation distribution 𝑝(𝑥𝑡 |𝑦𝑡 )
The form of HMM:
𝑇
𝑝 𝒚, 𝒙 =
𝑝 𝑦𝑡 𝑦𝑡−1 𝑝 𝑥𝑡 𝑦𝑡
𝑡=1
Initial state distribution 𝑝 𝑦1 = 𝑝 𝑦1 𝑦0
From HMMs to CRFs
HMM joint distribution: 𝑝 𝒚, 𝒙 =
𝑇
𝑡=1 𝑝
𝑦𝑡 𝑦𝑡−1 𝑝 𝑥𝑡 𝑦𝑡
Rewriting it generally :
𝑝 𝒚, 𝒙
1
= exp
𝑍
𝜆𝑖𝑗 𝟏 𝑦𝑡 =𝑖 𝟏 𝑦𝑡−1=𝑗 +
𝑡 𝑖,𝑗∈𝑆
S: the set of transition states
𝜆𝑖𝑗 : log 𝑝(𝑦𝑡 = 𝑖|𝑦𝑡−1 = 𝑗)
O: the set of observation states
𝜇𝑜𝑖 : log 𝑝(𝑥𝑡 = 𝑖|𝑦𝑡 = o)
Easy to use the feature functions
𝜇𝑜𝑖 1 𝑦𝑡 =𝑖 1 𝑥𝑡=𝑜
𝑡
𝑖∈𝑆 𝑜∈𝑂
HMM by Using Feature Functions
Each feature function has the form 𝑓𝑘 𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡
𝑓𝑖𝑗 𝑦, 𝑦 ′ , 𝑥 = 𝟏 𝑦=𝑖 𝟏{𝑦′ =𝑗} for each transition (𝑖, 𝑗)
𝑓𝑖𝑜 𝑦, 𝑦 ′ , 𝑥 = 𝟏 𝑦=𝑖 𝟏{𝑥=𝑜} for each state-observation
pair (𝑖, 𝑗)
Write an HMM as :
1
𝑝 𝒚, 𝒙 = exp
𝑍
𝐾
𝜆𝑘 𝑓𝑘 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
𝑘=1
Compare
1
𝑝 𝒚, 𝒙 = exp
𝑍
1
𝑝 𝒚, 𝒙 = exp
𝑍
𝐾
𝜆𝑘 𝑓𝑘 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
𝑘=1
𝜆𝑖𝑗 𝟏 𝑦𝑡 =𝑖 𝟏 𝑦𝑡−1 =𝑗 +
𝑡 𝑖,𝑗∈𝑆
𝜇𝑜𝑖 1 𝑦𝑡 =𝑖 1 𝑥𝑡 =𝑜
𝑡
𝑖∈𝑆 𝑜∈𝑂
Let
𝜆𝑘 =
𝜇𝑜𝑖 ,
𝜆𝑖𝑗 ,
𝑘 = 1~𝑇|𝑆|2
𝑘 = 𝑇 𝑆 2 + 1~𝑇 𝑆 2 + 𝑇|𝑆||𝑂|
𝑓𝑘 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
=
𝟏 𝑦𝑡=𝑖 𝟏 𝑦𝑡−1=𝑗 ,
𝑘 = 1~𝑇|𝑆|2
1 𝑦𝑡 =𝑖 1 𝑥𝑡 =𝑜 , 𝑘 = 𝑇 𝑆 2 + 1~𝑇 𝑆 2 + 𝑇|𝑆||𝑂|
Linear-Chain CRF
By the definition of the conditional distribution:
𝑝(𝒚, 𝒙)
𝑝 𝒚𝒙 =
𝑝(𝒚, 𝒙)𝑝(𝒚)
𝑝 𝒚|𝒙 =
Use the discussion of the joint
distribution:
𝑝(𝒚)
𝐾
𝑝(𝒚,
𝒙)
1
𝑝=
𝒚, 𝒙 = exp ′
𝜆𝑘 𝑓𝑘 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
𝑍𝑝(𝑦 , 𝒙)
′
𝑥
𝑘=1
exp 𝐾
𝑘=1 𝜆𝑘 𝑓𝑘 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
And obtain the
=distribution 𝑝 𝐾𝒚 by 𝑝 𝒚, 𝒙
𝐾 𝜆𝑘 𝑓𝑘 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
𝑥 ′ exp 𝑘=1
𝑝 𝒚 =
𝑥′
1
exp
𝑍
𝜆𝑘 𝑓𝑘 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
𝑘=1
Linear-Chain CRF
Let the parameter 𝑥𝑡 of the feature function
𝑓𝑘 𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 be 𝒙𝒕
Lead to the general definition of linear-chain CRF
Definition 1.1
𝐘, 𝐗: random vectors
Λ = {𝜆𝑘 } ∈ 𝑅 𝐾 : a parameter vector
{𝑓𝑘 (𝑦, 𝑦 ′ , 𝑥𝑡 }𝐾
𝑘=1 : a set of real-valued feature functions
linear-chain conditional random field
1
𝑝 𝒚𝒙 =
exp
𝑍(𝒙)
𝑍 𝒙 =
𝑦 exp
𝐾
𝜆𝑘 𝑓𝑘 𝑦𝑡 , 𝑦𝑡−1 , 𝒙𝒕
𝑘=1
𝐾
𝑘=1 𝜆𝑘 𝑓𝑘
𝑦𝑡 , 𝑦𝑡−1 , 𝒙𝒕
Linear-Chain CRF
HMM-liked CRF:
exp
𝑝 𝒚𝒙 =
𝑦 ′ exp
𝐾
𝑘=1 𝜆𝑘 𝑓𝑘 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
𝐾
𝑘=1 𝜆𝑘 𝑓𝑘 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
linear-chain CRF:
1
𝑝 𝒚𝒙 =
exp
𝑍(𝒙)
𝐾
𝜆𝑘 𝑓𝑘 𝑦𝑡 , 𝑦𝑡−1 , 𝒙𝒕
𝑘=1
Linear-Chain CRF
Allow the score of the transition (𝑖, 𝑗) to depend on the
current observation vector, by adding a feature
1{𝑦𝑡 =𝑗} 1{𝑦𝑡−1 =1} 1{𝑥𝑡 =0}
𝑝 𝒚𝒙 =
1
exp
𝑍(𝒙)
𝐾
𝑘=1 𝜆𝑘 𝑓𝑘
𝑦𝑡 , 𝑦𝑡−1 , 𝒙𝒕
However, the normalization constant sums over all
possible state sequence, an exponentially large number of
terms.
It can be computed efficiently by forward-backward, as
explaining it later
Outline
3. Linear-Chain Conditional Random Fields
3-1 From HMMs to CRFs
3-2 Parameter Estimation
3-3 Inference
Parameter Estimation
Discuss how to estimate the parameter Λ = 𝜆𝑘
Given iid training data 𝒟 = {𝒙 𝑖 , 𝒚(𝑖) }𝑁
𝑖=1 ,
where 𝒙
𝑖
𝑖
𝑖
𝑖
= {𝑥1 , 𝑥2 , … , 𝑥 𝑇 } is a sequence of input
𝑖
𝑖
𝑖
and 𝒚 𝑖 = {𝑦1 , 𝑦2 , … , 𝑦𝑇 } is a sequence of the desired
prediction
Performed by penalized maximum likelihood
Because modeling the conditional distribution, called the
conditional log likelihood, is appropriate:
𝑁
𝑙𝑜𝑔 𝑝(𝒚(𝑖) |𝒙(𝑖) )
ℓ 𝜃 =
𝑖=1
Parameter Estimation
After substituting in the CRF model into the likelihood
𝑁
𝑇
𝐾
𝑁
𝑖
ℓ 𝜃 =
𝑖
𝑖
log 𝑍(𝐱 𝑖 )
𝜆𝑘 𝑓𝑘 (𝑦𝑡 , 𝑦𝑡−1 , 𝒙𝑡 ) −
𝑖=1 𝑡=1 𝑘=1
𝑖=1
As a measure to avoid overfitting, use regularization, which is a
penalty on weight vectors whose norm is too large
A common choice of penalty is based on the Euclidean norm of
𝜃 and on a regularization parameter 1/2σ2
Regularized log likelihood:
𝑁
𝑇
𝐾
ℓ 𝜃 =
𝑁
𝜆𝑘 𝑓𝑘 (𝑦𝑡
𝑖=1 𝑡=1 𝑘=1
𝑖
𝑖
𝑖
, 𝑦𝑡−1 , 𝒙𝑡
𝐾
log 𝑍(𝐱 𝑖 ) −
)−
𝑖=1
𝑘=1
𝜆2𝑘
2σ2
Parameter Estimation
The function ℓ 𝜃 cannot be maximized in closed form.
The partial differential:
𝜕ℓ
=
𝜕𝜆𝑘
𝑁
𝑇
𝑓𝑘 (𝑦𝑡
𝑖
𝑖
𝑖
, 𝑦𝑡−1 , 𝒙𝑡
𝑁
𝑇
𝐾
𝑖
𝑓𝑘 𝑦, 𝑦 ′ , 𝒙𝑡
)−
𝑖
𝑝(𝑦, 𝑦 ′ |𝒙
𝑖=1 𝑡=1 𝑦,𝑦 ′
𝑖=1 𝑡=1
)−
𝑘=1
First term: the expect value of 𝑓𝑘 under the empirical
distribution 𝑝 𝒚, 𝒙 =
1
𝑁
𝑁
𝑖=1 1{𝒚=𝒚(𝑖) } 1{𝒙=𝒙(𝑖) }
Second term: arises from the derivative of log Z 𝐱 , the
expectation of 𝑓𝑘 under the model distribution 𝑝(𝒚|𝒙; 𝜃)𝑝(𝒙)
𝜆𝑘
σ2
Optimize ℓ 𝜃
The function ℓ 𝜃 is concave, which follows from the convexity
of functions of the form 𝑔 𝐱 = 𝑙𝑜𝑔
𝑖 exp 𝑥𝑖
Every local optimum is also a global optimum in concave
functions
Adding regularization ensures ℓ is strictly concave, which
implies that it has exactly one global optimum
The simplest approach to optimize ℓ is steepest ascent along the
gradient.
Newton’s method converges much faster because it takes into
the curvature of the likelihood, but it requires computing the
Hessian
Optimize ℓ 𝜃
Quasi-Newton methods: BFGS [Bertsekas,1999]
Limited-memory version of BFGS, due to Byrd et al. [1994]
When such second-order methods are used, gradient-based
optimization is much faster than original approaches based on
iterative scaling in Lafferty et al. [2001] as shown
experimentally by several authors [Sha and Pereira, 2003,
Wallcach, 2002, Malouf, 2002, Minka, 2003]
Computational cost:
𝑝(𝑦𝑡 , 𝑦𝑡−1 |𝐱): 𝑂(𝑇𝑀2 ) where M is the number of state of each 𝑦𝑡
Total computer cost: 𝑂(𝑇𝑀2 𝑁𝐺) where N is the number of train
examples and G is the number of gradient computions
Outline
3. Linear-Chain Conditional Random Fields
3-1 From HMMs to CRFs
3-2 Parameter Estimation
3-3 Inference
Inference
Two common inference problems for CRFS:
During training, computing the gradient requires marginal
distribution for each edge 𝑝 𝑦𝑡 , 𝑦𝑡−1 𝒙 , and computing the
likelihood requires 𝑍(𝒙)
To label an unseen instance, we compute the most likely labeling
𝒚∗ = arg max 𝑝(𝐲|𝐱)
𝒚
In linear-chain CRFs, inference tasks can be performed
efficiently and exactly by dynamic-programming algorithms for
HMMs
Here, review the HMM algorithms, and extend them to linear
CRFs
Introduce Notations
HMM: 𝑝 𝒚, 𝒙 =
𝑇
𝑡=1 𝑝
𝑦𝑡 𝑦𝑡−1 𝑝 𝑥𝑡 𝑦𝑡
Viewed as a factor graph: 𝑝 𝒙, 𝒚 =
1
𝑍
𝑡 Ψ𝑡 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
Define the factors and normalization constant as :
𝑍=1
Ψ𝑡 𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 ≝ 𝑝 𝑦𝑡 = 𝑗 𝑦𝑡−1 = 𝑖 𝑝 𝑥𝑡 = 𝑥 𝑦𝑡 = 𝑗
If viewed as a weighted finite state machine, the Ψ𝑡 𝑗, 𝑖, 𝑥 is the
weight on the transition from state 𝑖 to state 𝑗, when the current
observation is 𝑥
HMM Forward Algorithm
Used to compute the probability 𝑝(𝐱) of the observations
First, Using the distributive law:
𝑝 𝒙
=
𝑝(𝒙, 𝒚)
𝒚
=
Ψ𝑡 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
𝒚
𝑡
=
Ψ𝑡 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
𝑦1 ,𝑦2 ,…,𝑦𝑇
=
𝑡
Ψ𝑇 (𝑦𝑇 , 𝑦𝑇−1 , 𝑥 𝑇 )
𝑦𝑇 𝑦𝑇−1
Ψ𝑇−1 (𝑦𝑇−1 , 𝑦𝑇−2 , 𝑥 𝑇−1 ) …
𝑦𝑇−2
HMM Forward Algorithm
𝑝 𝒙 =
Ψ𝑇 (𝑦𝑇 , 𝑦𝑇−1 , 𝑥 𝑇 )
𝑦𝑇 𝑦𝑇−1
Ψ𝑇−1 (𝑦𝑇−1 , 𝑦𝑇−2 , 𝑥 𝑇−1 ) …
𝑦𝑇−2
Each of the intermediate sums is reused many times, and we can
save an exponential amount of work by caching the inner sums
forward variables 𝛼𝑡 :
Each is a vector of size M (the number of states)
Store the intermediate sums
HMM Forward Algorithm
𝑝 𝒙 =
Ψ𝑇 (𝑦𝑇 , 𝑦𝑇−1 , 𝑥 𝑇 )
𝑦𝑇 𝑦𝑇−1
Ψ𝑇−1 (𝑦𝑇−1 , 𝑦𝑇−2 , 𝑥 𝑇−1 ) …
𝑦𝑇−2
Defined as :
𝛼𝑡 𝑗 ≝ 𝑝 𝒙 1…𝑡 , 𝑦𝑡 = 𝑗
𝑡−1
=
Ψ𝑡 (𝑗, 𝑦𝑡−1 , 𝑥𝑡 )
𝒚 1…𝑡
Ψ𝑡 ′ (𝑦𝑦′ , 𝑦𝑦′−1 , 𝑥𝑡′ )
𝑡 ′ =1
Compute by the recursion:
𝛼𝑡 𝑗 =
𝑖∈𝑠 Ψ𝑡 (𝑗, 𝑖, 𝑥𝑡 ) 𝛼𝑡−1
Initialization: 𝛼1 𝑗 =
𝑖
𝑦𝑇 𝛼 𝑇
𝑦𝑇
HMM Forward Algorithm
𝑝 𝒙 =
Ψ𝑇 (𝑦𝑇 , 𝑦𝑇−1 , 𝑥 𝑇 )
𝑦𝑇 𝑦𝑇−1
Ψ𝑇−1 (𝑦𝑇−1 , 𝑦𝑇−2 , 𝑥 𝑇−1 ) …
𝑦𝑇−2
Backward :
𝛽𝑡 𝑗 ≝ 𝑝 𝒙 𝑡+1…𝑇 |𝑦𝑡 = 𝑖
𝑇
=
Ψ𝑡 ′ (𝑦𝑦′ , 𝑦𝑦′−1 , 𝑥𝑡′ )
𝑦 𝑡+1…𝑇 𝑡 ′ =𝑡+1
Recursion:
𝛽𝑡 𝑖 =
𝑗∈𝑆 Ψ𝑡+1 (𝑗, 𝑖, 𝑥𝑡+1 )𝛽𝑡+1 (𝑗)
Initialization: 𝛽𝑇 𝑖 = 1
HMM Forward Algorithm
Appling the distributive law:
𝑝 𝑦𝑡−1 , 𝑦𝑡 𝐱
𝑡−1
= Ψ𝑡 𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡
Ψ𝑡 ′ 𝑦𝑡 ′ , 𝑦𝑡 ′ −1 , 𝑥𝑡 ′
𝑦 1…𝑡−2 𝑡 ′ =1
𝑇
Ψ𝑡 ′ (𝑦𝑡 ′ , 𝑦𝑡 ′ −1 , 𝑥𝑡 ′ )
𝑦 𝑡+1…𝑇 𝑡 ′ =𝑡+1
Recursion:
𝑝 𝑦𝑡−1 , 𝑦𝑡 𝐱 ∝ 𝛼𝑡−1 𝑦𝑡−1 Ψ𝑡 𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 𝛽𝑡 𝑦𝑡
HMM Forward Algorithm
Finally, compute the globally most probable assigment:
y ∗ = arg max 𝑝(𝑦|𝑥)
𝒚
This yields the Viterbi recursion if all the summations are
replaced by maximization:
𝛿𝑡 𝑗 = max Ψ𝑡 (𝑗, 𝑖, 𝑥𝑡 ) 𝛿𝑡−1 𝑖
𝑖∈𝑠
Forward-Backward Algorithm For CRF
The definition forward recursion, the backward recursion, and
the Viterbi recursion of CRF are the same as HMM
Use the recursion to computer 𝑍 𝐱
𝐾
𝑍 𝒙 =
exp
𝐾
𝑦
=
𝜆𝑘 𝑓𝑘 𝑦𝑡 , 𝑦𝑡−1 , 𝒙𝒕
𝑘=1
exp 𝜆𝑘 𝑓𝑘 𝑦𝑡 , 𝑦𝑡−1 , 𝒙𝒕
𝑦 𝑘=1
𝑝 𝒙
=
Ψ𝑡 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
𝒚
𝑡