Lecture slides - CSE, IIT Bombay

Transcript Lecture slides - CSE, IIT Bombay

NLP-AI Seminar
Graphical Models
for
Segmenting and Labeling Sequence
Data
Manoj Kumar Chinnakotla
Outline
• Introduction
• Directed Graphical Models
– Hidden Markov Models (HMMs)
– Maximum Entropy Markov Models (MEMMs)
• Label Bias Problem
• Undirected Graphical Models
– Conditional Random Fields (CRFs)
• Summary
The Task
• Labeling
– Given sequence data, mark appropriate tags for
each data item
• Segmentation
– Given sequence data, segment into nonoverlapping groups such that related entities are
in same group
Applications
• Computational Linguistics
– POS Tagging
– Information Extraction
– Syntactic Disambiguation
• Computational Biology
– DNA and Protein Sequence Alignment
– Sequence homologue searching
– Protein Secondary Structure Prediction
Example : POS Tagging
Directed Graphical Models
• Hidden Markov models (HMMs)
– Assign a joint probability to paired observation and
label sequences
– The parameters trained to maximize the joint likelihood
of train examples
Hidden Markov Models (HMMs)
• Generative Model - Models the joint distribution
P ( w, t )
• Generation Process
–
–
–
–
–
Probabilistic Finite State Machine
Set of states – Correspond to tags
Alphabet - Set of words
Transition Probability – P(ti | ti  1)
State Probability – P( wi | ti )
HMMs (Contd..)
• For a given word/tag sequence pair
P( w, t )   P(ti | ti  1) * P( wi | ti )
i
• Why Hidden?
– Sequence of tags which generated word sequence not visible
• Why Markov?
– Based on Markovian Assumption : current tag depends only on
previous ‘n’ tags
– Solves the “sparsity problem”
• Training – Learning the transition and emission
probabilities from data
HMMs Tagging Process
• Given a string of words w, choose tag sequence
t* such that
t*  arg max P( w, t )
t
• Computationally expensive - Need to evaluate
all possible tag sequences!
– For ‘n’ possible tags, m positions – O(n m )
• Viterbi Algorithm
– Used to find the optimal tag sequence t*
– Efficient dynamic programming based algorithm
Disadvantages of HMMs
• Need to enumerate all possible observation
sequences
• Not possible to represent multiple interacting
features
• Difficult to model long-range dependencies of the
observations
• Very strict independence assumptions on the
observations
Maximum Entropy Markov Models
(MEMMs)
• Conditional Exponential Models
– Assumes observation sequence given (need not model)
– Trains the model to maximize the conditional likelihood
P(Y|X)
MEMMs (Contd..)
• For a new data sequence x, the label sequence y
which maximizes P(y|x,Θ) is assigned (Θ parameter set)
• Arbitrary non-independent features on observation
sequence possible
• Conditional Models known to perform well than
Generative
• Performs Per-State Normalization
– Total mass which arrives at a state must be distributed
among all possible successor states
Label Bias Problem
• Bias towards states with fewer outgoing
transitions
• Due to per-state normalization
• An Example MEMM
Undirected Graphical Models
Random Fields
Conditional Random Fields (CRFs)
• Conditional Exponential Model like MEMM
• Has all the advantages of MEMMs without label
bias problem
– MEMM uses per-state exponential model for the
conditional probabilities of next states given the current
state
– CRF has a single exponential model for the joint
probability of the entire sequence of labels given the
observation sequence
• Allow some transitions “vote” more strongly than
others depending on the corresponding
observations
Definition of CRFs
CRF Distribution Function


1
p (y | x) 
exp   k f k (e, y |e , x)   k gk (v, y |v , x) 
Z (x)
vV ,k
 eE,k

Where :
V = Set of Label Random Variables
fk and gk = Features
gk = State Feature
fk = Edge Feature
  (1 , 2 , , n ; 1 , 2 , , n ); k and k
are parameters to be estimated
y|e = Set of Components of y defined by edge e
y|v = Set of Components of y defined by vertex v
CRF Training
CRF Training (Contd..)
• Condition for maximum likelihood
Expected feature count computed using Model equals Empirical
feature count from training data
• Closed form solution for parameters not
possible
• Iterative algorithms employed - Improve log
likelihood in successive iterations
• Examples
– Generalized Iterative Scaling (GIS)
– Improved Iterative Scaling (IIS)
Graphical Comparison
HMMs, MEMMs, CRFs
POS Tagging Results
Summary
• HMMs
– Directed, Generative graphical models
– Cannot be used to model overlapping features on
observations
• MEMMs
– Directed, Conditional Models
– Can model overlapping features on observations
– Suffer from label bias problem due to per-state
normalization
• CRFs
– Undirected, Conditional Models
– Avoids label bias problem
– Efficient training possible
Thanks!
Acknowledgements
Some slides in this presentation are from
Rongkun Shen’s (Oregon State Univ)
Presentation on CRFs

Lecture slides - CSE, IIT Bombay

Transcript Lecture slides - CSE, IIT Bombay

Directory