Homotopy-based Semi-Supervised Hidden Markov Models for

Download Report

Transcript Homotopy-based Semi-Supervised Hidden Markov Models for

Homotopy-based SemiSupervised Hidden Markov
Models for Sequence Labeling
Gholamreza Haffari
Anoop Sarkar
Presenter: Milan Tofiloski
Natural Language Lab
Simon Fraser university
1
Outline
• Motivation & Contributions
• Experiments
• Homotopy method
• More experiments
2
Maximum Likelihood Principle
• Parameter setting for the joint probability of
input-output which maximizes probability of
the given data:
• L : labeled data
• U : unlabeled data
3
Deficiency of MLE
• Usually |U| >> |L|, then
• Which means the relationship of input-output
is ignored when estimating the parameters !
– MLE focuses on modeling the input distribution P(x)
– But we are interested in modeling the joint distribution
P(x,y)
4
Remedy for the Deficiency
• Balance the effect of lab and unlab data:
• Find
which maximally take
advantage of lab and unlab data
•
MLE
5
An experiment with HMM
Lower is Better
MLE Performance
• MLE can hurt the performance
• Balancing lab and unlab data related terms is beneficial
6
Our Contributions
1. Introducing a principled way to
choose  for HMM in sequence
labeling (tagging) tasks
2. Introducing an efficient dynamic
programming algorithm to compute
second order statistics in HMM
7
Outline
• Motivation & Contributions
• Experiments
• Homotopy method
• More experiments
8
Task
• Field segmentation in information extraction
• 13 tag fields: AUTHOR, TITLE, …
EDITOR
EDITOR
A
.
TITLE
Models
for
-
EDITOR
Elmagarmid
TITLE
PUB
EDITOR
TITLE
,
EDITOR
EDITOR
editor
TITLE
.
TITLE
Transaction
TITLE
Advanced Database Applications
PUB
Kaufmann
PUB
DATE
DATE
,
1992
.
TITLE
,
PUB
Morgan
9
Experimental Setup
• Use an HMM with 13 states
– Freeze the transition (state->state) probabilities to what
has been observed in the lab data
– Use the Homotopy method to just learn the emission
(state->alphabet) probabilities
– Do add- smoothing for the initial values of emission
and transition probabilities
• Data statistics:
– Average seq. length : 36.7
– Average number of segments in a seq: 5.4
– Size of Lab/Unlab data is 300/700
10
Baselines
• Held-out: put aside part of the lab data as
a held-out set, and use it t choose 
• Oracle: choose  based on test data using
per position accuracy
• Supervised: forgetting about unlab data,
and just using lab data
11
Homotopy vs Baselines
Higher is Better
Even very small values of  can be useful.
In homotopy =.004, and in supervised  = 0
• Sequence of most probable states decoding
See paper for more results
12
Outline
• Motivation & Contributions
• Experiments
• Homotopy method
• More experiments
13
Path of Solutions
• Look at  as  changes from 0 to 1
• Choose the best  based on the path





Discontinuity

Bifurcation
14
EMfor HMM
•
Let
•
To find best parameter values  which (locally) maximizes the
objective function for a fixed :
be a state->state or state->observation event in our HMM
Repeat until convergence
EM()
15
Fixed Points of EM
• Useful fact
• At the fixed points
, then
• This is similar to using Homotopy for root finding
– Same numerical techniques should be applicable here
16
Homotopy for Root Finding
• To find a root of G()
– start from a root of a simple problem F()
– trace the roots of intermediate problems while
morphing F to G
• To find  which satisfy the above:
– Set the derivative to zero: gives differential equation
– Numerically solve the resulting differential eqn.
17
Solving the Differential Eqn
Jaccobian of EM1
M
.
v=0
Repeat until
–
Update
in a proper direction parallel to v=Kernel(M)
– Update M
18
Jaccobian of EM1
See the paper for details
• So, we need to compute the covariance
matrix of the events
• The entry in the row
and column
of
the covariance matrix is
Challenging for HMM
19
Forward-Backward
Expected Quadratic Counts for HMM
• Dynamic programming algorithm to efficiently
compute
• Pre-compute a table Zx for each sequence
…
k2
k1
… xi
xi+1
…
xj
…
…
• Having table Zx, the EQC can be computed efficiently
– The time complexity is
where K is the number of states
20
in the HMM (see paper for more details)
How to Choose  based on Path
• monotone: the first point at which the
monotonocity of  changes
• MaxEnt: choose  for which the model has
maximum entropy on the unlab data
• minEig: when solving the diff eqn, consider
the minimum singular value of the matrix M.
Across rounds, choose  for which the
minimum singular value is the smallest
21
Outline
• Motivation & Contributions
• Experiments
• Homotopy method
• More experiments
22
Varying the Size of Unlab Data
• The three Homotopy-based methods outperform EM
• maxEnt outperforms minEig and monotone
• minEig and monotone have similar performances
Size of the labeled data: 100
23
Picked  Values
24
Picked  Values
• EM gives higher weight to
unlabeled data compared to
Homotopy-based method
•  selected by
− maxEnt are much smaller than those
selected by minEig and monotone
− minEig and monotone are close
25
Conclusion and Future Work
• Using EM can hurt performance in the
case |L| << |U|
• Proposed a method to alleviate this
problem for HMMs for seq. labeling tasks
• To speed up the method
– Using sampling to find approximation to
covariance matrix
– Using faster methods in recovering the
solution path, e.g. predictor-corrector
26
Questions?
27
Is Oracle outperformed by Homotopy?
• No!
- The performance measure used in selecting 
in oracle method may be different from that
used in comparing homotopy and oracle
- The decoding alg used in oracle may be
different from that used in comparing
homotopy and oracle
28
Why not set
?
• This adhoc way of setting  has two
drawbacks:
- It still may hurt the performance. The proper  may be
much smaller than that.
- In some situations, the right choice of  may be a big
value. Setting
is very conservative and
dose not fully take advantage of the available
unlabeled data.
29
Homotopy vs Baselines
Our method
(see the paper
for more results)
Higher is Better
– Viterbi Decoding: most probable seq of states decoding
– SMS Decoding: seq of most probable states decoding 30