CRF-OPT: An Efficient High Quality Conditional Random Field Solver
Download
Report
Transcript CRF-OPT: An Efficient High Quality Conditional Random Field Solver
Constrained Optimization for
Validation-Guided Conditional
Random Field Learning
Presented by: Qiang Yang, HKUST
Minmin Chen, Yixin Chen, Michael Brent , Aaron Tenney
Washington University in St. Louis
4/10/2016
KDD 2009
1
Conditional Random Fields
Conditional Random Fields (Lafferty, McCallum& Pereira 2001)
Probabilistic model for sequential data segmentation and
labeling
Labeling
Sequence
1
1
1
2
2
A
T
Observation
Sequence
G
C
G
FEATURE
4/10/2016
KDD 2009
2
Applications
Natural Language Processing
Lafferty J., McCallum A.& Pereira F. 2001
Sha F. & Pereira F. 2003
Sarawagi S. & Cohen W. W. 2005
Computer Vision
Sminchisescu C., Kanaujia A. & Metaxas D. 2006
Vishwanathan S. V. N. et al. 2006
Bioinformatics
Culotta A., Kulp D. & McCallum A. 2005
Gross, S. S. et al. 2007
4/10/2016
KDD 2009
3
Challenges and Related Work
Overfitting
Likelihood training Overfitting
Model flexibility Large feature set Overfitting
Related Work
Regularization (Sha 2003, Vail 2007)
Smoothing methods (Newman 1977, Rosenfeld 1996)
Regularization with a Gaussian prior works as well as or better
than smoothing methods, but the regularization parameter is
hard to tune
4/10/2016
KDD 2009
4
Motivation of Proposed Work
Cross-validation is often used to estimate the accuracy
of a classifier and to select models. Generally, the
performance on the validation set is strongly correlated
to the performance of the trained model on the unseen
data
Constraints prescribing the performance of the trained
model on the validation set are used to guide the
learning process and avoid the model from fitting freely
and tightly to the training data
4/10/2016
KDD 2009
5
Single Training Multiple Validation (STMV)
Framework
STMV Framework
Small data set
Training
V1
V2
Testing
V3
Large data set
Training
4/10/2016
Testing
KDD 2009
V1
V2
V3
6
Constrained Formulation
Original objective : maximize the
log likelihood of the labeling
sequence given the observation
sequence of the training data
Where
Score of a specific labeling sequence
y for an observation sequence x
Score of the most likely sequence
found by Viterbi under current
model for validation sequence v(j)
Score of real labeling sequence for
validation sequence v(j)
Constraints prescribing the difference of scores to ensure the model take
consideration of the performance on the validation sets
4/10/2016
KDD 2009
7
Extended Saddle Point (ESP) Theory
Extended Saddle Point Theory (Wah & Chen 2005)
Introduces a necessary and sufficient condition on the
Constrained Local Minimum (CLM) of a constrained nonlinear
programming problem in a continuous, discrete, or mixed
space
Offer several salient advantages over previous constraint
handling theory
• Does not require the constraints to be differentiable or in closed form
• Satisfied over an extended region of penalty values
• Necessary and sufficient
4/10/2016
KDD 2009
8
Extended Saddle Point (ESP) Search
Algorithm
Extended Saddle Point Search Algorithm
Transform the constrained formulation into a
penalty form
where
Outer loop updates the extended penalty values
Inner loop minimizes
Challenges
Efficient calculation of the gradient of the
penalty function
• The first term of the constraint is determined by the most likely sequence
found by the Viterbi Algorithm.
• Change of the parameter W can result in a very different sequence
non-differentiable
4/10/2016
KDD 2009
9
Approximation of the discontinuous
Gradient
Highest prob. Of reaching
this state at time t
f(s1, s2, X)
4/10/2016
KDD 2009
10
Experimental Results : Gene Prediction
Find out the protein coding regions and associated
components
Apply our STMV framework and ESP search algorithm
on CONTRAST, a state-of-the-art gene predictor
Data set:
Fruit fly genome, 27,463 genes, evenly divided into 4 sets
Feature set includes around 33,000 features
Compare performance to original CONTRAST, and
CONTRAST with regularization
Gene, exon, nucleotide level sensitivity and specificity
4/10/2016
KDD 2009
11
Experimental Results : Gene Prediction(cont)
The performance of original CRF, regularized CRF(CRFr) and constrained
CRF(CRFc) on CONTRAST
4/10/2016
KDD 2009
12
Experimental Result: Stock Price Prediction
Predict if tomorrow’s stock price will raise or fall
comparing to today’s, based on historical data;
Preprocessing techniques to smooth out the noise in
the raw data
Date set:
1741 stocks from NASDAQ and NYSE
Each contains stock prices from 2002-02-01 to 2007-09-12
Feature set includes 2,000 features
Predict on day T+1:
Training Sequence : day 1 ~ T
Validation Sequence: day T-V+1 ~ T (V = 100)
4/10/2016
KDD 2009
13
Experimental Result: Stock Price
Prediction(cont)
4/10/2016
KDD 2009
14
Conclusion
A Single Training Multiple Validation (STMV) framework
Integrate validation into the training process by modeling the
validation quality as constraints in the problem formulation
Effectively avoid overfitting of CRF models
Approximation Scheme
Efficiently approximate the discontinuous gradient of our
constrained formulation
Extended Saddle Point (ESP) search algorithm
Robustly find a constrained local minimum for our constrained
formulation
4/10/2016
KDD 2009
15
Questions?
[email protected]
4/10/2016
KDD 2009
16