Training dependency parsers by jointly optimizing multiple objectives

Download Report

Transcript Training dependency parsers by jointly optimizing multiple objectives

Training dependency parsers by
jointly optimizing multiple objectives
Keith HallRyan McDonaldJason KatzBrownMichael Ringgaard
Evaluation
• Intrinsic
– How well does system replicate gold annotations?
– Precision/recall/F1, accuracy, BLEU, ROUGE, etc.
• Extrinsic
– How useful is system for some downstream task?
• High performance on one doesn’t necessarily
mean high performance on the other
• Can be hard to evaluate extrinsically
Dependency Parsing
• Given a sentence, label the dependencies
•
(from nltk.org)
• Output is useful for downstream tasks like
machine translation
– Also of interest to NLP reaserchers
Overview of paper
• Optimize parser for two metrics
– Intrinsic evalutation
– Downstream task (here reranker in machine
translation system)
• Algorithm to do this
• Experiments
Perceptron Algorithm
• Takes: set of labeled training examples; loss
function
• For each example, predicts output, updates
model if the output is incorrect
– Rewards features that fire in gold standard model
– Penalizes those that fire in predicted output
Augmented Loss Perceptron Algorithm
• Similar to perceptron, except takes: multiple
loss functions; multiple datasets (one for each
loss function); scheduler to weight loss
functions
• Perceptron is an instance of ALP with one loss
function, one dataset, and a trivial scheduler
• Will look at ALP with 2 loss functions
• Can use extrinsic evaluator as loss function
Reranker loss function
•
•
•
•
•
Takes k-best output from parser
Assign cost to each parse
Take lowest cost parse to be “correct” parse
If 1-best parse is lowest cost do nothing
Otherwise update parameters based on
correct parse
• Standard loss function is instance of this in
which the cost is always lowest for 1-best
Experiment 1
• English to Japanese MT system, specifically word
reordering step
– Given a parse, reorder the English sentence into
Japanese word order
• Transition-based and graph-based dependency
parsers
• 17,260 manually annotated word reorderings
– 10,930 training, 6,338 test
– These are cheaper to produce than dependency
parses
Experiment 1
• 2nd loss function based off of METEOR
– Score = 1 – (#chunks – 1)/(#unigrams matched –
1)
– Cost = 1 – score
• Unigrams matched are those in reference and
hypothesis
• Chunks are sets of unigrams that are adjacent
in reference and hypothesis
• Vary weights of primary and secondary loss
Experiment 1
• As ratio of extrinsic loss : intrinsic loss
increases, performance on reordering task
improves
• Transition based parser
Intrinsic : Extrinsic
% Correctly
Reordered
Reordering Scores
1:0
35.29
76.49
1 : 0.5
38.71
78.19
1:1
39.02
78.39
1:2
39.58
78.67
Experiment 2
• Semi-supervised adaptation: Penn Treebank
(PTB) to Question Treebank (QTB)
• PTB trained parser bombs on QTB
• QTB trained parser does much better on QTB
• Ask annotators a simple question about QTB
sentences
– What is the main verb?
– ROOT usually attaches to main verb
• Use answers and PTB to adapt to QTB
Experiment 2
• Augmented loss data set: QTB data with ROOT
attached to main verb
– No other labels on QTB data
• Loss function: 0 if ROOT dependency correct,
1 otherwise
• Secondary loss function looks at k-best,
chooses highest ranked parse with correct
ROOT dependency
Experiment 2
• Results for transition parser
Setup
LAS
UAS
ROOT-F1
PTB
67.97
73.52
47.60
QTB
84.59
89.59
91.06
Aug. loss
76.27
86.42
83.41
• Huge improvement with data that is very
cheap to collect
– Cheaper to get Turkers to annotate main verbs
than grad students to manually parse sentences
Experiment 3
• Improving accuracy on labeled and unlabeled
dependency parsing (all intrinsic)
• Use labeled attachment score as primary loss
function
• Secondary loss function weights lengths of
incorrect and correct arcs
– One version uses labeled arcs, the other unlabeled
• Idea is to have model account for arc length
– Parsers tend to do poorly on long dependencies
(McDonald and Nivre, 2007)
Experiment 3
• Weighted Arc Length Score (ALS)
• Sum of lengths of all correct arcs divided by
sum of lengths of all arcs
• In unlabeled version only head (and
dependency) need to match
• In labeled version arc label must match too
Experiment 3
• Results with transition parser
Setup
LAS
UAS
ALS
Baseline
88.64
91.64
82.96
Unlabeled aug. loss
88.74
91.91
83.65
Labeled aug. loss
88.84
91.91
83.46
• Small improvement likely due to fact that ALS
is similar to LAS and UAS
Conclusions
• Possible to train tools for particular downstream tasks
– Might not want to use the same parses for MT as for
information extraction
• Can leverage cheap(er) data to improve task
performance
– Japanese translations/word orderings for MT
– Main verb identification instead of dependency parses for
domain adaptation
• Not necessarily easy to define the task or a good
extrinsic evaluation metric
– MT to word reordering score
– METEOR-based metric