DKRLM-visionary

Download Report

Transcript DKRLM-visionary

DKRLM:
Discriminant Knowledge-Rich
Language Modeling for Machine
Translation
Alon Lavie
“Visionary Talk”
LTI Faculty Retreat
May 4, 2007
Background: Search-based MT
• All state-of-the-art MT approaches work within
a general search-based paradigm
– Translation Models “propose” pieces of translation for
various sub-sentential segments
– Decoder puts these pieces together into complete
translation hypotheses and searches for the best
scoring hypothesis
• (Target) Language Modeling is the most
dominant source of information in scoring
alternative translation hypotheses
May 4, 2007
DKRLM
2
The Problem
• Most MT systems use standard statistical LMs that come
from SR, usually “as is”
– SRI-LM toolkit, CMU/CU LM, SALM toolkit
– Until recently, usually trigram models
• The Problem: these LMs are not good at discriminating
between good and bad translations!
• How do we know?
– Oracle experiments on n-best lists of MT output
consistently show that far better translations are “hiding”
in the n-best lists but are not being selected by our MT
systems
– Also true of our MEMT system… which led me to start
thinking about this problem!
May 4, 2007
DKRLM
3
The Problem
• Why do standard statistical LMs not work well for MT?
– MT hypotheses are very different from SR hypotheses
• Speech: mostly correct word-order, confusable homonyms
• MT: garbled syntax and word-order, wrong choices for some
translated words
– MT violates some basic underlying assumptions of
statistical LMs:
• Indirect Discrimination: better translations should have better
LM scores, but LMs are not trained to directly discriminate
between good and bad translations!
• Fundamental Probability Estimation Problems: Backoff
“Smoothing” for unseen n-grams is based on an assumption
of training data sparsity, but the majority of n-grams in MT
hypotheses have not been seen because they are not
grammatical (they really should have a zero probability!)
May 4, 2007
DKRLM
4
The New Idea
• Rather than attempting to model the probabilities of
unseen n-grams, we look at the problem differently:
– Extract instances of lexical, syntactic and semantic
features from each translation hypothesis
– Determine whether these instances have been “seen
before” (at least once) in a large monolingual corpus
• The Conjecture: more grammatical MT hypotheses are
likely to contain higher proportions of feature instances
that have been seen in a corpus of grammatical
sentences.
• Goals:
– Find the set of features that provides the best
discrimination between good and bad translations
– Learn how to combine these into a LM-like function for
scoring alternative MT hypotheses
May 4, 2007
DKRLM
5
Outline
• Knowledge-Rich Features
• Preliminary Experiments:
– Compare feature occurrence statistics for MT
hypotheses versus human-produced (reference)
translations
– Compare ranking of MT and “human” systems
according to statistical LMs versus a function based
on long n-gram occurrence statistics
– Compare n-grams and n-chains as features for
binary classification “human versus MT”
• Research Challenges
• New Connections with IR
May 4, 2007
DKRLM
6
Knowledge-Rich Features
• Lexical Features:
– “long” n-gram sequences (4 words and up)
• Syntactic/Semantic Features:
– POS n-grams
– Head-word Chains
– Specific types of dependencies:
• Verbs and their dependents
• Nouns and their dependents
• “long-range” dependencies
– Content word co-occurrence statistics
• Mixtures of Lexical and Syntactic Features:
– Abstracted versions of word n-gram sequences, where
words are replaced by POS tags or Named-entity tags
May 4, 2007
DKRLM
7
Head-Word Chains (n-chains)
The boy
ate
the red
apple
• Head-word Chains are chains of syntactic dependency
links (from dependent to their heads)
• Bi-chains: [theboy] [boyate] [theapple]
[redapple] [appleate]
• Tri-chains: [theboyate] [theappleate]
[redappleate]
• Four-chains: none (for this example)!
May 4, 2007
DKRLM
8
Specific Types of Dependencies
• Some types of syntactic dependencies may be
more important than others for MT
• Consider specific types of dependencies that
are most important for syntactic and semantic
structure:
– Dependencies involving content words
– Long-distance dependencies
– Verb/argument dependencies: focus only on the bichains where the head is the verb: [boyate] and
[appleate]
– Noun/modifier dependencies: focus only on the bichains where the noun is the head: [theboy]
[anapple] [redapple]
May 4, 2007
DKRLM
9
Feature Occurrence Statistics
for MT Hypotheses
• The general Idea: determine the fraction of feature
instances that have been observed to occur in a large
human-produced corpus
• For n-grams:
– Extract all n-gram sequences of order n from the
hypothesis
– Look-up whether each n-gram instance occurs in the
corpus
– Calculate fractions of “found” n-grams for each order n
• For n-chains:
– Parse the MT hypothesis (into dependency structure)
– Look-up whether each n-chain instance occurs in a
database of n-chains extracted from the large corpus
– Calculate fractions of “found” n-chains for each order n
May 4, 2007
DKRLM
10
Content-word Co-occurrence
Statistics
• Content-word co-occurrences: (unordered) pairs of content
words (nouns, verbs, adjectives, adverbs) that co-occur in the
same sentence
• Restricted version: subset of co-occurrences that are in a
direct syntactic dependency within the sentence (subset of bichains)
• Idea:
– Learn co-occurrence pair strengths from large monolingual
corpora using statistical measures: DICE, t-score, chi-square,
likelihood ratio
– Use average co-occurrence pair strength as a feature for scoring
MT hypotheses
– Weak way of capturing the syntax/semantics within sentences
• Preliminary experiments show that these features are
somewhat effective in discriminating between MT output and
human references
• Thanks Ben Han! [MT Lab Project, 2005]
May 4, 2007
DKRLM
11
Preliminary Experiments I
• Goal: compare n-gram occurrence statistics for MT hypotheses
versus human-produced (reference) translations
• Setup:
– Data: NIST Arabic-to-English MT-Eval 2003 (about 1000
sentences)
– Output from three strong MT systems and four reference
translations
– Used Suffix-Array LM toolkit [Zhang and Vogel 2006] modified to
return for each string call the length of the longest suffix of the
string that occurs in the corpus
– SALM used to index a subset of 600 million words from the
Gigaword corpus
– Searched for all n-gram sequences of length eight extracted from
the translation
• Thanks to Greg Hanneman!
May 4, 2007
DKRLM
12
Preliminary Experiments I
MT
Translations
Reference
Translations
Ref/MT Ratio
Margin
8-grams
2.1%
2.9%
1.38
+38%
7-grams
4.9%
6.4%
1.31
+31%
6-grams
11.4%
14.1%
1.24
+24%
5-grams
25.2%
29.1%
1.15
+15%
4-grams
48.4%
52.2%
1.08
+8%
3-grams
75.9%
77.7%
1.02
+2%
2-grams
94.8%
94.4%
0.995
-0.5%
1-grams
99.3%
98.2%
0.989
-1.1%
May 4, 2007
DKRLM
13
Preliminary Experiments II
• Goal: Compare ranking of MT and “human” systems according
to statistical LMs versus a function based on long n-gram
occurrence statistics
• Same data setup as in the first experiment
• Calculate sentence scores as average per word LM score
• System score is average over all its sentence scores
• Score each system with three different LMs:
– SRI-LM trigram LM trained on 260 million words
– SALM suffix-array LM trained on 600 million words
– A new function that assigns exponentially more weight to longer
n-gram “hits”:
1 n ( ord ( i ) 8)
score   3
n i 1
May 4, 2007
DKRLM
14
Preliminary Experiments II
System
SRI-LM trigram
LM
SALM 8-gram LM
Occurrencebased Exp score
Ref ahe
-2.23
1
-5.59
1
0.01059
1
Ref ahi
-2.28
4
-5.87
4
0.00957
2
Ref ahd
-2.31
5
-5.99
5
0.00926
3
Ref ahg
-2.33
6
-6.04
7
0.00914
4
MT system 1
-2.27
3
-5.77
3
0.00895
5
MT system 2
-2.24
2
-5.75
2
0.00855
6
MT system 3
-2.39
7
-6.01
6
0.00719
7
May 4, 2007
DKRLM
15
Preliminary Experiments III
• Goal: Directly discriminate between MT and human
translations using a binary SVM classifier trained on n-gram
versus n-chain occurrence statistics
• Setup:
– Data: NIST Chinese-to-English MT-Eval 2003 (919 sentences)
– Four MT system outputs and four human reference translations
– N-chain database created using SALM by extracting all n-chains
from a dependency-parsed version of the English Europarl corpus
(600K sentences)
– Train SVM classifier on 400 sentences from two MT systems and
two human “systems”
– Test classification accuracy on 200 unseen test sentences from the
same MT and human systems
– Features for SVM: n-gram “hit” fractions (all n) vs. n-chain
fractions
• Thanks to Vamshi Ambati
May 4, 2007
DKRLM
16
Preliminary Experiments III
• Results:
– Experiment 1:
• N-gram classifier: 49% accuracy
• N-chain classifier: 69% accuracy
– Experiment 2:
• N-gram classifier: 52% accuracy
• N-chain classifier: 63% accuracy
• Observations:
– Mixing both n-gram and n-chains did not improve
classification accuracy
– Features include both high and low-order instances
(did not try with only high-order ones)
– N-chain database is from different domain than test
data, and not a very large corpus
May 4, 2007
DKRLM
17
Preliminary Conclusions
• Statistical LMs do not discriminate well
between MT hypotheses and human reference
translations  also poor in discriminating
between good and bad MT hypotheses
• Long n-grams and n-chains occurrence
statistics differ significantly between MT
hypotheses and human reference translations
• Can potentially be useful as discriminant
features for identifying better (more
grammatical and fluent) translations
May 4, 2007
DKRLM
18
Research Challenges
• Develop Infrastructure for Computing with Knowledge-Rich
Features
– Scale up to querying against much larger monolingual corpora
(terabytes and up)
– Parsing and annotation of such vast corpora
• Explore more complex features
• Finding the set of features that are most discriminant
• Develop Methodologies for training LM-like discriminant
scoring functions:
–
–
–
–
SVM and/or other classifiers on MT versus human
SVM and/or other classifiers on MT versus MT “Oracle”
Direct regression against human judgments
Parameter optimization for maximizing automatic MT metric
scores (BLEU, METEOR, etc.)
• “Incremental” features that can be used during decoding
versus full set of features for n-best list reranking
May 4, 2007
DKRLM
19
New Connections with IR
• The “occurrence-based” formulation of the LM
problem transforms it from a counting and
estimation problem to an IR-like querying
problem:
– To be effective, we think this may require querying
against extremely large volumes of monolingual text,
and structured versions of such text  can we do
this against local snapshots of the entire web?
– SALM suffix-array infrastructure can currently handle
up to about the size of the Gigaword corpus (within
16GB memory)
– Can IR engines such as LEMUR/Indri be adapted to
the task?
May 4, 2007
DKRLM
20
New Connections with IR
• Challenges this type of task imposes on IR
(insights from Jamie Callan):
– The larger issue: IR search engines as query
interfaces to vast collections of structured text:
• Building an index suitable for very fast “n-gram”
lookups that satisfy certain properties.
• The n-gram sequences might be a mix of surface
features and derived features based on text
annotations, e.g., $PersonName, or POS=N
– Specific Challenges:
• How to build such indexes for fast access?
• What does the query language look like?
• How to deal with memory/disk vs. speed tradeoff
issues?
• Can we get LTI students to do this kind of research?
May 4, 2007
DKRLM
21
Final Words…
• Novel and exciting new research direction  there are
at least one or two PhD theses hiding in here…
• Submitted as a grant proposal to NSF last December
(jointly with Rebecca Hwa from Pitt)
• Influences: Some of these ideas were influenced by
Jaime’s CBMT work, and by Rebecca’s work on using
syntactic features for automatic MT evaluation metrics
• Acknowledgments:
– Thanks to Joy Zhang and Stephan Vogel for making the
the SALM toolkit available to us
– Thanks to Rebecca Hwa and to my students Ben Han, Greg
Hanneman and Vamshi Ambati for preliminary work on
these ideas.
May 4, 2007
DKRLM
22