Transcript ppt

Parsing Across Languages
CMSC 35100
Natural Language Processing
February 2, 2006
Roadmap
• Motivation:
– Resources & constraints
• Bootstrapping parsers by projection
– Leveraging resource rich languages
• Inversion Transduction Grammars
– Parsing jointly with different word order
• Conclusions
Two Languages are Better Than One
• Exploiting other languages
– Resources:
• Data-driven techniques (lots of) annotated data
• Leverage data, annotations for resource-rich
language to support resource poor language
– Constraints:
• Languages encode information differently
– Syntax, semantics
• Ambiguity in one language not necessarily in other
Classic Example
• Word Sense Disambiguation
– Dagan & Itai (1992), Diab (2004), etc
• Observation:
– Polysemous word in one language
• Different translation correspondence in other
• Approach:
– Given translated sentence pair w/ambiguous word
• Assign sense associated with unambiguous target
• Also POS tagging, shallow parsing, gender
Roadmap
• Bootstrapping by parse projection
– Resource rich, resource poor languages
– Dependency projection
• Direct correspondence
• Post-projection transformation
– Learning from projected parses
• Spanish, Chinese
• Conclusions
Motivation: Resources
• Parsers trained on large treebanks do well
• Goal:
– Train parser for language w/no treebank
• Few treebanked languages, small
• Problem: Resource bottleneck
• Question:
– Leverage resources from resource-rich language?
• English has treebanks, parsers, etc
Approach
• Available resources:
– English treebank, parsers, parallel text
– Statistical machine translation system
•
•
•
•
•
Train English parser on treebank
Align parallel text
Parse English side of parallel text
“Project” parse structure to FL using align
Train FL parser on (filtered) set of parses
Key Questions
• Is projection remotely plausible?
– Can complex structure be mapped across?
• If projection is noisy, is it still useful?
– Can one filter parses to minimize noise?
– Can large noisy treebanks compete with small
clean sets?
Parse Representation
• Dependency parses
• Advantages
– Rich syntactic representation
– Models long-distance dependencies locally
• Subj, obj, dat associated w/verb
– Regardless of intervening structure
– Syntactic dependencies capture semantics
• More useful than strictly bracketing
• Identifies actor, action, etc
– Insensitive to word order
• Separates surface form precedence from syntax dominance
Syntactic Projection
• Direct Correspondence Assumption:
• Given sentences E,F that are literal translations with
syntactic structures Tree-E and Tree-F, if x-E and y-E of TreeE are aligned with x-F and y-F in Tree-F,
• then if R(x-E,y-E) holds in Tree-E, R(x-F,y-F) holds in Tree-F
• Essentially assumes a homomorphism in syntax
– implicit in many models
• Allows simple procedure linked by alignment
Initial Assessment
• Ideal data, pure projection
– Manually constructed alignments
• English-Spanish, English-Chinese
– Manually constructed parses
• English
• Project parse structure onto FL sentence
• Results:
– Errorful: Eng-Spanish: 37%, Eng-Ch: 38%
• Unlabeled dependency F-score
Post-Projection Transformation
• Error analysis:
– More than English projection required
– Monolingual information also needed
• E.g. Chinese aspectual markers, Spanish clitics
– Commonly follow verbs, no English correlate
– Only English relations projected, can’t attach
• Solution:
– Small set of hand-coded correction rules
• Restricted to closed class, POS, lexical cat
– E.g aspectual marker modifies verb to left
• Results: Eng-Spanish: 70%, Eng-Ch: 67%
– On small ideal dataset
Back to Reality:
Training on Noisy Data
• Training data construction
– Collins CFG parser trained on Penn treebank
• Conversion to dependency parse
– Alignment of bilingual corpus using IBM models
– Projection
– Post-projection transformation
– Parse filtering
• Train dependency parser on projections
• Evaluate parses from dep. parser
Experiment I: Spanish
• Corpus: 100,000 translation parallel E-S
– Bible, FBIS, UN Parallel Corpus
– 200 sentences: 50% dev, 50% test
• Gold standard: COTS parser, manual conversion
• Baselines:
– Every word modifies word to left, +/- PPT
• Filtering:
– Exclude if 20% E not mapped to S, 30% rev, more
than 4 Spanish words aligned to English
– ~ 1/5 of the sentences are retained
Results
Baseline
Stat Parser
Stat Parser
COTS parser
33.8%
Corpus
No filtering
Corpus
W/filtering
98K sent 67.3%
20K sent 72.1%
69.2%
A Harder Test: English-Chinese
•
•
•
•
Corpus: 240,000 FBIS sentences
Challenge: Alignment error: 41.8 vs 24.4
Dev/Test: Chinese treebank (< 40 wds )
Additional filters:
– Unattached words, crossed dep., no POS
– 50K sentences
Results
Baseline
Stat Parser
Stat Parser
35.1%
FBIS
W/ filtering
Chinese
Treebank
50K sent 53.9%
10K sent 64.3%
Comparable to training on 2K Chinese treebank
Conclusion
• Cross-language dependency projection
– Constructs target language treebank
– Noisy, but still useful for bootstrapping
– O.w. 20K-40K treebank-> 4-7 years
• Combination of structural projection
– With language specification transforms
• Combine w/clean micro-treebank?
Inversion Transduction Grammars
• Motivation:
– Joint analysis of translation parallel content
• ITGs
– Definition
– Overview of Parsing Algorithm
– Adding probabilities
– Bracketing
• Conclusions
Motivation
• Rich analysis of parallel sentences
– Joint lexical constraints support
• Bracketing (shallow analysis)
• Word alignment
• Segmentation
– Across two languages simultaneously
• Useful for machine translation, segmenting
– Enables parsing of less resource-rich language
Approach
• Goal: Joint structural analysis
– Trivial with transductive grammar, IF
• Two languages have SAME syntactic structure
• Problem:
– Differences in word order, esp. head, spec
• Insight:
– Usually comparable constituents, reordered
• Approach: ITGs
– Allow reversals of constituent order
– Productions are “straight” (same) or “inverse”
Grammars, Probability & Parsing
• All ITG grammars can be written in normal form
– Analogous to CNF
– Rewrite as two non-terminals – straight/inverse
• Or rewrite as terminal pair (possibly empty)
• Stochastic form
– Associates probability with each rewrite
• Epsilon rules get small probability
• Parsing:
– Extension of probabilistic CYK to add inverse,
transduction
– O(N^3V^3T^3), N =#nonterminals, V,T length of inputs
Bracketing
• Shallow parsing
– Groups constituents – without labels
• Simplified parsing algorithm
– Lexical probabilities from automatic induction
– Bias for left/right attachment
Results
• Achieves 78-80% bracket accuracy
• Most errors attributable to lexicon
– 16% error
• Potential improvement for better lexicon
Discussion
• Exploiting multilingual constraints and
multilingual resources
– Overcome gaps in linguistic annotations
• Shared perspective:
– Direct match b/t languages at some level of
representation – constituents, dependencies
• Shared issues:
– Requirements for monolingual knowledge
• Attachment bias, non-shared categories, etc