levin-et-al-TLT-06 - Carnegie Mellon School of Computer Science

Download Report

Transcript levin-et-al-TLT-06 - Carnegie Mellon School of Computer Science

Parallel Reverse Treebanks for
the Discovery of MorphoSyntactic Markings
Lori Levin
Robert Frederking
Alison Alvarez
Language Technologies Institute
School of Computer Science
Carnegie Mellon University
Jeff Good
Department of Linguistics
Max Planck Institute for Evolutionary
Anthropology
Reverse Treebank (RTB)
• What?
– Create the syntactic structures first
– Then add sentences
• Why?
– To elicit data from speakers of less commonly taught
languages:
• Decide what meaning we want to elicit
• Represent the meaning in a feature structure
• Add an English or Spanish sentence (plus context notes) to
express the meaning
• Ask the informant to translate it
Bengali Example
srcsent: The large bus to the post office broke down.
context:
tgtsent:
((actor ((modifier ((mod-role mod-descriptor)
(mod-role role-loc-general-to)))
(np-identifiability identifiable)(np-specificity specific)
(np-biological-gender bio-gender-n/a)(np-animacy anim-inanimate)
(np-person person-third)(np-function fn-actor)(np-general-type common-nountype)(np-number num-sg)(np-pronoun-exclusivity inclusivity-n/a)(np-pronounantecedent antecedent-n/a)(np-distance distance-neutral)))
(c-general-type declarative-clause)(c-my-causer-intentionality intentionality-n/a)(ccomparison-type comparison-n/a)(c-relative-tense relative-n/a)(c-our-boundary
boundary-n/a)(c-comparator-function comparator-n/a)(c-causee-control controln/a)(c-our-situations situations-n/a)(c-comparand-type comparand-n/a)(c-causationdirectness directness-n/a)(c-source source-neutral)(c-causee-volitionality volitionn/a)(c-assertiveness assertiveness-neutral)(c-solidarity solidarity-neutral)(c-polarity
polarity-positive)(c-v-grammatical-aspect gram-aspect-neutral)(c-adjunct-clause-type
adjunct-clause-type-n/a)(c-v-phase-aspect phase-aspect-neutral)(c-v-lexical-aspect
activity-accomplishment)(c-secondary-type secondary-neutral)(c-event-modality
event-modality-none)(c-function fn-main-clause)(c-minor-type minor-n/a)(c-copulatype copula-n/a)(c-v-absolute-tense past)(c-power-relationship power-peer)(c-ourshared-subject shared-subject-n/a)(c-question-gap gap-n/a))
Outline
Background
– The AVENUE Machine Translation System
• Contents of the RTB
– An inventory of grammatical meanings
– Languages that have been elicited
• Tools for RTB creation
• Future work
– Evaluation
– Navigation
AVENUE Machine Translation
System
Type information
Synchronous Context Free
Rules
Alignments
x-side constraints
y-side constraints
xy-constraints,
e.g. ((Y1 AGR) = (X1 AGR))
;SL: the old man, TL: ha-ish ha-zaqen
NP::NP
(
(X1::Y1)
(X1::Y3)
(X2::Y4)
(X3::Y2)
[DET ADJ N] -> [DET N DET ADJ]
((X1 AGR) = *3-SING)
((X1 DEF = *DEF)
((X3 AGR) = *3-SING)
((X3 COUNT) = +)
((Y1 DEF) = *DEF)
((Y3 DEF) = *DEF)
((Y2 AGR) = *3-SING)
((Y2 GENDER) = (Y4 GENDER))
)
Jaime Carbonell (PI), Alon Lavie (Co-PI), Lori Levin (Co-PI)
Rule learning: Katharina Probst
AVENUE
• Rules can be written by hand or learned
automatically.
• Hybrid
– Rule-based transfer
– Statistical decoder
– Multi-engine combinations with SMT and EBMT
AVENUE systems
(Small and experimental, but tested on unseen data)
• Hebrew-to-English
– Alon Lavie, Shuly Wintner, Katharina Probst
– Hand-written and automatically learned
– Automatic rules trained on 120 sentences perform
slightly better than about 20 hand-written rules.
• Hindi-to-English
– Lavie, Peterson, Probst, Levin, Font, Cohen, Monson
– Automatically learned
– Performs better than SMT when training data is limited
to 50K words
AVENUE systems
(Small and experimental, but tested on unseen data)
• English-to-Spanish
– Ariadna Font Llitjos
– Hand-written, automatically corrected
• Mapudungun-to-Spanish
– Roberto Aranovich and Christian Monson
– Hand-written
• Dutch-to-English
– Simon Zwarts
– Hand-written
Elicitation
• Get data from someone who is
– Bilingual
– Literate
– Not experienced with linguistics
English-Hindi Example
Elicitation Tool: Erik Peterson
English-Chinese Example
English-Arabic Example
Elicitation
srcsent: Tú caíste
tgtsent: eymi ütrünagimi
aligned: ((1,1),(2,2))
context: tú = Juan [masculino, 2a persona del singular]
comment: You (John) fell
srcsent: Tú estás cayendo
tgtsent: eymi petu ütrünagimi
aligned: ((1,1),(2 3,2 3))
context: tú = Juan [masculino, 2a persona del singular]
comment: You (John) are falling
srcsent: Tú caíste
tgtsent: eymi ütrunagimi
aligned: ((1,1),(2,2))
context: tú = María [femenino, 2a persona del singular]
comment: You (Mary) fell
Outline
• Background
– The AVENUE Machine Translation System
Contents of the RTB
– An inventory of grammatical meanings
– Languages that have been elicited
• Tools for RTB creation
• Future work
– Evaluation
– Navigation
Size of RTB
• Around 3200 sentences
• 20K words
Languages
• The set of feature structures with English
sentences has been delivered to the Linguistic
Data Consortium as part of the Reflex program.
• Translated (by LDC) into:
– Thai
– Bengali
• Plans to translate into:
– Seven “strategic” languages per year for five years.
• As one small part of a language pack (BLARK) for each
language.
Languages
• Feature structures are being reverse annotated
in Spanish at New Mexico State University
(Helmreich and Cowie)
– Plans to translate into Guarani
• Reverse annotation into Portuguese in Brazil
(Marcello Modesto)
– Plans to translate into Karitiana
• 200 speakers
• Plans to translate into Inupiaq (Kaplan and
MacLean)
Previous Elicitation Work
• Pilot corpus
– Around 900 sentences
– No feature structures
• Mapudungun
– Two partial translations
• Quechua
– Three translations
• Aymara
– Seven translations
• Hebrew
• Hindi
– Several translations
• Dutch
Sample: clause level
•
•
•
•
•
•
•
•
•
•
•
Mary is writing a book for John.
Who let him eat the sandwich?
Who had the machine crush the
car?
They did not make the policeman
run.
Mary had not blinked.
The policewoman was willing to
chase the boy.
Our brothers did not destroy files.
He said that there is not a manual.
The teacher who wrote a textbook
left.
The policeman chased the man
who was a thief.
Mary began to work.
•
•
Tense, aspect, transitivity
Questions, causation and
permission
•
Interaction of lexical and
grammatical aspect
Volitionality
•
•
•
Embedded clauses and sequence
of tense
Relative clauses
•
Phase aspect
Sample: noun phrase level
• The man quit in November.
• The man works in the
afternoon.
• The balloon floated over the
library.
• The man walked over the
platform.
• The man came out from
among the group of boys.
• The long weekly meeting
ended.
• The large bus to the post office
broke down.
• The second man laughed.
• All five boys laughed.
• Temporal and locative
meanings
• Quantifiers
• Numbers
• Combinations of different types
of modifers
– My book
• Possession, definiteness
– A book of mine
• Possession, indefiniteness
Example
srcsent: The large bus to the post office broke down.
((actor ((modifier ((mod-role mod-descriptor)
(mod-role role-loc-general-to)))
(np-identifiability identifiable)(np-specificity specific)
(np-biological-gender bio-gender-n/a)(np-animacy anim-inanimate)
(np-person person-third)(np-function fn-actor)(np-general-type commonnoun-type)(np-number num-sg)(np-pronoun-exclusivity inclusivity-n/a)(nppronoun-antecedent antecedent-n/a)(np-distance distance-neutral)))
(c-general-type declarative-clause)(c-my-causer-intentionality intentionalityn/a)(c-comparison-type comparison-n/a)(c-relative-tense relative-n/a)(c-ourboundary boundary-n/a)(c-comparator-function comparator-n/a)(c-causeecontrol control-n/a)(c-our-situations situations-n/a)(c-comparand-type
comparand-n/a)(c-causation-directness directness-n/a)(c-source sourceneutral)(c-causee-volitionality volition-n/a)(c-assertiveness assertivenessneutral)(c-solidarity solidarity-neutral)(c-polarity polarity-positive)(c-vgrammatical-aspect gram-aspect-neutral)(c-adjunct-clause-type adjunctclause-type-n/a)(c-v-phase-aspect phase-aspect-neutral)(c-v-lexical-aspect
activity-accomplishment)(c-secondary-type secondary-neutral)(c-eventmodality event-modality-none)(c-function fn-main-clause)(c-minor-type
minor-n/a)(c-copula-type copula-n/a)(c-v-absolute-tense past)(c-powerrelationship power-peer)(c-our-shared-subject shared-subject-n/a)(cquestion-gap gap-n/a))
Grammatical meanings vs syntactic
categories
• Features and values are based on a
collection of grammatical meanings
– Many of which are similar to the
grammatemes of the Prague Treebanks
Grammatical Meanings
YES
• Semantic Roles
• Identifiability
• Specificity
• Time
– Before, after, or during
time of speech
• Modality
NO
• Case
• Voice
• Determiners
• Auxiliary verbs
Grammatical Meanings
YES
• How is identifiability
expressed?
–
–
–
–
Determiner
Word order
Optional case marker
Optional verb agreement
• How is specificity
expressed?
• How are generics
expressed?
• How are predicate
nominals marked?
NO
• How are English
determiners translated?
–
–
–
–
The boy cried.
The lion is a fierce beast.
I ate a sandwich.
He is a soldier.
• Il est soldat.
Argument Roles
• Actor
– Roughly, deep subject
• Undergoer
– Roughly, deep object
• Predicate and predicatee
– The woman is the manager.
• Recipient
– I gave a book to the students.
• Beneficiary
– I made a phone call for Sam.
Why not subject and object?
• Languages use their voice systems for different
purposes.
• Mapudungun obligatorily uses an inverse marked verb
when third person acts on first or second person.
– Verb agrees with undergoer
– Undergoer exhibits other subjecthood properties
– Actor may be object.
• Yes: How are actor and undergoer encoded in
combination with other semantic features like adversity
(Japanese) and person (Mapudungun)?
• No: How is English voice translated into another
language?
Argument Roles
• Accompaniment
– With someone
– With pleasure
• Material
– (out) of wood
• About 20 more roles
– From the Lingua checklist; Comrie & Smith (1977)
– Many also found in tectogrammatical representations
• Around 80 locative relations
– From Lingua checklist
• Many temporal relations
Noun Phrase Features
•
•
•
•
•
•
•
•
•
Person
Number
Biological gender
Animacy
Distance (for deictics)
Identifiability
Specificity
Possession
Other semantic roles
– Accompaniment, material,
location, time, etc.
• Type
– Proper, common, pronoun
•
•
•
•
Cardinals
Ordinals
Quantifiers
Given and new
information
– Not used yet because of
limited context in the
elicitation tool.
Clause level features
• Tense
• Aspect
– Lexical, grammatical,
phase
• Type
– Declarative, open-q,
yes-no-q
• Function
– Main, argument,
adjunct, relative
• Source
– Hearsay, first-hand,
sensory, assumed
• Assertedness
– Asserted,
presupposed, wanted
• Modality
– Permission, obligation
– Internal, external
Other clause types
(Constructions)
• Causative
– Make/let/have someone do something
• Predication
– May be expressed with or without an overt copula.
• Existential
– There is a problem.
• Impersonal
– One doesn’t smoke in restaurants in the US.
• Lament
– If only I had read the paper.
• Conditional
• Comparative
• Etc.
Outline
• Background
– The AVENUE Machine Translation System
• Contents of the RTB
– An inventory of grammatical meanings
– Languages that have been elicited
Tools for RTB creation
• Future work
– Evaluation
– Navigation
Tools for RTB Creation
• Change the inventory of grammatical
meanings
• Make new RTBs for other purposes
The Process
Feature
Specification
ClauseLevel
NounPhrase
Tense &
Aspect
List of semantic
features and
values
…
Modality
Feature Maps: which
combinations of
features and values
are of interest
Feature Structure Sets
Reverse Annotated Feature Structure
Sets: add English sentences
The Corpus
Mar 1, 2006
Sampling
Smaller Corpus
Feature Specification
• XML Schema
• XSLT Script
• Human readable form
– Feature: Causer intentionality
• Values: intentional, unintentional
– Feature: Causee control
• Values: in control, not in control
– Feature: Causee volitionality
• Values: willing, unwilling
– Feature: Causation type
• Values: direct, indirect
Feature Combination
• Person and number interact with tense in
many fusional languages.
• In English, tense interacts with questions:
– Will you go?
Feature Combination Template
((predicatee
((np-general-type pronoun-type commonnoun-type)
(np-person person-first person-second
person-third)
(np-number num-sg num-pl)
(np-biological-gender bio-gender-male biogender-female)))
{[(predicate ((np-general-type commonnoun-type)
(np-person person-third)))
(c-copula-type role)]
[(predicate ((adj-general-type quality-type)
(c-copula-type attributive)))]
[(predicate ((np-general-type commonnoun-type)
(np-person person-third)
(c-copula-type identity)))]}
(c-secondary-type secondary-copula) (cpolarity #all)
(c-general-type declarative)
(c-speech-act sp-act-state)
(c-v-grammatical-aspect gram-aspectneutral)
(c-v-lexical-aspect state)
(c-v-absolute-tense past present future)
(c-v-phase-aspect durative))
Summarizes 288 feature
structures, which are
automatically generated.
Annotation Tool
• Feature structure viewer
– Various views of the feature structure
• Omit features whose value is not-applicable
• Group related features together
– Aspect
– causation
Outline
• Background
– The AVENUE Machine Translation System
• Contents of the RTB
– An inventory of grammatical meanings
– Languages that have been elicited
• Tools for RTB creation
Future work
– Evaluation
– Navigation
Evaluation
• Current funding has not covered
evaluation of the RTB.
– Except for informal observations as it was
translated into several languages.
• Does it elicit the meanings it was intended
to elicit?
– Informal observation: usually
• Is it useful for machine translation?
Hard Problems
• Reverse annotating meanings that are not
grammaticalized in English.
– Evidentiality:
• He stole the bread.
• Context: Translate this as if you do not
have first hand knowledge. In English, we
might say, “They say that he stole the
bread” or “I hear that he stole the bread.”
Hard Problems
• Reverse annotating things that can be said
in several ways in English.
– Impersonals:
•
•
•
•
One doesn’t smoke here.
You don’t smoke here.
They don’t smoke here.
Credit cards aren’t accepted.
– Problem in the Reflex corpus because space
was limited.
Navigation
• Currently, feature combinations are specified by
a human.
• Plan to work in active learning mode.
–
–
–
–
–
–
–
–
Build seed RTB
Translate some data
Do some learning
Identify most valuable pieces of information to get
next
Generate an RTB for those pieces of information
Translate more
Learn more
Generate more, etc.