iamtc-overview - Carnegie Mellon University

Download Report

Transcript iamtc-overview - Carnegie Mellon University

Interlingual Annotation of Multilingual Text
Corpora (IAMTC)
Project Overview for ITIC
November 13, 2003
Carnegie Mellon University
Lori Levin, Teruko Mitamura, Simon Fung
Principal investigators and senior
personnel
•Bonnie Dorr, University of Maryland
•Nizar Habash, University of Maryland and Columbia
•Stephen Helmreich, NMSU
•Eduard Hovy, USC
•David Farwell, NMSU
•Lori Levin, CMU
•Keith Miller, MITRE
•Teruko Mitamura, CMU
•Owen Rambow, Columbia University
•Florence Reeder, MITRE
Wiki
•
•
•
•
http://sparky.umiacs.umd.edu:8000/IAMTC/IAMTC.wiki
Corpora
Documents and manuals
Discussion
Goals of IAMTC
• A practical interlingua for unrestricted text
– Based on mismatch resolution – between languages
and between multiple English translations
– Feasible human coding
• Speed
• Inter-coder agreement
– Feasible parsing and generation
– Usable by many research communities
• MT, extraction, etc.
• Corpus-based, rule-based, etc.
• Multiple levels of representation:
– Syntactic dependency structure
– Language-specific predicate argument structure
– Interlingua (with resolution of some mismatches)
Products of IAMTC
• A coding manual for the interlingua
• A multilingual tagged corpus
– 25 original texts in: French, Spanish,
Japanese, Korean, Arabic, Hindi
– Three English translations of each text
• An evaluation metric for the interlingua
Representations
• IL0: Language-specific dependency Syntax
• IL1: Language-specific dependency structure
with:
– Labeling of nodes using ontology
– Labeling of arcs with semantic role names
• IL2: Interlingua
– Neutralize: support verbs; some multi-word
expressions and non-literal language; some lexical
converses (buy-sell); some text planning differences;
conflational mismatches, head-switching mismatches,
etc.
Examples
(from Nizar Habash)
• http://www.umiacs.umd.edu/~habash/artb_004.idg.5.IL.1
– The minister, who has his own website, also
said: "I want Dubai to be the best place in the
world for state -of-the-art technology
companies.“
• http://www.umiacs.umd.edu/~habash/artb_004.idg.5.IL.2
– The minister who has a personal website on
the internet, further said that he wanted Dubai
to become the best place in the world for the
advanced (hitech) technological companies.
Example 1
•
Original English:
– In its first five years of operation, PRODEM financed loans to
over 13,300 micorentrepreneurs, 77 per cent of whom were
women, disbursing over $27 million in loans averaging $273.
• Original French:
– Au bout de cinq ans, le programme avait consenti plus de 27
millions de dollars de prets d'un montant moyen de 273
dollars, a plus de13 300 entrepreneurs, dont 77% de femmes
....
• English Translation from French:
– At the end of five years, the program had granted more than 27
million dollars in loans with an average amount of 273 dollars,
to more than 13 300 entrepreneurs, of which 77% were
women,....
Example 1
• Original English:
– financed
• loans
• to over 13,300 micorentrepreneurs,
– disbursing
• over $27 million
– in loans
• Original French:
– consenti
• plus de 27 millions de dollars
– de prets
• a plus de 13 300 entrepreneurs,
• English Translation from French:
– granted
• more than 27 million dollars
– in loans
• to more than 13 300 entrepreneurs
Example 2
• Original English:
– Its network of eighteen independent organizations
in Latin America has lent …..
• Original French:
– le reseau regroupe dix-huit organisations
independantes qui ont debourse …..
• English Translation from French:
– the network comprises eighteen independent
organizations which have disbursed …..
Example 2
• Original English:
– has lent
• Its network
– of eighteen independent organizations
• …..
• Original French:
– regroupe
• le reseau
– dix-huit organisations independantes
» ont debourse ……
• English Translation from French:
– comprises
• the network
• eighteen independent organizations
– have disbursed ……
Interlingua Merging
•
•
Language-faithful interlinguas
• Merged Interlingua
Original English:
– financed
• loans
• to over 13,300
micorentrepreneurs
– disbursing
• over $27 million
– in loans
•
Original French:
– consenti
• plus de 27 millions de dollars
– de prets
• a plus de 13 300 entrepreneurs
•
English Translation from French:
– granted
• more than 27 million dollars
– in loans
• to more than 13 300
entrepreneurs
– TRANSFER-MONEY
• over $27 million
• to over 13,300
micorentrepreneurs
– SOME-RELATION
• over $27 million
• loans
Interlingua Merging
• Original English:
– has lent
• Its network
– of eighteen
independent
organizations
• Original French:
– regroupe
• le reseau
– dix-huit organisations
independantes
» ont debourse
• English Translation from
French:
– comprises
• the network
• eighteen independent
organizations
– have disbursed
• Merged Interlingua
– HAS-AS-PART
• the network
• eighteen independent
organizations
– TRANSFER-MONEY
• the network
• …..
Example 3
• Original English:
– Three of the most advanced institutions in the ACCION network
started their programmes as non-profit organizations and
have, in the last five years, converted into
• Original French:
– Trois des institutions les plus performantes rattachees a
ACCION International qui etaient au depart des organisations
a but nonlucratif sont devenues ces cinq dernieres annees
• English Translation from French:
– Three of the most successful institutions connected to ACCION
International, which were non-profit organizations in the
beginning, have become, in these last five years,
Example 3
• Original English:
– Started
• their programmes
• Institutions
– as non-profit organizations
– Converted
• Institutions
• …..
• Original French:
– sont devenues
• Institutions
– relative-clause: etaient au depart
» institutions
• ……
• English Translation from French:
– Have become
• Institutions
– Relative-clause; Were …in the beginning
» institutions
• ……
Meetings and Workshops
• Meetings:
– September 2003: New Orleans
– November 2003: CMU
– January 18 and 19,2004: ISI
• Workshops:
– September 2003: MT Summit
– May 2004: Plan for a panel in the workshop
organized by Adam Meyer
– July 2004: Plan to propose ACL workshop
•
Timeline
November 10 to December 1:
– Assembly of ENGLISH tools and knowledge sources
•
•
•
•
•
•
Tools committee: Hovy, Rambow, Miller
Omega ontology, ISI
LCS verb lexicon (connect to Omega via Propbank)
LDA (Lightweight Dependency Analyzer, Srinivas Bangalore)
Graph tool from Prague
New annotation tool (Dependency tree, Omega, Lexicon)
– Draft of coding manual for IL1:
•
•
•
•
•
•
•
•
•
•
Annotation Committee: Rambow, Mitamura, Levin, Dorr, Habash, Helmreich
Ontology symbols– Hovy
IL0 – dependency structure – Rambow
IL1 markup format – Rambow and Habash
Semantic roles – Dorr, Habash, Mitamura, Levin
Nouns and compounds – Mitamura
Adverbs and adjectives– Helmreich
Prepositions – Miller
Named entities – Reeder
Modification vs Predication – Habash
– Annotator training Phase 1:
• All annotators will tag the same English text
– Assembly of corpora:
• Data committee: Mitamura and ??
• Five foreign language original texts in each language
• Three English translations of each text
Annotation Procedure (English)
• Run LDA parser
• Use tree editing tool to convert syntactic
dependency parse into IL1
– Correct parsing errors
– Choose symbols from the ontology as node
labels
– For verbs:
• look the verb up in the lexicon to get a list of
semantic role names
• Match phrases to roles
Timeline
• December 1 to January 19:
• Annotation development cycle:
– Procedure committee: Hovy, Farwell, Mitamura
– For each week, for each language:
• Pick a text and two English translations of the text
• Annotator 1: Annotate the original and two English translations
• Annotator 2: Annotate the two English translations and one
English translation from another site.
– Each week:
• Conference call on Friday at 1:00 pm Eastern Time
• Revise annotation manuals as necessary
• Development of inter-coder agreement metric
– Evaluation committee: Reeder and Habash, leaders
• Proposal for IL2 based on comparison of IL1’s for
different translations of the same text
Timeline
• January 19-February 23
– Development of foreign language analysis tools
– Large inter-coder agreement evaluation (IL1)
– Small intercoder agreement evaluation of IL2
• March 1: Mid year report
• March 1 2004 to September 2004
– Annotation of full corpus:
• 25 original texts in each of the six languages (French,
Spanish, Hindi, Korean, Arabic, Japanese)
• 3 translations of each text into English
Plans for year 2
• Argument taking predicates other than verbs
• Additional tools for automatic construction of IL1
and IL2
• More comprehensive set of divergences resolved
in IL2
• Additional annotation topics:
–
–
–
–
Coreference
Scope
Tense and aspect
Etc.
• Larger annotated corpus
– Suitable for corpus-based methods and machine
learning