Overview Presentation by Martha Palmer

Download Report

Transcript Overview Presentation by Martha Palmer

Multilinugual PennTools
that capture
parses and predicate-argument structures,
and their use in Applications
Martha Palmer, Aravind Joshi, Mitch Marcus,
Mark Liberman, Fernando Pereira
University of Pennsylvania
March 11, 2002
TIDES SITE VISIT
Outline
• Overview
– Objectives, resource development, applications
• Supervised Training of Individual Components
– parsers
– semantic taggers
• Training with labeled and unlabeled data
– co-training
– active learning (annotation tools)
Objectives
• Resources ($200K)
–
–
–
–
Chinese TreeBank II
Parallel Korean/English TreeBanks
PropBank
Multilingual Annotation Tool – (Tom Morton,
Nianwen Xue, Jeremy Lacivita)
• NYU, MITRE, LDC
Objectives (cont)
• PennTools ($300K)
– Morphological Analyzers (at LDC)
– Major decrease in parser development time and
parser running time (Dan Bikel, Carlos Prolo,
Anoop Sarkar)
– Automatic Predicate Argument Tagging
(Dan Gildea)
– Word Sense Disambiguation, English &
Chinese (Hoa Dang)
Chinese TreeBank II
Fu-dong Chiou, Nianwen Xue
• Cost of CTB I, 100K words : $270K
• Additional 40K, (20k, 20K)
– speedup given automatic parses? doubled
– compare HK, Sinorama, People’s Daily
• 2002 - 360K words, $100K
– Chiang’s parser doubles annotation speed
– 96K words bracketed as of March 8, 2002
– 110K Xinhua news, 200K other newswire, 50K DLI
corpus
– release of original 100K + 150K planned for June
English Translation: CTB I
TIDES
•
•
•
•
Beijing E-C Translation LTD
12 week estimate, actual 15 weeks, Nov
100K words, around $10K (.06 per char)
3rd pass for error correction
– taking longer than expected
– 40K/100K done
Chinese PropBank - DOD
• Proposal stage, 2 yrs, 275K a year
• Year One (Just got funded)
– Develop lexicon guidelines, 2600 verbs
– Tag 100K CTB
• Year Two
– Extend guidelines, up to 5 or 6000 verbs
– Tag additional 400K CTB II
• Spinoff – Chinese lexicon
Richer CTB Annotations
TIDES ($25K)
• Coreference Tagging (Susan Converse)
– Draft guidelines
– 100K words tagged
• Sense tagging (Hoa Dang)
Korean/English Parallel TreeBank
Chunghye Han, Narae Han, Allen Lee
(CoGenTex/Penn/Systran: ARL MT Project)
• Defense Language Institute data
– 50K word corpus of military messages
– Same corpus available in Chinese
• Guidelines for postagging, bracketing
• http://www.cis.upenn.edu/~xtag/koreantag/index.html
• Companion Transfer Lexicon, 4000 entries
READY TO RELEASE
English PropBank
Paul Kingsbury, Scott Cotton
• 1M words of Treebank
• New semantic augmentations
– Predicate-argument relations for verbs,
– label arguments (arg0, arg1, arg2)
– First subtask, 300K word financial subcorpus
• Spin-off: English lexical resource
– 3500+ verbs
English PropBank –
Current Status
• Frames files
– 787 verb lemmas (includes phrasal variants - 932)
– 363/ VerbNet semi-automatic expansions (subtask/PB)
• First subtask: 300K financial subcorpus
• 22,595K unique predicates annotated out of 29K,
(80%)
– 6K+ remaining (7 weeks, 2000@week, first pass)
• 1040 verb lemmas out of 1700+ (59%)
– 700 remaining (3.5 months, 200@month)
• PropBank, (including some of Brown?)
– 34,437 predicates annotated out of 118K, (29%)
– 1040 verb lemmas out of 3500, (29%)
Summary of Resources
Project
2002 Funds
Status
Completion
Date
Chinese Treebank II
$100K
146K/400K words
(100K@4 mo)
Dec, 02
English translation of
Chinese Treebank I
(2001 $10K)
$4K
3rd pass,
40K/100K
July, 02
Richer Chinese
Treebank (coref/
wsd)
$25K
1st pass pronouns
100K/100K
30/100 verbs
Dec’ 02
Korean/English
parallel Treebank
(2001 $100K)
$50K
50K English/
50K Korean
March’ 02
English PropBank
(2001 $250K)
$250K
240K/300K finan.
300K/1M WSJ
June’ 02
Dec ‘ 02
Dec’ 02
Objectives (cont)
• Applications: ($200K) + ($150K)
Relation Extraction and MT
– Initial experiments with MUC 7
– Korean/English MT system wrap-up
– Plans for investigating statistical MT
approaches
Information Extraction with TAG
(Libin Shen, Anoop Sarkar, and Jinying Chen)
• Template Relation (TR) task of the 7th Message
Understanding Conference
• F-Measure of 78% on sentence-level relation
which is comparable to the best results in MUC-7
• Convert IE into a discriminative problem
– Syntactic Analysis with Supertagger [Joshi 1994] and
Lightweight Dependency Analyzer [Srinivas 1997]
– Machine Learning with Boosting algorithm [Schapire
2000]
Korean/English ARL MT System:
New Parser Evaluation
Treebank trained – Anoop Sarkar
Dependency Evaluation: 75.7% on test, 97.58% training
off-the-shelf
parser
Treebank
parser
1st pass
W/ improved
collocations &
markers
Corrected
DSyntS
32%
35%
51%
82%
Fixable 64%
65%
45%
16%
Bad
5%
4%
2%
OK
4%
Statistical Approaches to MT
(Dan Gildea, Yuan Ding, Owen Rambow)
• Tree-based alignment:
– use one or both sets of trees from parallel treebanks to
constrain alignments,
– compare with unstructured alignments (IBM models).
• Word-sense disambiguation:
– apply maximum entropy model of word
– sense disambiguation to translation selection.
• Monolingual corpora:
– translation selection based on dependency statistics
from monolingual corpora.
• Statistical generation:
– PropBank as underlying representation for statistical
generation (JHU summer workshop).