Transcript Penn

Penn
English and Chinese PropBanks
Martha Palmer
University of Pennsylvania
with Olga Babko-Malaya, Nianwen Xue,
and Ben Snyder
April 14, 2005
Semantic Representation Meeting
University of Maryland
1
What is a PropBank?
Penn
 A PropBank is a corpus annotated with the
predicate-argument structure of the verbs:
 English Propbank: www.cis.upenn.edu/~ace 3/’04 LDC
Kingsbury and Palmer 2002, Palmer, Gildea, Kingsbury, 2005
 Wall Street Journal, 1M words, 120K+ predicate instances
 Brown, 14K predicate instances
 Chinese Propbank: www.cis.upenn.edu/~chinese/cpb
Xue and Palmer 2003, Xue 2004
 Xinhua (250K words – almost done),
 Sinorama (250K words – estimated 2007)
 Nominalized verbs for English = NomBank/NYU
 Chinese NomBank?
2
Capturing “neutral” semantic roles
Penn
 Boyan broke [ Arg1 the LCD-projector.]
break (agent(Boyan), patient(LCD-projector))
 [Arg1 The windows] were broken by the
hurricane.
 [Arg1 The vase] broke into pieces when it
toppled over
3
Frames File example: give
< 4000 Frames for PropBank
Penn
Roles:
Arg0: giver
Arg1: thing given
Arg2: entity given to
Example:
double object
The executives gave the chefs a standing ovation.
Arg0:
The executives
REL:
gave
Arg2:
the chefs
Arg1:
a standing ovation
4
Frames File example: give
w/ Thematic Role Labels
Penn
Roles:
Arg0: giver
Arg1: thing given
Arg2: entity given to
Example:
double object
The executives gave the chefs a standing ovation.
Arg0: Agent
The executives
REL:
gave
Arg2: Recipient the chefs
Arg1: Theme
a standing ovation
VerbNet – based on Levin classes
5
PropBank Exercise Ex.
Penn
 [He]-Arg1 Theme [will]-MOD [probably]-MOD
be [extradited]-rel [to the U.S]-DIR [for trial
under an extradition treaty President Virgilia
Barco has revived]-PRP.
 He will probably be extradited to the U.S for
trial under [an extradition treaty]-Arg1Theme
[President Virgilia Barco]-Arg0Agent has
[revived]-rel.
6
A Chinese Treebank Sentence
Penn
国会/Congress 最近/recently 通过/pass 了/ASP 银行法
/banking law
“The Congress passed the banking law recently.”
(IP (NP-SBJ (NN 国会/Congress))
(VP (ADVP (ADV 最近/recently))
(VP (VV 通过/pass)
(AS 了/ASP)
(NP-OBJ (NN 银行法/banking
law)))))
7
The Same Sentence, PropBanked
通过(f2)
arg0
国会
(law)
Penn
(IP (NP-SBJ arg0 (NN 国会))
(VP argM (ADVP (ADV 最近))
(VP f2 (VV 通过)
(AS 了)
(pass)
arg1 (NP-OBJ (NN
银行法)))))
argM
最近
arg1
银行法
(congress)
8
Annotation procedure
Penn
 PTB II – Extract all sentences of a verb
 Create Frame File for that verb Paul Kingsbury
(3400+ lemmas, 4700 framesets,120K predicates)
 1st pass: Automatic tagging Joseph Rosenzweig
 2nd pass: Double blind hand correction by verb
Inter-annotator agreement 84% (87% Arg#’s)
 3rd pass: Adjudication Olga Babko-Malaya
 4th pass: Train automatic semantic role labellers
Dan Gildea, Sameer Pradhan, Nianwen Xue, Szuting Yi, ….
CoNLL-04 shared task, 2004, 2005, ….
9
Propbank Kappa Statistics
P(A)
P(E)
Kappa
Role identify
.99
.89
.93
Role classify
.95
.27
.93
combined
.99
.88
.91
Penn
Role identification
classifying tree nodes as argument vs. non-argument
Role classification
classifying arguments as Arg1 vs. Arg2 vs ArgM-LOC vs. etc…
Kappa = P(A) - P(E) / 1 - P(E)
10
Throughput
Penn
 Framing: approximately 80-100 verbs/week
 Annotation: approximately 70 instances/hour
 Solomonization: approximately 100
instances per hour
 100K words (last summer)
~4 months (hardly any new frame files)
4-6 part-time annotators (100hrs a week),
half-time programmer,
half-time project manager,
half-time adjudicator, frame file creator
11
Applications
Penn
 IE – slot filling
 Question Answering:
What do lobsters like to eat?
Answer is NOT people!
 Machine Translation
Reconciling event descriptions across
languages - See Parallel Prop II
12
Word Senses in PropBank
Penn
 Orders to ignore word sense not feasible for 700+
verbs
 Mary left the room
 Mary left her daughter-in-law her pearls in her will
Frameset leave.01 "move away from":
Arg0: entity leaving
Arg1: place left
Frameset leave.02 "give":
Arg0: giver
Arg1: thing given
Arg2: beneficiary
How do these relate to traditional word senses in WordNet?
13
Overlap between Senseval2
Groups and Framesets – 95%
Penn
Frameset2
Frameset1
WN1 WN2
WN3 WN4
WN6 WN7 WN8
WN11 WN12 WN13
WN19
WN5 WN 9 WN10
WN 14
WN20
develop
14
Sense Hierarchy
(Palmer, et al, SNLU04 - NAACL04)
Penn
 PropBank Framesets – ITA >90%
coarse grained distinctions
20 Senseval2 verbs w/ > 1 Frameset
Maxent WSD system, 73.5% baseline, 90% accuracy
 Sense Groups (Senseval-2) - ITA 82% Tagging w/groups,
Intermediate level
ITA 89%, 200@hr
(includes Levin classes) – 69%
 WordNet – ITA 71%
fine grained distinctions, 60.2%
15
PropBank II – English/Chinese (100K)
Penn
We still need relations between events and entities:
 Event ID’s with event coreference
 Selective sense tagging
 Tagging nominalizations w/ WordNet sense
 Grouped WN senses - selected verbs and nouns
 Nominal Coreference
 not names
 Clausal Discourse connectives – selected subset
Level of representation that reconciles many surface
differences between the languages
16
Event IDs – Parallel Prop II (1)
Penn
 Aspectual verbs do not receive event
IDs:
今年/this year 中国/China 继续/continue 发
挥/play 其/it 在/at 支持/support 外商
/foreign business 投资/investment 企业
/enterprise 方面/aspect 的/DE 主/main 渠道
/channel 作用/role
“This year, the Bank of China will
continue to play the main role in
supporting foreign-invested businesses.”
17
Event IDs – Parallel Prop II (2)
Penn
 Nominalized verbs do:
 He will probably be extradited to the US for
trial.
done as part of sense-tagging
(all 7 WN senses for “trial” are events.)
 随着/with 中国/China 经济/economy 的/DE 不断
/continued 发展/development…
“With the continued development of
China’s
economy…”
The same events may be described by verbs in
English and nouns in Chinese, or vice versa.
Event IDs help to abstract away from POS tag
18
Event reference – Parallel Prop II
Penn
 Pronouns (overt or covert) that refer to events:
[This] is gonna be a word of mouth kind of thing.
这些/these 成果/achivements 被/BEI 企业/enterprise
用/apply (e15) 到/to 生产/production 上/on 点石成金
/spin gold from straw, *pro*-e15 大大/greatly 提高
/improve 了/le 中国/China 镍/nickel 工业/industry 的
/DE 生产/production 水平/level 。
“These achievements have been applied (e15) to production by
enterprises to spin gold from straw, which-e15 greatly improved
the production level of China’s nickel industry.”
 Prerequisites:
 pronoun classification
 free trace annotation
19
Chinese PB II: Sense tagging
Penn
 Much lower polysemy than English
 Avg of 3.5 (Chinese) vs. 16.7 (English)
Dang, Chia, Chiou, Palmer, COLING-02
 More than 2 Framesets
62/4865 (250K) Ch vs. 294/3635 (1M) English
 Mapping Grouped English senses to Chinese
(English tagging - 93 verbs/168 nouns, 5000+ instances)
 Selected 12 polysemous English words
(7 verbs/5 nouns)
 For 9 (6 verbs/3 nouns), grouped English senses map to unique
Chinese translation sets (synonyms)
20
Mapping of Grouped Sense Tags
to Chinese
increase
提高 /
ti2gao1
Penn
lift, elevate,
orient upwards
仰 / yang3
Collect, levy
募集 / mu4ji2
筹措 / chou2cuo4
筹... / chou2…
invoke, elicit, set off
提 / ti4
raise – translations by group
21
Discourse connectives:
The Penn Discourse TreeBank
Penn
 WSJ corpus (~1M words, ~2400 texts)
http://www.cis.upenn.edu/~pdtb
Frontiers
Miltsakaki, Prasad, Joshi and Webber, LREC-04, NAACL-04
Prasad, Miltsakaki, Joshi and Webber ACL-04 Discourse Annotation
 Chinese: 10 explicit discourse connectives that
include subordination conjunctions, coordinate
conjunctions, and discourse adverbials.
 Argument determination, sense disambiguation
[arg1 学校/school 不/not 教/teach 理财/finance management],
[conn 结果/as a result] [arg2 报章/newspaper 上/on 的/DE 各
/all 种/kind 专栏/column 就/then 成为/become 信息/information
的/DE 主要/main 来源/source]。
“The school does not teach finance management. As a result, the
different kinds of columns become the main source of
information.”
22
Summary of English PropBanks
Penn
Olga Babko-Malaya, Ben Snyder
Genre
Words
Wall Street Journal*
Frames
Frameset
Files
Tags
1000K
< 4000
700+
100K
<1500
250K
< 6000
Released
Prop2
March, 04
(Penn TreeBank II)
English Translation of
Dec, 04
Aug, 05
Dec, 04
Dec, 05
Chinese TreeBank *
Xinhua News
200
DOD funding
Sinorama
(100K)
150K
< 4000
July, 05
250K
<2000
Dec, 06
NSF-ITR funding
Sinorama, English corpus
NSF-ITR funding
*DOD funding
23
Annotation of free traces
Penn
 Free traces – traces which are not linked to an antecedent
in PropBank
 Arbitrary
Legislation to lift the debt ceiling is ensnarled in the fight over
[*]–ARB cutting capital-gains taxes
 Event
The department proposed requiring (e4) stronger roofs for light trucks
and minivans , [*]-e4 beginning with 1992 models
 Imperative
All right, [*]-IMP shoot.
 1K instances of free traces in a 100K corpus
24
Classification of pronouns
Penn
 'referring'
[John Smith] arrived yesterday. [He] said that...
 ‘bound'
[Many companies] raised [their] payouts by more than 10%
 ‘event‘
[This] is gonna be a word of mouth kind of thing.
 ‘generic'
I like [books]. [They] make me smile.
25
Mapping of Grouped Sense Tags
to Chinese
Penn
Zhejiang|浙江zhe4jiang1 will|将jiang1 raise|
提高ti2gao1 the level|水平shui3ping2 of|的de
opening up|开放kai1fang4 to|对dui4 the
outside world|外wai4. (浙江将提高对外开放的水
平。)
I|我wo3 raised|仰yang3 my|我的wo3de head|头
tou2 in expectation|期望qi1wang4.(我仰头望去
。)
…, raising|筹措chou2cuo4 funds|资金zi1jin1
of|的de 15 billion|150亿yi1ban3wu3shi2yi4
yuan|元yuan2 (…筹措资金150亿元。)
The meeting|会议hui4yi4 passed|通过tong1guo4
the “decision regarding motions”|议案yi4an4
raised|提ti4 by 32 NPC|人大ren2da4
26
representatives|代表dai4biao3 (会议通过了32名