Site Name - the Department of Computer and Information Science

Download Report

Transcript Site Name - the Department of Computer and Information Science

Simple Features for
Chinese Word Sense Disambiguation
Hoa Trang Dang, Ching-yi Chia, Martha Palmer, FuDong Chiou
Computer and Information Science
University of Pennsylvania
{htd, chingyc, mpalmer, chioufd}@unagi.cis.upenn.edu
Overview

Maximum entropy WSD feature types

English Senseval2 verbs

Chinese
– Penn Chinese Treebank
– People’s Daily News
English Senseval2 verbs






Primarily Penn Treebank WSJ corpus
WordNet 1.7 sense inventory
29 verbs
15.6 senses/verb in corpus
baseline (most frequent sense) 40%
best system performance 60%
Local Collocational Features
(English)

Collocational features for w
– word w
– pos of w
– pos of words at positions +1, -1 relative to w
– words at positions -2, -1, +1, +2 relative to w
Local Syntactic Features (English)

Syntactic features
– whether or not the sentence is passive
– whether there is a subject, direct object, indirect
object, or clausal complement
– the words (if any) in the positions of subject, direct
object, indirect object, particle, prepositional
complement (and its object)
Local Semantic Features (English)

Semantic features
– a Named Entity tag (PERSON, ORGANIZATION,
LOCATION) for proper nouns
– WordNet synsets and hypernyms for the nouns
Overall Accuracy of System
(English)
Feature Type
Accuracy
Collocation
Collocation + Syntax
Collocation + Syntax + Semantics
48.3
53.9
59.0
Collocation + Topic
Collocation + Syntax + Topic
Collocation + Syntax + Semantics + Topic
52.9
54.2
60.2
Data Preparation (Chinese)





Penn Chinese Treebank (100K words)
CETA (Chinese-English Translation
Assistance) Dictionary
28 words (multiple verb senses, possibly
other pos)
3.5 senses/word in corpus
Baseline (most frequent sense) 77%
Local Collocational Features
(Chinese)

Collocational Features:
–
–
–
–
–
word
pos
word-2, word-1, word+1, word+2
pos-1, pos+1
followsVerb
Local Syntactic Features
(Chinese)

Syntactic Features:
–
–
–
–
–
–
hassubj
subj
hasobj
obj-p
obj
hasinobj
–
–
–
–
Comp-VP
VPComp
Comp-IP
hasprd
Local Semantic Features
(Chinese)

Semantic Features (for verbs only):
 generated by assigning a HowNet noun
category to each subject and object
– subjsem
– objsem
Overall Accuracy of
Maximum Entropy System (CTB)
Feature Type
Accuracy
Std Dev
Collocation (no pos)
86.8
1.0
Collocation
Collocation + Syntax
Collocation + Syntax + Semantics
93.4
94.4
94.4
0.5
0.4
0.6
Collocation + Topic
Collocation + Syntax + Topic
Collocation + Syntax + Semantics + Topic
90.3
92.7
92.8
1.0
0.9
0.8
Baseline
76.7
Data Preparation (PDN)

People’s Daily News (PDN)
– Five words with low accuracy and counts in CTB
subsequently sense-tagged in PDN (1M words).
– About 200 sentences/word from PDN.
– 8.2 senses/verb in corpus
– Baseline (most frequent sense) 58%
– Automatic segmentation, pos-tagging, parsing
Overall Accuracy of
Maximum Entropy System (PDN)
Feature Type
Accuracy
Std Dev
Collocation (no pos)
72.3
2.2
Collocation
Collocation + Syntax
Collocation + Syntax + Semantics
70.3
71.7
71.7
2.9
3.9
4.2
Collocation + Topic
Collocation + Syntax + Topic
Collocation + Syntax + Semantics + Topic
73.3
72.6
73.0
3.2
2.9
3.4
Baseline
57.6
Conclusion

Types of features that are important for
English and Chinese are different.
– Parse information is useful for English
WSD.
– Lexical collocational information may be
sufficient for Chinese.
 Chinese word sense disambiguation
addressed at segmentation level