Semantic Annotation – Week 3

Download Report

Transcript Semantic Annotation – Week 3

Semantic Annotation – Week 3
Team: Louise Guthrie, Roberto Basili, Fabio Zanzotto, Hamish
Cunningham, Kalina Boncheva, Jia Cui, Klaus Macherey, David Guthrie,
Martin Holub, Marco Cammisa, Cassia Martin, Jerry Liu, Kris Haralambiev
Fred Jelinek
July 30th, 2003
JHU WORKSHOP - 2003
Our Hypotheses
●
●
●
A transformation of a corpus to replace words and phrases
with coarse semantic categories will help overcome the data
sparseness problem encountered in language modeling
Semantic category information will also help improve
machine translation
A noun-centric approach initially will allow bootstrapping for
other syntactic categories
July 30th, 2003
JHU WORKSHOP - 2003
An Example
●
●
Astronauts aboard the space shuttle Endeavor were
forced to dodge a derelict Air Force satellite Friday
Humans aboard space_vehicle dodge satellite
timeref.
July 30th, 2003
JHU WORKSHOP - 2003
Our Progress – Preparing the data- Pre-Workshop
●
Identify a tag set
●
Create a Human annotated corpus
●
Create a double annotated corpus
●
●
Process all data for named entity and noun phrase
recognition using GATE Tools
Develop algorithms for mapping target categories to
Wordnet synsets to support the tag set assessment
July 30th, 2003
JHU WORKSHOP - 2003
The Semantic Classes for Annotators
●
●
A subset of classes available in Longman's Dictionary
of contemporary English (LDOCE) Electronic version
Rationale:
The number of semantic classes was small
The classes are somewhat reliable since they were used
by a team of lexicographers to code
Noun senses
Adjective preferences
Verb preferences
July 30th, 2003
JHU WORKSHOP - 2003
Semantic Classes
Abstract T
Concrete C
Animate Q
Plant P
Animal A
B
••-
D
Inanimate I
Human H
F
M
Liquid L
Gas G
Organic 5 PhysQuant 4
Solid S
Movable N
Non-movable J
Target Classes
Annotated Evidence
July 30th, 2003
JHU WORKSHOP - 2003
More Categories
●
●
●
●
●
●
U: Collective
K: Male
R: Female
W: Not animate
X: Not concrete or animal
Z: Unmarked
We allowed annotators to choose “none of the above” (? in
the slides that follow)
July 30th, 2003
JHU WORKSHOP - 2003
Our Progress – Data Preparation
●
●
Assess annotation format and define uniform descriptions
for irregular phenomena and normalize them
Determine the distribution of the tag set in the training
corpus
●
Analyze inter-annotator agreement
●
Determine a reliable set of tags – T
●
Parse all training data
July 30th, 2003
JHU WORKSHOP - 2003
Doubly Annotated Data
●
Instances (headwords): 10960
●
8,950 instances without question marks.
●
8,446 of those are marked the same.
●
Inter-annotator agreement is 94% (83% including
question marks)
●
Recall – these are non named entity noun phrases
July 30th, 2003
JHU WORKSHOP - 2003
Distribution of Double Annotated Data
agree w /o ?
agree w ith ?
disagree w ith ?
disagree w /o ?
12,55% 4,60%
5,79%
77,06%
July 30th, 2003
JHU WORKSHOP - 2003
Agreement of doubly marked instances
agree w /o ?
6%
disagree w /o ?
94%
July 30th, 2003
JHU WORKSHOP - 2003
2
Inter-annotator agreement – for each category
1
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0
A
C
July 30th, 2003
G
H
I J
K
L
N
P
Q
R
S
4
5
T
U
W
X
JHU WORKSHOP - 2003
Category distribution among agreed part
inter-annotator agreement
69%
W
4
July 30th, 2003
K
A
Q
U
G
N
I
J
P
S
C
H
L
T
X
R
5
JHU WORKSHOP - 2003
A few statistics on the human annotated data
●
Total annotated 262,230 instances
48,175 with ?
●
214,055 with a category
of those Z .5%
W and X .5%
4 , 5 1.6%
July 30th, 2003
JHU WORKSHOP - 2003
Our progress – baselines
●
●
●
Determine baselines for automatic tagging of noun phrases
Baselines for tagging observed words in new contexts (new
instances of known words)
Baselines for tagging unobserved words
Unseen words – not in the training material but in dictionary
Novel words – not in the training material nor in the
dictionary/Wordnet
July 30th, 2003
JHU WORKSHOP - 2003
Overlap of dictionary and head nouns (in the BNC)
●
●
85% of NP’s covered
only 33% of vocabulary (both in LDOCE and in Wordnet) in
the NP’s covered
July 30th, 2003
JHU WORKSHOP - 2003
Preparation of the test environment
●
●
Selected the blind portion of the human annotated data
for late evaluation
Divided the remaining corpus into training and held-out
portions
Random division of files
Unambiguous words for training – ambiguous for testing
July 30th, 2003
JHU WORKSHOP - 2003
Baselines using only (target) words
Error Rate
Unseen words
marked with
Method
Valid training
instances
blame
15.1%
the first class
MaxEntropy
count  3
Klaus
12.6%
most frequent
class
MaxEntropy
count  3
Jerry
16%
most frequent
class
VFI
all
Fabio
13%
most frequent
class
NaiveBayes
all
Fabio
July 30th, 2003
JHU WORKSHOP - 2003
Baselines using only (target) words and
preceeding adjectives
Error Rate
Unseen words
marked with
Method
Valid training
instances
blame
13%
most frequent
class
MaxEntropy
count  3
Jerry
13.2%
most frequent
class
MaxEntropy
all
Jerry
12.7%
most frequent
class
MaxEntropy
count  3
Jerry
July 30th, 2003
JHU WORKSHOP - 2003
Baselines using multiple knowledge sources
●
●
●
Experiments in Sheffield
Unambiguous tagger (assign only available semantic
categories)
bag-of-words tagger (IR inspired)
window size 50 words
nouns and verbs
●
Frequency-based tagger (assign the most frequent semantic
category)
July 30th, 2003
JHU WORKSHOP - 2003
Baselines using multiple knowledge sources
(cont’d)
●
Frequency-based tagger
16-18% error rate
●
bag-of-words tagger
17% error rate
●
Combined architecture
14.5-15% error rate
July 30th, 2003
JHU WORKSHOP - 2003
Bootstrapping to Unseen Words
●
●
Problem: Automatically identify the semantic class of
words in LDOCE whose behavior was not observed in
the training data
Basic Idea: We use the unambiguous words
(unambiguous with respect to the our semantic tag set)
to learn context for tagging unseen words.
July 30th, 2003
JHU WORKSHOP - 2003
Bootstrapping: statistics
6,656 different unambiguous lemmas in the (visible) human
tagged corpus
...these contribute to 166,249 instances of data
...134,777 instances were considered correct by the annotators
! Observation: Unambiguous words can be used in the corpus
in an “unforeseen” way
July 30th, 2003
JHU WORKSHOP - 2003
Bootstrapping baselines
●
Test Instances (instances of ambiguous words) : 62,853
Method
Assigning the most frequent
semantic tag (i.e. Abstract)
% correct labelled instances
52%
Using one previous word (Adjective,
Noun, or Verb) (using Naive Bayes
Classifier)
(with reliable tagged instances) 45%
(with all instances) 44.3%
1 previous and 1 following word
(Adjective, Noun, or Verb) (using
Naive Bayes Classifier)
(with reliable tagged instances) 46.8%
(with all instances) 44.5%
July 30th, 2003
JHU WORKSHOP - 2003
Metrics for Intrinsic Evaluation
●
●
Need to take into account the hierarchical structure of the target
semantic categories
Two fuzzy measures based on:
dominance between categories
edge distance in the category tree/graph
●
Results wrt inter annotator agreement is almost
identical to exact match
July 30th, 2003
JHU WORKSHOP - 2003
What’s next
●
Investigate respective contribution of (independent) features
●
Incorporate syntactic information
●
Refine some coarse categories
Using subject codes
Using genus terms
Re-mapping via Wordnet
July 30th, 2003
JHU WORKSHOP - 2003
What’s next (cont’d)
●
Reduce the number of features/values via external
resources:
lexical vs. semantic models of the context
use selectional preferences
●
Concentrate on complex cases (e.g. unseen words)
●
Preparation of test data for extrinsic evaluation (MT)
July 30th, 2003
JHU WORKSHOP - 2003