Multi-Document Person Name Resolution

Download Report

Transcript Multi-Document Person Name Resolution

Multi-Document Person
Name Resolution
Michael Ben Fleischman (MIT), Eduard Hovy (USC)
From Proceedings of ACL-42 Reference Resolution
workshop 2004
Abstract
• Determine if two instances with the same
name from different documents refer to
the same individual
• Use Maximum Entropy model to give the
probability that two names refer to the
same individual
• Apply modified agglomerative clustering
technique to partition the instances
Related Work
• Mann and Yarowsky (2003)
– Bag of words, biographic information
– clustering task
– output two clusters
• Bagga and Baldwin (1998)
– within-document coreference resolution
– Text surrounding each reference chain
– vector space model
Data
• 2 million concept-instance pairs from ACL dataset
– e.g. (president of the United States, William Jefferson
Clinton)
• They randomly selected 2675 pairs.
• Each of them was matched with another conceptinstance pair that had an identical instance name,
but a different concept name.
• 1875 for training (84%)
• 400 for developing (85.5%)
• 400 for testing (83.5%)
Features (1)
•
•
•
•
•
Name features
Web features
Overlap features
Semantic features
Estimated statistics features
Features (2)
•
•
–
–
–
–
–
–
Name:
Name-common, freq of name in census data
Name-Fame, freq of name in ACL dataset
Web-Fame, num of hits in google
Web:
Web Intersection, the # of hits from query (name +
head1 + head2)
Web Difference, abs ((name + head1) – (name + head2))
Web Ratio, (name + head1 + head2)/((name + head1) +
(name + head2))
Features (3)
• Overlap features:
– Sentence-Count, # of words common to context of both
instances
– Sentence-TF, as above but weighted by term frequency
– Concept-Count
– Concept-TF
• Semantic features:
– They employ five metrics described in the literature
that use WordNet to determine a semantic distance
– JCN, HSO, LCH, Lin, Res
Features (4)
• Estimated statistics features:
– Concept information is very useful.
• E.g. politician, lawyer
–
–
–
–
p(i1=i2 | i1->A, i2->B)
p(i1->A, i2->B | i1=i2)
p(i1->A | i2->B) + p(i2->B | i1->A)
p(i1->A, i2->B) / p(i1->A) + p(i2->B)
Maximum Entropy Model
• They model the probability of two
instances having the same referent given a
vector of features x according to a
formulation
Experimental Results
Clustering (1)
• Use agglomerative clustering algorithm to cluster
each pairs with identical names
• Build a fully connected graph G with vertex set D
• Label each edge in G with a score
• While the edge with max score in G > threshold,
merge the two nodes connected by the edge
• For each node in the graph, merge the two edges
connecting it to the newly merged node, and
assign the new edge a score equal to the avg. of
the two old edge scores
Clustering (2)
• The final output of this algorithm is a new
graph in which each node represents a
single referent associated with a set of
concept-instance pairs.
• It is free to choose a different number of
referents for each instance name.
Experimental Design
• They randomly selected a set of 31
instance names from the ACL dataset, 11
of which referred to multiple individuals
and 20 of which had only a single referent.
• They use a simple clustering accuracy as
their performance metric.
• They compare their algorithm with two
baseline systems.
Results
Conclusion
• They have presented a two-step
methodology.
• This algorithm allows for dynamically set
number of referents, and outperforms two
baseline methods.