Transcript ARDA Symp

Discriminative Training of
Clustering Functions
Theory and Experiments with Entity
Identification
Xin Li
&
Dan Roth
University of Illinois,
Urbana-Champaign
1
Outline

Clustering



Current approaches
Some problems


The Reference Problem:

Entity Identification
within & across
documents.
Making Clustering a Learning problem
Supervised Discriminative Clustering Framework
Page 2
The Reference Problem
Kennedy
Document 1: The Justice Department has officially ended its inquiry into the assassinations
of John F. Kennedy and Martin Luther King Jr., finding ``no persuasive evidence'' to
support conspiracy theories, according to department documents. The House Assassinations
Committee concluded in 1978 that Kennedy was ``probably'' assassinated as the result of a
conspiracy involving a second gunman, a finding that broke from the Warren Commission 's
belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963.
Document 2: In 1953, Massachusetts Sen. John F. Kennedy married Jacqueline Lee
Bouvier in Newport, R.I. In 1960, Democratic presidential candidate John F. Kennedy
confronted the issue of his Roman Catholic faith by telling a Protestant group in Houston, ``I
do not speak for my church on public matters, and the church does not speak for me.'‘
Document 3: David Kennedy was born in Leicester, England in 1959. …Kennedy coedited The New Poetry (Bloodaxe Books 1993), and is the author of New Relations: The
Refashioning Of British Poetry 1980-1994 (Seren 1996).
Page 3
Entity Identification in Text


Goal: Given names, within or across
documents, identify real-world
entities behind them.
Problem Definition: Given a set of
names and their semantic types,
[people], [locations]
[Organizations] partition them into
groups that refer to different
entities.
Approaches:

A generative Model
[Li, Morie, Roth, NAACL’04]

A discriminative approach
[Li, Morie, Roth, AAAI’04]
Other works [on citation, and more:
Milche et. al; Bilenko et. al.,…]
Intuitively, a discriminative approach,
requires
 using some similarity measure
between names, followed up by
 clustering into clusters that
represent entities.
Page 4
Clustering

An optimization procedure that takes




A collection of data elements
A distance (similarity) measure on the space of data elements
A Partition Algorithm
Attempts to:

Optimize some quality with respect to the given distance metric.
Page 5
Clustering: Example
(k=4)
Page 6
Example: K-means Clustering





An Optimization Problem:
Data: X = {x1,x2,…} Cluster Names: C = {1,2,3,…,K}
The Euclidean Distance: d(x1,x2) = [ (x1-x2)T(x1-x2)]1/2
Find a mapping
f: X  C
That minimizes: j x 2 Cj d(x,j )2

Where j = 1/m x 2 Cj x
mean of elements in the k-th cluster
Page 7
Many NLP Applications

Class-based language models:



Document categorization; and topic identification (Karypis,
Han 99,02).
Co-reference resolution –


group similar words together based on their semantics (Dagan et. al
99, Lee et. al ; Pantel and Lin, 2002).
build coreference chain of noun phrases (Cardie, Wagstaff 99).
In all cases – fixed metric distance; tuned for the application
and the data.
(and the algorithm?)
Page 8
Clustering: Metric and Algorithm
There is no ‘universal’ distance metric that is appropriate for all
clustering algorithms
d1(x,x’) = [(f1 - f1’) 2+(f2 - f2’) 2]1/2
(a) Single-Linkage with
Euclidean
d2(x,x’) = |(f1+ f2)-(f1’+f2’)|
(b) K-Means with
Euclidean
(c) K-Means with a
Linear Metric
How do we make sure we have an appropriate one, that reflects
the task/designer intentions?
Page 9
Additional information in Clustering
Page 10
Traditional Clustering Framework


Typically, unsupervised; no learning.
More recently: work on metric learning with supervision:
[Bilenko&Mooney 03, 04, Xing et. al.’03, Schultz & Joachims’03, Bach &
Jordan03]


Learning a metric; then cluster
Learning while clustering (algorithm specific)
distance
metric d
+
clustering
algorithm A
A partition function
h(S) = Ad(S)
K-means
X = {x1,x2,…}, C = {c1,c2,…,ck}
Euclidean Distance:
unlabeled data
set S
partition h(S)
Page 11
d(x, x’) = [(x- x’)T(x- x’)]1/2
Supervised Discriminative Clustering (SDC)



Incorporates supervision directly into metric training process;
Training is driven by true clustering error
Computed via the chosen data partition algorithm.
Training Stage:
labeled data set S
Goal: h*=argmin errS(h,p)
supervised
learner
A partition function
h(S) = Ad(S)
distance
metric d
unlabeled data
set S’
+
clustering
algorithm A
partition h(S’)
Page 12
Application
Stage: h(S’ )
Elements of SDC: Partition Function and Error


Goal: A partition function h maps a set of data points S to
a partition h(S) of S.
(outcome of a clustering algorithm)
[Note difference from multi-class classification]
The partition function h is a function of the parameterized distance
d(x1,x2) =  wi |xi1- xi2|
metric:

Error: Given a labeled data set S;
p(S) = {(xi,ci)}1m, the correct partition, and
a fixed clustering algorithm A,
the training process attempts to find d*, minimizing the clustering error:
d*= argmind errS(h,p), where
h(S)=Ad(S).
Optimal (given) Partition
Learned Partition
Page 13
A Supervised Clustering Error
errS(h,p) = 1/|S|2 ij [d(xi,xj)*Aij +(D-d(xi,xj))*Bij]
(as opposed to a quality function that depends only on the distance)
Two types of errors in pairwise prediction:
(xi,xj) h `together’ or ‘apart’
False negative: Aij =
False positive: Bij =
I [p(xi)=p(xj) & h(xi)h(xj)],
I [p(xi)  p(xj) & h(xi)=h(xj)],
D = maxij d(xi,xj ) .
(See paper for a comparison with other error functions)
Page 14
Training the distance function
S
Initialize the distance Metric d
Cluster S using algorithm h=Ad
Update d
Evaluate ErrS(h,p)
A gradient descent based algorithm
Page 15
Training the distance function
Gradient descent Alg.
Learns a metric Iteratively
by adjusting the
parameter vector  by a
small amount in the
direction that would
most reduce the error.
Page 16
Entity Identification in Text


Goal: Given names, within or
across documents, identify realworld entities behind them.
Problem Definition: Given a set
of names and their semantic
types, [people], [locations]
[Organizations] partition them
into groups that refer to
different entities.
Approaches:

A generative Model
[Li, Morie, Roth, NAACL’04]

A discriminative approach
[Li, Morie, Roth, AAAI’04]
Page 17
Parameterized Distance Metrics for Name Matching
John F. Kennedy
?
President Kennedy

Feature Extraction: (John F. Kennedy, President Kennedy )= (1, 2 , …)
1.
Fixed distance: distance (similarity) metric d for names.



2.


d (John F. Kennedy, President Kennedy)  0.6
d (Chicago Cubs, Cubs)  0.6
d (United States, USA)  0.7
A learned distance function parameterized as a Linear function over
features (kernelized):
d(John F. Kennedy, President Kennedy ) =  wi i
Make it a pairwise classifier:
h (John F. Kennedy, President Kennedy ) = `together’
iff  wi i <= 0.5
The distance function can be trained separately, to optimize partition
quality, or via SDC, to minimize Error.
(via gradient descent)
Page 18
Features

Relational features that are extracted from a pair of strings,
taking into account relative positions of tokens, substring
relations, etc.
Page 19
Experimental Setting

Names of people, locations and organizations.






John F. Kennedy, Bush, George W. Bush
U.S.A, United States, and America
University of Illinois, U. of I., IBM, International Business Machines.
300 randomly picked New York Times news articles.
8,600 names annotated by a named entity tagger and
manually verified.
Training sets contain names labeled with its global entity.
John F. Kennedy  Kennedy1
President Kennedy Kennedy1,
David Kennedy Kennedy2.
Data is available from http://l2r.cs.uiuc.edu/~cogcomp/
Page 20
Gain from Metric Learning while Clustering



SoftTFIDF (Cohen et. al): Fixed metric
LMR (Li, Morie, Roth, AAAI’04) learned metric via a pairwise classifier;
relational features extracted from pairs of strings; feedback from pairwise labels;
SDC: trains a linear weighted distance metric for the single-link clustering
algorithm with labeled pairs of 600 names.
Page 21
Influence of Data Size
Page 22
Different Clustering Algorithms

Difference across clustering algorithm is not as significant as
difference obtained from learning a good metric via SDC.
Page 23
Summary



A framework for Metric Learning for Clustering that is
guided by global supervision with clustering as part of the
feedback loop.
A parameterized distance metric is learned in a way that
depends on the specific clustering algorithm used.
Significant improvement shown on the Reference Problem:
Entity Identification Across documents.
Page 24
Intuition behind SDC
d
K=16
Page 25
Relational Features
1
John

John
Kennedy
1
2
2
Kennedy
3
Davis
Relational features: do not depend on specific tokens in the two names, but
depend on some abstraction over tokens.



Honorific Equal: Mr., Mrs., President, Prof.
Nickname: Thomas, Tom
Edit Distance.
Page 26
Toward Concept-Based Text Understanding and Mining
How to employ transitivity between names ?
Michael Jordan,
Michael
Jordan
Clustering: splitting a set of names.


Distance Metrics: Edit distance, SoftTFIDF, Jora-Winkler
Clustering Algorithms: Single-Link, Complete-Link, K-means, graph
cut.
Toward Concept-Based Text Understanding and Mining
Page 27
Outline

Clustering



Making Clustering a Learning problem


Current approaches
Some problems
Supervised Discriminative Clustering Framework
The Reference Problem:

Entity Identification in within & across document.
Page 28
Entity Identification in Text


Goal: Given names, within or
across documents, identify realworld entities behind them.
Problem Definition: Given a set
of names and their semantic
types, [people], [locations]
[Organizations] partition them
into groups that refer to
different entities.
Approaches:

A generative Model
[Li, Morie, Roth, NAACL’04]

A discriminative approach
[Li, Morie, Roth, AAAI’04]
Challenge: millions of entities
in the world, but in training, we
can only see names of a limited
number of entities.
Page 29