Juan`s slides - Computer Science at Rutgers

Download Report

Transcript Juan`s slides - Computer Science at Rutgers

Distributional Clustering of
English Words
Fernando Pereira- AT&T Bell Laboratories, 600
Naftali Tishby- Dept. of Computer Science, Hebrew University
Lillian Lee- Dept. of Computer Science, Cornell University
Presenter- Juan Ramos, Dept. of Computer Science, Rutgers
Universtiy, [email protected]
Overview
• Purpose: evaluate a method for clustering
words according to their distribution in
particular syntactic contexts.
• Methodology: find lowest distortion sets of
clusters of words to determine models of
word coocurrence.
Applications
• Scientific POV: lexical acquisition of words
• Practical POV: classification concerns data
sparseness in garmmar models.
• Address clusters in large corpus of
documents
Definitions
• Context: function of given word in its
sentence.
– Eg: a noun as a direct object
• Sense class: hidden model describing
word association tendencies
– Mix of cluster and cluster probability given a
word
• Cluster: probabilistic concept of a sense
class
Problem Setting
• Restrict problem to verbs (V) and nouns
(N) in main verb-direct object relationship
• f (v, n) = frequencies of occurrence of
verb, noun pairs
– Text must be pre-formatted to fit specifications
• For given noun n, conditional distribution
p(n, v) = f(v,n)/(sum (v, f(v,n))
Problem Setting cont.
• Goal: create set C of clusters and
probabilityies p(c|n).
• Each c in C associated to cluster centroid
p(c)
– p(c) = average of p(n) over all v in V.
Distributional Similarity
• Given two distributions p, q, KL distance is
D(p || q) = sum (x, p(x) log (p(x)/q(x)))
– D(p || q) = 0 implies p = q
– Small D(p || q) implies two distributions are
likely instances of a centroid p(c).
• D(p || q) measures loss of data by using
p(c).
Theoretical Foundation
• Given unstructured V, N, training data of X
independent pairs of verbs and nouns.
• Problem: learn joint distribution of pairs
given X
• Not quite unsupervised, not quite
supervised
– No internal structure in pairs
– Learn underlying distribution
Distributional Clustering
• Approximately decompose p(n,v) to p’(n,v)
= sum (c in C, p(c|n)*p(c, v)).
– p(c|n) = membership probability of n in c
– p(c,v) = p(v|c) = probability of v given centroid
for c
• Assuming p(n), p’(v) coincide, p’(n,v) =
sum(c in C, p(c)*p(n|c)*p(v|c))
Maximum Likelihood Cluster
Centroids
• Used to maximize goodness of fit between data
and p’(n,v)
• For sequence of pairs S, S’s model log prob. is:
l(S) = sum(N, log (sum (c in C, p’(n,v)))).
– Maximize according to p(n|c) and p(v|c).
– Variation of l(S):
Maximum Entropy Cluster
Membership
• Assume independence between variations
of p(n|c) and p(v|c).
– Can find Bayes inverses of p(n|c) given p(v|c)
and p(v|n)
– p(v|c) that maximize l(S) also minimize
average distortion between cluster model and
data
Entropy Cluster Membership cont.
• Average cluster distortion:
• Entropy:
Entropy Cluster Membership cont.
• Class and membership distributions:
– Z(c) and Z(n) are normalization sums
• Previous equations reduce log-likelihood to:
• At maximum, variation vanishes
KL Distortion
• Attempt to minimize KL distortion through
variation of KL distances:
– Results in weighted average of noun distributions.
Free Energy Function
• Combined minimum distortion and max entropy
equivalent to minimum of free energy: F = <D> H/beta
• F determines <D> and H through partial
derivatives:
• Min of F determines balance between disordering
max entropy and ordering distortion min.
Hierarchical Clustering
• Number of clusters is determined through
sequence of increases of beta.
– Higher beta implies more local influence of
noun on definition of centroids.
• Start with low beta and a single c in C
– Search for lowest beta that splits c into two or
more leaf c’s.
– Repeat until |C| reaches desired size.
Experimental Results
• Classify 64 nouns appearing as direct
objects of verb ‘fire’ in Associated Press
documents, 1988, where |V| = 2147.
• Four words most similar to cluster centroid
and KL distances for first splits.
– Split 1: cluster of ‘fire’ as discharging weapons
vs. cluster of ‘fire’ as releasing employees
– Split 2: weapons as projectiles vs. weapons
as guns.
Clustering on Verb ‘fire’
Evaluation
Evaluation cont.
Conclusions
• Clustering is efficient, informative, and
returns good predictions
• Future work
– Make clustering method more rigorous
– Introduce human judgment, i.e. a more
supervised approach
– Extend model to other word relationships
References
References cont.
More References