ללא כותרת שקופית - University of Haifa

Download Report

Transcript ללא כותרת שקופית - University of Haifa

Unsupervised Methods
1
Association Measures
• Association between items: assoc(x,y)
– term-term, term-document, term-category, …
• Simple measure: freq(x,y), log(freq(x,y))+1
• Based on contingency table
2
Mutual Information
• The item corresponding to x,y in the Mutual
Information for X,Y:
P( x y)
P
(
x
,
y
)
MI ( x, y)  log
 log
 log P( y|x)
P( x) P( y)
P( x)
P( y)
• Disadvantage: the MI value is inflated for
low freq(x,y)
• Examples: results for two NLP articles
3
Log-Likelihood Ratio Test
• Comparing the likelihood of the data given
two competing hypotheses (Dunning,93)
• Does not depend heavily on assumptions of
normality, can be applied to small samples
• Used to test if p(x|y) = p(x|~y) = p(x), by
comparing it to the general case (inequality)
• High log-likelihood score indicates that the
data is much less likely if assuming equality
4
Log-Likelihood (cont.)
• Likelihood function:
• The likelihood ratio:
•
 2log 
• High
5
H( p , p ,...;k ,k ,...)  H(;k )
1 2 1 2

is asymptotically
 2log  :
max   H ( ; k )
0
max   H ( ; k )
 2 distributed
the data is less likely given 0
Log-Likelihood for Bigrams
p1  p( x| y ) 
a
k
 1
a  c n1
b
k2
p2  p( x|~ y ) 

b  d n2
a b
k1  k2
p  p( x ) 

a  b  c  d n1  n2
6
Log-Likelihood for Binomial
 n
k
n

k
H ( p; n, k )  p (1  p)
 
 k
H( p1, p2 ; n1, k1, n2 , k2 )  H( p1; n1, k1) H( p2 ; n2 , k2 )

max p H ( p1 , p2 ; n1 , k1 , n2 , k2 )
max p1 , p2 H ( p1 , p2 ; n1 , k1 , n2 , k2 )
• Maximum obtained for:
k1
k2
k1  k2
p1  ; p2  ; p 
n1
n2
n1  n2
7
Measuring Term Topicality
• For query relevance ranking: Inverse Document
Frequency
• For term extraction:
–
–
–
–
Frequency
Frequency ratio for specialized vs. general corpus
Entropy of term co-occurrence distribution
Burstiness:
• Entropy of distribution (frequency) in documents
• Proportion of topical documents for term (freq>1) within
all documents containing term (Katz, 1996)
8
Similarity Measures
• Cosine:
• Min/Max:
simu , v  
 log freq u, att  log freq v, att
att

2 
2
  log freq u , att      log freq v, att  
 att
  att

sim u , v  
 min I u, att , I v, att 
att
 max I u, att , I v, att 
att
• KL to Average:



2  Patt u  
2  Patt v  
  Patt v   log 

Au, v     Patt u   log 











P
att
u

P
att
v
P
att
u

P
att
v
att 





9
A Unifying Schema of Similarity
(with Erez Lotan)
• A general schema encoding most measures
• Identifies explicitly the important factors
that determine (word) similarity
• Provides the basis for:
– a general and efficient similarity computation
procedure
– evaluating and comparing alternative measures
and components
10
Mapping to Unified Similarity Scheme
count(u,att)
0
0
7
11
0
0
3
assoc(u,att)
0
0
17
0
0
4
u
count(v,att)
0
16
0
5
0
8
0 assoc(v,att)
0
0
45
0
0
6
joint(assoc(u,att),assoc(u,att))
joint(assoc(u,att),assoc(u,att))
W u    g assoc u, att
W v    g assoc v, att
att
att
SJ (u , v) 
assoc u,att, assoc v,att
  joint

att  Both u ,v
11
v
f SIM (u , v) 
SJ (u , v)
normSJ (u , v),W (u ),W (v) 
Association and Joint Association
• assoc(u,att): quantify association strength
– mutual information, weighted log frequency,
conditional probability (orthogonal to scheme)
• joint(assoc(u,att),assoc(v,att)): quantify the
“similarity” of the two associations
– ratio, difference, min, product
•
SJ (u, v ) 
 joint assoc u,att,assoc v,att
att  Bothu ,v 
Bothu, v    att frequ,att  0 , freqv,att  0 
12
Normalization
• Global weight of a word vector:
Justu    att
W u    g assoc u, att 
att Just u 
– For cosine:
W u  
frequ,att  0 
 assoc u, att
2
att  Justu 
• Normalization factor:
Norm_Factoru,v   normSJ u,v ,W u ,W v 
– For cosine:
13
Norm _ Factoru, v  W u W v
The General Similarity Scheme
simu,v  
SJ u,v 
SJ u, v 

Norm_Facto r u,v  normSJ u, v ,W u ,W v 
where
SJ (u, v) 
 joint assoc u, att ,assoc v,att 
att Bothu , v 
- For example - cosine :
SJ u,v 
simu,v  
W u   W v 
14
Min/Max Measures
sim(u, v ) 
att min(assoc(u, att ), assoc(v , att ))
att max(assoc(u, att ), assoc(v , att ))
• May be viewed as:
min( assoc(u , att ), assoc(v, att )
joint (,) 
max( assoc(u , att ), assoc(v, att ))
weight (att )  max( assoc(u , att ), assoc(v, att ))
15
Associations Used with Min/Max
• Log-frequency and Global Entropy Weight
(Grefenstette, 1994):
assoc(u, att)  log( freq(u, att)  1)  Gew(att)
1
Gew(att )  1 
v  P(v att )  log( P(v att ))
log nrels
[0,1]
• Mutual information (Dagan et al., 1993/5):
P(att u)
P(u att )
P(u, att )
assoc(u, att )  log
 log
 log
P(u) P(att )
P(att )
P(u)
16
Cosine Measure
cos(u, v ) 
att assoc(u, att )  assoc(v , att )
att assoc( w1 , att )
2

att assoc( w2 , att )
2
• Used for word similarity (Ruge, 1992) with:
assoc(u,att)=ln(freq(u,att))
• Popular for document ranking (vector space)
assoc(doc, term)  tf  idf
max docfreq ()
freq (doc, term)
idf  log
1
tf 
docfreq (term)
max freq (doc,)
17
Methodological Benefits
Joint work with Erez Lotan (Dagan 2000 and in preparation)
• Uniform understanding of similarity measure structure
• Modular evaluation/comparison of measure components
• Modular implementation architecture, easy experimentation
by “plugging” alternative measure combinations
18
Empirical Evaluation
• Thesaurus for query expansion (e.g. “insurance laws”):
Similar words for law :
Word
regulation
rule
legislation
guideline
commission
bill
budget
regulator
code
circumstance
Similarity
0.050242
0.048414
0.038251
0.035041
0.034499
0.033414
0.031043
0.031006
0.030998
0.030534
Judgment
+
+
+
+
+
+
+
-
•Precision and comparative Recall at each point in the list
19
Comparing Measure Combinations
Precision
Recall
• Min/Max schemes worked better than cosine and JensenShannon (almost by 20 points); stable over association
measures
20
Effect of Co-occurrence Type on
Semantic Similarity
21
Computational Benefits
• Efficient implementation through sparse matrix indexing
 By computing over common attributes only (both )
words
v1 … vj vm
u
att1 attributes
1...
i
n
atti
atti
attn
Similarity
Results
1...
j
m
• Complexity reduced by “sparseness” factor –
#non-zero cells / total #cells
 Two orders of magnitude in corpus data
22
General Scheme - Conclusions
• A general mathematical scheme
• Identifies the important factors for measuring
similarity
• Efficient general procedure based on scheme
• Empirical comparison of different measure
components (measure structure and assoc)
• Successful application in an Internet crawler
for thesaurus construction (small corpora)
23
Clustering Methods
• Input: A set of objects (words, documents)
• Output: A set of clusters (sets of elements)
• Based on a criterion for the quality of a class,
which guides cluster split/merge/modification
– a distance function between objects/classes
– a global quality function
24
Clustering Types
•
•
•
•
•
Soft / Hard
Hierarchical / Flat
Top-down / bottom-up
Predefined number of clusters or not
Input:
– all point-to-point distances
– original vector representation for points,
computing needed distances during clustering
25
Applications of Clustering
• Word clustering
– Constructing a hierarchical thesaurus
– Compactness and generalization in word
cooccurrence modeling (will be discussed later)
• Document clustering
– Browsing of document collections and search
query output
– Assistance in defining a set of supervised
categories
26
Hierarchical Agglomerative
Clustering Methods (HACM)
1. Initialize every point as a cluster
2. Compute a merge score for all cluster pairs
3. Perform the best scoring merge
4. Compute the merge score between the new cluster
and all other clusters
5. If more than one cluster remains, return to 3
27
Types of Merge Score
• Minimal distance between the two candidates for
the merge. Alternatives for cluster distance:
–
–
–
–
Single link: distance between two nearest points
Complete ling: distance between two furthest points
Group average: average pairwise distance for all points
Centroid: distance between the two cluster centroids
• Based on the “quality” of the merged class:
– Ward’s method: minimal increase in total within-group
sum of squares (average squared distance to centroid)
• Based on a global criterion (in Brown et al., 1992:
minimal reduction in average mutual information)
28
Unsupervised Statistics and
Generalizations for Classification
• Many supervised methods use cooccurrence
statistics as features or probability estimates
– eat a {peach,beach}
– fire a missile vs. fire the prime minister
• Sparse data problem: if alternative
cooccurrences never occurred, how to
estimate their probabilities, or their relative
“strength” as features?
29
Application: Semantic Disambiguation
Anaphora resolution (Dagan, Justeson, Lappin, Lease, Ribak 1995)
The terrorist pulled the grenade from his pocket and
?
threw it at the policeman
Traditional AI-style approach
Manually encoded semantic preferences/constraints
Actions
Weapon
<object – verb>
Bombs
grenade
30
Cause_movement
throw
drop
Statistical Approach
“Semantic”
Judgment
Corpus
(text collection)
<verb–object: throw-grenade> 20 times
<verb–object: throw-pocket>
1 time
• Semantic confidence combined with syntactic preferences
 it  grenade
• “Language modeling” for disambiguation
31
What about sense disambiguation?
(for translation)
I bought soap bars
I bought window bars
?
?
sense1 sense2
(‘chafisa’) (‘sorag’)
Corpus
(text collection)
sense1 sense2
(‘chafisa’) (‘sorag’)
Sense1:
<noun-noun: soap-bar>
20 times
<noun-noun: chocolate-bar> 15 times
Sense2:
<noun-noun: window-bar>
<noun-noun: iron-bar>
17 times
22 times
• “Hidden” senses – supervised labeling required?
32
Solution: Mapping to Another Language
English(-English)-Hebrew Dictionary:
bar1 ‘chafisa’
bar2 ‘sorag’
soap  ‘sabon’
window  ‘chalon’
Map ambiguous constructs to second language (all possibilities):
<noun-noun: soap-bar>
1 <noun-noun: ‘cahfisat-sabon’>
2 <noun-noun: ‘sorag-sabon’>
20 times
0 times
<noun-noun: window-bar>
1 <noun-noun: ‘cahfisat-chalon’>
2 <noun-noun: ‘sorag-chalon’>
0 times
15 times
• Exploiting ambiguities difference
• Principle – intersecting redundancies
(Dagan and Itai 1994)
33
Hebrew
Corpus
Selection Model Highlights
• Multinomial model, under certain linguistic
assumptions
 pi 
 
1 1 bound for odds-ratio:
• Selectionln“confidence”
   ln ni   Z – lower

 Conf (i )  
p 
 j
n 
 j
1
ni
nj
• Overlapping ambiguous constructs are resolved through
constraint propagation, by decreasing confidence order.
• Results (HebrewEnglish):
Coverage: ~70% Precision within coverage: ~90%
– ~20% improvement over choosing most frequent translation
(the common baseline)
34
Data Sparseness and Similarity
<verb–object: ‘hidpis-tikiya’>
?
<verb–object: print-folder> 0 times
<verb–object: print-file_cabinet> 0 times
• Standard approach: “back-off” to single term frequency
• Similarity-based inference:
folder
Similar
print
35
<verb-object>
file
directory
record
…
file_cabinet
Similar
print
<verb-object>
cupboard
closet
suitcase
…
Computing Distributional Similarity
print
erase
open
retrieve
browse
save
…
folder
Similar
file
• Association between word u (“folder”) and its “attributes”
(context words/features) is based on mutual information:
I u, att  log2

 Patt, u  
  P att u  


 PattPu  
  log2 
 Patt 




• Similarity between u and v (weighted Jaccard, [0,1]):
 m inI u, att, I v, att
sim u, v  
 m axI u, att, I v, att
att
att
36
Disambiguation Algorithm
Selection of preferred alternative:
• Hypothesized similarity-based frequency derived from
average association for similar words
(incorporating single term frequency)
• Comparing hypothesized frequencies
folder
Similar
print
37
<verb-object>
file
directory
record
…
file_cabinet
Similar
print
<verb-object>
cupboard
closet
suitcase
…
Computation and Evaluation
• Heuristic search used to speed computation of k most similar
words
• Results (HebrewEnglish):
• 15% coverage increase, while decreasing precision by 2%
• Accuracy 15% better than back-off to single word
frequency
(Dagan, Marcus and Markovitch 1995)
38
Probabilistic Framework - Smoothing
• Counts are obtained from a sample of the probability space:
sample
• Maximum Likelihood Estimate proportional to sample counts:
MLE estimate – 0 probability for unobserved events
• Smoothing discounts observed events, leaving probability
“mass” to unobserved events:
39
discounted estimate for observed events
positive estimate for
unobserved events
Smoothing Conditional Attribute
Probability
• Good-Turing smoothing scheme – discount & redistribute:
 Pd (att | u )

1
P(att | u )  norm(u )

att : count(att, u)  0

count(att, u )  0
count(att, u )  0
• Katz seminal back-off scheme (speech language modeling):
P(att | u)  norm(u) P(att)
count(att, u)  0
• Similarity-based smoothing: (Dagan, Lee, Pereira 1999)
P(att | u )  norm(u ) PSIM (att | u )
count(att, u )  0
where
PSIM (att | u ) 
40
f
u
1
SIM
f

(u, u)
u
SIM
(u, u)P(att | u)
Similarity/Distance Functions for
Probability Distributions
• Jensen-Shannon divergence (KL-distance to the average)
 Pu  Pv 
 Pu  Pv  
1

  DKL  Pv
 
Aq, r    DKL  Pu
2
2 
2 


Pu  Pv
att  1 Pu att  Pv att
2
2
Information loss by approximating u and v by their average
s.t.
A
f SIM
(u, v)  exp(A(u, v))
β controls the relative influence of close vs. remote neighbors
• L1 norm
Lu, v    Patt u   Patt v 
att

L
f SIM
(u, v)  (2  Lu, v ) 
41

Sample Results
• Most similar words to “guy”:
Measure
Closest Words
A
guy kid thing lot man mother doctor friend boy son
L
guy kid lot thing man doctor girl rest son bit
PC
role people fire guy man year lot today way part
Typical common verb contexts: see get give tell take …
PC : an earlier attempt for similarity-based smoothing
• Several smoothing experiments (A performed best):
 Language modeling for speech (hunt bears?pears)
 Perplexity (predicting test corpus likelihood)
 Data recovery task (similar to sense disambiguation)
Insensitive to exact value of β
42
Class-Based Generalization
• Obtain a cooccurrence-based clustering of words
and model a word cooccurrence by word-class or
class-class cooccurrence
• Brown et al., CL 1992: Mutual information
clustering; class-based model interpolated to ngram model
• Pereira, Tishby, Lee, ACL 1993: soft, top-down
distributional clustering for bigram modeling
• Similarity/class-based methods: general
effectiveness yet to be shown
43
Conclusions
• (Relatively) simple models cover a wide
range of applications
• Usefulness in (hybrid) systems: automatic
processing and knowledge acquisition
44
Discussion
45