Distributional Clustering of Words for Text Classification

Download Report

Transcript Distributional Clustering of Words for Text Classification

Distributional Clustering of
Words for Text Classification
L.Douglas Baker
(Carnegie Mellon
University)
Andrew Kachites McCallum
(Justsystem Pittsburgh
Research Center)
Presentation by:
Thomas Walsh
(Rutgers University)
Clustering
• Define what it means for words to be
“similar”.
• “Collapse” the word space by grouping
similar words in “clusters”.
• Key Idea for Distributional Clustering:
– Class probabilities given the words in a labeled
document collection P(C|w) provide rules for
correlating words to classifications.
Voting
• Can be understood by a voting model:
• Each word in a document casts a weighted
vote for classification.
• Words that normally vote similarly can be
clustered together and vote with the average
of their weighted votes without negatively
impacting performance.
Benefits of Word Clustering
• Useful Semantic word clustering
– Automatically generates a “Thesaurus”
• Higher classification accuracy
– Sort of, we’ll discuss in the results section
• Smaller classification models
– size reductions as dramatic as 50000  50
Benefits of Smaller Models
• Easier to compute – with the constantly increasing
amount of available text, reducing the memory
space is clutch.
• Memory constrained devices like PDA’s could
now use text classification algorithms to organize
documents.
• More complex algorithms can be unleashed that
would be infeasible in 50000 dimensions.
The Framework
• Start with Training Data with:
– Set of Classes C = {c1, c2… cm}
– Set of Documents D ={d1… dn}
– Each Document has a class label
Mixture Models
• f(xi|q) = Spkh(xi|lk)
• Sum of pk’s is 1
• h is a distriution function for x (such as a
Gausian) with lk as the parameter (m, S) in
the Gausian case.
• Thus q = (p1…pk, l1… lk)
What is q in this case?
• Assumption: one-to-one correspondence
between the mixture model components and
the classes.
• The class priors are contained in the vector
q0
• Instances of each class / number of
documents
What is q in this case?
• The rest of the entries in q correspond to
disjoint sets. The jth entry contains the
probability of each word wt in the
vocabulary V given the class cj.
• N(wt, di) is the number of times a word
appears in document di.
• P(cj|di) = {0, 1}
Prob. of a given Document in the
Model
• The mixture model can be used to produce
documents with probability:
• Just the sum of the probability of generating
this document in the model over each class.
Documents as Collections of
Words
• Treat each document as an ordered
collection of word events.
• Dik = work in document di at place k.
• Each word is dependent on preceding words
Apply Naïve Bayes Assumption
• Assume each word is independent of both
content and position
• Where dik = wt
• Update Formulas 2 and 1:
– (2) P(di | cj ; q) = P P(wt|cj ; q)
– (1) P(di| q) = S P(cj|q) P P(wt|cj; q)
Incorporate Expanded Formulae
for q
• We can calculate the model parameter q
from the training data.
• Now we wish to calculate P(cj|di; q), the
probability of document di belonging to
class cj.
Final Equation
Class prior * (2)Product of all the probabilities of each
word in the document assuming we are in class cj
------------------------------------------------------------(1/2/3) Sum of all class priors * product of all word
probabilities assuming we are in class cr
 Maximize and that value of cj is the class for the
document
Shortcomings of the Framework
• In real world data (documents) there isn’t
actually an underlying mixture model and
the independence assumption doesn’t
actually hold.
• But empirical evidence and some theoretical
writing (Domingos and Pazzani 1997)
indicates the damage from this is negligible.
What about clustering?
• So assuming the Framework holds… how
does clustering fit into all this?
How Does Clustering affect
probabilities?
• Fraction of cluster from wt + fraction of
cluster from ws
Vs. other forms of learning
• Measures similarity based on the property it
is trying to estimate (the classes)
– Makes the supervision in the training data
really important.
• Clustering is based on the similarity of the
class variable distributions
• Key Idea: Clustering preserves the “shape”
of the class distributions.
Kullock-Liebler Divergence
• Measures the similarity between class
distributions
• D( P(C | wt) || P(C | ws)) =
• If P(cj | wt) = P(cj | ws) then log(1) = 0
Problems with K-L Divergence
• Not symmetric
• Denominator can be 0 if ws does not appear
in any documents of class cj.
K-L Divergence from the Mean
• Ratio of each words occurrence in the cluster * KL divergence of that word within the cluster
• New and improved: uses a weighted average
instead of just the mean
• Justification: fits clustering because independent
distributions now form combined statistics.
Minimizing Error in Naïve Bayes
Scores
• Assuming uniform class priors allows us to
drop P(cj | q) and the whole denominator
from (6)
• Then performing a little algebra gets us the
cross entropy:
• So error can be measured in the difference
in cross-entropy caused by clustering.
Minimizing this equation results in equation
(9), so clustering in this method minimizes
error.
The Clustering Algorithm
• Comparing similarity of all possible word clusters
would be O(V2)
• Instead, a number M is set as the total number of
desired clusters
– More supervision
• M clusters initialized with the M words with the
highest mutual information to the class variable
• Properties: Greedy, scales efficiently
Algorithm
S P(C | wt)
Related Work
• Chi Merge / Chi 2
– Use D. Clustering to discretize numbers
• Class-based clustering
– Uses amount that mutual information is reduced to
determine when to cluster
– Not effective in text classification
• Feature Selection by Mutual Information
– cannot capture dependencies between words
• Markov-blanket-based Feature Selection
– Also attempts to Preserve P(C | wt) shapes
• Latent Semantic Indexing
– Unsupervised, using PCA
The Experiment :
Competitors to Distributional
Clustering
• Clustering with LSI
• Information Gain Based Feature Selection
• Mutual-Information Feature Selection
• Feature Selection involves cutting out
redundant instances
• Clustering combines these redundancies
The Experiment: Testbeds
• 20 Newsgroups
– 20,000 articles from 20 usenet groups (apx 62000
words)
• ModApte “Reuters-21578”
– 9603 training docs, 3299 testing docs, 135 topics
(apx. 16000 words)
• Yahoo! Science (July 1997)
– 6294 pages in 41 classes (apx. 44000 words)
– Very noisy data
20 Newsgroups Results
• Averaged over 5-20 trials
• Computational constraints forced Markov blanket
to a smaller data set (second graph)
• LSI uses only 1/3 training ratio
20 Newsgroups Analysis
• Distributional Clustering achieves 82.1% accuracy at
50 features, almost as good as having the full
vocabulary.
• More accurate then all non-clustering approaches
• LSI did not add any improvement to clustering
(claim: because it is unsupervised)
• On the smaller data set, D.C. achieves 80% accuracy
far quicker then the others, in some cases doubling
their performance for small numbers of features.
• Claim: Clustering outperforms Feature selection
because it conserves information rather than
discarding it.
Speed in 20-Newsgroups Test
•
•
•
•
Distributional Clustering: 7.5 minutes
LSI: 23 minutes
Makov Blanket: 10 hours
Mutual information feature selection (???):
30 seconds
Reuters-21578 Results
• D.C. outperforms others for small numbers of
features
• Information-Gain based feature selection does better
for larger feature sets.
• In this data set, documents can have multiple labels.
Yahoo! Results
• Feature selection performs almost as well or better
in these cases
• Claim: The data is so noisy that it is actually
beneficial to “lose data” via feature selection.
Performance Summary
• Only slight loss in accuraccy despite despite
the reduction in feature space
• Preserves “redundent” information better
than feature selection.
• The improvement is not as drastic with
noisy data.
Improvements on Earlier D.C.
Work
• Does not show much improvement on
sparse data because the performance
measure is related to the data distribution
– D.C. preserves class distributions, even if these
are poor estimates to begin with.
• Thus this whole method relies on accurate
values for P(C | wi)
Future Work
• Improve D.C.’s handling of sparse data
(ensure good estimates of P(C | wi)
• Find ways to combine feature selection and
D.C. to utilize the strengths of both (perhaps
increase performance on noisy data sets?)
Some Thoughts
• Extremely supervised
• Needs to be retrained when new documents
come in
• In a paper with a lot of topics, does Naïve
Bayes (word independent of context) make
sense?
• Didn’t work well in noisy data
• How can we ensure proper theta values?