A Clustering Method Based on Nonnegative Factorization for Text

Download Report

Transcript A Clustering Method Based on Nonnegative Factorization for Text

A Clustering Method Based on
Nonnegative Matrix Factorization
for Text Mining
Farial Shahnaz
Topics
•
•
•
•
•
Introduction
Algorithm
Performance
Observation
Conclusion and Future Work
Introduction
Basic Concepts
• Text Mining : Detection of trends or
patterns in text data
• Clustering : Grouping or classifying
documents based on similarity of content
Clustering
• Manual Vs Automated
• Supervised Vs Unsupervised
• Hierarchical Vs Partitional
Clustering
• Objective: Automated Unsupervised
Partitional Clustering of Text Data or
Documents
• Method : Nonnegative Matrix Factorization
or NMF
Vector Space Model of Text Data
• Documents represented as n-dimensional
vectors
– n : terms in the dictionary
– vector component : importance of term
• Document collection represented as term-bydocument matrix
Term-by-Document Matrix
• Terms in the dictionary, n : 9 (a, brown,
dog, fox, jumped, lazy, over, quick, the)
• Document 1 : a quick brown fox
• Document 2 : jumped over the lazy dog
Term-by-Document Matrix
Clustering Method : NMF
• Low rank approximation of large sparse
matrices
• Preserves data nonnegativity
• Introduces the concept of parts-based
representation (by Lee and Seung in
Nature, 1999)
Other Methods
• Other rank reduction methods :
– Principal Component Analysis (PCA)
– Vector Quantization (VQ)
• Produce basis vectors with negative
entries
• Additive and Subtractive combinations of
basis vectors yield original document
vectors
NMF
• Produces nonnegative basis vectors
• Additive combination of basis vectors yield
original document vector
Term-by-Document Matrix (all
entries nonnegative)
NMF
• Basis vectors interpreted as semantic
features or topics
• Documents clustered on the basis of
shared features
NMF
• Demonstrated by Xu et. Al (2003):
– Outperforms Singular Value Decomposition
(SVD)
– Comparable to Graph Partitioning methods
Algorithm
NMF : Definition
Given
• S
• Vmxn
• m
• n
: Document collection
: term-by-document matrix
: terms in the dictionary
: Number of documents in S
NMF : Definition
NMF is defined as:
• Low rank approximation of Vmxn in terms of
some metric
• Factor V into the product WH
– Wmxk : Contains basis vectors
– Hkxn : Contains linear combinations
–k
: Selected number of topics or basis
vectors, k << min(m,n)
NMF : Common Approach
• Minimize objective function:
NMF : Existing Methods
Multiplicative Method (MM) [ by Lee and
Seung ]
• Based on Multiplicative update rules
• || V - WH || is monotonically nonincreasing and constant iff W, H at
stationary point
• Version of Gradient Descent (GD)
optimization scheme
NMF : Existing Methods
Sparse Encoding [ by Hoyer ]
• Based on study of neural networks
• Enforces statistical sparsity of H
– Minimizes sum of non-zeros in H
NMF : Existing Methods
Sparse Encoding [ by Mu, Plemmons and
Santago ]
• Similar to Hoyer’s method
• Enforces statistical sparsity of H using a
regularization parameter
– Minimizes number of non-zeros in H
NMF : Proposed Algorithm
Hybrid Method:
• W approximated using Multiplicative
Method
• H calculated using a Constrained Least
Square (CLS) model as the metric
– Penalizes the number of non-zeros
– Similar to the method by Mu, Plemmons and
Santago
• Called GD-CLS
GD-CLS
Performance
Text Collections Used
• Two benchmark topic detection text
collections:
– Reuters : Collection of documents on
assorted topics
– TDT2 : Transcripts from news media
Text Collections Used
Accuracy Metric
• Defined by:
• di : Document number i
• = 1 = 1 if the topic labels match
• ∂(di) = 0 otherwise
k = 2, 4, 6, 8, 10, 15, 20
λ = 0.1, 0.01, 0.001
Results for Reuters
Results for TDT2
Observations
Observations : AC
• AC inversely proportional to k
• Nature of the collection affects AC
– Reuters : earn, interest, cocoa
– TDT2 : Asian economic crisis, Oprah lawsuit
Observations : λ parameter
• AC declines as λ
increases ( mostly
effective for
homogeneous text
collections) :
• CPU time declines as
λ increases
Observations : Cluster size
• Imbalance in cluster sizes has adverse effect :
Conclusion & Future Work
GD-CLS can be used to effectively cluster
text data. Further development involves:
• Smart updating
• Use in Bioinformatics
• Develop user-interface
• Convert to C++