JFC`s notes for lecture 14
Download
Report
Transcript JFC`s notes for lecture 14
HCC class
lecture 14 comments
John Canny
3/9/05
Administrivia
Clustering: LSA again
The input is a matrix. Rows represent text blocks
(sentences, paragraphs or documents)
Columns are distinct terms
Matrix elements are term counts (x tfidf weight)
The idea is to “Factor” this matrix into A D B:
Themes
Terms
Text
blocks
M
Terms
D
=
Text
blocks
A
B
LSA again
A encodes the representation of each text block in a
space of themes.
B encodes each theme with term weights. It can be
used to explicitly describe the theme.
Themes
Terms
Text
blocks
M
Terms
D
=
Text
blocks
A
B
LSA limitations
LSA has a few assumptions that don’t make much
sense:
– If documents really do comprise different “themes”
there shouldn’t be negative weights in the LSA
matrices.
– LSA implicitly models gaussian random processes for
theme and word generation. Actual document
statistics are far from gaussian.
– SVD forces themes to be orthogonal in the A and B
matrices. Why should they be?
Non-negative Matrix Factorization
NMF deals with non-negativity and orthogonality, but
still uses gaussian statistics:
– If documents really do comprise different “themes”
there shouldn’t be negative weights in the LSA
matrices.
– LSA implicitly models gaussian random processes for
theme and word generation. Actual document
statistics are far from gaussian.
– SVD forces themes to be orthogonal in the A and B
matrices. Why should they be?
LSA again
The consequences are:
– LSA themes are not meaningful beyond the first few
(the ones with strongest singular value).
– LSA is largely insensitive to the choice of semantic
space (most 300-dim spaces will do).
NMF
The corresponding properties:
– NMF components track themes well (up to 30 or
more).
– The NMF components can be used directly as topic
markers, so the choice is important.
NMF
NMF is an umbrella term for several algorithms.
The one in this paper uses least squares to match the
original term matrix. i.e. it minimizes:
(M – AB)2
Another natural metric is the KL or Kullback-Liebler
divergence. The KL-divergence between two probability
distributions p and q is:
p log p/q
Another natural version of NMF uses KL-divergence
between M and its approximation as A B.
NMF
KL-divergence is usually a more accurate way to
compare probability distributions.
However, in clustering applications, the quality of fit to
the probability distribution is secondary to the quality of
the clusters.
KL-divergence NMF performs well for smoothing
(extrapolation) tasks, but not as well as least-squares
for clustering.
The reasons are not entirely clear, but it may simply be
an artifact of the basic NMF recurrences, which find only
locally-optimal matches.
A Simpler Text Summarizer
A simpler text summarizer based on inter-sentence
analysis did as well as any of the custom systems on the
DUC-2002 dataset (Document Understanding
Conference).
This algorithm called “TextRank” was based on a
graphical analysis of the similarity graph between
sentences in the text.
A Simpler Text Summarizer
Vertices in the graph represent sentences, edge weights
are similarity between sentences:
S1
S2
S7
S3
S6
S5
S4
Textrank
TextRank computes vertex strength using a variant of
Google’s Pagerank. It gives the probability of being at a
vertex during a long random walk on the graph.
S1
S2
S7
S3
S6
S5
S4
Textrank
The highest-ranked vertices comprise the summary.
Textrank achieved the same summary performance as
the best single-sentence summarizers at DUC-2002.
(TextRank appeared in ACL 2004)
Discussion Topics
T1: The best text analysis algorithms for a variety of tasks
seem to use numerical (BOW or graphical models) of texts.
Discuss what information these representations capture
and why they might be effective.