JFC`s notes for lecture 14

Download Report

Transcript JFC`s notes for lecture 14

HCC class
lecture 14 comments
John Canny
3/9/05
Administrivia
Clustering: LSA again




The input is a matrix. Rows represent text blocks
(sentences, paragraphs or documents)
Columns are distinct terms
Matrix elements are term counts (x tfidf weight)
The idea is to “Factor” this matrix into A D B:
Themes
Terms
Text
blocks
M
Terms
D
=
Text
blocks
A
B
LSA again


A encodes the representation of each text block in a
space of themes.
B encodes each theme with term weights. It can be
used to explicitly describe the theme.
Themes
Terms
Text
blocks
M
Terms
D
=
Text
blocks
A
B
LSA limitations

LSA has a few assumptions that don’t make much
sense:
– If documents really do comprise different “themes”
there shouldn’t be negative weights in the LSA
matrices.
– LSA implicitly models gaussian random processes for
theme and word generation. Actual document
statistics are far from gaussian.
– SVD forces themes to be orthogonal in the A and B
matrices. Why should they be?
Non-negative Matrix Factorization

NMF deals with non-negativity and orthogonality, but
still uses gaussian statistics:
– If documents really do comprise different “themes”
there shouldn’t be negative weights in the LSA
matrices.
– LSA implicitly models gaussian random processes for
theme and word generation. Actual document
statistics are far from gaussian.
– SVD forces themes to be orthogonal in the A and B
matrices. Why should they be?
LSA again

The consequences are:
– LSA themes are not meaningful beyond the first few
(the ones with strongest singular value).
– LSA is largely insensitive to the choice of semantic
space (most 300-dim spaces will do).
NMF

The corresponding properties:
– NMF components track themes well (up to 30 or
more).
– The NMF components can be used directly as topic
markers, so the choice is important.
NMF


NMF is an umbrella term for several algorithms.
The one in this paper uses least squares to match the
original term matrix. i.e. it minimizes:
 (M – AB)2

Another natural metric is the KL or Kullback-Liebler
divergence. The KL-divergence between two probability
distributions p and q is:
 p log p/q

Another natural version of NMF uses KL-divergence
between M and its approximation as A B.
NMF




KL-divergence is usually a more accurate way to
compare probability distributions.
However, in clustering applications, the quality of fit to
the probability distribution is secondary to the quality of
the clusters.
KL-divergence NMF performs well for smoothing
(extrapolation) tasks, but not as well as least-squares
for clustering.
The reasons are not entirely clear, but it may simply be
an artifact of the basic NMF recurrences, which find only
locally-optimal matches.
A Simpler Text Summarizer

A simpler text summarizer based on inter-sentence
analysis did as well as any of the custom systems on the
DUC-2002 dataset (Document Understanding
Conference).

This algorithm called “TextRank” was based on a
graphical analysis of the similarity graph between
sentences in the text.
A Simpler Text Summarizer

Vertices in the graph represent sentences, edge weights
are similarity between sentences:
S1
S2
S7
S3
S6
S5
S4
Textrank

TextRank computes vertex strength using a variant of
Google’s Pagerank. It gives the probability of being at a
vertex during a long random walk on the graph.
S1
S2
S7
S3
S6
S5
S4
Textrank

The highest-ranked vertices comprise the summary.

Textrank achieved the same summary performance as
the best single-sentence summarizers at DUC-2002.
(TextRank appeared in ACL 2004)
Discussion Topics
T1: The best text analysis algorithms for a variety of tasks
seem to use numerical (BOW or graphical models) of texts.
Discuss what information these representations capture
and why they might be effective.