LSA - University of Victoria

Download Report

Transcript LSA - University of Victoria

Latent Semantic Analysis
An Example
d1 : Romeo and Juliet.
d2 : Juliet: O happy dagger!
d3 : Romeo died by dagger.
d4 : “Live free or die”, that’s the New-Hampshire’s motto.
d5 : Did you know, New-Hampshire is in New-England.
q: dies, dagger
Which document should be returned and how the ranking should be?
Eigenvectors and Eigenvalues
• Let A be an n × n matrix.
• If x is an n-dimensional vector, then the matrix-vector product
Ax
is well-defined, and the result is again an n-dimensional vector.
• In general, multiplication by a matrix changes the direction of a non-zero
vector x, unless the vector is special and we have that
Ax =  x
for some scalar .
Matrix Decomposition
• Let S be the matrix with eigenvectors of A as columns.
• Let  be the diagonal matrix with the eigenvalues of A on the
diagonal.
• Then
A = SS-1
• If A is symmetric then we have S-1=ST
A = SST
Singular Value Decomposition
• Let A be an m × n matrix with entries being real numbers and m > n.
• Consider the n × n square matrix B = ATA.
– B is symmetric
– it has been shown that the eigenvalues of such (ATA) matrices are non-negative.
– Since they are non-negative we can write them in decreasing order as squares of
non-negative real numbers: 12 > . . . > n2
• For some index r (possibly n) the first r numbers are positive whereas the rest
are zero.
• S1 = [x1, . . . , xr]
• y1=(1/1)Ax1 ... yr=(1/r)Axr
• S2 = [y1, ..., yr]
• We can show that
A = S2  S 1 T
•  is diagonal and the values along the diagonal are 1, . . . , n which are
called singular values.
• If we denote S2 by S and S1 by U we have A = S  UT
Example
d1 : Romeo and Juliet.
d2 : Juliet: O happy dagger!
d3 : Romeo died by dagger.
d4 : “Live free or die”, that’s the New-Hampshire’s motto.
d5 : Did you know, New-Hampshire is in New-England.
q: dies, dagger
Document-term matrix
Latent Concepts
• Latent Semantic Indexing (LSI) is a method for discovering
hidden concepts in document data.
• Each document and term (word) is then expressed as a vector
with elements corresponding to these concepts.
– Each element in a vector gives the degree of participation of the
document or term in the corresponding concept.
• Goal is not to describe the concepts verbally, but to be able to
represent the documents and terms in a unified way for exposing
– document-document,
– document-term, and
– term-term similarities
which are otherwise hidden…
Matrix 
Matrix A can be written: A = SUT
Let's "neglect" the last three singular values of  as being too "small"...
Also, just keep
two columns from S obtaining S2 and
two rows from UT obtaining U2T
Matrix A is approximated as: A2 = S2U2T
In general: Ak = SkUkT where a good value for k is determined empirically.
Matrices 2, S2, U2
Representing Documents, Terms, and
Queries
• Represent documents by the column vectors of U2T
• Represent terms by the row vectors S2
• Represent queries by the centroid vector of their terms
Geometry