Transcript Document

Latent Concepts and the
Number Orthogonal Factors
in Latent Semantic Analysis
Georges Dupret
[email protected]
1
Abstract


We seek insight into Latent Semantic Indexing
by establishing a method to identify the optimal
number of factors in the reduced matrix for
representing a keyword.
By examining the precision, we find that lower
ranked dimensions identify related terms and
higher-ranked dimensions discriminate between
the synonyms.
2
Introduction


The task of retrieving the documents relevant
to a user query in a large text database is
complicated by the fact that different authors
use different words to express the same ideas
or concepts.
Methods related to Latent Semantic Analysis
interpret the variability associated with the
expression of a concept as a noise, and use
linear algebra techniques to isolate the
perennial concept from the variable noise.
3
Introduction




In LSA, the SVD (singular value decomposition)
technique is used to decompose a term by
document matrix to a set of orthogonal factors.
A large number of factor weights provide an
approximation close to the original term by
document matrix but retains too much noise.
On the other hand, if too many factors are
discarded, the information loss is too large.
The objective is to identify the optimal number
of orthogonal factors.
4
Bag-of-words representation


Each documents is replaced by a vector of its
attributes which are usually the keywords
present in the document.
This representation can be used to retrieve
document relevant to a user query: A vector
representation is derived from the query in the
same way as regular documents and then
compared with the database using a suitable
measure of distance or similarity.
5
Latent Semantic Analysis



LSI is one of the few methods which successfully
overcome the vocabulary noise problem because
it takes into account synonymy and polysemy.
Not accounting for synonymy leads to underestimate the similarity between related documents,
and not accounting for polysemy leads to
erroneously finding similarities.
The idea behind LSI is to reduce the dimension of
the IR problem by projecting the D documents by
N attributes matrix A to an adequate subspace of
lower dimension.
6
Latent Semantic Analysis

SVD of the D × N matrix A:
A = UΔVT
(1)
U,V: orthogonal matrices
Δ: a diagonal matrix with elements σ1,…,σp,
where p=min(D, N) and σ1 ≧σ2 ≧… ≧σp-1 ≧σp

The closest matrix A(k) of dimension k < rank(A)
is obtained by setting σi =0 for i > k.
A(k) = U(k) × Δ(k) × V(k)T
(2)
7
Latent Semantic Analysis


Then we compare documents in a k dimensional
subspace based on A(k). The projection of the
original document representation gives
AT × U(k) × Δ(k)-1 = V(k)
(3)
the same operation on the query vector Q
Q(k) = QT × U(k) × Δ(k)-1
(4)
The closest document to query Q is identified by
a dissimilarity function dk(., .):
(5)
8
Covariance Method


Advantage: be able to handle databases of several
hundred of thousands of documents.
If D is the number of documents, Ad the vector
representing the dth document and Ā the mean of these
vectors, the covariance matrix is written:
(6)
this matrix being symmetric, the singular value
decomposition can be written:
C = VΔVT
(7)
9
Covariance Method

Reducing Δ to the k more significant singular
values, we can project the keyword space into a
k dimensional subspace:
(A|Q) → (A|Q)V(k)Δ = (A(k)|Q(k)) (8)
10
Embedded Concepts




Sending the covariance matrix onto a subspace
of fewer dimension implies a loss of information.
It can be interpreted as the merging of
keywords meaning into a more general concept.
ex: “cat” and “mouse” → “mammal” → ”animal”
How many singular values are necessary for a
keyword to be correctly distinguished from all
others in the dictionary?
What’s the definition of “correctly distinguished”?
11
Correlation Method


The correlation matrix S of A is defined based
on the covariance matrix C:
(9)
Using the correlation rather than the covariance
matrix results in a different weighting of
correlated keywords, the justification of the
model remaining otherwise identical.
12
Keyword Validity

The property of SVD:
(10)

The rank k approximation S(k) of S can be
written
(11)
with k ≦ N. S(N) = S.
13
Keyword Validity

We can argue the following argument:
the k-order approximation of the correlation matrix
correctly represents a given keyword only if this
keyword is more correlated to itself than to any other
attribute.

For a given keyword α this condition is written
(12)

A keyword is said to be “valid” of rank k if k-1 is
the largest value for which Eq.(12) is not
verified, then k is the validity rank of the
keyword.
14
Experiments


Data: REUTERS (21,578 articles) and TREC5
(131,896 articles generated 1,822,531
“documents”) databases.
Pre-processing:



using the Porter algorithm to stem words
removing the keywords appearing in either more or
less than two user specified thresholds.
mapping documents to vectors by TFIDF.
15
First Experiment
The claim: a given keyword is correctly

represented by a rank k approximation of the
correlation matrix if k is at least equal to the
validity rank of the keyword.
Experiment method:

1.

Select a keyword, for example africa, and extract all
the documents containing it.
Produce a new copy of these documents, in which
we replace the selected keyword by an new one.
ex: replace africa by afrique (French)
16
First Experiment



Add these new documents to the original database,
and the keyword afrique to the vocabulary.
Compute the correlation matrix and SVD of this
extended, new database. Note that afrique and
africa are perfect synonyms.
Send the original database to the new subspace
and issue a query for afrique. We hope to find
documents containing africa first.
17
First Experiment
Figure 1: Keyword africa is replaced by afrique: Curves
corresponding to ranks 450 and 500 start with a null
precision and remain under the curves of lower validity ranks.
18
First Experiment
Table 1: Keywords Characteristics
19
First Experiment
Note: the drop in precision is low after we reach the
validity rank, but still present. As usual, the validity
rank is precisely defined: the drop in precision is
observed as soon as the rank reaches 400.
20
First Experiment
Figure 3: keyword network is replaced by reseau:
Precision is 100% until the validity rank and
deteriorates drastically beyond it.
21
First Experiment


This experiment shows the relation between the
“concept” associated with afrique (or any other
concept) and the actual keyword.
For low rank approximations S(k), augmenting
the number of orthogonal factors helps
identifying the “concept” common to both
afrique and africa, while orthogonal factors
beyond the validity rank help distinguish
between the keyword and its synonym.
22
Second Experiment
Figure 4: Ratios R and hit = N/G for keyword afrique.
Validity rank is 428.
23
Third Experiment
Figure 5: vocabulary of 1201
keywords in REUTERS db.
Figure 6: vocabulary of 2486
keywords in TREC db.
24
Conclusion




We examined the dependence of the Latent
Semantic structure on the number of orthogonal
factors in the context of the Correlation Method.
We analyzed the claim that LSA provides a
method to take account of synonymy.
We propose a method to determine the number
of orthogonal factors for which a given keyword
best represents an associated concept.
Further directions might include the extension to
multiple keywords queries.
25