Comparison of information retrieval techniques: Latent

Download Report

Transcript Comparison of information retrieval techniques: Latent

Comparison of information retrieval techniques: Latent
semantic indexing (LSI) and Concept indexing (CI)
Jasminka Dobša
Faculty of organization and informatics,
Varaždin
Outline
• Information retrieval in vector space model (VSM) or bag
of words representation
• Techniques for conceptual indexing
– Latent semantic indexing
– Concept indexing
• Comparison: Academic example
• Experiment
• Further work
Information retrieval in VSM 1/3
• Task of information retrieval: to extract documents that are
relevant for user query in document collection
• In VSM documents are presented in high dimensional
space
• Dimension of space depends on number of indexing terms
which are chosen to be relevant for the collection (40005000 in my experiments)
• VSM is implemented by forming term-document matrix
Information retrieval in VSM 2/3
• Term-document matrix is
mn matrix where m is
number of terms and n is
number of documents
• row of term-document
matrix = term
• column of termdocument matrix =
document
Figure 1. Term-document matrix
d
1

a11

a21
A  


am1

d
d
2

a
a

12
a
n
22
m2


 2n 




 amn 


a
a
1n
 t1
 t2

 tm
Information retrieval in VSM 3/3
• query has the same shape as document (m dimensional
vector)
• measure of similarity between query q and a document aj
is a cosine of angle between those two vectors

cos (a j , q ) 
T
aj q
aj
2
q
2
Retrieval performance evaluation
• Measures for evaluation:
– Recall
– Precision
– Average precision
• Recall
recall i 
• Precision
r
r
precisioni 
i
n
r
i
i
• ri is number of relevant
documents among i
highest ranked documents
• rn is total number of
relevant documents in
collection
• Average precision –
average precision for
distinct levels of recall
Techniques for conceptual indexing
• In term-matching method similarity between query and
the document is tested lexically
• Polysemy (words having multiple meaning) and
synonymy (multiple words having the same meaning) are
two fundamental problems in efficient information
retrieval
• Here we compare two techniques for conceptual indexing
based on projection of vectors of documents (in means of
least squares) on lower-dimensional vector space
– Latent semantic indexing (LSI)
– Concept indexing (CI)
Latent semantic indexing
• Introduced in 1990; improved in 1995
• S. Deerwester, S. Dumas, G. Furnas, T. Landauer, R.
Harsman: Indexing by latent semantic analysis, J.
American Society for Information Science, 41, 1990, pp.
391-407
• M. W. Berry, S.T. Dumas, G.W. O’Brien: Using linear
algebra for intelligent information retrieval, SIAM
Review, 37, 1995, pp. 573-595
• Based on spectral analysis of term-document matrix
Latent semantic indexing
• For every m×n matrix A there is singular value
decomposition (SVD)
A  U V T
U orthogonal m×m matrix whose columns are left singular
vectors of A
 diagonal matrix on whose diagonal are singular values
of matrix A in descending order
V orthogonal n×n matrix whose columns are right singular
vectors of A
Latent semantic indexing
• For LSI truncated SVD is used
A
k
 U k k V k
T
where
Uk is m×k matrix whose columns are first k left singular vectors of A
k is k×k diagonal matrix whose diagonal is formed by k leading
singular values of A
Vk is n×k matrix whose columns are first k right singular vectors of A
• Rows of Uk = terms
• Rows of Vk = documents
(Truncated) SVD
Latent semantic indexing
• Using the truncated LSI we include only first k independent linear
components of A (singular vectors and values)
• Documents are projected in means of least squares on space spread by
first k singular vectors of A (LSI space)
• First k components capture the major associational structure in in the
term-document matrix and throw out the noise
• Minor differences in terminology used in documents are ignored
• Closeness of objects (queries and documents) is determined by overall
pattern of term usage, so it is context based
• Documents which contain synonyms are closer in LSI space than in
original space; documents which contain polysemy in different context
are more far in LSI space than in original space
Concept indexing (CI)
•
•
Indexing using concept decomposition (CD) instead of
SVD like in LSI
Concept decomposition was introduced in 2001
I.S.Dhillon, D.S. Modha: Concept decomposition for
large sparse text data using clustering, Machine
Learning, 42:1, 2001, pp. 143-175
Concept decomposition
• First step: clustering of documents in term-document matrix A on k
groups
• Clustering algorithms:
– Spherical k-means algorithm
– Fuzzy k-means algorithm
• Spherical k-means algorithm is a variant of k-means algorithm which
uses the fact that vectors of documents are of the unit norm
• Centroids of groups = concept vectors
• Concept matrix is matrix whose columns are centroids of groups
C  c c
k
cj – centroid of j-th group
1
2
c
k
Concept decomposition
• Second step: calculating the concept decomposition
• Concept decomposition Dk of term-document matrix A is least squares
approximation of A on the space of concept vectors
Dk  Ck Z
where Z is solution of the least squares problem
Z  C Ck  CTk A
T
k
• Rows of Ck = terms
• Columns of Z = documents
1
Comparison: Academic example
•
Collection of 15 documents (Titles of books)
–
–
–
•
9 from the field of data mining
5 from the field of linear algebra
1 combination of these fields (application of linear algebra for data
mining)
List of terms was formed
1)
2)
3)
•
By words contained in at least two documents
Words on stop list were ejected
Stemming was performed
On term-document matrix we apply
–
–
Truncated SVD (k=2)
Concept decomposition (k=2)
Documents 1/2
D1
Survey of text mining: clustering, classification, and retrieval
D2
Automatic text processing: the transformation analysis and
retrieval of information by computer
D3
Elementary linear algebra: A matrix approach
D4
Matrix algebra & its applications statistics and econometrics
D5
Effective databases for text & document management
D6
Matrices, vector spaces, and information retrieval
D7
Matrix analysis and applied linear algebra
D8
Topological vector spaces and algebras
Documents 2/2
D9
Information retrieval: data structures & algorithms
D10
Vector spaces and algebras for chemistry and physics
D11
Classification, clustering and data analysis
D12
Clustering of large data sets
D13
Clustering algorithms
D14
Document warehousing and text mining: techniques for
improving business operations, marketing and sales
D15
Data mining and knowledge discovery
Terms
Data mining terms
Linear algebra terms
Neutral terms
Text
Linear
Analysis
Mining
Algebra
Application
Clustering
Matrix
Algorithm
Classification
Vector
Retrieval
Space
Information
Document
Data
Projection of terms by SVD
0.5
0.4
0.3
0.2
0.1
0
-0.1
-0.2
-0.3
data mining terms
linear algebra terms
-0.4
-0.5
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Projection of terms by CD
0.5
data mining terms
linear algebra terms
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Queries
• Q1: Data mining
– Relevant documents : All data mining documents
• Q2: Using linear algebra for data mining
– Relevant document: D6
Projection of documents by SVD
0.3
data mining documents
linear algebra documents
D6 document
queries
0.2
0.1
Q2
0
Relevant documents for query Q2
-0.1
-0.2
-0.3
-0.4
Q1
0
0.1
0.2
Relevant documents for query Q1
0.3
0.4
0.5
0.6
Projection of documents by CD
1.2
Relevant documents for query Q1
1
Q1
Q2
0.8
0.6
Relevant documents for query Q2
0.4
0.2
data mining documents
linear algebra documents
D6 document
queries
0
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Results of information retrieval (Q1)
score
1,4142
0,7071
0,5774
0,5000
0,5000
0,4472
0,0000
0,0000
0,0000
0,0000
0,0000
0,0000
0,0000
0,0000
0,0000
document
D15
D12
D14
D9
D11
D1
D2
D3
D4
D5
D6
D7
D8
D10
D13
score
0,6141
0,5480
0,5465
0,4809
0,4644
0,4301
0,4127
0,3858
0,3165
0,1585
0,0013
-0,0631
-0,0631
-0,0712
-0,0712
document
D1
D11
D12
D9
D15
D2
D14
D13
D5
D6
D7
D8
D10
D4
D3
score
document
0,6622
D1
0,6057
D14
0,5953
D12
0,5763
D11
0,5619
D15
0,4727
D5
0,3379
D13
0,2690
D2
0,2615
D9
0,0016
D7
-0,0063
D6
-0,0462
D3
-0,0468
D4
-0,0473
D8
-0,0473
D10
Results of information retrieval (Q2)
Term matching method
score
1,4142
1,1547
0,8944
0,7071
0,5774
0,5774
0,5774
0,5774
0,5000
0,5000
0,4472
0,0000
0,0000
0,0000
0,0000
document
D15
D3
D7
D12
D4
D8
D10
D14
D9
D11
D1
D2
D5
D6
D13
Latent semantic indexing
score
0,6737
0,6472
0,6100
0,6100
0,5924
0,5924
0,5789
0,5404
0,5268
0,5236
0,4656
0,3936
0,3560
0,3320
0,2800
document
D6
D7
D8
D10
D3
D4
D10
D2
D11
D9
D12
D15
D14
D13
D5
Concept indexing
score
0,7105
0,6204
0,5936
0,5517
0,5488
0,5452
0,5441
0,5280
0,5256
0,4797
0,4797
0,4693
0,4693
0,4459
0,4202
document
D1
D11
D12
D2
D14
D6
D9
D7
D15
D8
D10
D3
D4
D5
D13
Collections
• MEDLINE
– 1033 documents
– 30 queries
– Relevant judgements
• CRANFIELD
– 1400 documents
– 225 queries
– Relevant judgements
Test A
•
Comparison of errors of approximation termdocument matrix by
1) k-rank SVD
A  U k k V Tk
2) k-rank CD
A  Ck Z
F
F
MEDLINE - errors of approximation
900
800
error of approx.
700
600
500
400
300
200
100
0
0
50
100
150
200
250
rank of approximation
k-means
fuzzy k-means
SVD
300
CRANFIELD - errors of approximation
1200
error of approx.
1000
800
600
400
200
0
0
50
100
150
200
250
rank of approximation
k-means
fuzzy k-means
SVD
300
Test B
• Average inner product between concept vectors cj,
j=1,2,…,k
k
k
2
T
T

k (k  1) j 1 l  j 1c j cl
• Comparison of average inner product for
– Concept vectors obtained by spherical k-means
algorithm
– Concept vectors obtained by fuzzy k-means algorithm
MEDLINE – average inner product
0,25
0,20
0,15
0,10
0,05
0,00
0
50
100
150
200
rank of approximation
k-means
fuzzy k-means
250
300
CRANFIELD – average inner product
0,40
0,35
0,30
0,25
0,20
0,15
0,10
0,05
0,00
0
50
100
150
200
rank of approximation
k-means
fuzzy k-means
250
300
Test C
• Comparison of mean average precision of
information retrieval and precision-recall
plots
• Mean average precision for term-matching
method:
– MEDLINE : 43,54
– CRANFIELD : 20,89
MEDLINE – mean average precision
mean average precision
60,00
55,00
50,00
45,00
40,00
35,00
30,00
0
50
100
150
200
250
rank of approximation
k-means
fuzzy k-means
SVD
300
CRANFIELD – mean average precision
mean average precision
0,25
0,20
0,15
0,10
0,05
0,00
0
50
100
150
200
250
rank of approximation (k)
k-means
fuzzy k-means
SVD
300
MEDLINE – precision-recall plot
90
80
70
precision
60
50
40
30
20
10
0
0
20
40
60
80
100
recall
k-means
fuzzy k-means
SVD
term-matching
120
CRANFIELD – precision-recall plot
50
45
40
precision
35
30
25
20
15
10
5
0
0
20
40
60
80
100
recall
k-means
fuzzy k-means
SVD
term-matching
120
Test D
• Correlation between mean average precision (MAP) and clustering
quality
• Measure of cluster quality – generalized within groups sum of square
errors function Jfuzz
k
n
b


J fuzz  ij a j  ci
i 1 j 1
• aj , j=1,2,…,n are vectors of documents,
•
•
ci , i=1,2,…,k are concept vectors
ij is the fuzzy membership degree of document aj in the group
whose concept is ci
• b1, is weight exponent
MEDLINE - Correlation (clustering quality
and MAP)
• 46 observations for rank of approximation k[1,100]
• Correlation between mean average precision and Jfuzz is
r=-0,968198
with significance p<<0,01
• Correlation between rank of approximation and mean
average precision is r= 0,70247 ( p<<0,01)
• Correlation between rank of approximation and Jfuzz is
r= -0,831071
( p<<0,01)
CRANFILD - Correlation (clustering quality
and MAP)
• 46 observations for rank of approximation k[1,100]
• Correlation between mean average precision and Jfuzz is
r=-0,988293
with significance p<<0,01
• Correlation between rank of approximation and mean
average precision is r= 0,914489 ( p<<0,01)
• Correlation between rank of approximation and Jfuzz is
r= -0,904415
( p<<0,01)
Regression line: clustering quality and MAP
(MEDLINE)
50
mean average precision
45
40
35
30
25
20
15
10
5
0
200
250
300
350
JW
400
450
Regression line: clustering quality and MAP
(CRANFIELD)
mean average precision
16
14
12
10
8
6
4
2
0
400
450
500
550
600
Jw
650
700
750
Conclusion 1/3
• By SVD approximation term-document matrix is projected on the first
k left singular vectors, which for orthogonal base for LSI space
• By CD approximation term-document matrix is projected on the k
centroids of groups (concept vectors)
• Concept vectors form the base for CI space; they tend to orthogonality
as k raises
• Concept vectors obtained by fuzzy k-means algorithm tend to
orthogonality faster then those obtained by spherical k-means
algorithm
• CI using CD by fuzzy k-means algorithm gives higher MAP of
information retrieval then LSI on both collections we have used
Conclusion 2/3
• CI using CD by spherical k-means algorithm gives lower (but
comparable) MAP of information retrieval then LSI on both
collections we have used
• According the results of MAP k=75 for MEDLINE collection, and
k=200 for CRANFIELD collection is good choice of rank of
approximation
• By LSI and CI documents are presented in smaller matrices:
– For MEDLINE collection term-document matrix is stored in
5940×1033 matrix – approximations of documents are stored in
75×1033 matrix
– For CRANFIELD collection term-document matrix is stored in
4758×1400 matrix - approximations of documents are stored in
200×1400 matrix
Conclusion 3/3
• LSI and CI work better on MEDLINE collection
• When evaluated for different ranks of approximation MAP
is more stable for LSI then for CI
• There is high correlation between MAP and clustering
quality
Further work
1)
To apply CI on the problem of classification in
supervised setting
To propose solutions of problem adding new documents
in collection for CI method
2)
•
•
•
Adding new documents in collection requires recomputation of
SVD or CD
It is computationally inefficient
2 approximation methods are developed for adding new
document in collection for LSI method