Transcript Lecture 3

Vector Space Model
Rong Jin
1
Basic Issues in A Retrieval Model
How to
represent text
objects
How to refine query
according to users’
feedbacks?
What similarity
function should
be used?
2
Basic Issues in IR




How to represent queries?
How to represent documents?
How to compute the similarity between
documents and queries?
How to utilize the users’ feedbacks to
enhance the retrieval performance?
3
IR: Formal Formulation



Vocabulary V={w1, w2, …, wn} of language
Query q = q1,…,qm, where qi  V
Collection C= {d1, …, dk}


Set of relevant documents R(q)  C



Document di = (di1,…,dimi), where dij  V
Generally unknown and user-dependent
Query is a “hint” on which doc is in R(q)
Task = compute R’(q), an “approximate R(q)”
4
Computing R(q)

Strategy 1: Document selection

Classification function f(d,q) {0,1}




Outputs 1 for relevance, 0 for irrelevance
R(q) is determined as a set {dC|f(d,q)=1}
System must decide if a doc is relevant or not
(“absolute relevance”)
Example: Boolean retrieval
5
Document Selection Approach
True R(q)
Classifier C(q)
+ +- - + - + +
--- ---
6
Computing R(q)

Strategy 2: Document ranking

Similarity function f(d,q) 


Cut off 



Outputs a similarity between document d and query q
The minimum similarity for document and query to be
relevant
R(q) is determined as the set {dC|f(d,q)>}
System must decide if one doc is more likely to
be relevant than another (“relative relevance”)
7
Document Selection vs. Ranking
True R(q)
+ +- - + - + +
--- ---
Doc Ranking
f(d,q)=?
0.98 d1 +
0.95 d2 +
0.83 d3 0.80 d4 +
0.76 d5 0.56 d6 0.34 d7 0.21 d8 +
0.21 d9 -
R’(q)

8
Document Selection vs. Ranking
True R(q)
+ +- - + - + +
--- ---
1
Doc Selection
f(d,q)=?
Doc Ranking
f(d,q)=?
0
+ +- + ++
R’(q)
- -- - - + - 0.98 d1 +
0.95 d2 +
0.83 d3 0.80 d4 +
0.76 d5 0.56 d6 0.34 d7 0.21 d8 +
0.21 d9 -
R’(q)
9
Ranking is often preferred


Similarity function is more general than
classification function
The classifier is unlikely to be accurate


Ambiguous information needs, short queries
Relevance is a subjective concept

Absolute relevance vs. relative relevance
10
Probability Ranking Principle

As stated by Cooper
“If a reference retrieval system’s response to each request is a ranking of the
documents in the collections in order of decreasing probability of usefulness to
the user who submitted the request, where the probabilities are estimated as
accurately as possible on the basis of whatever data made available to the system
for this purpose, then the overall effectiveness of the system to its users will be
the best that is obtainable on the basis of that data.”

Ranking documents in probability maximizes
the utility of IR systems
11
Vector Space Model

Any text object can be represented by a term vector



Similarity is determined by relationship between two
vectors


Examples: Documents, queries, sentences, ….
A query is viewed as a short document
e.g., the cosine of the angle between the vectors, or the
distance between vectors
The SMART system:


Developed at Cornell University, 1960-1999
Still used widely
12
Vector Space Model: illustration
Java
Starbuck
Microsoft
D1
1
1
0
D2
0
1
1
D3
1
0
1
D4
1
1
1
Query
1
0.1
1
13
Vector Space Model: illustration
Starbucks
??
D2
??
D3
D4
??
Java
Query
D1
Microsoft
??
14
Vector Space Model: Similarity


Represent both documents and queries by word histogram
vectors
 n: the number of unique words
 A query q = (q1, q2,…, qn)
 qi: occurrence of the i-th word in query
 A document dk = (dk,1, dk,2,…, dk,n)
 dk,i: occurrence of the the i-th word in document
q
Similarity of a query q to a document dk
dk
15
Some Background in Linear Algebra

Dot product (scalar product)
q  d k  q1d k ,1  q2 d k , 2  ...  qn d k ,n

Example:
q  [1,2,5], d k  [4,1,0]
q  d k  1 4  2  1  5  0  6
q  [1,2,5], d k  [1,3,4]
q  d k  11  2  3  5  4  26

Measure the similarity by dot product q  dk
16
Some Background in Linear Algebra

Length of a vector
q  q12  q22  ...  qn2 , dk  d k2,1  d k2, 2  ...  d k2,n

Angle between two vectors
q  dk
cos( (q, d k )) 
q  dk

q1d k ,1  q2 d k , 2  ...  qn d k ,n
q
 (q, d)
dk
q12  q22  ...  qn2 d k2,1  d k2, 2  ...  d k2,n
17
Some Background in Linear Algebra

q
Example:
 (q, d)
q  [1,2,5], d k  [4,1,0]
1 4  2  1  5  0
cos( (q, d k )) 
12  2 2  52 4 2  12  0 2
q  [1,2,5], d k  [1,3,4]
cos( (q, d k )) 

1 1  2  3  5  4
1 2 5
2
2
2
1 3 4
2
2
2
 0.27
dk
 0.97
Measure similarity by the angle between
vectors
18
Vector Space Model: Similarity

Given

A query q = (q1, q2,…, qn)


A document dk = (dk,1, dk,2,…, dk,n)


q
qi: occurrence of the i-th word in query
 (q, d)
dk,i: occurrence of the the i-th word in
document
Similarity of a query q to a document dk
sim (q, d k )
 q1d k ,1  q2 d k , 2  ...  qn d k ,n
 q  d k  q  d k cos( (q, d k ))
sim' (q, d k )  cos( (q, d k )) 

dk
q  dk
q  dk
q1d k ,1  q2 d k , 2  ...  qn d k ,n
q12  q22  ...  qn2 d k2,1  d k2, 2  ...  d k2,n
19
Vector Space Model: Similarity
q
q  [1,2,5], d k  [0,0,8]
q  d k  1 0  2  0  5  8  40
q  [1,2,5], d k  [1,3,4]
dk
q  d k  11  2  3  5  4  26
20
Vector Space Model: Similarity
q
 (q, d)
q  [1,2,5], d k  [0,0,8]
1 0  2  0  5  8
cos( (q, d k )) 
12  2 2  52 0 2  0 2  82
q  [1,2,5], d k  [1,3,4]
cos( (q, d k )) 
1 1  2  3  5  4
1 2 5
2
2
2
1 3 4
2
2
2
 0.913
dk
 0.97
21
Term Weighting
sim(q, d k )  q1d k ,1  q2 d k ,2 
sim(q, d k )  q1d k ,1wk ,1  q2 d k ,2 wk ,2 


 qn d k ,n wk ,n
wk,i: the importance of the i-th word for document dk
Why weighting ?


 qn d k , n
Some query terms carry more information
TF.IDF weighting



TF (Term Frequency) = Within-doc-frequency
IDF (Inverse Document Frequency)
TF normalization: avoid the bias of long documents
22
TF Weighting


A term is important if it occurs frequently in document
Formulas:
 f(t,d): term occurrence of word ‘t’ in document d
 Maximum frequency normalization:
f (t , d )
Tf (t , d )  0.5  0.5
MaxFreq(d )
Term frequency
normalization
23
TF Weighting


A term is important if it occurs frequently in document
Formulas:
 f(t,d): term occurrence of word ‘t’ in document d
Term frequency
 “Okapi/BM25 TF”:
normalization
kf (t , d )
Tf (t , d ) 

doclen(d ) 
f (t , d )  k 1  b  b

avg _ doclen 

doclen(d): the length of document d
avg_doclen: average document length
k,b: predefined constants
24
TF Normalization

Why?



Two views of document length



Document length variation
“Repeated occurrences” are less informative than the “first
occurrence”
A doc is long because it uses more words
A doc is long because it has more contents
Generally penalize long doc, but avoid overpenalizing (pivoted normalization)
25
TF Normalization
Norm. TF
Raw TF
“Pivoted normalization”
Tf (t , d ) 
kf (t , d )

doclen(d ) 
f (t , d )  k 1  b  b

avg
_
docl
e
n


26
IDF Weighting



A term is discriminative if it occurs only in a
few documents
Formula:
IDF(t) = 1+ log(n/m)
n – total number of docs
m -- # docs with term t (doc freq)
Can be interpreted as mutual information
27
TF-IDF Weighting

TF-IDF weighting :

The importance of a term t to a document d
weight(t,d)=TF(t,d)*IDF(t)


Freq in doc  high tf  high weight
Rare in collection high idf high weight
sim(q, d k )  q1d k ,1wk ,1  q2 d k ,2 wk ,2 
 qn d k ,n wk ,n
28
TF-IDF Weighting

TF-IDF weighting :

The importance of a term t to a document d
weight(t,d)=TF(t,d)*IDF(t)


Freq in doc  high tf  high weight
Rare in collection high idf high weight
sim(q, d k )  q1d k ,1wk ,1  q2 d k ,2 wk ,2 

 qn d k ,n wk ,n
Both qi and dk,i arebinary values, i.e. presence and
absence of a word in query and document.
29
Problems with Vector Space Model

Still limited to word based matching


A document will never be retrieved if it does not
contain any query word
How to modify the vector space model ?
30
Choice of Bases
Starbucks
D
Q
Java
D1
Microsoft
31
Choice of Bases
Starbucks
D
Q
Java
D1
Microsoft
32
Choice of Bases
Starbucks
D’
D
Q
Java
D1
Microsoft
33
Choice of Bases
Starbucks
D’
D
Q’
Q
Java
D1
Microsoft
34
Choice of Bases
Starbucks
D’
Java
Q’
D1
Microsoft
35
Choosing Bases for VSM

Modify the bases of the vector space


Each basis is a concept: a group of words
Every document is a vector in the concept space
A1
c1
c2
c3
c4
c5
m1
m2
m3
m4
A1
1
1
1
1
1
0
0
0
0
A2
0
0
0
0
0
1
1
1
1
A2
36
Choosing Bases for VSM

Modify the bases of the vector space


Each basis is a concept: a group of words
Every document is a mixture of concepts
A1
c1
c2
c3
c4
c5
m1
m2
m3
m4
A1
1
1
1
1
1
0
0
0
0
A2
0
0
0
0
0
1
1
1
1
A2
37
Choosing Bases for VSM

Modify the bases of the vector space



Each basis is a concept: a group of words
Every document is a mixture of concepts
How to define/select ‘basic concept’?

In VS model, each term is viewed as an
independent concept
38
Basic: Matrix Multiplication
39
Basic: Matrix Multiplication
40
Linear Algebra Basic: Eigen Analysis

Eigenvectors (for a square mm matrix S)
(right) eigenvector

eigenvalue
Example
41
Linear Algebra Basic: Eigen Analysis
2 1 
S

1
2


1 / 2 

the first eigenvalue 1  3, v1  

1
/
2


 1/ 2 

the second eigenvalue 2  1, v2  


1
/
2


42
Linear Algebra Basic: Eigen Decomposition
1 / 2 
 1/ 2 
, v2  
 1  3, 2  1,
v1  

 1/ 2 
1 / 2 


 1
2 1 
2
S

  1
1
2

 
2

S=
U

 1
2  3 0
2
0 1 1
1



2
2

1
*

*

2 

 1 
2
1
UT
43
Linear Algebra Basic: Eigen Decomposition
1 / 2 
 1/ 2 
, v2  
 1  3, 2  1,
v1  

 1/ 2 
1 / 2 


 1
2 1 
2
S

  1
1
2

 
2

S=
U

 1
2  3 0
2
0 1 1
1



2
2

1
*

*

2 

 1 
2
1
UT
44
Linear Algebra Basic: Eigen
Decomposition
 1
2 1 
2
S

  1
1
2

 
2

S=
U

 1
2  3 0
2
0 1 1
1



2
2

1
*

*

2 

 1 
2
1
UT

This is generally true for symmetric square matrix

Columns of U are eigenvectors of S

Diagonal elements of  are eigenvalues of S
45
Singular Value Decomposition
For an m n matrix A of rank r there exists a factorization
(Singular Value Decomposition = SVD) as follows:
A  U V
T
mm mn
V is nn
The columns of U are left singular vectors.
The columns of V are right singular vectors
 is a diagonal matrix with singular values
46
Singular Value Decomposition

Illustration of SVD dimensions and sparseness
47
Singular Value Decomposition

Illustration of SVD dimensions and sparseness
48
Singular Value Decomposition

Illustration of SVD dimensions and sparseness
49
Low Rank Approximation

Approximate matrix with the largest singular
values and singular vectors
50
Low Rank Approximation

Approximate matrix with the largest singular
values and singular vectors
51
Low Rank Approximation

Approximate matrix with the largest singular
values and singular vectors
52
Latent Semantic Indexing (LSI)
Computation: using single value decomposition (SVD) with the
first m largest singular values and singular vectors, where m is the
number of concepts
Concept
Concept

Rep. of Concepts
in term space
Rep. of concepts in
53
document space
Finding “Good Concepts”
54
SVD: Example: m=2
 3.34 0 
X  0 2.54  X


55
SVD: Example: m=2
 3.34 0 
X  0 2.54  X


56
SVD: Example: m=2
 3.34 0 
X  0 2.54  X


57
SVD: Example: m=2
 3.34 0 
X  0 2.54  X


2.54
4
 0.76 
3.34
5
58
SVD: Orthogonality
v1
 3.34 0 
X  0 2.54  X


v2
u1
·
u2
=0
v1 · v2 = 0
59
SVD: Properties

X: rank(X) = 9
 3.34
0
X 
0 

2.54 

X
X’: rank(X’) = 2

rank(S): the maximum number of either row or column
vectors within matrix S that are linearly independent.

SVD produces the best low rank approximation
60
SVD: Visualization
X
=
61
SVD: Visualization

SVD tries to preserve the Euclidean distance of document vectors
62