Matrix Decomposition and Graphical Model

Download Report

Transcript Matrix Decomposition and Graphical Model

An Introduction To Matrix
Decomposition and Graphical Model
Lei Zhang/Lead Researcher
Microsoft Research Asia
2012-04-17
Outline
• Matrix Decomposition
– PCA, SVD, NMF
– LDA, ICA, Sparse Coding, etc.
• Graphical Model
–
–
–
–
Basic concepts in probabilistic machine learning
EM
pLSA
LDA
• Two Applications
– Document decomposition for “long query” retrieval
– Modeling Threaded Discussions
What Is Matrix Decomposition?
• We wish to decompose the matrix A by writing it as a product
of two or more matrices:
An×m = Bn×kCk×m
• Suppose A, B, C are column matrices
– An×m = (a1, a2, …, am), each ai is a n-dim data sample
– Bn×k = (b1, b2, …, bk), each bj is a n-dim basis, and space B consists of
k bases.
– Ck×m = (c1, c2, …, cm), each ci is the k-dim coordinates of ai projected
to space B
Why We Need Matrix Decomposition?
• Given one data sample:
a1 = Bn×kc1
(a11, a12, …, a1n)T = (b1, b2, …, bk) (c11, c12, …, c1k)T
• Another data sample:
a2 = Bn×kc2
• More data sample:
am = Bn×kcm
• Together (m data samples):
(a1, a2, …, am) = Bn×k (c1, c2, …, cm)
An×m = Bn×kCk×m
Why We Need Matrix Decomposition?
(a1, a2, …, am) = Bn×k (c1, c2, …, cm)
An×m = Bn×kCk×m
• We wish to find a set of new basis B to represent data
samples A, and A will become C in the new space.
• In general, B captures the common features in A, while C
carries specific characteristics of the original samples.
•
•
•
•
In PCA: B is eigenvectors
In SVD: B is right (column) eigenvectors
In LDA: B is discriminant directions
In NMF: B is local features
PRINCIPLE COMPONENT ANALYSIS
Definition – Eigenvalue & Eigenvector
Given a m x m matrix C, for any λ and w, if
Then λ is called eigenvalue, and w is called eigenvector.
Definition – Principle Component Analysis
– Principle Component Analysis (PCA)
– Karhunen-Loeve transformation (KL transformation)
• Let A be a n × m data matrix in which the rows represent data
samples
• Each row is a data vector, each column represents a variable
• A is centered: the estimated mean is subtracted from each
column, so each column has zero mean.
• Covariance matrix C (m x m):
Principle Component Analysis
• C can be decomposed as follows:
C=UΛUT
• Λ is a diagonal matrix diag(λ1 λ2,…,λn), each λi is an eigenvalue
• U is an orthogonal matrix, each column is an eigenvector 
UTU=I  U-1=UT
Maximizing Variance
• The objective of the rotation transformation is to find the
maximal variance
• Projection of data along w is Aw.
• Variance: σ2w= (Aw)T(Aw) = wTATAw = wTCw
where C = ATA is the covariance matrix of the data (A is
centered!)
• Task: maximize variance subject to constraint wTw=1.
Optimization Problem
• Maximize
λ is the Lagrange multiplier
• Differentiating with respect to w yields
• Eigenvalue equation: Cw = λw, where C = ATA.
• Once the first principal component is found, we continue in the same
fashion to look for the next one, which is orthogonal to (all) the principal
component(s) already found.
Property: Data Decomposition
• PCA can be treated as data decomposition
a=UUTa
=(u1,u2,…,un) (u1,u2,…,un)T a
=(u1,u2,…,un) (<u1,a>,<u2,a>…,<un,a>)T
=(u1,u2,…,un) (b1, b2, …, bn)T
= Σ bi·ui
Face Recognition – Eigenface
• Turk, M.A.; Pentland, A.P. Face recognition using eigenfaces,
CVPR 1991 (Citation: 2654)
• The eigenface approach
–
–
–
–
images are points in a vector space
use PCA to reduce dimensionality
face space
compare projections onto face space to recognize faces
PageRank – Power Iteration
• Column j has nonzero elements in positions corresponding to
outlinks of j (Nj in total)
• Row i has nonzero element in positions corresponding to
inlinks Ii.
Column-Stochastic & Irreducible
• Column-Stochastic
• where
• Irreducible
Iterative PageRank Calculation
• For k=1,2,…
• Equivalently (λ=1, A is a Markov chain transition matrix)
• Why can we use power iteration to find the first eigenvector?
Convergence of the power iteration
• Expand the initial approximation r0 in terms of the
eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definition
• Any m x n matrix A, with m ≥ n, can be factorized
•
Singular Values And Singular Vectors
• The diagonal elements σj of are the singular values of the
matrix A.
• The columns of U and V are the left singular vectors and
right singular vectors respectively.
• Equivalent form of SVD:
Matrix approximation
• Theorem: Let Uk = (u1 u2 … uk), Vk = (v1 v2 … vk) and Σk =
diag(σ1, σ2, …, σk), and define
• Then
• It means, that the best approximation of rank k for the matrix
A is
SVD and PCA
• We can write:
• Remember that in PCA, we treat A as a row matrix
• V is just eigenvectors for A
– Each column in V is an eigenvector of row matrix A
– we use V to approximate a row in A
• Equivalently, we can write:
• U is just eigenvectors for AT
– Each column in U is an eigenvector of column matrix A
– We use U to approximate a column in A
Example - LSI
• Build a term-by-document matrix A
• Compute the SVD of A:
• Approximate A by
A = UΣVT
– Uk: Orthogonal basis, that we use to approximate all the documents
– Dk: Column j hold the coordinates of document j in the new basis
– Dk is the projection of A onto the subspace spanned by Uk.
SVD and PCA
• For symmetric A, SVD is closely related to PCA
• PCA: A = UΛUT
– U and Λ are eigenvectors and eigenvalues.
• SVD: A = UΛVT
– U is left(column) eigenvectors
– V is right(row) eigenvectors
– Λ is the same eigenvalues
• For symmetric A, column eigenvectors equal to row eigenvectors
• Note the difference of A in PCA and SVD
– SVD: A is directly the data, e.g. term-by-document matrix
– PCA: A is covariance matrix, A=XTX, each row in X is a sample
Latent Semantic Indexing (LSI)
1. Document file preparation/ preprocessing:
– Indexing: collecting terms
– Use stop list: eliminate ”meaningless” words
– Stemming
2. Construction term-by-document matrix, sparse matrix
storage.
3. Query matching: distance measures.
4. Data compression by low rank approximation: SVD
5. Ranking and relevance feedback.
Latent Semantic Indexing
• Assumption: there is some underlying latent semantic
structure in the data.
• E.g. car and automobile occur in similar documents, as do
cows and sheep.
• This structure can be enhanced by projecting the data (the
term-by-document matrix and the queries) onto a lower
dimensional space using SVD.
Similarity Measures
• Term to term
AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
• Document to document
ATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
• Term to document
A = UΣVT = (UΣ½)(VΣ½)T
UΣ½ are the coordinates of A (rows) projected to space V
VΣ½ are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
• Idea: Web includes two flavors of prominent pages:
– authorities contain high-quality information,
– hubs are comprehensive lists of links to authorities
• A page is a good authority, if many hubs point to it.
• A page is a good hub if it points to many authorities.
• Good authorities are pointed to by good hubs and good hubs
point to good authorities.
Hubs
Authorities
Power Iteration
• Each page i has both a hub score hi and an authority score ai.
• HITS successively refines these scores by computing
• Define the adjacency matrix L of the directed web graph:
• Now
HITS and SVD
• L: rows are outlinks, columns are inlinks.
• a will be the dominant eigenvector of the authority matrix LTL
• h will be the dominant eigenvector of the hub matrix LLT
• They are in fact the first left and right singular vectors of L!!
• We are in fact running SVD on the adjacency matrix.
HITS vs PageRank
• PageRank may be computed once, HITS is computed per
query.
• HITS takes query into account, PageRank doesn’t.
• PageRank has no concept of hubs
• HITS is sensitive to local topology: insertion or deletion of a
small number of nodes may change the scores a lot.
• PageRank more stable, because of its random jump step.
NMF – NON-NEGATIVE MATRIX
FACTORIZATION
Definition
• Given a nonnegative matrix Vn×m, find non-negative matrix
factors Wn×k and Hk×m such that
Vn×m≈Wn×kHk×m
• V: column matrix, each column is a data sample (n-dimension)
• Wi: k-basis represents one base
• H: coordinates of V projected to W
vj ≈ Wn×khj
Motivation
• Non-negativity is natural in many applications...
• Probability is also non-negative
• Additive model to capture local structure
Multiplicative Update Algorithm
• Cost function  Euclidean distance
• Multiplicative Update
Multiplicative Update Algorithm
• Cost function  Divergence
– Reduce to Kullback-Leibler divergence when
– A and B can be regarded as normalized probability distributions.
• Multiplicative update
• PLSA is NMF with KL divergence
NMF vs PCA
• n = 2429 faces, m = 19x19 pixels
• Positive values are illustrated with black pixels and negative
values with red pixels
• NMF  Parts-based representation
• PCA  Holistic representations
Reference
• D. D. Lee and H. S. Seung. Algorithms for non-negative matrix
factorization. (pdf) NIPS, 2001.
• D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative
matrix factorization. (pdf) Nature 401, 788-791 (1999).
Major Reference
• Saara Hyvönen, Linear Algebra Methods for Data Mining, Spring 2007,
University of Helsinki (Highly recommend)
Outline
• Basic concepts
–
–
–
–
Likelihood, i.i.d.
ML, MAP and Bayesian Inference
Expectation-Maximization
Mixture Gaussian Parameter estimation
• pLSA
– Motivation
– Derivation & Geometry properties
– Applications
• LDA
–
–
–
–
–
Motivation - Why to add a hyper parameter
Dirichlet Distribution
Variational EM
Relations with other topic modals
Incorporating category information
• Summary
Not Included
• General graphical model theories
• Markov random field (belief propagation)
• Detailed derivation of LDA
BASIC CONCEPTS
What Is Machine Learning?
Data
• Let x = (x1, x2, . . . , xD) T denote a data point, and D = {x(1), x(2) . . . , x(N)}, a
data set. D is sometimes associated with desired outputs y1, y2, . . ..
Predictions
• We are generally interested in predicting something based on the observed
data set.
• Given D what can we say about x(N+1)?
Model
• To make predictions, we need to make some assumptions. We can often
express these assumptions in the form of a model, with some parameters, θ
• Given data D, we learn the model parameters , from which we can predict
new data points.
• The model can often be expressed as a probability distribution over data
points
Likelihood Function
• Given a set of parameter values, probability density function
(PDF) will show that some data are more probable than other
data.
• Inversely, given the observed data and a model of interest,
Likelihood function is defined as:
L(θ) = fθ(x|θ) = p(x|θ)
• That is, likelihood function L(θ) will show that some
parameters are more likely to have produced the data
Maximum Likelihood (ML)
• Maximum likelihood will find the best model parameters that
make the data “most likely” generated from this model.
• Suppose we are given n data samples (x1, x2, …, xn),
• Maximum likelihood will find θ that maximize L(θ)
• Predictive distribution
I.I.D. – Independent, Identically Distributed
• I.I.D. means
• The problem is considerably simplified as:
• Usually, log likehood is used
Reference
• Zoubin Ghahramani, Machine Learning (4F13), 2006, Cambridge
(Introduction to Machine Learning, Lectures 1-2 Slides)
• Gregor Heinrich, Parameter estimation for text analysis, Technical Note,
2005-2008
EXPECTATION MAXIMIZATION
Why We Need EM?
• The Expectation–Maximization (EM) algorithm is a method for
ML learning of parameters in latent variable models.
• Why we need latent variables?
• To describe complex model  Gaussian Mixture Model
• To discover the intrinsic structure inside a data set  Topic
Models, such as pLSA, LDA.
More General
• Data set
• Likelihood:
,
• Goal: learn maximum likelihood (ML) parameter values
• The maximum likelihood procedure finds parameters θ such
that:
• Because of the integral (or sum) over latent variables, the
likelihood can be a very complicated, and hard to optimize
function.
The Expectation Maximization (EM) Algorithm
• The EM algorithm finds a (local) maximum of a latent variable
model likelihood. It starts from arbitrary values of the
parameters, and iterates two steps:
• E step: Fill in values of latent variables according to posterior
given data.
• M step: Maximize likelihood as if latent variables were not
hidden.
• Decomposes difficult problems into series of tractable steps.
Jensen’s Inequality
Lower Bounding the Log Likelihood
• Observed data D = {yn}; Latent variables X = {xn}; Parameters θ.
• Goal: Maximize the log likelihood (i.e. ML learning) wrt θ:
• Any distribution, q(X), over the hidden variables can be used to obtain
a lower bound on the log likelihood using Jensen’s inequality:
• where H[q] is the entropy of q(X).
The E and M Steps of EM
• The lower bound on the log likelihood is given by:
• EM alternates between:
• E step: optimize
wrt distribution over hidden variables
holding params fixed:
• M step: maximize
distribution fixed:
wrt parameters holding hidden
The E Step
• E step: for fixed θ,
• The second term is the Kullback-Leibler divergence.
• This means that, for fixed θ, F is bounded above by L, and achieves
that bound when
• So, the E step simply sets
The M Step
• M step: maximize
distribution q fixed:
wrt parameters holding hidden
• The second equality comes from fact that entropy of q(X) does not
depend directly on θ.
• The specific form of the M step depends on the model. Often the
maximum wrt θ can be found analytically.
EM Never Decreases the Likelihood
• The E and M steps together never decrease the log likelihood:
• The E step brings F(q, θ) to the likelihood L(θ).  顶
• The M-step maximizes F(q, θ) wrt θ.  抬
• F(q, θ) < L(θ) by Jensen – or, equivalently, from the nonnegativity of KL
Reference
• Zoubin Ghahramani, Machine Learning (4F13), 2006, Cambridge
(Unsupervised learning, Lecture 5 Slides)
• Christopher M. Bishop (2006) Pattern Recognition and Machine Learning.
Springer
WHY DO WE NEED GRAPHICAL
MODEL?
Why Do We Need Graphical Models?
• Cons
– Graphical model is so complex, even with a few circles…
– We have to make too many assumptions.
• Pros
– We do need probability to explain our world. But joint probability is
hard to compute.
– Graphical model can help us analyze and understand our problems.
– Graphs are an intuitive way of representing and visualizing the
relationships between many variables.
– With a graphical model, we can decouple joint probability to
conditional probabilities, which are usually easier.
Directed Acyclic Graphical Models (Bayesian Networks)
• A DAG Model / Bayesian network corresponds to a factorization of
the joint probability distribution:
p(A,B,C,D,E) = p(A)p(B)p(C|A,B)p(D|B,C)p(E|C,D)
• In general
• where pa(i) are the parents of node i.
Directed Graphs for Statistical Models:
Plate Notation
• A data set of N points generated from a Gaussian:
PLSA – PROBABILISTIC LATENT
SEMANTIC ANALYSIS
Latent Semantic Indexing (LSI) Review
• For natural Language Queries, simple term matching does not
work effectively
– Ambiguous terms
– Same Queries vary due to personal styles
• Latent semantic indexing
– Creates this ‘latent semantic space’ (hidden meaning)
• LSI puts documents together even if they don’t have common
words, if the docs share frequently co-occurring terms
• Disadvantages:
– Statistical foundation is missing
pLSA – Probabilistic Latent Semantic Analysis
• Automated Document Indexing and Information retrieval
• Identification of Latent Classes using an Expectation
Maximization (EM) Algorithm
• Shown to solve
– Polysemy
• Java could mean “coffee” and also the “PL Java”
• Cricket is a “game” and also an “insect”
– Synonymy
• “computer”, “pc”, “desktop” all could mean the same
• Has a better statistical foundation than LSA
pLSA
d
d
z
w
z1
z2
z3
w1
w2
w3
…
zN
wN
Nd
M
M
z1, …, zN are variables. ziє[1,K].
K is the number of latent topics.
pLSA
d1
z1
z2
w1
w2
d2
…
zN1
z1
z2
wN1
w1
w2
dM
…
zN2
z1
z2
wN2
w1
w2
…
zNm
wNm
p(w|z=1), p(w|z=2), p(w | z=Nm) are shared for all documents.
Likelihood:
Joint Probability vs Likelihood
• Joint probability
• Likelihood (only for observed variables)
• p(d) is assumed to be uniform
Document Decomposition
• Each document can be decomposed as:
• This is similar to the matrix decomposition, if we consider
each discrete distribution as a vector.
p(w|d) = ZV×k p(z|d)
• With many documents, we hope to find latent topics as
common basis.
pLSA – Objective Function
• pLSA tries to maximize the log likelihood:
• Due to the summation over z inside log, we have to resort to
EM.
EM Steps
• E-Step
– Expectation of the likelihood function is calculated with the current
parameter values
• M-Step
– Update the parameters with the calculated posterior probabilities
– Find the parameters that maximizes the likelihood function
Lower Bounding the Log Likelihood
EM Steps
• The E-Step:
• The M-Step:
Latent Subspace
pLSA vs LSA
• LSA and PLSA perform dimensionality reduction
– In LSA, by keeping only K singular values
– In PLSA, by having K aspects
• Comparison to SVD
– U Matrix related to P(z|d) (doc to aspect)
– V Matrix related to P(w|z) (aspect to term)
– E Matrix related to P(z) (aspect strength)
pLSA vs LSA
• The main difference is the way the approximation is done
• PLSA generates a model (aspect model) and maximizes its
predictive power
• Selecting the proper value of K is heuristic in LSA
• Model selection in statistics can determine optimal K in PLSA
Applications
• Text mining / topic discovering
• Scene Classification
Text Mining
Scene Classification
Classification Result
Reference
• Thomas Hofmann. Probabilistic latent semantic analysis. In Proc. of
Uncertainty in Artificial Intelligence, UAI'99, Stockholm, 1999
• Bosch, A. , Zisserman, A. and Munoz, X. Scene Classification via pLSA
Proceedings of the European Conference on Computer Vision (2006)
• Sivic, J. , Russell, B. C. , Efros, A. A. , Zisserman, A. and Freeman, W. T.,
Discovering Object Categories in Image Collections, MIT AI Lab Memo
AIM-2005-005, February, 2005.
LDA – LATENT DIRICHILET
ALLOCATION
Problems in pLSA
• pLSA provides no probabilistic model at the document level.
Each doc has its own topic mixture proportion.
• The number of parameters in the model grows linearly with M
(the number of documents in the training set).
Problems in pLSA
• There is no constraint for distributions p(z|di).
p(z|d1)
p(z|d2)
d2
d1
z1
z2
w1
w2
p(z|dm)
…
zN1
z1
z2
wN1
w1
w2
dm
…
zN2
z1
z2
wN2
w1
w2
• Easy to lead to serious problems with over-fitting.
…
zNm
wNm
Dirichlet Distribution
• In the LDA model, the topic mixture proportions for each
document are assumed to follow some distribution.
• Requirement for such a distribution:
– The samples (mixture proportions) generated from it are K-tuples of
non-negative numbers that sum to one. That is, the samples are
multinormials.
– Easy to optimize.
• Dirichlet Distribution is one of such distributions.
• The space of all of these multinomials has a nice geometric
interpretation as a (k-1)-simplex.
Dirichlet Distribution
• Definition:
Γ( ik1i ) k i 1
p( x1 , x2 ,..., xk | 1 , 2 ,..., k )  k
i 1 xi
i 1 Γ(i )
s.t.
xi  0,

k
x 1
i 1 i
• The density is zero outside this open (K − 1)-dimensional
simplex.
Example Dirichlet Distributions (K=3)
• Various parameter α.
(6, 2, 2)
(3, 7, 5)
(2, 3, 4)
(6, 2, 6)
Example Dirichlet Distributions (K=3)
• Equal αi, different 0  i 1i
k
α0=0.1
α0=1
α0=10
The LDA Model




z1
z2
z3
z4
z1
z2
z3
z4
z1
z2
z3
z4
w1
w2
w3
w4
w1
w2
w3
w4
w1
w2
w3
w4
b
• For each document,
• Choose ~Dirichlet()
• For each of the N words wn:
– Choose a topic zn» Multinomial()
– Choose a word wn from p(wn|zn,b), a multinomial probability
conditioned on the topic zn.
The LDA Model
• For each document,
• Choose ~Dirichlet()
• For each of the N words wn:
– Choose a topic zn» Multinomial()
– Choose a word wn from p(wn|zn,b), a multinomial probability
conditioned on the topic zn.
Joint Probability
• Given parameter α and β
where
Likelihood
• Joint Probability
• Marginal distribution of a document
• Likelihood over all the documents
Inference
• The likelihood can be computed by summing each document.
• Jansen’s inequality in EM.
Inference
• In E-Step, we need to compute the posterior distribution of
the hidden variables.
• Unfortunately, this distribution is intractable to compute in
general.
• We have to resort to variational approach
Variational Inference
• In variational inference, we consider a simplified graphical
model with variational parameters ,  and minimize the KL
Divergence between the variational and posterior
distributions.
Variantional Inference
• The difference between the lower bound and the likelihood is
the KL divergence.
• Maximizing the lower bound L(,;,b) with respect to  and 
is equivalent to minimizing the KL divergence.
VBEM vs EM
• Only different in the E-Step.
• In standard EM, q(X) is directly set to p(X|D,θ), and let KL=0.
• In VBEM, it is intractable to compute p(X|D,θ). Instead, it
approximates p(X|D,θ) by a variational distribution q(X) by
minimizing KL(q(X) | P(X|D, θ).
• This is also equivalent to maximizing the lower bound L(θ).
Parameter Estimation
• Given a corpus of documents, we would like to find the
parameters  and b which maximize the likelihood of the
observed data.
• Strategy (Variational EM):
• Lower bound log p(w|,b) by a function L(,;,b)
• Repeat until convergence:
– E: Maximize L(,;,b) with respect to the variational parameters ,.
– M: Maximize the bound with respect to parameters  and b.
Parameter Estimation
• E-Step: Variational Inference – repeat until convergence.
:
:
• M-Step: Parameter estimation
β:
α: can be implemented using the Newton-Raphson method.
Topic Examples in a 100-topic LDA Model)
• 16,000 documents from a subset of the TREC AP corpus.
Classification (50-topic LDA + SVM)
• Reuters-21578 dataset – contains 8000 documents and
15,818 words.
(a) EARN vs. NOT EARN.
(b) GRAIN vs. NOT GRAIN.
Problems in LDA
• Dirichlet Distribution is helpful to avoid over-fitting. But the
assumption might be too strong.




z1
z2
z3
z4
z1
z2
z3
z4
z1
z2
z3
z4
w1
w2
w3
w4
w1
w2
w3
w4
w1
w2
w3
w4
b
A Bayesian Hierarchical Model for Learning
Natural Scene Categories
• Incorporating category information
θ
π
z
x
β
Nd
M
Codebook
• 174 Local Image Patches
• Detection:
Evenly Sampled Grid
Random Sampling
Saliency Detector
Lowe’s DoG Detector
• Representation:
Normalized 11x11 gray values
128-dim SIFT
Topic Distribution in Different Categories
Topic Hierarchical Clustering
More Topic Models
• Dynamic topic models, ICML 2006
• Correlated Topic Model, NIPS 2005
• Hierarchical Dirichlet Process, Journal of the American
Statistical Association 2003
• Nonparametric Bayes pachinko allocation, UAI 2007
• Supervised LDA, NIPS 2007
• MedLDA – Maximum Margin Discrimant LDA, ICML 2009
• …
Are you really into Graphical Models?
• Describing Visual Scenes using Transformed Dirichlet Processes. E.
Sudderth, A. Torralba, W. Freeman, and A. Willsky. NIPS, Dec. 2005.
Reference
• David M. Blei, Andrew Y. Ng, Michael I. Jordan, Latent Dirichlet Allocation,
Journal of Machine Learning Research (JMLR), 2003
• Matthew J. Beal. Variational Algorithms for Approximate Bayesian
Inference, PhD. Thesis, University of Cambridge, 1998
• L. Fei-Fei and P. Perona. A Bayesian Hierarchical Model for Learning
Natural Scene Categories. CVPR, 2005.
Outline
• Matrix Decomposition
– PCA, SVD, NMF
– LDA, ICA, Sparse Coding, etc.
• Graphical Model
–
–
–
–
Basic concepts in probabilistic machine learning
EM
pLSA
LDA
• Two Applications
– Document decomposition for “long query” retrieval, ICCV 2009
– Modeling Threaded Discussions, SIGIR 2009
Large-Scale Indexing for “Long Query”
Retrieval (Similarity Search)
Xiao Zhang, Zhiwei Li, Lei Zhang,
Wei-Ying Ma, and Heung-Yeung Shum
ICCV 2009
The Long Query Problem
• If a query contains 1000 keywords:
– Need to access 1000 inverted lists
– The intersection of 1000 inverted lists may be empty
– The union of 1000 inverted list may be the whole corpus
• Dimension reduction?
Dim = 1 million
Img1
Term1
Term2
Term3
Term4
…
TermN
1
2
0
0
…
2
Topic Projection
Img1
f1
f2
…
fM
0.2
0.1
…
0.03
Dim = 200
Key Idea:
Dimension Reduction + Residual Error Preservation
p  Xw  
16
14
12
An image =
+
10
8
6
4
a few words
(10 words)
2
0
•
•
•
•
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p : original TF-IDF vector in vocabulary space
X : projection matrix for dimension reduction
w: low dimensional feature vector
 : residual error
Orthogonal Decomposition
p  Xw  
Low dimensional
representation
x1k 
 p1   x11
 1 
 p  x
 w
 
x


2k 
1
 2   21
 2 

   X1, X2, X3, …, Xk      


 
   



 pW 1  
  wk  W 1 
 pW   xW 1
 W 
xWk 
Base vector
Residual
16
14
12
An image =
+
10
8
6
4
2
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
a few words
(10 words)
 p, q 
 wTp wq   Tp  q
A Probabilistic Implementation
C. Chemudugunta, et.al. Modeling General and Specific Aspects of Documents with a
Probabilistic Topic Model, NIPS 2006
x is a switch variable. It
controls a word
generated from:
• a topic specific
distribution
• a document specific
distribution
• a background
distribution
p ( w | d )  p( x  0 | d )k 1 p(w | z  k ) p( z  k | d )
K
 p( x  1| d ) p( w |  , d )
 p( x  2 | d ) p( w |  )
Search (Online)
16
14
A query =
+
12
10
8
6
4
2
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Doc 1
DS1
DS2 …
Doc 2
DS1
DS2 …
Doc 300
DS1
DS2 …
Doc 401
DS1
DS2 …
a few
words
…
…
Doc Meta
…
…
Doc N
Doc 401
DS1
Index 10M Images: 4.6GB
Search Speed: < 100ms
LSH Index
…
DS2 …
Doc 300 …
Re-ranking
Doc 300
Doc 401
…
Search Example
Query Image
Search Example
Query Image
Simultaneously Modeling Semantics and
Structure of Threaded Discussions: A Sparse
Coding Approach and Its Applications
Chen LIN, Jiang-Ming YANG, Rui CAI,
Xin-jing WANG, Wei WANG, Lei ZHANG
SIGIR 2009
123
Semantic & structure
Semantic:
Topics
Structure:
Who reply to who
124
Optimize them together
Model semantic
Model structure
125
Reply reconstruction
Document
Similarity
Topic
Similarity
126
Structure
Similarity
Baselines
NP
 Reply to Nearest Post
RR
 Reply to Root
DS
 Document Similarity
 LDA
 Latent Dirichlet Allocation
 Project documents to topic space
SWB
 Special Words Topic Model with Background distribution
 Project documents to topic and junk topic space
127
Evaluation
method
Slashdot
Apple
All Posts
Good Posts
All Posts
Good Posts
NP
0.021
0.012
0.289
0.239
RR
0.183
0.319
0.269
0.474
DS
0.463
0.643
0.409
0.628
LDA
0.465
0.644
0.410
0.648
SWB
0.463
0.644
0.410
0.641
SMSS
0.524
0.737
0.517
0.772
128
Expert finding
Methods
HITS
PageRank
…
Expert finding
Network
construction
Reply
reconstruction
129
Baselines
LM
 Formal Models for Expert Finding in Enterprise Corpora. SIGIR
06
 Achieves stable performance in expert finding task using a
language model
PageRank
 Benchmark nodal ranking method
HITS
 Find hub nodes and authority node
EABIF
 Personalized Recommendation Driven by Information Flow.
SIGIR ’06
 Find most influential node
130
Evaluation
• Bayesian estimate
Method
MRR
MAP
P@10
LM
0.821
0.698
0.800
EABIF(ori.)
0.674
0.362
0.243
EABIF(rec.)
0.742
0.318
0.281
PageRank(ori.)
0.675
0.377
0.263
PageRank(rec.)
0.743
0.321
0.266
HITS(ori.)
0.906
0.832
0.900
HITS(rec.)
0.938
0.822
0.906
131
Summary
• Matrix and probability are fundamental mathematics in
information retrieval and computer vision.
– Matrix decomposition – a good practice to learn matrix.
– Graphical model – a good practice to learn probability.
• Graphical model is a good tool to analyze problems
• The essence of decomposition is to discover a set of mid-level
features to describe original documents/images.
• It is more adaptable for various applications than matrix
decomposition.