R - Michigan State University
Download
Report
Transcript R - Michigan State University
Machine Learning for
Information Retrieval
Rong Jin
Yi Zhang
Michigan State University
University of California Santa Cruz
1
Outline
Introduction to information retrieval, statistical inference
and machine learning
Supervised learning and its application to IR
Semi-supervised learning and its application to IR
Emerging research directions
2
Roadmap of Information Retrieval
Retrieval
Applications
Information
Access
Summarization
Visualization
Filtering
Mining
Extraction
Search
Categorization
Mining/Learning
Applications
Knowledge
Acquisition
Clustering
Data Analysis
Data
Why Machine Learning is Important ?
3
Text Categorization
4
Text Categorization
Open directory project
the largest human-edited
directory of the Web
Manual classification
Over 4 million sites and
590 K categories
Need to automate the
process
5
Document Clustering
6
Question Answering
Classify question; identify answers; match questions and answers
7
Image Retrieval
Image segmentation by data clustering
8
Image Retrieval by Key Points
b1
b2
b7
b3
…
b6
b4
b8
…
b5
…
Key features visual words: data clustering
b1 b2 b3 b4
9
Image Retrieval by Text Query
Automatically annotate images with textual words
Retrieve images with textual queries
Key technique: classification
Each keyword a different category
10
Information Extraction
Web page: free style text
Relational DB
Title
J2EE Developer
Length
4 month
Salary
….
Location
Reference
Structure prediction by Hidden Markov Model and
Markov Random Field
11
Citation/Link Analysis
12
Recommender Systems
13
Recommender Systems
User 1
?
5
3
4
2
User 2
4
1
5
?
5
User 3
5
?
4
2
5
User 4
1
5
3
5
?
Sparse data problem: a lot of missing values
14
Recommender System
Movie
Type I
Movie
Type II
User Class I
1
User Class II
p(4)=1/4
p(5)=3/4
p(4)=1/4
p(5)=3/4
p(1)=1/2
p(2)=1/2
Movie
Type III
3
p(4)=1/2
p(5)=1/2
Fill out sparse data by data clustering
15
One More Reason for ML
$ 1,000,000 award
16
Review of Basic Prob. Concepts
Probability Pr(A): “the fraction of possible world in
which A is true”
Examples
Event space of all
A = Your paper will be accepted by SIGIR 2008 possible worlds.
The area is 1.
A = It rains in Singapore
A = A document contains the word “IR”
A is true
17
Conditional Probability
SIGIR2008 = “a document contains the phrase SIGIR 2008”
SINGAPORE = “a document contains the word singpaore”
P(SINGAPORE) = 0.000001
P(SIGIR2008) = 0.00000001
P(SINGAPORE|SIGIR2008) = 1/2
“Singapore” is rare and “SIGIR 2008” is rarer, but if you
have a document with SIGIR 2008, there’s a 50-50 chance
you’ll find the word “Singapore” in it
18
Conditional Prob.
Definition Pr(A; B )
Pr(AjB ) =
Pr(B )
Chain rule
Pr (A; B ) = Pr (B ) Pr (AjB )
B is true
A is true
19
Conditional Prob.
Definition Pr(A; B )
Pr(AjB ) =
Pr(B )
Independent variables
Pr(AjB ) = Pr(A)
Chain rule
Pr (A; B ) = Pr (B ) Pr (AjB )
Pr(A; B ) = Pr(B ) Pr(A)
A is true
B is true
20
Conditional Prob.
Definition Pr(A; B )
Pr(AjB ) =
Pr(B )
Independence
Marginal probability
k
Pr(AjB ) = Pr(A)
X
Pr(B ) =
Pr(B ; A = aj )
Chain rule
Pr (A; B ) = Pr (B ) Pr (AjB )
Pr(A; B ) = Pr(B ) Pr(A)
B is true
A is true
j=1
21
Bayes’ Rule
Posterior
Prior
Likelihood
Pr(H jE ) / Pr(H ) £ Pr(E jH )
Information: Pr(E|H)
H
E
Inference: Pr(H|E)
Hypothesis
Evidence
22
Bayes’ Rule
Posterior
Prior
Likelihood
Pr(H jE ) / Pr(H ) £ Pr(E jH )
Information: Pr(W|R)
W
R
Inference: Pr(R|W)
Pr(W|R)
R
R
W
0.7
0.4
W
0.3
0.6
R: It rains
W: The grass is wet
23
Statistical Inference
Posterior
Prior
Likelihood
Pr(H jE ) / Pr(H ) £ Pr(E jH )
Learning stage: a parametric model for Pr(E|H)
Inference stage: for a given observation E
Compute Pr(H|E) for each hypothesis H
Choose the hypothesis with the largest Pr(H|E)
24
Example: Language Model (LM) for IR
q: ‘Signapore SIGIR’
Evidence: E
Estimating likelihood p(q| )
Pr(E jH )
?
?
Pr(H jE )
?
Estimating some statistics for each document
d1
…
d1000
Hypothesis: H
Pr(H )
25
Probability Distributions
Binomial distributions
Beta distribution
Multinomial distributions
Dirichlet distribution
Gaussian distributions
Laplacian distribution
Language models
Smoothing LM
Sparse solution
L1 regularizer
26
Outline
Introduction to information retrieval, statistical inference and
machine learning
Supervised learning and its application to IR
Semi-supervised learning and its application to IR
Emerging research directions
27
Supervised Learning: Basic Setting
Given training data: {(x1,y1), (x2,y2)…(xN,yN)}
Learning: infer a function f(X) from the training data
Inference: predict future outcomes y=f(x) given x
2.5
y
f (x ) = ax ¡ b
2
1.5
1
Regression: Continuous Y`
0.5
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
x
28
Supervised Learning: Basic Setting
Given training data: {(x1,y1), (x2,y2)…(xN,yN)}
Learning: infer a function f(X) from the training data
Inference: predict future outcomes y=f(x) given x
x2
x = (x 1 ; x 2 )
y = +1
y = -1
w> x ¡ b = 0
+
f (x) = sign(w > x ¡ b)
Classification: Discrete Y
x1
29
Examples
Text categorization
Input x: word histogram
Output y: document categories (e.g., 1 for
“domestic economics”, 2 for “politics”, 3 “sports”,
and 4 for “others”)
Question answering: classify question types
Input x: a parsing tree of a qestion
Output y: question types (e.g., when, where, …)
30
K Nearest-Neighbor (KNN) Classifiers
Unknown record
– Compute distance to
other training documents
– Identify the k nearest
neighbors
– determine the class of the
unknown point by the
class labels of its closest
neighbors
31
Based on Tan,Steinbach, Kumar
K Nearest-Neighbor (KNN) Classifiers
Compute distance between two points
Euclidean distance, cosine distance, Kullback-Leibler
distance, Bregman distance, …
Learning distance function from data (Distance learning)
Determine the class
Majority vote, or weighted majority vote
Bregman distance:
generated by a
convex function
32
K Nearest-Neighbor (KNN) Classifiers
Decide K (# of nearest neighbors)
Bias-variance tradeoff
Cross validation (or leave-one-out)
(k=1)
Training
Dataset
Validation
Dataset
(k=4)
33
K Nearest-Neighbor (KNN) Classifiers
Curse of dimensionality
Many attributes are irrelevant
High dimension less informative distance
Distribution of square
distance, generated by
1000 random data points
in 1000 dims
34
KNN for Collaborative Filtering
Collaborative filtering
Assumption:
Will user u like item b?
Users have similar tastes are likely to have similar
preferences on items
Making filtering decisions for one user based
on the feedback from other users that are
similar to this user
35
KNN for Collaborative Filtering
User 1
1
5
3
4
3
User 2
User 3
4
2
1
?
5
3
2
5
5
4
36
KNN for Collaborative Filtering
User 1
1
5
3
4
3
User 2
User 3
4
2
1
5?
5
3
2
5
5
4
Similarity measure of user interests can be learned
37
Paradigm for Supervised Learning
Gathering training data
Determine the input features (i.e., What’s x ?)
Determine the functional form f(x)
Linear or nonlinear
What is the function form for KNN?
Determine the learning algorithm
e.g., text categorization, bags of words
Feature engineering is very very very important
Learn optimal parameters (optimization, cross validation)
Probabilistic or non-probabilistic
Test on a test set
38
Bayesian Learning
Posterior
Prior
Likelihood
Pr(H jE ) / Pr(H ) £ Pr(E jH )
Baye’s Rule
Hypot hesis space: H = f Y1 ; Y2 ; : : : ; g
Y¤ =
arg max Pr(Y jX )
Y2H
=
arg max Pr(Y ) Pr(X jY )
Y2H
MAP Learning:
Maximum A Posterior
39
Bayesian Learning
Posterior
Prior
Likelihood
Pr(H jE ) / Pr(H ) £ Pr(E jH )
Baye’s Rule
Hypot hesis space: H = f Y1 ; Y2 ; : : : ; g
Y¤ =
arg max Pr(Y jX )
Y2H
=
arg max Pr(Y ) Pr(X jY )
Y2H
MLE Learning:
Maximum Likelihood Estimation
40
Bayesian Learning: Conjugate Prior
Hypot hesis space: H = f Y1 ; Y2 ; : : : ; g
Y¤ =
arg max Pr(Y jX )
Y2H
=
arg max Pr(Y ) Pr(X jY )
Y2H
Posterior Pr(Y|X) is in the same form as prior Pr(Y)
e.g., Dirchlet dist. is conjugate prior for multinomial
dist. (widely used in language model)
41
Example: Text Categorization
Y ¤ = arg max Pr(Y ) Pr(X jY )
Y2H
Web page for Prof. or student ?
Counting !
1. Counting = MLE
2. Counting + Pseudo = MAP
What is Y ?
What is feature X?
How to estimate Pr(Y=Student) or Pr(Y= Prof.) ?
How to estimate Pr(w|Y) ?
42
Naïve Bayes
[w1 ; w2 ; : : : ; wV ]
X = (x 1 ; x 2 ; : : : ; x V )
Pr(wjY )
?
Pr(X jY )
Pr(X jY ) ¼ [Pr(w1 jY )]x 1 ¢¢¢[Pr(wV jY )]x V
Threshold
Constant
Pr(X jY =
Weight for words
P) Pr(Y = P)
f (X ) = log
Pr(X jY = S) Pr(Y = S)
Pr(Y = P)
Pr(w1 jY = P)
Pr(wV jY = P)
= log
+ x 1 log
+ : : : + x v log
Pr(Y = S)
Pr(w1 jY = S)
Pr(wV jY = S)
43
Naïve Bayes: A Linear Classifier
x2
f (x) = sign(w > x ¡ b)
+
y = +1
y = -1
Logistic Regression
x1
Pr(X jY =Directly
P) Pr(Y = P)
model f(x) or Pr(Y|X)
f (X ) = log
Pr(X jY = S) Pr(Y = S)
Pr(Y = P)
Pr(w1 jY = P)
Pr(wV jY = P)
= log
+ x 1 log
+ : : : + x v log
Pr(Y = S)
Pr(w1 jY = S)
Pr(wV jY = S)
44
Logistic Regression (LR)
Pr(X jY = P ) Pr(Y = P)
log
Pr(X jY = S) Pr(Y = S)
Pr(Y = P )
Pr(w1 jY = P)
Pr(wV jY = P)
= log
+ x 1 log
+ : : : + x v log
Pr(Y = S)
Pr(w1 jY = S)
Pr(wV jY = S)
Pr(X jY = P ) Pr(Y = P)
log
= b+ t 1 x 1 + : : : + t V x V
Pr(X jY = S) Pr(Y = S)
t1…tV are unknown weights that are learned from data
by maximum likelihood estimation1(MLE)
Pr(y = §1jX ) =
1 + exp[¡ y(t 1 x 1 + : : : + t V x V + b)]
45
Logistic Regression (LR)
Learning parameters: b, t1…tV
Maximum Likelihood
Estimation (MLE)
XN
(~
t ¤ ; b¤ ) = arg max
~
t ;b
log Pr(yi jX i ; ~
t ; b)
i= 1
46
Logistic Regression (LR)
Learning parameters: b, t1…tV
Maximum Likelihood
Estimation (MLE)
XN
(~
t ¤ ; b¤ ) = arg max
~
t ;b
log Pr(yi jX i ; ~
t ; b) + Pr(t)
i= 1
Overfitting
Why only word weights?
worse performance
Maximum Likelihood Estimation
Maximum A Posterior
47
Learning Logistic
Regression
N
(~
t ¤ ; b¤ ) = arg min
~
t ;b
X
¡ log Pr(yi jX i ; ~
t ; b)
i= 1
Loss function
Mismatch between y and f(X)
Other Loss functions
1
Pr(y = §1jX ) =
1 + exp[¡ yf (X )]
f(X)
48
Logistic Regression (LR)
Closely related to Maximum Entropy (ME)
Logistic
Regression
Dual
Maximum
Entropy
Advantage of LR
Bayesian approach
Convenient for incorporating prior knowledge
Useful for semi-supervised learning, transfer
learning, …
49
Comparison of Classifiers
Macro
F1
Micro
F1
KNN
0.8557
0.5975
Naïve
Bayes
0.8009
0.4737
Logistic
Regression
0.8748
0.6084
From Li and Yang SIGIR03
50
Comparison of Classifiers
Logistic Regression
1. Model Pr(Y|X)
2. Model decision boundary
3. NB is a special case of LR
Naïve Bayes
1. Model Pr(X|Y) & Pr(Y)
2. Model input patterns (X)
x2
1. Require numerical solution
2. Large number of training
examples, slow convergence
1. Simple solution
2. Small number of training
examples, fast convergence
x1
51
Comparison of Classifiers
Discriminative Model Rule of Thumb
Generative Model
Discriminative model if
1. Enough training examples
1. Model Pr(Y|X) 2. Enough computational
power
1. Model
Pr(X|Y) & Pr(Y)
2. Model decision boundary
2. Model
patterns (X)
3. Classification accuracy
is input
important
3. Broader model assumption
Generative
model if
x2
1. Lack of training examples
1. Simple solution
1. Require numerical
solution
2. Lack of computational
power
2.
Small
number of training
2. Large number of training
3. Training time is more
important
examples,
fast convergence
examples, slow convergence
4. A quick test
x1
52
Comparison of Classifiers
Discriminative Model
What
Generative
about KNN
? Model
1. Model Pr(Y|X)
2. Model decision boundary
3. Broader model assumption
1. Model Pr(X|Y) & Pr(Y)
2. Model input patterns (X)
1. Require numerical solution
2. Large number of training
examples, slow convergence
1. Simple solution
2. Small number of training
examples, fast convergence
53
Other Discriminative Classifiers
Decision tree
Aggregation of decision
rules via a tree
Easy interpretation
54
Other Discriminative Classifiers
Decision tree
Aggregation of decision
rules via a tree
Easy interpretation
Support vector machine x2
A maximum margin
classifier
y = +1
best text classifier y = -1
x1
55
Comparison of Classifiers
Macro
F1
Micro
F1
KNN
0.8557
0.5975
Naïve
Bayes
0.8009
0.4737
Logistic
Regression
0.8748
0.6084
Support
vector
machine
0.8857
0.5975
From Li and Yang SIGIR03
56
Ensemble Learning
Generate multiple classifiers
Classification by (weighted) majority votes
Bagging & Boosting
Train a classifier for a different sampling of training data
D
x2
Sampling
D1
D2
Dk
…
x1
h1
h2
hk
57
Ensemble Learning
Bias-variance tradeoff
Reduce variance (bagging) and bias (boosting)
50 decision trees
Majority vote
Error caused
by variance
Error caused
by bias
58
Multi-Class Classification
Binary
classifier
………
f 1 (X )
More than 2 classes
Multi-labels assigned to
X1
each example
X2
Approaches
One against all
ECOC coding
Binary
fclassifier
(X )
K
c1
c2
… cK
0
1
…
1
0
0
1
0
1
0
…
XN
59
Multi-Class Classification
f 1 (X )
c1 c3
f 3 (X )
c2 f (X )
2
f 1 (X )
fM
…
…
(X
More than 2 classes
Multi-labels assigned to
each example
Approaches
One against all
ECOC coding
0
1
…
0
1
0
…
1
…
…
…
…
1
1
…
0
c1
c2 … cK
X1
0
1
X2
1
0
0
1
0
1
)
…
# of
coding
bits
0
…
XN
60
Multi-Class Classification
More than 2 classes
Multi-labels assigned to X1
X2
each example
…
Approaches
One against all
ECOC coding
Transfer learning
XN
c1
c2
… cK
0
1
…
1
0
0
1
0
1
f 1 (X )
Binary classifier
………
0
f K (X )
Binary classifier
61
Beyond Vector Inputs
sequences
gene sequence
classification
trees
question type
classification
graphs
Character
Recognition
62
Beyond Vector Inputs: Kernel
Kernel function k(x1, x2)
Assess the similarity between two objects x1, x2
Don’t have to represent objects by vectors
63
Beyond Vector Inputs: Kernel
Kernel function k(x1, x2)
Assess the similarity between two objects x1, x2
Don’t have to represent objects by vectors
Vector representation by kernel
x ; : : :function
;x
1
Given training examples
Represent any example x by vector
N
[k(x 1 ; x ); k(x 2 ; x ); : : : ; k(x N ; x )]
Related to representer theorem
64
Beyond Vector Inputs
sequences
Strong Kernel
trees
Tree Kernel
graphs
Graph Kernel
65
Kernel for Nonlinear Classifiers
66
Words are associated with Kernels
Reproducing Kernel Hilbert Space (RKHS)
Mercer’s conditions
Vector representation
Good kernels
Representer theorem
Kernel learning (e.g., multiple kernel
learning)
67
Sequence Prediction
[PRP] [VBZ]
[DT]
[JJ]
[NN]
[He] [reckons] [the] [current] [account]
[deficit]
Part-of-speech tagging
But, all the taggings are related
Pr(N N j\ account ) !
[NN]
Pr(N N j\ account ; t ag-for-\ current )
Hidden Markov Model (HMM), Conditional Random Field
(CRF), and Maximum Margin Markov Network (M3)
68
Outline
Introduction to information retrieval, statistical inference and
machine learning
Supervised learning and its application to IR
Semi-supervised learning and its application to IR
Emerging research directions
69
Topics of Semi-supervised Learning
Introduction to semi-supervised learning
Basics of semi-supervised learning
Semi-supervised classification algorithms
Label propagation
Graph partitioning based approaches
Transductive Support Vector Machine (TSVM)
Co-training
Semi-supervised data clustering
70
Spectrum of Learning Problems
71
What is Semi-supervised Learning
Learning from a mixture of labeled and unlabeled examples
L = f (x 1Labeled
; y1 ); : :Data
: ; (x n ; yn )g
l
Total
N
= number
n l + nofu examples:
l
U =Unlabeled
f x 1 ; : : :Data
; xn g
u
f (x) : X ! Y
72
Why Semi-supervised Learning?
Labeling is expensive and difficult
Labeling is unreliable
Ex. Segmentation applications
Need for multiple experts
Unlabeled examples
Easy to obtain in large numbers
Ex. Web pages, text documents, etc.
73
Semi-supervised Learning Problems
Classification
Transductive – predict labels of unlabeled data
Inductive – learn a classification function
Clustering (constrained clustering)
Ranking (semi-supervised ranking)
Almost every learning problem has a semisupervised counterpart.
74
Topics of Semi-supervised Learning
Introduction to semi-supervised learning
Basics of semi-supervised learning
Semi-supervised classification algorithms
Label propagation
Graph partitioning based approaches
Transductive Support Vector Machine (TSVM)
Co-training
Semi-supervised data clustering
75
Why Unlabeled Could be Helpful
Clustering assumption
Unlabeled data help decide the decision boundary
f (X ) = 0
Manifold assumption
Unlabeled data help decide decision function
f (X )
76
Clustering Assumption
?
77
Clustering Assumption
?
Suggest a simple alg. for
PointsSemi-supervised
with same label are Learning
connected through
high
?
density regions, thereby defining a cluster
Clusters are separated through low-density regions
78
Manifold Assumption
Graph representation
Vertex: training example
(labeled and unlabeled)
Edge: similar examples
Labeled
examples
x1
x2
Regularize the classification function f(x)
x 1 and x 2 are connect ed ¡ !
jf (x 1 ) ¡ f (x 2 )j is small
79
Manifold Assumption
Graph representation
Vertex: training example
(labeled and unlabeled)
Edge: similar examples
Manifold assumption
Data lies on a low-dimensional manifold
Classification function f(x) should “follow” the
data manifold
80
Statistical View
Generative model for classification
Pr(X ; Y jµ; ´) = Pr(X jY ; µ) Pr(Y j´)
θ
Y
X
81
Statistical View
Generative model for classification
Pr(X ; Y jµ; ´) = Pr(X jY ; µ) Pr(Y j´)
Unlabeled data help estimate
Clustering assumption
Pr(X jY ; µ)
θ
Y
X
82
Statistical View
Discriminative model for classification
Pr(X ; Y jµ; ´) = Pr(X j¹ ) Pr(Y jX ; µ)
θ
μ
Y
X
83
Statistical View
Discriminative model for classification
Pr(X ; Y jµ; ´) = Pr(X j¹ ) Pr(Y jX ; µ)
Unlabeled data help regularize θ
Pr(µjX )
via a prior
Manifold assumption
θ
μ
Y
X
84
Topics of Semi-supervised Learning
Introduction to semi-supervised learning
Basics of semi-supervised learning
Semi-supervised classification algorithms
Label propagation
Graph partitioning based approaches
Transductive Support Vector Machine (TSVM)
Co-training
Semi-supervised data clustering
85
Topics of Semi-supervised Learning
Introduction to semi-supervised learning
Basics of semi-supervised learning
Semi-supervised classification algorithms
Label propagation
Graph partitioning based approaches
Transductive Support Vector Machine (TSVM)
Co-training
Semi-supervised data clustering
86
Label Propagation: Key Idea
A decision boundary
based on the labeled
examples is unable to
take into account the
layout of the data points
How to incorporate the
data distribution into the
prediction of class labels?
87
Label Propagation: Key Idea
Connect the data points
that are close to each
other
88
Label Propagation: Key Idea
Connect the data points
that are close to each
other
Propagate the class labels
over the connected graph
89
Label Propagation: Key Idea
Connect the data
points that are close
to each other
Propagate the class
labels over the
connected graph
Different from the
K Nearest Neighbor
90
Label Propagation: Representation
W 2 f 0; 1gN £ N
Adjancy
½ matrix
1 x i and x j connect
Wi ;j =
0 ot herwise
£N
W 2 RN
+
Similarity matrix
Wi ;j : similarity between x i and x j
D = diag(d1 ; : : : ; dN )
P
Matrix
di =
Wi ;j
j6
=i
91
Label Propagation: Representation
W 2 f 0; 1gN £ N
Adjancy
½ matrix
1 x i and x j connect
Wi ;j =
0 ot herwise
£N
W 2 RN
+
Similarity matrix
Wi ;j : similarity between x i and x j
Degree matrix
D = diag(d1 ; : : : ; dN ) di =
P
j6
=i
Wi ;j
92
Label Propagation: Representation
W 2 RN £ N
+
Given
Label information
y l = (y1 ; y2 ; : : : ; yn ) 2 f ¡ 1; + 1gn l
l
y u = (y1 ; y2 ; : : : ; yn ) 2 f ¡ 1; + 1gn u
u
93
Label Propagation: Representation
W 2 RN £ N
+
Given
Label information
y l = (y1 ; y2 ; : : : ; yn ) 2 f ¡ 1; + 1gn l
l
y = (y l ; y u )
94
Label Propagation
yb 2 f ¡ 1; 0; + 1gN
½ assignments
Initial class
§1 x i is labeled
ybi =
0 x i is unlabeled
Predicted class assignments
f 2 RN
First predict the confidence scores
y 2 f ¡ 1; + 1gN
Then predict
½ the class assignments
yi
=
+1 fi > 0
¡ 1 fi · 0
95
Label Propagation
yb 2 f ¡ 1; 0; + 1gN
½ assignments
Initial class
§1 x i is labeled
ybi =
0 x i is unlabeled
Predicted class assignments
f = (f 1 ; : : : ; f N )
First predict the confidence scores
y 2 f ¡ 1; + 1gN
Then predict
½ the class assignments
yi
=
+1 fi > 0
¡ 1 fi · 0
96
Label Propagation (II)
One round of propagation
½
ybi
x i is labeled
P
fi =
® N Wi ;j ybi ot herwise
i= 1
Weight for
each propagation
Weighted KNN
f 1 = yb + ®W yb
97
Label Propagation (II)
f2
Two rounds of propagation
=
f 1 + ®W f 1
=
yb + ®W yb + ®2 W 2 yb
How to generate any number
of iterations?
Xk
fk
=
yb +
®i W i yb
i= 1
98
Label Propagation (II)
f2
Two rounds of propagation
=
f 1 + ®W f 1
=
yb + ®W yb + ®2 W 2 yb
Results for any number of
iterations
Xk
fk
=
yb +
®i W i yb
i= 1
99
Label Propagation (II)
f2
f1
Two rounds of propagation
=
f 1 + ®W f 1
=
yb + ®W yb + ®2 W 2 yb
Results for infinite number
of iterationsX1
=
yb +
®i W i yb
i= 1
100
Label Propagation (II)
f2
f1
Two rounds of propagation
=
f 1 + ®W f 1
=
yb + ®W yb + ®2 W 2 yb
Matrix Inverse
Results for infinite number
of iterations
=
(I ¡ ®W ) ¡ 1 yb
¹ = D¡
W
Normalized Similarity Matrix:
1=2 W D ¡ 1=2
101
Local and Global Consistency
[Zhou et.al., NIPS 03]
Local consistency:
Like KNN
Global consistency:
Beyond KNN
102
Summary:
Construct a graph using pairwise similarities
Propagate class labels along the graph
Key parameters
: the decay of propagation
W: similarity matrix
Computational complexity
Matrix inverse: O(n3)
Chelosky decomposition
Clustering
f
=
(I ¡ ®W ) ¡ 1 yb
103
Questions
Cluster Assumption
?
Manifold Assumption
?
Transductive
Inductive
predict classes for unlabeled data
learn classification function
104
Application: Text Classification
[Zhou et.al., NIPS 03]
SVM
20-newsgroups
Pre-processing
KNN
autos, motorcycles, baseball,
and hockey under rec
stemming, remove stopwords
& rare words, and skip header
#Docs: 3970, #word: 8014
Propagation
105
Application: Image Retrieval
[Wang et al., ACM MM 2004]
Label propagation
5,000 images
Relevance feedback for the top
20 ranked images
Classification problem
SVM
Relevant or not?
f(x): degree of relevance
Learning relevance function f(x)
Supervised learning: SVM
Label propagation
106
Topics of Semi-supervised Learning
Introduction to semi-supervised learning
Basics of semi-supervised learning
Semi-supervised classification algorithms
Label propagation
Graph partition based approaches
Transductive Support Vector Machine (TSVM)
Co-training
Semi-supervised data clustering
107
Graph Partition
Classification as graph partitioning
Search for a classification boundary
Consistent with labeled examples
Partition with small graph cut
Graph Cut = 2
Graph Cut = 1
108
Graph Partitioning
Classification as graph partitioning
Search for a classification boundary
Consistent with labeled examples
Partition with small graph cut
Graph Cut = 1
109
Min-cuts for semi-supervised learning
[Blum and Chawla, ICML 2001]
Additional nodes
V+ : source, V-: sink
Infinite weights connecting sinks and sources
High computational cost
V+
Source
Graph Cut = 1
V̲
Sink
110
Harmonic Function [Zhu et al., ICML 2003]
Weight matrix W
wi,j 0: similarity between xi and xi
f = (f 1 ; : : : ; f N )
½
Membership vector
fi =
+ 1 xi 2 A
¡ 1 xi 2 B
A
+1
+1
B
¡ 1
¡ 1
¡ 1 ¡ 1
+1
+1
¡ 1
¡ 1
¡ 1
¡ 1
¡ 1
111
Harmonic Function (cont’d)
C(f )
Graph cut
XN
C(f )
XN
=
(f i ¡ f j ) 2
wi ;j
4
A
+1
B
¡1
¡1
¡1 ¡1
+1
i= 1 j = 1
=
+1
+1
1
1
f > (D ¡ W )f = f > L f
4
4
¡1
¡1
¡1
D = diag(d1 ; : : : ; dN )
P
Degree matrix
di =
Wi ;j
j6
=i
Diagonal element:
112
¡1
Harmonic Function (cont’d)
Graph cut
XN
C(f )
C(f )
XN
=
(f i ¡ f j ) 2
wi ;j
4
A
+1
B
¡1
¡1
¡1 ¡1
+1
i= 1 j = 1
=
+1
+1
1
1
f > (D ¡ W )f = f > L f
4
4
¡1
¡1
¡1
Graph Laplacian L = D –W
Pairwise relationships among data poitns
Mainfold geometry of data
113
¡1
Harmonic Function
min
f 2 f ¡ 1;+ 1g N
s. t .
1
C(f ) = f > L f
4
f i = yi ; 1 · i · n l
Consistency with
graph structures
Challenge:
Consistent with labeled data
Discrete space Combinatorial Opt.
A
+1
+1
B
¡1
¡1
¡1 ¡1
+1
+1
¡1
¡1
¡1
114
¡1
Harmonic Function
min
f 2 f ¡ 1;+ 1g N
s. t .
1
C(f ) = f > L f
4
f i = yi ; 1 · i · n l
Relaxation: {-1, +1} continuous real number
min
f 2 RN
s. t .
1
C(f ) = f > L f
4
f i = yi ; 1 · i · n l
A
+1
+1
B
¡1
¡1
¡1 ¡1
+1
+1
¡1
Convert continuous f to binary ones
¡1
¡1
115
¡1
Harmonic Function
s. t .
1
C(f ) = f > L f
4
f i = yi ; 1 · i · n l
µ
¶
min
f 2 RN
L=
L l ;l
L l ;u
L u ;l
L u ;u
; f = (f l ; f u )
f u = ¡ L ¡ 1 L u ;l y l
u ;u
116
Harmonic Function
Local
Propagation
fu = ¡ L ¡
1L
yl
u
;l
u ;u
117
Harmonic Function
Local
Propagation
Sound familiar ?
fu = ¡ L ¡
1L
yl
u
;l
u ;u
Global
propagation
118
Spectral Graph Transducer [Joachim , 2003]
min
f 2 RN
s. t .
Xn l
1
C(f ) = f > L f + ® (f i ¡ yi ) 2
4
i= 1
f i = yi ; 1 · i · n l
Soften hard constraints
119
Spectral Graph Transducer [Joachim , 2003]
min
f 2 RN
s. t .
Xn l
1
C(f ) = f > L f + ® (f i ¡ yi ) 2
4
i= 1
f i = yi ; 1 · i · n l
Solved by Constrained Eigenvector
Problem
Xn
min
f 2 RN
1
C(f ) = f > L f + ®
4
l
(f i ¡ yi ) 2
i= 1
XN
s. t .
f
i= 1
2
i
= N
120
Manifold Regularization [Belkin, 2006]
min
f 2 RN
Xn l
1
C(f ) = f > L f + ® (f i ¡ yi ) 2
4
i= 1
XN
s. t .
f
2
i
= N
i= 1
Loss function for
misclassification
Regularize the norm
of classifier
121
Manifold Regularization [Belkin, 2006]
min
f 2 RN
Xn l
1
f > L f + ® (f i ¡ yi ) 2
4
i= 1
XN
s. t .
f
2
i
= N
i= 1
min
f 2 RN
Manifold
Regularization
f > Lf + ®
Xn l
Loss funct ion: l(f (x i ); yi )
l(f (x i ); yi ) + °jf j 2
HK
i= 1
122
Summary
Construct a graph using pairwise similarity
Key quantity: graph Laplacian
Decision boundary is consistent
Captures the geometry of the graph
Graph structure
Labeled examples
Parameters
, , similarity
A +1
+1
+1
+1
B
¡1
¡1
¡1 ¡1
¡1
¡1
¡1
¡1
123
Questions
Cluster Assumption
?
Manifold Assumption
?
Transductive
Inductive
predict classes for unlabeled data
learn classification function
124
Application: Text Classification
SVM
20-newsgroups
KNN
Pre-processing
Propagation
autos, motorcycles, baseball,
and hockey under rec
stemming, remove stopwords
& rare words, and skip header
#Docs: 3970, #word: 8014
Harmonic
125
Application: Text Classification
PRBEP: precision recall break even point.
126
Application: Text Classification
Improvement in PRBEP by SGT
127
Topics of Semi-supervised Learning
Introduction to semi-supervised learning
Basics of semi-supervised learning
Semi-supervised classification algorithms
Label propagation
Graph partitioning based approaches
Transductive Support Vector Machine (TSVM)
Co-training
Semi-supervised data clustering
128
Transductive SVM
Support vector machine
Classification margin
Maximum classification
margin
Decision boundary given a
small number of labeled
examples
129
Transductive SVM
Decision boundary given a
small number of labeled
examples
How to change decision
boundary given both
labeled and unlabeled
examples ?
130
Transductive SVM
Decision boundary given a
small number of labeled
examples
Move the decision
boundary to low local
density
131
Transductive SVM
! (X ; y ; f )
f
Classification margin
f(x): classification function
Supervised learning
¤
= arg max ! (X ; y ; f )
f 2HK
f (x)
! (X ; y ; f )
Semi-supervised learning
Optimize over both f(x) and yu
132
Transductive SVM
! (X ; y ; f )
f
Classification margin
f(x): classification function
Supervised learning
¤
f (x)
= arg max ! (X ; y ; f )
f 2HK
Semi-supervised learning
Optimize over both f(x) and yu
133
Transductive SVM
! (X ; y ; f )
f
Classification margin
f(x): classification function
Supervised learning
¤
f (x)
= arg max ! (X ; y ; f )
f 2HK
Semi-supervised learning
f
¤
Optimize over both f(x) and yu
=
arg max
f 2 H K ;y u 2 f ¡ 1;+ 1gn u
! (X ; y l ; y u ; f )
134
Transductive SVM
Decision boundary given
a small number of
labeled examples
Move the decision
boundary to place with
low local density
Classification results
How to formulate this
idea?
135
Transductive SVM: Formulation
Original SVM
A binary variables for
label of each example
Transductive SVM
{w* , b*}= argmin argmin w w
{w* , b*}= argmin w w
w, b
y1 w x1 b 1
y2 w x2 b 1 labeled
....
examples
yn w xn b 1
Constraints for
unlabeled data
yn 1 ,..., yn m
w, b
y1 w x1 b 1
y2 w x2 b 1 labeled
....
examples
yn w xn b 1
yn 1 w xn 1 b 1
unlabeled
....
examples
yn m w xn m b 1
136
Computational Issue
{w , b }= argmin argmin w w
*
*
yn1 ,..., yn m
w, b
y1 w x1 b 1 1
y2 w x2 b 1 2 labeled
....
examples
yn w xn b 1 n
n
i 1 i
n
i 1 i
yn 1 w xn 1 b 1 1
unlabeled
....
examples
yn m w xn m b 1 m
No longer convex optimization problem.
Alternating optimization
137
Summary
Based on maximum margin principle
Classification margin is decided by
Labeled examples
Class labels assigned to unlabeled data
High computational cost
Variants: Low Density Separation (LDS), SemiSupervised Support Vector Machine (S3VM), TSVM
138
Questions
Cluster Assumption
?
Manifold Assumption
?
Transductive
Inductive
predict classes for unlabeled data
learn classification function
139
Text Classification by TSVM
10 categories from the
Reuter collection
3299 test documents
1000 informative words
selected by MI criterion
140
Topics of Semi-supervised Learning
Introduction to semi-supervised learning
Basics of semi-supervised learning
Semi-supervised classification algorithms
Label propagation
Graph partitioning based approaches
Transductive Support Vector Machine (TSVM)
Co-training
141
Co-training [Blum & Mitchell, 1998]
Classify web pages into
category for students and category for professors
Two views of web pages
Content
“I am currently the second year Ph.D. student …”
Hyperlinks
“My advisor is …”
“Students: …”
142
Co-training for Semi-Supervised Learning
143
Co-training for Semi-Supervised Learning
It is easier to
classify this web
page using
hyperlinks
It is easy to
classify the type of
this web page
based on its
content
144
Co-training
Two representation for each web page
Content representation:
(doctoral, student, computer, university…)
Hyperlink representation:
Inlinks: Prof. Cheng
Oulinks: Prof. Cheng
145
Co-training
Train a content-based classifier
146
Co-training
Train a content-based classifier using
labeled examples
Label the unlabeled examples that are
confidently classified
147
Co-training
Train a content-based classifier using
labeled examples
Label the unlabeled examples that are
confidently classified
Train a hyperlink-based classifier
148
Co-training
Train a content-based classifier using
labeled examples
Label the unlabeled examples that are
confidently classified
Train a hyperlink-based classifier
Label the unlabeled examples that are
confidently classified
149
Co-training
Train a content-based classifier using
labeled examples
Label the unlabeled examples that are
confidently classified
Train a hyperlink-based classifier
Label the unlabeled examples that are
confidently classified
150
Co-training
Assume two views of objects
Key idea
Two sufficient representations
Augment training examples of one view by exploiting the
classifier of the other view
Extension to multiple view
Problem: how to find equivalent views
151
A Few Words about Active Learning
Active learning
Select the most informative examples
In contrast to passive learning
Key question: which examples are
informative
Uncertainty principle: most informative example
is the one that is most uncertain to classify
Measure classification uncertainty
152
A Few Words about Active Learning
Query by committee (QBC)
SVM based approach
Construct an ensemble of classifiers
Classification uncertainty largest degree of
disagreement
Classification uncertainty distance to decision
boundary
Simple but very effective approaches
153
Topics of Semi-supervised Learning
Introduction to semi-supervised learning
Basics of semi-supervised learning
Semi-supervised classification algorithms
Label propagation
Graph partitioning based approaches
Transductive Support Vector Machine (TSVM)
Co-training
Semi-supervised clustering algorithms
154
Semi-supervised Clustering
Clustering data into two clusters
155
Semi-supervised Clustering
Must link
cannot link
Clustering data into two clusters
Side information:
Must links vs. cannot links
156
Semi-supervised Clustering
Also called constrained clustering
Two types of approaches
Restricted data partitions
Distance metric learning approaches
157
Restricted Data Partition
Require data partitions to be consistent
with the given links
Links hard constraints
E.g. constrained K-Means (Wagstaff et al.,
2001)
Links soft constraints
E.g., Metric Pairwise Constraints K-means
(Basu et al., 2004)
158
Restricted Data Partition
Hard constraints
Cluster memberships must obey the link constraints
must link
Yes
cannot link
159
Restricted Data Partition
Hard constraints
Cluster memberships must obey the link constraints
must link
Yes
cannot link
160
Restricted Data Partition
Hard constraints
Cluster memberships must obey the link constraints
must link
No
cannot link
161
Restricted Data Partition
Soft constraints
Penalize data clustering if it violates some links
must link
Penality = 0
cannot link
162
Restricted Data Partition
Hard constraints
Cluster memberships must obey the link constraints
must link
Penality = 0
cannot link
163
Restricted Data Partition
Hard constraints
Cluster memberships must obey the link constraints
must link
Penality = 1
cannot link
164
Distance Metric Learning
Learning a distance metric from pairwise links
Enlarge the distance for a cannot-link
Shorten the distance for a must-link
Applied K-means with pairwise distance measured
by the learned distance metric
must link
Transformed by learned
distance metric
cannot link
165
Example of Distance Metric Learning
2D data projection using Euclidean
distance metric
2D data projection using learned
distance metric
Solid lines: must links
dotted lines: cannot links
166
BoostCluster [Liu, Jin & Jain, 2007]
General framework for semi-supervised clustering
Improves any given unsupervised clustering algorithm with
pairwise constraints
Key challenges
How to influence an arbitrary clustering algorithm by side
information?
Encode constraints into data representation
How to take into account the performance of underlying clustering
algorithm?
Iteratively improve the clustering performance
167
167
BoostCluster
Data
Pairwise
Constraints
Kernel
Matrix
New data
Representation
Clustering
Results
Clustering
Algorithm
Clustering
Algorithm
Final Results
Given: (a) pairwise constraints, (b) data
examples, and (c) a clustering algorithm
168
168
BoostCluster
Data
Pairwise
Constraints
Kernel
Matrix
New data
Representation
Clustering
Results
Clustering
Algorithm
Clustering
Algorithm
Final Results
Find the best data rep. that encodes the
unsatisfied pairwise constraints
169
169
BoostCluster
Data
Pairwise
Constraint
s
Kernel
Matrix
New data
Representation
Clustering
Results
Clustering
Algorithm
Clustering
Algorithm
Final Results
Obtain the clustering results given the new
data representation
170
170
BoostCluster
Data
Pairwise
Constraints
Kernel
Matrix
New data
Representation
Clustering
Results
Clustering
Algorithm
Clustering
Algorithm
Final Results
Update the kernel with the clustering results
171
171
BoostCluster
Data
Pairwise
Constraints
Kernel
Matrix
New data
Representation
Clustering
Results
Clustering
Algorithm
Clustering
Algorithm
Final Results
Run the procedure iteratively
172
172
BoostCluster
Data
Pairwise
Constraints
Kernel
Matrix
New data
Representation
Clustering
Results
Clustering
Algorithm
Clustering
Algorithm
Final Results
Compute the final clustering result
173
173
Summary
Clustering data under given pairwise constraints
Two types of approaches
Must links vs. cannot links
Restricted data partitions (either soft or hard)
Distance metric learning
Questions: how to acquire links/constraints?
Manual assignments
Derive from side information: hyper links, citation, user
logs, etc.
May be noisy and unreliable
174
Application: Document Clustering
[Basu et al., 2004]
300 docs from topics
(atheism, baseball, space)
of 20-newsgroups
3251 unique words after
removal of stopwords and
rare words and stemming
Evaluation metric:
Normalized Mutual
Informtion (NMI)
KMeans-x-x: different
variants of constrained
clustering algs.
175
Outline
Introduction to information retrieval, statistical inference and
machine learning
Supervised learning and its application to text classification,
adaptive filtering, collaborative filtering and ranking
Semi-supervised learning and its application to text
classification
Emerging research directions
176
Efficient Learning
In IR, we have massive amount of data
But, most learning algs. are relatively slow
How to improve scalability ?
Difficult to handle millions of documents
Sampling, only use part of data
Stochastic optimization, update model one example each
time (related to online learning)
More interesting, more examples may mean more
efficient training (Sebro, ICML 2008)
177
Kernel Learning
Kernel plays central role in machine learning
Kernel functions can be learned from data
Kernel alignment, multiple kernel learning, nonparametric learning, …
Kernel learning is suitable for IR
Similarity measure is key to IR
Kernel learning allows us to identify the optimal
similarity measure automatically
178
Transfer Learning
Different document categories are correlated
We should be able to borrow information of
one class to the training of another class
Key question: what to transfer between
classes?
Representation, model priors, similarity
measure …
179
Active Learning IR Applications
Relevance feedback (text retrieval or image
retrieval)
Text classification
Adaptive information filtering
Collaborative filtering
Query Rewriting
180
Discriminative Language Models
Language models have shown to be effective
for information retrieval
But most language models are generative,
thus missing the discriminative power
Key difficulty in discriminative language
models: no outputs!
Side information
Mixture of generative and discriminative models
181
References
A. McCallum and K. Nigam. A comparison of event models for Naive Bayes text classification.
In AAAI-98 Workshop on Learning for Text Categorization, 1998
Tong Zhang and Frank J. Oles, Text Categorization Based on Regularized Linear Classification
Methods, Journal of Information Retrieval, 2001
F. Li and Y. Yang. A loss function analysis for classification methods in text categorization, The
Twentieth International Conference on Machine Learning (ICML'03)
Chengxiang Zhai and John Lafferty, A study of smoothing methods for language models
applied to information retrieval, ACM Trans. Inf. System, 2004
A. Blum and T. Mitchell, Combining Labeled and Unlabeled Data with Co-training, COLT
1998
D. Blei and M. Jordan, Variational methods for the Dirichlet process, ICML 2004
T. Hofmann, Unsupervised Learning by Probabilistic Latent Semantic Analysis, Mach. Learn.,
42(1-2), 2001
D. Blei, A. Ng and M. Jordan, Latent Dirichlet allocation, NIPS*2002
R. Jin, C. Ding, and F. Kang, A Probabilistic Approach for Optimizing Spectral Clustering,
NIPS*2005
D. Zhou, B. Scholkopf, and T. Hofmann, Semi-supervised learning on directed graphs,
NIPS*2005.
X. Zhu, Z. Ghahramani, and J. D. Lafferty, Semi-supervised learning using Gaussian fields and
harmonic functions. ICML 2003.
182
T. Joachims, Transductive Learning via Spectral Graph Partitioning, ICML 2003
References
Andrew McCallum and Kamal Nigam, Employing {EM} in Pool-Based Active Learning for
Text Classification, Proceeding of the International Conference on Machine Learning, 1998
David A. Cohn and Zoubin Ghahramani and Michael I. Jordan, Active Learning with
Statistical Models, Journal of Artificial Intelligence Research, 1996
S. Tong and E. Chang. Support vector machine active learning for image retrieval. In ACM
Multimedia, 2001
Xuehua Shen and ChengXiang Zhai, Active feedback in ad hoc information retrieval, SIGIR
'05
J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear
predictors. Information and Computation, 1997.
X.-J. Wang, W.-Y. Ma, G.-R. Xue, X. Li. Multi-Model Similarity Propagation and its Application
for Web Image Retrieval, ACM Multimedia, 2004
M. Belkin and P. Niyogi and V. Sindhwani, Manifold Regularization, Technical Report, Univ.
of Chicago, 2006
K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clustering with
background knowledge. In ICML '01, 2001.
S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised
clustering. In SIGKDD '04, 2004.
183
References
Xiaofei He, Benjamin Rey, Wei Vivian Zhang, Rosie Jones, Query Rewriting using Active Learning
for Sponsored Search, SIGIR07
Y. Zhang, W. Xu, and J. Callan. Exploration and exploitation in adaptive filtering based on bayesian
active learning. In Proceedings of 20th International Conf. on Machine Learning, 2003.
Z. Xu and R. Akella. A bayesian logistic regression model for active relevance feedback (SIGIR08)
G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. ICML 2000
M. Saar-Tsechansky and F. Provost. Active sampling for class probability estimation and ranking.
Machine learning, 2004
J. Rocchio. Relevance feedback in information retrieval, In The Smart System: experiments in
automatic document processing. Prentice Hall, 1971.
H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the fifth annual
workshop on Computational learning theory, 1992
Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee
algorithm. Machine Learning, 28(2-3):133–168, 1997
D. A. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learn-ing. Machine
learning, 1994.
Robert M. Bell and Yehuda Koren, Lessons from the Netix Prize Challenge, KDD Exploration 2008
Tie-Yan Liu, Tutorial: Learning to rank
Soumen Chakrabarti, Learning to Rank in Vector Spaces and Social Networks, www 2007
184
Thank You
God, it is finally over !
185