Transcript ppt version

Feature/Kernel Learning
in Semi-supervised scenarios
Kevin Duh
[email protected]
UW 2007 Workshop on SSL for
Language Processing
Agenda
1. Intro: Feature Learning in SSL
1. Assumptions
2. General algorithm/setup
2.
3.
4.
5.
3 Case Studies
Algo1: Latent Semantic Analysis
Kernel Learning
Related work
Semi-supervised learning (SSL):
Three general assumptions
• How can unlabeled data help?
1. Bootstrap assumption:
– Automatic labels on unlabeled data has sufficient
accuracy
– Can be incorporated into training set
2. Low Density Separation assumption:
– Classifier should not cut through a high density
region because clusters of data likely come from the
same class
– Unlabeled data can help identify high/low density
regions
3. Change of Representation assumption
Low Density Assumption
o
+
+
o
o
-
-
-
+
+
-
o
-
o
o
o
o
Green line cuts through
High density region
Black line cuts through
Low density region
Change of Representation
Assumption
•
Basic learning process:
1. Define a feature representation of data
2. Map each sample into feature vector
3. Apply a vector classifier learning algorithm
•
What if the original feature set is bad?
–
Then learning algo’s will have a hard time
•
•
–
•
Ex: Highly correlation among features
Ex: Irrelevant features
Different learning algo’s deal with it differently
Assumption: Large amounts of unlabeled data
can help us learn better feature representation
Two stage learning process
•
Stage 1: Feature learning
1. Define an original feature representation
2. Using unlabeled data, learn a “better”
feature representation
•
Stage 2: Supervised training
1. Map each labeled training sample into new
feature vector
2. Apply a vector classifier learning algo
(Basic learning process)
labele
d
Data
Feature
Vector
Classifier
(2-stage learning process)
unlabeled
Data
Original
Feature
Vector
labele
d
Data
New
Feature
Vector
New
Feature
Vector
Classifier
Philosophical reflection
• Machine learning is not magic
– Human knowledge is encoded in feature representation
– Machine & statistics only learn the relative weight/importance of
each feature on data
• Two-stage learning is not magic
– Merely extended machine/statistical analysis to the feature
“engineering” stage
– Human knowledge still needed in:
• Original feature representation
• Deciding how original feature transforms into new feature (Here is
the research frontier!)
• Designing features and learning weights on features are
really the same thing – it’s just division of labor
between human and machine.
Wait, how is this different from …
• Feature selection/extraction?
– This approach applies traditional feature selection to a larger,
unlabeled dataset.
– In particular, we apply unsupervised feature selection methods
(semi-supervised feature selection is an unexplored area)
• Supervised learning?
– Supervised learning is a critical component
• We can use all we know in feature selection &
supervised learning here:
– There components aren’t new: the new thing here is the semisupervised perspective.
– So far there is little work on this in language processing
(compared to auto-labeling): lots of research opportunities!!!
Research questions
•
Given lots of unlabeled data and an original feature
representation, how can we transform into a better, new
features?
1. How to define “better”? (objective)
– Ideally, “better” = “leads to higher classification accuracy, but
also this depends on the downstream classifier algo
– Is there a surrogate metric for “better” that does not require
labels?
2. How to model the feature transformation? (modeling)
– Mostly we use linear transform: y=Ax
•
•
x = original feature; y = new feature;
A = transform matrix to learn
3. Given the objective and model, how to find A? (algo)
Agenda
1. Intro: Feature Learning in SSL
2. 3 Case Studies
1. Correlated features
2. Irrelevant features
3. Feature Set is not sufficiently expressive
3. Algo1: Latent Semantic Analysis
4. Kernel Learning
5. Related work
Running Example:
Topic classification
• Binary classification of documents by topics: Politics vs.
Sports
• Setup:
– Data: Each document is vectorized so that each element
represents a unigram and its count
• Doc1: “The President met the Russian President”
• Vocab: [The President met Russian talks strategy Czech missile]
• Vector1: [2 2
1
1
0
0
0
0
]
– Classifier: Some supervised training algorithm
• We will discuss 3 cases where original feature set may
be unsuitable for training
1. Highly correlated features
• Unigrams (features) in “Politics”:
–
–
–
–
Republican, conservative, GOP
White, House, President
Iowa, primary
Israel, Hamas, Gaza
• Unigrams in “Sports”:
–
–
–
–
game, ballgame
strike-out, baseball
NBA, Sonics
Tiger, Woods, golf
• OTHER EXAMPLES?
• Highly correlate features represent redundant information:
– A document with “Republican” isn’t necessarily more about politics than
a document with “Republican” and “conservative.”
– Some classifiers are hurt by highly correlated (redundant) features by
“double-counting”, e.g. Naïve Bayes & other generative models
2. Irrelevant features
• Unigrams in both “Politics” & “Sports”:
– the, a, but, if, while (stop words)
– win, lose (topic neutral words)
– Bush (politician or sports player)
• Very rare unigrams in either topic
– Wazowski, Pitcairn (named entities)
– Brittany Speers (typos, noisy features)
• OTHER EXAMPLES?
• Irrelevant features un-necessarily expands the
“search space”. They should get zero weight, but
classifier may not do so.
3. Feature Set is not
sufficiently expressive
• Original feature set:
– 3 features: [leaders challenge group]
– Politics or Sports?
• “The leaders of the terrorist group issued a challenge”
• “It’s a challenge for the cheer leaders to take a group picture”
• Different topics have similar (same) feature vectors
• A more expressive (discriminative) feature set:
– 5 features: [leader challenge group cheer terrorist]
– Bigrams: [“the leaders” “cheer leaders” …]
• Vector-based linear classifiers simply doesn’t work
– Data is “inseparable” by the classifier
• Non-linear classifiers (e.g. kernel methods, decision trees, neural
networks) automatically learn combinations of features
– But, there is still no substitute for an human expert thinking of features
Agenda
1.
2.
3.
4.
5.
Intro: Feature Learning in SSL
3 Case Studies
Algo1: Latent Semantic Analysis
Kernel Learning
Related work
Dealing with correlated and/or
irrelevant features
• Feature clustering methods:
– Idea: cluster features so that similarly distributed
features are mapped together
– Algorithms:
• Word clustering (using Brown algo, spectral clustering, kmeans…)
• Latent Semantic Analysis
• Principle Components Analysis (PCA):
– Idea: Map original feature vector into directions of
high variance
• Words with similar counts across all documents should be
deleted from feature set
• Words with largely varying counts *might* be a discriminating
feature
Latent Semantic Analysis (LSA)
• Currently, a document is represented by a set of
terms (words).
• LSA: represent a document by a set of “latent
semantic” variables, which are coarser-grained
than terms.
– A kind of “feature-clustering”
– Latent semantics could cluster together synonyms or
similar words, but in general it may be difficult to say
“what” the latent semantics corresponds to.
– Strengths: data-driven, scalable algorithm (with good
results in natural language processing and
information retrieval!)
LSA Algorithm
1. Represent data as term-document matrix
2. Perform singular value decomposition
3. Map documents to new feature vector by the
vectors with largest singular values
•
Term-document matrix:
–
–
–
t terms, d documents. Matrix is size (t x d)
Each column is a document (represented by how
many times each term occurs in that doc)
Each row is a term (represented by how many times
it occurs in various documents)
Term-document matrix: X
term1
term2
term3
doc1
5
4
0
doc2
5
3
0
doc3
0
0
7
doc4
0
1
6
doc5
10
7
3
• Distance between 2 documents: dot product between column
vectors
– d1’ * d2 = 5*5+4*3+0*0 = 37
– d1’ * d3 = 5*0+4*0+0*7 = 0
– X’ * X is the matrix of document distances
• Distance between 2 terms: dot product between row vectors
– t1 * t2’ = 5*4+5*3*0*0+0*1+10*7=105
– t1 * t3’ = 5*0+5*0+0*7+0*6+10*3=30
– X * X’ is the matrix of term distances
Singular Value Decomposition
(SVD)
• Eigendecomposition:
– A is a square matrix
– Av = ev
• (v is eigenvector, e is eigenvalue)
– A = VEV’
• (E is diagonal matrix of eigenvalues
• (V is matrix of eigenvectors; V*V’=V’*V = I)
• SVD:
– X is (t x d) rectangular matrix
– Xd = st
• (s is singular value, t & d are left/right singular vectors)
– X = TSD’
• (S is t x d matrix with diagonal singular values and others 0)
• (T is t x t, D is d x d matrix; T*T’=I; D*D’=I)
SVD on term-document matrix
• X = T*S*D’ = (t x t)* (t x d) * (d x d)
-0.78 0.25 0.56
-0.55 0.11 -0.82
-0.27 -0.96 0.05
15.34
0
0 9.10
0
0
0
0
0.85
0
0
0
0
0
0
-0.40 0.18 -0.52 -0.36 -0.62
-0.36 0.17 0.42 0.66 -0.45
-0.12 -0.73 0.45 -0.37 -0.30
-0.14 -0.62 -0.56 0.51 0.08
-0.81 0.04 0.09 -0.15 0.54
• Low-rank approx X ~ (t x k)*(k x k)*(k x d)
-0.78 0.25
-0.55 0.11
-0.27 -0.96
15.34
0
0 9.10
-0.40
-0.36
0.18 -0.52 -0.36 -0.62
0.17 0.42 0.66 -0.45
• X’*X = (DS’T’)(TSD) = D(S’S)D
• New document vector representation: d’*T*inv(S)
Agenda
1.
2.
3.
4.
Intro: Feature Learning in SSL
3 Case Studies
Algo1: Latent Semantic Analysis
Kernel Learning
1. Links between features and kernels
2. Algo2: Neighborhood kernels
3. Algo3: Gaussian Mixture Model kernels
5. Related work
Philosophical reflection: Distances
• (Almost) Everything in machine learning is about
distances:
– A new sample is labeled positive because it is “closer”
in distance to positive training examples
• How to define distance?
– Conventional way: dot product of feature vectors
– That’s why feature representation is important
• What if we directly learn the distance?
– d(x1,x2): inputs are two samples, output is a numer
– Argument: it doesn’t matter what features we define;
all that matters is whether positive samples are close
in distance to other positive samples, etc.
What is a kernel?
• Kernel is a distance function:
– Dot product in feature space
• x1= [a b]; x2 = [c d] , d(x1,x2)=x1’*x2 = ac+bd
– Polynomial kernel:
•
•
•
•
Map feature into higher-order polynomial
x1 = [aa bb ab ba]; x2 = [cc dd cd dc]
d(x1,x2) = aacc + bbdd + 2abcd
Kernel trick: polynomial kernel = (ac+bd)^2
• When we define a kernel, we are implicitly
defining a feature space
– Recall in SVMs, kernels allow classification in spaces
non-linear w.r.t. the original features
Kernel Learning in
Semi-supervised scenarios
1. Learn a kernel function k(x1,x2) from
unlabeled+labeled data
2. Use this kernel when training a
supervised classifier from labeled data
(2-stage learning process)
unlabeled
Data
Algo
labele
d
Data
Kernel
New
Kernel
Kernel-based
Classifier
Neighborhood kernel (1/3)
• Idea: Distance between x1, x2 is based on
distances between their neighbors
• Below are samples in feature space
– Would you say the distances d(x1,x2) and
d(x2,x7) are equivalent?
– Or does it seem like x2 should be closer to x1
than x3?
x10
x11
x1
x2
x6
x4
x5
x3
x7
x9
x8
Neighborhood kernel (2/3)
• Analogy: “I am close to you if you are my clique
of close friends.”
– Neighborhood kernel k(x1,x2): avg distance among
neighbors of x1, x2
• Eg. x1= [1]; x2 = [6], x3=[5], x4=[2]
– Original distance: x2-x1 = 6-1 = 5
– Neighbor distance = 4
• closest neighbor to x1 is x4; closest neighbor to x2 is x3
• [(x2-x1) + (x3-x1) + (x2-x4) + (x3-x4)]/4 = [5+4+4+3]/4 = 4
Neighborhood kernel (3/3)
• Mathematical form:
• Design questions:
– What’s the base distance? What’s the distance used
for finding close neighbors?
– How many neighbors to use?
• Important: When do you expect neighborhood
kernels to work? What assumptions does this
method employ?
Gaussian Mixture Model Kernels
1.
Perform a (soft) clustering of your labeled + unlabeled
dataset (by fitting a Gaussian mixture model)
x10
x11
x1
x2
x6
x4
2.
x5
x3
x7
x9
x8
The distance between two samples is defined by
whether they are in the same cluster:
K(x,y) = prob[x in cluster1]*prob[y in cluster1]
+ prob[x in cluster2]*prob[y in cluster2]
+ prob[x in cluster3]*prob[y in cluster3] + …
Seeing relationships
• We’ve presented various methods, but at some
level they are all the same…
• Gaussian mixture kernel & neighborhood
kernel?
• Gaussian mixture kernel & feature clustering?
• Can you defined a feature learning version of
neighborhood kernels?
• Can you define a kernel learning version of
LSA?
Agenda
1.
2.
3.
4.
5.
Intro: Feature Learning in SSL
3 Case Studies
Algo1: Latent Semantic Analysis
Kernel Learning
Related work
Related Work: Feature Learning
• Look into machine learning & signal processing literature for
unsupervised feature selection / extraction
• Discussed here:
– Two-stage learning:
• C. Oliveira, F. Cozman, and I. Cohen. “Splitting the unsupervised and
supervised components of semi-supervised learning.” In ICML 2005
Workshop on Learning with Partially Classified Training Data, 2005.
– Latent Semantic Analysis:
• S. Deerwester et.al., “Indexing by Latent Semantic Analysis”, Journal of Soc.
For Information Sciences (41), 1990.
• T. Hofmann, “Probabilistic Latent Semantic Analysis”, Uncertainty in AI, 1999
– Feature clustering:
• W. Li and A. McCallum. “Semi-supervised sequence modeling with syntactic
topic models”. AAAI 2005
• Z. Niu, D.-H. Ji, and C. L. Tan. A semi-supervised feature clustering
algorithm with application to word sense disambiguation. HLT 2005
• Semi-supervised feature selection is a new area to explore:
• Z. Zhao and H. Liu. ``Spectral Feature Selection for Supervised and
Unsupervised Learning'‘, ICML 2007
Related Work: Kernel learning
• Beware that many kernel learning algorithms are
transductive (don’t apply to samples not originally in your
training data):
• N. Christianni, et.al. “On kernel target alignment”, NIPS 2002
• G. Lanckriet, et.al. “Learning a kernel matrix with semi-definite
programming,” ICML 2002
• C. Ong, A. Smola, R. Williamson, “Learning the kernel with
hyperkernels”, Journal of Machine Learning Research (6), 2005
• Discussed here:
– Neighborhood kernel and others:
• J. Weston, et.al. “Semi-supervised protein classification using
cluster kernels”, Bioinformatics 2005 21(15)
– Gaussian mixture kernel and KernelBoost:
• T. Hertz, A.Bar-Hillel and D. Weinshall. “Learning a Kernel Function
for Classification with Small Training Samples.” ICML 2006
• Other keywords: “distance learning”, “distance metric
learning”
• E. Xing, M. Jordan, S. Russell, “Distance metric learning with
applications to clustering with side-information”, NIPS 2002
Related work: SSL
• Good survey article:
• Xiaojin Zhu. “Semi-supervised learning literature survey.” Wisconsin CS
TechReport
• Graph-based SSL are popular. Some can be related to kernel
learning
• SSL on sequence models (of particular importance to NLP)
• J. Lafferty, X. Zhu and Y. Liu. “Kernel Conditional Random Fields:
Representation and Clique Selection.” ICML 2004
• Y. Altun, D. Mcallester, and M. Belkin. “Maximum margin semi-supervised
learning for structured variables”, NIPS 2005
• U. Brefeld and T. Scheffer. “Semi-supervised learning for structured output
variables,” ICML 2006
• Other promising methods:
– Fisher kernels and features from generative models:
• T. Jaakkola & D. Haussler. “Exploiting generative models in discriminative
classifiers.” NIPS 1998
• A. Holub, M. Welling, and P. Perona. “Exploiting unlabelled data for hybrid
object classification.” NIPS 2005 Workshop on Inter-class transfer
• A. Fraser and D. Marcu. “Semi-supervised word alignment.” ACL 2006
– Multi-task learning formulation:
• R. Ando and T. Zhang. A High-Performance Semi-Supervised Learning
Method for Text Chunking. ACL 2006
• J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural
correspondence learning. EMNLP 2006
Summary
• Feature learning and kernel learning are two sides of the
same coin
• Feature/kernel learning in SSL consists of a 2-stage
approach:
– (1) learn better feature/kernel with unlabeled data
– (2) learn traditional classifier on labeled data
• 3 Cases of inadequate feature representation:
correlated, irrelevant, not expressive
• Particular algorithms presented:
– LSA, Neighborhood kernels, GMM kernels
• Many related areas, many algorithms, but underlying
assumptions may be very similar