Transcript slides

Learning Feature
Representation with K-means
Adam Coates and Andrew Y. Ng
Shuai Zhao, 20170208
1
Background
 If we have a small data size, we don’t have enough data to train the plenty parameters in
neural networks.
 If we have a large data size, we don’t have enough time to train the plenty parameters in
neural networks.
Learning Feature Representation
With K-Means
instead of Sparse Coding
2
Questions
 Talking of feature representation, we have sparse coding and PCA. K-means is known as
clustering. How to use K means to do feature representation?
Before
Between
After
Pictures are from Internet. IF copyright-hurting, delete immediately.
3
K Means Clustering Introduction
 Unsupervised Learning
 The goal of K-means clustering is to finds K centroids that minimize the distance between
data points and the nearest centroid.
 Visualization http://stanford.edu/class/ee103/visualizations/kmeans/kmeans.html
4
5
6
7
8
9
K-Means Optimization Function
 Each data point is a n-dimensional vector 𝑥 (𝑖) ∈ 𝑅 𝑛 , 𝑖 = 1, … , 𝑚. And there are m data points.
 The optimization function is 𝐽 𝑐, 𝜇 =
𝑚
𝑖=1
𝑥
𝑖
− 𝜇𝑐 (𝑖)
2
2
, 𝜇𝑐 (𝑖) is its nearest corresponding centroid.
 The K centroids can been combined as a matrix D (𝐷 ∈ 𝑅 𝑛×𝑘 ). 𝑠 (𝑖) is a “code vector” associated with the input
𝑥 (𝑖) . Then optimization function can be rewritten as:
J=
𝑚
𝑖=1
𝑥
𝑖
− 𝐷𝑠 (𝑖)
subject to 𝑠 (𝑖)
and 𝐷 (𝑗)
D (nxk)
1
⋮
𝑛
…
⋮
…
2
X
1
⋮
𝑘
2
0
<= 1, ∀𝑖 (means that each 𝑠 (𝑖) has at most one entry )
= 1, ∀𝑗 (𝐷 (𝑗) is the jth column of D)
𝑠 (𝑖) (kx1)
𝑘
⋮
2
𝑥 (𝑖) (nx1)
-
1
⋮
𝑛
10
𝜇𝑐 (𝑖) and 𝐷𝑠 (𝑖)
 𝐽 𝑐, 𝜇 =
𝑚
𝑖=1
𝑥
𝑖
− 𝜇𝑐 (𝑖)
0.8 0.6
0.1 0.1
0.6 08
𝑚
𝑖=1
𝑥
X
0
4
0
𝑖
− 𝐷𝑠 (𝑖)
2
2
𝜇𝑐 (1)
𝑠 (1) (kx1)
D (nxk)
0.1
0.8
0.6
and J =
=
3.2
4
2.4
11
Continue…
 Each column of D is a centroid and there are k columns. D is column normalized.
 Each column of D is a basis. The three basis of three-dimensional data can be
 The combination of all centroids are comprise of a new scaling space.
12
What is Sparse Coding
 Sparse Coding was inspired by the mammalian visual neural cells and was firstly published
by Olshausen and Field on Nature on 1996.
 In spare coding, the goal is to represent each input vector 𝑥 (𝑖) as a sparse linear
combination of basis vectors.
 Typically k > n, is called “over-complete”
𝑥 (𝑖) and 𝑥 (𝑖) : 𝑥 (𝑖) is more dense and
smaller dimensions
13
Sparse Coding and K-Means
K-Means
1. Adding a penalty that encourages 𝑠 (𝑖) to be sparse
2. 𝑠 (𝑖) allows more than one non-zero entry, enabling a much more accurate representation
of each 𝑥 (𝑖) while still requiring each 𝑠 (𝑖) to be simple.
14
Sparse Coding
 Spare Coding is widely used in image processing and text mining. A crucial step in object
recognition is the selection of proper features. Image are comprise of a very large number
of pixels. And we need sparse representation of natural images.
 Digit recognition
 Traffic Sign recognition
 Text Mining
15
PCA vs Sparse Coding
k < n; decrease
dimensions
k > n; increase
dimensions
16
https://cseweb.ucsd.edu/~dasgupta/254-deep/brian.pdf
Summary
 D is a n x k matrix.
 In K-Means, each column is a centroid.
 In Sparse Coding text mining, each column is a word. That why D is named from dictionary.
 In PCA, each column is a latent factor (not latent class). That why we don’t want a large k.
Item Response Theory: How to measure real feeling
through many category data.
17
The advantages of K-Means over Sparse Coding
 Speed and Scalability. Sparse coding requires us to solve a convex optimization problem
for every 𝑠 (𝑖) repeatedly during which the learning procedure is very expensive to deploy
at a large scale.
𝑅𝑒𝑐𝑎𝑙𝑙 𝑡ℎ𝑎𝑡 𝐷 (𝑗) 𝐷
𝑗 𝑇 𝑥 (𝑖)
= 𝑥 (𝑖) , i.e., it is fast to find which centroid is the nearest to 𝑥 (𝑖)
 This can be done very quickly. And solving for D given s is also easy, we can train very
large dictionaries rapidly by alternating optimization of D and s.
 The second reason is that K-means does not have any parameters other than k.
18
Experiment: K-Means vs Sparse Coding
 16 x 16 pixel image data (i.e., 𝑥 (𝑖) ∈ 𝑅256 )
 Three tweaks of K-Means:
1) Normalize inputs
2) Whiten inputs
3) Use damped updates of the centroids
19
Results
 K-Means performs fair enough if the data dimensionality is not too high. But its ability to
discover sparse directions in the data depends heavily on the dimensionality of the input
and the quantity of data.
 In particular, as the dimensionality increases, we need increase data size to get clean
results.
20
Connection with Deep Learning
 Spare Coding vs Spare Autoencoder
21
Shallow versus Deep Networks
Dense Layer (Fully Connected Layer): (like Cartesian join in relational
DB)
Each input node is connected to each output node.
22
Picture from:
http://neuralnetworksanddeeplearning.com/chap5.html
Non-Linearity between layers
Why we need nonlinear activation?
23
Minh Hoai Nguyen, Machine Learning Spring 16, Stony Brook University
Why we need Activation Function
Reason: the combination of linear functions is also linear function. We want to introduce nonlinearity into the network.
24
http://blog.csdn.net/hyman_yx/article/details/51789186
Some Examples of Activation Function
 Activation function is like gates which optionally let information through.
logistic (sigmod)
function
tanh function
Relu function
25
Autoencoder
code
Autoencoder has the same problem
with sparse coding
http://deeplearning.stanford.edu/wiki/index.php/Sparse_Coding:_Autoencoder_Interpretation
http://stats.stackexchange.com/questions/118199/what-are-the-differences-between-sparsecoding-and-autoencoder
26
27