linear-classifier - UCSB Computer Science

Download Report

Transcript linear-classifier - UCSB Computer Science

Machine Learning
CS 165B
Spring 2012
1
Course outline
• Introduction (Ch. 1)
• Concept learning (Ch. 2)
• Decision trees (Ch. 3)
• Ensemble learning
• Neural Networks (Ch. 4)
• Linear classifiers
Midterm on Wednesday
• Support Vector Machines
• Bayesian Learning (Ch. 6)
• Bayesian Networks
• Clustering
• Computational learning theory
2
Midterm Wednesday May 2
• Topics (till today’s lecture)
• Content
–
–
–
–
(40%) Short questions
(20%) Concept learning and hypothesis spaces
(20%) Decision trees
(20%) Artificial Neural Networks
• Practice midterm will be posted today
• Can bring one regular 2-sided sheet & calculator
3
Background on Probability & Statistics
• Random variable, sample space, event (union, intersection)
• Probability distribution
– Discrete (pmf)
– Continuous (pdf)
– Cumulative (cdf)
• Conditional probability
– Bayes Rule
– P(C ≥ 2 | M = 0)
3 coins
C is the count of heads
M =1 iff all coins match
• Independence of random variables
– Are C and M independent?
• Choose which of two envelopes contains a higher number
– Allowed to peak at one of them
4
Background on Probability & Statistics
• Common distributions
–
–
–
–
–
Bernoulli
Uniform
Binomial
Gaussian (Normal)
Poisson
• Expected value, variance, standard deviation
5
Approaches to classification
• Discriminant functions:
– Learn the boundary between classes.
• Infer conditional class probabilities:
– Choose the most probable class
p(class = Ck | x)
What kind of classifier is logistic regression?
6
Discriminant Functions
• They can be arbitrary functions of x, such as:
Nearest
Neighbor
Decision
Tree
Linear
Functions
Nonlinear
Functions
g (x)  wT x  b
Sometimes, transform the data and then learn a linear function
7
High-dimensional data
Gene expression
Face images
Handwritten digits
8
Why feature reduction?
• Most machine learning and data mining techniques may not be
effective for high-dimensional data
– Curse of Dimensionality
– Query accuracy and efficiency degrade rapidly as the dimension
increases.
• The intrinsic dimension may be small.
– For example, the number of genes responsible for a certain type of
disease may be small.
9
Why feature reduction?
• Visualization: projection of high-dimensional data onto 2D or
3D.
• Data compression: efficient storage and retrieval.
• Noise removal: positive effect on query accuracy.
10
Applications of feature reduction
• Face recognition
• Handwritten digit recognition
• Text mining
• Image retrieval
• Microarray data analysis
• Protein classification
11
Feature reduction algorithms
• Unsupervised
– Latent Semantic Indexing (LSI): truncated SVD
– Independent Component Analysis (ICA)
– Principal Component Analysis (PCA)
• Supervised
– Linear Discriminant Analysis (LDA)
12
Principal Component Analysis (PCA)
• Summarization of data with many variables by a
smaller set of derived (synthetic, composite)
variables
• PCA based on SVD
– So, look at SVD first
13
Singular Value Decomposition (SVD)
• Intuition: find the axis that shows the
greatest variation, and project all points to
this axis
f2
e2
e1
f1
14
14
SVD: mathematical formulation
• Let A be an m x n real matrix of m n-dimensional points
• SVD decomposition
– A = U x L x VT
=I
V(n x n) is orthogonal: VTV = I
– U(m x m) is orthogonal: UTU
–
–
L(m x n) has r positive non-zero singular values in descending
order on its diagonal
• Columns of U are the orthogonal eigenvectors of AAT (called the
left singular vectors of A)
– AAT = (U x L x VT ) (U x L x VT )T = U x L x LT x UT = U x L2 x UT
• Columns of V are the orthogonal eigenvectors of ATA (called the
right singular vectors of A)
– ATA = (U x L x VT )T (U x L x VT ) = V x LT x L x VT = V x L2 x VT
•
L contains the square root of the eigenvalues of AAT (or ATA)
– These are called the singular values (positive real)
– r is the rank of A, AAT , ATA
• U defines the column space of A, V the row space.
15
SVD - example
16
SVD - example
• A = U L VT
1
2
1
5
0
0
0
1
2
1
5
0
0
0
1
2
1
5
0
0
0
0
0
0
0
2
3
1
0
0
0
0
2
3
1
=
0.18
0.36
0.18
0.90
0
0
0
0
0
0
0
0.53
0.80
0.27
x
9.64 0
0
5.29
x
v1
0.58 0.58 0.58 0
0
0
0
0
0.71 0.71
17
SVD - example
• A = U L VT
variance (‘spread’) on the v1 axis
1
2
1
5
0
0
0
1
2
1
5
0
0
0
1
2
1
5
0
0
0
0
0
0
0
2
3
1
0
0
0
0
2
3
1
=
0.18
0.36
0.18
0.90
0
0
0
0
0
0
0
0.53
0.80
0.27
x
9.64 0
0
5.29
x
0.58 0.58 0.58 0
0
0
0
0
0.71 0.71
18
Dimensionality reduction
1
2
1
5
0
0
0
1
2
1
5
0
0
0
1
2
1
5
0
0
0
0
0
0
0
2
3
1
0
0
0
0
2
3
1
=
0.18
0.36
0.18
0.90
0
0
0
0
0
0
0
0.53
0.80
0.27
x
9.64 0
0
5.29
x
0.58 0.58 0.58 0
0
0
0
0
0.71 0.71
19
Dimensionality reduction
• set the smallest singular values to zero:
1
2
1
5
0
0
0
1
2
1
5
0
0
0
1
2
1
5
0
0
0
0
0
0
0
2
3
1
0
0
0
0
2
3
1
=
0.18
0.36
0.18
0.90
0
0
0
0
0
0
0
0.53
0.80
0.27
x
9.64 0
0
5.29
x
0.58 0.58 0.58 0
0
0
0
0
0.71 0.71
20
Dimensionality reduction
1
2
1
5
0
0
0
1
2
1
5
0
0
0
1
2
1
5
0
0
0
0
0
0
0
2
3
1
0
0
0
0
2
3
1
~
0.18
0.36
0.18
0.90
0
0
0
0
0
0
0
0.53
0.80
0.27
x
9.64 0
0
0
x
0.58 0.58 0.58 0
0
0
0
0
0.71 0.71
21
Dimensionality reduction
1
2
1
5
0
0
0
1
2
1
5
0
0
0
1
2
1
5
0
0
0
0
0
0
0
2
3
1
0
0
0
0
2
3
1
~
0.18
0.36
0.18
0.90
0
0
0
0
0
0
0
0.53
0.80
0.27
x
9.64 0
0
0
x
0.58 0.58 0.58 0
0
0
0
0
0.71 0.71
22
Dimensionality reduction
1
2
1
5
0
0
0
1
2
1
5
0
0
0
1
2
1
5
0
0
0
0
0
0
0
2
3
1
0
0
0
0
2
3
1
~
0.18
0.36
0.18
0.90
0
0
0
x
9.64
x
0.58 0.58 0.58 0
0
23
Dimensionality reduction
1
2
1
5
0
0
0
1
2
1
5
0
0
0
1
2
1
5
0
0
0
0
0
0
0
2
3
1
0
0
0
0
2
3
1
~
1
2
1
5
0
0
0
1
2
1
5
0
0
0
1
2
1
5
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
24
Dimensionality reduction
‘spectral decomposition’ of the matrix:
1
2
1
5
0
0
0
1
2
1
5
0
0
0
1
2
1
5
0
0
0
0
0
0
0
2
3
1
0
0
0
0
2
3
1
=
0.18
0.36
0.18
0.90
0
0
0
0
0
0
0
0.53
0.80
0.27
x
9.64 0
0
5.29
x
0.58 0.58 0.58 0
0
0
0
0
0.71 0.71
25
Dimensionality reduction
‘spectral decomposition’ of the matrix:
1
2
1
5
0
0
0
1
2
1
5
0
0
0
1
2
1
5
0
0
0
0
0
0
0
2
3
1
0
0
0
0
2
3
1
= u1
u2
x
l1
l2
x
v1T
v2T
26
Dimensionality reduction
‘spectral decomposition’ of the matrix:
n
m
1
2
1
5
0
0
0
1
2
1
5
0
0
0
1
2
1
5
0
0
0
0
0
0
0
2
3
1
0
0
0
0
2
3
1
=
l1 u1 v1T +
l2 u2 v2T +...
27
Dimensionality reduction
‘spectral decomposition’ of the matrix:
n
m
1
2
1
5
0
0
0
1
2
1
5
0
0
0
1
2
1
5
0
0
0
0
0
0
0
2
3
1
0
0
0
0
2
3
1
r terms
=
l1 u1 vT1 +
mx1
l2 u2 vT2 +...
1xn
28
Dimensionality reduction
approximation / dim. reduction:
by keeping the first few terms (how many?)
m
n
1
2
1
5
0
0
0
1
2
1
5
0
0
0
1
2
1
5
0
0
0
0
0
0
0
2
3
1
0
0
0
0
2
3
1
=
l1 u1 vT1 +
l2 u2 vT2 +...
assume: l1 >= l2 >= ...
29
Dimensionality reduction
A heuristic: keep 80-90% of ‘energy’ (= sum of squares of li’s)
m
n
1
2
1
5
0
0
0
1
2
1
5
0
0
0
1
2
1
5
0
0
0
0
0
0
0
2
3
1
0
0
0
0
2
3
1
=
l1 u1 vT1 +
l2 u2 vT2 +...
assume: l1 >= l2 >= ...
30
Dimensionality reduction
• Matrix V in the SVD decomposition
(A = UΛVT ) is used to transform the data.
• AV (= UΛ) defines the transformed dataset.
• For a new data element x, xV defines the
transformed data.
• Keeping the first k (k < n) dimensions,
amounts to keeping only the first k columns
of V.
31
Optimality of SVD
• Let A = U L VT
• A = ∑ λiuiviT
• The Frobenius norm of an m x n matrix M is
 A [i , j ]
= √  λi2
• Let Ak = the above summation using the k largest eigenvalues.
A
F

2
Theorem: [Eckart and Young] Among all m x n matrices B of
rank at most k, we have that: A- Ak  A- B
F
F
A- Ak 2  A- B 2
• “Residual” variation is information in A that is not retained.
Balancing act between
– clarity of representation, ease of understanding
– oversimplification: loss of important or relevant information.
32
Principal Components Analysis (PCA)
• Transfer the dataset to the center by subtracting
the means: let matrix A be the result.
• Compute the covariance matrix ATA.
• Project the dataset along a subset of the
eigenvectors of ATA.
• Matrix V in the SVD decomposition contains
these.
• Also known as K-L transform.
33
Principal Component Analysis (PCA)
• Takes a data matrix of m objects by n variables, which
may be correlated, and summarizes it by uncorrelated
axes (principal components or principal axes) that are
linear combinations of the original n variables
• The first k components display as much as possible of
the variation among objects.
34
2D Example of PCA
14
12
Variable X 2
10
8
X 2  4.91
6
+
4
2
X 1  8.35
0
0
2
4
6
8
10
12
14
16
18
20
Variable X1
V1  6.67
V2  6.24
C1, 2  3.42
35
Configuration is Centered
8
6
Variable X 2
4
2
0
-8
-6
-4
-2
0
2
4
6
8
10
12
-2
-4
-6
Variable X1
• each variable is adjusted to a mean of zero (by
subtracting the mean from each value).
36
Principal Components are Computed
6
4
PC 2
2
0
-8
-6
-4
-2
0
2
4
6
8
10
12
-2
-4
-6
PC 1
• PC 1 has the highest possible variance (9.88)
• PC 2 has a variance of 3.03
• PC 1 and PC 2 have zero covariance.
37
8
PC 1
6
Variable X 2
4
PC 2
2
0
-8
-6
-4
-2
0
2
4
6
8
10
12
-2
-4
-6
Variable X1
• Each principal axis is a linear combination of the original two
variables
38
Feature reduction algorithms
• Unsupervised
– Latent Semantic Indexing (LSI): truncated SVD
– Independent Component Analysis (ICA)
– Principal Component Analysis (PCA)
• Supervised
– Linear Discriminant Analysis (LDA)
39
Course outline
• Introduction (Ch. 1)
• Concept learning (Ch. 2)
• Decision trees (Ch. 3)
• Ensemble learning
• Neural Networks (Ch. 4)
• Linear classifiers
• Support Vector Machines
• Bayesian Learning (Ch. 6)
• Bayesian Networks
• Clustering
• Computational learning theory
40
Midterm analysis
• Grade distribution
• Solution to ANN problem
• Makeup problem on Wednesday
– 20 minutes
– 15 points
– Bring a calculator
41
Fisher’s linear discriminant
• A simple linear discriminant function is a projection of the
data down to 1-D.
– So choose the projection that gives the best separation of
the classes. What do we mean by “best separation”?
• An obvious direction to choose is the direction of the line
joining the class means.
– But if the main direction of variance in each class is not
orthogonal to this line, this will not give good separation
(see the next figure).
• Fisher’s method chooses the direction that maximizes the ratio
of between class variance to within class variance.
– This is the direction in which the projected points contain
the most information about class membership (under
Gaussian assumptions)
42
Fisher’s linear discriminant
When projected onto the
line joining the class means,
the classes are not well
separated.
Fisher chooses a direction that
makes the projected classes much
tighter, even though their projected
means are less far apart.
43
Fisher’s linear discriminant (derivation)
Find the best direction w for accurate classification.
A measure of the separation between the projected
points is the difference of the sample means.
If mi is the d-dimensional sample
mean from Di given by
the sample mean from the projected
points Yi given by
the difference of the projected sample means is:
44
Fisher’s linear discriminant (derivation)
Define scatter for the projection:
Choose w in order to maximize
is called the total within-class scatter.
Define scatter matrices Si (i = 1, 2) and Sw by
45
Fisher’s linear discriminant (derivation)
We obtain
46
Fisher’s linear discriminant (derivation)
where
In terms of SB and Sw, J(w) can be written as:
Note that SB and SW are symmetric.
47
Fisher’s linear discriminant (derivation)
To find w that can maximize J (w):
Find J '(w).
NOTE: Both w t SW w and w t S B w are scalars K1 and K 2 .
J '(w) =
1
(w t SW w)
2
[2S B ww t SW w - 2SW ww t S B w]
Since we only want to find the direction of w, the magnitude is not important.
2S B ww t SW w - 2SW ww t S B w = 0
S B w × K1 = SW w × K 2
S B w = l SW w
48
Fisher’s linear discriminant (derivation)
A vector w that maximizes J(w) must satisfy
In the case that Sw is nonsingular,
Since S B = (m1 - m2 )(m1 - m2 )t , S B w = (m1 - m2 )(m1 - m2 )t w.
Since (m1 - m2 )t w is a scalar, S B w = c(m1 - m2 ).
Therefore, the direction of w is given by Sw-1 (m1 - m2 ).
49
Linear discriminant
d
g ( x) = wT x + b = å w j x j + b
j=1
• Advantages:
– Simple: O(d) space/computation
– Knowledge extraction: weighted sum of attributes;
positive/negative weights, magnitudes (credit scoring)
50
Non-linear models
• Quadratic discriminant:
• Higher-order (product) terms:
z1  x 1 , z2  x 2 , z3  x 12 , z 4  x 22 , z5  x 1x 2
Map from x to z using nonlinear basis functions and use a
linear discriminant in z-space
k
g ( x ) = å w jj j ( x )
j=1
51
Linear model: two classes
C1 if g x   0
choose 
C2 otherwise
52
Geometry of classification
d
g ( x) = wT x + b = å w j x j + b
j=1
w0 = b
w is orthogonal to the
decision surface
D = distance of decision
surface from origin
Consider any point x on the
decision surface. Then D = wTx
/ ||w|| = −b / ||w||
d(x) = distance of x from
decision surface
x = xp+ d(x) w/||w||
wTx + b = wTxp+ d(x) wTw/||w|| + b
g(x) = (wTxp+ b) + d(x) ||w||
d(x) = g(x) / ||w|| = wTx / ||w|| − D
53