ICS 278: Data Mining Lecture 1: Introduction to Data Mining

Download Report

Transcript ICS 278: Data Mining Lecture 1: Introduction to Data Mining

ICS 278: Data Mining
Lecture 5: Low-Dimensional Representations of
High-Dimensional Data
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Today’s lecture
• Extend project proposal deadline to Monday, 8am: questions?
• Outline of today’s lecture:
– “orphan” slides from earlier lectures
– Dimension reduction methods
•
•
•
•
Data Mining Lectures
Motivation
Variable selection methods
Linear projection techniques
Non-linear embedding methods
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Notation Reminder
• n objects, each with p measurements
• data vector for ith object
x (i )  ( x1 (i ), x2 (i ),  , x p (i ))
• Data matrix
• x j (i )is the ith row, jth column
• columns -> variables
• rows -> data points
• Can define distances/similarities
• between rows (data vectors i)
• between columns (variables j)
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Distance
• n objects with p measurements
x  ( x1 , x2 ,  , x p )
y  ( y1 , y 2 ,  , y p )
• Most common distance metric is Euclidean distance:
 p
2
d E ( x, y )    ( xk  yk ) 
 k 1

1
2
• Makes sense in the case where the different measurements are
commensurate; each variable measured in the same units.
• If the measurements are different, say length and weight,
Euclidean distance is not necessarily meaningful.
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Dependence among Variables
• Covariance and correlation measure linear dependence
(distance between variables, not objects)
• Assume we have two variables or attributes X and Y and n objects
taking on values x(1), …, x(n) and y(1), …, y(n). The sample
covariance of X and Y is:
1 n
Cov( X , Y )   ( x(i )  x )( y (i )  y )
n i 1
• The covariance is a measure of how X and Y vary together.
– it will be large and positive if large values of X are associated
with large values of Y, and small X  small Y
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Correlation coefficient
• Covariance depends on ranges of X and Y
• Standardize by dividing by standard deviation
• Linear correlation coefficient is defined as:
n
 ( X ,Y ) 
 ( x(i)  x )( y(i)  y )
i 1
n
 n
2
2
  ( x(i )  x )  ( y (i )  y ) 
i 1
 i 1

Data Mining Lectures
Lecture 5: Dimension Reduction
1
2
Padhraic Smyth, UC Irvine
Sample Correlation Matrix
-1 0 +1
business acreage
nitrous oxide
average # rooms
Data on characteristics
of Boston surburbs
Median house value
percentage of large residential lots
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Mahalanobis distance (between objects)


d MH ( x, y)  x  y   1 x  y 
Evaluates to a
scalar distance
T
Vector difference in
p-dimensional space
1
2
Inverse covariance matrix
1. It automatically accounts for the scaling of the coordinate axes
2. It corrects for correlation between the different features
Cost:
1. The covariance matrices can be hard to determine accurately
2. The memory and time requirements grow quadratically, O(p2), rather
than linearly with the number of features.
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Example 1 of Mahalanobis distance
Covariance matrix is
diagonal and isotropic
-> all dimensions have
equal variance
-> MH distance reduces
to Euclidean distance
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Example 2 of Mahalanobis distance
Covariance matrix is
diagonal but non-isotropic
-> dimensions do not have
equal variance
-> MH distance reduces
to weighted Euclidean
distance with weights
= inverse variance
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Example 2 of Mahalanobis distance
Two outer blue
points will have same MH
distance to the center
blue point
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Distances between Binary Vectors
• matching coefficient
i=1
j=0
i=1
n11
n10
i=0
n01
n00
Number of
variables where
item j =1 and item i = 0
n11  n 00
n11  n10  n 01  n 00
• Jaccard coefficient (e.g., for sparse vectors, non-symmetric)
n11
n11  n10  n 01
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Other distance metrics
• Categorical variables
– Number of matches divided by number of dimensions
• Distances between strings of different lengths
– e.g., “Patrick J. Smyth” and “Padhraic Smyth”
– Edit distance
• Distances between images and waveforms
– Shift-invariant, scale invariant
– i.e., d(x,y) = min_{a,b} ( (ax+b) – y)
• More generally, kernel methods
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Transforming Data
• Duality between form of the data and the model
– Useful to bring data onto a “natural scale”
– Some variables are very skewed, e.g., income
• Common transforms: square root, reciprocal, logarithm, raising to a
power
– Often very useful when dealing with skewed real-world data
• Logit: transforms from 0 to 1 to real-line
p
logit ( p) 
1 p
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Data Quality
•
Individual measurements
– Random noise in individual measurements
•
•
•
•
Variance (precision)
Bias
Random data entry errors
Noise in label assignment (e.g., class labels in medical data sets)
– Systematic errors
• E.g., all ages > 99 recorded as 99
• More individuals aged 20, 30, 40, etc than expected
– Missing information
• Missing at random
– Questions on a questionnaire that people randomly forget to fill in
• Missing systematically
– Questions that people don’t want to answer
– Patients who are too ill for a certain test
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Data Quality
•
Collections of measurements
– Ideal case = random sample from population of interest
– Real case = often a biased sample of some sort
– Key point: patterns or models built on the training data may only be
valid on future data that comes from the same distribution
•
Examples of non-randomly sampled data
– Medical study where subjects are all students
– Geographic dependencies
– Temporal dependencies
– Stratified samples
• E.g., 50% healthy, 50% ill
– Hidden systematic effects
• E.g., market basket data the weekend of a large sale in the store
• E.g., Web log data during finals week
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Classifier technology and the illusion of progress
(abstract for workshop on State-of-theArt in Supervised Classification, May 2006)
Professor David J. Hand, Imperial College, London
Supervised classification methods are widely used in data mining. Highly sophisticated
methods have been developed, using the full power of recent advances in computation.
However, these advances have largely taken place within the context of a classical
paradigm, in which construction of the classification rule is based on a ‘design sample’
of data randomly sampled from unknown but well defined distributions of the classes. In
this paper, I argue that this paradigm fails to take account of other sources of
uncertainty in the classification problem, and that these other sources lead to
uncertainties which often swamp those arising from the classical ones of estimation and
prediction. Several examples of such sources are given, including imprecision in the
definitions of the classes, sample selectivity bias, population drift, and use of
inappropriate optimisation criteria when fitting the model. Furthermore, it is argued,
there are both theoretical arguments and practical evidence supporting the assertion
that the marginal gains of increasing classifier complexity can often be minimal. In brief,
the advances in classification technology are typically much less than is often claimed.
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Dimension Reduction Methods
(reading: 3.6 to 3.8 in the text)
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Dimension Reduction methods
•
Dimension reduction
– From p-dimensional x to d-dimensional x’ , d < p
•
Techniques
– Variable selection:
• use an algorithm to find individual variables in x that are relevant to the
problem and discard the rest
• stepwise logistic regression
– Linear projections
• Project data to a lower-dimensional space
• E.g., principal components
– Non-linear embedding
• Use a non-linear mapping to “embed” data in a lower-dimensional space
• E.g., multidimensional scaling
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Dimension Reduction: why is it useful?
•
In general, incurs loss of information about x
– So why do this?
•
If dimensionality p is very large (e.g., 1000’s), representing the data in a
lower-dimensional space may make learning more reliable,
– e.g., clustering example
• 100 dimensional data
• but cluster structure is only present in 2 of the dimensions, the others are
just noise
• if other 98 dimensions are just noise (relative to cluster structure), then
clusters will be much easier to discover if we just focus on the 2d space
•
Dimension reduction can also provide interpretation/insight
– e.g for 2d visualization purposes
•
Caveat: 2-step approach of dimension reduction followed by learning
algorithm is in general suboptimal
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Variable Selection Methods
•
p variables, would like to use a smaller subset in our model
– e.g., in classification, do kNN in d-space rather than p-space
– e.g., for logistic regresison, use d inputs rather than p
•
Problem:
– Number of subsets of p variables is O(2p)
– Exhaustive search is impossible except for very small p
– Typically the search problem is NP-hard
•
Common solution:
– Local systematic search (e.g., add/delete variables 1 at a time) to
locally maximize a score function (i.e., hill-climbing)
– e.g., add a variable, build new model, generate new score, etc
– Can often work well, but can get trapped in local maxima/minima
– Can also be computationally-intensive (depends on model)
•
Note: some techniques such as decision tree predictors automatically
perform dimension reduction as part of the learning algorithm.
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Linear Projections
x = p-dimensional vector of data measurements
Let a = weight vector, also dimension p
Assume aT a = 1 (i.e., unit norm)
aT x =  aj xj
= projection of x onto vector a,
gives distance of projected x along a
e.g.,
Data Mining Lectures
aT = [1 0]
-> projection along 1st dimension
aT = [0 1]
-> projection along 2nd dimension
aT = [0.71, 0.71] -> projection along diagonal
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Example of projection from 2d to 1d
x2
Direction of
weight vector a
x1
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Projections to more than 1 dimension
• Multidimensional projections:
– e.g., x is 4-dimensional
– a1T = [ 0.71 0.71 0
– a2T = [ 0
0
0
]
0.71 0.71 ]
AT x -> coordinates of x in 2d space spanned by columns of A
-> linear transformation from 4d to 2d space
where A = [ a1 a2 ]
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Principal Components Analysis (PCA)
X = p x n data matrix: columns = p-dim data vectors
Let a = weight vector, also dimension p
Assume aT a = 1 (i.e., unit norm)
aT X = projection of each column x onto vector a,
= vector of distances of projected x vectors along a
PCA: find vector a such that var(aT X ) is maximized
i.e., find linear projection with maximal variance
More generally:
ATX = d x n data matrix with x vectors projected to d-dimensional space,
where size(A) = p x d
PCA: find d orthogonal columns of A such that variance in the
d-dimensional projected space is maximized, d < p
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
PCA Example
Direction of 1st
principal component vector
(highest variance projection)
x2
x1
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
PCA Example
Direction of 1st
principal component vector
(highest variance projection)
x2
x1
Direction of 2nd
principal component vector
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
How do we compute the principal components?
• See class notes
• See also page 78 in the text
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Notes on PCA
• Basis representation and approximation error
• Scree diagrams
• Computational complexity of computing PCA
– Equivalent to solving set of linear equations, matrix inversion,
singular value decomposition, etc
• Scales in general as O(np2 + p3)
• Many numerical tricks possible, e.g., for sparse X matrices, for
finding only the first k eigenvectors, etc
• In MATLAB can use eig.m or svd.m (also note sparse versions)
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
More notes on PCA
• Links between PCA and multivariate Gaussian density
• Caveat: PCA can destroy information about differences between
groups for clustering or classification
• PCA for other data types
– Images, e.g., eigenfaces
– Text, e.g., “latent semantic indexing” (LSI)
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Basis images (eigenimages) of faces
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
20 face images
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
First 16 Eigenimages
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
First 4 eigenimages
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Reconstruction of First Image with 8
eigenimages
Reconstructed
Image
Original Image
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Reconstruction of first image with 8 eigenimages
Weights = -14.0 9.4 -1.1 -3.5 -9.8 -3.5 -0.6 0.6
Reconstructed image = weighted sum of 8 images on left
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Reconstruction of 7th image with eigenimages
Reconstructed
Image
Original Image
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Reconstruction of 7th image with 8 eigenimages
Weights = -13.7 12.9 1.6 4.4 3.0 0.9 1.6 -6.3
Weights for
Image 1 = -14.0 9.4 -1.1 -3.5 -9.8 -3.5 -0.6 0.6
Reconstructed image = weighted sum of 8 images on left
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Reconstructing Image 6 with 16 eigenimages
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Multidimensional Scaling (MDS)
•
Say we have data in the form of an N x N matrix of dissimilarities
– 0’s on the diagonal
– Symmetric
– Could either be given data in this form, or create such a dissimilarity
matrix from our data vectors
•
Examples
– Perceptual dissimilarity of N objects in cognitive science experiments
– String-edit distance between N protein sequences
•
MDS:
– Find k-dimensional coordinates for each of the N objects such that
Euclidean distances in “embedded” space matches set of dissimilarities
as closely as possible
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Multidimensional Scaling (MDS)
•
MDS score function (“stress”)
S   (d (i, j )   (i, j )) 2 /  d (i, j ) 2
i, j
Original
dissimilarities
•
i, j
Euclidean distance
in “embedded” k-dim space
N points embedded in k-dimensions -> Nk locations or parameters
– To find the Nk locations?
• Solve optimization problem -> minimize S function
•
Often used for visualization, e.g., k=2, 3
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
MDS Optimization
S   (d (i, j )   (i, j )) 2 /  d (i, j ) 2
i, j
•
i, j
Optimization problem:
– S is a function of Nk parameters
• find the set of N k-dimensional positions that minimize S
• Note: 3 parameters are redundant: location (2) and rotation (1)
•
If original dissimilarities are Euclidean
• -> linear algebra solution (equivalent to principal components)
•
Non-Euclidean (more typical)
– Local iterative hill-climbing, e.g., move each point to increase S, repeat
– Non-trivial optimization, can have local minima, etc
– Initialization: either random or heuristically (e.g., by PCA)
– Complexity is O(N2 k) per iteration (iteration = move all points locally)
– See Faloutsos and Lin (1995) for FastMap: O(Nk) approximation for large N
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
MDS example: input distance data
Chicago
Raleigh
Boston
Seattle
S.F.
Austin
Chicago
0
Raleigh
641
0
Boston
851
608
0
Seattle
1733
2363
2488
0
S.F.
1855
2406
2696
684
0
Austin
972
1167
1691
1764
1495
0
Orlando
994
520
1105
2565
2458
1015
Data Mining Lectures
Lecture 5: Dimension Reduction
Orlando
0
Padhraic Smyth, UC Irvine
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Result of MDS
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
MDS for protein sequences
MDS similarity matrix
(note cluster structure)
MDS embedding
226 protein sequences of the Globin family (from Klock & Buhmann 1997).
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
MDS from human judgements of similarity
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
MDS: Example data
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
MDS: 2d embedding of face images
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Other embedding techniques
•
Many other algorithms for non-linear embedding
•
Some of the better-known examples…
– Self-organizing maps (Kohonen)
• Neural-network inspired algorithm for 2d embedding
– ISOMAP, Local linear embedding
• Find low-d coordinates that preserve local distances
• Ignore global distances
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Example of Local Linear Embedding (LLE)
(Roweis and Saul, Science, 2000)
Note how points that are far away on the 3d manifold
(e.g., red and blue) in “manifold distance” would be mapped
as being close together by MDS or PCA but are kept
“far apart” by LLE. LLE emphasizes local relationships
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
LLE Algorithm
•
N points in dimension p: wish to reduce to dimension d, d < p
•
Step 1:
– Select K nearest neighbors for each point in training data
– Represent each point as X = a weighted linear combination of its K
neighbor points
– Find best k weights for each of the X vectors (least squares fitting)
•
Step 2:
– Fix the weights from part 1
– For each p-dim vector X, find a d-dimensional Y vector that is closest to
its reconstructed approximation based on d-dim neighbors and weights
– Reduces to another linear algebra/eigenvalue problem, O(N3)
complexity
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
LLE
applied
to text
data
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
LLE
applied
to a set
of face
images
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Local Linear Embedding example
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
ISOMAP
Tenenbaum, de Silva, Langford, Science, 2000
Similar to LLE in concept: preserves local distances
Computational strategy is different: measures distance in
original space via geodesic paths (distance “on manifold”)
Algorithm involves finding shortest paths between points and then embedding
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Examples
of ISOMAP
embeddings
in 2d
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
ISOMAP:
morphing
examples
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine
Summary on Dimension Reduction
•
Can be used for defining a new set of (lower-dimensional) variables for
modeling or for visualization/insight
•
3 general approaches
– Variable selection (only select “relevant” variables)
– Linear projections (e.g., PCA)
– Non-linear projections (e.g., MDS, LLE, ISOMAP)
– Can be used with text, images, etc by representing such data as very
high-dimensional vectors
– MATLAB implementations of all of these techniques are available on the
Web
– These techniques can be useful, but like any high-powered tool they are
not a solution to everything
• real-world data sets often do not produce the types of elegant 2d
embeddings that one often sees in research papers on these topics.
Data Mining Lectures
Lecture 5: Dimension Reduction
Padhraic Smyth, UC Irvine