lecture 3 - DKE Personal & Projects Websites

Download Report

Transcript lecture 3 - DKE Personal & Projects Websites

DATA MINING
from data to information
Ronald Westra
Dep. Mathematics
Knowledge Engineering
Maastricht University
PART 2
Exploratory Data
Analysis
VISUALISING AND
EXPLORING DATA-SPACE
Data Mining Lecture II
[Chapter 3 from Principles of Data Mining
by Hand,, Manilla, Smyth ]
LECTURE 3: Visualising and Exploring Data-Space
Readings:
•
Chapter 3 from Principles of Data Mining by Hand, Mannila, Smyth.
3.1 Obtain insight in the Structure in Data Space
1.
distribution over the space
2.
Are there separate and disconnected parts?
3.
is there a model?
4.
data-driven hypothesis testing
5.
Starting point: use strong perceptual powers of humans
LECTURE 3: Visualising and Exploring Data-Space
3.2 Tools to represent a variabele
1.
mean, variance, standard deviation, skewness
2.
plot
3.
moving-average plot
4.
histogram, kernel
histogram
Box Plots
Overprinting
Contour plot
LECTURE 3: Visualising and Exploring Data-Space
3.3 Tools for repressenting two variables
1.
scatter plot
2.
moving-average plots
scatter plot
scatter plots
scatter plots
LECTURE 3: Visualising and Exploring Data-Space
3.4 Tools for representing multiple variables
1.
all or selection of scatter plots
2.
idem moving-average plots
3.
‘trelis’ or other parameterised plots
4.
icons: star icons, Chernoff’s faces
Chernoff’s faces
Chernoff’s faces
Chernoff’s faces
Star Plots
Parallel
coordinates
3.5 PCA: Principal Component Ananlysis
3.6 MDS: Multidimensional Scaling
DIMENSION REDUCTION
3.5 PCA: Principal Component Ananlysis
With sub-scatter plots we already noticed that the best
projections were determined by the projection that resulted in
the maximal size of the set of data points. This is in the
direction of the maximum variance.
This idea is worked out in the approach of the
Principal Components Analysis.
3.5 PCA: Principal Component Ananlysis
Principal component analysis (PCA) is a vector space transform
often used to reduce multidimensional data sets to lower
dimensions for analysis.
Depending on the field of application, it is also named the discrete
Karhunen-Loève transform (KLT), the Hotelling transform or
proper orthogonal decomposition (POD).
PCA now is the mostly used as a tool in exploratory data
analysis and for making predictive models. PCA involves the
calculation of the eigenvalue decomposition of a data covariance
matrix after mean centering the data for each attribute. The results
of a PCA are usually discussed in terms of component scores and
contribution.
3.5 PCA: Principal Component Ananlysis
PCA is the simplest of the true eigenvector-based multivariate
analyses. Often, its operation can be thought of as revealing the
internal structure of the data in a way which best explains the
variance in the data.
If a multivariate dataset is visualised as a set of coordinates in a
high-dimensional data space (1 axis per variable), PCA supplies
the user with a lower-dimensional picture, a "shadow" of this
object when viewed from its (in some sense) most informative
viewpoint.
3.5 PCA: Principal Component Ananlysis
PCA is closely related to factor analysis.
3.5 PCA: Principal Component Ananlysis
Consider a multivariate set in Data Space: this is a set with
normal distributions in multiple dimensions, for instance:
Observe that the spatial
extent appears different
in each dimension.
Also observe that in this
case the set is almost 1dimensional.
Can we project the set
so that the spatial extent
in one dimension is
optimal?
3.5 PCA: Principal Component Ananlysis
Data X: n rows of p fields: the
vectors are rows in X.
a
STEP 1: Subtract the average value
from the dataset X: mean centered
data.
The spatial extent of this cloud of
points can be measured by the
variance in the dataset X. This is an
entry in the correlation matrix V =
XTX.
The projection of the dataset X in a
direction a is: y = Xa.
The spatial extent in direction a is the variance in the projected dataset Y:
i.e. the variance σa2 = yTy = (Xa)T(Xa) = aTXTXa = aTV a .
We now want to maximize this extent σa2 over all possible vectors a (why?).
3.5 PCA: Principal Component Ananlysis
STEP 2: Maximize: σa2 = aTV a over all possible vectors a.
This is unlimited, just like maximizing x2 over x, therefore we restrict the size of
vector a to 1: aTV a – 1 = 0
So we have:
maximize: aTV a subject to: aTV a – 1 = 0
This can be solved with the Lagrange-multipliers method:
maximize: f(x) subject to: g(x) = 0 → d/dx{ f(x) – λ g(x)} = 0
For our case this means:
→
→
d/da{ aTV a – λ (aTV a – 1 )} = 0
2 Va – 2λa = 0
Va = λa
This means that we are looking for the eigen-vectors and eigen-values of the
correlation matrix V = XTX.
3.5 PCA: Principal Component Analysis
So, the underlying idea is: supose you have a high-dimensional normaldistributed data set. This will take the shape of a high-dimensional ellipsoid.
An ellipsoid is structured from its centre by orthogonal vectors with different
radii. The largest radii have the strongest influence on the shape of the
ellipsoid. The ellipsoid is described by the covariance-matrix of the set of datapoints. The axes are defined by the orthogonal eigen-vectors (from the centre –
the centroid – of the set), the radii are defined by the associated values.
So determine the eigen-values and order those in decreasing size: .
{1, 2 , 3 ,..., N }
The first n ordered eigen-vectors thus ‘explain’ the following amount of the data:
.
n
N
i 1
i 1
 i /  i
3.5 PCA: Principal Component Ananlysis
3.5 PCA: Principal Component Ananlysis
3.5 PCA: Principal Component Ananlysis
MEAN
3.5 PCA: Principal Component Ananlysis
Principal
axis 1
Principal
axis 2
MEAN
3.5 PCA: Principal Component Ananlysis
STEP 2: Plot the ordered eigen-values versus the index-number
and inspect where a ‘shoulder’ occurs: this determines the
number of eigen-values you take into acoount. This is a so-called
‘scree-plot’.
3.5 PCA: Principal Component Ananlysis
For n points of p components there are: O(np2
Use LU-decomposition etcetera.
+ p3) operations required.
3.5 PCA: Principal Component Ananlysis
Many benefits: considerable data-reduction, necessary for computational
techniques like ‘Fisher-discriminant-analysis’ and ‘clustering’.
This works very well in practice.
3.5 PCA: Principal Component Analysis
PCA is closely related to and often confused with Factor
Analysis:
Factor Analysis is the explanation of p-dimensional data
by a smaller number of m < p factors.
EXAMPLE of PCA
astronomical application: PCs for elliptical galaxies
Rotating to PC in BT – Σ space improves Faber-Jackson relation
as a distance indicator
Dressler, et al. 1987
astronomical application: Eigenspectra (KL transform)
Connolly, et al. 1995
1 pc
2 pc
3 pc
4 pc
3.6 Multi-Dimensional Scaling [MDS]
1.
Same purpose : represent high-dimensional data set
2.
In the case of MS not by projections, but by reconstruction from the
distance-table. The computed points are represented in an Euclidian sub-space
– preferably a 2D-plane.
3.
MDS performs better than PCA in case of strongly curved sets.
3.6 Multidimensional Scaling
The purpose of multidimensional scaling (MDS) is to provide a visual
representation of the pattern of proximities (i.e., similarities or distances) among
a set of objects
INPUT: distances dist[Ai,Aj] where A is some class of objects
OUTPUT: positions X[Ai] where X is a D-dimensional vector
3.6 Multidimensional Scaling
3.6 Multidimensional Scaling
3.6 Multidimensional Scaling
INPUT: distances dist[Ai,Aj] where A is some class of objects
3.6 Multidimensional Scaling
OUTPUT: positions X[Ai] where X is a D-dimensional vector
3.6 Multidimensional Scaling
How many dimensions ??? SCREE PLOT
Multidimensional Scaling: Nederlandse dialekten
3.6 Kohonen’s Self Organizing Map (SOM) and
Sammon mapping
1.
Same purpose : DIMENSION REDUCTION : represent a high
dimensional set in a smaller sub-space e.g. 2D-plane.
2.
SOM gives better results than Sammon mapping, but strongly sensitive
to initial values.
3.
This is close to clustering!
3.6 Kohonen’s Self Organizing Map (SOM)
3.6 Kohonen’s Self Organizing Map (SOM)
Sammon mapping
All information on math-part of course on:
http://www.math.unimaas.nl/personal/ronaldw/
DAM/DataMiningPage.htm