titel - DKE Personal & Projects Websites

Download Report

Transcript titel - DKE Personal & Projects Websites

DATA MINING
from data to information
Ronald Westra
Dep. Mathematics
Knowledge Engineering
Maastricht University
PART 1
Introduction
All information on math-part of course on:
http://www.math.unimaas.nl/personal/ronaldw/
DAM/DataMiningPage.htm
Data mining - a definition
"Data mining is the process of
exploration and analysis, by
automatic or semi-automatic
means, of large quantities of data
in order to discover meaningful
patterns and results."
(Berry & Linoff, 1997, 2000)
DATA MINING
Course Description:
In this course the student will be made familiar with the
main topics in Data Mining, and its important role in
current Computer Science. In this course we’ll mainly
focus on algorithms, methods, and techniques for the
representation and analysis of data and information.
DATA MINING
Course Objectives:
To get a broad understanding of data mining and knowledge
discovery in databases.
To understand major research issues and techniques in this
new area and conduct research.
To be able to apply data mining tools to practical problems.
LECTURE 1: Introduction
1.
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996),
Data Mining to Knowledge Discovery in Databases:
http://www.kdnuggets.com/gpspubs/aimag-kdd-overview1996-Fayyad.pdf
2.
Hand, D., Manilla, H., Smyth, P. (2001), Principles of Data
Mining, MIT press, Boston, USA
MORE INFORMATION ON: ELEUM
and:
http://www.math.unimaas.nl/personal/ronaldw/DAM/Data
MiningPage.htm
Hand, D., Manilla, H., Smyth, P.
(2001),
Principles of Data Mining,
MIT press, Boston, USA
+ MORE INFORMATION ON:
ELEUM or DAM-website
LECTURE 1: Introduction
What is Data Mining?
•
•
data information knowledge
patterns structures models
The use of Data Mining
•
•
•
•
•
•
increasingly larger databases TB (TeraBytes)
N datapoints and K components (fields) per datapoint
not accessible for fast inspection
incomplete, noise, wrong design
different numerical formats, alfanumerical, semantic fields
necessity to automate the analysis
LECTURE 1: Introduction
Applications
•
•
•
•
•
astronomical databases
marketing/investment
telecommunication
industrial
biomedical/genetica
LECTURE 1: Introduction
Historical Context
•
•
in mathematical statistics negative connotation:
danger for overfitting and erroneous generalisation
LECTURE 1: Introduction
Data Mining Subdisciplines
•
•
•
•
•
•
•
Databases
Statistics
Knowledge Based Systems
High-performance computing
Data visualization
Pattern recognition
Machine learning
LECTURE 1: Introduction
Data Mining -methodes
•
Clustering
•
classification (off- & on-line)
•
(auto)-regression
•
visualisation techniques: optimal projections and PCA
(principal component analysis)
•
discrimnant analysis
•
decomposition
•
parameteriical modelling
•
non-parameteric modeling
LECTURE 1: Introduction
Data Mining essentials
•
•
•
model representation
model evaluation
search/optimisation
Data Mining algorithms
•
•
•
•
Decision trees/Rules
Nonlinear Regression and Klassificatie
Example-based methods
AI-tools: NN, GA, ...
LECTURE 1: Introduction
Data Mining and Mathematical Statistics
•
•
when Statistics and when DM?
is DM a sort of Mathematical Statistics?
Data Mining and AI
•
AI is instrumental in finding knowledge in large chunks of data
Mathematical Principles in Data Mining
Part I: Exploring Data Space
* Understanding and Visualizing Data Space
Provide tools to understand the basic structure in databases. This is done by
probing and analysing metric structure in data-space, comprehensively
visualizing data, and analysing global data structure by e.g. Principal
Components Analysis and Multidimensional Scaling.
* Data Analysis and Uncertainty
Show the fundamental role of uncertainty in Data Mining. Understand the
difference between uncertainty originating from statistical variation in the
sensing process, and from imprecision in the semantical modelling. Provide
frameworks and tools for modelling uncertainty: especially the frequentist and
subjective/conditional frameworks.
Mathematical Principles in Data Mining
PART II: Finding Structure in Data Space
* Data Mining Algorithms & Scoring Functions
Provide a measure for fitting models and patterns to data. This enables the
selection between competing models. Data Mining Algorithms are discussed in
the parallel course.
* Searching for Models and Patterns in Data Space
Describe the computational methods used for model and pattern-fitting in data
mining algorithms. Most emphasis is on search and optimisation methods. This
is required to find the best fit between the model or pattern with the data.
Special attention is devoted to parameter estimation under missing data using
the maximum likelihood EM-algorithm.
Mathematical Principles in Data Mining
PART III: Mathematiscal Modelling of Data Space
* Descriptive Models for Data Space
Present descriptive models in the context of Data Mining. Describe specific
techniques and algorithms for fitting descriptive models to data. Main emphasis
here is on probabilistic models.
* Clustering in Data Space
Discuss the role of data clustering within Data Mining. Showing the relation of
clustering in relation to classification and search. Present a variety of paradigms
for clustering data.
EXAMPLES
* Astronomical Databases
* Phylogenetic trees from DNA-analysis
Example 1: Phylogenetic Trees
The last decade has witnessed a major and historical leap in biology and
all related disciplines. The date of this event can be set almost exactly to
November 1999 as the Humane Genome Project (HGP) was declared
completed. The HGP resulted in (almost) the entire humane genome,
consisting of about 3.3.109 base pairs (bp) code, constituting all
approximately 35K humane genes. Since then the genomes of many more
animal and plant species have come available. For our sake, we can
consider the humane genome as a huge database, existing of a single
string with 3.3.109 characters from the set {C,G,A,T}.
Example 1: Phylogenetic Trees
This data constitutes the human ‘source code’. From this data – in
principle – all ‘hardware’ characteristics, such as physiological and
psychological features, can be deduced. In this block we will concentrate
on another aspect that is hidden in this information: phylogenetic
relations between species. The famous evolutionary biologist
Dobzhansky once remarked that: ‘Everything makes sense in the light of
evolution, nothing makes sense without the light of evolution’. This most
certainly applies to the genome. Hidden in the data is the evolutionary
history of the species. By comparing several species with various amount
of relatedness, we can from systematic comparison reconstruct this
evolutionary history. For instance, consider a species that lived at a
certain time in earth history. It will be marked by a set of genes, each with
a specific code (or rather, a statistical variation around the average).
Example 1: Phylogenetic Trees
If this species is by some reason distributed over a variety of nonconnected areas (e.g. islands, oases, mountainous regions), animals of
the species will not be able to mate at a random. In the course of time, due
to the accumulation of random mutations, the genomes of the separated
groups will increasingly differ. This will result in the origin of sub-species,
and eventually new species. Comparing the genomes of the new species
will shed light on the evolutionary history, in that: we can draw a
phylogenetic tree of the sub-species leading to the ‘founder’-species;
given the rate of mutation we can estimate how long ago the founderspecies lived; reconstruct the most probable genome of the founderspecies.
Example 2: data mining in astronomy
Example 2: data mining in astronomy
Example 2: data mining in astronomy
DATA AS SETS OF
MEASUREMENTS AND
OBSERVATIONS
Data Mining Lecture II
[Chapter 2 from Principles of Data Mining by
Hand,, Manilla, Smyth ]
LECTURE 2: DATA AS SETS OF
MEASUREMENTS AND
OBSERVATIONS
Readings:
•
Chapter 2 from Principles of Data Mining by Hand,
Mannila, Smyth.
2.1 Types of Data
2.2 Sampling
1.
2.
3.
(re)sampling
oversampling/undersampling, sampling artefacts
Bootstrap and Jack-Knife methodes
2.3 Measures for Similarity and Difference
1.
2.
3.
Phenomenological
Dissimilarity coefficient
Metric in Data Space based on distance measure
Types of data
Sampling :
– the process of collecting new (empirical) data
Resampling :
– selecting data from a larger already existing collection
Sampling
–Oversampling
–Undersampling
–Sampling artefacts (aliasing, Nyquist frequency)
Sampling artefacts (aliasing, Nyquist frequency)
Moire fringes
Resampling
Resampling is any of a variety of methods for doing
one of the following:
– Estimating the precision of sample statistics (medians,
variances, percentiles) by using subsets of available data
(= jackknife) or drawing randomly with replacement from a
set of data points (= bootstrapping)
– Exchanging labels on data points when performing
significance tests (permutation test, also called exact test,
randomization test, or re-randomization test)
– Validating models by using random subsets (bootstrap,
cross validation)
Bootstrap & Jack-Knife methodes
using inferential statistics to account for randomness and
uncertainty in the observations. These inferences may take the
form of answers to essentially yes/no questions (hypothesis
testing), estimates of numerical characteristics (estimation),
prediction of future observations, descriptions of association
(correlation), or modeling of relationships (regression).
Bootstrap method
bootstrapping is a method for estimating the sampling distribution
of an estimator by resampling with replacement from the original
sample.
"Bootstrap" means that resampling one available sample gives
rise to many others, reminiscent of pulling yourself up by your
bootstraps.
cross-validation: verify replicability of results
Jackknife: detect outliers
Bootstrap: inferential statistics
2.3 Measures for Similarity and Dissimilarity
1.
Phenomenological
2.
Dissimilarity coefficient
3.
Metric in Data Space based on distance measure
2.4 Distance Measure and Metric
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Euclidean distance
Metric
Commensurability
Normalisatie
Weighted Distances
Sample covariance
Sample covariance correlation coefficient
Mahalanobis distance
Normalised distance and Cluster Separation (zie aanvullende tekst)
Generalised Minkowski
2.4 Distance Measure and Metric
1.
Euclidean distance
2.4 Distance Measure and Metric
2.
Generalized p-norm
Generalized Norm / Metric
Minkowski Metric
Minkowski Metric
Generalized Minkowski Metric
In the data space is already a structure present.
The structure is represented by the correlation and given by the
covariance matrix G
The Minkowski-norm of a vector x is:
 2 T 
x  x Gx
2.4 Distance Measure and Metric
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Euclidean distance
2. Metric
Commensurability
Normalisatie
Weighted Distances
Sample covariance
Sample covariance correlation coefficient
Mahalanobis distance
Normalised distance and Cluster Separation (zie aanvullende tekst)
Generalised Minkowski
2.4 Distance Measure and Metric
Mahalanobis distance
2.4 Distance Measure and Metric
8. Mahalanobis distance
The Mahalanobis distance is a distance measure introduced by P. C.
Mahalanobis in 1936.
It is based on correlations between variables by which different patterns can be
identified and analysed. It is a useful way of determining similarity of an
unknown sample set to a known one.
It differs from Euclidean distance in that it takes into account the correlations of
the data set.
2.4 Distance Measure and Metric
8. Mahalanobis distance
The Mahalanobis distance from a group of values with mean
and covariance matrix Σ for a multivariate vector
is defined as:
2.4 Distance Measure and Metric
8. Mahalanobis distance
Mahalanobis distance can also be defined as dissimilarity measure between
two random vectors x and y of the same distribution with the covariance
matrix Σ :
2.4 Distance Measure and Metric
8.
Mahalanobis distance
If the covariance matrix is the identity matrix then it is the same as Euclidean
distance. If covariance matrix is diagonal, then it is called normalized
Euclidean distance:
where σi is the standard deviation of the xi over the sample set.
2.4 Distance measures and Metric
8.
Mahalanobis distance
2.4 Distance measures and Metric
8.
Mahalanobis distance
2.4 Distance measures and Metric
8.
Mahalanobis distance
2.5 Distortions in Data Sets
1.
2.
3.
outlyers
Variance
sampling effects
2.6 Pre-processong data with mathematical transformationes
2.7 Data Quality
•
Data quality of individual measurements [GIGO]
•
Data quality of Data collections
Part II. Exploratory Data Analysis
VISUALISING AND
EXPLORING DATA-SPACE
Data Mining Lecture III
[Chapter 2 from Principles of Data Mining by
Hand,, Manilla, Smyth ]
LECTURE 3: Visualising and Exploring Data-Space
Readings:
•
Chapter 3 from Principles of Data Mining by Hand, Mannila, Smyth.
3.1 Obtain insight in the Structure in Data Space
1.
distribution over the space
2.
Are there separate and disconnected parts?
3.
is there a model?
4.
data-driven hypothesis testing
5.
Starting point: use strong perceptual powers of humans
LECTURE 3: Visualising and Exploring Data-Space
3.2 Tools to represent a variabele
1.
mean, variance, standard deviation, skewness
2.
plot
3.
moving-average plot
4.
histogram, kernel
histogram
LECTURE 3: Visualising and Exploring Data-Space
3.3 Tools for repressenting two variables
1.
scatter plot
2.
moving-average plots
scatter plot
scatter plots
LECTURE 3: Visualising and Exploring Data-Space
3.4 Tools for representing multiple variables
1.
all or selection of scatter plots
2.
idem moving-average plots
3.
‘trelis’ or other parameterised plots
4.
icons: star icons, Chernoff’s faces
Chernoff’s faces
Chernoff’s faces
Chernoff’s faces
3.5 PCA: Principal Component Ananlysis
3.6 MDS: Multidimensional Scaling
DIMENSION REDUCTION
3.5 PCA: Principal Component Ananlysis
1.
With sub-scatter plots we already noticed that the best projections were
determined by the projection that resulted in the optimal spreading of the set of
points. This is in the direction of the smallest varaince. This idea is now worked
out.
3.5 PCA: Principal Component Analysis
2.
Underlying idea : supose you have a high-dimensional normaldistributed data set. This will take the shape of a high-dimensional ellipsoid.
An ellipsoid is structured from its centre by orthogonal vectors with different
radii. The largest radii have the strongest influence on the shape of the
ellipsoid. The ellipsoid is described by the covariance-matrix of the set of datapoints. The axes are defined by the orthogonal eigen-vectors (from the centre –
the centroid – of the set), the radii are defined by the associated values.
So determine the eigen-values and order those in decreasing size: .
{1, 2 , 3 ,..., N }
The first n ordered eigen-vectors thus ‘explain’ the following amount of the data:
.
n
N
i 1
i 1
 i /  i
3.5 PCA: Principal Component Ananlysis
3.5 PCA: Principal Component Ananlysis
3.5 PCA: Principal Component Ananlysis
MEAN
3.5 PCA: Principal Component Ananlysis
Principal
axis 2
Principal
axis 1
MEAN
3.5 PCA: Principal Component Ananlysis
3.
Plot the ordered eigen-values versus the index-number and inspect
where a ‘shoulder’ occurs: this determines the number of eigen-values you take
into acoount. This is a so-called ‘scree-plot’.
3.5 PCA: Principal Component Ananlysis
4.
Other derivation is by maximization of the variance of the projected data
geprojecteerde data (see book).
This leads to an Eigen-value problem of the covariance-matrix, so the
solution described above.
3.5 PCA: Principal Component Ananlysis
5.
For n points of p components there are: O(np2
required. Use LU-decomposition etcetera.
+ p3) operations
3.5 PCA: Principal Component Ananlysis
6.
Many benefits: considerable data-reduction, necessary for
computational techniques like ‘Fisher-discriminant-analysis’ and ‘clustering’.
This works very well in practice.
3.5 PCA: Principal Component Ananlysis
7.
NB: Factor Analysis is often confused with PCA but is different:
the explanation of p-dimensional data by a smaller number of m < p factors.
Principal component analysis (PCA) is a technique that is useful for the
compression and classification of data. The purpose is to reduce the
dimensionality of a data set (sample) by finding a new set of variables,
smaller than the original set of variables, that nonetheless retains most
of the sample's information.
By information we mean the variation present in the sample,
given by the correlations between the original variables. The new
variables, called principal components (PCs), are uncorrelated, and are
ordered by the fraction of the total information each retains.
overview
• geometric picture of PCs
• algebraic definition and derivation of PCs
• usage of PCA
• astronomical application
Geometric picture of principal components (PCs)
A sample of n observations in the 2-D space
Goal: to account for the variation in a sample
in as few variables as possible, to some accuracy
Geometric picture of principal components (PCs)
• the 1st PC
is a minimum distance fit to a line in
• the 2nd PC
is a minimum distance fit to a line
in the plane perpendicular to the 1st PC
space
PCs are a series of linear least squares fits to a sample,
each orthogonal to all the previous.
Algebraic definition of PCs
Given a sample of n observations on a vector of p variables
define the first principal component of the sample
λ
by the linear transformation
where the vector
is chosen such that
is maximum
Algebraic definition of PCs
Likewise, define the kth PC of the sample
by the linear transformation
where the vector
is chosen such that
subject to
and to
λ
is maximum
Algebraic derivation of coefficient vectors
To find
first note that
where
is the covariance matrix for the variables
Algebraic derivation of coefficient vectors
To find
maximize
subject to
Let λ be a Lagrange multiplier
then maximize
by differentiating…
therefore
is an eigenvector of
corresponding to eigenvalue
Algebraic derivation of
We have maximized
So
is the largest eigenvalue of
The first PC
retains the greatest amount of variation in the sample.
Algebraic derivation of coefficient vectors
To find the next coefficient vector
maximize
subject to
and to
First note that
then let λ and φ be Lagrange multipliers, and maximize
Algebraic derivation of coefficient vectors
We find that
whose eigenvalue
is also an eigenvector of
is the second largest.
In general
• The kth largest eigenvalue of
is the variance of the kth PC.
• The kth PC
retains the kth greatest fraction
of the variation in the sample.
Algebraic formulation of PCA
Given a sample of n observations
on a vector of p variables
define a vector of p PCs
according to
where
is an orthogonal p x p matrix
whose kth column is the kth eigenvector
Then
of
is the covariance matrix of the PCs,
being diagonal with elements
usage of PCA: Probability distribution for sample PCs
If
(i) the n observations of
(ii)
in the sample are independent &
is drawn from an underlying population that
follows a p-variate normal (Gaussian) distribution
with known covariance matrix
then
where
else
is the Wishart distribution
utilize a bootstrap approximation
usage of PCA: Probability distribution for sample PCs
If
(i)
follows a Wishart distribution &
(ii) the population eigenvalues
then
are all distinct
the following results hold as
• all the
•
are independent of all the
and
are jointly normally distributed
•
•
(a tilde denotes a population quantity)
usage of PCA: Probability distribution for sample PCs
and
•
•
(a tilde denotes a population quantity)
usage of PCA: Inference about population PCs
If
then
follows a p-variate normal distribution
analytic expressions exist* for
MLE’s of
,
, and
confidence intervals for
hypothesis testing for
else
and
and
bootstrap and jackknife approximations exist
*see references, esp. Jolliffe
usage of PCA: Practical computation of PCs
In general it is useful to define standardized variables by
If
the
are each measured about their sample mean
then
the covariance matrix
of
will be equal to the correlation matrix of
and
the PCs
will be dimensionless
usage of PCA: Practical computation of PCs
Given a sample of n observations on a vector
(each measured about its sample mean)
compute the covariance matrix
where
is the n x p matrix
whose ith row is the ith obsv.
Then compute the n x p matrix
whose ith row is the PC score
for the ith observation.
of p variables
usage of PCA: Practical computation of PCs
Write
to decompose each observation into PCs
usage of PCA: Data compression
Because the kth PC retains the kth greatest fraction of the variation
we can approximate each observation
by truncating the sum at the first m < p PCs
usage of PCA: Data compression
Reduce the dimensionality of the data
from p to m < p by approximating
where
is the n x m portion of
and
is the p x m portion of
astronomical application: PCs for elliptical galaxies
Rotating to PC in BT – Σ space improves Faber-Jackson relation
as a distance indicator
Dressler, et al. 1987
astronomical application: Eigenspectra (KL transform)
Connolly, et al. 1995
references
Connolly, and Szalay, et al., “Spectral Classification of Galaxies: An Orthogonal Approach”, AJ, 110, 1071-1082, 1995.
Dressler, et al., “Spectroscopy and Photometry of Elliptical Galaxies. I. A New Distance Estimator”, ApJ, 313, 42-58, 1987.
Efstathiou, G., and Fall, S.M., “Multivariate analysis of elliptical galaxies”, MNRAS, 206, 453-464, 1984.
Johnston, D.E., et al., “SDSS J0903+5028: A New Gravitational Lens”, AJ, 126, 2281-2290, 2003.
Jolliffe, Ian T., 2002, Principal Component Analysis (Springer-Verlag New York, Secaucus, NJ).
Lupton, R., 1993, Statistics In Theory and Practice (Princeton University Press, Princeton, NJ).
Murtagh, F., and Heck, A., Multivariate Data Analysis (D. Reidel Publishing Company, Dordrecht, Holland).
Yip, C.W., and Szalay, A.S., et al., “Distributions of Galaxy Spectral Types in the SDSS”, AJ, 128, 585-609, 2004.
3.5 PCA: Principal Component Ananlysis
1 pc
2 pc
3 pc
4 pc
3.6 Multidimensional Scaling [MDS]
1.
Same purpose : represent high-dimensional data set
2.
In the case of MS not by projections, but by reconstruction from the
distance-table. The computed points are represented in an Euclidian sub-space
– preferably a 2D-plane.
3.
MDS performs better than PCA in case of strongly curved sets.
3.6 Multidimensional Scaling
The purpose of multidimensional scaling (MDS) is to provide a visual
representation of the pattern of proximities (i.e., similarities or distances) among
a set of objects
INPUT: distances dist[Ai,Aj] where A is some class of objects
OUTPUT: positions X[Ai] where X is a D-dimensional vector
3.6 Multidimensional Scaling
3.6 Multidimensional Scaling
3.6 Multidimensional Scaling
INPUT: distances dist[Ai,Aj] where A is some class of objects
3.6 Multidimensional Scaling
OUTPUT: positions X[Ai] where X is a D-dimensional vector
3.6 Multidimensional Scaling
How many dimensions ??? SCREE PLOT
Multidimensional Scaling: Nederlandse dialekten
3.6 Kohonen’s Self Organizing Map (SOM) and
Sammon mapping
1.
Same purpose : DIMENSION REDUCTION : represent a high
dimensional set in a smaller sub-space e.g. 2D-plane.
2.
SOM gives better results than Sammon mapping, but strongly sensitive
to initial values.
3.
This is close to clustering!
3.6 Kohonen’s Self Organizing Map (SOM)
3.6 Kohonen’s Self Organizing Map (SOM)
Sammon mapping
1. Clustering versus Classification
•
•
•
•
classification: give a pre-determined label to a sample
clustering: provide the relevant labels for classification from structure in
a given dataset
clustering: maximal intra-cluster similarity and maximal inter-cluster
dissimilarity
Objectives: - 1. segmentation of space
- 2. find natural subclasses
The End