Presentation material from Sami Äyrämö

Download Report

Transcript Presentation material from Sami Äyrämö

UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
13.02.2007
University of Joensuu
Sami Äyrämö
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
• The need for scalable and robust KD/DM/DM methods is caused by
the constant growth of digital data storages:
• Capability to measure and store data from different objects has increased
(industrial processes, human beings, markets, … )
• Collecting data is cheap!
• The consequence is that the essential information may be lost into
the oceans of data
– The growth of knowledge lags behind the growth of digital data sets
• Lots of data, high dimensions, missing values, outliers,…
– Scalable and robust data analysis methods are needed
• NB! Data owners should retain that they have to collect and digitize
the most important data
– All collected digital data is not important
– All important data may not be collected
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Modified from: U. Fayyad, G. Piatetsky-Shapiro and P. Smyth, The KDD process for extracting
useful knowledge from volumes of data, Communications of the ACM, Volume 39 , Issue 11,
November 1996.
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
• Enhanced understanding of the target data leads to success
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
•
•
•
•
•
•
•
•
•
•
•
•
•
•
E-commerce
Web mining
Marketing research
Process industry (e.g.,
paper machines)
Gene databases
Social sciences
Software quality
analysis
Document databases
Image databases
Radar applications
Peace science
Telecommunication
Archeology
and many others…
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Control
Quality
Sensors

d1m 
 d11
 




D   dt1  dtj  dtm 







d n1

d nm 
Temporal
events
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Operator
Laborant
Process data
Process data
Quality
Customer
Control data
Manager

d1m 
 d11
 
 

D   d t1  d tj  d tm 


  

d n1

d nm 
Feedback
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
•
•
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Clustering is the process of grouping the data into clusters so
that objects within a cluster have high similarity in comparison
to one another, but are very dissimilar to objects in other
clusters
The core method of data mining
155 48
178 60

160 90

?
78

D
195 ?

250 71
 :
:

170  60
E ...120
V K ... ? 
H E ...136

? ? ...180
S E ... ? 

R ? ...117

: : :

? E ...171
S
Refine,compess,simplify,…
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
•
•
•
•
•
•
•
•
•
•
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Application domain → relevant variables → type of variables →
distance measure
• Non-informative variables, e.g., completely missing,
constant, and dublicate variables
Outliers and noise (robustness)
Missing data (consistency)
Scaling of data (equal weighting of variables)
Number of clusters (may be ambiguous)
Choice of clustering method (underlying assumptions)
Non-uniquess of results (solution depends on the initial
values)
Large data set (scalability)
Cluster presentation (dimension reductions)
Interpretation (domain knowledge)
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Features: height, weight, eye color, eyeglasses,…, blood pressure
Objects/
B N ...120
G Y ... ? 
G N ...136

? ? ...180
B N ... ? 

R ? ...117 

: : :

B N ...171
persons
Outlier
155 48
178 60

160 90

?
78

D
195 ?

250 71
 :
:

170  60
Missing value
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
Initialization
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Robust clustering
Bad?
But this is the
global minimum?
Better?
Useful?
Role of K?
Clustering methods do not necessarily inherit the
robustness of its robust components!
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
•
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Roughly speaking there are two types of missing values:
1. The true underlying value exists, but for one reason or
another it is not entered into the data set
•
E.g., human mistake or measurement system failure
2. There may not exist true value in the real-world
•
•
•
•
An unemployed person can not have an employer
”Do not exist” or ”do not know” may be the most important
information (cf. Finnish parliamentary election 2007)
Hence, missing data may contain the most important piece of
information (”nonignorable variables”)
Anyhow, missing data causes many challenges for cluster
analysis
• For instance, dissimilarity matrix may be impossible to
compute (what is suitable distance measure for a pair of
vectors (NaN , 3.2 , NaN) and (1.7 , NaN , 4.0))
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
•
•
•
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Probability that jth component
of vector xi∈ℝp is missing is
independent of any other
known or missing value of xi
Any missing data strategy can
be applied without producing
significant bias on the data
The easiest type of missing
data from the statistical pointof-view
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
•
•
•
Probability that jth component
of vector xi∈ℝp is missing
may depend on the observed
components of xi
The observed values of (xi)j
is a random sample within
subclasses defined by the
values of (xi)j, i≠k
Some knowledge about the
unavailable data values
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
x > 5 → y is missing
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
•
•
•
•
Probability that jth component
of vector xi∈ℝp is missing
depend on its value
E.g., value of a measurement
is out of some permissible
range and might therefore be
censored or replaced by an
empty value (in this case the
mechanism of missing data is
understood)
Some knowledge about the
unavailable data values
No general method exists for
this kind of missing data
mechanism
•
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
x exists only if x > 0
Missing data value is actually an extreme outlier (any value
between -∞ and ∞)
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
•
Fill in the missing values manually:
•
•
Simple, but not usually recommended
Treat unknown values as a special category
Use only complete cases in the analysis
•
•
•
Time consuming (esp. DM applications)
Replace by constant
•
•
•
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
A lot of information might be lost
It may be more reasonable to discard a variable with a large fraction of
missing values than to discard the large number of objects
Available case strategy
•
•
•
•
Uses all available data in computation
Statistical efficiency of basic estimates is proportional to the fraction of
missing data
Implemented easily by projector technique where all computation is
projected to the available vector components (more details later..)
Extremely simple to the end-user
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
•
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Imputation
•
•
•
•
•
•
Replace missing values with reasonable estimates (mean, median, mode,
kNN) based on the available data values
Simple approach
Mean imputation works when the sample is drawn from unimodal normal
distribution and the data is missing at random
Reduces the inherent variability in the data
Outliers may seriously distort the mean imputation (median might be
better)
Simple imputation methods are not generally suitable for data clustering
• The potential class structure is not considered by simple imputation
methods
• Shrinking the within-cluster variations may disturb cluster validation
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
•
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Hot deck imputation
•
Replaces the missing values with values based on estimated missing data
distributions – the missing data distribution is estimated from the available
values
•
Sample EM-like hot deck algorithm:
1. Cluster the data using complete cases
2. Assign incomplete cases to the nearest clusters (using an
appropriate distance measure)
3. Impute missing values using statistical within-cluster
summaries
4. Cluster the whole data (both complete and imputed cases)
5. If changes in clustering repeat from Step 3
•
Cold deck imputation uses other distinct data set for
substitution of missing values
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
•
Nearest-neighbor imputation
•
•
•
Identifies candidates that are the most similar to the object with missing
data (the candidate has to be complete in the required variables) and
substitutes the missing values with values of candidates
Appropriate distance measure is needed
• E.g., l 2- or l ∞-norm on available variables (presumes standardization
of the variables)
• Upper limit d0 may be given for the distance in order to avoid
occurence of strange objects (if the distance from the incomplete
object to its closest complete-case neighbor is greater than d0
imputation will be omitted)
Prediction-based imputation
•
•
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Uses classification (nominal variable) and regression (continuous) models
based on other variables to predict missing values
Multiple imputation
•
•
A collection of imputed data sets is produced and analyzed using
standard methods
The final result is obtained by combining the characterizations of the
imputed data sets (computationally expensive)
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
•
•
•
•
•
•
•
•
•
•
•
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Data presentation
Choice of objects
Choice of variables
What to cluster: objects or variables
Normalization/weighting of variables
Choice of (dis)similarity measures!
Choice of clustering criterion (objective function)
Choice of missing data strategy
Algorithms and computer implementation (and their reliability,
e.g., convergence)
Number of clusters
Interpretation of results
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
•
•
•
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Data sets produced by industrial processes are often sparse and contain outliers that
are caused e.g., by human errors and measurement failures (e.g., laboratory
measurements)
Traditional clustering methods (e.g., k-means, hierarchical methods) are not
appropriate for comprehensive cluster analysis by themselves, because of their
sensitivity to such defects
Robust methods complement the traditional methods because of their
•
tolerance to incomplete, sparse and dirty data
•
usability (less manual trimming of data by the end user)
Quality attributes
T
0.40
i
NaN
0.90
 0.55
 0.37

m
 NaN

e
0.44
D
 0.29

 NaN
 0.31

 0.34
NaN
1.01
NaN
NaN
NaN
NaN 120.0
0.20 55.3 0.90 
0.05 67.3 129.0

0.04 71.1 NaN 
401.0 70.0 NaN 

0.01 64.0 NaN 
0.02 67.0 NaN 

0.03 69.0 NaN 
0.67
Missing value
Possible outlier
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
min
J q (I,{c k }kK1 ),
K
α = 2, q=2 → K-means
{c k }k 1 ,I i
α = 1, q=2 → K-spatialmedians
n

i 1
q
for J q (I,{c k }kK1 )   Pi (x i  c ( I )i ) ,
α = 1, q=1 → K-coord.medians
where, (I)i {1,..., K }, c k  R p , Pi  diag( p i ) and x i  R p .
•
K-prototype algorithm
1. Initialize the cluster prototypes (HOW?)
2. Assign each data point to the closest cluster prototype
(DISTANCE MEASURE?)
3. Compute the new estimates for the cluster prototypes
(ESTIMATOR?)
4. Termination: stop if termination criteria are satisfied
(usually no changes in I)
T. Kärkkäinen and S. Äyrämö, Robust clustering methods for erroneous and incomplete data,
In Proc. of 5th Conference on Data Mining, September, 2004.
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
•
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
“Robustness signifies insensitivity to small deviations from
the assumptions”, P. Huber, Robust statistics, 1981
•
•
•
•
•
•
Small deviation means either gross errors (outliers) in the minor part of
data or small errors (noise) in a large part of data
Statistical methods and tools are mainly based on classical statistics (mean,
variance)
Classical statistics are based on the assumption that data sets originates
from normal distribution
This often makes results useless when the normal assumption is not
satisfied
The classical statistics are computationally straightforward
Robust statistics are more intractable from computational point-of-view
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
•
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Breakdown point: ”the smallest fraction of contamination
that can cause the estimator T to take values arbitrarily
far from T(X)”
•
•
•
•
BP of the sample mean is 1/n and the coordinatewise
and spatial medians 1/2
Influence function
Gross-error sensitivity
Local-shift sensitivity
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
•
The trimmed mean
•
•
•
Attractive robust alternative in one- or two-dimensions
Translation and scale equivariant
The spatial median
•
•
•
•
•
Requires parameters (distance or fraction of data)
Difficult in high-dimensions
The (coordinatewise) median
•
•
•
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Attractive robust alternative in high dimensions
Computationally challenging – explicit solution does exist
for the optimization problem
Orthogonal equivariant
Presumes equal weighting for variables – insensitive to
scaling of variables
Many others: Oja median,…
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
•
•
•
•
•
•
•
•
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
High efficiency: Nearly equal efficiency with maximum
likelihood estimators under ideal parametric models.
Qualitative robustness: Estimates are influenced just slightly by
small deviations from the assumed model.
Quantitative robustness: Estimates are protected against large
amounts of contamination or single gross errors (high
breakdown point).
Local-shift sensitivity: Smooth reaction to rounding and
grouping.
Rejection point: Separation between outliers and the bulk of
data.
Fisher consistency: Estimation of right quantity (for parametric
models).
Affine equivariance: The solution should be independent of the
scales of the variables (multivariate cases)
Computational practicality: The solution should be obtainable in
a practical amount of computing time, even in a high dimension
and/or with large amounts of data.
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
•Infinite local-shift sensitivity at the
median of the distribution:
3D-sample
4D-sample
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
•Coordinatewise median does not always lie in the convex
hull of the data in ℝp with p≥3:
data : {ei }iN1 , ei  R p (ei  unit vecto r)
 median  0 not contained in the convex hull for N  3.
• Not orthogonal equivariant – sensitive to the rotations of
data
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
•
•
In statistics: a robust estimate of location (M-estimate). A.k.a. FermatWeber problem or single-facility location problem (used, e.g., in
transportation cost minimization problems, networking problems)
Good statistical properties on multivariate samples
–
–
–
–
–
–
•
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
More robust than the sample mean (breakdown point 50%)
Inherently multivariate estimator on continuous variables
The coordinatewise sample median is sensitive to so called ”inliers” and
works better with discrete variables
Rotation invariant
In univariate case spatial median coincides with the coordinatewise
median
Statistical efficiency approach to the sample mean as p approach to infinity
Treated as nonsmooth convex optimization problem
–
–
Unique solution exists when the data are not collinear
SOR-Weiszfeld type of algorithm with treatment for missing data
Kärkkäinen, T. and Äyrämö, S., On Computation of Spatial Median for Robust Data Mining, In
Proceedings of EUROGEN 2005, September, 2005, Germany.
Valkonen, T., Convergence of a SOR-Weiszfeld type algorithm for incomplete data sets, Numerical
Functional Analysis and Optimization, 27 (7-8), Dec. 2006.
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY


minn J q (u), for J q (u) 
uR
1
N
ux


i 1

i q
The spatial median:
(u  xi ) j





, for u  xi 2  0,
 i
2  q  2 : 0  J 21 (u)   i , with  j
u  xi 2
i 1
   1,
for u  xi 2  0.
 i 2
N
•
•
•
•
Nonsmooth
Convex
Unique solution if data is not collinear
In univariate case spatial median coincides with the coordinatewise
median
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Modified gradient:
(Pi (u  x i )) j
N
J (u)    i , with  i  j 
1
2
i 1
max( Pi (u  x i ) ,  )
, for all x i .
ε-approximating problem:
N
minn J  (u), for J  (u)  
uR
i 1
Pi (u  x i )  
2
N
(Pi (u  x i )) j
i 1
Pi (u  x i )  
J  (u)    i , where  i  j 
2
, for all x i .
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
• Classical direct search optimization methods
(e.g., Nelder-Mead)
– Accurate since not based on gradients
– Scale poorly w.r.t. the number of dimensions
– Requires a large amount of computing resources
• Gradient-based methods
– Presume differentiability
– Inaccurate
– Requires a lot of resources to compute
• Iterative methods (e.g., Weiszfeld and its
modifications)
– Fast
– Good scalability to high dimensions
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
• ε-approximation problem reformulation
N
minn J  (u), for J  (u)  
uR
Pi (u  x i )  
2
i 1
• Basic iteration is based on the first-order
necessary conditions for a stationary
point of the cost function:
n

i 1
Pi (u  x i )
Pi (u  x i )  
2
 0.
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
• Acceleration using SOR type of stepsize
factor
u k 1  u k   ( v  u k ),
where   [0,2].
• On incomplete data sets operations can
be performed correspondingly by
projecting all computation to the
available values!
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
• In this variant nonsmooth data points are
removed from the computations.
• Basic iteration is based on the first-order
necessary conditions for a stationary
point of the cost function:

i:x i B ( u , r )
Pi (u  x i )
Pi (u  x i )
2
 0.
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
•
•
•
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Time complexity of one SOR/ASSOR iteration corresponds to two/three
cost function evaluations (O(n))
The number of iterations of the SOR-based methods is remarkably
smaller when compared to the number function evaluations of the
classical optimization methods
CG with ε-approximation problem formulation gives inaccurate solutions
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
• Needed to avoid the locally optimal solutions
• Random
– Practical only for small data sets
• Distance-optimization
– Maximizes between-cluster distances
– Sensitive towards outliers
• Density-estimation
– Tries to detect dense regions from data by
clustering several subsamples
– Easily generalized to the robust clustering
– Trimming
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
• Silhouettes
• L1-datadepth based robust methods
• Trimmed silhouettes
– Concentrates on the tight cluster cores and
between cluster distances
– Trim away the most distant 50% of the cluster
points (with respect to the spatial median)
– Compute the average distance a(Ck) from the
remaining most central points to the closest cluster
prototype (measures the tightness of the core of
the cluster)
– Similarly, compute the average distance b(Ck) from
the most central points to the second closest
cluster prototype (measures the between-cluster
distance)
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
• Trimmed silhouette width for cluster k is
computed as:
bCk   aCk 
sr Ck  
max aCk , bCk 
• Robust silhouette coefficient is defined as:
1
sr K  
K
K
 sC 
i 1
i
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
13/02/2007
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
13/02/2007