Transcript slides

Fair Use Agreement
This agreement covers the use of all slides on this
CD-Rom, please read carefully.
• You may freely use these slides for teaching, if
• You send me an email telling me the class number/ university in advance.
• My name and email address appears on the first slide (if you are using all or most of the slides), or on each
slide (if you are just taking a few slides).
• You may freely use these slides for a conference presentation, if
• You send me an email telling me the conference name in advance.
• My name appears on each slide you use.
• You may not use these slides for tutorials, or in a published work (tech report/
conference paper/ thesis/ journal etc). If you wish to do this, email me first, it is
highly likely I will grant you permission.
(c) Eamonn Keogh, [email protected]
Towards Parameter Free Data Mining
Stefano
Lonardi
Chotirat
Ann
Ratanamahatana
Eamonn Keogh
Computer Science & Engineering Department, University of California Riverside
Outline of Talk
• Why parameter laden data mining algorithms
are troublesome…
• A parameter free/lite approach
• Empirical evaluation
• Parameter free/lite clustering
• Parameter free/lite anomaly detection
• Parameter free/lite classifcation
• Conclusions
Parameters
Anecdotal observations
• A data mining paper I once reviewed had 14
parameters…
• More than 5 parameters seems the norm for most
papers…
• Some algorithms have only 4 parameters, but the
algorithm is tested on synthetic data, created by a
algorithm with 3 or 4 parameters….
Is a surfeit of parameters a bad thing?
Parameter laden algorithms are often
hard to implement
By definition, parameter laden algorithms tend to
be hard to implement. But don’t we want people to
use our work?
This fact (combined with general laziness)
probably accounts for the general dearth of
strawman comparisons in most papers. See Keogh &
Kasetty SIGKDD02.
Parameter laden algorithms make it
hard to reproduce others results
We re-implement many other papers (63 for this
paper alone). Yet we are rarely able to reproduce
others results, even when using the same datasets.
One reason is that we typically don’t know how
someone set the 5, 6, 7, 8, 9 parameters used in the
original paper.
Are irreproducible results scientifically meaningful?
Parameter laden algorithms make it
hard not to overfit
Over fitting is a problem in data mining. If we have 5 to 10
parameters to set, avoiding over fitting is nearly
impossible.
This makes it hard to evaluate the contribution of a paper.
So you got 95% accuracy on the “bull/bear” classification
problem. Did you…
• Spend 2 minutes adjusting the parameters (I am somewhat impressed)
• Spend 2 weeks adjusting the parameters (I am not impressed)
Yet often we don’t know which!
(See Pedro Domingos POE papers)
Parameter laden algorithms are nearly
useless to the non-specialist
If we ask a Cardiologist / COE / Lawyer / School
Administrator to use an algorithm with 10
parameters, then the best we can hope for is that
they give up in frustration!
The Central Claim of this Paper
A very simple (12 lines of code) parameter
free algorithm can outperform many
(most/all?) other algorithms, for many data
mining tasks.
Quick Review
The Kolmogorov complexity K(x) of a string x
is defined as the length of the shortest program
capable of producing x on a universal computer
— such as a Turing machine.
The conditional Kolmogorov complexity K(x|y)
of x to y is defined as the length of the shortest
program that computes x when y is given as an
auxiliary input to the program.
The function K(xy) is the length of the shortest
program that outputs y concatenated to x.
The Similarity Metric
Ming Li et. al.
Ming Li
This is the optimal
similarity measure, but
intractable…
This is a cheap
approximation…
K ( x | y )  K ( y | x)
d k ( x, y ) 
K ( xy)
C ( x | y )  C ( y | x)
d c ( x, y) 
C ( xy)
C(x) means the size of x
compressed by an off-theshelf compression algorithm
like Winzip, Stuffit etc
Ming Li, Xin Chen, Xin Li, Bin Ma, Paul M. B. Vitányi: The similarity metric. SODA 2003: 863-872
The (very) Minor Extension
proposed here
This is the optimal
similarity measure, but
intractable…
K ( x | y )  K ( y | x)
d k ( x, y ) 
K ( xy)
This is a cheap
approximation…
C ( x | y )  C ( y | x)
d c ( x, y) 
C ( xy)
This is a cheaper
approximation…
C ( xy)
CDM ( x, y ) 
C ( x)  C ( y )
Note we flipped the numerator and denominator
to go from similarity to dissimilarity
Compression Based Distance Measure
C ( xy)
CDM ( x, y ) 
C ( x)  C ( y )
function dist = CDM(A, B)
save A.txt A –ASCII
zip('A.zip', 'A.txt');
A_file = dir('A.zip');
% Save variable A as A.txt
% Compress A.txt
% Get file information
save B.txt B –ASCII
zip('B.zip', 'B.txt');
B_file = dir('B.zip');
% Save variable B as B.txt
% Compress B.txt
% Get file information
A_n_B = [A; B];
save A_n_B.txt A_n_B –ASCII
zip('A_n_B.zip', 'A_n_B.txt');
A_n_B_file = dir('A_n_B.zip');
% Concatenate A and B
% Save A_n_B.txt
% Compress A_n_B.txt
% Get file information
% Return CDM dissimilarity
dist = A_n_B_file.bytes / (A_file.bytes + B_file.bytes);
But wait! Isn’t the choice of compression
algorithm itself a parameter!
No! We are trying to
approximate the optimal
Kolmogorov compression.
So the best compression
algorithm to use is the
one that gives the
smallest file size
The clustering achieved by CDM on 16,300 symbols from the
mitochondrial DNA of 12 primates, and one “outlier” species.
Ring-Tailed Lemur
prosimians
Malayan Flying Lemur
Capuchin
hominoids
Gibbon
Sumatran Orangutan
primates
This is the
correct
clustering
for these
species
Oyster
pongines
Orangutan
anthropoids
Gorilla
Human
Pygmy Chimpanzee
Chimpanzee
Sang-Hee Lee
(Noted physical
anthropologist)
Barbary Ape
Baboon
taxonomy level
controversial
panines
cercopithecoids
catarrhines
greater apes
USA
Germany
Sweden
Norway
Denmark
Italy
Catalan
Brazil
Spain
Mexico
Argentina
The clustering achieved on the text
from various Yahoo portals (Jan-15th
2004). The smallest webpage had 1,615
characters, excluding white spaces
Maori
English
Latin
Italian
Dutch
French
German
Norwegian
Danish
The clustering achieved on
the text from the first fifty
chapters of Genesis
But DNA strings and
natural language text are
inherently discrete. Much
of data mining is on real
valued data, like images,
video and time series
Yeah, what
about time
series?
What about Time Series? Part I
In the example below, 1 and 3 are very similar ECGS...
1
0.13812500000000
0.04875000000000
0.10375000000000
0.17875000000000
0.24093750000000
0.29875000000000
0.37000000000000
0.48375000000000
0.55593750000000
0.64625000000000
0.70125000000000
..
2
0.51250000000000
0.50000000000000
0.50000000000000
0.47562500000000
0.45125000000000
0.45125000000000
0.47656250000000
0.50000000000000
0.48281250000000
0.48468750000000
0.46937500000000
..
3
0.49561523437690
0.49604248046834
0.49653076171875
0.49706481933594
0.49750732421875
0.49808715820312
0.49875854492187
0.49939941406230
0.50007080078125
0.50062011718750
0.50123046875826
..
What about Time Series? Part II
c
c
c
b
b
a
0
20
b
a
40
60
80
100
120
SAXIFY!
In the SAX
representation, every
bit has about the same
importance…
1
0.13812500000000
0.04875000000000
0.10375000000000
baabccbc
2
0.51250000000000
0.50000000000000
0.50000000000000
3
0.49561523437690
0.49604248046834
0.49653076171875
How well does CDM time series clustering work?
• We compared to 51 other measures (we re-implemented
every time series measure from SIGKDD, SIGMOD, ICDM, ICDE, VLDB, ICML, SSDM,
PKDD and PAKDD conferences in the last decade).
• We optimized the parameters of the above
• We tested on diverse problems
• Homogenous datasets: All ECGs etc
• Heterogeneous datasets: A mixed bag of 18 diverse datasets
• We used a formal metric to score the clusterings
(but here we will just show a few pictures)
Reel 2: Tension
CDM does best
(by a huge
margin) on the
heterogeneous
data problems
Reel 2: Angular speed
Koski ECG: Fast 2
Koski ECG: Fast 1
Koski ECG: Slow 2
Koski ECG: Slow 1
Dryer hot gas exhaust
Dryer fuel flow rate
Ocean 2
Ocean 1
Evaporator: vapor flow
Evaporator: feed flow
Furnace: cooling input
Furnace: heating input
Great Lakes (Ontario)
Great Lakes (Erie)
Buoy Sensor: East Salinity
Buoy Sensor: North Salinity
Sunspots: 1869 to 1990
Sunspots: 1749 to 1869
Exchange Rate: German Mark
Exchange Rate: Swiss Franc
Foetal ECG thoracic
Foetal ECG abdominal
Balloon2 (lagged)
Balloon1
Power Demand: April-June (Dutch)
Power Demand: Jan-March (Dutch)
Power Demand: April-June (Italian)
Power Demand: Jan-March (Italian)
MIT-BIH Noise Stress Test Database (nstdb): record 118e6
CDM does best
(by a huge
margin) on the
homogenous
data problems
Long Term ST Database (ltstdb): record 20021
BIDMC Congestive Heart Failure Database (chfdb): record chf02
BIDMC Congestive Heart Failure Database (chfdb): record chf15
The second best clustering
(of 51 approaches)
CDM’s clustering
17
20
3
10
18
13
19
15
16
14
15
12
13
11
12
8
14
5
11
2
2
6
5
9
3
7
4
4
1
20
10
8
7
9
6
19
18
17
16
1
OK, you can cluster stuff,
what about other data
mining problems, like
classification, motif
discovery, anomaly
detection
Yes, what about
anomaly
detection?
Parameter-Free Anomaly Detection
The connection between the notion of
similarity and anomalies is inherent in
the English idiom.
When confronted with an anomalous
object or occurrence, people usually
exclaim "I have never seen anything like
it!" or "It is like nothing you have ever
seen".
A Parameter-Free Anomaly Detection Algorithm
function loc_of_anomaly = kolmogorov_anomaly(data)
loc_of_anomaly = 1;
while size(data,1) > 2
if CDM(data(1:floor(end/2),:),data); < CDM(data(ceil(end/2):end,:),data);
loc_of_anomaly = loc_of_anomaly + size(data,1) / 2;
data = data(ceil(end/2):end,:);
else
data = data(1:floor(end/2),:);
end
end
…
A)
B)
C)
D)
E)
A problem from
SIGKDD 2003…
0
200
400
600
800
1000
1200
A) Our Approach
B) Support Vector Machine (SVM) based approach (6 parameters).
C) Immunology (IMM) inspired approach (5 parameters).
D) Association Rule (AR) based approach (5 parameters).
E) TSA-tree Wavelet based approach (3 parameters).
For each experiment we spend one hour of CPU time, and one hour of human time trying
to find the best parameters and only reported the best results.
What happens when we change the anomaly?
A) Our Approach
B) Support Vector Machine (SVM) based approach (6 parameters).
C) Immunology (IMM) inspired approach (5 parameters).
D) Association Rule (AR) based approach (5 parameters).
E) TSA-tree Wavelet based approach (3 parameters).
A)
B)
C)
D)
E)
0
200
400
600
800
1000
1200
A small excerpt from dataset 118e06 from the MIT-BIH
Noise Stress Test Database. The full dataset is 21,600 data
points long. Here, we show only a subsection containing
the two most interesting events detected by our algorithm
2000
3000
4000
5000
The gray markers are independent annotations by a
cardiologist indicating Premature Ventricular Contractions
Aerospace Corp Problems
The results of using our algorithm on various datasets from the Aerospace Corp collection. The
bolder the line, the stronger the anomaly. Note that because of the way we plotted these there is a
tendency to locate the beginning of the anomaly as opposed to the most anomalous part.
1
L-1v
0
-1
0
100
200
300
400
500
600
700
800
900
1000
50
L-1t
0
-50
0
100
200
300
400
500
600
700
800
900
1000
1
L-1n
0.5
0
-0.5
-1
0
100
200
300
400
500
600
700
800
900
1000
2
1
0
L-1u
-1
-2
0
100
200
300
400
500
600
700
800
900
1000
550
Hand above holster
500
Briefly swings gun
at target, but does
not aim
Aiming at target
450
400
350
Actor misses
holster
300
250
200
Laughing and flailing
hand
Hand resting at
side
150
100
0
100
200
300
400
500
600
The parameter-free
algorithm generalizes to
multi-dimensional time
series, without changing a
single line of code!
What about
classification?
Yes, what about
classification?
Parameter-Free
(Structural) Time
Series Classification
Euclidean
DTW
CDM
ECG: signal 1
42.25 %
16.25 %
6.25 %
ECG: signal 2
47.50 %
11.25 %
7.50 %
Gun: 2 classes
5.00 %
0.00 %
0.00 %
Gun: 4 classes
37.50 %
12.5 %
5.00 %
What are the limitations
of parameter-free
algorithms?
The data cannot be
arbitrarily small, but
the amazing clustering
results show here only
used 1,000 datapoints
per object!
Conclusions
Parameter laden algorithms are bad!
• They are often hard to implement
• They make it hard to reproduce others results
• They make it difficult to judge the significance of a contribution
• They make it hard not to overfit
• They are next to useless to the non-specialist
• Parameter
free/lite algorithms are good!
• They require little coding effort
• They make it trivial for others to reproduce your results
• They make it nearly impossible to overfit
• They have a real chance of being used by the non-specialist
• You should use a parameter free/lite approach
before doing anything more complex!
Questions?
All datasets and code used in this talk can be found at
www.cs.ucr.edu/~eamonn/TSDMA/index.html