Some notes on VizTree

Download Report

Transcript Some notes on VizTree

Fair Use Agreement
This agreement covers the use of all slides on this
CD-Rom, please read carefully.
• You may freely use these slides for teaching, if
• You send me an email telling me the class number/ university in advance.
• My name and email address appears on the first slide (if you are using all or most of the slides), or on each
slide (if you are just taking a few slides).
• You may freely use these slides for a conference presentation, if
• You send me an email telling me the conference name in advance.
• My name appears on each slide you use.
• You may not use these slides for tutorials, or in a published work (tech report/
conference paper/ thesis/ journal etc). If you wish to do this, email me first, it is
highly likely I will grant you permission.
(c) Eamonn Keogh, [email protected]
08/25/2004
KDD ‘04
1
Visually Mining and Monitoring
Massive Time Series
Jessica Lin*
Eamonn Keogh
Stefano Lonardi
Jeffrey Lankford
Donna Nystrom
08/25/2004
KDD ‘04
(UC Riverside)
(The Aerospace Corp)
2
Motivation
• Before the launch of any unmanned space vehicle, a
critical “go/no go” decision must be made.
– Data from past launches is available to assist in the decisionmaking.
– Streaming telemetry must be constantly monitored to detect
any potential problems.
• A single framework is needed to perform these two
tasks.
– Existing tools inadequate for such tasks.
08/25/2004
KDD ‘04
3
Introduction
• We introduce VizTree
– Mining archival data
• Pattern discovery
– repeated pattern discovery (motif discovery),
– anomaly detection,
– query-by-content
– Monitoring incoming streaming data
• Why visualization?
– human eye is often advocated as the ultimate datamining tool
– User-interaction allows visual data exploration and
hypotheses testing
08/25/2004
KDD ‘04
4
Outline
•
•
•
•
Introduction
Related Works
VizTree Motivation
VizTree Implementation
– Time Series Discretization
• Experimental Evaluation
• Diff Tree
• Discussion/Conclusion
08/25/2004
KDD ‘04
5
Related Work 1:
Time Series Spirals
400
One year of power demand data
300
400
•
•
Spiral Axis = serial attributes are
encoded as line thickness
Radii = periodic attributes
200
300
200
100
100
0
0
F
-100
-100
-200
-200
-300
Carlis & Konstan. UIST-98
-300
-400
-400
-400
-400
-300
De
-300
-200
-200
-100
-100
0
100
0
200
100
300
400
200
300
Independently rediscovered by
Weber, Alexa & Müller InfoVis-01
But dates back to 1888!
08/25/2004
KDD ‘04
6
Related Work 2:
TimeSearcher
Comments
• Simple and intuitive
• Highly dynamic
exploration
• Query power may be
limited and simplistic
• Limited scalability
Hochheiser, and Shneiderman
08/25/2004
KDD ‘04
7
Related Work 3 – Calendar-based
The cluster and calendar-based visualization on employee working hours data. It
shows six clusters, representing different working-day patterns.
08/25/2004
KDD ‘04
8
Motivation of VizTree
010110010111100110100100001000
101001101101011100001010101110
1111100011011011011111101001100
100100011010001111001101101000
101111000101101001101100110100
000010011000100111000001110100
1100101100001010010
08/25/2004
10001000101001000101010100001
010100010101110111101011010010
11101001010100111010101010010
10010101011101010100101010101
101010100101100101110111101000
111000010100001001110101000111
00001010101100101110101
Here are two sets of bit
strings. Which set is
generated by a human and
which one is generated by
a computer?
KDD ‘04
9
VizTree
010110010111100110100100001000101
001101101011100001010101110111110
001101101101111110100110010010001
101000111100110110100010111100010
110100110110011010000001001100010
011100000111010011001011000010100
10
0
1
10001000101001000101010100001010
100010101110111101011010010111010
010101001110101010100101001010101
110101010010101010110101010010110
010111011110100011100001010000100
111010100011100001010101100101110
101
0
1
0
0
1
1
Lets put the sequences into a depth limited tree, such
that the frequencies of all triplets are encoded in the
thickness
08/25/2004 of branches…
KDD ‘04
10
“humans usually try to fake randomness by alternating patterns”
VizTree
The “trick” on the previous slide
only works for discrete data, but
time series are real valued.
Details 2
Zoom in
But we can
SAX up a time
series to make
it discrete!
Overview
VisTree
• Convert the time series to SAX
• Push the data in a depth-limited
suffix tree
• Encode the frequencies as the
line
thickness
08/25/2004
KDD ‘04
Overview, zoom
& filter, details
on demand
Details 1
11
SAX
Symbolic Aggregate
ApproXimation
baabccbc
08/25/2004
KDD ‘04
12
How do we obtain SAX?
C
C
0
60
80
100
120
a
a
b
b
c
0
08/25/2004
40
a
First convert the time
series to PAA
representation, then
convert the PAA to
symbols
It take linear time
20
20
KDD ‘04
b
c
40
60
80
100
120
bccbaaba
13
Visual Comparison
3
2
1
0
-1
-2
-3
a
b
c
d
e
f
DFT
PLA
Haar
APCA
A raw time series of length 128 is transformed into the
word “aaaaaabbbccdeffdcbbdcdefffffdccbb.”
– We can use more symbols to represent the time series since each
08/25/2004
‘04
symbol requires fewer bits than KDD
real-numbers
(float, double)
15
Subsequence Matching/
Motif Dicovery
Ben Shneiderman
Zoom in
Overview, zoom
& filter, details
on demand
This example demonstrates subsequence matching and motif discovery.
We want to find a U-shaped pattern, so we’d try something that starts high,
descends, and then ascends again. Clicking on “abdb” shows such
patterns.
Motif Discovery
Clicking on “abxx” shows this repeated patterns
08/25/2004
KDD ‘04
17
Anomaly Detection 1
08/25/2004
Clicking on the branch “acxx”
shows the anomalous heartbeat
KDD ‘04
18
Anomaly Detection 2
Clicking on “bab” shows the anomalous week (Christmas). Instead of a
08/25/2004
KDD ‘04
normal 5-working-day week, it has 3-working
day during Christmas.
19
Diff Tree
DiffTree
• Convert the two time
series to SAX
• Push the data in a depthlimited suffix tree
• Encode the difference of
frequencies as the line
thickness
• Encode the significance
of difference as the line
color intensity
• Rank the surprising
patterns
08/25/2004
Blue lines - pattern is more common in A
GreenKDD
lines‘04- pattern is more common in B
Red lines – surprising patterns
20
Diff Tree 2
08/25/2004
KDD ‘04
21
Scalability
• The pixel space of the tree is determined solely
by the number of segments and alphabet size.
– Constant and independent of the size of time series
– Size of the dataset plays a role in memory space,
since each node in the tree stores the offsets of its
subsequences. However, SAX allows efficient
numerosity reduction to reduce the number of
subsequences being included into the tree
• large amounts of dimensionality reduction do not
greatly affect the accuracy of our results (for the
power dataset, the dimensionality is reduced
from 672 to 3, a compression ratio of 224-to-1).
08/25/2004
KDD ‘04
22
Conclusion
• We propose VizTree, a novel time series
visualization tool for pattern discovery.
– Frequently occurring patterns
– Anomaly detection
– Query-by-content
• A single framework that allows both mining
of the archival data and monitoring of
streaming data.
• Highly scalable.
08/25/2004
KDD ‘04
23
Future Work
• Researchers from other sectors of the industry
can greatly benefit from our system as well.
– it could potentially be used for indexing and editing
video sequences.
• Problems that can be indirectly solved:
– Subsequence Clustering
– Time Series Rule Discovery
• While we mainly focus on the “mining” aspect in
this paper, we will extend VizTree to accept
online streaming data for monitoring purposes.
08/25/2004
KDD ‘04
24