Transcript 0 1 0

VizTree
Huyen Dao and Chris Ackermann
Introducing example
These are two random bit sequences. One sequence is generated
by a computer and the other one by humans.
010110010111100110100100001000101
001101101011100001010101110111110
001101101101111110100110010010001
101000111100110110100010111100010
110100110110011010000001001100010
011100000111010011001011000010100
10
10001000101001000101010100001010
100010101110111101011010010111010
010101001110101010100101001010101
110101010010101010110101010010110
010111011110100011100001010000100
111010100011100001010101100101110
101
Which is which
Introducing example
01011001011110011010010000100010100110110101110000101010111
01111100011011011011111101001100100100011010001111001101101
00010111100010110100110110011010000001001100010011100000111
01001100101100001010010
10001000101001000101010100001010100010101110111101011010010
11101001010100111010101010010100101010111010101001010101011
01010100101100101110111101000111000010100001001110101000111
00001010101100101110101
1
0
0
1
1
0
Not really random!
Subjects tried to create
Randomness by alternating.
What does VizTree do?
• Analysis of time series data.
• Illustrates motifs, and anomalies with
‘Subsequence Trees’
1
1
1
0
0
1
0
1
1
0
1
0
0
0
Length of subsequence = 3
Creating a Subsequence Tree
0 1 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 …
0
1
0
1
1
1
0
0
1
0
1
1
0
1
0
0
0
Creating a Subsequence Tree 2
0 1 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 …
1
0
1
1
1
1
0
0
1
0
1
1
0
1
0
0
0
Discretizing
•
Only discrete data can be visualized.
•
Most data is continuous and needs to be
converted.
•
Several steps to convert continuous data
into tree structure
1. PAC
2. SAX
PAC
A.
Piecewise aggregate approximation (PAC) of time
series:
1. Divide time series into n segments of equal
length
2. Assign each a coefficient = average of values in
that segment
0
4.8
9.6
14.4
19.2
24
SAX
B.
Create an alphabet on the distribution space of time series:
a.
b.
C.
Divide range into x regions: segment has equal probability of falling
into any one
Assign symbols to regions from top-to-bottom
Assign each segment of the PAA a symbol based on in which
segment resides.
a
b
c
Time series becomes a string:
0
4.8
9.6
14.4
‘
b 19.2
c b a24 b’
Tree of continuous data
Instead of Boolean values, the
branches of represent the
symbols,
a
a
b
a
b
b
a
a
a
- the top branch represents a
b
- the bottom branch
represents the last letter
Larger alphabet means more
branches
b
b
a
b
window size = 3
# of symbols = 3
Alphabet size = 2
Sliding window length
• Specifies the time frame of the pattern that
is being matched.
Appropriate length
can be determined
by using the ruler
0
4.8
9.6
14.4
length = 12
length = 24
19.2
24
# of symbols per window
• Specifies how many discrete windows are fit into
the given time window
• Depends on sliding window size and frequency
of value changes
0
4.8
9.6
14.4
length = 24
‘
b c‘cbaa
’
b’
19.2
24
Alphabet size
• Larger alphabet:
– Discrete representation is more fine grained.
– Tree is difficult to read.
a
a
b
cb
0
4.8
9.6
‘
14.4
19.2
b b
c a
b a a
b’
24
Parameters
• Length of the sliding window
– For focusing on certain intervals
• # of symbols per window
– The size of the pattern being analyzed
• Alphabet size
– The number of discrete values.
Time Series Data Mining Tasks
1. Subsequence matching
2. Time series motif discovery
3. Anomaly Detection
Advanced settings
•Cull trivial matches:
•Consecutive strings that are the same: ‘dcb’, ‘dcb’
•Consecutive strings where no pair of symbols are more than a
symbol apart: ‘dcb’, ‘cba’
•Chunking instead of actually sliding the window
VizTree and Data Mining Tasks
Subsequence Matching
• Do not have to know
exact pattern for query:
give concise description
of pattern.
• Selecting branch shows
all subsequence matches
and highlights
occurrences in time
series.
VizTree and Data Mining Tasks
Time Series Motif Discovery
•Motif – “previously unknown,
frequently occurring patterns”
•Discovery simple: frequently
occurring patterns => thick
branches
•Traditional motif discovery
algorithms slow
•VizTree builds frequency into
visualization so quickly find
motifs
•Highlights where motifs
occur
Lin et al. 2005
VizTree and Data Mining Tasks
Anomaly Discovery
• Simple cases:
observing very thin
branches in
subsequence trees.
• More complex cases:
Diff Trees.
Lin et al. 2005
• Thick branches of
vivid green or blue
indicate anomalies in
second time series.
Diff Tree
• Contain analysis of two time series, A and B
• Shows frequency of patterns in B in relation to
frequency in A
• Two values used in creation:
– Support: is a pattern overrepresented (more
frequently occurring) in B or underrepresented (less
frequently occurring)
– Confidence: how prevalent is the pattern in A
– Support => Thickness of branches
– Confidence => Color intensity of branches
• Also: Surprisingness: ranks most anomalous
patterns
What is great about VizTree?
• Simple graphical representation:
– Straightforward
– Powerful: Can show lots of different subsequences in
a simple tree structure
– Simple and easy to understand description of
subsequences through strings.
• Quick analysis
– The subsequence trees and diff trees renders quickly
– Since the relevant encoded in tree: can spot motifs
and anomalies quickly
Weaknesses
• It is difficult to find the right combination of parameters
– An idea would be to superimpose the effect of parameters on
original graph (discrete values, sliding window length etc.)
• Zooming is rather inconvenient
– This could be solved by using another zooming technique, such
as fish-eye.
• Usability could be improved
– Would be informative to see how the alphabet is define over the
dataset.
– The subtree view does not indicate where in the main tree it is so
can lose track
– The time series scales are not adjustable so can be hard to
place where subsequences are in terms of time
– Nodes are hard to select