Apweb220PPt - NEC Laboratories America

Download Report

Transcript Apweb220PPt - NEC Laboratories America

Effective Variation Management for
Pseudo Periodical Streams
Lv-an Tang, Bin Cui, Hongyan Li, Gaoshan
Miao, Dongqing Yang, Xinbiao Zhou
School of EECS
Peking University
2015/7/17
ACM SIGMOD 2007
Summary
Introduction
Related Work
Variation Management for Pseudo Periodical Stream
Experiments
Conclusion
•2015/7/17
ACM SIGMOD 2007
2/45
Pseudo Periodical Stream
Pseudo Periodical Stream
Data seems to repeat in a certain period
Tiny variation exists between different periods
Common in the domain of medical, seismology
Typical stream variations: gradual evolutions rather
than burst changes
•2015/7/17
ACM SIGMOD 2007
3/45
An Example of Pseudo Periodical Stream
The respiratory data repeats about every 3.2 seconds
Reflects the evolution of the patient’s illness during five hours
•2015/7/17
ACM SIGMOD 2007
4/45
Variation Management on Data Stream
Data streams are widely applied in many domains
Stock market analysis
Road traffic control
Medical signal processing
Online variation management -- an important task
When did the variation occur? (Detect variations)
What is the variation ? / How does it change? (Describe
variations)
Why it turns to change in this way ? (Help understanding
variations )
•2015/7/17
ACM SIGMOD 2007
5/45
Major Technical Challenges
Value Type
Traditional Algorithms: Discrete values (enumerative) or
Time series (equidistant intervals)
Data stream: consecutive real number with variable sampling
frequencies
Training Sets or Models
Several training sets or predefined models
Data stream evolves and the models may not work soon
On the contrary, the system is required to generate such
models as output
•2015/7/17
ACM SIGMOD 2007
6/45
Major Technical Challenges II
Variation Type
Not only on abnormal values and distribution
The structure in a period (shape)
Noises: unpredictable, random
In many applications, the variations are monitored
manually
Our contribution: proposing a new method named
Pattern Growth Graph (PGG) to detect and store
variations over pseudo periodical streams
•2015/7/17
ACM SIGMOD 2007
7/45
Summary
Introduction
Related Work
Variation Management for Pseudo Periodical Stream
Experiments
Conclusion
•2015/7/17
ACM SIGMOD 2007
8/45
Data Stream Management Systems
Data stream work can be loosely classified in two
categories: DSMS and Online Data Mining
Data Stream Management Systems (DSMS)
Such as STREAM, Aurora, TelegraphCQ…
Mainly focus on completing predefined SQL queries
Not try to find the data features, or to monitor the variations
•2015/7/17
ACM SIGMOD 2007
9/45
Online data mining
Variation management is an important part of online
data mining
Three classes according to the algorithms
Symbolic Approaches
Mathematic Transformation
Predefined Models
Symbolic Approaches: Tarzan and SAX
Space: Put the entire time series/data stream in memory
Precision is not good for SAX
•2015/7/17
ACM SIGMOD 2007
10/45
Mathematic Transformation
Mathematic Transformation: Discrete Wavelet
Transform (DWT) and Fast Fourier Transform (FFT)
Require the data length fixed, as well as the sampling
frequency (equidistant intervals)
Haar wavelet transform can only perform on 2n data items,
e.g, the data length must be 1024 or 2048
Predefined Models: Using Zigzag to detect events in
financial streams (SIGMOD 04)
Too domain specific
Users can not provide such models in advance – actually
they would like them as the output
•2015/7/17
ACM SIGMOD 2007
11/45
Summary
Introduction
Related Work
Variation Management for Pseudo Periodical Stream
Experiments
Conclusion
•2015/7/17
ACM SIGMOD 2007
12/45
Task Specification by Respiration Stream
Respriation
1400
1200
A
B
C D
E
F
G
H
1000
800
600
400
200
0 0
Time (S)
3
6
9
12
1
801 the stream
1601
2401
Variation : Online
detect
variation
in 3201
one pass
Wave: The smallest unit concerned is not a single point, but
values in a certain period represented as a wave
Alarms: F is actually the noise caused by body movements
Summary: A summary with acceptable error bound is very
helpful
•2015/7/17
ACM SIGMOD 2007
13/45
System Framework
Send
Alarms
6
5
4
3
2
1
0
OutPut
Pattern
Evolutions
Pattern Growth Graph
Data Stream
Wave
Online
Update
Stream
View
Splitting
Unmatched
6
5
4
3
2
1
0
Wave-Pattern
Matching
Partially
Matched
Full
Matched
Wave Stream
New Pattern
Grow Pattern
Increase
Frequency
Online Variation Management
•2015/7/17
ACM SIGMOD 2007
14/45
Wave Splitting I
Variation: the difference from old data
Detected by comparing the old data and coming stream
Waste too much resources if comparing at each coming
item
Just comparing at each wave -- much more efficient
How to divide the stream according to the data features?
•2015/7/17
ACM SIGMOD 2007
15/45
Wave Splitting II
Fixed length window will accumulate error
Observation: The waves start and end at valley points that are smaller
than a certain value
•2015/7/17
ACM SIGMOD 2007
16/45
Upper Bound of Valley Points
User define
Update with the average
value of past valley points
•2015/7/17
N
U b   ( Vi ) / N
i 1
ACM SIGMOD 2007
17/45
Valley Sections
Valley Section: Approximate flat section represents the time
interval between two events
It is also worth to study as one part of the wave
Take the last point of the section as the cut point
•2015/7/17
ACM SIGMOD 2007
18/45
Two Problems in Online Matching I
Problem 1: The data stream’s sampling frequency is
usually high (>100Hz), waves should be simplified
Problem 2: How to compare two waves with different
time lengths, and may not have data at same time point?
A: {(10,0.5), (20, 1.0), (25, 1.3), …(90, 50.5)} 22 data items
B: {(11,0.5), (25, 1.2), (30, 1.7) … (87, 50)} 20 data items
•2015/7/17
ACM SIGMOD 2007
19/45
Two Problems in Online Matching II
Solution 1: Piecewise Liner Representation
Make Problem 2 more difficult: patterns are simplified
as segments, how to compare segments and points?
•2015/7/17
ACM SIGMOD 2007
20/45
Wave-pattern Matching
ECG
Time (s)
0
0.5
1.0
Time (s)
1.5
0
0.5
1.0
1.5
In real applications, two sequences are assumed to match if their
paths roughly coincide
PLR segments record paths of old data
Testing whether the incoming stream items are on the paths
The intensity of variations can be determined by the number of
matching items
•2015/7/17
ACM SIGMOD 2007
21/45
Record the Patterns
Observation: Many patterns just have few partial
segments changed
Most stream variations are gradual evolutions rather
than burst mutations
Recording by a simple list not only ignores their
relationship but also causes storage redundancy
Utilize the similarity among patterns and reuse the
unchanged parts
Pattern Growth Graph (PGG) is designed to store
patterns and the variation history
•2015/7/17
ACM SIGMOD 2007
22/45
Pattern Growth Graph
Implemented as bi-directional linked list
Only generate new segments on the un-matched data
New patterns seems to grow from the old one
1.5
Wave
Pattern 1
Pattern 2
2
1
3
1
4
5 6
1'
7
3'
2'
4'
8
0.5
0
1
51
1'
Pattern 2
(Growth Pattern)
Pattern 1
( Base Pattern)
•2015/7/17
101
151
2'
3'
4'
End
Start
End
1
2
3
4
5
6
7
8
ACM SIGMOD 2007
23/45
Construct Full Wave-pattern
New Problem: Wave-Pattern matching needs full pattern to
compare, while PGG only stores the new parts
Fortunately we can construct the full pattern by propagating the
pointers
Pattern 3
1"
1'
Pattern 2
End
2"
2'
3'
Collision
Pattern 1
Start
End
1
2
3
4
模式
5
6
7
8
9
左1
右1
左2
右2
Step 0
1”
2”
1’
2’
9
End
Step 1
1’ 1” 2’
9 2”
1
3’
8
\
Step 2
1 1’ 1” 2’3’
8 9 2”
Start
Final
1 1’ 1” 2’3’
8 9 2”
\
•2015/7/17
8 ( Collision! ) 7
\
\
ACM SIGMOD 2007
\
\
24/45
Problems for PGG size
Waves in data stream: N PGG size: k
Time complexity of PGG based matching algorithm is
O (k*n)
In the worst case, each incoming wave introduces a
new pattern: overall time cost is O (n2)
When PGG becomes larger, the algorithm is timeconsuming
PGG is not allowed to take “forgetting functions”
Hard to delete in PGG
Some uncommon patterns may have higher domain
significance
•2015/7/17
ACM SIGMOD 2007
25/45
Rank the Patterns
Observation: The most frequent pattern and its similar
patterns have the highest possibility to match the
incoming wave
Matching probability factor
The patterns with smaller probability are not deleted,
but have lower priority to be compared
When one pattern get a match, system not only
increase its own rank, also its “families”
•2015/7/17
ACM SIGMOD 2007
26/45
Reconstruct the Stream View with PGG
Queries on traditional DSMS
predefined, hard to conduct when data items passed by
Answer “the patient's ECG in the past five hours”
Record all patterns’ occurrence time in PGG
Reconstruct the stream view with PGG patterns
Only consumes about 4% storage space of the original
stream, but can provide an approximate stream view
within 5% relative error bound
•2015/7/17
ACM SIGMOD 2007
27/45
Track Pattern Evolution
To answer “Why will it change in this way ?”
User selects an interesting pattern, PGG can track the
source of it
•2015/7/17
ACM SIGMOD 2007
28/45
False Alarm
A successful system needs to reduce the false alarms introduced
by noises
The major problem: noises are caused by many sources, they
have various styles and are hard to be modeled
•2015/7/17
ACM SIGMOD 2007
29/45
Noise Reorganization
A short cut: considering the pattern’s evolution
history
Some strategies to reduce false alarms on medical
stream:
Unusual values in growth patterns: the patients’ condition
has been exacerbated -- Warning
New pattern, it matches successive waves: the underlying
pathology mechanism might have some fundamental
changes -- Warning
A series of new patterns and they all un-match the
previous/following waves -- suspected as noises
•2015/7/17
ACM SIGMOD 2007
30/45
System Framework
Send
Alarms
6
5
4
3
2
1
0
OutPut
Pattern
Evolutions
Pattern Growth Graph
Data Stream
Wave
Online
Update
Stream
View
Splitting
Unmatched
6
5
4
3
2
1
0
Wave-Pattern
Matching
Partially
Matched
Full
Matched
Wave Stream
New Pattern
Grow Pattern
Increase
Frequency
Online Variation Management
•2015/7/17
ACM SIGMOD 2007
31/45
Summary
Introduction
Related Work
Variation Management for Pseudo Periodical Stream
Experiments
Conclusion
•2015/7/17
ACM SIGMOD 2007
32/45
Experimental Setup
Data Set
Medical streams: Six real pathology signals including ECG,
respiration... (over 25,000,000 data points)
Earthquake waves: The pacific earthquake wave data from
the NGA project. (100,000 data points)
Sunspot data: All the sunspot records between the year 1850
and 2001 (55,000 data points)
Environment: Intel Pentium 4 3.0GHz CPU with 1GB
RAM, Windows XP Professional, JDK 1.5.0…
•2015/7/17
ACM SIGMOD 2007
33/45
Effect of Rank Function
At the beginning, the effect is insignificant.
After three million data points, the naive algorithm’s
performance decreases rapidly
In the end, the rank algorithm outperforms by about 300%
•2015/7/17
ACM SIGMOD 2007
34/45
Reconstruct the Stream View
ECG data stream (more than 10M data items) can be represented
with only 420 patterns
The amazing compressing result is achieved due to two factors
The PLR simplify can reduce the size of patterns to about 20%
PGG further reduces it to about 3.31% by compressing the repeating and
similar patterns (Patterns only need 0.3%, the rest 3% stores the
occurrence time of the patterns)
•2015/7/17
ACM SIGMOD 2007
35/45
Compared with Other Methods
Compared PGG with SAX (symbolic approaches), Discrete
Haar Wavelet Transformation (mathematic transformation) and
Zigzag (predefined models)
The processing efficiency is average 60K—70K items/sec
Much higher than real application needs
•2015/7/17
ACM SIGMOD 2007
36/45
Variation Detection & Noise Recognition
Two important measurements:
Sensitivity (High Positive Rate): The algorithm send alarms
at meaningful variations
Selectivity (Low Negative Rate): The algorithm does not
send false alarms on noises
The two measurements are conflict
Increasing sensitivity to find more variations will inevitably
cause more false alarms
In a medical environment, sensitivity is much more
important -- missing a meaningful variation may cost the
patient’s life
•2015/7/17
ACM SIGMOD 2007
37/45
Best Results of Sensitivity on Respiration Stream
Zigzag sends false alarm at almost every noise section
DWT and SAX nearly cannot distinguish real variations from
noises
•2015/7/17
ACM SIGMOD 2007
38/45
Results of Noise Recognition on Other Stream
For other stream, we take precision as the main measurement
PGG performs accurately and stably
Zigzag is volatile with different datasets:
Good on three blood pressure signals (ABP, CVP and ICP, meaningful
variations are outliners)
Poorly on PLETH (meaningful variations are of inner structures)
•2015/7/17
ACM SIGMOD 2007
39/45
Discussion
Zigzag: focuses on extreme data points, strongly
influenced by outliers
SAX: good at finding in a long period using frequency
statistics -- more suitable for time series
DWT: only effective for signals with strict periods
With the effective data structure, PGG discovers and
records as much features of the data stream as possible
The recorded information helps distinguish between
meaningful variations and noises
•2015/7/17
ACM SIGMOD 2007
40/45
Summary
Introduction
Related Work
Variation Management for Pseudo Periodical Stream
Experiments
Conclusion
•2015/7/17
ACM SIGMOD 2007
41/45
Conclusion
Streams are split as waves and represented by PLR
patterns
Detect variations by online wave-pattern matching
Pattern Growth Graph stores the variation history
Reconstruct the stream view with high accuracy
Effectively distinguish meaningful variations from
noises
•2015/7/17
ACM SIGMOD 2007
42/45
The System Interface
•2015/7/17
ACM SIGMOD 2007
43/45
Future Work
Extend PGG to multiple streams
Implement the PGG method in other application
domains such as weather forecasting and financial
analysis
Combine with other methods, like Zigzag…
•2015/7/17
ACM SIGMOD 2007
44/45
Thank You Very Much!
Please give me questions and
suggestions
•2015/7/17
ACM SIGMOD 2007
45/45