Mining Data Streams with Periodically changing Distributions
Download
Report
Transcript Mining Data Streams with Periodically changing Distributions
Mining Data Streams with
Periodically changing Distributions
Yingying Tao, Tamer Ozsu CIKM’09
Supervisor Dr Koh
Speaker Nonhlanhla Shongwe
April 26, 2010
Preview
Introduction
Challenge
Method
DMM framework
Distance Function Selection
Experiments
Conclusion
Introduction
Mining stream for knowledge mining such as
Clustering
Classification
Frequent patterns discovery , has become important
Important characteristic of unbounded data stream is that the
underlying distributions can show important changes over time,
leading to dynamic data streams.
Challenge
The problem of mining dynamic data streams
Balance accuracy with efficiency – highly accurate mining
techniques are generally computationally expensive
Some question to ask: for dynamic data streams
Is the distribution changes entirely random and unpredictable
Is it possible for the distribution changes to follow certain
patterns?
Method
Propose a method for mining dynamic data streams
Where important observed distributions patterns are stored
And compared new detected changes with these patterns
Method
Two streams with the same distribution
Mining results such as
List of all frequent items / itemset for frequent patterns discovery
Set of clusters / classes for clustering and classification should be the
same
If distribution change is detected and a match is found
Possible to skip the re-mining process
Directly output the mining results for the archived distribution
This is called the match-and-reuse strategy
Method
Issues to be resolved
Pattern selection
Selecting and storing important distributions that have a high
probability of occurring in the future
Pattern representation
Storing each pattern succinctly
Matching
Efficient procedure for rapid data streams with high accuracy
DMM framework
DMM framework stands for Detect, Match and Mine
Consist of four sequences
Choosing representative set
Change detection
Pattern matching
Stream mining
All processes are independent
DMM framework
Window model
Generate reference window (choosing representative set)
Change detection
Distribution matching (Pattern matching)
Choosing important distribution
Window model
Two windows on Stream S
Time-based
Defines the time intervals
Denoted by Wt
Called observation window
Implemented as a tumbling window
Moves forward at each clock tick
Window model cont’s
Count-based
Contains a sub stream with fixed number of elements
Denoted by Wr
Called reference window
The size of the reference window (|Wr|) and time intervals of
Wt are predefined values
DMM framework
Window model
Generate reference window (choosing representative set)
Change detection
Distribution matching (Pattern matching)
Choosing important distribution
Generate reference window (choosing
representative set)
Wr stores a set of data elements that represents a current
distribution of S
The size needs to be small due to memory limitations
Inaccurate results if we use a small data set to represent a
distribution if the distribution is complicated
Due to this problem, use a dynamic reference window (Wr)
Generate reference window (choosing
representative set)
Merge and select process
Dynamic reference window (Wr)
Merge Wt and Wr to get a larger substream |Wr|+|Wt|
Select |Wr| elements from the merged window (Wr +Wt) and
replace the stream in Wr by the new set
Merge and select process is triggered every time Wt tumbles
Generate reference window (choosing
representative set)
Generate reference window (choosing
representative set)
Selecting representative set: Two-step sampling approach
First-step sampling approach
Estimate the density function of Wr + Wt
K= kernel function
h= smoothing parameter (bandwidth)
si= data element in Wr + Wt
Generate reference window (choosing
representative set)
Selecting representative set: Two-step sampling approach
K is set to (Standard Gaussian function mean = 0 variance = 1)
Then the density function
h=value between 0 or 1
Generate reference window (choosing
representative set)
Selecting representative set: Two-step sampling approach
With the density function we are able to estimate the “shape” of
the current distribution
X-axis is the = value of the data s (v(s)) in Wr +Wt
Y-axis is the probability (p(v)) for all data values
Generate reference window (choosing
representative set)
Selecting representative set: Two-step sampling approach
Second-step sampling approach
Generate reference window (choosing
representative set)
Selecting representative set: Two-step sampling approach
Second-step sampling approach
First calculate the start and end values for each partition
DMM framework
Window model
Generate reference window (choosing representative set)
Change detection
Distribution matching (Pattern matching)
Choosing important distribution
Change detection
Online change detection technique that is not restricted to
specific stream processing application
Wt tumbles, the change detection procedure is triggered
Compare the distributions of substreams in both Wr and Wt
windows
If the distance is greater than the predefined maximum matching
distance, then a distribution change is flagged
DMM framework
Window model
Generate reference window (choosing representative set)
Change detection
Distribution matching (Pattern matching)
Choosing important distribution
Distribution matching (Pattern matching)
We use the appropriate distance measure to check their
similarity
If a match is found, then the persevered mining results are
outputted
The maximum predefined maximum matching is important
Smaller implies a higher accuracy
Larger increases the possibility of a new distribution to match a
pattern in the preserved set
DMM framework
Window model
Generate reference window (choosing representative set)
Change detection
Distribution matching (Pattern matching)
Choosing important distribution
Choosing important distribution
Use heuristic rules
Distribution that have occurred in the stream for more times are
more important
The longer a distribution lasts in the streams lifespan the more
important it is
Distribution that has mining results with higher accuracy is more
important that a distributions with less accurate mining results
Distance Function Selection
Dynamic Time Wrapping (DTW), Longest Common Subsequence
(LCSS), Edit Distance on Real Sequence (EDR) and Relativized
Discrepancy (RD)
A proper distance function that can be used with DMM
Efficient , with the ability to stretching
Experiments
Change detection
Kernel density approach (KD)
Distance function-based approach(DF)
Experiments
Distribution matching evaluation
Data from Tropical Atmosphere Ocean
Sea surface temperatures.
12 218 streams each with a length of 962
Experiments
Efficiency with and without DMM
Adopt a popular decision tree-based clustering technique VFDT to
cluster the temperatures
Best decision tree generators for dynamic data streams
Time is reduced by 31.3%
Conclusion
Introduced a DMM framework to mine dynamic data streams
Window model
Generate reference window (choosing representative set)
Change detection
Distribution matching (Pattern matching)
Choosing important distribution
Experiments that showing DMM performs better
Thank you for your attention