Progress on Continuous Data Stream Processing

Download Report

Transcript Progress on Continuous Data Stream Processing

Continuous Data Stream
Processing
MAKE Lab
Date: 2006/03/07
Post-Excellence Project
Subproject 6
Continuous Data Stream Processing
Music Virtual Channel
Peer search
engine
Clustering
Profile
engine
database
Internet
Cluster
coordinator
Interface
Channel
monitor
V.C.
player
Profile
monitor
Favorite
channel
1
…
Music
channel
simulator
Filtering
XML
Filtering
engine
2
…
V.C.
player
Cluster
monitor
MusicXML
Music
database
metadata
N
Music collections
2
Continuous Data Stream Processing
Research Directions
Sequence Query Matching
Temporal Query Processing
Episode Query Matching
Filtering
Spatial Query Processing
Range Search
KNN Search
Aggregate Query Processing
Streaming
Data
Management
Top-K Search
Frequent Tree Pattern Mining
Closed Tree Pattern Mining
Frequent Itemset Mining
(sliding window)
Frequent Itemset Mining
(landmark model)
Mining
3
Hash-based synopsis with
memory consideration for mining
frequent itemsets over data
streams
Continuous Data Stream Processing
Landmark model
5
Continuous Data Stream Processing
Lossy Counting
Step 1: Divide the stream into ‘buckets’
bucket 1
bucket 2
bucket 3
bucket-size = 1/ε
ε = 10% of support s
6
Continuous Data Stream Processing
Lossy Counting in Action
Frequency
Counts
+
Empty
First Bucket
At bucket boundary, decrement all counters by 1
7
Continuous Data Stream Processing
Lossy Counting continued ...
Frequency
Counts
+
Next Bucket
At bucket boundary, decrement all counters by 1
Output:
Elements with counter values exceeding sN – εN
8
Continuous Data Stream Processing
Drawbacks of Lossy Counting
1
Keep all items
with frequency > 
Applied to mine
frequent itemsets, the
space may
exponentially increase
s
ε
Lossy-Counting
0
9
Continuous Data Stream Processing
hCount
m
……, 9, ……
h1(9) mod m
h2(9) mod m
h3(9) mod m
h4(9) mod m
h
1
0
1
1
1
2
1
0
0
1
2
2
1
2
1
0
1
1
1
0
1
1
1
2
For each item, hash the item into buckets, choose the
minimum count and return the item if its minimum
count ≥ sN
10
Continuous Data Stream Processing
hash-based
+1
N
+1
N
1
 Transaction {1, 2, 3}
 Subsets of {1, 2, 3}:
N
N
…
…
Total_
Access
Itemset Surplus_
Estimate
True_
Count
Nlast_access
{1}
{2}
{3}
{1, 2}
{1, 3}
{2, 3}
{1, 2, 3}
×
○
○
×
×
○
×
How to compute the Surplus_Estimate?
11
Continuous Data Stream Processing
Compute the Surplus_Estimate for an Itemset
Two variables
 n: number of different itemsets in the
bucket but not in the list
 c: sensible counts to be divided between
itemsets which are not in the list
 If c = [3, 5], n = [3, ?] →Surplus_Estimate
= 3, (3, 1, 1)
 Surplus_Estimate --, until
(Surplus_Estimate) / Nlast_acces < minSup
12
Continuous Data Stream Processing
Determine c and n
{2, 3, 5}, N = 4, minSup = 0.4
{2} is hashed into the bucket
Boundary of c: 4-(2+SE) ≤ c ≤ 4-2
 4  (2  0) 
Boundary of n: 
2n
 0.4  3  1 
c = 2, n = 2 → (1, 1) →Surplus_Estimate = 1
4
Total_
Access
5
3
4
{1}
2
0
Itemset
Surplus_
Nlast_access
True_
Estimate
Count
{2}
1
1
13
Monitoring Constrained
k-Nearest Neighbor over
Moving Objects with Different
Values
Continuous Data Stream Processing
Motivation (Cont.)
Example:
Consider that an user wants to find the k places to buy new
shoes where the costs are the lowest.
Cost = Price($) + Traffic Cost($)
2-NN Query
$100
$200
2
$400
1
400+100*1=500
3
$90
5
200+100*2=400
100+100*3=400
90+100*5=590
15
Continuous Data Stream Processing
Motivation
Objects with different values in spatial database.
 find the k places to buy something where the costs are the lowest.
Cost = Price($) + Traffic Cost($)
 Taxi driver wants to find the k places to gain the most profits.
Profit = Gain($) - Traffic Cost($)
 Taxi driver wants to find the k places to gain the most profits.
Profit = Gain($) / Time = Gain($) / Time
 Virtual Channel
age * profile distance
listen hours / profile distance
 Market Survey
$consumption (or income , age…) / profile distance
16
Continuous Data Stream Processing
Challenges
Efficiency
 Search space reduction
 Query processing enhancement
Effectiveness
 Previous result reuse
17
Continuous Data Stream Processing
Framework
Initialization
Step1
Find k-candidates to restrict the search region.
q
Step2
Run Pruning Ring on the remaining candidates to determine
actual answer.
Handling updates
-Incrementally update positions or values for objects and queries
-Computation is necessary only for affected query
18
Querying Episodes over
Event Stream
Continuous Data Stream Processing
Motivation
Knowledge Discovery from Telecommunication
Network Alarm Databases [ICDE96]
 If an alarm of type A occurs, then an alarm of type B occurs within 30
seconds with probability 0.8
 If alarms of types A and B occurs within 5 seconds, then a alarm of
type C occurs within 60 seconds with probability 0.7
 If an alarm of type A precedes an alarm of type B, and C precedes D,
all within 15 seconds, then E will follow within 4 minutes with
probability 0.6
5 seconds
B
A
15 seconds
A
A
B
C
D
20
Continuous Data Stream Processing
Challenges
Efficiency
 Index impaction
 Partial result sharing
 Load shedding
21
Continuous Data Stream Processing
Framework
D
Q1
Q2
Q3
5
3
7
B
C
p1
C
p2
D
p3
p5
A
C
B
D
A
D
p5
p4
p4
p1
p1, p5
B
p5 p
1
p4
 Joining events B and C: B
C
D
p2
p3
C
 Q3 is composed of p5 and p4
22
Continuous Data Stream Processing
3
5
D
B
7
C
C
C
A
D
B
D
E. I.: -1
E. I.: 6
A
EQueue
P1
(time)
PQueue
TLink
EQueue
P2
A
TLink
p5 p
1
TLink
D
EQueue
p4
M.Q.: Q2
PQueue
E. I.: 2
p1
P3
B
EQueue
E. I.: 4
D
p5
E. I.: 2
C
(St, Et)
E. I.: 2
E. I.: 5
B
M.Q.: Q1
p4
M.Q.: Q2
PQueue
E. I.: 6
C
D
p2
p3
P4
M.Q.: Q3
PQueue
TLink
E. I.: 6
P5
M.Q.: Q3
PQueue
23