Model Maintenance in Dynamic Environments

Download Report

Transcript Model Maintenance in Dynamic Environments

Model Maintenance in
Dynamic Environments
Venkatesh Ganti
(Joint work with Raghu Ramakrishnan,
Johannes Gehrke, Mong Li Lee)
Mining Environment
•
•
Data Mining
……
……
Data
Warehouse
Data repository for analysis
•
OLAP
•
•
DEMON
Data mining models
• Frequent itemsets
• Decision trees
• Clusters
• …
OLAP
• Aggregate queries
Repository updated regularly
Query workloads change
2
Two Parts of this Talk
1. Model Maintenance: Maintaining models
under systematic data evolution [ICDE 00]
2. Tuning samples: Maintaining samples for
approximate query answering with respect
to changing query workloads [VLDB00]
DEMON
3
Systematic Block Evolution
•
•
Data warehouses are updated with blocks of new data
Block: a set of tuples appended simultaneously to the data
warehouse
D
d
…
D+d
Result: a sequence of database snapshots
DEMON
4
Model Maintenance: Objective
•
•
•
Allow selection of interesting timevarying subsets to be modeled
Low response time to get the updated
model
Interesting classes of models
•
•
•
DEMON
Frequent itemsets (LITS)
Clusters
Decision trees (DT)
5
Subset selection: Data Span
•
Span of interest
•
•
Everything until now—Unrestricted window
Recently collected—Most recent window
• Unrestricted Window (UW)
• Model the entire database
DEMON
D1 D2 D3
M(D1+D2+D3)
D1 D2 D3 D4
M(D1+D2+D3+D4)
6
Data Span (contd.)
•
Most Recent Window (MRW) of size w
E.g., model data collected in the last 3
days
Sliding Windows
Models
•
DEMON
D1 D2 D3
M(D1+D2+D3)
D1 D2 D3 D4
M(D2+D3+D4)
D1
M(D3+D4+D5)
D2 D3
D4 D5
7
Block Selection Sequence
•
•
•
Maintain models on data collected on
alternate days within the last 4 weeks
Require fine granular selection
Block selection sequence (BSS)
•
A 0/1 sequence: a bit for each block in the
data span
•
•
DEMON
1--the block is selected for modeling
0--the block is not selected for modeling
8
BSS: UW
•
A sequence of 0/1 bits, one for each block
in the entire database
• E.g., select all blocks collected on alternate days
1
D1
DEMON
0 1
0 1
D2 D3 D4 D5
9
BSS: MRW
Two types of BSS w.r.t. MRW
•
1.
2.
Window-independent
Window-relative
Model data collected on Mondays within the
last 4 weeks
•
•
BSS: (1000000)*
1 0…0
1
D1 D2-D7 D8
M(D1+D8)
DEMON
0…0 1 ...
D9-D14
1 0…0
1
D1 D2-D7 D8
0…0 1 ...
D9-D14 D15
M(D1+D8+D15)
10
BSS: MRW (contd.)
•
•
•
Window-relative BSS
Model all data collected on alternate days
from the start in a window of size 3
BSS: 101
D1 D2 D3
[1
0
1]
D1 D2 D3
[1
0
D4
1]
D1 D2 D3 D4
[1
D5
0
1]
Here, each successive subset is disjoint from its predecessor
DEMON
11
Model Maintenance: Enumeration
LITS
Clustering
DT
UW:BSS
MRW:BSS
Includes both window-independent and
window-relative block selection sequences.
DEMON
12
Model Maintenance: Algorithms
LITS
UW:BSS
Clustering
DT
GEMM(A)
MRW:BSS
GEMM: GEneric Model Maintenance Algorithm for any
class of models that has an incremental maintenance
algorithm A under tuple insertions
DEMON
13
Maintenance under Insertions
•
Algorithm A
•
•
•
Such algorithms exist for
•
•
•
•
DEMON
Input: old dataset D, old model M(D), a block of
tuples d appended to D
Output: M(D+d) = A(D, d, M(D))
Frequent itemsets (ECUT, ECUT+, BORDERS, FUP)
Clusters (BIRCH)
Decision trees (BOAT)
Note: We do NOT require A to handle
deletions!
14
GEMM
•
Input
•
•
•
•
Output
•
DEMON
Data span (and window size for MRW)
BSS
A model-update algorithm A under tuple
insertions (deletions not required)
An efficient model maintenance algorithm
15
GEMM:MRW
•
Assume BSS is a sequence of 1’s and w=3
T3
D1 D2 D3
M(D1+D2+D3)
T4
D1 D2 D3 D4
M(D2+D3+D4)
T5
D1 D2 D3 D4 D5
M(D3+D4+D5)
• We already know parts of future windows
DEMON
16
GEMM: MRW (contd.)
Idea: Start building models for future windows
E.g.,: At T3, we maintain models on
<D1 + D2 + D3> (model required for window at T3)
<D2 + D3>
(partial model for window at T4)
<D3>
(partial model for window at T5)
Models at T3
M<D1 + D2 + D3>
M<D2 + D3>
M<D3>
DEMON
Models at T4
Immediate
Offline
M<D2 + D3 + D4> (for window at T2)
M<D3 + D4>
(for window at T3)
M<D4>
(for window at T4)
17
GEMM: Arbitrary BSS
1 0 1 0 1 ...
DD11 D2 D3 D4 D5
T3: Model on <1.D1 + 0.D2 + 1.D3>
T4: D4 is appended
Model on <0.D2 + 1.D3 + 0.D4>
T5: D5 is appended
Model on <1.D3 + 0.D4 + 1.D5>
Idea: We still know parts of future windows and the
corresponding BSS for each of them
E.g.,: At T3, we maintain models on
<1.D1 + 0.D2 + 1.D3> (model required at T3)
<0.D2 + 1.D3>
identical
<1.D3>
DEMON
18
GEMM: Resource Requirements
•
Response time to new model
•
Updating one model with the new block
•
•
•
Depends on the incremental algorithm
Space requirements
•
•
DEMON
Other updates offline
At most w models
Space required for a model is orders of
magnitude less than that for data!
19
Maintenance under Insertions
•
Algorithm A
•
•
•
Such algorithms exist for
•
•
•
•
DEMON
Input: old dataset D, old model M(D), a block of
tuples d appended to D
Output: M(D+d) = A(D, d, M(D))
Frequent itemsets (ECUT, ECUT+, BORDERS, FUP)
Clusters (BIRCH)
Decision trees (BOAT)
Note: We do NOT require A to handle
deletions!
20
Frequent Itemset Models
•
•
Set of customer transactions
Frequent itemset: a set of items purchased
together by “many” customers
TID a b c
1
2
3
DEMON
1 0 1
1 1 1
0 1 0
Minimum frequency threshold = 50%
{b}, {c}, {a,c} are frequent itemsets
21
Incremental Algorithm
[FAAM97,TBAR97]
D1 D2 D3
•
Input
•
•
•
•
Old dataset
Old set of frequent itemsets
New block D4
Steps
•
•
Detect if new itemsets become frequent
Count frequencies of a small number of itemsets
•
•
DEMON
D4
Current algorithms scan (D1+D2+D3) completely
Update model…
22
ECUT—New Counting Algorithm
•
•
Transformed data representation
Within each block Di
•
item x: sorted list of transaction identifiers
containing “x”—TID-list(x)
TID a b c
1
2
3
DEMON
1 0 1
0 1 1
0 1 0
TID-list(a) = {1}
TID-list(b) = {2,3}
TID-list(c) = {1,2}
Count({a,b}) = |TID-list(a) intersection TID-list(b)|
23
Experimental Comparison
Time to Update Model
ECUT
SCAN-TO-COUNT
50
45
Time (in seconds)
40
35
30
25
20
15
10
5
0
10
25
50
75
100
150
200
400
New Block Size (in 1000's)
DEMON
24
Comparing Count Times
Counting Times
400
350
Time (in Sec)
300
250
4M:ECUT
200
4M:SCAN
150
100
2M:ECUT
2M:SCAN
50
4M:ECUT+
2M:ECUT+
0
5
DEMON
15
25
35
45
55
65
75
85
95
105
#Itemsets
115
125
135
145
155
165
175
185
25
Summary of the first part
LITS
UW:BSS
Clustering
DT
GEMM
MRW:BSS
DEMON
Maintenance algorithms under tuple insertions
•Frequent itemsets
•ECUT, ECUT+
•Clusters
•BIRCH
•Decision Trees
•BOAT
26
Second Part of this Talk
1. Model Maintenance: Maintaining models
under systematic data evolution [ICDE 00]
2. Maintaining samples with respect to
changing query workloads [VLDB00]
DEMON
27
Random Samples for AQUA
Agg. query Q
Typical AQUA
approach
Exact answer
R
Approx. answer
Uniform
Random
sample
•
•
•
DEMON
S(R)
All tuples in R are assumed to be equally
important while drawing S(R)
In practice, queries exhibit locality
Consequence: S(R) wastes precious real estate
28
Problem
•
Given
•
•
•
•
DEMON
Relation R
Workload W: Q1,…,Qn
Goal: Dynamically tune “random sample of
R” w.r.t. W
Model to be maintained: a simple random
sample
29
ICICLES
•
•
R(Q): set of tuples in R required to answer Q
Random sample of R U R(Q1) U … U R(Qn)
•
Tuples required often are more likely to be in SW(R)
R
R(Q1)
Uniform
Random
Sample
SW(R)
ICICLE
DEMON
R(Qn)
30
Mail Order Dataset
AVG queries on a mail order dataset
Icicle
Icicle-complete
Static sample
0.3
Relative Error
0.25
0.2
0.15
0.1
0.05
0
1
2 3
4
5 6
7
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
#Queries
DEMON
31
Conclusions and Future Work
Static dataset Dynamic dataset
Workload
indifferent
Workload
sensitive
DEMON
Most DM
literature
ICICLES
+
??
DEMON
+
Related Effort
??
32
•
DEMON
Questions?
33