Algorithm B (Example)
Download
Report
Transcript Algorithm B (Example)
Mining of Frequent Patterns
from Sensor Data
Presented by: Ivy Tong Suk Man
Supervisor: Dr. B C M Kao
20 August, 2003
1
Outline
Outline of the Presentation
Motivation
Problem Definition
Algorithm
• Apriori with data transformation
• Interval-List Apriori
Experimental Results
Conclusion
2
Motivation
Continuous items
reflect values from an entity that changes continuously in the
external environment.
Update Change of state of the real entity
E.g. temperature reading data
• Initial temperature: 25ºC at t=0s
• Sequence of updates: <timestamp, new_temp>
<1s, 27ºC>, <5s, 28ºC>, <10s, 26ºC>, <14s,..> …
• t=0s to 1s, 25ºC
25ºC 27ºC
28ºC
t=1s to 5s, 27ºC
t=5s to 10s, 28ºC
0 1
5
26ºC
t
10
What is the average temperature from t=0s to 10s?
• Ans: (25x1+27x4+28x5)/10 = 27.3ºC
3
Motivation
Time is a component in some applications
E.g. stock price quotes, network traffic data
“Sensors” are used to monitor some conditions, for
example:
Prices of stocks: by getting quotations from a finance website
Weather: measuring temperature, humidity, air pressure, wind,
etc.
We want to find correlations of the readings among a
set of sensors
Goal: To mine association rules from sensor data
4
Challenges
How different is it from mining association rules from
market basket data?
Time component
When searching for association rules in market basket data,
time field is usually ignored as there is no temporal correlation
between the transactions
Streaming data
Data arrives continuously, possibly infinitely, and in large
volume
5
Notations
We have a set of sensors R = {r1,r2,…,rm}
Each sensor ri has a set of numerical states Vi
Assume binary states for all sensors
Vi = {0,1} i s.t. ri R
Dataset D: a sequence of updates of sensor state in the
form of <ts, ri, vi> where ri R, vi Vi
ts : timestamp of the update
ri: sensor to be updated
vi: new value of the state of ri
For sensors with binary states
• update in form of <ts, ri> as the new state can be inferred by toggling
the old state
6
Example
R={A,B,C,D,E,F}
Initial states: all off
D:
<1,A>
<2,B>
<4,D>
<5,A>
<6,E>
<7,F>
<8,E>
<10,A>
<11,F>
<13,C>
A
t
0
1
5
10
B
t
2
C
t
13
D
t
4
t
E
6
8
F
t
7
11
7
More Notations
An association rule is a rule, satisfying certain
support and confidence restrictions, in the form
XY
where XR,
YR
and
X R
,Y
R XY=
and X Y
8
More Notations
Association rule X Y has confidence c,
In c % of the time when the sensors in X are
ON (with state = 1), the sensors in Y are ON
Association rule X Y has support s,
In s% of the total length of history, the sensors
in X and Y are ON
9
More Notations
TLS(X) denote Total LifeSpan of X
Total length of time that the sensors in X are ON
T – total length of history
Sup(X) = TLS(X)/T
Conf(X Y) = Sup(X U Y) / Sup(X)
Example:
T = 15s
TLS(A)=9, TLS(AB)=8
Sup(A) = 9/15 = 60%
Sup(AB) =8/15 = 53%
Conf(A->B) = 8/9 = 89%
A
t
0 1
5
B
10
t
2
10
Algorithm A
Transform & Apriori
Transform the sequence of updates to the form of
market basket data
At each point of update
• take a snapshot of the states of all sensors
• Output all sensors with state=on as a transaction
• Attach
Weight(transaction)
= Lifespan(this update)
= timestamp(next update) – timestamp(this update)
11
Algorithm A - Example
Initial states: all off
D: <1,A>,<2,B>,<4,D>,<5,A>,
<6,E>,<7,F>,<8,E>,<10,A>,
<11,F>,<13,C>
A
t
0 1
5
10
B
t
2
Transformed database D’:
TID
Items
Weight
C
t
13
D
t
4
E
t
6
8
F
t
7
11
12
Algorithm A - Example
Initial states: all off
D: <1,A>,<2,B>,<4,D>,<5,A>,
<6,E>,<7,F>,<8,E>,<10,A>,
<11,F>,<13,C>
A
t
0 1
5
10
B
t
2
Transformed database D’:
TID
Items
1
A
Weight
timestamp=1
C
t
13
D
t
4
t
E
6
8
F
t
7
11
timestamp=1
13
Algorithm A - Example
Initial states: all off
D: <1,A>,<2,B>,<4,D>,<5,A>,
<6,E>,<7,F>,<8,E>,<10,A>,
<11,F>,<13,C>
A
t
0 1
5
10
B
t
2
Transformed database D’:
TID
Items
Weight
1
A
2-1=1 timestamp=1
2
A,B
timestamp=2
C
t
13
D
t
4
t
E
6
8
F
t
7
11
timestamp=2
14
Algorithm A - Example
Initial states: all off
D: <1,A>,<2,B>,<4,D>,<5,A>,
<6,E>,<7,F>,<8,E>,<10,A>,
<11,F>,<13,C>
A
t
0 1
5
10
B
t
2
Transformed database D’:
TID
Items
Weight
1
A
1
2
A,B
4-2=2 timestamp=2
3
A,B,D
C
t
13
D
t
4
E
timestamp=4
t
6
8
F
t
7
11
timestamp=4
15
Algorithm A - Example
Initial states: all off
D: <1,A>,<2,B>,<4,D>,<5,A>,
<6,E>,<7,F>,<8,E>,<10,A>,
<11,F>,<13,C>
A
t
0 1
5
10
B
t
2
Transformed database D’:
TID
Items
Weight
1
A
1
2
A,B
2
3
A,B,D
1
4
B,D
1
5
B,D,E
1
6
B,D,E,F
1
7
B,D,F
2
8
A,B,D,F
1
9
A,B,D
2
10
A,B,C,D
15-13=2 timestamp=13
C
t
13
D
t
4
E
t
6
8
F
t
7
11
End of history = 15s
16
Algorithm A - Example
Initial states: all off
D: <1,A>,<2,B>,<4,D>,<5,A>,
<6,E>,<7,F>,<8,E>,<10,A>,
<11,F>,<13,C>
A
t
0 1
5
10
B
t
2
Transformed database D’:
TID
Items
Weight
1
A
1
2
A,B
2
3
A,B,D
1
4
B,D
1
5
B,D,E
1
6
B,D,E,F
1
7
B,D,F
2
8
A,B,D,F
1
9
A,B,D
2
10
A,B,C,D
2
C
t
13
D
t
4
E
t
6
8
F
t
7
11
17
Algorithm A
Apply Apriori on the transformed dataset
D’
TID
sup( X ) Weight ( X ) / Weight
1
X T ,T D '
Drawbacks:
A lot of redundancy
Adjacent transactions may be very
similar, differed by the one sensor with
state update
Items
Weight
A
1
2
A,B
2
3
A,B,D
1
4
B,D
1
5
B,D,E
1
6
B,D,E,F
1
7
B,D,F
2
8
A,B,D,F
1
9
A,B,D
2
10
A,B,C,D
2
18
Algorithm B
Interval-List Apriori
Uses an “interval-list” format
<X, interval1, interval2, interval3, … >
where intervali is the interval in which all sensors in X are on.
TLS(X) = (intervali.h – intervali.l)
Example:
A
t
0 1
<A, [1,5), [10,15)>
5
10
TLS(A) = (5-1) + (15-10) = 9
19
Algorithm B
Step 1:
For each ri R,
build a list of interval in which ri is ON by
scanning the sequence of updates
Calculate the TLS of each ri
If TLS(ri) min_sup, put ri into L1
20
Algorithm B – Example
Initial states: all off
D:
<1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>
<A, empty>
<B, empty>
<C, empty>
<D, empty>
<E, empty>
<F, empty>
21
Algorithm B – Example
Initial states: all off
D:
<1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>
<A, [1,?)>
<B, empty>
<C, empty>
<D, empty>
<E, empty>
<F, empty>
22
Algorithm B – Example
Initial states: all off
D:
<1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>
<A, [1,?)>
<B, [2,?)>
<C, empty>
<D, empty>
<E, empty>
<F, empty>
23
Algorithm B – Example
Initial states: all off
D:
<1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>
<A, [1,5)>
<B, [2,?)>
<C, empty>
<D, [4,?)>
<E, empty>
<F, empty>
24
Algorithm B – Example
Initial states: all off
D:
<1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>
<A, [1,5),[10,?)>
<B, [2,?)>
<C, [13,?)>
<D, [4,?)>
<E, [6,8)>
<F, [7,11)>
25
Algorithm B – Example
Initial states: all off
D:
<1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>
<A, [1,5),[10,15)>
<B, [2,15)>
<C, [13,15)>
<D, [4,15)>
<E, [6,8)>
<F, [7,11)>
End of history T =15s
26
Algorithm B
Step 2:
Find all larger frequent sensor-sets
Similar to Apriori Frequent Itemst Property
Any subset of a frequent sensor-set must be frequent.
Method:
Generate candidates of size i+1 from frequent sensor-sets of
size i.
Approach used: join to obtain sensor-sets of size i+1 if two
size-i frequent sensor-sets agree on i-1
May also prune candidates who have subsets that are not
large.
Count the support by merging (intersection of) the interval lists
of the two size-i frequent sensor-sets
If sup min_sup, put into Li+1
Repeat the process until the candidate set is empty
27
Algorithm B
Example:
<A, [1,5), [10,15)>
<B, [2,15)>
<AB, [2,5),[10,15)>
A
t
0 1
5
10
B
t
2
T=15
28
Algorithm B (Example)
A
C
B
D
[1,5)
[2,15)
[13,15)
[10,15)
LS:13
LS:2
E
F
[4,15)
[6,8)
[7,11)
LS:11
LS:2
LS:4
LS:9
AB
AD
[2,5)
[4,5)
[10,15)
[10,15)
LS:8
LS:6
AF
BD
BF
[10,11)
[4,15)
[7,11)
LS:1
LS:11
LS:4
ABD
[4,5)
[10,15)
Min support count: 3
LS:6
29
Algorithm B – Candidate Generation
When generating a candidate sensor-set C of size i
from two size i-1 sensor-sets LA and LB (subsets of C),
we also construct the interval list of C by intersecting
the interval lists of LA and LB.
Joining the two interval lists (of length m and n) is a key
step in our algorithm
Use simple linear scan requires O(m+n) time
There are i different size i-1 subset of C
which two to pick?
30
Algorithm B – Candidate Generation
Method 1:
Choose two lists with fewest no of intervals
=>Store no of intervals for each sensor-set
Method 2:
Choose two lists with smallest count (TLS)
Intuitively shorter lifespan implies fewer intervals
Easier to implement
• Have the lifespan when checking if the sensor-set is
frequent
31
Experiments
Data generation
Stimulate data generated by a set of n binary
sensors
Make use of a standard market basket data
With n sensors, each of which can be either on or off
=>2n possible combination of sensor states
Assign a probability to each of the combinations
32
Experiments – Data Gen
How to assign the probabilities?
Let N be the no of occurrences of the transaction in the market
basket that contains exactly only the sensors that are ON
• E.g. Consider R={A,B,C,D,E,F}
• Suppose we want to assign prob to the sensor state AC (only A
and C are ON)
• N is no of transactions that contain exactly only A and C
Assign prob = N/|D|, where |D| is the size of the market basket
dataset
Note: Need sufficiently large market basket data
• transactions that occur very infrequently will not be given ZERO
probability
33
Experiments – Data Gen
Generating sensor set data
Choose the initial state (at t=0s)
• Randomly
• According to the probabilities assigned
• Pick the combination with highest probability assigned
=> first sensor set states
34
Experiment – Data Gen
What is the next set of sensor-set states?
For simplicity, in our model, only one sensor can be updated at
a time
For any two adjacent updates, the sensor-set states at the two
time instants are differed by only one sensor
=> change only one sensor state
=> n possible combinations by toggling each of the n sensor
states
We normalize the probabilities of the n combinations by their
sum
Pick the next set of sensor-set states according to the
normalized probabilities
Inter-arrival time of updates: exponential distribution
35
Experiments
Market Basket Dataset
8,000,000 transactions
100 items
number of maximal potentially large itemsets = 2000
average transaction length: 10
average length of maximal large itemsets: 4
length of the maximal large itemsets: 11
minimum support: 0.05%
length of the maximal large itemsets: ?
Algorithms:
Apriori: cached mode
IL-apriori:
• (a) random-join (IL-apriori)
• (b) join-by-smallest lifespan (IL-apriori-S)
• (c) join-by-fewest-no-of-intervals (IL-apriori-C)
36
Experiments - Results
Mining Tim e
Runtime (sec.)
500
450
400
350
cache-apriori
300
250
200
IL-apriori
IL-apriori-C
IL-apriori-S
150
100
50
0
0
1
2
3
4
5
Support threshold (%)
Performance of algorithms (larger support):
All IL-apriori algorithms outperform cache apriori
37
Experiments - Results
Mining Tim e
950
Runtime (sec.)
850
750
cache-apriori
650
IL-apriori
550
IL-apriori-C
450
IL-apriori-S
350
250
150
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Support Threshold (%)
Performance (lower support):
More candidates => IL-apriori: Expensive to join interval lists
38
Experiments - Results
Mining Tim e
950
Runtime (sec.)
850
750
cache-apriori
650
IL-apriori
550
IL-apriori-C
450
IL-apriori-S
350
250
150
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Support Threshold (%)
More long frequent sensor-sets
Apriori has to match the candidates by search through the DB
IL-apriori-C and IL-apriori-S reduce a lot of time in joining the lists
39
Experiments - Results
Memory
1900
Mem ory usage (MB)
1700
1500
cache-apriori
1300
IL-apriori
1100
IL-apriori-C
900
IL-apriori-S
700
500
300
0
1
2
3
4
5
Support threshold (%)
Amounts of memory usage - peak memory usage
Cache apriori - store the whole database
IL-apriori – store a lot of interval lists when no of candidates is growing large
40
Experiments – Results Experiments - Results
Mining Tim e
(min_sup = 0.02%)
Runtime (sec.)
140
120
100
cache-apriori
80
IL-apriori
60
IL-apriori-C
40
IL-apriori-S
20
0
1
2
3
4
5
6
7
8
9
10
11
12
i-th pass
Apriori is faster in the first 3 passes
Running time for IL-apriori drops sharply after
Apriori has to scan over the whole database
IL-apriori (C/S) needs to join relatively short interval-lists in later passes
41
Experiments - Results
(min_sup = 0.02%)
Memory usage (MB)
Mem ory
2000
1800
1600
1400
1200
1000
800
600
400
200
0
cache-apriori
IL-apriori
IL-apriori-C
IL-apriori-S
No. of Frequent Sensor-sets
1
2
3
4
5
6
7
8
9
10
11
12
10000
8000
i-th pass
6000
4000
2000
0
Memory requirement for IL-apriori is a lot higher when
there are more frequent sensor-set interval lists to join
1
2
3
4
5
6
7
8
9
10
11
12
i-th pass
42
Experiments - Results
(min_sup = 0.05%)
Mining Tim e
380
Runtime (sec.)
330
280
cache-apriori
230
IL-apriori
180
IL-apriori-C
IL-apriori-S
130
80
30
1000
2000
3000
4000
5000
6000
7000
8000
Number of transactions (K)
Runtime for all algorithms increases linearly with total number of transactions
43
Experiments - Results
(min_sup = 0.05%)
Mem ory
Memory usage (MB)
1600
1400
1200
cache-apriori
1000
IL-apriori
800
IL-apriori-C
600
IL-apriori-S
400
200
0
1000
2000
3000
4000
5000
6000
7000
8000
Num ber of transactions (K)
Memory required by all algorithms increases as no of transactions increases.
Rate of increase in IL-apriori is faster
44
Conclusions
Interval-list method to mine sensor data is described
Two interval list joining strategies are quite effective in
reducing running time
Memory requirement is quite high
Future Work
Other methods for joining interval-lists
• Trade-off between time and space
Extending to the streaming case
• Consider approaches other than Lossy Counting Algorithms
(Manku, and R. Motwani, VLDB’02)
45
Q&A
46