Algorithm B (Example)

Download Report

Transcript Algorithm B (Example)

Mining of Frequent Patterns
from Sensor Data
Presented by: Ivy Tong Suk Man
Supervisor: Dr. B C M Kao
20 August, 2003
1
Outline
 Outline of the Presentation
 Motivation
 Problem Definition
 Algorithm
• Apriori with data transformation
• Interval-List Apriori
 Experimental Results
 Conclusion
2
Motivation
 Continuous items
 reflect values from an entity that changes continuously in the
external environment.
 Update  Change of state of the real entity
 E.g. temperature reading data
• Initial temperature: 25ºC at t=0s
• Sequence of updates: <timestamp, new_temp>
<1s, 27ºC>, <5s, 28ºC>, <10s, 26ºC>, <14s,..> …
• t=0s to 1s, 25ºC
25ºC 27ºC
28ºC
t=1s to 5s, 27ºC
t=5s to 10s, 28ºC
0 1
5
26ºC
t
10
 What is the average temperature from t=0s to 10s?
• Ans: (25x1+27x4+28x5)/10 = 27.3ºC
3
Motivation
 Time is a component in some applications
 E.g. stock price quotes, network traffic data
 “Sensors” are used to monitor some conditions, for
example:
 Prices of stocks: by getting quotations from a finance website
 Weather: measuring temperature, humidity, air pressure, wind,
etc.
 We want to find correlations of the readings among a
set of sensors
 Goal: To mine association rules from sensor data
4
Challenges
 How different is it from mining association rules from
market basket data?
 Time component
When searching for association rules in market basket data,
time field is usually ignored as there is no temporal correlation
between the transactions
 Streaming data
Data arrives continuously, possibly infinitely, and in large
volume
5
Notations
 We have a set of sensors R = {r1,r2,…,rm}
 Each sensor ri has a set of numerical states Vi
 Assume binary states for all sensors
 Vi = {0,1} i s.t. ri R
 Dataset D: a sequence of updates of sensor state in the
form of <ts, ri, vi> where ri R, vi Vi




ts : timestamp of the update
ri: sensor to be updated
vi: new value of the state of ri
For sensors with binary states
• update in form of <ts, ri> as the new state can be inferred by toggling
the old state
6
Example
 R={A,B,C,D,E,F}
 Initial states: all off
 D:
<1,A>
<2,B>
<4,D>
<5,A>
<6,E>
<7,F>
<8,E>
<10,A>
<11,F>
<13,C>
A
t
0
1
5
10
B
t
2
C
t
13
D
t
4
t
E
6
8
F
t
7
11
7
More Notations
 An association rule is a rule, satisfying certain
support and confidence restrictions, in the form
XY
where XR,
YR
and
X R
,Y 
R XY=
and X  Y  
8
More Notations
 Association rule X  Y has confidence c,
In c % of the time when the sensors in X are
ON (with state = 1), the sensors in Y are ON
 Association rule X  Y has support s,
In s% of the total length of history, the sensors
in X and Y are ON
9
More Notations
 TLS(X) denote Total LifeSpan of X
 Total length of time that the sensors in X are ON
 T – total length of history
 Sup(X) = TLS(X)/T
Conf(X  Y) = Sup(X U Y) / Sup(X)
 Example:
T = 15s
TLS(A)=9, TLS(AB)=8
Sup(A) = 9/15 = 60%
Sup(AB) =8/15 = 53%
Conf(A->B) = 8/9 = 89%
A
t
0 1
5
B
10
t
2
10
Algorithm A
 Transform & Apriori
 Transform the sequence of updates to the form of
market basket data
 At each point of update
• take a snapshot of the states of all sensors
• Output all sensors with state=on as a transaction
• Attach
Weight(transaction)
= Lifespan(this update)
= timestamp(next update) – timestamp(this update)
11
Algorithm A - Example


Initial states: all off
D: <1,A>,<2,B>,<4,D>,<5,A>,
<6,E>,<7,F>,<8,E>,<10,A>,
<11,F>,<13,C>
A
t
0 1
5
10
B
t
2
Transformed database D’:
TID
Items
Weight
C
t
13
D
t
4
E
t
6
8
F
t
7
11
12
Algorithm A - Example


Initial states: all off
D: <1,A>,<2,B>,<4,D>,<5,A>,
<6,E>,<7,F>,<8,E>,<10,A>,
<11,F>,<13,C>
A
t
0 1
5
10
B
t
2
Transformed database D’:
TID
Items
1
A
Weight
timestamp=1
C
t
13
D
t
4
t
E
6
8
F
t
7
11
timestamp=1
13
Algorithm A - Example


Initial states: all off
D: <1,A>,<2,B>,<4,D>,<5,A>,
<6,E>,<7,F>,<8,E>,<10,A>,
<11,F>,<13,C>
A
t
0 1
5
10
B
t
2
Transformed database D’:
TID
Items
Weight
1
A
2-1=1 timestamp=1
2
A,B
timestamp=2
C
t
13
D
t
4
t
E
6
8
F
t
7
11
timestamp=2
14
Algorithm A - Example


Initial states: all off
D: <1,A>,<2,B>,<4,D>,<5,A>,
<6,E>,<7,F>,<8,E>,<10,A>,
<11,F>,<13,C>
A
t
0 1
5
10
B
t
2
Transformed database D’:
TID
Items
Weight
1
A
1
2
A,B
4-2=2 timestamp=2
3
A,B,D
C
t
13
D
t
4
E
timestamp=4
t
6
8
F
t
7
11
timestamp=4
15
Algorithm A - Example


Initial states: all off
D: <1,A>,<2,B>,<4,D>,<5,A>,
<6,E>,<7,F>,<8,E>,<10,A>,
<11,F>,<13,C>
A
t
0 1
5
10
B
t
2
Transformed database D’:
TID
Items
Weight
1
A
1
2
A,B
2
3
A,B,D
1
4
B,D
1
5
B,D,E
1
6
B,D,E,F
1
7
B,D,F
2
8
A,B,D,F
1
9
A,B,D
2
10
A,B,C,D
15-13=2 timestamp=13
C
t
13
D
t
4
E
t
6
8
F
t
7
11
End of history = 15s
16
Algorithm A - Example


Initial states: all off
D: <1,A>,<2,B>,<4,D>,<5,A>,
<6,E>,<7,F>,<8,E>,<10,A>,
<11,F>,<13,C>
A
t
0 1
5
10
B
t
2
Transformed database D’:
TID
Items
Weight
1
A
1
2
A,B
2
3
A,B,D
1
4
B,D
1
5
B,D,E
1
6
B,D,E,F
1
7
B,D,F
2
8
A,B,D,F
1
9
A,B,D
2
10
A,B,C,D
2
C
t
13
D
t
4
E
t
6
8
F
t
7
11
17
Algorithm A
 Apply Apriori on the transformed dataset
D’
TID
 sup( X )   Weight ( X ) /  Weight
1
X T ,T D '
 Drawbacks:
 A lot of redundancy
 Adjacent transactions may be very
similar, differed by the one sensor with
state update
Items
Weight
A
1
2
A,B
2
3
A,B,D
1
4
B,D
1
5
B,D,E
1
6
B,D,E,F
1
7
B,D,F
2
8
A,B,D,F
1
9
A,B,D
2
10
A,B,C,D
2
18
Algorithm B
 Interval-List Apriori
 Uses an “interval-list” format
 <X, interval1, interval2, interval3, … >
where intervali is the interval in which all sensors in X are on.
 TLS(X) =  (intervali.h – intervali.l)
 Example:
A
t
0 1
<A, [1,5), [10,15)>
5
10
TLS(A) = (5-1) + (15-10) = 9
19
Algorithm B
 Step 1:
For each ri R,
build a list of interval in which ri is ON by
scanning the sequence of updates
 Calculate the TLS of each ri
 If TLS(ri)  min_sup, put ri into L1
20
Algorithm B – Example
 Initial states: all off
 D:
<1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>






<A, empty>
<B, empty>
<C, empty>
<D, empty>
<E, empty>
<F, empty>
21
Algorithm B – Example
 Initial states: all off
 D:
<1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>






<A, [1,?)>
<B, empty>
<C, empty>
<D, empty>
<E, empty>
<F, empty>
22
Algorithm B – Example
 Initial states: all off
 D:
<1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>






<A, [1,?)>
<B, [2,?)>
<C, empty>
<D, empty>
<E, empty>
<F, empty>
23
Algorithm B – Example
 Initial states: all off
 D:
<1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>






<A, [1,5)>
<B, [2,?)>
<C, empty>
<D, [4,?)>
<E, empty>
<F, empty>
24
Algorithm B – Example
 Initial states: all off
 D:
<1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>






<A, [1,5),[10,?)>
<B, [2,?)>
<C, [13,?)>
<D, [4,?)>
<E, [6,8)>
<F, [7,11)>
25
Algorithm B – Example
 Initial states: all off
 D:
<1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>






<A, [1,5),[10,15)>
<B, [2,15)>
<C, [13,15)>
<D, [4,15)>
<E, [6,8)>
<F, [7,11)>
End of history T =15s
26
Algorithm B
 Step 2:
 Find all larger frequent sensor-sets
 Similar to Apriori Frequent Itemst Property
 Any subset of a frequent sensor-set must be frequent.
 Method:
 Generate candidates of size i+1 from frequent sensor-sets of
size i.
 Approach used: join to obtain sensor-sets of size i+1 if two
size-i frequent sensor-sets agree on i-1
 May also prune candidates who have subsets that are not
large.
 Count the support by merging (intersection of) the interval lists
of the two size-i frequent sensor-sets
 If sup  min_sup, put into Li+1
 Repeat the process until the candidate set is empty
27
Algorithm B
 Example:
 <A, [1,5), [10,15)>
 <B, [2,15)>
 <AB, [2,5),[10,15)>
A
t
0 1
5
10
B
t
2
T=15
28
Algorithm B (Example)
A
C
B
D
[1,5)
[2,15)
[13,15)
[10,15)
LS:13
LS:2
E
F
[4,15)
[6,8)
[7,11)
LS:11
LS:2
LS:4
LS:9
AB
AD
[2,5)
[4,5)
[10,15)
[10,15)
LS:8
LS:6
AF
BD
BF
[10,11)
[4,15)
[7,11)
LS:1
LS:11
LS:4
ABD
[4,5)
[10,15)
Min support count: 3
LS:6
29
Algorithm B – Candidate Generation
 When generating a candidate sensor-set C of size i
from two size i-1 sensor-sets LA and LB (subsets of C),
we also construct the interval list of C by intersecting
the interval lists of LA and LB.
 Joining the two interval lists (of length m and n) is a key
step in our algorithm
 Use simple linear scan requires O(m+n) time
 There are i different size i-1 subset of C
which two to pick?
30
Algorithm B – Candidate Generation
 Method 1:
 Choose two lists with fewest no of intervals
 =>Store no of intervals for each sensor-set
 Method 2:
 Choose two lists with smallest count (TLS)
 Intuitively shorter lifespan implies fewer intervals
 Easier to implement
• Have the lifespan when checking if the sensor-set is
frequent
31
Experiments
 Data generation
 Stimulate data generated by a set of n binary
sensors
 Make use of a standard market basket data
 With n sensors, each of which can be either on or off
=>2n possible combination of sensor states
 Assign a probability to each of the combinations
32
Experiments – Data Gen
 How to assign the probabilities?
 Let N be the no of occurrences of the transaction in the market
basket that contains exactly only the sensors that are ON
• E.g. Consider R={A,B,C,D,E,F}
• Suppose we want to assign prob to the sensor state AC (only A
and C are ON)
• N is no of transactions that contain exactly only A and C
 Assign prob = N/|D|, where |D| is the size of the market basket
dataset
 Note: Need sufficiently large market basket data
• transactions that occur very infrequently will not be given ZERO
probability
33
Experiments – Data Gen
 Generating sensor set data
 Choose the initial state (at t=0s)
• Randomly
• According to the probabilities assigned
• Pick the combination with highest probability assigned
=> first sensor set states
34
Experiment – Data Gen
 What is the next set of sensor-set states?
 For simplicity, in our model, only one sensor can be updated at
a time
 For any two adjacent updates, the sensor-set states at the two
time instants are differed by only one sensor
=> change only one sensor state
=> n possible combinations by toggling each of the n sensor
states
 We normalize the probabilities of the n combinations by their
sum
 Pick the next set of sensor-set states according to the
normalized probabilities
 Inter-arrival time of updates: exponential distribution
35
Experiments
 Market Basket Dataset








8,000,000 transactions
100 items
number of maximal potentially large itemsets = 2000
average transaction length: 10
average length of maximal large itemsets: 4
length of the maximal large itemsets: 11
minimum support: 0.05%
length of the maximal large itemsets: ?
 Algorithms:
 Apriori: cached mode
 IL-apriori:
• (a) random-join (IL-apriori)
• (b) join-by-smallest lifespan (IL-apriori-S)
• (c) join-by-fewest-no-of-intervals (IL-apriori-C)
36
Experiments - Results
Mining Tim e
Runtime (sec.)
500
450
400
350
cache-apriori
300
250
200
IL-apriori
IL-apriori-C
IL-apriori-S
150
100
50
0
0
1
2
3
4
5
Support threshold (%)

Performance of algorithms (larger support):
 All IL-apriori algorithms outperform cache apriori
37
Experiments - Results
Mining Tim e
950
Runtime (sec.)
850
750
cache-apriori
650
IL-apriori
550
IL-apriori-C
450
IL-apriori-S
350
250
150
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Support Threshold (%)

Performance (lower support):
 More candidates => IL-apriori: Expensive to join interval lists
38
Experiments - Results
Mining Tim e
950
Runtime (sec.)
850
750
cache-apriori
650
IL-apriori
550
IL-apriori-C
450
IL-apriori-S
350
250
150
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Support Threshold (%)

More long frequent sensor-sets
 Apriori has to match the candidates by search through the DB
 IL-apriori-C and IL-apriori-S reduce a lot of time in joining the lists
39
Experiments - Results
Memory
1900
Mem ory usage (MB)
1700
1500
cache-apriori
1300
IL-apriori
1100
IL-apriori-C
900
IL-apriori-S
700
500
300
0
1
2
3
4
5
Support threshold (%)



Amounts of memory usage - peak memory usage
Cache apriori - store the whole database
IL-apriori – store a lot of interval lists when no of candidates is growing large
40
Experiments – Results Experiments - Results
Mining Tim e
(min_sup = 0.02%)
Runtime (sec.)
140
120
100
cache-apriori
80
IL-apriori
60
IL-apriori-C
40
IL-apriori-S
20
0
1
2
3
4
5
6
7
8
9
10
11
12
i-th pass


Apriori is faster in the first 3 passes
Running time for IL-apriori drops sharply after


Apriori has to scan over the whole database
IL-apriori (C/S) needs to join relatively short interval-lists in later passes
41
Experiments - Results
(min_sup = 0.02%)
Memory usage (MB)
Mem ory
2000
1800
1600
1400
1200
1000
800
600
400
200
0
cache-apriori
IL-apriori
IL-apriori-C
IL-apriori-S
No. of Frequent Sensor-sets
1
2
3
4
5
6
7
8
9
10
11
12
10000
8000
i-th pass
6000
4000
2000
0

Memory requirement for IL-apriori is a lot higher when
there are more frequent sensor-set interval lists to join
1
2
3
4
5
6
7
8
9
10
11
12
i-th pass
42
Experiments - Results
(min_sup = 0.05%)
Mining Tim e
380
Runtime (sec.)
330
280
cache-apriori
230
IL-apriori
180
IL-apriori-C
IL-apriori-S
130
80
30
1000
2000
3000
4000
5000
6000
7000
8000
Number of transactions (K)

Runtime for all algorithms increases linearly with total number of transactions
43
Experiments - Results
(min_sup = 0.05%)
Mem ory
Memory usage (MB)
1600
1400
1200
cache-apriori
1000
IL-apriori
800
IL-apriori-C
600
IL-apriori-S
400
200
0
1000
2000
3000
4000
5000
6000
7000
8000
Num ber of transactions (K)


Memory required by all algorithms increases as no of transactions increases.
Rate of increase in IL-apriori is faster
44
Conclusions
 Interval-list method to mine sensor data is described
 Two interval list joining strategies are quite effective in
reducing running time
 Memory requirement is quite high
 Future Work
 Other methods for joining interval-lists
• Trade-off between time and space
 Extending to the streaming case
• Consider approaches other than Lossy Counting Algorithms
(Manku, and R. Motwani, VLDB’02)
45
Q&A
46