Mining frequency counts from sensor set data

Download Report

Transcript Mining frequency counts from sensor set data

Mining frequency counts from sensor
set data
Loo Kin Kong
25th June 2003
Outline




Motivation
Sensor set data
Finding frequency counts of itemsets from sensor set
data
Future work
Stock quotes

Closing prices of some
HK stocks…
Date
0005
0023
0511
2388
09/6
95.00
15.30
28.10
7.90
10/6
94.00
15.45
27.80
7.85
11/6
93.50
15.60
27.90
7.85
12/6
95.00
15.50
27.80
7.75
13/6
95.25
15.55
27.70
7.90
16/6
95.25
15.30
27.85
7.95
17/6
97.00
15.55
27.60
8.00
18/6
97.00
15.60
26.45
8.10
19/6
96.50
15.55
27.20
8.05
20/6
96.00
15.55
28.00
8.10
23/6
94.25
15.30
27.70
7.95
Stock quotes

Intra-day stock
price of TVB
(0511) on 23rd
June 2003
(Source: quamnet.com)
Motivation




Fluctuation of the price of a stock may be related to
that of another stock or other conditions
Online analysis tools can help to give more insight on
such variations
The case of stock market can be generalized...
We use “sensors” to monitor some conditions, for
example:


We monitor the prices of stocks by getting quotations from a
finance website
We monitor the weather by measuring temperature,
humidity, air pressure, wind, etc.
Sensors

Properties of a sensor include:





A sensor reports values, either spontaneously or by request,
reflecting the state of the condition being monitored
Once a sensor reports a value, the value remains valid until
the sensor reports again
The lifespan of a value is defined as the length of time when
the value is valid
The value reported must be one of the possible states of the
condition
The set of all possible states of a sensor is its state set
s
t1
s
t2
s
t3
s
t4
s
t5
s
t6
time
Sensor set data



A set of sensors (say, n of them) is called a sensor
set
At any time, we can obtain an n-tuple, which is
composed of the values of the n sensors, attached
with a time stamp
<t, (v1, v2, ..., vn)>
where:
t is the time when the n-tuple is obtained
vx is the value of the x-th sensor
If the n sensors have the same state set, we call the
sensor set homogeneous
Mining association rules from sensor
set data


An association rule is a rule, satisfying certain
support and confidence restrictions, in the form
XY
where X and Y are two disjoint itemsets
We redefine the support to reflect the time factor in
sensor set data
supp(X) =  lifespan(X) / length of history
Transformations of sensor-set data


The n-tuples <ts, (v1, v2, ..., vn)> need
transformation for finding frequent itemsets
Transformation 1:



Each (zx, sy) pair, where zx is a sensor and sy a state for zx, is
treated as an item in traditional association rule mining
Hence, the i-th n-tuple is transformed as
<t(i+1) - ti, {(z1, v1), (z2, v2), ..., (zn, vn)}>
where ti is the timestamp of the i-th n-tuple
Thus, association rules of the form
{(z1, s1), (z2, v2), ..., (zn, vn)}  {(zx, vx)}
can be obtained
Transformations of sensor-set data

Transformation 2:



Assuming a homogeneous sensor set, each s in the state set
is treated as an item in traditional association rule mining
The i-th n-tuple is transformed as
<t(i+1) - ti, {(e1, s1), (e2, s2), ..., (em, sm)}>
where ti is the timestamp of the i-th n-tuple, ex is a boolean
value, showing whether the state sx exists in the tuple
Thus, association rules of the form
{s1, s2, ..., sj}  {sk}
can be obtained
The Lossy Counting (LC) Algorithm
for items




User specifies the support threshold s and error
tolerance 
Transactions of single item are conceptually kept in
buckets of size 1/
At the end of each bucket, counts smaller than the
error tolerance are discarded
Counts, kept in a data structure D, of items are kept
in the form (e, f, ), where



e is the item
f is the frequency of e since the entry is inserted in D
 is the maximum count of e before the entry is added to D
The Lossy Counting (LC) Algorithm
for items
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
D  ; N  0
w  1/; b  1
e  next transaction; N  N + 1
if (e,f,) exists in D do
ff+1
else do
insert (e,1,b-1) to D
endif
if N mod w = 0 do
prune(D, b); b  b + 1
endif
Goto 3;
D: The set of all counts
N: Curr. len. of stream
e: Transaction (of item)
w: Bucket width
b: Current bucket id
The Lossy Counting (LC) Algorithm
for items
1.
2.
3.
4.
5.
function prune(D, b)
for each entry (e,f,) in D do
if f +   b do
remove the entry from D
endif
The Lossy Counting (LC) Algorithm
for itemsets



Transactions are kept in buckets
Multiple (say m) buckets are processed at a time.
The value m depends on the amount of memory
available
For each transaction E, essentially, every subset of E
is enumerated and treated as if an item in LC
algorithm for items
Extending the LC Algorithm for
sensor-set data

We can extend the LC Algorithm for finding
approximate frequency counts of itemsets for SSD:



Instead of using a fixed sized bucket, size of which is
determined by , we can use a bucket which can hold an
arbitrary number of transactions
During the i-th bucket, when a count is inserted to D, we set
=  T1,i-1
where Ti,j denotes the total time elapsed since bucket i up to
bucket j
At the end of the i-th bucket, we prune D by removing the
counts such that
+f  T1,i
Extending the LC Algorithm for
sensor-set data
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
D  ; N  0
w  (user defined value); b  1
E  next transaction; N  N + 1
foreach subset e of E
if (e,f,) exists in D do
ff+1
else do
insert (e,1,  T1,b-1) to D
endif
if N mod w = 0 do
prune(D, T1,b); b  b + 1
endif
Goto 3;
D: The set of all counts
N: Curr. len. of stream
E: Transaction (of itemset)
w: Bucket width
b: Current bucket id
Observations

The choice of w can affect the efficiency of the
algorithm





A small w may cause the pruning procedure being invoked
too frequently
A big w may cause that many transactions being kept in the
memory
It may be possible to derive a good w w.r.t. mean lifespan of
the transactions
If the lifespans of the transactions are short,
potentially we need to prune D frequently
Difference between adjacent transactions may be
little
Future work


Evaluate the efficiency of the LC Algorithm for
sensor-set data
Investigate how to exploit the observation that
adjacent transactions may be very similar
Q&A