James Cheng, Yiping Ke, Wilfred Ng

Download Report

Transcript James Cheng, Yiping Ke, Wilfred Ng

LOGO
Maintaining Frequent
Itemsets over High-Speed
Data Streams
James Cheng, Yiping Ke, Wilfred Ng
(PAKDD 2006)
Advisor:Dr. Koh Jia-Ling
Speaker:Tu Yi-Lang
Date:2007.10.30
1
Introduction
Frequent itemset (FI) mining is fundamental to
many important data mining tasks.
Recently, the increasing prominence of data
streams has led to the study of online mining of FIs.
Due to the constraints on both memory
consumption and processing efficiency.
 seek to approximate FIs over streams.
2
Introduction
Existing approximation techniques for mining FIs
are mainly false-positive.
Using an error parameter,ε, to control the
accuracy ( quality ) of the mining result.
 A smallε:more accurate mining result but
enormously larger number of itemsets to be
maintained.
 A bigε:degrade the mining accuracys.
3
Introduction
This paper proposes a false-negative approach
to mine FIs over high-speed data streams.
The method here places greater importance on
recent data by adopting a sliding window model.
We consider as a relaxed minimum support
threshold and propose to progressively increase
the value of for an itemset as it is kept longer in
a window.
4
Introduction
Compare:False-positive vs. False-negative
False-positive
FIs
False-negative
Mined
FIs
Mined
5
Preliminaries
I = {x1, x2, . . . , xm} be a set of items.
A transaction, X, is an itemset.
A transaction data stream is a continuous
sequence of transactions.
A window( a time interval ) in the stream is a set
of successive time units.
共有j – i + 1個time units
6
Preliminaries
A sliding window in the stream is a window that
slides forward for every time unit.
 tτ denotes the current time unit.
 w is called the size of the window.
共w個時間單位
 current window is W = < tτ−w+1, . . . , tτ > .
trans(T) as the set of transactions that arrive on
the stream in a time interval T.
7
Preliminaries
|trans(T)| as the number of transactions in trans(T).
The support of an itemset X over T, sup(X, T).
Minimum Support Threshold (MST), σ (0 ≤ σ ≤ 1).
X is a frequent itemset (FI) over T if sup(X, T) ≥
σ|trans(T)|.
8
Preliminaries
w = 2 , (w:the size of the window)
If the minimum support required is 3 (σ = 3/5 in
both windows, σ * |tans(T)| = 3/5 * 5 = 3 )
The set of FIs over W1 and W2 are {b, c, bc}
and {b, c, d, bd}.
W1
W2
9
A Progressively Increasing MST
Function
Considering = rσ as a relaxed MST, where r (0 ≤
r ≤ 1) is the relaxation rate, to mine the set of FIs
over each time unit t in the sliding window.
Since all itemsets whose support is less than
rσ|trans(t)| are discarded.
10
A Progressively Increasing MST
Function
a time unit
a time interval
11
A Progressively Increasing MST
Function
τ-(τ-w+1)+1=w
k:在size為w的時間間隔中取某k段時間;也就是與現在時間tτ相距k時間單位
的時間間隔。ex:k=2, T2=<tτ-1, tτ>。
12
Mining FIs over a Sliding Window
Using a prefix tree to keep the semi-FIs.
A node in the prefix tree represents an itemset,
X, and has three fields:
 item which is the last item of X
 uid(X):ID of the time unit, tuid(X):X在哪一個時間單位
被插入prefix tree中
 sup(X) which is the computed support of X since tuid(X)
13
Algorithm 1 (MineSW)
14
Algorithm 1 (MineSW)
15
Algorithm 1 (MineSW)
16
Experimental Evaluation
On Sun Ultra-SPARC III with 900MHz CPU and
4GB RAM, and two types of data streams:
 t10i4 and t15i6
Compare MineSW algorithm with Lossy
Counting algorithm(ε = rσ in LCSW).
When r increases from 0.1 to 1, the precision of
LCSW drops from 98% to around 10%, while the
recall of MineSW only drops from 99% to around
90%.
17
Experimental Evaluation
(ε = rσ in LCSW)左圖ε= 0.1σ,
前面有提到ε越小越accuracy,
ε00.5=0.1*0.05 < ε0.1=0.1*0.1,
18
(ε = rσ in LCSW)左圖σ=0.1%
前面有提到ε越小越accuracy,
ε0.1=0.1*σ < ε0.2=0.2*σ,
So Preccision0.1 > Preccision0.2
19
Experimental Evaluation
20
Conclusions
Proposing a progressively increasing minimum
support function, which allows us to increaseεat
the expense of only slightly degraded accuracy.
The MineSW algorithm is significantly faster and
consumes less memory than existing algorithms,
while attains the same level of accuracy.
21