James Cheng, Yiping Ke, Wilfred Ng
Download
Report
Transcript James Cheng, Yiping Ke, Wilfred Ng
LOGO
Maintaining Frequent
Itemsets over High-Speed
Data Streams
James Cheng, Yiping Ke, Wilfred Ng
(PAKDD 2006)
Advisor:Dr. Koh Jia-Ling
Speaker:Tu Yi-Lang
Date:2007.10.30
1
Introduction
Frequent itemset (FI) mining is fundamental to
many important data mining tasks.
Recently, the increasing prominence of data
streams has led to the study of online mining of FIs.
Due to the constraints on both memory
consumption and processing efficiency.
seek to approximate FIs over streams.
2
Introduction
Existing approximation techniques for mining FIs
are mainly false-positive.
Using an error parameter,ε, to control the
accuracy ( quality ) of the mining result.
A smallε:more accurate mining result but
enormously larger number of itemsets to be
maintained.
A bigε:degrade the mining accuracys.
3
Introduction
This paper proposes a false-negative approach
to mine FIs over high-speed data streams.
The method here places greater importance on
recent data by adopting a sliding window model.
We consider as a relaxed minimum support
threshold and propose to progressively increase
the value of for an itemset as it is kept longer in
a window.
4
Introduction
Compare:False-positive vs. False-negative
False-positive
FIs
False-negative
Mined
FIs
Mined
5
Preliminaries
I = {x1, x2, . . . , xm} be a set of items.
A transaction, X, is an itemset.
A transaction data stream is a continuous
sequence of transactions.
A window( a time interval ) in the stream is a set
of successive time units.
共有j – i + 1個time units
6
Preliminaries
A sliding window in the stream is a window that
slides forward for every time unit.
tτ denotes the current time unit.
w is called the size of the window.
共w個時間單位
current window is W = < tτ−w+1, . . . , tτ > .
trans(T) as the set of transactions that arrive on
the stream in a time interval T.
7
Preliminaries
|trans(T)| as the number of transactions in trans(T).
The support of an itemset X over T, sup(X, T).
Minimum Support Threshold (MST), σ (0 ≤ σ ≤ 1).
X is a frequent itemset (FI) over T if sup(X, T) ≥
σ|trans(T)|.
8
Preliminaries
w = 2 , (w:the size of the window)
If the minimum support required is 3 (σ = 3/5 in
both windows, σ * |tans(T)| = 3/5 * 5 = 3 )
The set of FIs over W1 and W2 are {b, c, bc}
and {b, c, d, bd}.
W1
W2
9
A Progressively Increasing MST
Function
Considering = rσ as a relaxed MST, where r (0 ≤
r ≤ 1) is the relaxation rate, to mine the set of FIs
over each time unit t in the sliding window.
Since all itemsets whose support is less than
rσ|trans(t)| are discarded.
10
A Progressively Increasing MST
Function
a time unit
a time interval
11
A Progressively Increasing MST
Function
τ-(τ-w+1)+1=w
k:在size為w的時間間隔中取某k段時間;也就是與現在時間tτ相距k時間單位
的時間間隔。ex:k=2, T2=<tτ-1, tτ>。
12
Mining FIs over a Sliding Window
Using a prefix tree to keep the semi-FIs.
A node in the prefix tree represents an itemset,
X, and has three fields:
item which is the last item of X
uid(X):ID of the time unit, tuid(X):X在哪一個時間單位
被插入prefix tree中
sup(X) which is the computed support of X since tuid(X)
13
Algorithm 1 (MineSW)
14
Algorithm 1 (MineSW)
15
Algorithm 1 (MineSW)
16
Experimental Evaluation
On Sun Ultra-SPARC III with 900MHz CPU and
4GB RAM, and two types of data streams:
t10i4 and t15i6
Compare MineSW algorithm with Lossy
Counting algorithm(ε = rσ in LCSW).
When r increases from 0.1 to 1, the precision of
LCSW drops from 98% to around 10%, while the
recall of MineSW only drops from 99% to around
90%.
17
Experimental Evaluation
(ε = rσ in LCSW)左圖ε= 0.1σ,
前面有提到ε越小越accuracy,
ε00.5=0.1*0.05 < ε0.1=0.1*0.1,
18
(ε = rσ in LCSW)左圖σ=0.1%
前面有提到ε越小越accuracy,
ε0.1=0.1*σ < ε0.2=0.2*σ,
So Preccision0.1 > Preccision0.2
19
Experimental Evaluation
20
Conclusions
Proposing a progressively increasing minimum
support function, which allows us to increaseεat
the expense of only slightly degraded accuracy.
The MineSW algorithm is significantly faster and
consumes less memory than existing algorithms,
while attains the same level of accuracy.
21