Transcript Document

Sequential PAttern Mining
using A Bitmap Representation
Jay Ayres, Johannes Gehrke
Tomi Yiu, Jason Flannick
ACM SIGKDD 2002
Outline
Introduction
The SPAM Algorithm
Data Representation
Experimental Evaluation
Conclusion
Introduction
Sequential Patterns Mining :依據顧客購買商
品的交易資料庫來進行分析,用來找出顧客購買
商品行為是否存在商品間先後順序關係。
Input
I={a, b, c, d}
CID
TID
Itemset
1
1
{a, b, d}
1
3
{b, c, d}
1
6
{b, c, d}
2
2
{b}
2
4
{a, b, c}
3
5
{a, b}
3
7
{b, c, d}
Dataset sorted by CID and TID
Output
({b}, {d, c}, {c})
Introduction
(Frequent)
Sa:A Sequence.
Example:({a}, {c})
supD(Sa):The support of Sa in D.
Example:Sa=({a},{b, c}), we know supD(Sa)=2
Given a support threshold minSup, a Sequence Sa is
called a frequent sequential pattern on D
if supD(Sa)  minSup.
Example:minSup=2
Sa is deemed frequent.
CID
Sequence
1
({a, b, d}, {b, c, d}, {b, c, d})
2
({b}, {a, b, c})
3
({a, b}, {b, c, d})
Sequence for each customer
The SPAM Algorithm
(Lexicographic Tree for Sequences)
Consider all sequences arranged in a sequence
tree with the following structure:
The root of the tree is labeled with .
{}
Sequence
I={a, b}
a
Sequence
a, a
a, a, a
a, a, b
a, b
a, (a, b)
(a, b)
Figure 1:
The Lexicographic Sequence Tree
The SPAM Algorithm
(Lexicographic Tree for Sequences)
Each sequence in the sequence tree can be
considered as either a sequence-extended
sequence or an itemset-extended sequence.
{}
I={a, b}
a
a, a
a, a, a
a, a, b
a, b
a, (a, b)
(a, b)
Figure 1:
The Lexicographic Sequence Tree
The SPAM Algorithm
(Lexicographic Tree for Sequences)
For example: Sa= ({a, b, c}, {a, b})
sequence-extended sequence of Sa.
itemset-extended sequence of Sa.
({a, b, c}, {a, b})
S-Step
({a, b, c}, {a, b}, {d})
=S-Step
=I-Step
({a, b, c}, {a, b})
I-Step
………
({a, b, c}, {a, b, d})
The SPAM Algorithm
(Lexicographic Tree for Sequences)
Sn, the set of candidate items that are
considered for a possible S-step extensions of
node n (abbreviated s-extensions).
Example : S({a})={a, b, c, d}
a
CID
Sequence
1
({a, b, d}, {b, c, d}, {b, c, d})
2
({b}, {a, b, c})
3
({a, b}, {b, c, d})
Sequence for each customer
S-Step
a, a
a, b
a, c
a, d
The SPAM Algorithm
(Lexicographic Tree for Sequences)
In, which identifies the set of candidate
items that are considered for a possible Istep extensions (abbreviated, i-extensions).
Example : I({a})={b, c, d}
a
I-Step
CID
Sequence
1
({a, b, d}, {b, c, d}, {b, c, d})
2
({b}, {a, b, c})
3
({a, b}, {b, c, d})
Sequence for each customer
(a, b)
(a, c)
(a, d)
The SPAM Algorithm
(Depth First Tree Traversal)
A standard depth-first manner.
I={a, b}
supD(S)  minSup,
store S and repeat DFS.
{}
a
supD(S) < minSup,
not need to repeat DFS.
a, a
a, b
(a, b)
A huge search space.
a, a, a a, a, b a, (a, b)
Figure 1: The Lexicographic Sequence Tree
The SPAM Algorithm
(Pruning)
S-step Pruning
Suppose we are at node ({a}) in the tree and suppose that S({a})
= {a, b, c, d}, I({a}) = {b, c, d}.
Suppose that ({a}, {c}) and ({a}, {d}) are not frequent.
By Apriori principle;
({a},{a},{c}) ({a},{b},{c}) ({a},{a,c}) ({a},{b,c})({a},{a},{d})
({a},{b},{d}) ({a},{a,d}) ({a},{b,d})are not frequent
Hence, when we are at node ({a}, {a}) or ({a}, {b}), we do not
have to perform I-step or S-step using items c and d, i.e.
S({a},{a}) = S({a},{b}) = {a, b}
I({a},{a}) = {b}, and I({a}, {b}) =  .
a
a,a
a,b
a,c
a,d
a,a,a
{a,b} {a,c} {a,d}
a,{b,d}
a,{b,c}
a,a,b
a,{a,d}
a,a,c
a,a,d
a,{a,c}
a,{a,b}
a,b,d
a,b,a
a,b,b
a,b,c
The SPAM Algorithm
(Pruning)
I-step Pruning (Example)
Let us consider the same node ({a}) described in the
previous section. The possible itemset-extended
sequences are ({a, b}), ({a, c}), and ({a, d}). If ({a,
c}) is not frequent, then ({a, b, c}) must also not be
frequent by the Apriori principle.
Hence, I({a, b}) = {d}, S({a,b}) = {a, b}, and
S({a, d}) = {a, b}.
a
{a,b}
{a,d}
{a,c}
{a,b,d}
{a,d},a
{a,b},a
{a,b},b
{a,b,c}
{a,b},d
{a,b},c
…………
{a,d},b
Data Representation
(Data Structure)
Our algorithm uses a vertical bitmap representation of
the data.
A vertical bitmap is created for each item in the dataset,
and each bitmap has a bit corresponding to each
transaction in the dataset.
If item i appears in transaction j, then the bit
corresponding to transaction j of the bitmap for item i is
set to one; otherwise, the bit is set to zero.
Data Representation
(Data Structure)
If the size of a sequence is between 2k + 1 and 2k+1, we
consider it as a 2K+1-bit sequence.
(the size of a sequence =3), k=1, 21+1-bit=4-bit
CID
TID
Itemset
1
1
{a, b, d}
1
3
{b, c, d}
1
6
{b, c, d}
2
2
{b}
2
4
{a, b, c}
3
5
{a, b}
3
7
{b, c, d}
Dataset sorted by CID and TID
CID
TID
{a}
{b}
{c}
{d}
1
1
1
-
1
3
6
-
1
0
0
0
1
1
1
0
0
1
1
0
1
1
1
0
2
2
-
2
4
-
0
1
0
0
1
1
0
0
0
1
0
0
0
0
0
0
3
3
-
5
7
-
1
0
0
0
1
1
0
0
0
1
0
0
0
1
0
0
Data Representation
(Candidate Generation) S-step Process
We first generate a bitmap from B(sa) such that all
bits less than or equal to k are set to zero, and all bits
after k are set to one.
We call this bitmap a transformed bitmap.
We then AND the transformed bitmap with the item
bitmap.
The resultant bitmap has the properties described
above and it is exactly the bitmap for the generated
sequence.
Data Representation
(Candidate Generation) S-step Process
({a})
({a})s
{b}
({a},{b})
1
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
0
0
1
1
1
1
0
0
0
1
0
0
1
0
0
0
S-step
process
0
1
1
1
&
result
0
0
0
0
1
1
0
0
0
1
0
0
CID
Sequence
1
({a, b, d}, {b, c, d}, {b, c, d})
2
({b}, {a, b, c})
3
({a, b}, {b, c, d})
Data Representation
(Candidate Generation) I-step Process
({a},{b})
{d}
({a},{b, d})
0
1
1
0
1
1
1
0
0
1
1
0
0
0
0
0
0
0
0
0
0
1
0
0
&
0
1
0
0
result
0
0
0
0
CID
Sequence
1
({a, b, d}, {b, c, d}, {b, c, d})
2
({b}, {a, b, c})
0
1
0
0
3
({a, b}, {b, c, d})
Experimental Evaluation
(ComparisonWith SPADE and PrefixSpan)
We compared the three algorithms on several small,
medium, and large datasets for various minimum
support values.
This set of tests shows that SPAM outperforms SPADE
by about a factor of 2.5 on small datasets and better than
an order of magnitude for reasonably large datasets.
PrefixSpan outperforms SPAM slightly on very small
datasets, but on large datasets SPAM outperforms
PrefixSpan by over an order of magnitude.
Conclusion
We presented an algorithm to quickly find all
frequent sequences in a list of transactions.
The algorithm utilizes a depth-first traversal of the
search space combined with a vertical bitmap
representation to store each sequence.
Experimental results demonstrated that our
algorithm outperforms SPADE and PrefixSpan on
large datasets