Mining Sequential Patterns
Download
Report
Transcript Mining Sequential Patterns
Mining Sequential
Patterns
Rakesh Agrawal
Ramakrishnan Srikant
Proc. of the Int’l Conference on Data
Engineering (ICDE) March 1995
Presenter: Phil Schlosser
Topics
► Introduction
► Problem
Statement
► Finding Sequential Patterns (The Algorithm)
► Sequence Phase (AprioriAll, AprioriSome,
DynamicSome
► Conclusion (Performance/FE
Questions/Future work).
Introduction
► Bar
code data allows us to store massive
amounts of sales data (basket data).
► We would like to extract sequential buying
patterns.
► E.g. Video rental: Star Wars; Empire
Strikes Back; Return of the Jedi.
► Can include multiple elements.
► E.g. Sheet and pillow case; comforter;
drapes.
Problem Statement
► Given
a database of transactions:
► Each transaction:
- Customer ID
- Transaction Time
- Set of items purchased
* No two transactions have same customer ID and
transaction time
* Don’t consider quantity of items purchased.
Item was either purchased or not purchased
Problem Statement (terms)
Itemset: non-empty set of items. Each item set mapped to
an integer.
► Sequence: Ordered list of itemsets.
► Customer-Sequence: List of customer transactions ordered
by increasing transaction time.
► (A customer supports a sequence if the sequence is
contained in the customer-sequence.)
► Support for a sequence: Fraction of total customers that
support a sequence.
► Maximal Sequence: A sequence that is not contained in
any other sequence.
► Large Sequence: Sequence that meets minisup.
►
Example
Customer
ID
Transaction
Time
Items
Bought
1
June 25 '93
30
1
{ (30) (90) }
1
June 30 '93
90
2
{ (10 20) (30) (40 60 70) }
2
June 10 '93
10,20
3
{ (30) (50) (70) }
2
June 15 '93
30
2
June 20 '93
40,60,70
4
{ (30) (40 70) (90) }
3
June 25 '93
30,50,70
5
{ (90) }
4
June 25 '93
30
4
June 30 '93
40,70
Maximal sequences with support > 25%
4
July 25 '93
90
{ (30) (90) }
5
June 12 '93
90
{ (30) (40 70) }
Customer
ID
Customer Sequence
Note: Use Minisup of 25%
{ (10 20) (30) } Does not have minisup (Only supported by Cust. 2)
{ (30) }, { (70) }, { (30) (40) } … are not maximal.
The Algorithm
► Sort
Phase
► Litemset Phase
► Transformation Phase
► Sequence Phase (The Biggie)
► Maximal Phase
Sort Phase
► Sort
the database:
► Customer ID as the
major key.
► Transaction-Time as
the minor key.
Customer
ID
Transaction
Time
Items
Bought
1
June 25 '93
30
1
June 30 '93
90
2
June 10 '93
10,20
2
June 15 '93
30
2
June 20 '93
40,60,70
3
June 25 '93
30,50,70
4
June 25 '93
30
4
June 30 '93
40,70
4
July 25 '93
90
5
June 12 '93
90
Litemset Phase
► Litemset
(Large
Itemset):
► Supported by fraction
of customers greater
than minisup.
Customer
ID
Transaction
Time
Items
Bought
1
June 25 '93
30
1
June 30 '93
90
2
June 10 '93
10,20
2
June 15 '93
30
Large Itemsets
Mapped To
2
June 20 '93
40,60,70
(30)
1
3
June 25 '93
30,50,70
(40)
2
4
June 25 '93
30
(70)
3
4
June 30 '93
40,70
(40 70)
4
4
July 25 '93
90
(90)
5
5
June 12 '93
90
Transformation Phase
►
►
Replace each transaction with all litemsets contained in the
transaction.
Transactions with no litemsets are dropped. (Still considered for
support counts)
Original Customer
Sequence
Transformed Custumer Sequence
After Mapping
1
{ (30) (90) }
<{(30)} {(90)}>
<{1} {5}>
2
{ (10 20) (30) (40 60 70) }
<{(30)} {(40),(70),(40 70)}>
<{1} {2,3,4}>
3
{ (30) (50) (70) }
<{(30),(70)}>
<{1,3}>
4
{ (30) (40 70) (90) }
<{(30)} {(40),(70),(40 70)} {(90)}>
<{1} {2,3,4} {5}>
5
{ (90) }
<{(90)}>
<{5}>
Customer ID
Note: (10 20) dropped because of lack of support.
(40 60 70) replaced with set of litemsets {(40),(70),(40 70)} (60 does not
have minisup)
Sequence Phase
► Finds
Large Sequences:
► AprioriAll
► AprioriSome
► DynamicSome
► Stay Tuned….
Maximal Phase
► Find
maximal sequences among large
sequences.
► k-sequence: sequence of length k
► S set of all large sequences
►
for (k=n; k>1; k--) do
foreach k-sequence sk do
Delete from S all subsequences of sk
Authors claim data-structures and an algorithm exist
to do this efficiently. (hash trees)
Sequence Phase
► Sequence
phase finds the large sequences.
► Two families of algorithms:
*****CountSome*****
*****CountAll*****
Final Exam Question #2
► There
were two types of algorithms presented to
find sequential patterns, CountSome and CountAll.
What was the main difference between the two
algorithms?
► CountAll (AprioriAll) is careful with respect to
minimum support, careless with respect to
maximality.
CountSome (AprioriSome) is careful with respect
to maximality, careless with respect to minimum
support.
AprioriAll
L1 = {large 1-sequences}
for (k = 2; Lk-1 ≠ {}; k++) do
begin
Ck = New candidates generated from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c.
Lk = Candidates in Ck with minimum support.
end
Answer = Maximal Sequences in UkLk
Notation:
Lk: Set of all large k-sequences
Ck: Set of candidate k-sequences
AprioriAll (Example)
Customer Sequences
<{1 5} {2} {3} {4} >
L1
1-Seq
L2
Support
2-Seq
L3
Support
3-Seq
Support
<1>
4
<1 2>
2
<1 2 3>
2
<2>
2
<1 3>
4
<1 2 4>
2
<{1} {2} {3} {4}>
<3>
4
<1 4>
3
<1 3 4>
3
<{1} {3} {5}>
<4>
4
<1 5>
3
<1 3 5>
2
<{4} {5}>
<5>
4
<2 3>
2
<2 3 4>
2
<2 4>
2
<3 4 5>
1
<3 4>
3
<3 5>
2
<4 5>
2
<2 5>
0
<{1} {3} {4} {3 5}>
L4
4-Seq
<1 2 3 4>
Support
2
Answer: <1 2 3 4>, <1 3 5>, <4 5>
Minisup = 25%
AprioriSome (Forward Phase)
L1 = {large 1-sequences}
C1 = L1
last = 1
for (k = 2; Ck-1 ≠ {} and Llast ≠ {}; k++) do
begin
if (Lk-1 known) then
Ck = New candidates generated from Lk-1
else
Ck = New candidates generated from Ck-1
if (k==next(last)) then begin // (next k to count?)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck
that are contained in c.
Lk = Candidates in Ck with minimum support.
last = k;
end
end
AprioriSome (Backward Phase)
for (k--; k>=1; k--) do
if (Lk not found in forward phase) then begin
Delete all sequences in Ck contained in
some Li i>k;
foreach customer-sequence c in DT do
Increment the count of all candidates in Ck
that are contained in c
Lk = Candidates in Ck with minimum support
end
else // lk already known
Delete all sequences in Ck contained in
some Li i>k;
Answer = UkLk //(Maximal Phase not Needed)
Notation: DT; Transformed database
DynamicSome (Just the basics)
► Similar
to AprioriSome
► AprioriSome generates Ck from Ck-1
► DynamicSome generates Ck “on the fly”
Final Exam Question #1
► What
was the greatest hardware concern
regarding the algorithms contained in the
paper?
► Main memory capacity. When there is little
main memory, or many potentially large
sequences, the benefits of AprioriSome
vanish.
Final Exam Question #3
► How
did the two best sequence mining
algorithms (AprioriAll and AprioriSome)
perform compared with each other? Take
into consideration memory, speed, and
usefulness of the data.
Final Exam Question #3
► Memory:
In terms of main memory usage, AprioriAll
is better.
In terms of secondary storage access,
AprioriSome is better.
Final Exam Question #3
► Speed:
With sufficient memory, as minimum
support decreases the difference between
AprioriAll and AprioriSome increases.
(AprioriSome is better.)
More large sequences not maximal are
generated.
Final Exam Question #3
► Usefulness
of the data:
For the problem of finding maximal large
sequences, the answer is “Precisely the same.”.
However, AprioriAll finds all large sequences, while
AprioriSome discards some large sequences that
aren’t maximal. AprioriAll, then, generates more
“useful” data.
“The user may want to know the ratio of the
number of people who bought the first k + 1 items
in a sequence to the number of people who
bought the first k items.”