Market Basket Analysis - University of Windsor

Download Report

Transcript Market Basket Analysis - University of Windsor

60-520:Seminar
Market Basket Analysis:
An Approach to
Association Rule Mining in Multiple Store Environment.
_______________________________________
Presented By
Robyat Hossain
University of Windsor
February 10, 2006
Market Basket Analysis:
An Approach to
Association Rule Mining in Multiple Store Environment.
______________________________________________________________
Agenda

Introduction

General concept

Discussion on proposed ideas

Conclusion
Market Basket Analysis:
An Approach to
Association Rule Mining in Multiple Store Environment.
______________________________________________________________

Introduction


MBA
Purpose
Peoples attitude

Benefit

Inside of store environment
layout changes

Outside of store
product bundling
direct marketing

Outside of realm of marketing
stock inventory
operation strategis
Market Basket Analysis:
An Approach to
Association Rule Mining in Multiple Store Environment.
___________________________________________________

General Concept
 The
approach
 Methods, Measures, Results.
 Literature overview on Traditional Association
rules.
Market Basket Analysis
General Concept: methods
_____________________________

Method:
Transaction 1: Frozen pizza, cola, milk
Transaction 2: Milk, potato chips
Transaction 3: Cola, frozen pizza
Transaction 4: Milk, pretzels
Transaction 5: Cola, pretzels
Frozen
Pizza
Milk
Cola
Potato
Chips
Pretzel
s
Frozen Pizza
2
1
2
0
0
Milk
1
3
1
1
1
Cola
2
1
3
0
1
Potato Chips
0
1
0
1
0
Pretzels
0
1
1
0
2
Hints that frozen pizza and cola may sell well
together, and should be placed side-by-side in the
convenience store..
Results:
we could derive the association rules:
If a customer purchases Frozen Pizza, then they will probably purchase Cola.
If a customer purchases Cola, then they will probably purchase Frozen Pizza.
Market Basket Analysis
General Concept: Measures
_____________________________

Measures:

Suport
Suport = (containing the item combination) /( total number of record.)
Let the rule Is "If a customer purchases Cola, then they will purchase Frozen Pizza"
The support for this
= 2 (number of transaction that include both Cola and Frozen Pizza is) / 5(total records )
= 40%.

Confidence:
Confidence of a rule = the support for the combination / the support for the condition.
For the rule "If a customer purchases Milk, then they will purchase Potato Chips"
confidence = support for the combination (Potato Chips + Milk) is 20%/
condition (Milk) is 60%,
=33%
Suport and confidence are used to select the association rules.
support for the
Market Basket Analysis
Discussion on proposed ideas
_______________________________________________________
 Problem Definition
 Assumptions
 Four Definitions ( Definition 1,2,3,4)
 Algorithm
 General concepts developing the algorithm
 Discussion on several key steps
 Performance Evaluation
 Data generation
 Performance measure
 Simulation study

Market Basket Analysis
Proposed ideas :Problem Definition
_____________________________________________________
Assumptions:
D = a database , contains transactional records from multiple stores over time period T
Let I={11, 12,. . ., Ir } be the set of product items in D
where Ik (1 ≤ k ≤ r ) is the identifier for the kth item.
Let X be a set of items in I. We refer X as a k itemset if |X| = k.
W (, D)  {s s  D  X  s} denote the set of transactions S(subset of I) in D, contain itemset X.
Let {T1, T2,. . ., Tm} be the set of mutually disjoint time intervals (periods)
Let P={P1, P2,. . ., Pq} set of stores, where Pj (1 ≤ j≤ q) denotes the jth store in the store chain
Definition 1. The support of X, denoted by sup(X, D), is the fraction of transactions containing X in
database D; i.e., sup(X, D) = |W(X, D)|/|D|. For a specified support threshold σs, X is a
frequent itemset if sup(X, D)≥σs.
Definition 2. Let X be an itemset in I with context VX, and DV X the subset of transactions in D
whose timestamps t and store identifiers p satisfy VX. We define the relative support of X w.r.t
the context VX, denoted by rel _ sup( X , DVX ), as | W ( X , DVX ) | / | DVX |
For a given threshold for relative support σr, if a frequent itemset X satisfies rel _ sup( X , DV ,)   c
we call X a relative-frequent (RF) itemset.
X
Market Basket Analysis
Problem Definition
_____________________________________________________
Let SK  P and RK  T be the sets of the stores and times that item Ik is sold,
V Ik = S k X R k as the context of item Ik;(combination of stores & time where sold)
the context of itemset X , consists of two items Ik and Ik’
denoted by VX = V Ik ∩ V Ik’ .(common stores & time shared by items in x)
Definition 3. Consider two itemsets X and Y. The relative support of X
with respect to the context V , denoted by rel _ sup( X , DV )
is defined
as | W ( X , DV ) | / | DV | . The confidence of rule X => Y, denoted by conf
(X => Y), is defined as rel _ sup( X  Y , DV ) / rel _ sup( X , DV )
X Y
X Y
X Y
X Y
X Y
X Y
Definition 4. Let Z be an RF itemset, where Z = XuY, X  I and Y  I \ X.
Given a confidence threshold σc, if conf(X => Y)≥ σc, we call X => Y a
store-chain (SC) association rule, and Vxuy as the context of the rule.
Market Basket Analysis
Algorithm
_____________________________________________________
Market Basket Analysis
Algorithm: General concepts
___________________________

We use RFk ,Fk, Ck to denote the set of all relative-frequent k-itemsets ,
frequent k-itemsets ,candidate k-itemsets.

As the first step of the algorithm, we build a table, called the PT table,

In the first phase, we scan the database for the first time and build a twodimensional table, called the TS table. The entries denoted by TS(Ti, Pj) ,
records the number of transactions that occur at store Pj in period Ti.

In the kth phase of the algorithm, we first derive Ck , and, then, generate Fk
by evaluating their supports, through scanning the database and removing all
infrequent itemsets and also generate RFk from Fk.

Using this TS table and the PT table for a given itemset X, we can determine
the | D |.
VX
Market Basket Analysis
Algorithm :Key steps
_______________________________

Methods of




Building the PT(Store-Time) table
Building of TS (transaction)table
Finding of RFk (Relative –frequent itemset)
Generating the store-chain association rule
Market Basket Analysis
Algorithm: Building the PT table
_____________________________________________________



Hold time & store information for each item
PT table from bit matrices for individual items
PT table for Item I1 I2 I3
Bit Matrices for Items I1 I2 I3
Bit matrix for Item I1
I1
I2
I3
P1
4
0
13
P2
125
3
46
P3
6
14
13
P4
136
36
14
P5
125
12
0
P6
5
15
126
Bit matrix for Item I2
Bit matrix for Item I3
T1
T2
T3
T4
T5
T6
P1
1
1
1
1
1
1
0
P2
1
1
0
0
0
1
0
P3
0
0
0
1
1
1
0
P4
1
1
0
1
1
0
0
P5
0
1
1
1
0
0
P6
0
0
T1
T2
T3
T4
T5
T6
P1
1
1
1
0
0
0
P2
0
1
1
1
0
P3
1
1
1
1
P4
0
0
1
P5
0
1
P6
1
1
T1
T2
T3
T4
T5
T6
P1
0
0
1
1
1
1
0
P2
1
1
1
0
0
1
1
1
P3
0
0
1
1
1
1
0
0
1
P4
0
0
0
1
1
1
1
1
1
1
P5
1
1
1
1
1
1
0
0
1
1
P6
0
1
1
1
1
0
Market Basket Analysis
Algorithm: Building the PT table
_____________________________________________________
The method to compute the jth row of PT table for Itemset X
PT table for Item I1 I2
PT table for Itemset
Itemset X(I1,I2)
I1
I2
P1
4
0
P1
P2
125
3
P2
P3
6
14
P3
P4
136
36
P4
P5
125
12
P5
P6
5
15
P6
123
Market Basket Analysis
Algorithm: Building the TS table
_____________________________________________________

The TS table



At first phase of the algorithm , build
the TS table,
The entries denoted by TS(Ti, Pj) ,
records the number of transactions
that occur at store Pj in period Ti.
Using the TS and PT tables for
itemset X, we can determine the
value | D | by summing all the
values in the entries of the TS table
The process of constructing the table
is described in lines 2 through 4 in
algorithm
VX
T1
T2
T3
T4
T5
T6
P1
23
23
5
42
23
56
P2
93
42
12
39
47
23
P3
43
41
8
70
43
59
P4
21
32
43
34
10
21
P5
32
42
30
64
34
32
P6
45
12
16
90
12
65
An example of TS table
Market Basket Analysis
Algorithm: Computing RFk ,Fk
_____________________________________________________
Here is an example of the computation process:

Suppose ,15 periods( T1 to T15), #of transactions are 19, 17, 14, 25, 20, 17, 15, 27,
21, 20, 22, 18, 25, 21, and 19,

selling periods of product A ,T1 to T10, and involved in 60 transactions
selling periods of product B ,T6 to T15, and involved in 80 transactions
50 transactions containing both A and B, and sold in periods T6 to T10.



To compute supports and relative supports for itemsets {A}, {B}, and {A, B},
| W ({ A}, DV ) || W ({ A}, D) |
= 60, | W ({B}, DV ) || W ({B}, D) | = 80, and
| W ({ A, B}, DV ) || W ({ A, B}, D) | = 50 and |D| = 300,
sup({A}, D) = 60/300 = 0.2, sup({B}, D) = 80/300 = 0.267, and sup({A, B}, D) = 50/ 300
= 0.167.
{ A}
{ B}
{ A, B }


Bases for relative support are | DV | = 195, | DV |=205, and | DV |= 100,
rel _ sup({ A}, DV =
) 60/195 = 0.308,
rel _ sup({ B}, DV ) = 80/205 = 0.39,
rel _ sup({ B}, DV )= 50/100 = 0.5
Suppose we set rs at 0.1 and rr at 0.35. Then, {A}, {B}, and {A, B} are frequent.
{A} is not relative-frequent, but {B} and {A, B} are relative-frequent.
{B }
{ A}



{ A}
{B }
{B }
{ A B }
Market Basket Analysis
Performance Evaluation: Data
______________________________________________

Data generation
Dij is number of transaction of store i & period j
Ni is number of products in store i
Market Basket Analysis
Performance Evaluation: Measures
_____________________________

Performance measures
Three measures (errors type A,B,C) for assessing the magnitudes of
the deviations in support, confidence, and the number of association
rules when we use the traditional association rules for the store-chain
data.
Type A error measures the relative difference in the support levels
= rel _ sup( X , DV )  sup( X , D) / rel _ sup( X , DV )
=(.03 - .02 )/.03
= 33.33%
X
X
Type B is used to compare the difference in confidence levels
= conf(X=>Y) - conf V( X=>Y ))/conf (X=>Y), where conf V(X=>Y) is the rule confidence computed by
the traditional methods.
The type C error is used to compare the relative difference in the numbers of rules
generated by the two methods.
Market Basket Analysis
Performance Evaluation: Simulation
_____________________________
Simulation Study




The effect of number of stores & periods
The effect of stores size
The effect of different product replacement rate
Summarization
Market Basket Analysis
Performance Evaluation: Simulation
_____________________________
The table Data_Set_Table used in different case for simulation study
Data_Set_Table:
Data set #of stores #of periods Range of store sizes Product replacement rate
1
2
3
4
5
6
7
8
9
5
10
50
50
50
50
50
50
50
5
10
50
50
50
50
50
50
50
50–100
50–100
50–100
10–100
50–100
90–100
50–100
50–100
50–100
0.001
0.001
0.001
0.001
0.001
0.001
0.001
0.005
0.010
Market Basket Analysis
Performance Evaluation: Simulation
_____________________________
Effect analysis for # of stores &
periods on type A,B,C error
based on data sets 1,2,3 in Dataset_Table
Market Basket Analysis
Performance Evaluation: Simulation
_____________________________

Effect analysis for store size
based on data sets 4,5,6 in
Data-set_Table
Market Basket Analysis
Performance Evaluation: Simulation
_____________________________

Effect analysis for product
replacement ratio based on data
sets 7,8,9 in Data-set_Table
Market Basket Analysis
Performance Evaluation: Simulation
_____________________________

The traditional association rules may not be
able to extract all important purchasing
patterns for a multistore chain, when there are
large numbers of stores and periods, a large
variation in store sizes, and high product
replacement ratios.

we find that the proposed algorithm requires
larger process times, but the differences are
not substantial. This result is reasonable,
because the proposed algorithm requires one
more scan of the data than does the Apriori
algorithm, and also requires additional basic
operations in each phase of the algorithm.
Market Basket Analysis
Conclusion
_____________________________





Traditional Associations rule mining was introduced in 1993 by Agrawal et
al. fail in multi store environment.
To overcome this, store-chain association rule is proposed where store and
period are involved.
A simulation is used to compare the proposed and traditional methods. we
have seen that algorithm is computationally efficient.
The outcomes would be used for general or local marketing strategies.
Moreover for product procurement , inventory, distribution strategies for
store-chain.
This is the promising research area in data mining ,lot of work is left to do
like generating store-chain association rules accurately and efficiently, to
extend this algorithm for multiple levels in distributed environment.
environment
Market Basket Analysis
Refferences
_____________________________________






[1] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, Proceedings of
the 20th VLDB Conference, Santiago, Chile, 1994, pp. 478–499.
[2] J.M. Ale, G.H. Rossi, An approach to discovering temporal association rules,
Proceedings of the 2000 ACM Symposium on Applied Computing (Vol. 1), Villa Olmo,
Como, Italy, 2000, pp. 294– 300.
[3]yen-Liang Chen, Kwei tang, Ren-jie Shen and Ya-Han Hu, “market basket analysis
in a multiple store environment” Decision support System, 40(2)(2005) 339-354
[4] H. Lu, L. Feng, J. Han, Beyond intra-transaction association analysis: mining multidimensional inter-transaction association rules, ACM Transactions on Information
Systems 18 (4) (2000) 423– 454.
[5] J.F. Roddick, M. Spiliopoilou, A survey of temporal knowledge discovery paradigms
and methods, IEEE Transactions on Knowledge and Data Engineering 14 (2002) 750–
767.
[6] E. Clementini, P.D. Felice, K. Koperski, Mining multiplelevel spatial association rules
for objects with a broad boundary, Data and Knowledge
Engineering 34 (3) (2000) 251– 270.