Transcript ThesisPPT

Association Rules Mining in
Distributed Environments
By: Shamila Mafazi
Supervised by: Dr. Abrar Haider
Acknowledgment


First I would like to thank my supervisor, Dr. Abrar Haider
for his valuable support in this thesis.
I specially like to extend my thank to Dr Jiuyong Li who
introduced me to the Data Mining world.
Introduction

Data Mining refers to extraction or mining knowledge
from large amount of data (Han & Kamber 2006).

Data Mining Techniques include: Association rules
mining, clustering, classification, prediction and so on.

Association Rules Mining retrieves relation and correlation
between the data in a database.
Research Question

How can a more efficient method for discovering
association rules in a distributed environment be created?
Methodology

To address the research question, this research has
reviewed literature relating to the following areas:






Data mining in the centralised environments.
Data mining in the distributed environments.
Association rules mining in the distributed/centralised
environments.
Comparing the existing algorithms in the distributed/centralised
environments and studying the advantages and disadvantages of
them.
Developing a concise representation particularly, distributed
deduction rules.
Designing the new algorithm based on DTFIM .
Thesis Contribution

The proposed algorithm by this research intends to resolve
marketing problem in distributed environments.
Additionally, it aims to profile needs of customers and
their preferences in the transaction oriented systems such
as, credit cards. One of the most significant problems of
market basket data is dealing with large number of
candidate itemsets. Retrieving interesting and meaningful
patterns from these candidates is extremely difficult. Thus,
this research, presents an algorithm which reduces the
number of candidate sets and simplifies the process to
produce interesting customer preferences and patterns.
The Importance of Association Rules

Better management of inventory.

Better arrangement of the shelves.

Enhancement in sale of items.

These rules can reveal the effect of continuing or
discontinuing the sale of an item on the sale of other items.
Two Common Measures of Rules
Interestingness

The discovered rules are interesting if they satisfy the
minimum support and confidence thresholds.

Support of an itemset is the percentage of the transactions
which contain that itemset.

Confidence is the probability of occuring (such as buying)
items together.
Association Rules Mining (ARM)

ARM in centralised environments refers to mining
association rules from an integrated database





Apriori
AprioriTid
AprioriHybrid
FIM





Sampling
Partitioning Algorithm
DIC algorithm
FP Growth algorithm
NDI algorithm
ARM in distributed environments refers to mining
association rules from distributed databases.


CD
FDM


ODAM
DTFIM
Apriori Algorithm (Agrawal & Srikant 1994)
Minimum support thershold=2
1-itemsets
2-itemsets
3-itemsets
(Kantardzic 2003, p.168)
Centralised ARM Algorithm
 Frequent Itemsets Mining (FIM) algorithm (Bodon 2004):
 Uses Trie data
structure
 Fast construction and
information retrieval
Supp(a)=9
Supp(a,b)=6
 Memory efficient
Trie data structure (Ansari et al. 2008)
Centralised ARM Algorithm

Non derivable itemsets (NDI) (Calder & Geothals 2002)

Deduction rules divide the itemsets of a data base into derivable and
non derivable itemsets.

The deduction rules set tight bounds on the support of an itemset.

The least of the upper bounds and the greatest of the lower bounds are
the tight bounds for an itemset.

Derivable itemsets have equal upper and lower bounds.

Derivable itemsets do not represent any new information, not being
introduced by their subsets.

The non derivable itemsets are considered as a concise representation
of the frequent itemsets.
Deduction Rules:
An example of a DB transactions (Calders & Geothals 2007)
__
supp (abc) = supp(a) –supp(ab) – supp(ac) + supp(abc)
0 ≥ supp(a) –supp(ab) – supp(ac) + supp(abc)
supp(abc) ≥ supp(ab) + supp(ac) – supp(a)
Summary of Deduction Rules

If |I\X| is odd then (odd-itemsets)
|I\J| +1
Supp (I) ≤ ∑ (-1)
supp(J)
X⊆ J⊂I

If |I\X| is even then (even-itemsets)
|I\J| +1
Supp (I) ≥ ∑ (-1)
X ⊆ J ⊂I
supp(J)
The Importance of Distributed Data Mining



Distributed databases.
Transferring data within sites is extremely costly and time
consuming.
Due to security issues and data ownership, transferring the
local data is not permitted
Distributed ARM Algorithms
 Distributed Trie-based Frequent Itemset Mining (DTFIM)
(Ansari et al. 2008)
 Trie based data structure.
 Memory efficient
 Fast searching
 The more skewed data bases, the algorithm acts more
efficient.
My Contribution


Creating a concise representation by applying distributed
deduction rules.
Applying the distributed deduction rules inside the DTFIM
algorithm
Distributed Deduction Rules
 If |I/X| is odd then
|I\J| +1
Supp (I) ≤ ∑
∑ Suppi (J))
((-1)
1 ≤i ≤ n
X⊆ J⊂I
 If |I/X| is even then
|I\J| +1
Supp (I) ≥ ∑
X⊆ J⊂I
((-1)
∑ Suppi (J))
1 ≤i ≤ n
The Proposed Algorithm (cont.)

Inputs:
DBi (i=1,…,n): the databases at each site Si.
iterationDepth: number of iterations
minSup: the support threshold

Output: The set of all globally large itemsets L.

Method: Execution of the following program fragment (for the
k-th iteration) at the participating sites.
The Proposed Algorithm (cont.)
k:=1;
while k ≤ iterationDepth do
{ if k=1 then
TR i(1) := findLocalCandidate (DB i,0,1);
else
{ candidateGen (TR i(k-1), NDL(k-1), CG(k), DL(k), DL(k-1));
if DL(k-1) ≠ 0 then
dFrequent (DL(k-1), NDL(k-1), DL(k));
TR i(k) := findLocalCandidate (DBi, CG(k), k); }
if CG(k) ≠ 0 then // if the CG(k) is not empty
TRi(k-1) :=findNDFrequent (DBi, CG(k), k);
passLocalCandidate (TRi(k));
GLi(k) := getGlobalFrequent (); // globally large k-itemsets
updateLocalCandidates (TRi(k), GLi(k)); // prunes the local candidates which are not globally large
NDL(k) := ∪ⁿ i=1 GLi(k) ;
k:=k+1; }
L(k) := NDL(k) ∪ DL (k) ;
return L(k);
Candidate Set Generating Procedure
procedure candidateGen (TR i(k-1), NDL(k-1), CG(k), DL(k), DL(k-1))
for all Z ∈TR i(k-1) do
{ compute the [l,u] bounds of Z
if Z.sup=Z.l or Z.sup=Z.u then
{ Prune Z from NDL(k-1) and TRi(k-1) and insert it into DL(k-1) ;
if Z.sup=Z.l then
Z.sup=Z.l;
else
Z.sup=Z.u; }
pCG(k) =∪ⁿ i=1 CG i(k) =∪ⁿ i=1 aprioriGen(NDLi(k-1)); //FDM candidate itemsets generator
for all Y ∈ pCG(k) do
{
compute [l,u] bounds on support of Y
if l≠u then
{ Y.l=l;
Y.u=u;
Insert Y into CG(k) ; }
else
{
If u≥ minSup then
{
Insert Y into DL(k), delete it from NDLi(k-1) and TRi(k-1) ; Y.sup=u }
} } }
end procedure
Derivable Frequent Itemsets Procedure
Procedure dFrequent (DL(k-1), NDL(k-1), DL(k))
DCG(k) := aprioriGen2(DL(k-1), NDL(k-1)); // FDM apriori candidate generator.
for all Z ∈ DCG(k) do
{
compute Z.sup; //compute the s support of Z
if Z.sup ≥ minSup then
Insert Z into DL(k), delete it from NDLi(k-1) and TRi(k-1) ;
}
end procedure
Explanation of the Proposed Algorithm
Developing the local 1-itemsets vectors:



The local DBs are scanned by their local sites independently.
the 1-itemsets are determined and their local support counts are
stored in the local vectors.
Global 1-itemsets:



The support counts are exchanged within the sites
The globally large 1-itemsets are determined.
Initialising the local Tries:


Each local site initialise their local Trie based on Global 1-itemsets
Explanation of the Proposed Algorithm

Production of the global frequent k-itemsets (k≥2):
 The local candidate 2-itemsets are generated based on the local Tries
 the local candidate 2-itemsets are stored in a two dimensional array.

Applying the deduction rules:
 The deduction rules are applied on the local candidate 2-itemsets by each site.
 The derivable itemsets are deleted from the list of the non-derivable local
candidate 2-itemsets.

Global large 2-itmsests:
 The support counts of non-derivable locally frequent 2-itemsets are exchanged
within the sites.
 The global support count for 2-itemsets are determined.

Updating the local Tries:

Local sites update their Trie by inserting the global large 2-itmsets.
Conclusion
This research resolves the problem of market basket data in
distributed environments and presents an algorithm which
reduces the number of candidate sets and simplifies the
process to produce interesting customer preferences and
patterns.
References


Ansari, E, Dastghaibifard, GH, Keshtkaran, M, Kaabi, H
2008,‘Distributed Frequent Itemset Mining using Trie Data Structure’,
IAENG International Journal of Computer Science, vol. 35, no. 3, pp.
377-381.
Calders, T& Goethals, B 2007, ‘Non Derivable Itemsets mining’, data
mining Knowledge Discovery in Databases, vol. 14, pp. 171-206.

Han, J, Kamber, M 2006, Data mining: concepts and techniques, Diane
Cerra, United States of America.

Kantardzic, M 2003, Data Mining Concepts, Models, Methods, and
Algorithms, A John Wiley & Sons, INC, United State of America.
Thank you