MiningAssociationRul..

Transcript MiningAssociationRul..

Mining Association Rules
Mohamed G. Elfeky
1
Introduction
Data mining is the discovery of knowledge
and useful information from the large
amounts of data stored in databases.
Association Rules: describing association
relationships among the attributes in the
set of relevant data.
2
Rules
Body ==> Consequent [ Support , Confidence ]
Body: represents the examined data.
Consequent: represents a discovered property
for the examined data.
Support: represents the percentage of the
records satisfying the body or the consequent.
Confidence: represents the percentage of the
records satisfying both the body and the
consequent to those satisfying only the body.
3
Association Rules Examples
Basket Data
Tea ^ Milk ==> Sugar [0.3 , 0.9]
Relational Data
x.diagnosis = Heart ^ x.sex = Male ==> x.age > 50 [0.4 , 0.7]
Object-Oriented Data
s.hobbies = { sport , art } ==> s.age() = Young [0.5 , 0.8]
4
Topics of Discussion
Formal Statement of the Problem
Different Algorithms





AIS
SETM
Apriori
AprioriTid
AprioriHybrid
Performance Analysis
5
Formal Statement of the Problem
I = { i1 , i2 , … , im } is a set of items
D is a set of transactions T
Each transaction T is a set of items (subset of I)
TID is a unique identifier that is associated with
each transaction
The problem is to generate all association rules
that have support and confidence greater than
the user-specified minimum support and
minimum confidence
6
Problem Decomposition
The problem can be decomposed into two subproblems:
1. Find all sets of items (itemsets) that have
support (number of transactions) greater than
the minimum support (large itemsets).
2. Use the large itemsets to generate the desired
rules.
For each large itemset l, find all non-empty subsets,
and for each subset a generate a rule a ==> (l-a) if
its confidence is greater than the minimum
confidence.
7
General Algorithm
1. In the first pass, the support of each individual
item is counted, and the large ones are
determined
2. In each subsequent pass, the large itemsets
determined in the previous pass is used to
generate new itemsets called candidate
itemsets.
3. The support of each candidate itemset is
counted, and the large ones are determined.
4. This process continues until no new large
itemsets are found.
8
AIS Algorithm
Candidate itemsets are generated and counted on-the-
1.
2.
fly as the database is scanned.
For each transaction, it is determined which of the
large itemsets of the previous pass are contained in
this transaction.
New candidate itemsets are generated by extending
these large itemsets with other items in this
transaction.
The disadvantage is that this results in unnecessarily
generating and counting too many candidate itemsets
that turn out to be small.
9
Example
Database
L1
C2
Itemset Support
Itemset Support
TID
Items
100
134
{1}
2
{1 3}*
2
200
235
{2}
3
{1 4}
1
300
1235
{3}
3
{3 4}
1
400
25
{5}
3
{2 3}*
2
{2 5}*
3
{3 5}*
2
{1 2}
1
{1 5}
1
C3
Itemset Support
{1 3 4}
1
{2 3 5}*
2
{1 3 5}
1
10
SETM Algorithm
Candidate itemsets are generated on-the-fly as the
database is scanned, but counted at the end of the pass.
1. New candidate itemsets are generated the same way as
in AIS algorithm, but the TID of the generating
transaction is saved with the candidate itemset in a
sequential structure.
2. At the end of the pass, the support count of candidate
itemsets is determined by aggregating this sequential
structure
It has the same disadvantage of the AIS algorithm.
Another disadvantage is that for each candidate itemset,
there are as many entries as its support value.
11
Example
Database
L1
C2
Itemset Support
Itemset TID
TID
Items
100
134
{1}
2
200
235
{2}
3
300
1235
{3}
3
400
25
{5}
3
{1 3}
100
{1 4}
100
{3 4}
100
{2 3}
200
{2 5}
200
{3 5}
200
{1 2}
300
Itemset TID
{1 3}
300
{1 3 4}
100
{1 5}
300
{2 3 5}
200
{2 3}
300
{1 3 5}
300
{2 5}
300
{3 5}
300
{2 3 5}
300
{2 5}
400
C3
12
Apriori Algorithm
Candidate itemsets are generated using only the
large itemsets of the previous pass without
considering the transactions in the database.
1.The large itemset of the previous pass is joined
with itself to generate all itemsets whose size is
higher by 1.
2.Each generated itemset, that has a subset which
is not large, is deleted. The remaining itemsets
are the candidate ones.
13
Example
Database
L1
C2
Itemset Support
Itemset Support
TID
Items
100
134
{1}
2
{1 2}
1
200
235
{2}
3
{1 3}*
2
300
1235
{3}
3
{1 5}
1
400
25
{5}
3
{2 3}*
2
{2 5}*
3
{3 5}*
2
C3
Itemset Support
{2 3 5}*
2
{1 2 3}
{1 3 5}
{2 3 5}
14
AprioriTid Algorithm
The database is not used at all for counting the support
of candidate itemsets after the first pass.
1. The candidate itemsets are generated the same way as
in Apriori algorithm.
2. Another set C’ is generated of which each member has
the TID of each transaction and the large itemsets
present in this transaction. This set is used to count the
support of each candidate itemset.
The advantage is that the number of entries in C’ may
be smaller than the number of transactions in the
database, especially in the later passes.
15
Example
Database
C2
Itemset Support
L1
{1 2}
1
Itemset Support
{1 3}*
2
TID
Items
100
134
{1}
2
{1 5}
1
200
235
{2}
3
{2 3}*
2
300
1235
{3}
3
{2 5}*
3
400
25
{5}
3
{3 5}*
2
C’2
C’3
C3
200
{2 3 5}
100
{1 3}
300
{2 3 5}
200
{2 3}, {2 5}, {3 5}
300
{1 2}, {1 3}, {1 5},
{2 3}, {2 5}, {3 5}
400
{2 5}
Itemset Support
{2 3 5}*
2
16
Performance Analysis
17
AprioriHybrid Algorithm
Performance Analysis shows that:
1. Apriori does better than AprioriTid in the
earlier passes.
2. AprioriTid does better than Apriori in the
later passes.
Hence, a hybrid algorithm can be designed
that uses Apriori in the initial passes and
switches to AprioriTid when it expects that
the set C’ will fit in memory.
18

MiningAssociationRul..

Transcript MiningAssociationRul..

Directory