Chapter 10 - VCU DMB Lab.

Download Report

Transcript Chapter 10 - VCU DMB Lab.

Chapter 10
ASSOCIATION RULES
Cios / Pedrycz / Swiniarski / Kurgan
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
Outline
• Introduction
• Association Rules and Transactional Data
– Basic Concepts
• Mining Single Dimensional, Single-Level Boolean
Association Rules
–
–
–
–
–
Naïve Algorithm
Apriori Algorithm
Generating Association Rules from Frequent Itemsets
Improving Efficiency of the Apriori Algorithm
Finding Interesting Association Rules
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
2
Introduction
Association rules mining is another, after clustering,
key unsupervised data mining method that finds
interesting associations (relationships,
dependencies) in large sets of data items.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
3
Introduction
AR are used to describe associations or correlations
among a set of items in transaction databases,
relational databases, and data warehouses
– applications:
• basket data analysis, cross-marketing, catalog design,
clustering, data preprocessing, genomics, etc.
-
rule format: LHS  RHS [support, confidence]
Examples:
buys(x, diapers)  buys(x, beers) [0.5%, 60%]
major(x, CS) AND takes(x, Advanced Data Analysis)  level(x, PhD) [1%, 75%]
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
4
Rules
Body ==> Consequent [ Support , Confidence ]
• Body: represents the examined data.
• Consequent: represents a discovered
property for the examined data.
• Support: represents the percentage of the
records satisfying the body or the
consequent.
• Confidence: represents the percentage of the
records satisfying both the body and the
consequent to those satisfying only the
5
body.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
Introduction
how the demographical
information affects what the
customer buys?
in this shopping basket
customer bought tomatoes,
carrots, bananas, bread, eggs,
sup, milk, etc.
is bread usually bought with
milk?
does a specific milk brand
make any difference?
where we place the tomatoes
in the store to maximize their
sales?
is the bread bought when both
milk and eggs are bought
together?
what can be
learned
using
association
rules?
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
6
Introduction
AR can be derived when data describes events that occur
at the same time / in close proximity.
They can be:
– useful, when containing high quality, actionable information
• diapers  beer
– trivial, when they are valid and supported by data, but useless
since they describe well known facts
• milk  eggs
– inexplicable, when they concern valid and new facts, but they
cannot be utilized
• grocery store  milk is sold as often as bread
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
7
Introduction
Some of the common applications:
– planning store layouts
• we can place products that have a strong purchasing relationship close
together OR
• we place such products far apart to increase traffic to purchase other
items
– planning bundling products and offering coupons
• knowing that buys(x, diapers)  buys(x, beers), discounts are not offered
on both beer and diapers at the same time
– we discount one to increase sales and make money on the other
– designing direct marketing campaign
• mailing a camcorder promotion to people who bought VCR is best when
it comes approximately two to three months after the VCR purchase
– other applications will be discussed later…
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
8
Introduction
How do we derive ARs:
– techniques are based on probability and statistics
– the process consists of 4 steps
1.
2.
3.
4.
prepare input data
choose items of interest… (itemset)
compute probabilities, joint probabilities, etc…
find (the most probable) association rules…
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
9
Transactional Data
Data should be provided in transactional form
– each record consists of transaction ID and information
about all items that constitute the transaction
TID
Transaction (basket)
1000
Beer, Diapers, Eggs
…
….
Transaction ID
Subset of all available items
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
10
Transactional Data
First look at the transactional data
– example
TID
Transaction (basket)
1000
Apples, Celery, Diapers
2000
Beer, Celery, Eggs
3000
Apples, Beer, Celery, Eggs
4000
Beer, Eggs
– rules (just by looking at the data)
• Beer  Eggs
• Apples  Celery
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
11
Transactional Data
• Nominal data can also be transformed into transactions
Target
Attribute
Attributes
Example
Bar
Fri/Sat
Hungry
Rain
Estimate
Wait?
e1
No
No
Yes
No
0-10
Yes
…
…
…
…
…
…
…
– transformed into transactional format
TID
Transaction (basket)
1
Bar=No, Fri/Sat=No, Hungry=Yes, Rain=No, Estimate=0-10, Wait?=Yes
…
….
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
12
Association Rules
What do we want to do?
– given database of transactions, where each transaction is a
list of items (purchased by a customer in a visit)
– find all rules that correlate presence of one set of items
with that of another set of items
•
•
how many items do we consider in each set?
how do we define correlation?
– how strong the correlation should be?
•
how do we extract useful rules?
Let us answer these questions
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
13
Basic Definitions
itemset
I  {i1 , i2 ,..., im }
transaction
T I
association rule
A B
A  I, B  I, A B  
dataset (set of transactions)
D
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
14
Basic Definitions
How to measure interestingness of the rules?
– each rule has two assigned measures
• support which indicates the frequencies of the occurring
patterns
– defined as ratio of # of transactions containing A and B to the total # of
transactions
|| {T  D | A  B  T } ||
s ( A  B) 
|| D ||
• confidence which denotes the strength of implication in the
rule
– defined as ratio of # of transactions containing A and B to the #of
transactions containing A
|| {T  D | A  B  T } ||
c( A  B) 
|| {T  D | A  T } ||
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
15
Basic Definitions
How to measure interestingness of the rules?
– example: diapers  beer
• support
s ( A  B) 
|| {T  D | A  B  T } ||
|| D ||
customer
customer
buys both
buys beer
customer
buys diapers
• confidence
c( A  B) 
|| {T  D | A  B  T } ||
|| {T  D | A  T } ||
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
16
Basic Definitions
How to measure interestingness of the rules?
• let’s use a more complex rule and explain the concepts in
terms of probabilities
– find all the rules A and B  C with minimum confidence and
support
» support is probability that a transaction contains {A, B, C}
» confidence is conditional probability that a transaction
having {A, B}, also contains C
A and B  C (support is 25%, confidence is 100%)
TID
Transaction
1
A, B, C
2
A, C
3
A, D
4
B, E, F
if we decide to have minimum support of 50% and minimum
confidence of 50%, we generate two rules from the data:
A  C (support 50%, confidence 66.6%)
C  A (support 50%, confidence 100%)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
17
Basic Definitions
TID
Transaction (basket)
1000
Apples, Celery, Diapers
2000
Beer, Celery, Eggs
3000
Apples, Beer, Celery, Eggs
4000
Beer, Eggs
What is I?
What is T for TID=2000?
What is support(Beer  Eggs)?
What is confidence(Beer  Eggs)?
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
18
Basic Definitions
TID
Transaction (basket)
1000
Apples, Celery, Diapers
2000
Beer, Celery, Eggs
3000
Apples, Beer, Celery, Eggs
4000
Beer, Eggs
What is I?
Apples, Beer, Celery, Diapers, Eggs
What is T for TID=2000?
Beer, Celery, Eggs
What is support(Beer  Eggs)?
3/4 = 75%
What is confidence(Beer  Eggs)?
3/3 = 100%
m (# of items) = 5
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
19
Basic Definitions
Frequent itemsets
– itemset is any set of items
– k-itemset is an itemset containing k items
– frequent itemset is an itemset that satisfies a minimum
support level
– the problem arises when we try to analyze dataset that
contains m items
• how many itemsets are there?
– many if m is large
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
20
Strong Association Rules
Given an itemset it is easy to generate association rules
Example itemset = {Beer, Diapers}:
•
•
•
•
 Beer, Diapers
Beer  Diapers
Diapers  Beer
Diapers, Beer 
– we are interested only in strong rules
• those which satisfy minimum support and minimum confidence
– these two are user-defined parameters
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
21
Strong Association Rules
Given itemsets, it is easy to generate association rules
– example itemsets:
{Beer, Diapers} with support 40%
{Beer, Diapers, Eggs} with support 20%
– rule
IF customer buys Beer and Diapers THEN the probability
that (s)he buys Eggs is 50% can be inferred
• This provides simple descriptive pattern
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
22
Association Rules
How to generate ARs?
– two basic steps
1. Find all frequent itemsets
– those that satisfy minimum support
2. Find all strong association rules
– generate association rules from frequent itemsets
– select and keep rules that satisfy minimum confidence
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
23
Naïve Algorithm
How do we generate frequent itemsets (step 1)?
– naïve algorithm
n = |D|
for each subset s of I
{
counter = 0
for each transaction T in D
{
if s is a subset of T
counter = counter + 1
}
if minimum support ≤ counter / n
add s to frequent itemsets
}
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
24
Frequent Itemsets
Does the naïve algorithm work well?
• we have 2m subsets of I
• we have to scan n transactions for each subset
– thus we perform O(2mn) tests
– O(2mn) complexity shows that
• the algorithm growth exponentially with the number of
items m
• Thus we have to use some other approach!
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
25
Frequent Itemsets
FIs support the apriori property
– if A is not a frequent itemset, then any superset of A is
not a frequent itemset
• we will use this property to speed up the computations
Proof
n is # of transactions
suppose A is a subset of i transactions
if A’  A, then A’ is a subset of i’  i transactions
thus
if i/n < minimum support, so is i’/n
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
26
Frequent Itemsets
Using the apriori property
– candidate k-itemsets are build from frequent (k-1)-itemsets
• find all frequent 1-itemsets
• extend (k-1)-itemsets to candidate k-itemsets
• prune candidate itemsets that do not meet the minimum
support requirement
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
27
Apriori Algorithm
Improved algorithm to generate frequent itemsets
L1 = {frequent 1-itemsets}
for (k=2; L(k-1) is not empty; k++)
{
Ck is generated as k-itemset candidate from L(k-1)
for each transaction T in D
{
Ct=subset(Ck,T)
// k-itemsets that are subsets of T
for each k-itemset c in Ct
c.count++;
}
Lk = {c in Ck such that c.count ≥ minimum support}
}
the frequent itemsets are the union of the Lk
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
28
Frequent
Itemsets
L1 = {frequent 1-itemsets}
for (k=2; L(k-1) is not empty; k++)
{
Ck is generated as k-itemset candidates from L(k-1)
for each transaction T in D
{
Ct=subset(Ck,T)
// candidates that are subsets of T
for each candidate c in Ct
c.count++;
}
Lk = {c in Ck such that c.count ≥ minimum suport}
}
the frequent itemsets are the union of the Lk
Apriori approach reduces number of considered itemsets (#
of scans of D)
– how do we generate k-itemset candidates?
1. for each item i that is not in a given frequent (k-1)-itemset, but in
some other frequent (k-1) itemset in Lk-1
– add i to the (k-1)-itemset to create a k-itemset candidate
– remove duplicates
» example
frequent 1-itemsets {A}, {B}, {C}
candidate 2-itemsets {A, B}, {A, C}, {B, A}, {B, C}, {C, A}, {C, B}
eliminate duplicates {A, B}, {A, C}, {B, C}
2. joining together frequent (k-1)-itemsets
– if frequent (k-1)-itemsets have (k-2)-items in common, create a k-itemset
candidate by adding two different items to (k-2) common items
» example
{A, B, C} joined with {A, B, D} gives {A, B, C, D}
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
29
1.
2.
Association
Rules
Find all frequent itemsets
–
those that satisfy minimum support
Find all strong association rules
–
generate association rules from frequent itemsets
–
select and keep rules that satisfy minimum confidence
1. The frequent set is computed iteratively
– 1st iteration
• large 1-candidate itemsets are found by scanning
– kth iteration
• large k-candidate itemsets are generated by applying apriori-based
generation to large (k-1)-itemsets
• apriori rule generates only those k-itemsets whose every
(k-1)-itemset subset is frequent (above the threshold)
2. Generating rules
– for each frequent itemset X output all rules
Y  (X – Y) if s(X) / s(Y) > minimum confidence
• Y is a subset of X
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
30
Example
We will generate association rules from the transactional data
given below:
– minimum support = 50%
• the # of transactions above the minimum support is 4 x 50 % = 2
– minimum confidence = 60%
TID
Transaction
TID
Transaction
1000
Apples, Celery, Diapers
1000
A, C, D
2000
Beer, Celery, Eggs
2000
B, C, E
3000
Apples, Beer, Celery, Eggs
3000
A, B, C, E
4000
Beer, Eggs
4000
B, E

© 2007 Cios / Pedrycz / Swiniarski / Kurgan
31
Example
TID
Transaction
1000
A, C, D
generate candidate
1-itemsets
2000
B, C, E
3000
A, B, C, E
4000
B, E
generate candidate
2-itemsets
item
set
support
count
A, B
A, C
A, E
B, C
B, E
C, E
1
2
1
2
3
2
item
set
support
count
A
B
C
D
E
2
3
3
1
3
C2
generate frequent
2-itemsets
L1
C1
generate frequent
1-itemsets
delete candidates
below minimum
support
item
set
support
count
A, C
B, C
B, E
C, E
2
2
3
2
item
set
support
count
A
B
C
E
2
3
3
3
L2
we do not use 3itemsets with (A, B)
and (A, E)
they are below
minimum support
C3
itemset
generate candidate
3-itemsets
B, C, E
support
count
2
generate frequent
3-itemsets
itemset
support
count
B, C, E
2
L3
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
32
Example
TID
Transaction
1000
A, C, D
2000
B, C, E
3000
A, B, C, E
4000
B, E
Finally we derive association rules
• from generated frequent 3-itemset {B, C, E} with s = 50%
– we need to satisfy minimum confidence of 60%
B and C  E
B and E  C
C and E  B
B  C and E
C  B and E
E  B and C
with support = 50%
with support = 50%
with support = 50%
with support = 50%
with support = 50%
with support = 50%
and confidence = 100%
and confidence = 66.7%
and confidence = 100%
and confidence = 66.7%
and confidence = 66.7%
and confidence = 66.7%
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
33
Improving Efficiency of Apriori
The Apriori algorithm can be modified to improve its
efficiency (computational complexity) by:
– hashing
– removal of transactions that do not contain frequent
itemsets
– sampling of the data
– partitioning of the data
– mining frequent itemsets without generation of
candidate itemsets
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
34
Improving efficiency: Hashing
• Hashing
– is used to reduce the size of the candidate k-itemsets,
i.e., itemsets that are generated from frequent itemsets
from iteration k-1, Ck, for k>1
• for instance, when scanning D to generate L1 from the
candidate 1-temsets in C1, we can at the same time
generate all 2-itemsets for each transaction, hash (map)
them into different buckets of the hash table structure and
increase the corresponding bucket counts
• a 2-itemset, which corresponding bucket count is below
the support threshold, cannot be frequent and thus we can
remove it from the candidate set C2.
In this way we reduce the number of candidate 2-itemsets
that must be examined to obtain L2
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
35
Improving efficiency: Hashing
• To add itemset, start at root and go down until a
leaf is reached
• At interior node at depth d, decide which branch
to follow by applying hash function to the dth
item of the itemset
• When # of items in a leaf node exceeds some
threshold convert leaf node to an internal node
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
36
Improving efficiency: Hashing
• To find candidates contained in given
transaction, t
– Hash on every item in t at the root node
• to ensure that itemsets that start with an item not in t are
ignored
– At interior node reached by hashing the item i in the
transaction hash on each item that comes after i in t
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
37
Improving efficiency: Hashing
Let C3 = {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}}
Ct = subset(C3, {1 0 1 1 0})
First build hash-tree with candidate itemsets
Then determine which of those itemsets are actually
in the given transaction
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
38
Hash-Tree
At root (d=1), hash on items in
t: 1, 3, 4.
3, 4 return nothing since no
itemsets start with 3 or 4.
2 Is ignored since not in t.
t = {1 0 1 1 0}
ROOT
d=2
Hash on the next
item in t: 3, ignore
itemsets with 2
since not in t.
{2 3 4}
Hash
Table
2
{1 2 3}
{1 2 4}
2
1
3
{1 3 4}
{1 3 5}
Check which are in
t. Those that are
get added to Ct as
output from subset
function.
Ct = {1 3 4}
Manually verify that of the 5 itemsets in C3 only {1 3 4} was actually present in
the transaction
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
39
Direct Hashing and Pruning(DHP)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
40
DHP
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
41
Improving Efficiency of Apriori
• Removal of transactions that do not contain
frequent itemsets
– we remove transactions that do not contain frequent
itemsets
– In general, if a transaction does not contain any frequent
k-itemsets, it cannot contain any frequent (k+1)
itemsets, and thus can be removed from computation of
any frequent t-itemsets, where t > k
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
42
Improving Efficiency of Apriori
• Sampling of the data
– we generate association rules based on a sampled
subset of transactions in D
• a randomly selected subset S of D is used to search for the
frequent itemsets
• generation of frequent itemsets from S is more efficient
(faster) but some of the rules that would have been
generated from D may be missing, and some rules
generated from S may not be present in D
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
43
Improving Efficiency of Apriori
• Partitioning of the data
– partitioning generates frequent itemsets by finding
frequent itemsets in subsets (partition) of D
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
44
Improving Efficiency of Apriori
• Mining frequent itemsets without generation of
candidate itemsets
– one of the main limiting aspects of the Apriori is that it
can generate very large number of candidate itemsets
• for instance, for 10,000 1-itemsets, the Apriori algorithm generates
approximately 10,000,000 candidate 2-itemsets
– other limiting aspect is that the Apriori may need to
repeatedly scan the data set D
– to address these issues, a divide-and-conquer method,
which decomposes the overall problem into a set of
smaller tasks, is used
• the method, referred to as frequent-pattern growth (FP-growth)
compresses the set of frequent (individual) items from D into a
frequent pattern tree (FP-tree)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
45
General Algorithm to be improved
1. In the first pass, the support of each
individual item is counted, and the large
ones are determined
2. In each subsequent pass, the frequent
itemsets determined in the previous pass is
used to generate new itemsets called
candidate itemsets.
3. The support of each candidate itemset is
counted, and the frequent ones are
determined.
4. This process continues until no new large
itemsets are found.
46
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
AIS Algorithm
•
1.
2.
•
Candidate itemsets are generated and counted onthe-fly as the database is scanned.
For each transaction, it is determined which of the
frequent itemsets of the previous pass are
contained in this transaction.
New candidate itemsets are generated by
extending these frequent itemsets with other items
in this transaction.
The disadvantage is that this results in
unnecessarily generating and counting too many
candidate itemsets that turn out to be small.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
47
Example
Database
L1
C2
Itemset Support
Itemset Support
TID
Items
100
134
{1}
2
{1 3}*
2
200
235
{2}
3
{1 4}
1
300
1235
{3}
3
{3 4}
1
400
25
{5}
3
{2 3}*
2
{2 5}*
3
{3 5}*
2
{1 2}
1
{1 5}
1
C3
Itemset Support
{1 3 4}
1
{2 3 5}*
2
{1 3 5}
1
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
48
SETM Algorithm
• Candidate itemsets are generated on-the-fly as the
database is scanned, but counted at the end of the
pass.
1. New candidate itemsets are generated the same way
as in AIS algorithm, but the TID of the generating
transaction is saved with the candidate itemset in a
sequential structure.
2. At the end of the pass, the support count of
candidate itemsets is determined by aggregating this
sequential structure
• It has the same disadvantage of the AIS algorithm.
• Another disadvantage is that for each candidate
itemset, there are as many entries as its support
49
value.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
Example
Database
L1
C2
Itemset Support
Itemset TID
TID
Items
100
134
{1}
2
200
235
{2}
3
300
1235
{3}
3
400
25
{5}
3
{1 3}
100
{1 4}
100
{3 4}
100
{2 3}
200
{2 5}
200
{3 5}
200
{1 2}
300
Itemset TID
{1 3}
300
{1 3 4}
100
{1 5}
300
{2 3 5}
200
{2 3}
300
{1 3 5}
300
{2 5}
300
{3 5}
300
{2 3 5}
300
C3
{2 5}/ Kurgan
400
© 2007 Cios / Pedrycz / Swiniarski
50
Apriori Algorithm
• Candidate itemsets are generated using only
the frequent itemsets of the previous pass
without considering the transactions in the
database.
1. The frequent itemset of the previous pass is
joined with itself to generate all itemsets
whose size is higher by 1.
2. Each generated itemset, that has a subset
which is not frequent, is deleted. The
remaining itemsets are the candidate ones.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
51
Example
Database
L1
C2
Itemset Support
Itemset Support
TID
Items
100
134
{1}
2
{1 2}
1
200
235
{2}
3
{1 3}*
2
300
1235
{3}
3
{1 5}
1
400
25
{5}
3
{2 3}*
2
{2 5}*
3
{3 5}*
2
C3
Itemset Support
{2 3 5}*
2
{1 2 3}
{1 3 5}
{2 3 5}
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
52
AprioriTid Algorithm
• The database is not used at all for counting the
support of candidate itemsets after the first pass.
1. The candidate itemsets are generated the same way
as in Apriori algorithm.
2. Another set C’ is generated of which each member
has the TID of each transaction and the large
itemsets present in this transaction. This set is used
to count the support of each candidate itemset.
• The advantage is that the number of entries in C’
may be smaller than the number of transactions in
the database, especially in the later passes.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
53
Example
C2
Itemset Support
Database
L1
{1 2}
1
Itemset Support
{1 3}*
2
TID
Items
100
134
{1}
2
{1 5}
1
200
235
{2}
3
{2 3}*
2
300
1235
{3}
3
{2 5}*
3
400
25
{5}
3
{3 5}*
2
C’2
C’3
200
{2 3 5}
100
{1 3}
300
{2 3 5}
200
{2 3}, {2 5}, {3 5}
300
{1 2}, {1 3}, {1 5},
{2 3}, {2 5}, {3 5}
C3
Itemset Support
{2 3 5}*
2
400
{2 5}
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
54
Performance Analysis
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
55
AprioriHybrid Algorithm
Performance Analysis shows that:
1. Apriori does better than AprioriTid in the
earlier passes.
2. AprioriTid does better than Apriori in the
later passes.
• Hence, a hybrid algorithm can be
designed that uses Apriori in the initial
passes and switches to AprioriTid when
it expects that the set C’ will fit in
memory.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
56
RARM and FOLDARM
• Both the techniques use a data structure SOTrieIT
(Support ordered Trie Itemset)
– Trie: position in the tree defines the key it is associated with
– Support-ordered: most frequent on left-most branch
• This trie-like tree structure stores the support counts
of all 1-itemsets and 2-itemsets in the database.
– This structure is then sorted according to the support counts
of each node in descending order
• Unlike RARM, FOLDARM allows transactions and
unique transactional items to be added and removed
without the need to destroy the current SOTrieIT and
rebuild it.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
57
FP-Growth Algorithm
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
58
Finding Interesting Association Rules
Depending on the minimum support and confidence values the
user may generate a large number of rules to analyze and assess
How to filter out the rules that are potentially the most interesting?
– whenever a rule is interesting (or not) can be evaluated either
objectively or subjectively
• the ultimate but subjective user’s evaluation cannot be
quantified or anticipated; they are different for different users
• that is why objective interestingness measures, based on the
statistical information present in D are used
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
59
Finding Interesting Association Rules
The subjective evaluation of association rules often boils down to
checking if a given rule is unexpected (i.e., surprises the user)
and actionable (i.e., the user can use it for something useful)
– useful, when they provide high quality actionable
information
e.g. diapers  beers
– trivial, when they are valid and supported by data but
useless since they confirm well known facts
e.g. milk  bread
– inexplicable, when they contain valid new facts but cannot
be utilized
e.g. grocery_store  milk_is_sold_as_often_as_bread
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
60
Finding Interesting Association Rules
In most cases, confidence and support values
associated with each rule are used as the objective
measure to select the most interesting rules
– rules that have these values higher with respect to other
rules are preferred
– although this simple approach works in many cases, we will
show that sometimes rules with high confidence and
support may be uninteresting or even misleading
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
61
Finding Interesting Association Rules
Objective interestingness measures
– example
• let us assume that transactional data contains milk and bread as
the frequent items
• 2,000 transactions were recorded
– in 1,200 customers bought milk
– in 1,650 customers bought bread
– in 900 customers bought both milk and bread
milk
not milk
total
bread
900
750
1650
not bread
300
50
350
total
1200
800
2000
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
62
Finding Interesting Association Rules
Objective interestingness measures
– Example
• given the minimum support threshold of 40% and minimum
confidence threshold of 70% rule
“milk  bread [45%, 75%]” would be generated
• on the other hand, due to low support and confidence values
the rule “milk  not bread [15%, 25%]” would not be generated
• the latter rule is by far more “accurate” while the first may be
misleading
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
63
Finding Interesting Association Rules
Objective interestingness measures
– example
• “milk  bread [45%, 75%]” rule
– probability of buying bread is 82.5%, while confidence of
milk  bread is lower and equals 75%
– bread and milk are negatively associated, i.e., buying one
decreases buying the other
– obviously using this rule would not be a wise decision
milk
not milk
total
bread
900
750
1650
not bread
300
50
350
total
1200
800
2000
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
64
Finding Interesting Association Rules
Objective interestingness measures
– alternative approach to evaluate interestingness of
association rules is to use measures based on correlation
– for A  B rule, the itemset A is independent of the
occurrence of the itemset B if P(A  B) = P(A)P(B).
Otherwise, itemsets A and B are dependent and correlated
as events.
– correlation measure (also referred to as lift and interest),
is defined between itemsets A and B as
P( A  B)
P( A) P( B)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
65
Finding Interesting Association Rules
Objective interestingness measures
– correlation measure
• if the correlation value is less than 1, then the occurrence of A is
negatively correlated (inhibits) the occurrence of B
• if the value is greater than 1, then A and B are positively
correlated, which means that occurrence of one promotes
occurrence of the other
• if correlation equals 1, then A and B are independent, i.e., there is
no correlation between these itemsets
– correlation value for the rule milk  bread is
0.45 / (0.6*0.825) = 0.45 / 0.495 = 0.91
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
66
Other measures for association rule
Objective interestingness measures
– All-confidence
– Collective strength
– Coverage: It measures how often a rule X -> Y is applicable in a
database similar to support
– Conviction: the frequency at which the rule makes an incorrect
prediction
– Succinctness: characterized by clear, precise expression in few
words
– Leverage: Leverage measures the difference
of X and Y appearing together in the data set and what would be
expected if X and Y where statistically dependent.
– Lift: Leverage measures the difference of X and Y appearing
together in the data set and what would be expected
if X and Y where statistically dependent.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
67
Applications
• Web Personalization
– personalization and recommendation systems for
browsing web pages
• e.g. Amazon’s recommended books
Mobasher, et al., Effective Personalization Based on Association Rule Discovery from Web Usage Data,
ACM Workshop on Web Information and Data Management, 2001
• Genomic Data
– mining to find associations in genomic data
Satou, K., et al., Finding Association Rules on Heterogeneous Genome Data, Proceedings of the Pacific
Symposium on Biocomputing, pp.397-408, pp.1997
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
68
Genomic Data
Summary of results
– 182,388 association rules were generated with
• minimum support = 5%
• minimum confidence = 65%
– post processing of the generated rules selected rules with
maximum support of 30
• final number of generated rules was 381
– they were corroborated by biological data
» common substructures in serine endopeptidases were
found
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
69
References
R. Agrawal, T. Imielinski, and A. Swami, Mining Association Rules
Between Sets of Items in Large Databases, Proceedings of the ACM
SIGMOD Conference on Management of Data, pp. 207-216, 1993
R. Agrawal, and R. Srikant, Fast Algorithms for Mining Association Rules,
Proceedings of the 20th VLDB Conference, pp. 487-499, 1994
J. Han, M. Kamber, Data Mining, Morgan Kaufmann, 2001
Jong Soo Park, Member, Ming-Syan Chen and Philip S. Yu, Using a HashBased Method with Transaction Trimming for Mining Association
Rules, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA
ENGINEERING, VOL. 9, NO. 5, SEPTEMBER/OCTOBER 1997
•
•
Slides 5, 46 – 56 from Mohamed G. Elfeky, Purdue University
Slide 58, Florian Verhein, School of Information Technologies,The
University of Sydney,Australia
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
70
References
•
•
Amitabha Das, Wee-Keong Ng, Yew-Kwong Woon, Rapid association
rule mining, CIKM '01: Proceedings of the tenth international conference
on Information and knowledge management, Oct 2001
Yew-Kwong Woon Wee-Keong Ng Amitabha Das, Fast Online Dynamic
Association Rule Mining, Web Information Systems Engineering, 2001.
Proceedings of the Second International Conference, 2001
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
71