Association Rules - Department of Computer Science and

Download Report

Transcript Association Rules - Department of Computer Science and

Association Rule Mining
Zhenjiang Lin
Group Presentation
April 10, 2007
Overview
Associations
Market Basket Analysis
Basic Concepts
Frequent Itemsets
Generating Frequent Itemsets
 Apriori
 FP-Growth
Applications
2
Association Rule Learners
to discover elements that co-occur frequently within a data set
consisting of multiple independent selections of elements (such
as purchasing transactions), and
to discover rules, such as implication or correlation, which
relate co-occurring elements.
to answer questions such as "if a customer purchases product
A, how likely is he to purchase product B?" and "What products
will a customer buy if he buys products C and D?" are answered
by association-finding algorithms.
to reduce a potentially huge amount of information to a small,
understandable set of statistically supported statements.
also known as “market basket analysis”.
3
Associations
Rules expressing relationships between items
Example
cereal, milk  fruit
“People who bought cereal and milk also bought
fruit.”
Stores might want to offer specials on milk and
cereal to get people to buy more fruit.
4
Market Basket Analysis
Analyze tables of transactions
Person
Basket
A
Chips, Salsa, Cookies, Crackers, Coke, Beer
B
Lettuce, Spinach, Oranges, Celery, Apples, Grapes
C
Chips, Salsa, Frozen Pizza, Frozen Cake
D
Lettuce, Spinach, Milk, Butter
Can we hypothesize?

Chips => Salsa
Lettuce => Spinach
5
Market Baskets
In general, data consists of
TID
Transaction
ID
Basket
Subset of
items
6
Basic Concepts
Set of items
Transaction
I  {i1 , i2 ,..., im }
T I
Association Rule
A B
A  I, B  I, A B  
D - set of transactions (i.e., our data)
7
Measuring Interesting Rules
Support
 Ratio of # of transactions containing A and
B to the total # of transactions
|| {T  D | A  B  T } ||
s ( A  B) 
|| D ||
Confidence
 Ratio of # of transactions containing A and
B to #of transactions containing A
|| {T  D | A  B  T } ||
c( A  B) 
|| {T  D | A  T } ||
8
Measuring Interesting Rules
Rules are included/excluded based on
two metrics


minimum support level - how frequently all
of the items in a rule appear in
transactions
minimum confidence level - how frequently
the left hand side of a rule implies the right
hand side
9
Market Basket Analysis
Person
Basket
A
Chips, Salsa, Cookies, Crackers, Coke, Beer
B
Lettuce, Spinach, Oranges, Celery, Apples, Grapes
C
Chips, Salsa, Frozen Pizza, Frozen Cake
D
Lettuce, Spinach, Milk, Butter, Chips
What
What
What
What
is
is
is
is
I?
T for person B?
s(Chips=>Salsa)?
c(Chips=>Salsa)?
10
Frequent Itemsets
itemset – any set of items
k-itemset – an itemset containing k
items
frequent itemset – an itemset that
satisfies a minimum support level
If I contains m items, how many
itemsets are there?
11
Strong Association Rules
Given an itemset, it’s easy to generate
association rules





Given itemset, {Chips, Salsa}
=> Chips, Salsa
Chips => Salsa
Salsa => Chips
Chips, Salsa =>
Strong rules are interesting

Generally defined as those rules satisfying
minimum support and minimum confidence
12
Association Rule Mining
Two basic steps

Find all frequent itemsets
 Satisfying minimum support

Find all strong association rules
 Generate association rules from frequent
itemsets
 Keep rules satisfying minimum confidence
13
Generating Frequent Itemsets
Naïve algorithm
n <- |D|
for each subset s of I do
l <- 0
for each transaction T in D do
if s is a subset of T then
l <- l + 1
if minimum support <= l/n then
add s to frequent subsets
14
Generating Frequent Itemsets
Analysis of naïve algorithm



2m subsets of I
Scan n transactions for each subset
O(2mn) tests of s being subset of T
Growth is exponential in the number of
items!
Can we do better?
15
Generating Frequent Itemsets
Frequent itemsets support the apriori
property


If A is not a frequent itemset, then any superset
of A is not a frequent itemset.
Proof: Let n be the number of transactions.
Suppose A is a subset of l transactions. If A’  A,
then A’ is a subset of l’  l transactions. Thus, if l/n
< minimum support, so is l’/n.
16
Generating Frequent Itemsets
Central idea: Build candidate k-itemsets
from frequent (k-1)-itemsets
Approach



Find all frequent 1-itemsets
Extend (k-1)-itemsets to candidate kitemsets
Prune candidate itemsets that do not meet
the minimum support.
17
Generating Frequent Itemsets
(Basic Apriori)
L1 = {frequent 1-itemsets}
for (k=2; L(k-1) is not empty; k++) {
Ck = generate k-itemset candidates from L(k-1)
for each transaction t in D {
// The candidates that are subsets of t
Ct=subset(Ck,t)
for each candidate c in Ct {
c.count++;
}
}
Lk = {c in Ck | c.count >= min_sup}
}
The frequent itemsets are the union of the Lk
18
FP Growth
(Han, Pei, Yin 2000)
One problematic aspect of the Apriori is
the candidate generation

Source of exponential growth
Another approach is to use a divide and
conquer strategy
Idea: Compress the database into a
frequent pattern tree representing
frequent items
19
FP Growth (Tree construction)
Initially, scan database for frequent 1itemsets

Place resulting set in a list L in descending order
by frequency (support)
Construct an FP-tree


Create a root node labeled null
Scan database
 Process the items in each transaction in L order
 From the root, add nodes in the order in which items
appear in the transactions
 Link nodes representing items along different branches
20
Frequent 1-itemsets
TID Items
1
I1,I2,I5
2
I2,I4
3
I2,I3,I6
4
I1,I2,I4
5
I1,I3
6
I2,I3
7
I1,I3
8
I1,I2,I3,I5
9
I1,I2,I3
Minimum support of 20%
(frequency of 2)
Frequent 1-itemsets
I1,I2,I3,I4,I5
Construct list
L = {(I2,7),(I1,6),(I3,6),(I4,2),(I5,2)}
21
Build FP-Tree
Create root node
null
I2 1
I1 1
(I2,1)
I3 0
I4 0
(I1,1)
I5 1
(I5,1)
Scan database
Transaction1: I1, I2, I5
Order: I2, I1, I5
Process transaction
Add nodes in item order
Label with items, count
Maintain header table
22
Build FP-Tree
TID Items
null
I2 2
I1 1
(I2,2)
I3 0
I4 1
(I1,1)
I5 1
(I5,1)
(I4,1)
1
I1,I2,I5
2
I2,I4
3
I2,I3,I6
4
I1,I2,I4
5
I1,I3
6
I2,I3
7
I1,I3
8
I1,I2,I3,I5
9
I1,I2,I3
23
Minining the FP-tree
Start at the last item in the table
Find all paths containing item

Follow the node-links
Identify conditional patterns

Patterns in paths with required frequency
Build conditional FP-tree C
Append item to all paths in C, generating
frequent patterns
Mine C recursively (appending item)
Remove item from table and tree
24
Mining the FP-Tree
Prefix Paths
(I2 I1,1)
(I2 I1 I3, 1)
Conditional Path
(I2 I1, 2)
Conditional FP-tree
null
I2 7
I1 6
(I1,2)
(I2,7)
I3 6
I4 2
(I4,1)
(I1,4)
I5 2
(I3,2)
null
(I3,2)
(I5,1)
(I3,2)
(I4,1)
(I5,1)
(I2,2)
(I1,2)
(I2 I1 I5, 2)
25
Applications
Web Personalization
Genomic Data
26
Web Personalization
“Effective Personalization Based on
Association Rule Discovery from Web
Usage Data,” Mobasher, et al., ACM
Workshop on Web Information and
Data Management, 2001.
Personalization and recommendation
systems

e.g. Amazon.com’s recommended books
27
Data Preprocessing
Identify set of pageviews P


Which files result in a single browser
display (complicated by frames, images,
etc.)
P = {p1, …, pn}
Transactions T


From session IDs or cookies
T = {t1, …, tm}
28
Data Preprocessing
A transaction t consists of

t = {(p1t, w(p1t)), …, (plt,w(plt))}
The w is a weight associated with the
pageview


Could be binary (purchase or non-purchase)
Could be related to amount of time spent
on the page
29
Data Preprocessing
In the paper, only considered pageviews
in a transaction with w(p) = 1
Ordering of pageviews didn’t matter
30
Recommendation Engine
Has to run online


i.e. must be fast
generate frequent itemsets first and store in a
graph data structure for efficient searching
Maintains a history of the user’s current
session


Sets a window size w (e.g. 3)
Consider pageviews A, B, C
 {A,B,C}

If user then visits D
 {B,C,D}
31
Genomic Data
“Finding Association Rules on
Heterogenous Genome Data,” Satou et
al.
Combined data from PDB, SWISS-PROT,
and PROSITE
Protein
Name
sequence sequence structure
feature1 feature2 feature1
function1
function2
name1
1
0
1
0
1
name2
0
0
1
1
0
32
Genomic Data
After mining, 182388 association rules
were generated (minimum support = 5,
minimum confidence = 65%)
Post process results with max support
of 30

Itemsets appearing too frequently aren’t
interesting
Reduced to 381 rules
33
Genomic Data
Rules generated were corroborated by
biological background data

Found common substructures in serine
endopeptidases
Rules were not distributed well over
protein families

Still some work to be done on the data
preprocessing stage
34
Association Rule Summary
Association rule mining is a fundamental tool
in data mining
Several algorithms




Apriori: Use a provable mathematical property to
improve performance
FP-Growth: Stop candidate generation, use
effective data structure
Correlation Rules: Evaluate interestingness based
on statistics
Query Flocks: Generalize approach with the
purpose of query optimization (incorporation into
database systems)
35
Association Rule Summary
There exist several extensions

Hierarchical attributes (e.g. year->month>week->day or computer->luggable>handheld->palm)
 Multilevel/multidimensional


Numerical attributes
Constraint based
36