Association Rule Mining I

Download Report

Transcript Association Rule Mining I

ACCTG 6910
Building Enterprise &
Business Intelligence Systems
(e.bis)
Association Rule Mining
Olivia R. Liu Sheng, Ph.D.
Emma Eccles Jones Presidential Chair of Business
1
Main Expectations
•
•
•
•
•
•
Knowledge pattern in focus
Definitions and examples
A basic method
How to tune the method
Decision support applications
When to use association rule mining
• Reading – T2, pp. 225 - 236
2
Association
• Under a given condition,
a set of objects  (implies) another set of objects
Examples
• Retail items purchased together
• Services subscribed by the same customer
• Web pages a user access in a session
• Courses taken by the same student
• Medications prescribed by a doctor for a patient visit
• Genes that are expressed at the same level
3
Decision Support Applications
•
•
•
•
•
•
Customer relationship management
Retail merchandise placement
Online retail catalog design
Website link re-organization
Fraud detection
Gene analysis for cancer prevention
4
Preliminary
• Set Theory
– A set is a collection of objects.
E.g., set A = {3,5} and set B= {1,3,5}
– Elements of a set are the objects belong to it.
E.g., 3 {3,5}, 3{1,3,5}, 3 A and 3 B
– Set X is a subset of set Y if any element in X belongs
to Y, denoted as X Y. E.g., A B or {3,5} {1,3,5}
5
Preliminary
• Two properties of set
– An element in a set is counted only once
E.g., {1,3,5} = {1,3,3,5}
– There is no order of elements in a set
E.g., {3,1,5} = {1,3,5}
6
Association Rules
Given: A database of transactions
Example of transactions:
•a customer’s visit to a grocery store
•an online purchase at a virtual store such as ‘Amazon.com’
Format of transactions:
date transaction IDcustomer ID Item
1/1/99 001
001
egg
1/1/99 001
001
milk
7
Association Rules
Find: patterns in the form of association rules
Association rules : correlate the presence of one set of items (X)
with the presence of another set of items (Y), denoted as X  Y
Example : {purchase egg,milk}  {bread}
How to measure correlations in association rules?
8
Association Rules
Itemset: a set of items, ex. {egg, milk}
Size of Itemset: number of items in that itemset.
The ratio of the number of transactions that purchases all items
in an itemset to the total number of transactions is called the
support of the itemset.
9
Association Rules
Example:
TID
CID
Item
Price
Date
101
101
101
102
102
103
103
103
201
201
201
201
201
202
202
202
Computer
MS Office
MCSE Book
Hard disk
MCSE Book
Computer
Hard disk
MCSE Book
1500
300
100
500
100
1500
500
100
1/4/99
1/4/99
1/4/99
1/8/99
1/8/99
1/21/99
1/2199
1/2199
10
Association Rules
In this example:
The support of the 2-itemset {Computer,Hard disk}
is 1/3=33.3%.
What is the support of 1-itemset {Computer}?
11
Association Rules
Two important metrics for association rules:
If two itemsets X and Y co-exist in a transaction database,
the association rule XY holds with
supports s which is the ratio of
the # of transactions purchasing both X and Y
to (÷) the total # of transactions
confidence c which is the ratio of
the # of transactions purchasing both X and Y
to (÷) the # of transactions purchasing X only.12
Association Rules
Association rule: {Computer} {Hard disk}
Support: 1/3=33.3%
Confidence: 1/2=50%
How about {Computer} {MCSE book}
{Computer, MCSE book}  {Hard disk}???
13
Association Rule Mining
Association rule mining: find all association rules with support
no less than user-specified minimum support and confidence no
less than user-specified minimum confidence in a database
• For small problems, the process
of mining association rules is not that complex.
• How about a transaction database with 1billion transactions
and 1million different items?
• An efficient algorithm is needed!
14
Association Rules
Two Steps in Association rule mining:
1. Find all large or frequent itemsets that have support above
user-specified minimum support.
2. For each large itemset L, find all association rules in the
form of a(L-a) where a and (L-a) are non-empty subsets
of L.
Example: find all association rules in the example with minimum
support 60% and minimum confidence 80%.
15
Association Rule Mining
Step 2 is trivial compared to step 1:
•Exponential search space
•Size of transaction database
16
Apriori Algorithm
• Apriori is an efficient algorithm to discover all
large itemsets from a huge database with
large number of items.
• Apriori is developed by two researchers from
IBM Almaden Research Lab.
17
Apriori Algorithm
• Apriori algorithm is based on Apriori property.
• Apriori property is that any subset of a large
itemset must be large.
18
Apriori Algorithm
• Step 1: Scan DB one time to find all large 1itemsets.
• Step 2: Generate candidate K-itemsets from
large (k-1)-itemsets.
• Step 3: Find all large k-itemsets from
candidate k-itemsets by scanning DB once
• Go back to step 2 and stop until no cadidate
itemsets can be generated.
19
Apriori Algorithm
• Step 2
– Candidate k-itemsets are k-itemsets that could be
large.
– Why generate candidate k-itemsets only from
large (k-1) itemsets?
– How to generate?
• Step 2-1: Join: Two large (k-1)-itemsets, L1 amd L2, that
are joinable must satisfy the following conditions:
– L1(1)=L2(1) and L1(2)=L2(2) and …. L1(K-2)=L2(K-2)
– L1(K-1)<L2(K-1)
• Step 2-2: Prune: prune itemsets generated in step 2-1
that have subset not large.
20
Apriori Algorithm
Transaction ID
Items
100
1,3,4,6
200
2,3,5,7
300
1,2,3,5,8
400
2,5,9,10
500
1,4
Minimum support =40%
Minimum confidence =70%
21
Association Rule Mining
Tid
items
100
1, 3, 4, 6
200
2, 3, 5, 7
300
1, 2, 3, 5, 8
400
2, 5, 9, 10
Large 1-itemset:
500
1, 4
{1}
support=3/5=60%
{2}
support=3/5=60%
{3}
support=3/5=60%
{4}
support=2/5=40%
{5}
support=3/5=60%
Minimum Support: 40%
22
Association Rule Mining
Large 1-itemset:
{1}
support=3/5=60%
{2}
support=3/5=60%
{3}
support=3/5=60%
{4}
support=2/5=40%
{5}
support=3/5=60%
Candidate 2-itemset:
{1, 2}
{1, 3}
{1, 4}
{2, 3}
{2, 4}
{2, 5}
{3, 4}
{3, 5}
{4, 5}
{1, 5}
23
Association Rule Mining
Candidate 2-itemset:
{1, 2}
{1, 3}
{1, 4}
{2, 3}
{2, 4}
{2, 5}
{3, 4}
{3, 5}
{1, 5}
{4, 5}
Large 2-itemset:
{1, 3}
support=2/5=40%
{1, 4}
support=2/5=40%
{2, 3}
support=2/5=40%
{2, 5}
support=3/5=60%
{3, 5}
support=2/5=40%
24
Association Rule Mining
Large 2-itemset:
{1, 3}
support=2/5=40%
{1, 4}
support=2/5=40%
{2, 3}
support=2/5=40%
{2, 5}
support=3/5=60%
{3, 5}
support=2/5=40%
Candidate 3-itemset:
{1, 3, 4}
{2, 3, 5}
25
Association Rule Mining
Candidate 3-itemset:
{1, 3, 4}
{2, 3, 5}
Large 3-itemset:
{2, 3, 5} support=2/5=40%
Candidate 4-itemset:
No candidate 4-itemset. Stop.
26