Mining Association Rules between Sets of Items in Large Databases

Download Report

Transcript Mining Association Rules between Sets of Items in Large Databases

Mining Association Rules
between Sets of Items in Large
Databases
presented by Zhuang Wang
Outline
•
•
•
•
•
Introduction
Formal Model
Apriori Algorithm
Experiments
Summary
Introduction
•
Association rule:
- Association rules are used to discover elements that
co-occur frequently within a dataset consisting of
multiple independent selections of elements (such as
purchasing transactions), and to discover rules.
• Applications:
- Questions such as "if a customer purchases product A,
how likely is he to purchase product B?" and "What
products will a customer buy if he buys products C and
D?" are answered by association-finding algorithms.
(market basket analysis)
Formal Model
• Let I = I_1, I_2,. . ., I_n be a set of items.
Let T be a database of transactions.
Each transaction t in T is represented as a subset of I .
Let X be a subset of I.
• Support and Confidence:
By an association rule, we mean an implication of the
form X  I_k, where X is a set of some items in I, and I_k
is a single item in I that is not present in X.
support: probability that a transaction contains X and I_k.
P(X ,I_k)
confidence: conditional probability that a transaction
having X also contains I_k.
P(l_k | X)
Support and Confidence - Example
Transaction ID Items Bought
• Let minimum support
1
A,B,C
50%, and minimum
2
A,C
confidence 50%, we have
3
A,D
– A  C (50%, 66.6%)
4
B,E,F
– C  A (50%, 100%)
Apriori Algorithm
• To find subsets which are common to at least a minimum
confidence of the itemsets.
• Using a "bottom up" approach, where frequent itemsets
(the sets of items that follows minimum support) are
extended one item at a time (a step known as candidate
generation), and groups of candidates are tested against
the data.
• The algorithm terminates when no further successful
extensions are found.
• Generating from each large itemset, rules that use items
from the large itemset
Find Frequent Itemsets - Example
Database D
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
Experiments
• We experimented with the rule mining algorithm using
the sales data obtained from a large retailing company.
• There are a total of 46,873 customer transactions in
this data. Each transaction contains the department
numbers from which a customer bought an item in
a visit.
• There are a total of 63 departments. The
algorithm finds if there is an association between
departments in the customer purchasing behavior.
• The following rules were found for a minimum support of
1% and minimum condence of 50%.
• [Tires]  [Automotive Services] (98.80, 5.79)
• [Auto Accessories], [Tires]  [Automotive Services]
(98.29, 1.47)
• [Auto Accessories]  [Automotive Services] (79.51,
11.81)
• [Automotive Services]  [Auto Accessories] (71.60, 11.81)
• [Home Laundry Appliances]  [Maintenance Agreement
Sales] (66.55, 1.25)
• [Children's Hardlines]  [Infants and Children's wear]
(66.15, 4.24)
• [Men's Furnishing]  [Men's Sportswear] (54.86, 5.21)
Summary
• Apriori, while historically significant, suffers from a
number of inefficiencies or trade-offs, which have
spawned other algorithms.
• Hash tables: uses a hash tree to store candidate
itemsets. This hash tree has item sets at the leaves and
at internal nodes
• Partitioning: Any itemset that is potentially frequent in
DB must be frequent in at least one of the partitions of
DB
• Sampling: mining on a subset of given data, need a
lower support threshold + a method to determine the
completeness.
Reference
• R. Agrawal, T. Imielinski, A. Swami: “Mining Associations between
Sets of Items in Massive Databases”, Proc. of the ACM SIGMOD
Int'l Conference on Management of Data, Washington D.C., May
1993, 207-216.
• http://knight.cis.temple.edu/~vasilis/Courses/CIS664/
• http://en.wikipedia.org/wiki/Apriori_algorithm