Association Rules
Download
Report
Transcript Association Rules
IT 433
Data Warehousing and
Data Mining
Association Rules
Assist.Prof.Songül Albayrak
Yıldız Technical University
Computer Engineering Department
[email protected]
www.yildiz.edu.tr/~sbayrak
Association Rules
Association rules are on of the major
techniques of data mining and it is perhaps
the most common form of local-pattern
discovery in unsupervised learning systems.
It is a form of data mining that most closely
resembles the process that most people think about
when they try to understand the data mining
process; namely, “mining” for gold through a vast
database.
The gold in this case would be a rule that is
interesting, that tells you something about your
database that you didn’t already know and probably
weren’t explicitly articulate.
Market Basket Analysis
A market basket is a collection of items
purchased by a custumer in a single
transaction, which is a well defined business
activity.
One common analysis run against a
transactions database is to find sets of items,
or itemsets, that appear together in many
transactions.
Market Basket Analysis
continuing
A business can use knowledge of these
patterns to improve the placemant of these
items in the store or the layout of mail-order
catalog pages and web pages.
Basic Concepts: Frequent Patterns
An itemset containing i items is called an iitemset.
The percentage of transactions that contain
an itemset is called the itemset’s support.
For an itemset to be interesting, its support
must be higher than a user-specified
minimum. Such itemsets are said to be
frequent.
Basic Concepts: Frequent Patterns
Let’s try to describe the problem more
formally and develop its mathematical model.
From a database of sales transactions, we
want to discover the important associations
among items such that a presence of some
items in a transaction will imply the presence
of the other items in the same transactions.
Basic Concepts: Frequent Patterns
Let I={i1,i2,...,im} be a set of literals, called items.
Let DB be a set of transactions, where each
transaction T is a set of items such that TI.
Note that the quantities of the items bought in a
transaction are not considered, meaning that each
item is a binary variable indicating whether an item
was bought or not.
Basic Concepts: Frequent Patterns
An example of the model for such a transaction
database is given as follow;
Transaction-id
Items bought
10
A, C, D
20
B, C, E
30
A, B, C, E
40
B, E
Basic Concepts: Frequent Patterns
Let X be a set of items. A transaction T is said
to contain X if and only if X T
An association rule implies the form
XY, where X I, Y I and XY=.
The rule XY holds in the transaction set DB
with confidence c if c% of the transaction in D
that contain X also contain Y.
Basic Concepts: Frequent Patterns
The rule XY holds in the transaction set DB
with confidence c if c% of the transaction in D
that contain X also contain Y.
The rule XY has support s in the transaction
set D if s% of the transaction in DB that contain
XY.
Support (XY) =P(XY)
Confidence(XY)=P(Y|X)
Basic Concepts: Strong Rules
Confidence denote strength of implication and support indicates
the frequency of the patterns occuring in the rule.
It is often desireable to play attention to only those rules that may
have a reasonable large support.
Such rules with high confidence and strong support are referred
to as strong rules.
The task of mining association rules is essentially to discover
strong association rules in large databases.
Mining Association Rules
The problem of mining association rules may
be decomposed into two phases:
Discover the large itemsets, i.e. the sets of items
that have transaction supports above
predetermined minimum threshold.
Use the large itemsets to generate the
association rules for the database that have
confidence c above a predetermined mining
threshold.
ALGORITHM APRIORI
The algorithm Apriori computes the frequent
itemsets in the database through several
iterations.
Iteration i computes all frequent i-itemsets
(itemsets with i elements)
Each iteration has two steps:
Candidate generation
Candidate counting and selection
ALGORITHM APRIORI
In the first phase of the first iteration, the generated
set of candidate itemsets contains all 1-itemsets (i.e.
All items in the database)
In the counting phase, the algorithm counts their
support searching again through the whole
database. Finally, only 1-itemsets (items) with s
above required threshold will be selected as
frequent.
Thus, after the first iteration, all frequent 1-itemsets
will be known.
The Apriori Algorithm—An Example
Database TDB
Supmin = 2
Tid
Items
10
20
30
A, C, D
B, C, E
A, B, C, E
40
B, E
C1
1st
Itemset
{A, C}
{B, C}
{B, E}
{C, E}
C3
sup
2
2
3
2
Itemset
{B, C, E}
sup
2
3
{C}
3
{D}
{E}
1
3
scan
C2
L2
Itemset
{A}
{B}
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
3rd scan
sup
1
2
1
2
3
2
L3
L1
Itemset
{A}
{B}
sup
2
3
{C}
3
{E}
3
C2
2nd scan
Itemset
{B, C, E}
sup
2
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}