Week 9-Part 1- ppt - Monash University

Download Report

Transcript Week 9-Part 1- ppt - Monash University

Mining Associations in Large Databases
Monash University
Semester 1, March 2006
www.monash.edu.au
Associations or Basket Analysis
Huge amount of data is stored electronically in all retail
outlets due to barcoding of all goods sold. Natural to try
to find some useful information from this mountains of
data.
A conceptually simple yet interesting example is to find
association rules from these large databases.
www.monash.edu.au
2
Associations or Basket Analysis
• Association rules mining or market basket analysis
searches for interesting customer habits by looking at
associations.
• The classical example is the one where a store was
reported to have discovered that people buying nappies
tend also to buy beer
• Applications in marketing, store layout, customer
segmentation, medicine, finance, etc.
www.monash.edu.au
3
Basics
• Given a set of transactions {T}, each containing a
subset of items from an item set {i1, i2, …, im},
discovery of association relationships or correlations
among a set of items.
• Want to find a group of items that tend to occur
together.
• The association rules are often written as X => Y
meaning that whenever X appears Y also tends to
appear. X and Y may be single items of sets of items
(same item not appearing in both).
www.monash.edu.au
4
Basics
Although we are only considering basket analysis,
the technique can be useful in other application.
We assume that we have data from a supermarket
like Woolworths which may have several thousand
items and many millions of transactions a week
(Gigabytes of data each week). Note that the
quantities of items bought in a transaction is
ignored.
www.monash.edu.au
5
Terminology
• We assume that we have a set of transactions, each
transaction being a list of items (e.g. books)
• Suppose X and Y appear together in only 1% of the
transactions but whenever X appears there is 80%
chance that Y also appears
• The 1% presence of X and Y together is called the
support (or prevalence) of the rule and 80% is called
the confidence (or predictability) of the rule
• These are measures of interestingness of the rule
www.monash.edu.au
6
Terminology
• The support for X=>Y is the probability of both X
and Y appearing together, that is P(X U Y)
• The confidence of X=>Y is the conditional
probability of Y appearing given that X exists, that
is:
P(Y | X) = P(X U Y) / P (Y)
• Confidence denotes the strength of the association.
Support indicates the frequency of the pattern. A
minimum support is necessary if an association is
going to be of some business value.
www.monash.edu.au
7
The task
Want to find all associations which have at least p%
support with at least q% confidence such that
– all rules satisfying user constraints are found
– the rules are found efficiently from large databases
– the rules found are actionable
www.monash.edu.au
8
An Example
Consider a furniture and appliances store has
100,000 records of sale. Let 5,000 records contain
lounge suites (5%), and let 10,000 records contain
TVs (10%). The support for lounge suites is
therefore 5% and for TVs 10%.
It is expected that the 5,000 records containing
lounge suites will contain 10% TVs (i.e. 500). If the
number of TV sales in the 5,000 records is in fact
2,500 then we have a lift of 5 or confidence of
50%.
www.monash.edu.au
9
Question
With 100,000 records, 5,000 records containing
lounge suites (5%), and 10,000 containing TVs
(10%) and the number of TV sales in the 5,000
records containing lounge suite being 2,500, what
is the support and the confidence in the following
two rules?
lounge => TV
and TV => lounge
www.monash.edu.au
10
Associations
It is worth noting:
• The only rules of interest are with very high or
very low lift (why?)
• Items that appear on most transactions are of
little interest (why?)
• Similar items should be combined to reduce the
number of total items (why?)
www.monash.edu.au
11
The Apriori Algorithm
To find such associations, a simple two step approach
may be used:
Step 1 - discover all frequent items that have
support above the minimum support required
Step 2 - Use the set of frequent items to generate
the association rules that have high enough
confidence level
Is this a reasonable algorithm? Why?
www.monash.edu.au
12
The Apriori Algorithm
The algorithm works as follows
• Scan all transactions and find all frequent items
that have transaction support above x%. Let these
be L1.
• Build item pairs from L1. This is the candidate set
C2. Scan all transactions and find all frequent pairs
in C2. Let this be L2.
• General rule - build sets of k items from Lk-1. This
is set Ck. Scan all transactions and find all
frequent sets in Ck. Let this be Lk.
www.monash.edu.au
13
The Apriori Algorithm
Consider an example with the following set of
transactions
TID
001
002
003
004
005
006
007
008
009
010
Items bought
B, M, T, Y
B, M
T, S, P
A, B, C, D
A, B
T, Y, E
A, B, M
B, C, D, T, P
D, T, S
A, B, M
www.monash.edu.au
14
The Apriori Algorithm
Assume that we wish to find associations with at least 30%
support and 60% confidence. The list of frequent items is first
computed. Only the following items qualify as frequent since
they appear in more than 30% of the transactions. This is set
L1.
Item
Frequency
A
4
B
7
D
3
M
4
T
5
These five items form 10 pairs as shown on the next slide.
www.monash.edu.au
15
The Pairs
The following ten pairs are worth investigating since they
are the only ones that could be frequent (why?):
{A, B}
{A, D}
{A, M}
{A, T}
{B, D}
{B, M}
{B, T}
{D, M}
{D, T}
{M, T}
www.monash.edu.au
16
The Frequent Pairs
The frequency of the pairs is as follows:
{A, B}
{A, D}
{A, M}
{A, T}
{B, D}
{B, M}
{B, T}
{D, M}
{D, T}
{M, T}
4
1
2
0
2
4
2
0
2
1
So, the only frequent pairs are {A, B} and {B, M}.
www.monash.edu.au
17
Confidence
These two pairs have more than 30% support. No
other pair can have 30% support or more. Why?
What about their confidence level? A => B has
confidence level of 100%, B => A has confidence
level of about 57% (4/7). B => M also has
confidence level of approx 57%, M => B 100%.
Therefore only A => B and M => B are acceptable.
Are we finished? Is there any point going further?
Why?
www.monash.edu.au
18
Example
The frequent item pairs (that is L2) are:
Pair
{A, B}
{B, M}
Frequency
4
4
These pairs are now used to generate a set of three
items (i.e. C3). In this simple example only one such
set is possible which is {A, B, M}. The frequency of
this set is only 2 which is below 30% support and
therefore this set of three items does not qualify. So
L3 is empty.
www.monash.edu.au
19
Questions
Firstly, how do we compute C3 from L2 or more
generally Ci from Li-1?
Secondly, if there were several members of the set
C3. How do we derive L3 by pruning C3?
Thirdly, let us suppose the set {A, B, M} did have
30% support or higher. What do we then do? What
does it tell us? How do we derive association rules?
www.monash.edu.au
20
Answers
To compute C3 from L2, or more generally Ci
from
Li-1, we join members of Li-1 to other members if
the first (I-2) items are common.
When there are several members of the set C3,
or Ci in general, we look at each member of the
set and consider each subset. For this member
to be frequent, every subset must be frequent.
www.monash.edu.au
21
Example
If the set {A, B, M} did have 30% support or
higher, we can then find association rules from it
that satisfy both the conditions of support and
confidence.
www.monash.edu.au
22
Example
We first generate all nonempty subsets of {A, B,
M} and use each of it on the LHS and remaining
symbols on the RHS. For example, subsets of {A,
B, M} are A, B, M, AB, AM, BM. The possible rules
therefore are A => BM,
B => AM, M => AB, AB => M, AM => B or BM =>
A.
www.monash.edu.au
23
Answers
To test the confidence of the possible rules A => BM,
B => AM, M => AB, AB => M, AM => B or BM => A,
we proceed as follows. We know that
confidence(A=>B) = P(B|A) = P(A U B)/ P(A)
Therefore confidence of A => B can be computed by
taking the ratio of the support for A and B both
together and support for A by itself. The confidence
of all these rules can thus be computed.
www.monash.edu.au
24
Efficiency
Consider an example of a supermarket database
which might have several thousand items of which
1000 items are frequent and several million
transactions.
Which part of the Apriori algorithm will be expensive
to compute? Why?
www.monash.edu.au
25
Efficiency
The algorithm to construct the candidate set for large
itemsets is crucial to the performance of the Apriori
algorithm. The larger the candidate set, higher the
processing cost for discovering the large itemsets.
Given that the early itemsets are very large, the initial
iterations dominate the cost.
In a supermarket database with about 1000 frequent
items, there will be almost one million pairs that need
to be searched for.
www.monash.edu.au
26
Efficiency
Generally the number of frequent pairs out of one
million pairs will be small and therefore (why?) the
number of three-itemsets will be small.
Therefore, it is the generation of the large 2-itemsets
that is the key to improving the performance of the
Apriori algorithm.
www.monash.edu.au
27
Improving the Apriori Algorithm
Many techniques for improving the efficiency:
•Pruning (already mentioned)
•Hashing based technique
•Transaction reduction
•Partitioning
•Sampling
•Dynamic itemset counting
www.monash.edu.au
28
Pruning
Pruning can reduce the size of candidate set Ck.
We want to transform Ck into a set of frequent
items Lk. To reduce the work of checking, we
may use a trick that all subsets of Ck must also
be frequent.
www.monash.edu.au
29
Hashing
The direct hashing and pruning (DHP) algorithm
attempts to generate large itemsets efficiently and
reduces the transaction database size.
www.monash.edu.au
30
Transaction Reduction
As discussed earlier, any transaction that does
not contain any frequent k-itemsets cannot
contain any frequent (k+1)-itemsets and such a
transaction may be marked or removed.
www.monash.edu.au
31
Partitioning
The set of transactions may be divided into a
number of disjoint subsets. Then each partition is
searched for frequent itemsets using support s
times the number of transactions in the subset.
These frequent itemsets are called local frequent
itemsets.
How can information about local itemsets be used
in finding frequent itemsets of the global set of
transactions?
www.monash.edu.au
32
Sampling
A sample may be obtained from the overall set of
transactions and the sample is searched for
frequent itemsets using support s times the number
of transactions in the sample. These frequent
itemsets are called sample frequent itemsets.
How can information about sample itemsets be
used in finding frequent itemsets of the global set of
transactions?
www.monash.edu.au
33
Problem with Association Rules Algorithms
• Users are overwhelmed by the number of rules
identified - how can rules be reduced to those that
are relevant to the user needs?
• Apriori algorithm assumes sparsity since number of
items on each record is quite small.
• Some applications produce dense data which may
also have
• many frequently occurring items
• strong correlations
• many items on each record
www.monash.edu.au
34
Problem with Association Rules Algorithms
Also consider:
AB => C (90% confidence)
and A => C (92% confidence)
Clearly the first rule is of no use. We should only
look for more complex rules if they are better than
simple rules.
www.monash.edu.au
35
References
• Database management systems, Raghu Ramakrishnan, Johannes
Gehrke, 2nd ed. McGraw-Hill, 2000.
• Knowledge discovery and data mining : current issues and new
applications : 4th Pacific-Asia Conference, PAKDD 2000, Kyoto,
Japan, Takao Terano, Huan Liu, Arbee L.P. Chen (eds.): Springer,
2000.
• Mastering data mining : the art and science of customer
relationship management, Michael J. A. Berry, Gordon Linoff. Wiley,
2000.
• Data mining : building competitive advantage, Robert Groth:
Prentice Hall, 2000.
www.monash.edu.au
36
References
• Data mining II , editors, N. Ebecken & C.A. Brebbia: WIT Press,
2000.
• Data mining : practical machine learning tools and techniques
with Java implementations, Ian H. Witten, Eibe Frank. Morgan
Kaufmann, 2000.
• Data mining : technologies, techniques, tools, and trends,
Bhavani Thuraisingham. Boca Raton : CRC Press, 1999.
• Data mining techniques : for marketing, sales, and customer
support, Michael J.A. Berry, Gordon Linoff. New York : Wiley,
1997.
www.monash.edu.au
37
References
• U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.
Uthurusamy (eds.), Advances in Knowledge Discovery and
Data Mining, AAAI/MIT Press, 1996
• M.S. Chen, J. Han, and P.S. Yu, Data Mining: An Overview
from a Database Perspective, IEEE Transactions on
Knowledge and Data Engineering, 8(6), pp 866-883, 1996.
• R. Agarwal, M. Mehta, J. Shafer, A. Arning, and T. Bollinger,
The Quest Data Mining System, Proc 1996 Int. Conf on
Data Mining and Knowledge Discovery (KDD’96), Portland,
Oregon, pp. 244-249, Aug 1996.
www.monash.edu.au
38
References
• R. Agarwal, S. Ghosh, T. Imielinski, B. Iyer, and A. Swami,
An Interval Classifier for database Mining Applications,
VLDB-92, Vancouver, Canada, 1992, pp 560-573.
• R. Agarwal, T. Imielinski, and A. Swami, Mining
Association Rules Between sets of Items in Large
Databases, In Proc of the ACM SIGMOD, 1993, pp. 207216.
www.monash.edu.au
39