Data Mining Association Rules

Download Report

Transcript Data Mining Association Rules

Data Mining Association
Rules
Yao Meng
Hongli Li
91.574 Database II
Fall 2002
Outline

Overview






Apriori
AprioriTid
DIC
Data Structure
Experiment Environment
Experiment Result and Analysis
Overview – Apriori Algorithm
Generate Large
1-itemset
Merge and prune
to Generate
candidate of
next size
large itemset
Yes
Has
Candidate?
No
End
Yes
Go through
whole DB to
Count Support
> Min
Support?
Overview – AprioriTid
Generate Large
1-itemset
Merge and prune
to Generate
candidate of
next size
large itemset
Yes
Has
Candidate?
No
End
Yes
Go through
DB_Sub to
Count Support
and generate
new DB_Sub
> Min
Support?
Overview – DIC






Read M transaction
Increment those itemset
that are current counting
If all the child of a itemset
turned to large, begin to
counting this itemset
If an itemset has been
counted through all the
transaction, remove it from
the current counting list
If at the end of the DB, go to
the first step
Stop if no itemset are need
to counting
Hypothesis of Performance Analysis
Given a memory size
 AprioriTid generally has better performance
than Apriori due to I/O saving
 DIC has better performance than Apriori in
fairly homogenenous data environment.
 DIC performance should approach that of
Apriori while M approaches number of total
transaction.
Experiment Environment

Data Sets


IBM Synthetic Dataset Generation Code for
Association Rules
Enviroments


Operating System: Microsoft Windows XP
Professional
Computer



Intel Pentium III processor 550MHz
RAM 384 MB
Source code written in Java
Data Structure

Apriori and DIC




Candidate Itemset
stored in a hash-tree
Each internal node is
are hashtables
The leaves stored the
candidate itemset
AprioriTid

Use array to keep
candidates
Size vs. Execution Time
Number of Items = 8
Avg transaction length = 5
M = 500
Support Threshold
Size = 16410 transaction
Average Length per transaction = 5
Number of Items = 8
M = 500
DIC – Different M value
Size = 12291 transaction
Average Length per transaction = 5
Number of Items = 8
DIC – “Non-Homogeneous” Dataset
Size = 6000 transaction
M = 500
Number of Items = 8
Conclusions

AprioriTid is the best in our experiment



Apriori and DIC are very similar



I/O saving
AprioriTid use small Data structure
Apriori is Special Case of DIC
They use same data structure
DIC


Sensitive to data
M affects performance
Reference
1.
2.
3.
4.
5.
6.
7.
Rakesh Agrawal, Tomasz Imielinski, Arun Swami. Mining Association Rules between
Sets of Items in Large Database. Proceedings of the 1993 ACM SIGMOD International
Conference on Management of Data, 1993
Rakesh Agrawal, Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules.
Proc. 20th Int. Conf. Very Large Data Bases, VLDB, page 487-499. 1994
Ashok Savasere, Edward Omiecinski, Shamkant Navathe. An Efficient Algorithm for
Mining Association Rules in Large Databases. Proc. of the 21st VLDB Conf., pp. 432444, 1995.
Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, Shalom Tsur. Dynamic Itemset
Counting and Implication Rules for Market Basket Data. SIGMOD 1997, Proceedings
ACM SIGMOD International Conference on Management of Data. Tucson, Arizona,
USA. 1997.
J. Hipp, U. Güntzer, G. Nakhaeizadeh. Mining Association Rules: Deriving a Superior
Algorithm by Analysing Today's Approaches. Proceedings of the 4th European
Symposium on Principles of Data Mining and Knowledge Discovery (PKDD '00), Lyon,
France. 2000.
Jochen Hipp, Ulrich Güntzer, Gholamreza Nakhaeizadeh. Algorithms for Association
Rule Mining – A General Survey and Comparison. SIGKDD Explorations. 2(1): 58-64.
2000.
R. Srikant, R. Agrawal. Mining Generalized Association Rule. In Proc. of the VLDB
Conference, September 1995