DataMining and Association Rules by Dongyi Jia

Download Report

Transcript DataMining and Association Rules by Dongyi Jia

Introduction
of Data Mining and
Association Rules
cs157 Spring 2009
Instructor: Dr. Sin-Min Lee
Student: Dongyi Jia
What is data mining?
 The automated extraction of hidden
predictive information from database
 Allows users to analyze large databases
to solve business decision problems.
 An extension of statistics, with a few
artificial intelligence and machine
learning twists thrown in.
 Attempts to discover rules and patterns
from data.
Data Mining - On What Kind of
Data
 In principle, data mining should be
applicable to any kind of information
repositiory:
● relational databases
● data warehouses
● transactional and advanced databases
● flat files
● World Wide Web
Data Mining Functionalities-What
kinds of Patterns Can be Mined?
Association Analysis
Classification and Prediction
Cluster Analysis
Evolution Analysis
Applications of data mining
 Require some sort of Prediction:
for example: when a person applies for a
credit card, the credit-card company
wants to predict if the person is a good
credit risk.
 Looks for Associations:
for example: if a customer buys a book,
an on-line bookstore may suggest other
associated books.
Associations Rule Discovery
 Task: Discovering association rules
among items in a transaction database.
 How are association rules mined from
large database?
1. Find all frequent itemset: each of these
itemsets will occur at least as frequent as predetermined minimum support count.
2. Generate strong association rules from the
frequent itemsets: these rules must satisfy
minimum support and minimum confidence.
Association Rules (cont.)
 Retail shops are often interested in
associations between items that people
buy.
 Someone who buys bread is quite likely also to
buy milk.
association rule: bread => milk
 A person who brought the book Database
System Concepts is quite likely also to buy the
book Operating System Concepts.
association rule: DSC => OSC
Association Rules (cont.)
Two numbers:
 Support: is a measure of what fraction of
the population satisfies both the
antecedent and the consequent of the
true.
 Confidence: is a measure of how often
the consequent is true when the
antecedent is true.
Association Rules (cont.)
 Let I = {i1, i2, …im} be a total set of items
D is a set of transactions
d is one transaction consists of a set of items
dI
 Association rule:



X  Y where X  I ,Y  I and X  Y = 
support = (#of transactions contain X  Y ) /D
confidence = (#of transactions contain X  Y ) /
#of transactions contain X
example

Example of transaction data:
1.
2.
3.
4.





CD player, music’s CD, music’s book
CD player, music’s CD
music’s CD, music’s book
CD player
I = {CD player, music’s CD, music’s book}
D=4
#of transactions contain both CD player,
music’s CD =2
#of transactions contain CD player =3
CD player  music’s CD (sup=2/4 , conf
=2/3 )
Association Rules (cont.)
 Rule support and confidence reflect the
usefulness and certainty of discovered rules.
 A support of 50% for association rule means
that 50% of all the transactions under analysis
that CD’s player and music CD are purchased
together.
 A confidence of 67% means that 67% of the
customers who purchased a CD’s player also
bought music CD.
Strong Association Rule
 User sets support and confidence
thresholds.
 Rules above support threshold have
LARGE support.
 Rules above confidence threshold have
HIGH confidence.
 Rules satisfying both are said to be
STRONG.
References
 Professor Lee’s lectures
http://www.cs.sjsu.edu/~lee/cs157b/cs157b
.html
 Rui Zhao, SJSU
http://www.cs.sjsu.edu/~lee/cs157b/cs157b.html

 Jiawei Han, Micheline Kamber
Data Mining Concepts and Techniques
Morgan Kaufmann Publishers
Thank you !