DataMining and Association Rules by Dongyi Jia
Download
Report
Transcript DataMining and Association Rules by Dongyi Jia
Introduction
of Data Mining and
Association Rules
cs157 Spring 2009
Instructor: Dr. Sin-Min Lee
Student: Dongyi Jia
What is data mining?
The automated extraction of hidden
predictive information from database
Allows users to analyze large databases
to solve business decision problems.
An extension of statistics, with a few
artificial intelligence and machine
learning twists thrown in.
Attempts to discover rules and patterns
from data.
Data Mining - On What Kind of
Data
In principle, data mining should be
applicable to any kind of information
repositiory:
● relational databases
● data warehouses
● transactional and advanced databases
● flat files
● World Wide Web
Data Mining Functionalities-What
kinds of Patterns Can be Mined?
Association Analysis
Classification and Prediction
Cluster Analysis
Evolution Analysis
Applications of data mining
Require some sort of Prediction:
for example: when a person applies for a
credit card, the credit-card company
wants to predict if the person is a good
credit risk.
Looks for Associations:
for example: if a customer buys a book,
an on-line bookstore may suggest other
associated books.
Associations Rule Discovery
Task: Discovering association rules
among items in a transaction database.
How are association rules mined from
large database?
1. Find all frequent itemset: each of these
itemsets will occur at least as frequent as predetermined minimum support count.
2. Generate strong association rules from the
frequent itemsets: these rules must satisfy
minimum support and minimum confidence.
Association Rules (cont.)
Retail shops are often interested in
associations between items that people
buy.
Someone who buys bread is quite likely also to
buy milk.
association rule: bread => milk
A person who brought the book Database
System Concepts is quite likely also to buy the
book Operating System Concepts.
association rule: DSC => OSC
Association Rules (cont.)
Two numbers:
Support: is a measure of what fraction of
the population satisfies both the
antecedent and the consequent of the
true.
Confidence: is a measure of how often
the consequent is true when the
antecedent is true.
Association Rules (cont.)
Let I = {i1, i2, …im} be a total set of items
D is a set of transactions
d is one transaction consists of a set of items
dI
Association rule:
X Y where X I ,Y I and X Y =
support = (#of transactions contain X Y ) /D
confidence = (#of transactions contain X Y ) /
#of transactions contain X
example
Example of transaction data:
1.
2.
3.
4.
CD player, music’s CD, music’s book
CD player, music’s CD
music’s CD, music’s book
CD player
I = {CD player, music’s CD, music’s book}
D=4
#of transactions contain both CD player,
music’s CD =2
#of transactions contain CD player =3
CD player music’s CD (sup=2/4 , conf
=2/3 )
Association Rules (cont.)
Rule support and confidence reflect the
usefulness and certainty of discovered rules.
A support of 50% for association rule means
that 50% of all the transactions under analysis
that CD’s player and music CD are purchased
together.
A confidence of 67% means that 67% of the
customers who purchased a CD’s player also
bought music CD.
Strong Association Rule
User sets support and confidence
thresholds.
Rules above support threshold have
LARGE support.
Rules above confidence threshold have
HIGH confidence.
Rules satisfying both are said to be
STRONG.
References
Professor Lee’s lectures
http://www.cs.sjsu.edu/~lee/cs157b/cs157b
.html
Rui Zhao, SJSU
http://www.cs.sjsu.edu/~lee/cs157b/cs157b.html
Jiawei Han, Micheline Kamber
Data Mining Concepts and Techniques
Morgan Kaufmann Publishers
Thank you !