Data Mining by Rui Zhao

Download Report

Transcript Data Mining by Rui Zhao

Data Mining
-Association Rules
By Rui Zhao
What is data mining?
 The
automated extraction of hidden
predictive information from database
 Allows users to analyze large databases to
solve business decision problems.
 An extension of statistics, with a few
artificial intelligence and machine learning
twists thrown in.
 Attempts to discover rules and patterns
from data.
example
Consider a catalog retailer who needs to decide who should receive
information about a new product.
The information operated on by the data mining process is contained
in a historical database of previous interactions with customers and
the features associated with the customers, such as age, zip code,
their responses. The data mining software would use this historical
information to build a model of customer behavior that could be used
to predict which customers would be likely to respond to the new
product. By using this information a marketing manager can select
only the customers who are the most likely to respond. The
operational business software can then feed the results of the
decision to the appropriate touch point systems (call centers, direct
mail, web servers, email systems, etc) so that the right customers
receive the right offers.
Applications of data mining
 Require some
sort of Prediction:
for example: when a person applies for a
credit card, the credit-card company wants
to predict if the person is a good credit risk.
 Looks for Associations:
for example: if a customer buys a book, an
on-line bookstore may suggest other
associated books.
Associations Rule Discovery
 Task:
Discovering association rules
among items in a transaction
database.
 An association among two items A
and B means that the presence of A
in a record implies the presence of B
in the same record: A => B
 In general: A1, A2, … => B
Association Rules (cont.)
 Retail
shops are often interested in
associations between items that people
buy.

Someone who buys bread is quite likely also to buy
milk.
association rule: bread => milk
 A person who brought the book Database System
Concepts is quite likely also to buy the book
Operating System Concepts.
association rule: DSC => OSC
Association rules (cont.)
 Two numbers:
 Support: is a measure of what fraction of the
population satisfies both the antecedent and the
consequent of the true.
 Confidence: is a measure of how often the
consequent is true when the antecedent is true.
Association rules (cont.)
 Used
to find all rules in a basket data
 Basket data also called transaction data
 analyze how items purchased by
customers in a shop are related
 discover all rules that have:

support greater than minsup specified by user
confidence greater than minconf specified by
user
Association rules (cont.)

Let I = {i1, i2, …im} be a total set of items
D is a set of transactions
d is one transaction consists of a set of items
dI

Association rule:



X  Y where X  I ,Y  I and X  Y = 
support = (#of transactions contain X  Y ) /D
confidence = (#of transactions contain X  Y ) /
#of transactions contain X
example

Example of transaction data:
1.
2.
3.
4.





CD player, music’s CD, music’s book
CD player, music’s CD
music’s CD, music’s book
CD player
I = {CD player, music’s CD, music’s book}
D=4
#of transactions contain both CD player,
music’s CD =2
#of transactions contain CD player =3
CD player  music’s CD (sup=2/4 , conf =2/3 )
Strong Association Rule
 User
sets support and confidence
thresholds. (e.g. at least 100 relations, 80%
confidence)
 Rules
above support threshold have
LARGE support.
 Rules above confidence threshold have
HIGH confidence.
 Rules satisfying both are said to be
STRONG.
Association rules
 How
are association rules mined from
large databases ?
 Two-step process:


find all frequent item sets
generate strong association rules from
frequent item sets
Classification vs. Association

Classification




to mine a small set of rules existing in the data to form
a classifier or predictor
it has a target attribute
dataset are in the form of relation table
Association





dataset are transaction data
has no fixed target
can fixed it, thus can be used for classification
A=a, B=b  Class = yes
A=c  Class = no
References
 Professor

Lee’s lectures
http://www.cs.sjsu.edu/~lee/cs157b/cs157b.ht
ml
 Web-Site


http://www.thearling.com/
pizza.unbsj.ca/~owen/backup/courses/OLAP2004/dm.pdf