Data Mining by Yanhua

Download Report

Transcript Data Mining by Yanhua

Data Mining
CS157B Fall 04
Professor Lee
By Yanhua Xue
Over View

What is Data Mining?
 Why do we need Data Mining
 Major tasks of Data Mining
Here is a problem

You are a marketing manager for a brokerage
company
– Problem: Churn is too high

Turnover(after six month introductory period ends) is 40%
– Customers receive incentives (average cost: $160)
when account is opened
– Giving new incentives to everyone who might leave is
very expensive
– Bring back a customer after they leave is both difficult
and costly
A solution

One month before the end of the introductory
period is over, predict which customers will leave
– If you want to keep a customer that is predicted to
churn, offer them something based on their predicted
value

The ones that are not predicted to churn need no attention
– If you don’t want to keep the customer, do nothing
Data Mining Definition

The automatic discovery of relationships in
typically large database and, in some instances, the
use of the discovery results in predicting
relationships.

An essential process where intelligent methods are
applied in order to extract data patterns.

Data mining lets you be proactive
– Prospective rather than Retrospective
Why Mine Data?
Commercial Viewpoint…

Lots of data is being collected and
warehoused.
 Computing has become affordable.
 Competitive Pressure is Strong
– Provide better, customized services for an edge.
– Information is becoming product in its own
right.
Why Mine Data?
Scientific Viewpoint…

Data collected and stored at enormous speeds
–
–
–
–
Remote sensor on a satellite
Telescope scanning the skies
Microarrays generating gene expression data
Scientific simulations generating terabytes of data

Traditional techniques are infeasible for raw data
 Data mining for data reduction
– Cataloging, classifying, segmenting data
– Helps scientists in Hypothesis Formation
Major Data Mining Tasks








Classification: Predicting an item class
Association Rule Discovery: descriptive
Clustering: descriptive, finding groups of items
Sequential Pattern Discovery: descriptive
Deviation Detection: predictive, finding changes
Forecasting: predicting a parameter value
Description: describing a group
Link analysis: finding relationships and
associations
Classification:Definition

Given a collection of records(training set)
– Each record contains a set of attributes, one of the
attributes is the class.

Find a model for class attribute as a function of the
values of other attributes.
 Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build the
model and test set used to validate it.
Classification: Application

Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of
customers likely to buy a new cell-phone product.
– Approach:



Use the data for a similar product introduced before.
We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class
attribute.
Collect various demographic, lifestyle, and companyinteraction related information about all such customers.
– Type of business, where they stay, how much they earn, etc.

Use this information as input attributes to learn a classifier
model.
Classification (Cont’n)

A sample table
Age
Smoke
Risk
20
No
Low
25
Yes
High
44
Yes
High
18
No
Low
55
No
High
35
No
Low
To identify the risk
of a group of insurance
Applicants.
The class here are:
Risk = Low
Risk = High
Classification (Cont’n)

The following techniques could be used:– Decision Tree
– Naïve Bayesian classifiers
– Using association rule
– Neural networks
– etc……..
Decision Tree





A widely used technique for classification.
Each leaf node of the tree has an associated class.
Each internal node has a predicate(or more
generally, a function) associated with it.
To classify a new instance, we start at the root, and
traverse the tree to reach a leaf; at an internal node
we evaluate the predicate(or function) on the data
instance, to find which child to go to.
A series of nested if/then rules
Age
Decision Tree
Smoke
No
Yes
Insurance
Risk
Age
0-35
High
Low
20
25
44
18
55
35
36 - 100
High
Smoke Risk
No
Yes
Yes
No
No
No
Low
High
High
Low
High
Low
Benefits of Decision Tree

Understandable
 Relatively fast
 Easy to translate to SQL queries
Associations
I = {i1, i2, …im}: a set of literals, called
items.
 Transaction d: a set of items such that d  I
 Database D: a set of transactions
 A transaction d contains X, a set of some
items in L, if X d.
 An association rule is an implication of the
form X Y, where X, Y I.

Association Rule

Used to find all rules in a basket data
 Basket data also called transaction data
 analyze how items purchased by customers in a
shop are related
 discover all rules that have:– support greater than minsup specified by user
– confidence greater than minconf specified by user

Example of transaction data:–
–
–
–
CD player, music’s CD, music’s book
CD player, music’s CD
music’s CD, music’s book
CD player
Association Rule

Let I = {i1, i2, …im} be a total set of items
D a set of transactions
d is one transaction consists of a set of items
– dI

Association rule:– X  Y where X  I ,Y  I and X  Y = 
– support = (#of transactions contain X  Y ) / D
– confidence = (#of transactions contain X  Y ) /
#of transactions contain X
Association Rule

Example of transaction data:–
–
–
–





CD player, music’s CD, music’s book
CD player, music’s CD
music’s CD, music’s book
CD player
I = {CD player, music’s CD, music’s book}
D=4
#of transactions contain both CD player, music’s
CD =2
#of transactions contain CD player =3
CD player  music’s CD (sup=2/4 , conf =2/3 )
Association Rule

How are association rules mined from large
databases ?
 Two-step process:– find all frequent item sets
– generate strong association rules from frequent
item sets
Classification vs. Association

Classification
– to mine a small set of rules existing in the data to form
a classifier or predictor
– it has a target attribute
– dataset are in the form of relation table

Association
–
–
–
–
–
dataset are transaction data
has no fixed target
can fixed it, thus can be used for classification
A=a, B=b  Class = yes
A=c  Class = no
Clustering Definition

Given a set of data points, each having a set
of attributes, and a similarity measure
among them, find clusters such that
– Data points in one cluster are more similar to
one another.
– Data points in separate clusters are less similar
to one another.
Clustering Application

Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a distinct
marketing mix.

Approach:
– Collect different attributes of customers based on their
geographical and lifestyle related information
– Find clusters of similar customers.
– Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those from
different clusters.
References

Professor Lee’s lectures
– http://www.cs.sjsu.edu/~lee/cs157b/cs157b.html

Website
– http://www.thearling.com/dmintro/dmintro.pdf