Transcript Data Mining
Data Mining
Adrian Tuhtan
004757481
CS157A
Section1
Overview
Introduction
Explanation of Data Mining Techniques
Advantages
Applications
Privacy
Data Mining
What is Data Mining?
“The process of semi automatically analyzing large
databases to find useful patterns” (Silberschatz)
KDD – “Knowledge Discovery in Databases” (3)
“Attempts to discover rules and patterns from data”
Discover Rules Make Predictions
Areas of Use
Internet – Discover needs of customers
Economics – Predict stock prices
Science – Predict environmental change
Medicine – Match patients with similar problems cure
Example of Data Mining
Credit Card Company wants to discover information about
clients from databases. Want to find:
Clients who respond to promotions in “Junk Mail”
Clients that are likely to change to another competitor
Clients that are likely to not pay
Services that clients use to try to promote services affiliated
with the Credit Card Company
Anything else that may help the Company provide/ promote
services to help their clients and ultimately make more
money.
Data Mining & Data Warehousing
Data Warehouse: “is a repository (or archive) of
information gathered from multiple sources, stored under
a unified schema, at a single site.” (Silberschatz)
Collect data Store in single repository
Allows for easier query development as a single repository
can be queried.
Data Mining:
Analyzing databases or Data Warehouses to discover
patterns about the data to gain knowledge.
Knowledge is power.
Discovery of Knowledge
Data Mining Techniques
Classification
Clustering
Regression
Association Rules
Classification
Classification: Given a set of items that have several classes,
and given the past instances (training instances) with their
associated class, Classification is the process of predicting the
class of a new item.
Therefore to classify the new item and identify to which class it
belongs
Example: A bank wants to classify its Home Loan Customers
into groups according to their response to bank advertisements.
The bank might use the classifications “Responds Rarely,
Responds Sometimes, Responds Frequently”.
The bank will then attempt to find rules about the customers
that respond Frequently and Sometimes.
The rules could be used to predict needs of potential customers.
Technique for Classification
Decision-Tree Classifiers
Job
Engineer
Carpenter
Income
<30K
Bad
>50K
Good
Income
<40K
Bad
>90K
Good
Doctor
Income
>100K
<50K
Bad
Predicting credit risk of a person with the jobs specified.
Good
Clustering
“Clustering algorithms find groups of items that are
similar. … It divides a data set so that records with
similar content are in the same group, and groups are
as different as possible from each other. ” (2)
Example: Insurance company could use clustering to
group clients by their age, location and types of
insurance purchased.
The categories are unspecified and this is referred to
as ‘unsupervised learning’
Clustering
Group Data into Clusters
Similar data is grouped in the same cluster
Dissimilar data is grouped in the same cluster
How is this achieved ?
K-Nearest Neighbor
A classification method that classifies a point by
calculating the distances between the point and points in
the training data set. Then it assigns the point to the
class that is most common among its k-nearest
neighbors (where k is an integer).(2)
Hierarchical
Group data into t-trees
Regression
“Regression deals with the prediction of a value, rather
than a class.” (1, P747)
Example: Find out if there is a relationship between
smoking patients and cancer related illness.
Given values: X1, X2... Xn
Objective predict variable Y
One way is to predict coefficients a0, a1, a2
Y = a0 + a1X1 + a2X2 + … anXn
Linear Regression
Regression
Example graph:
Line of Best Fit
Curve Fitting
Association Rules
“An association algorithm creates rules that describe how
often events have occurred together.” (2)
Example: When a customer buys a hammer, then 90%
of the time they will buy nails.
Association Rules
Support: “is a measure of what fraction of the
population satisfies both the antecedent and the
consequent of the rule”(1, p748)
Example:
People who buy hotdog buns also buy hotdog sausages in
99% of cases. = High Support
People who buy hotdog buns buy hangers in 0.005% of
cases. = Low support
Situations where there is high support for the
antecedent are worth careful attention
E.g. Hotdog sausages should be placed in near hotdog buns
in supermarkets if there is also high confidence.
Association Rules
Confidence: “is a measure of how often the consequent is
true when the antecedent is true.” (1, p748)
Example:
90% of Hotdog bun purchases are accompanied by hotdog
sausages.
High confidence is meaningful as we can derive rules.
Hotdog bun Hotdog sausage
2 rules may have different confidence levels and have
the same support.
E.g. Hotdog sausage Hotdog bun may have a
much lower confidence than Hotdog bun Hotdog
sausage yet they both can have the same support.
Advantages of Data Mining
Provides new knowledge from existing data
Public databases
Government sources
Company Databases
Old data can be used to develop new knowledge
New knowledge can be used to improve services or products
Improvements lead to:
Bigger profits
More efficient service
Uses of Data Mining
Sales/ Marketing
Risk Assessment
Identify Customers that pose high credit risk
Fraud Detection
Diversify target market
Identify clients needs to increase response rates
Identify people misusing the system. E.g. People who have
two Social Security Numbers
Customer Care
Identify customers likely to change providers
Identify customer needs
Applications of Data Mining
(4)
Source IDC 1998
Privacy Concerns
Effective Data Mining requires large sources of data
To achieve a wide spectrum of data, link multiple data
sources
Linking sources leads can be problematic for privacy as
follows: If the following histories of a customer were
linked:
Shopping History
Credit History
Bank History
Employment History
The users life story can be painted from the collected
data
References
1.
2.
3.
4.
5.
Silberschatz, Korth, Sudarshan, “Database System
Concepts”, 5th Edition, Mc Graw Hill, 2005
http://www.twocrows.com/glossary.htm, “Two Crows,
Data Mining Glossary”
http://en.wikipedia.org/wiki/Data_mining, “Wikipedia”
http://phoenix.phys.clemson.edu/tutorials/excel/regressi
on.html
http://wwwmaths.anu.edu.au/~steve/pdcn.pdf