Data Mining Outline
Download
Report
Transcript Data Mining Outline
Data Mining: Confluence of Multiple
Disciplines
Database
Systems
Machine
Learning
Algorithm
Statistics
Data Mining
Visualization
Other
Disciplines
Data Mining Outline
Introduction
Classification
Clustering
Association Rules
Data Mining Outline
Introduction
Classification
Clustering
Association Rules
Introduction
Data is growing at a phenomenal rate
Users expect more sophisticated
information
How?
UNCOVER HIDDEN INFORMATION
DATA MINING
Data Mining Definition
Finding hidden information in a
database
Fit data to a model: descriptive or
predictive
Similar terms
– Exploratory data analysis
– Data driven discovery
– Deductive learning
But it isn’t Magic
You must know what you are
looking for
You must know how to look for it
Suppose you knew that a specific cave had
gold:
• What would you look for?
• How would you look for it?
• Might need an expert miner
“If it looks like a terrorist,
duck,
walks like a terrorist,
duck, andand
quacks
quackslike
likea aduck,
terrorist,
then then
it’s
it’sa aduck.”
terrorist.”
Description
Behavior
Classification Clustering
Associations
Link Analysis
Query Examples
Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchase more
than $10,000 in last month.
– Find all customers who have purchased milk
Data Mining
– Find all credit applicants who are poor credit
risks. (classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased
with milk. (association rules)
KDD Process
© Prentice Hall
Selection: Obtain data from various sources.
Preprocessing: Cleanse data.
Transformation: Convert to common format.
Transform to new format.
Data Mining: Obtain desired results.
Interpretation/Evaluation: Present results
to user in meaningful manner.
Data Mining Outline
Introduction
Classification – Assign data to a
predefined class
– Decision Trees
– Neural Networks
– Distance Based
Clustering
Association Rules
The classification
problem can now be
expressed as:
Given a training
database predict the
class label of a
previously unseen
instance
Insect Abdomen Antennae Insect Class
ID
Length
Length
Grasshopper
1
2.7
5.5
2
3
4
5
6
7
8
9
10
previously unseen instance = 11
8.0
0.9
1.1
5.4
2.9
6.1
0.5
8.3
8.1
5.1
9.1
4.7
3.1
8.5
1.9
6.6
1.0
6.6
4.7
7.0
Katydid
Grasshopper
Grasshopper
Katydid
Grasshopper
Katydid
Grasshopper
Katydid
Katydid
???????
Classification Process (1):
Model Construction
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
Anne
RANK
YEARS TENURED
A ssistan t P ro f
3
no
A ssistan t P ro f
7
yes
P ro fesso r
2
yes
A sso ciate P ro f
7
yes
A ssistan t P ro f
6
no
A sso ciate P ro f
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classification Process (2): Use
the Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
M erlisa
G eo rg e
Jo sep h
RANK
YEARS TENURED
A ssistan t P ro f
2
no
A sso ciate P ro f
7
no
P ro fesso r
5
yes
A ssistan t P ro f
7
yes
Tenured?
Training Dataset
This
follows an
example
from
Quinlan’s
ID3
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Output: A Decision Tree for
“buys_computer”
age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Neural Network Example
Tuple Input
Output
Data Mining Outline
Introduction
Classification
Clustering – Place data into groups
– Hierarchical
– K-Means
– Partitional
Association Rules
Clustering Examples
Segment customer database based on
similar buying patterns.
Group houses in a town into
neighborhoods based on similar
features.
Identify new plant species
Identify similar Web usage patterns
Clustering vs. Classification
No prior knowledge
– Number of clusters
– Meaning of clusters
Unsupervised learning
Data Mining Outline
Introduction
Classification
Clustering
Association Rules – Find
relationships between data
–Apriori
Association Rules Example
I = { Beer, Bread, Jelly, Milk, PeanutButter}
Support of {Bread,PeanutButter} is 60%
Association Rules Ex
(cont’d)
AR & Market Baskets
Determine items often purchased
together (Marketbasket Data)
Determine optimal placement of data on
store floor
Determine items for sales and/or
specials
Increase sales of items
www.amazon.com
Summary
Data Mining is a fast growing area with
many applications.
Data Mining algorithms are usually
computationally expensive.
Data Mining tools may be difficult to use
effectively.