Data Mining Outline

Download Report

Transcript Data Mining Outline

Data Mining: Confluence of Multiple
Disciplines
Database
Systems
Machine
Learning
Algorithm
Statistics
Data Mining
Visualization
Other
Disciplines
Data Mining Outline
Introduction
Classification
Clustering
Association Rules
Data Mining Outline
Introduction
Classification
Clustering
Association Rules
Introduction
Data is growing at a phenomenal rate
Users expect more sophisticated
information
How?
UNCOVER HIDDEN INFORMATION
DATA MINING
Data Mining Definition
Finding hidden information in a
database
Fit data to a model: descriptive or
predictive
Similar terms
– Exploratory data analysis
– Data driven discovery
– Deductive learning
But it isn’t Magic
You must know what you are
looking for
You must know how to look for it
Suppose you knew that a specific cave had
gold:
• What would you look for?
• How would you look for it?
• Might need an expert miner
“If it looks like a terrorist,
duck,
walks like a terrorist,
duck, andand
quacks
quackslike
likea aduck,
terrorist,
then then
it’s
it’sa aduck.”
terrorist.”
Description
Behavior
Classification Clustering
Associations
Link Analysis
Query Examples
Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchase more
than $10,000 in last month.
– Find all customers who have purchased milk
Data Mining
– Find all credit applicants who are poor credit
risks. (classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased
with milk. (association rules)
KDD Process
© Prentice Hall
 Selection: Obtain data from various sources.
 Preprocessing: Cleanse data.
 Transformation: Convert to common format.
Transform to new format.
 Data Mining: Obtain desired results.
 Interpretation/Evaluation: Present results
to user in meaningful manner.
Data Mining Outline
 Introduction
Classification – Assign data to a
predefined class
– Decision Trees
– Neural Networks
– Distance Based
 Clustering
 Association Rules
The classification
problem can now be
expressed as:
Given a training
database predict the
class label of a
previously unseen
instance
Insect Abdomen Antennae Insect Class
ID
Length
Length
Grasshopper
1
2.7
5.5
2
3
4
5
6
7
8
9
10
previously unseen instance = 11
8.0
0.9
1.1
5.4
2.9
6.1
0.5
8.3
8.1
5.1
9.1
4.7
3.1
8.5
1.9
6.6
1.0
6.6
4.7
7.0
Katydid
Grasshopper
Grasshopper
Katydid
Grasshopper
Katydid
Grasshopper
Katydid
Katydid
???????
Classification Process (1):
Model Construction
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
Anne
RANK
YEARS TENURED
A ssistan t P ro f
3
no
A ssistan t P ro f
7
yes
P ro fesso r
2
yes
A sso ciate P ro f
7
yes
A ssistan t P ro f
6
no
A sso ciate P ro f
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classification Process (2): Use
the Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
M erlisa
G eo rg e
Jo sep h
RANK
YEARS TENURED
A ssistan t P ro f
2
no
A sso ciate P ro f
7
no
P ro fesso r
5
yes
A ssistan t P ro f
7
yes
Tenured?
Training Dataset
This
follows an
example
from
Quinlan’s
ID3
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Output: A Decision Tree for
“buys_computer”
age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Neural Network Example
Tuple Input
Output
Data Mining Outline
 Introduction
 Classification
Clustering – Place data into groups
– Hierarchical
– K-Means
– Partitional
 Association Rules
Clustering Examples
Segment customer database based on
similar buying patterns.
Group houses in a town into
neighborhoods based on similar
features.
Identify new plant species
Identify similar Web usage patterns
Clustering vs. Classification
No prior knowledge
– Number of clusters
– Meaning of clusters
Unsupervised learning
Data Mining Outline
Introduction
Classification
Clustering
Association Rules – Find
relationships between data
–Apriori
Association Rules Example
I = { Beer, Bread, Jelly, Milk, PeanutButter}
Support of {Bread,PeanutButter} is 60%
Association Rules Ex
(cont’d)
AR & Market Baskets
Determine items often purchased
together (Marketbasket Data)
Determine optimal placement of data on
store floor
Determine items for sales and/or
specials
Increase sales of items
www.amazon.com
Summary
Data Mining is a fast growing area with
many applications.
Data Mining algorithms are usually
computationally expensive.
Data Mining tools may be difficult to use
effectively.