Transcript Document

Advanced
Database Topics
Copyright © Ellis Cohen 2002-2005
Data Mining
These slides are licensed under a Creative Commons
Attribution-NonCommercial-ShareAlike 2.5 License.
For more information on how you may use them,
please see http://www.openlineconsult.com/db
1
Topics
Types of Data Mining
Data Mining Activities
Datasets
Estimation: Classification & Approximation
Decision Trees
Approximation (Regression)
Raw Data Modeling with k-Nearest Neighbor
Opaque Modeling with Feed-Forward Neural Networks
Clustering
Partition-Driven Clustering
Density-Driven Clustering
Subspace Clustering
Targeted Clustering & Association Rules
Market Basket Analysis
Scoring Market Basket Analysis
Scoring & Data Preparation
Copyright © Ellis Cohen 2002-2005
2
Goals of Data Mining
Find patterns and relationships
among objects represented by
data points in a dataset
Novel / Significant
People who buy notebooks buy pencils vs.
Men who buy diapers at night also buy beer
Understandable / Useful
Tradeoff between accuracy and simplicity
Causality vs Chance
All data has random variations, which can show
up as spurious patterns and relationships
Copyright © Ellis Cohen 2002-2005
3
Types of
Data Mining
Copyright © Ellis Cohen 2002-2005
4
Types of Data Mining Covered
Estimation
(Classification & Approximation)
Classify/categorize objects, or
approximate some value associated
with objects based on their features
Clustering
Find groups of objects, [some of] whose
features are all similar to one another
Market Basket Analysis
Find collections of items which frequently
occur together (in the same basket) &
formulate the cause
Copyright © Ellis Cohen 2002-2005
5
Objects & Features
custid
3043
3174
…
age
23
44
…
The entire collection
of data is called a
dataset
income
5000
6190
…
…
Object:
Customer
(Represented
by a
Data Point)
Features
(also called
Variables)
Copyright © Ellis Cohen 2002-2005
6
Classification
Determine target category of an
object, based on its features
Predictor Variables: Age, MonthlyIncome
Target Variable: CarType
Category values: x - Luxury Cars o - Midrange Cars # - Cheap Cars
Objects: Customers
Note
difficult
separation
Monthly
Income
x
Region
x
x
x
x
x
x
#
#ox x
xx
x
xx x x o x
o
x
xxx x
ox
#
o
o
x ox
o
x
o # o
x
#
o
o
#
o x o o# o
#
o
o
oo x oxo o o x o o o # o
x o o o
#
o
# #
# # o o o o
#
# #o # #
o #
#o # #
x
#
#
#
#
#
#
#
#
Draw regions in
which one category
predominates
Age
o
o
Region
#
Region
When a new customer comes along, you can
categorize what kind of car they're likely to buy
based upon which region they fall into
Copyright © Ellis Cohen 2002-2005
7
Approximation
Approximate some (continuous) target
value of an object based on its features
Predictor Variables: Age, MonthlyIncome
Target Variable: CarCost -- amt spent by customer on a car
Objects: customers
Find a function
f(income,age) that
gives a good estimate
of the actual value
Monthly
Income
o
o oo
o
oo
oo
o
o
o
o
o
o
oo o o o o o
oo o
o o
oo o o
o o o o o
o
oooo
o
o
o
o
o oo o o o o
o
o
oo oooo o o o o o o o o o o o o o
o
o o
o o o
o
o
o
o
o
o
o
o
o
o ooo o o o
o
o
o
o
Age
o
o
o
o
When a new customer comes along, you can
apply f to their income & age to estimate how
much they're likely to spend on a car
Copyright © Ellis Cohen 2002-2005
8
Applications of Estimation
Sales
Estimate how much a customer is likely to spend
Determine amount of effort/money to put into
capturing a potential customer
Credit
Decide whether to approve a credit application
Determine someone's credit limit
Evaluate whether a credit card is stolen
Medicine
Determine whether a patient has a disease based
on symptoms and test results
Assess which protocol should be used to treat an
illness
Copyright © Ellis Cohen 2002-2005
9
Clustering
Find groups of objects, [some of] whose
features are all similar to one another
Objects:
Corvette
Buyers
x
x
x
x
x
x
x x
x
xx
x
x
x
xxx
xx xx x
x x
x
Monthly
Income
FEATURES
x
Identify Cluster
e.g. by
center & radius
Age
Copyright © Ellis Cohen 2002-2005
10
Applications of Clustering
Marketing
Target advertising for each cluster;
use media whose demographics match cluster
Astronomy
Find galactic clusters;
explore large scale structure of the universe
Epidemiology
Find things in common among people with the
same illness (esp location)
Government Policy & Planning
Identify regions with similar features (e.g.
economic, industrial, geographic) for policies
and planning (e.g. land use, economic
assistance, office locations, bus routes)
Copyright © Ellis Cohen 2002-2005
11
Market Basket Analysis
Organize dataset into baskets.
Find groups of items which frequently
occur together in baskets
11-Feb-99
11-Feb-99
11-Feb-99
…
11-Feb-99
11-Feb-99
…
13-Feb-99
13-Feb-99
Rules capture
causality
Joe
Joe
Joe
Diapers
Formula
Beer
Simha Pretzels
Simha Beer
Sasha Diapers
Sasha Beer
Basket: Daily
Shopping by
a Customer
Diapers and beer
occur together
frequently
Item: Product purchased
NO! People who buy beer are not more likely
Beer  Diapers? to buy diapers
YES! People who buy diapers are more likely
Diapers  Beer? to buy beer (esp men at night)
Copyright © Ellis Cohen 2002-2005
12
Applications of
Market Basket Analysis
Marketing
Baskets: Daily Shopping
Items: Products Purchased
Controlling Customer Traversal in Stores
Coupons
Recommendations (e.g. Amazon)
Semantic Nets
Baskets: Documents
Items: Words/Phrases
Use for Semantic Search
Plagiarism Detection
Baskets: Sentences
Items: Documents
Copyright © Ellis Cohen 2002-2005
13
Data Mining Approaches
Deterministic
Heuristic
K-Means
Clustering
Agglomerative
CLIQUE
Regression
Neural Nets
Estimation Bayesian Networks Genetic Algorithms
K-Nearest Neighbor
Decision Trees
Mkt Basket Apriori
Analysis
Produces "best"
possible model or
prediction
Produces reasonably
good model or
prediction
Copyright © Ellis Cohen 2002-2005
14
Data Mining
Activities
Copyright © Ellis Cohen 2002-2005
15
Data Mining Activities Diagram
Forensics
Analyzer
Anomalies
Model
Discovery
Detector/
Predictor
Modelling
Detection
Prediction
Sample Data
Live Data
Copyright © Ellis Cohen 2002-2005
Predictions
16
Data Mining Activities
Discovery/Modeling
Using an existing sample dataset to develop a model
which characterizes or describes the data
Forensics
Finding anomalous data points in an existing sample
dataset – those that do not match the discovered
model, and determining the cause (which may
involve another round of Discovery/Modeling)
Prediction
Using the discovered model to predict an unknown
feature value of a new live data point
Detection
Detect new live data points which are anomalous –
those that do not match the discovered model, and
determining the cause (more Discovery/Modeling)
Copyright © Ellis Cohen 2002-2005
17
Applications of
Modeling & Prediction
Clustering
Model to decide on a targeted advertising
program
Predict whether a web user is in a target group
for a product, and if so, show them the ad
prepared for that group.
Market Basket Analysis
Use the model to decide on store layout, sales &
promotions
Use Predictions to delivery Personalized Coupons
at checkout
Classification/Approximation
Model to decide marketing program for lessfocused mass media advertising
Predict how an individual potential customer will
behave, and personalize sales approach to that
one customer
Copyright © Ellis Cohen 2002-2005
18
Data Mining Planning
Data, Mining Type & Activity Selection
What data do you want to mine and how do you
want to mine it?
Data Cleaning & Transformation
Does the data need to be prepared so the mining
will work correctly?
Evaluation
How will you be able to tell whether the results
are good?
Visualization & Scoring
Mining Type & Activity Details
What approach will you use to implement the
mining type & activity, and with what control
parameters?
Copyright © Ellis Cohen 2002-2005
19
Integrated DW/DM Process
Data
Sources
Data
Warehouse
Data Mining
Store
More
ETL
ETL
OLAP &
Visualization
Data
Mining
• Episodic
• Strategic
• Continuous
Copyright © Ellis Cohen 2002-2005
20
Datasets
Copyright © Ellis Cohen 2002-2005
21
Dataset
A set of data points
(that represent objects)
each of which has multiple variables
(that represent the object's features)
From a Data Warehouse Perspective
Dataset
Derived from some selected subset of a fact table
• Either an original fact table, or
• An aggregated fact table derived from it
Variables
– Measure in the fact table
– Attribute of one of the dimensions
– Derived from the measures & attributes
# of variables is the data set's dimensionality
(not to be confused with dimensions of the
original fact table)
Copyright © Ellis Cohen 2002-2005
22
Variables
• Quantitative [numeric]
– Continuous [float]
– Discrete [int]
• Categorical [enumerated]
– Ordinal [ordered]
– Nominal [unordered,
possibly hierarchical]
• Structured
– Text, Image, etc. (ignore these)
Quantitative variables
Have well defined distance (v2 - v1)
May have well-defined ratio (v2/v1)
Unlike integer categorical variables
[e.g. 1:doctor, 2:nurse, 3:other]
Copyright © Ellis Cohen 2002-2005
23
Types of Data Mining Revisited
Clustering
Find groups of data points that are
similar to one another in some way
Approximation
Estimate the value of a quantitative target
variable based on the values of other variables
Classification
Classify (a.k.a. categorize) a data point
(estimate the value of a categorical target
variable) based on the values of other variables
Market Basket Analysis
After dividing data points into baskets,
find sets of values of the item variable
that frequently occur together in a basket
Copyright © Ellis Cohen 2002-2005
24
Example Dataset Definition
Basic Fact Table
Sales( price, units )
Dimensions
Customer, Product, Date, Location
Aggregate, Filter, Select & Transform
AGGREGATE
sum(price * units) as total,
total / sum(units) as avgprice
BY Customer.custid, Product.subcatid
WHERE (year = 2002)
VARIABLES on next page
Actually specify this using a DMQL (Data
Mining Query Language) -- e.g. PMML
(Predictive Model Markup Language)
Copyright © Ellis Cohen 2002-2005
25
Dataset Variables
total CONTINUOUS
avgprice CONTINUOUS
product.subcatid NOMINAL
product.catid NOMINAL, PARENT OF subcatid
customer.custid ORDINAL (order of date joined)
customer.gender NOMINAL
customer.age DISCRETE
customer.married NOMINAL
customer.numDepend DISCRETE
customer.income CONTINUOUS
customer.familyIncome CONTINUOUS
customer.vacationSpend CONTINUOUS
customer.monthlyRent CONTINUOUS
customer.monthlyMortgage CONTINUOUS
monthlyHousing CONTINUOUS =
monthlyRent + monthlyMortgage
allowance CONTINUOUS = CalcAllowance(
familyIncome, married, numDepend, monthlyHousing )
famAllowance CONTINUOUS = CalcFamilyAllowance ( … )
nearestStoreDistance = CalcNearestStoreDist(
customer.x, customer.y )
Copyright © Ellis Cohen 2002-2005
26
Exploratory vs Targeted Data Mining
Exploratory Data Mining
How are the variables in a dataset related
to one another?
Targeted Data Mining
How are the target variables in a dataset
related to the other variables (sometimes
called the predictor variables)?
The target variables (e.g. type of car
bought) are
• Known for the sample dataset
• Unknown (at least initially) for live datasets
Estimation: Always Targeted
Clustering: Exploratory or Targeted
Market Basket Analysis: Generally Exploratory
Copyright © Ellis Cohen 2002-2005
27
Estimation:
Classification &
Approximation
Copyright © Ellis Cohen 2002-2005
28
Classification
Determine target category of an
object, based on its features
Predictor Variables: Age, MonthlyIncome
Target Variable: CarType
Category values: x - Luxury Cars o - Midrange Cars # - Cheap Cars
Objects: Customers
x
x
x
x
x
x
#
#ox x
xx
x
xx x x o x
o
x
xxx x
ox
#
o
o
x ox
o
x
o # o
x
#
o
o
#
o x o o# o
#
o
o
oo x oxo o o x o o o # o
x o o o
#
o
# #
# # o o o o
#
# #o # #
o #
#o # #
x
#
#
#
#
#
#
#
#
Draw regions in
which one category
predominates
Note
difficult
separation
Monthly
Income
Age
o
x
Region
o
Region
#
Region
When a new customer comes along, you can
categorize what kind of car they're likely to buy
based upon which region they fall into
Copyright © Ellis Cohen 2002-2005
29
Approximation
Approximate some (continuous) target
value of an object based on its features
Predictor Variables: Age, MonthlyIncome
Target Variable: CarCost -- amt spent by customer on a car
Objects: customers
Find a function
f(income,age) that
gives a good estimate
of the actual value
Monthly
Income
o
o oo
o
oo
oo
o
o
o
o
o
o
oo o o o o o
oo o
o o
oo o o
o o o o o
o
oooo
o
o
o
o
o oo o o o o
o
o
oo oooo o o o o o o o o o o o o o
o
o o
o o o
o
o
o
o
o
o
o
o
o
o ooo o o o
o
o
o
o
Age
o
o
o
o
When a new customer comes along, you can
apply f to their income & age to estimate how
much they're likely to spend on a car
Copyright © Ellis Cohen 2002-2005
30
Estimation Activities
Forensics
Analyzer
Anomalies
Model
Discovery
Modeling
Sample Data
Live Data
Detector/
Predictor
Detection
Prediction
Predictions
Modeling
Come up with a way of estimating target value of data items (only
known for sampled data, not live data) based on other features
Forensics
Understand why some data items have values which significantly
differ from the estimated value
Predication
Estimate the (unknown) target value of live data items based on
the known features
Detection
When the live data's unknown target value becomes known, find
items whose target value doesn't match its estimated value
Copyright © Ellis Cohen 2002-2005
31
Model Characteristics
Forensics
Analyzer
Anomalies
Model
Discovery
Modelling
Sample Data
Live Data
Detector/
Predictor
Detection
Prediction
Predictions
Characteristics of Models:
• Transparent (Descriptive): Understandable
• Opaque (Predictive): Not particularly
understandable; its sole purpose is to drive a
predictor/detector/analyzer
• Raw Data: The retained sample data is provided
directly to the predictor/detector/analyzer
Copyright © Ellis Cohen 2002-2005
32
Training & Testing
Sample Set
Training Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Use the training set to
build the model
Use testing sets to tweak
and validate the model
Copyright © Ellis Cohen 2002-2005
33
Estimation Approaches
Models
Raw
Data
Transparent
Opaque
Classification
Approximation
K Nearest Neighbor
Bayesian
Network
Linear
Regression
Decision
Trees
Regression
Trees
Feed-Forward
Neural Networks
Genetic Algorithms
Copyright © Ellis Cohen 2002-2005
34
Decision
Trees
Copyright © Ellis Cohen 2002-2005
35
Classification
Determine target category of an
object, based on its features
Predictor Variables: Age, MonthlyIncome
Target Variable: CarType
Category values: x - Luxury Cars o - Midrange Cars # - Cheap Cars
Objects: Customers
x
x
x
x
x
x
#
#ox x
xx
x
xx x x o x
o
x
xxx x
ox
#
o
o
x ox
o
x
o # o
x
#
o
o
#
o x o o# o
#
o
o
oo x oxo o o x o o o # o
x o o o
#
o
# #
# # o o o o
#
# #o # #
o #
#o # #
x
#
#
#
#
#
#
#
#
Draw regions in
which one category
predominates
Note
difficult
separation
Monthly
Income
Age
o
x
Region
o
Region
#
Region
When a new customer comes along, you can
categorize what kind of car they're likely to buy
based upon which region they fall into
Copyright © Ellis Cohen 2002-2005
36
Motivating Decision Trees
1
x
x
x
x
xxx xxx
x
x
#
x x x x x oxxxx x oooo
xxx xo o o
x
o
o
o oo o o x
xx
x
x
x
o o oo# o o
o
x
o
#
o
oo
# o
o # o##
oooo
o ooo o x o o oo o # #
### # # # # oo o o # # ### #
# #
#o ### # o # o
#
#
#
#
o# # # #
# ##
x
x
x
x
xxx xxx
x
x
#
x x x x x oxxxx x oooo
xxx xo o o
x
o
o
o oo o o x
xx
x
x
x
o o oo# o o
o
x
o
#
o
oo
# o
o # o##
oooo
o ooo o x o o oo o # #
### # # # # oo o o # # ### #
# #
#o ### # o # o
#
#
#
#
o# # # #
# ##
o
x
x
x
x
xxx
x
x
x x
x
x x x x x #oxxxxxx oooo
xxx xo o o
x
o
o
o
x xxxxxo oo o x o o o o o o
xoo ooo # # o o o o #o##o##
oo
#
oo o ooo o x o oo o
# # ## # oo o o # # #### #
#### #
#
o
# # o # ## # ## o o o # # ## # ##
Copyright © Ellis Cohen 2002-2005
o
x
x
x
xxx
x
x
x x
x
x x x x x #oxxxxxx oooo
xxx xo o o
x
o
o
o
o# o o
o oo o x
xx
x
o
x
x
o
o
o
xoo ooo # # o o
o
oo o ooo o x o oo o o # ## o##
### # # # # oo o o # # ### #
#
# ## o
# # o # ## # ## o o # # ## # ##
o
2
37
x
o
o
#
#
o
Decision Tree Construction
x
x
x
x
xxx
x
x
x x
x
x x x x x #oxxxxxx oooo
xxx xo o o
x
o
o
o
x xxxxxo oo o x o o o o o o
xoo ooo # # o o o o #o##o##
oo
#
oo o ooo o x o oo o
# # ## # oo o o # # #### #
income #### #
#
o
# # o # ## # ## o o o # # ## # ##
age
x
o
o
#
#
o
income < 3000
#
income < 6000
age < 35
#
#
o
age < 45
o
#
x
age < 52
o
…
o
…
Copyright © Ellis Cohen 2002-2005
age < 54
x
o
…
38
Decision Tree Algorithm
Given a set of variables to
consider, in each iteration the
algorithm
– decides which variable to split
– where (at what value) to split it
in order to find the best possible
separation of tuples
Copyright © Ellis Cohen 2002-2005
39
Decision Tree
Construction Issues
Splitting
Which variable to split & where
Gini, Twoing, Entropy, Chi-Square, Gain, …
Binary or n-ary splits,
esp for categorical variables
Use multiple approaches and pick best
Biased Splitting
Non-symmetric misclassification costs
safety misclassification (e.g. O rings)
Category Importance
fraud detection
Penalties on Variables
acquisition cost, questionable values
Case weights
different distributions of sample & live data
Copyright © Ellis Cohen 2002-2005
40
Linear Combination Splitting
x
x
x
x
o # x xx x xxxx
x x
x x x x x x x xxxx oooo
x
x xx x x
o oo o o
o
o
x x x xo o
o o
o
o
o
x
o
o
o
o
#
o ooo # o o x # # ###
o
#
oo o ooo oo # # # #
##
# # ### ##
#
#
#
#
# #
#
#
# ####
o # ## o# ## # # # # ## # ##
o
income < 40*age + 400
x o x xxxxx xx
x
x
x
x x
x x x x x x x# xxxx oooo
x
o
x xx x x
o
o
o
o
o
x x x xo o o
o o
oo
o
x
o
o
o
o
#
o
#
o oo
o
o x# # # ## ###
o
oo o o
o
o
o #o # # # # # # # ##
#
# # ## #o
#
# # ## # #
#
# # ##o # #
#
## # # #
# ##
o
Copyright © Ellis Cohen 2002-2005
41
Overfitting
income < 3000
x
x
x
xxx
x
x
x x
x
x x x x x #oxxxxxx oooo
xxx xo o o
x
o
o
o
o# o o
o oo o x
xx
x
o
x
x
o
o
o
xoo ooo # # o o
o
oo o ooo o x o oo o o # ## o##
# # ## # oo o o # # #### #
income ### #
#
o
# # o # ## # ## o o # # ## # ##
age
…
income < 6000
o
Prevent overfitting by
• Stopping
Need criteria
Danger of stopping
too early
• Pruning
Build full tree
Cut back when testing
…
o
age < 52
…
o
income < 4500
#
o
age < 59
#
#
age < 61
o
Copyright © Ellis Cohen 2002-2005
#
42
Classification Rule Extraction
x
x
x
x
xxx
x
x
x x
x
x x x x x #oxxxxxx oooo
xxx xo o o
x
o
o
o
x xxxxxo oo o x o o o o o o
xoo ooo # # o o o o #o##o##
oo
#
oo o ooo o x o oo o
# # ## # oo o o # # #### #
income #### #
#
o
# # o # ## # ## o o # # ## # ##
age
x
o
o
#
#
o
Extract a rule for each region
(25 ≤ age ≤ 35) Λ (income < 3000) 
CarType = 'midrange' (i.e. 'o')
Support:
7/94  7.4%
Confidence:
6/7  86%
Support( A ) = # of objects satisfying A / # of total objects:
Does the region determined by A have enough pts to matter?
Confidence( A  B ) = # satisfying A & B / # satisfying A:
How confident are we that a point in the region determined
by A also satisfies B?
Copyright © Ellis Cohen 2002-2005
43
Classification Ruleset
(income < 3000) & (age < 35)  '#'
(income < 3000) & (35 <= age < 45)  'o'
(income < 3000) & (45 <= age)  '#'
(3000 <= income < 4500) & (age < 20)  'o'
…
((income < 3000) & (35 <= age < 45) |
(3000 <= income < 4500) & (age < 20)) |
…  'o'
(income < 3000) &
((age < 35) | (45 <= age)) …  '#'
Copyright © Ellis Cohen 2002-2005
44
Using Decision Trees
Forensics
Analyzer
Anomalies
Model
Discovery
Modeling
Sample Data
Live Data
Detector/
Predictor
Detection
Prediction
Predictions
The Decision Tree
(or the extracted Classification Rules)
are a Transparent (understandable) Model.
How is the Model used for
Forensics, Prediction & Detection?
Copyright © Ellis Cohen 2002-2005
45
Approximation
(also called Regression)
Copyright © Ellis Cohen 2002-2005
46
Approximation
Approximate some (continuous) target
value of an object based on its features
Predictor Variables: Age, MonthlyIncome
Target Variable: CarCost -- amt spent by customer on a car
Objects: customers
Find a function
f(income,age) that
gives a good estimate
of the actual value
Monthly
Income
o
o oo
o
oo
oo
o
o
o
o
o
o
oo o o o o o
oo o
o o
oo o o
o o o o o
o
oooo
o
o
o
o
o oo o o o o
o
o
oo oooo o o o o o o o o o o o o o
o
o o
o o o
o
o
o
o
o
o
o
o
o
o ooo o o o
o
o
o
o
Age
o
o
o
o
When a new customer comes along, you can
apply f to their income & age to estimate how
much they're likely to spend on a car
Copyright © Ellis Cohen 2002-2005
47
1D Piecewise Linear Regression
$ spent on x
income
income
income
income
Copyright © Ellis Cohen 2002-2005
48
Piecewise Multivariate
Linear Regression
If there are n predictor dimensions
Use multivariate linear regression to fit
the best possible n-dimensional hyperplane
Find the hyper-plane that has the largest
error, and "split" it into 2 (or more)
hyper-planes
Continue until the error is within desired
limits.
Use testing sets to prune in case of
overfitting
Copyright © Ellis Cohen 2002-2005
49
Raw Data
Estimation Modeling
with
k Nearest Neighbor
Copyright © Ellis Cohen 2002-2005
50
Raw Data Model
Model
Discovery
Modeling
Sample Data
Live Data
Detector/
Predictor
Detection
Prediction
Predictions
Ordinarily, the sample data is used to build a
model (e.g. a decision tree, a ruleset, an
approximation function) that is used to predict
target variable values for live data
In raw data models, the model is the sample data
(or some subset of it). The predictor uses the
raw data directly to predict the target variable
values of live data tuples.
Copyright © Ellis Cohen 2002-2005
51
K Nearest Neighbor
for Approximation
k=5
o oo
o
oo
o
oo
o
o
o oo o oo
o o o o oo oo
o
oo o
o
o ooo o
oo o
oooo
o
o
o
o
o
o oo o o
o
oo oooo o o o o oo o o o o o o o o
o
o o
o o o
o
o o o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
*
Estimate value of a new data point as
average value of nearest k points
Using an average weighted by distance
generalizes linear interpolation
Copyright © Ellis Cohen 2002-2005
52
K Nearest Neighbor
for Classification
k=5
x
x
x
x
x
#
#ox x
xx
x
xx x x o x
o
x
ox
o xxx x
#
o
o
x ox
o
xo
o # o
x
#
o
#
o x o o# o
#o
o
oo x oxo o o x
o o* # o # x o o o
o
# #
# # o o o o
#
# #o # #
o #
#o # #
x
#
#
#
#
#
#
#
#
x
o
Classify value of a new data pt by choosing
category among k nearest neighbors
which has the most weight
(possibly taking distance into account)
Copyright © Ellis Cohen 2002-2005
53
k Nearest Neighbor Scalability
Too expensive when large # of points
Pick a number of candidate representative
subsample sets instead from a training set
Use test sets to determine the quality of each
representative subsample set.
Also use test sets to determine the best value for k
Choose the
• representative subsample set and
• value of k
with the smallest misclassification rate
Problematic when too many variables
Best (even good) distance metric is hard to
determine
Best distance metric may vary depending upon
region
Copyright © Ellis Cohen 2002-2005
54
Opaque
Estimation Modeling
with Feed-Forward
Neural Networks
Copyright © Ellis Cohen 2002-2005
55
Opaque Data Model
Model
Discovery
Modeling
Sample Data
Live Data
Detector/
Predictor
Detection
Prediction
Predictions
In a transparent data model -- a decision tree, a
ruleset, or (possibly) an approximation function,
the model can be examined and provides an
understandable description & explanation of the
patterns in the sample data.
An opaque data model does not provide an
understandable description of the patterns in the
sample data. However, it can be used to
configure a "machine" that predicts the target
variable value of live data.
Copyright © Ellis Cohen 2002-2005
56
Neural Net Prediction Engine
One or more
layers of
hidden nodes
Predictor
variables
Use weights to combine a
node's inputs to produce
its output. The node
weights are the Model
Target
variables
Output
nodes
Input
nodes
Hidden
nodes
This is an opaque
model: Good for
prediction, but not
for explanation
Copyright © Ellis Cohen 2002-2005
57
Propagation Rule
o1
w1
o2
w2
in
w3
o3
o4
w4
in =
o1*w1 + o2*w2 +
o3*w3 + o4*w4
If wi is positive,
effect is "excitory"
If wi is negative,
effect is "inhibitory"
Copyright © Ellis Cohen 2002-2005
58
Activation
o1
w1
o2
w2
in
out
w3
o3
w4
Activation Function
out = f(in)
o4
Copyright © Ellis Cohen 2002-2005
59
Activation Functions
Linear 
Linear Regression
Step 
Piecewise Linear Regression
Sigmoid 
Approximate Any Function
(better than piecewise linear
regression w same # of nodes)
out = 1 / (1 + e-in)
Copyright © Ellis Cohen 2002-2005
60
Learning & Back Propagation
The big problem is:
How do you set the weights?
Pick initial weights
Repeatedly
• Provide a data pt from the training set as
input
• Compare the predicted response(s) with
the actual response(s)
• Use the difference to adjust the weights
of the output node
• Use the adjustment in weights of a node
to adjust the weights of the nodes that
feed into it (back propagation)
Copyright © Ellis Cohen 2002-2005
61
Momentum
If you adjust the weights too fast
The weights can oscillate instead of
converging
If you adjust the weights too slowly
The weights converge too slowly
In general, neural nets converge
slowly
Can be helped by using other
techniques to select the data points
in the training set & the order in
which they are presented
Copyright © Ellis Cohen 2002-2005
62
Neural Nets for Classification
Luxury
Midrange
Cheap
If a sample data point has a luxury target value 
adjust weights towards Luxury=1, Midrange=0, Cheap=0
If a sample data point has a midrange target value 
adjust weights towards Luxury=0, Midrange=1, Cheap=0
If a sample data point has a cheap target value 
adjust weights towards Luxury=0, Midrange=0, Cheap=1
Copyright © Ellis Cohen 2002-2005
63
Clustering
Copyright © Ellis Cohen 2002-2005
64
Clustering
Find groups of objects, [some of] whose
features are all similar to one another
Objects:
Corvette
Buyers
x
x
x
x
x
x
x x
x
xx
x
x
x
xxx
xx xx x
x x
x
Monthly
Income
FEATURES
x
Identify Cluster
e.g. by
center & radius
Age
Copyright © Ellis Cohen 2002-2005
65
Clustering Activities
Forensics
Analyzer
Anomalies
Model
Discovery
Modeling
Sample Data
Live Data
Detector/
Predictor
Detection
Prediction
Predictions
Modeling
Description of each cluster
• Cluster boundaries
• For compact clusters: subspaces + centroid + radius
Forensics
Identify and explain outliers (points not in a cluster)
Detection/Predication
Does live data cluster in the same way?
Copyright © Ellis Cohen 2002-2005
66
Modeling Clusters
Bounds-Based
Cluster 1:
[age: 21  2, monthlyIncome: 5000  1000]
Cluster 2:
[age: 54  3, monthlyIncome: 6400  1200]
Centroid/Radius-Based
Cluster 1:
centroid: [age: 21, monthlyIncome: 5000],
radius: .12
Cluster 2:
centroid: [age: 54, monthlyIncome: 6400],
radius: .15
Centroid/Distance-Based approach implies that
• clusters are circular (too strong)
• we need a uniform distance metric (needed anyway)
Copyright © Ellis Cohen 2002-2005
67
Distance Metric
Clustering requires a distance metric
Given 2 data points, pt1, pt2
Compute distance d( pt1, pt2 )
Distance in a single dimension
Easy for quantitative variables (v2-v1)
Harder for categorical variables
Hardest for structured variables
(e.g. similarity metrics for text, images)
Distance over multiple dimensions
More than just Pythagoras …
Copyright © Ellis Cohen 2002-2005
68
Categorical Variable Distance
Ordinal Variables [ordered]
v2 - v1 doesn't work
Use lookup table or function f(v1,v2)
Nominal Variables [unordered]
– Non-hierarchical [e.g. gender]
d(v1,v2) = 0, if v1=v2
1, otherwise
– Hierarchical
Use distance based upon hierarchy
d(p1,p2) [p1 and p2 are prodid] = for example
0, if p1 = p2, else
.4, if Prodtyp(p1) = Prodtyp(p2), else
.7, if Subcat(p1) = Subcat(p2), else
.9, if Category(p1) = Category(p2)
1, otherwise
Copyright © Ellis Cohen 2002-2005
69
Multidimensional Distance
x = (x1,x2,…,xn)
y = (y1,y2,…,xn)
Euclidean Distance
d(x,y) = sqrt(  (xi - yi)2 )
What if dimensions aren't commensurate?
Scale all dimensions
– Use weights based upon importance, or
– so values between 0 and 1, or
– d(x,y) = sqrt(  ((xi - yi)/i ) 2 )
i
is the standard deviation for the ith
dimension
Copyright © Ellis Cohen 2002-2005
70
Types of Clustering
Partition-Driven
(primarily for O.R.)
Partition data points
Score: based on compactness
Either
• Every pt is in a cluster
• Minimize # of pt which are not
Density-Driven
(primarily for Data Mining)
Discovering dense collections of data
points
Find all clusters which have minimum
size & density
No requirement to include outliers
Copyright © Ellis Cohen 2002-2005
71
Partition-Driven
Clustering
Copyright © Ellis Cohen 2002-2005
72
Partition-Driven Clustering
Assume fixed # of clusters
deal with variable # of clusters later
Different objectives and constraints
• Every pt in some cluster
Warehouse Problem: k warehouses, each
delivering to nearby customers
Place warehouse at center of assigned clusters
• Every pt in some cluster, per-cluster caps
Require every warehouse serve same # of
customers, or deliver same # of total units
• Not every pt needs to be in a cluster,
exclusion penalty
No delivery to faraway customers; the delivery
cost would be more than the cost of attracting a
new customer
Copyright © Ellis Cohen 2002-2005
73
K-Means Approach
Every pt must be in some partition,
no per-cluster caps
Decide on k points that will be
the initial center of each of the k clusters
Repeat
Assign each pt to the nearest cluster center
Recalculate the centers
Until initial center points are same
as final center points
For faster convergence
(esp for 1st & probably 2nd iteration)
After assigning a point to a cluster,
immediately recalculate that cluster's center
Problem: Finds Local Minimum, not Global Minimum
Use different starting points, solve each, take best
overall result (one with min compactness score)
Copyright © Ellis Cohen 2002-2005
74
Per-Cluster Caps
Decide on k points that will be
the initial center of each of the k clusters
Repeat
Assign the data points to
minimize the score and satisfy the caps!
Recalculate the centers
Until initial center points are same
as final center points
Assigning the data points
– Use integer programming (exact but slow)
– Use greedy method, inexact but fast
Suppose scorei(pt) is the score if the pt is
assigned to cluster i (small score is best)
Hash each pt into a bucket based on
max(scorei(pt)) - min(scorei(pt))
Starting with the bucket w biggest values, assign
each pt in the bucket to the cluster which
minimizes scorei(pt) and still has room
Copyright © Ellis Cohen 2002-2005
75
Varying k
What if the best value for k isn't known?
Could just put each pt in a separate cluster
If score function is based on distance of pt from
cluster center, total score would be zero
Balance with cost of having too many partitions
Two components of total score:
Partitioning Score + Complexity Score
Partitioning Score: based on compactness
Complexity Score: based on k
Overall Algorithm
Choose likely k
Use Cluster Algorithm with k-1, k, and k+1 =>
Move k in direction of smallest score (or use
larger window to avoid local minima)
Done when cost for k is
smaller than cost for k+1 and k-1
Copyright © Ellis Cohen 2002-2005
76
Density-Driven
Clustering
Copyright © Ellis Cohen 2002-2005
77
Density-Driven Clustering
Goal: Find interesting clusters
Discover dense collections of data points
Find all clusters which have minimum
size & density
No requirement to include outliers
Kinds of Density-Driven Clustering
Fixed Space Clustering
Pick a set of dimensions, and find all the
clusters in those dimensions
Subspace Clustering
A cluster in some set of dimensions need not
form a cluster in a subset or superset of
those dimensions
Given a set of dimensions, find all the clusters
in all possible subsets of those dimensions
Copyright © Ellis Cohen 2002-2005
78
Fixed-Space Clustering Approaches
Partition-Based
Find partition with best k, but using a scoring not
hurt by outliers, then eliminate outliers
Division-Based
Start with one partition containing entire dataset
Iteratively separate non-uniformly dense partitions
into subpartitions
Agglomeration-Based
Divide data set into large # of small partitions (either
use partition-based clustering, or each initial
partition has exactly one data point)
Iteratively combine together dense nearby partitions
Grid-Based
Agglomeration-Based, where initial partitions are
cells of an n-dimensional grid
Copyright © Ellis Cohen 2002-2005
79
Partitioning for
Density-Driven Clustering
Use partitioning approach
e.g. Variable k-Means
Use different score
not dominated by outliers
e.g. instead of d2, use S curve
-- d2 / ( m2 + d2 ) [fixed m]
Throw outliers away
After finding clusters
Then possibly re-cluster
Copyright © Ellis Cohen 2002-2005
80
Misclustering
x x xx
x x
x x xx
x xx x x
x xx xx
x xx x x
x x xx x
x xx x xx x
x x x x x xxxx xx
x x x xx x
x x xx xx xx x
xx x xx x x xx
xx x xx
xx xx
xxx x x
x
xx
xxx
What if you have
the right k, but the
wrong clusters?
Use Subclustering
Use Agglomeration
Subclustering
Pick k' initial pts dispersed in cluster (& vary k')
Cluster around those. Compare results.
In general (but not in this example), adjusting
clusters (by subclustering + agglomeration)
will change k
Leads to more complex algorithm:
Intersperses finding k with adjusting clusters
Copyright © Ellis Cohen 2002-2005
81
Pure Agglomerative Clustering
Initially put each pt in a separate cluster
Repeat
Find two nearest clusters & merge them
Until no clusters are sufficiently close
Throw away clusters which are too small
(treat as outliers)
Needs a notion of inter-cluster distance
Copyright © Ellis Cohen 2002-2005
82
Inter-Cluster Distance Metrics
Closest Neighbor
Find two points, one in each cluster,
which are nearest together
Can lead to chaining: long narrow clusters. This
may be acceptable, even desirable
Variant: increases compactness
k Closest: average of k nearest neighbors
k might depend upon size of cluster
Farthest Neighbor
Find two points, one in each cluster,
which are farthest together
Leads to very compact clusters
Centroid Distance
Distance between cluster centroid
Keeps reasonably compact
Fastest approach
Note: May also want to scale the metric based upon
density and size of the two clusters
Copyright © Ellis Cohen 2002-2005
83
Agglomeration & Scalability
Performance of agglomerative
clustering doesn't scale well
especially expensive if
pts don't all fit into main memory
Do in 2 steps:
1. Divide into k clusters [pick a large k]
Use k-means
(where cost not dominated by outliers)
2. Use Agglomerative Clustering
starting with those clusters
Use centroid inter-cluster distance metric
Copyright © Ellis Cohen 2002-2005
84
Grid-based Clustering
Divide the space into a grid, with each of the
n dimensions divided into d divisions
(adjust for non-continuous variables)
Control parameters
d: number of parts each
dimension is divided into
p: min # of pts in a cell for it
to be an interesting cluster
64 3D cells
n = 3, d = 4
1. Divide each space into cells
2. Mark the cells which have at
least p points
3. Agglomerate the marked cells
4. (Optional) Use the
agglomerated cell clusters as the
starting point for another
clustering approach.
Copyright © Ellis Cohen 2002-2005
85
Subspace
Clustering
Copyright © Ellis Cohen 2002-2005
86
Clustering & Subspaces
Not all clusters involve all variables
Cluster 1: [subcatid: Convenience, age: 16  2,
married: FALSE]
Cluster 2: [subcatid: Convenience, allowance: 500
 75, nearestStoreDistance: .1  .1]
Cluster 3: [subcatid: Sofas, familyIncome: 6000 
1000, vacationSpend: 2000  400]
The variables involved in each cluster
form a subspace
Cluster 1: [ subcatid, age, married ]
Cluster 2: [ subcatid, allowance,
nearestStoreDistance ]
Cluster 3: [ subcatid, familyIncome,
vacationSpend ]
Fixed space clustering only works if you KNOW, in advance, which
group of dimensions you should examine for clusters.
But often, you don't know. What do you do?
Copyright © Ellis Cohen 2002-2005
87
Subspace Clustering
Consider every possible subspace,
and find clusters in each subspace
Grid-based implementation
Find all the dense cells in each single
dimension separately
To find all dense cells in an n+1
dimension subspaces, combine dense
cell in all n dimension subspaces with
dense cells in each single dimension
(this is pretty efficient)
In each subspace with dense cells,
agglomerate to find clusters
Copyright © Ellis Cohen 2002-2005
88
Counting Pts In Subspaces
1 3D space
{age, income, housing}
64 3D cells
n = 3, d = 4
3 2D subspaces
{age, income}
{age, housing}
{income, housing}
3 1D subspaces
{age}
{income}
{housing}
Housing
Age
Income
To count # pts in a 2D
cell, aggregate
counts in
corresponding 3D
cells
Copyright © Ellis Cohen 2002-2005
89
Partial Projection-Based Merging
10 1D cells
100 2D cells
x
x
x
x
age
x
x
x x
x
xx
x
x
x
x x x
xx xx x
x x
x
x
age
income
10 1D cells
Clusters in one
space might
merge in some
subspaces
income
Copyright © Ellis Cohen 2002-2005
90
Total Projection-Based Merging
x
x
x
x
x
x
x
x x x
xx xx x
x x
x
x
x x x
x
xx
age
x
age
income
Clusters in one
space might
merge in all
subspaces
income
Copyright © Ellis Cohen 2002-2005
91
Total Projection-Based Hiding
Clusters in one space might even be hidden in all
subspaces
x xx
x x
x
xx x
x
x
x x
x
xx
x
xx
x x x
xx x x
x x x
x
x
x
x x
xx x x
x
You cannot just
look in lower
dimension spaces
to find clusters in
higher dimension
spaces
Copyright © Ellis Cohen 2002-2005
92
Superspace Dispersion
Clusters in a subspace may be dispersed in a
superspace
10 1D cells
100 2D cells
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
10 1D cells
You cannot just
look in higher
dimension spaces
to find clusters in
lower dimension
spaces
Copyright © Ellis Cohen 2002-2005
93
Subspaces & Cells
64 3D cells
n = 3, d = 4
subspaces with
( 33) 3D
4 cells each
3
1 subspace, 64 cells
each, 64 total cells
subspaces with
( 32) 2D
4 cells each
2
3 subspaces, 16 cells
each, 48 total cells
subspaces with
( 31) 1D
4 cells each:
3 subspaces, 4 cells
each, 12 total cells
Copyright © Ellis Cohen 2002-2005
94
Smaller High Dimensional Cells
n = 20, d = 10
(20
1)
(20
2)
(20
3)
1D subspaces with 10 cells each:
20 subspaces, 10 cells each,
120 total cells
2D subspaces with 102 cells each
190 subspaces, 100 cells each,
19000 total cells
3D subspaces with 103 cells each
1140 subspaces, 1000 cells each,
1,140,000 total cells
~1 million subspaces total
Copyright © Ellis Cohen 2002-2005
95
Monotonicity
64 3D cells
n = 3, d = 4



Age
Income
If a cell in k+1 dimensions
contains p points,
then every k dimensional cell
which aggregates it must also
contain p points.
So: to find the 3D cells with
p points, you only need to
consider cells, all of whose
2D aggregates contain p
points.
e.g. a cell in the subspace
{age,income, housing} only
contains p points if the
Housing corresponding cells in
{age,income}, {age,housing}
and {housing,income} all
contain p points
Copyright © Ellis Cohen 2002-2005
96
Subspace Clustering Algorithm
In every subspace, find all cluster cells
(i.e. dense cells with at least p points)
– Find all 1D cells with at least p points
– Use monotonicity to efficiently find
all 2D cells with at least p points
– Continue using monotonicity to efficiently find all
cells with at least p points in all subspaces
– Discard cells in low dimensional subspaces which
are not sufficiently dense
For each subspace which has cluster cells
– Agglomerate the marked cells
– (Optional) Use the agglomerated cell clusters as
the starting point for another clustering approach
Copyright © Ellis Cohen 2002-2005
97
Density Requirements
Suppose we have 10,000 data points
and 20 dimensions
d: 10 (divide each dimension into 10ths)
p: 50 (need 50 data pts in a cell)
: 4 (cell must be at least 4 times as dense as avg)
# dim
1
2
3
4
5
avg #
pts per
cell
#
cells
10
100
1000
100000
1000000
1000
100
10
1
.1
req # of
pts per
cell to
satisfy 
req # of
pts per
cell to
satisfy
&p
4000
400
40
4
1
4000
400
50
50
50
Copyright © Ellis Cohen 2002-2005
98
Targeted Clustering &
Association Rules
Copyright © Ellis Cohen 2002-2005
99
Exploratory vs Targeted Clustering
Exploratory Clustering
Find clusters involving an arbitrary
set of variables
Targeted Clustering
Find clusters among a set of
variables which include the target
variable (possibly restricted to a
particular value)
Copyright © Ellis Cohen 2002-2005
100
Single Value Targeted Clustering
Suppose
• Our sample dataset consists of car
buyers
• we want to find clusters of car
buyers who bought luxury cars

1. Restrict the sample dataset to just
those tuples where CarType =
"luxury"
2. Use clustering among this
restricted dataset
Copyright © Ellis Cohen 2002-2005
101
Multiple Value Targeted Clustering
Suppose
• Our sample dataset consists of car
buyers
• we want to find clusters of car
buyers who bought the various
categories of cars

Use Single Value Targeted Clustering
for CarType = "luxury", then again
for CarType = "midrange", then
again for CarType = "cheap"
Copyright © Ellis Cohen 2002-2005
102
Clustering vs Classification
Can't we use Multiple Value Targeted
Clustering to do Classification?
Find the clusters where
CarType = "luxury",
CarType = "midrange",
and CarType = "cheap"
Use the clusters as the model and to
predict the value of live data.
WILL THIS WORK OR NOT?
Copyright © Ellis Cohen 2002-2005
103
Clustering Does NOT Classify
Clusters do not cover the space
Clusters only identify dense regions of objects.
The bulk of the space that a decision tree would
assign to cheap car buyers probably does NOT
hold dense clusters of them, so would not be
included in the clustered model
Clusters for different target values may
overlap
x
o cluster
o
x
x cluster
o o x xx x
o o o x o xo x
o o o x x x xx
o oo xo o
x
x
o
oo o x o x
o x o xx x o
o
Copyright © Ellis Cohen 2002-2005
104
Association Rules
The region corresponding to a cluster
may include other data points whose
target values differ from the targeted
cluster value
o cluster
o ox
o oo x
o o ox
o oo xo
Monthly oo oo x
ox o
Income
Age
Cluster:
[age: 26  3,
monthlyIncome: 3000  1000]
Pct of data pts within the cluster region with target value
'midrange' (i.e. with symbol 'o') is the same as
the CONFIDENCE of the association rule
(23 ≤ age ≤ 29) Λ (2000 ≤ monthlyIncome ≤ 4000)
 carType = "midrange'
Copyright © Ellis Cohen 2002-2005
105
Cluster Confidence
Depending upon the data mining
application, confidence may or
may not matter.
Problem: Come up with
applications where the
confidence of clusters
– must be > 80%
– must be > 30%
– doesn't matter
Copyright © Ellis Cohen 2002-2005
106
Requiring Confident Clusters
Suppose we are interested in
regions where enough data
points are clustered together
(i.e. with good support), but
where a minimum confidence
w.r.t. a targeted value is
required.
Is there any alternative to
simply discarding clustered
regions with low confidence?
Copyright © Ellis Cohen 2002-2005
107
Split Clusters
o ox
o oo x
o o ox
o oo xo
oo oo x
ox o
Use decision-tree style splitting of
the data points within the cluster
to best separate the 'o' valued
points from the non-'o' valued
points
If the 'o' region has too few data
points (too little support), tough
luck
If it has adequate support &
confidence, done!
If it has adequate support, but has
inadequate confidence, split it
again …
(If the non-'o' regions are large enough,
they can also potentially be split to
find 'o' subregions)
Copyright © Ellis Cohen 2002-2005
108
Quantitative Target Variables
If a target variable is
quantitative, especially if it is
continuous, how can targeted
clustering be done?
For example, how do you find
clusters of car buyers who
spent approximately the same
amount on a car?
Copyright © Ellis Cohen 2002-2005
109
Quantitative Targeted Clustering
How do you find clusters of car
buyers who spent approximately
the same amount on a car?
Do standard clustering,
just require that the variables
used always INCLUDE the
target variable!
Copyright © Ellis Cohen 2002-2005
110
Market Basket
Analysis
Copyright © Ellis Cohen 2002-2005
111
Market Basket Analysis
Organize dataset into baskets.
Find groups of items which frequently
occur together in baskets
11-Feb-99
11-Feb-99
11-Feb-99
…
11-Feb-99
11-Feb-99
…
13-Feb-99
13-Feb-99
Rules capture
causality
Joe
Joe
Joe
Diapers
Formula
Beer
Simha Pretzels
Simha Beer
Sasha Diapers
Sasha Beer
Basket: Daily
Shopping by
a Customer
Diapers and beer
occur together
frequently
Item: Product purchased
NO! People who buy beer are not more likely
Beer  Diapers? to buy diapers
YES! People who buy diapers are more likely
Diapers  Beer? to buy beer (esp men at night)
Copyright © Ellis Cohen 2002-2005
112
Market Basket Activities
Forensics
Analyzer
Anomalies
Model
Discovery
Modeling
Sample Data
Live Data
Detector/
Predictor
Detection
Prediction
Predictions
Modeling
Identify good sets of rules (w statistics)
Forensics
Understand why other groups of items are NOT related
Detection/Prediction
Does live data follow same rules with same statistics
Copyright © Ellis Cohen 2002-2005
113
Baskets
In order to use market basket
analysis, you must first divide the
dataset into baskets.
Baskets are specified as a group of
variables (possibly derived). The
actual baskets are obtained by
grouping the dataset by these
variables (e.g. date/customer).
The first step of market basket
analysis is deciding which variables
define the baskets.
Copyright © Ellis Cohen 2002-2005
114
Items
Market Basket analysis looks for
groups of items which frequently
appear together in a basket.
An item is determined by a variable
(or set of variables). Each different
value for that variable (or
variables) determines a different
item (e.g. productPurchased).
The second step of market basket
analysis is determining which
variable(s) are used to identify the
items
Copyright © Ellis Cohen 2002-2005
115
Market Basket Discovery
1) Find frequent itemsets
2 items that appear together frequently
are interesting
3 items that appear together frequently
are really interesting
{ charcoal, chicken, bbq sauce }
4 or more items, really really interesting
2) Find rules that characterize
causality
Diapers  Beer, but not Beer  Diapers
Think in terms of which do you give a
coupon for.
Copyright © Ellis Cohen 2002-2005
116
Apriori Algorithm
Find all itemsets which have at least n items
Use Apriori or Monotonicity principle:
If a set of items S is frequent (i.e. appears in at least
n baskets), then every subset of S is frequent.
Call L1, the items which appear in at least n baskets
Consider all combos of 2 items {A,B}, both from L1
Call L2, those which appear in at least n baskets
Consider all combos of 3 items {A,B,C}, where
{A,B} and {A,C}, and {B,C} are in L2
Call L3, those which appear in at least n baskets
Consider all combos of 4 items {A,B,C,D}, where
{A,B,C} and {A,B,D}, {A,C,D} and {B,C,D} are in L3
Call L4, those which appear in at least n baskets
The frequent itemsets are L1 + L2 + …
Copyright © Ellis Cohen 2002-2005
117
DB Implementation of Apriori
One scan through DB to get frequent items
CREATE HotItems AS
SELECT item FROM purchases
GROUP BY item HAVING count(*) >= n
Another scan through DB looking for pair
itemsets (repeat for size n itemsets)
WITH HotPurchases AS
(SELECT * FROM
Purchase NATURAL JOIN HotItems)
SELECT P1.item, P2.item
FROM HotPurchases P1, HotPurchases P2
WHERE P1.basket = P2.basket
AND P1.item < P2.item
GROUP BY P1.item, P2.item
HAVING count(*) >= n
Copyright © Ellis Cohen 2002-2005
118
Apriori Scalability
Obtaining frequent pair itemsets:
If HotItems can be kept in memory allowing
rapid lookup (sorted list or hashtable)
FP (frequent pair) itemsets can be obtained in
one linear pass through the DB.
Obtaining frequent size n itemsets:
1.
2.
Use a separate linear scan through the DB up
to n. Slow.
On second scan, don't just count pairs;
instead build a memory-based FP-Tree.
Can be used to find all frequent itemsets of
any size.
But, we often only care about frequent pair
itemsets.
Copyright © Ellis Cohen 2002-2005
119
Clustering & Market Basket Analysis
Market Basket Analysis is a form of clustering
• Turn each basket into a single LARGE data item.
– Each LARGE data item has a separate boolean
variable for each possible item that can be in
a basket. For example
– Beer, diapers, etc. are separate variables
– A LARGE data item's beer value is TRUE if the
basket it came from had a beer
• In the original dataset, we look for
k-element itemsets which appear in p or more
baskets (using apriori)
This is equivalent to using the LARGE item
dataset and
– using the subspace clustering algorithm to
look for k-dimensional cells with p or more
points
Copyright © Ellis Cohen 2002-2005
120
Scoring
Market Basket
Analysis
Copyright © Ellis Cohen 2002-2005
121
Support
Cereal
Beer
Support( s ) =
# of baskets containing S /
# of total baskets
40
1000
2000
4000
Support { Beer } = 1000/4000 = 25%
Support { Cereal } = 2000/4000 = 50%
Support { Beer, Cereal } = 40/4000 = 1%
Support: How significant is this itemset
In a supermarket, anything over .1% might be
significant
Given the # of total baskets, the minimum
interesting support determines n for the Apriori
algorithm
Copyright © Ellis Cohen 2002-2005
122
Confidence
Cereal
Confidence( A  B ) =
Support( A & B ) /
Support( A )
Beer
40
1000
2000
4000
Confidence( A  B ) =
# of baskets containing A & B /
# of baskets containing A
Confidence( Beer  Cereal ) = 40/1000 = 4%
Confidence( Cereal  Beer ) = 40/2000 = 2%
Confidence( A  B ): If a basket has A,
how likely is it that the basket also will have B
(i.e. how confident are we that A predicts B)
If this is low (say, less than 30%), it is not very
interesting, since the two items don't correlate
Copyright © Ellis Cohen 2002-2005
123
High Support & Confidence
Milk
Beer
400
1000
2000
4000
Support { Beer } = 1000/4000 = 25%
Support { Milk } = 2000/4000 = 50%
Support { Beer, Milk } = 400/4000 = 10% WOW!
Confidence( Milk  Beer ) = 400/2000 = 20%
Confidence( Beer  Milk ) = 400/1000 = 40%
High Confidence, so potentially interesting
BUT 40% < 50% the pct who buy milk anyway
So giving milk coupons to beer buyers is probably
not the most useful thing to do
Copyright © Ellis Cohen 2002-2005
124
Lift
Milk
Lift( A  B ) =
Confidence( A  B ) /
Support( B )
= Lift( B  A )
Beer
400
1000
2000
4000
Support { Beer } = 1000/4000 = 25%
Support { Milk } = 2000/4000 = 50%
Support { Beer, Milk } = 400/4000 = 10% WOW!
Confidence( Milk  Beer ) = 400/2000 = 20%
Confidence( Beer  Milk ) = 400/1000 = 40% OK!
Lift( A  B ): How much does A help B
Lift( Beer  Milk ) = 40% / 50% = .8
If lift < 1, then it doesn't help at all!
Copyright © Ellis Cohen 2002-2005
125
Good Lift
Diapers
Lift( A  B ) =
Confidence( A  B ) /
Support( B )
= Lift( B  A )
Beer
80
1000
200
4000
Support { Beer } = 1000/4000 = 25%
Support { Diapers } = 200/4000 = 5%
Support { Beer, Diapers } = 80/4000 = 2% OK!
Confidence( Beer  Diapers ) = 80/1000 = 8%
Confidence( Diapers  Beer ) = 80/200 = 40% OK!
Lift( Diapers  Beer ) = 40% / 25% = 1.6
Note: Lift can be useful in clustering situations as well
Copyright © Ellis Cohen 2002-2005
126
Support, Confidence & Lift
AB
Support( A & B )
How important is the rule: What percent of baskets
have both A & B?
Confidence( A  B )
How likely is it that baskets which contains A also
contains B. In general, should be at least 35%.
Lift( A  B )
If we know that a basket contains A, how much surer
are we that the basket contains B than if we didn't
know what else what in the basket. Must be > 1;
probably should be at least 1.3.
Copyright © Ellis Cohen 2002-2005
127
Hierarchical Categories
Do apriori with values at each category level
Whole Wheat Bread  Skim Milk,
but not
Bread  Milk
or vice versa!
For scalability,
can initially only include higher level categories,
then split itemsets with high levels of support
Copyright © Ellis Cohen 2002-2005
128
Rules for Larger Itemsets
For { A, B, C, D }
Consider
A,
A,
A,
B,
B, C  D
B, D  C
C, D  B
C, D  A
Diapers, ChildrensTylenol  Beer
may have less support than
Diapers  Beer
but may well have higher confidence
and higher lift
Copyright © Ellis Cohen 2002-2005
129
Incorporating Other Variables
Diapers,
gender:male,
time: [8pm:1am]
 Beer
will also have less support than
Diapers  Beer
But will almost certainly have
higher confidence & lift
Remember, this is just
subspace clustering with more variables
Copyright © Ellis Cohen 2002-2005
130
Scoring &
Data Preparation
Copyright © Ellis Cohen 2002-2005
131
Model Scoring
Descriptive Scoring
How good is the resulting model wrt to
the sample data?
E.g. estimation: does the model function do a
good job of actually estimating the target
variable of the sample data?
E.g. clustering: Do the clusters overlap? Are
there unidentified subclusters?
Predictive Scoring
How good is the resulting model wrt to
live data?
E.g. estimation: Does the model function do a
good job of actually estimating the target
variable of live data?
May not know right away, or ever …
Copyright © Ellis Cohen 2002-2005
132
Causes of Scoring Failure
Descriptive Scoring Failures
Noisy, approximate or erroneous data
Curse of Dimensionality: too many variables
Variables inadequate for finding good patterns
• Clustering: Limited clustering with given
variables
• Classification: Limited separation relative to
given variable
• Mkt Basket: Baskets are too coarse or too fine
Approach doesn't match sample data
Predictive Scoring Failures
Environment Change
• Calls for continuous mining & trend analysis
Overfitting
• Model does not just identify the persistent
trends embodied in data
• Model also captures idiosyncratic random
variations of the raw data
Copyright © Ellis Cohen 2002-2005
133
Data Cleaning
Missing Small Amounts of Data
Not a problem for many approaches
Use estimation to fill in missing values
Noisy or Approximate Data
Use smoothing
Lose ability to discover some real patterns &
relationships
Avoids overfitting & discovery of non-causal
relationships
Erroneous Data
Forensic activity to discover outliers
Distinguish (if possible) between anomalous and
erroneous outliers; remove/fix erroneous ones
Copyright © Ellis Cohen 2002-2005
134
Curse of Dimensionality
Too many dimensions
– can cause scoring failures
– can makes data mining
significantly more expensive
Choose a subset of the possible
dimensions
– For targeted data mining: choose
the ones which contribute most
to the value of the target variable
Copyright © Ellis Cohen 2002-2005
135
Attribute Importance
Auto Sales
o
o
o
o o o o
o o o
o
o
o
Monthly
Income
o
o o
o
o
o o
o
o
o oo o o o
o o o o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o o oo oo o o o
o o
o o oo o o
o
o
o
o
o
o
o
o o o
o
o
oo
o
o o o
o
o ooo o
o o
o
o
o
o
o
o
o
Age
o
o
o
o
Size of O proportional to price paid
Suppose you wanted to generate an estimator for
the price a new customer will pay for a car.
You data mining tool can define an estimator that
only works given a single variable (unlikely, but bear
with me …). Would you prefer to estimate price as
f1(age) or f2(monthlyIncome)?
Copyright © Ellis Cohen 2002-2005
136
Feature Selection
Cumulative attribute
importance
monthlyIncome 42%
vacationSpend
64%
numDepends
72%
dateOfBirth
81%
gender
90%
…
…
…
99%
…
…
…
100%
age
100%
Feature selection can
itself be considered a
type of data mining
Orders predictor
variables in terms of
the amount to which
they contribute to
values of target
variables
Users can do data
mining with just
attributes which cover
some % of importance
Can lose patterns, esp
cluster-related
Extend feature selection
to take changes of
density into account.
Copyright © Ellis Cohen 2002-2005
137
Principal Components Analysis
Combination of
– Feature Selection
– Transformation (Linear Combination)
Defines a new set of variables
Each is linear combination of predictor variables
e.g. v1 = .05*allowance + .01*vacationSpend
Replace the original predictor variables
Goal
Choose variable definitions to get the best
possible effect from feature selection
A small number of the new variables almost
completely determines the values of the target
variables
Pro: fewer dimensions
Con: less understandable system
Copyright © Ellis Cohen 2002-2005
138
Goals of Data Mining
Find patterns and relationships
among objects represented by
data points in a dataset
Novel / Significant
People who buy notebooks buy pencils vs.
Men who buy diapers at night also buy beer
Understandable / Useful
Tradeoff between accuracy and simplicity
Causality vs Chance
All data has random variations, which can show
up as spurious patterns and relationships
Copyright © Ellis Cohen 2002-2005
139