Transcript a priori

Clustering
Algorithms
Minimize
distance
But to
Centers of
Groups
5-2
Clustering
• First need to identify clusters
– Can be done automatically
– Often clusters determined by problem
• Then simple matter to measure distance
from new observation to each cluster
– Use same measures as with memory-based
reasoning
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
5-3
Partitioning
• Define new categorical variables
– Divide data into fixed number (k) of regions
– K-means clustering
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
5-4
Clustering Uses
• Segment customers
– Find profitability of each, treat accordingly
• Star classification:
– Red giants, white dwarfs, normal
– Brightness & temperature used to classify
• U.S. Army
– Identify sizes needed for female soldiers
– (males – one size fits all)
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
5-5
Tires
• Segment customers into product
categories
– High end (they would buy Michelins)
– Intermediate & Low
• Standardize data (as in memory-based
reasoning)
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
5-6
Raw Tire Data
BRAND
Michelin
INCOME
AGE OF CAR
$182,200 5 months
Michelin
Goodyear
Goodyear
$171,200 3 years
$28,800 7 years
$37,800 6 years
Goodyear
Goodyear
Goodyear
Goodyear
$42,200
$55,600
$51,200
$173,400
Opie’s tires
Opie’s tires
McGraw-Hill/Irwin
5 years
4 years
9 years
7 years
$13,400 3 years
$68,800 6 years
©2007 The McGraw-Hill Companies, Inc. All rights reserved
5-7
Standardize
• INCOME
– MIN(1,INCOME/200000)
• AGE OF CAR
– IF({AGE OF CAR})<12 months,1,
– ELSE[MIN{(8-Years)/7},1]
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
5-8
Sort Data by Outcome
BRAND
INCOME
AGE OF CAR
Michelin
High income
Bought this year
Michelin
High income
Bought 1-3 yrs ago
Goodyear
Low income
Bought 4+ yrs ago
Goodyear
Low income
Bought 4+ yrs ago
Goodyear
Low income
Bought 4+ yrs ago
Goodyear
Avg income
Bought 1-3 yrs ago
Goodyear
Avg income
Bought 4+ yrs ago
Goodyear
High income
Bought 4+ yrs ago
Opie’s tires
Low income
Bought 1-3 yrs ago
Opie’s tires
Avg income
Bought 4+ yrs ago
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
5-9
Standardized Training Data
BRAND
INCOME
AGE OF CAR
Michelin
Michelin
Goodyear
Goodyear
0.911
0.856
0.144
0.189
1
0.714
0.143
0.286
Goodyear
Goodyear
Goodyear
0.211
0.278
0.256
0.429
0.571
0
Goodyear
Opie’s tires
Opie’s tires
0.867
0.067
0.344
0.143
0.714
0.286
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
5-10
Identify Cluster Means
(could use median, mode)
BRAND
INCOME
CAR AGE
Michelin
0.884
0.857
Goodyear
0.324
0.262
Opie’s tires
0.206
0.500
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
5-11
New Case #1
• From new data (could be test set or new
observations to classify) squared distance
to each centroid
Michelin:
Goodyear
Opie’s tires
0.840
0.025
0.047
• So minimum distance to Goodyear
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
5-12
New Case #2
• Squared distance to each centroid
Michelin:
Goodyear
Opie’s tires
0.634
0.255
0.057
• So minimum distance to Opie’s
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
5-13
Software Methods
• Hierarchical clustering
– Number of clusters unspecified a priori
– Two-step a form of hierarchical clustering
• K-means clustering
• Self-organizing maps
– Neural network
• Hybrids combine methods
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
5-14
Application: Credit Cards
• Credit scoring critical
• Use past applicants; develop model to
predict payback
– Look for indicators providing early warning of
trouble
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
5-15
British Credit Card Company
• Monthly account status – over 90
thousand customers, one year operations
• Outcome variable STATE: cumulative
months of missed payments (integer)
– Some errors & missing data (eliminated
observations)
– Biased sample of 10 thousand observations
– Required initial STATE of 0
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
5-16
British Credit Card Company
• Compared clustering approaches with pattern
detection method
• Used medians rather than centroids
– More stable
– Partitioned data
• Clustering useful for general profile behavior
• Pattern search method sought local clusters
– Unable to partition entire data set
– Identified a few groups with unusual behavior
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
5-17
Insurance Claim Application
• Large data warehouse of financial
transactions & claims
• Customer retention very important
– Recent heavy growth in policies
– Decreased profitability
• Used clustering to analyze claim patterns
– Wanted hidden trends & patterns
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
5-18
Insurance Claim Mining
• Undirected knowledge discovery
– Cluster analysis to identify risk categories
• Data for 1996-1998
–
–
–
–
–
Quarterly data
Claims for prior 12 months
Contribution to profit of each policy
Over 100,000 samples
Heavy growth in young people with expensive
automobiles
– Transformed data to normalize, remove outliers
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
5-19
Insurance Claim Mining
• Number of clusters
– Too few – no discrimination – best here was 50
– Used k-means algorithm to minimize least squared
error
• Identified a few cluster with high claims
frequency, unprofitability
• Compared 1998 data with 1996 data to find
trends
• Developed model to predict new policy holder
performance
– Used for pricing
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
5-20
Computational Constraints
• Each cluster should have adequate sample size
• Since cluster averages are used, cluster
analysis not as sensitive to disproportional
cluster sizes relative to matching
• The more variables you have, the greater the
computational complexity
– The curse of dimensionality
– (it won’t run in a reasonable time if you have too
many variables)
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved