Customer Profiling and Algorithms

Download Report

Transcript Customer Profiling and Algorithms

MKT 700
Business Intelligence and
Decision Models
Algorithms and
Customer Profiling (1)
Classification and Prediction
Classification
Unsupervised Learning
Predicting
Supervised Learning
SPSS Direct Marketing
Unsupervised
Learning
Supervised
Learning
Classification
Predictive
RFM
Cluster analysis
Postal Code Responses
NA
Customer Profiling
Propensity to buy
SPSS Analysis
Unsupervised
Learning
Supervised
Learning
Classification
Predictive
Hierarchical Cluster
Two-Step Cluster
K-Means Cluster
NA
Classification Trees
-CHAID
-CART
Linear Regression
Logistic Regression
Artificial Neural Nets
Major Algorithms
Unsupervised
Learning
Supervised
Learning
Classification
Predictive
Euclidean Distance
Log Likelihood
NA
Chi-square Statistics
Log Likelihood
GINI Impurity Index
F-Statistics (ANOVA)
Log Likelihood
F-Statistics (ANOVA)
Nominal: Chi-square, Log Likelihood
Continuous: F-Statistics, Log Likelihood
Euclidean Distance
Euclidean Distance for
Continuous Variables

Pythagorean distance  √d2 = √(a2+b2)

Euclidean space  √d2 = √(a2+b2+c2)

Euclidean distance  d = [(di)2]1/2
(Cluster Analysis with continuous var.)
Pearson’s Chi-Square
Contingency Table
North South
East
West
Tot.
Yes
68
75
57
79
279
No
32
45
33
31
141
100
120
90
110
420
Tot.
Observed and theoretical
Frequencies
North South
Yes
No
Tot.
68
66
32
34
100
75
80
45
40
120
East
West
Tot.
57
60
33
30
90
79
73
31
37
110
279
66%
141
34%
420
Chi-Square:
Obs.
fo
fe
1,1
1,2
1,3
1,4
2,1
2,2
2,2
2,4
68
75
57
79
32
45
33
31
66
80
60
73
34
40
30
37
X
(fo fe)

fe
2
fo-fe (fo-fe)2 (fo-fe)2
fe
2
-5
-3
6
-2
5
3
6
4
25
9
36
4
25
9
36
.0606
.3125
.1500
.4932
.1176
.6250
.3000
.9730
X2= 3.032
2
Statistical Inference

DF: (4 col –1) (2 rows –1) = 3
.10
3.032
6.251
.05
7.815
Log Likelihood Chi-Square
Log Likelihood

Based on probability distributions
rather than contingency (frequency)
tables.

Applicable to both categorical and
continuous variables, contrary to
chi-square which must be
discreticized.
Contingency Table
(Observed Frequencies)
Cluster 1 Cluster 2
Male
10
30
Total
40
Contingency Table
(Expected Frequencies)
Cluster 1 Cluster 2
Male
10
20
30
20
Total
40
40
Chi-Square:
Obs.
fo
Fe
1,1
1,2
10
30
20
20
X
(fo fe)

fe
2
fo-fe (fo-fe)2 (fo-fe)2
fe
-10
10
100
100
5.00
5.00
X2= 10.00
p < 0.05; DF = 1; Critical value = 3.84
2
Log Likelihood Distance
& Probability
Cluster 1 Cluster 2
Male
O
E
O/E
Ln (O/E)
O * Ln (O/E)
2∑O*Ln(O/E)
10
20
30
20
10/20 = .50
-.693
10*-.693
-6.93
30/20=1.50
.405
30*.405
12.164
2*(-6.93+12.164)
= 10.46
p < 0.05; critical value = 3.84
Variance, ANOVA, and
F Statistics
F-Statistics

For metric or continuous variables

Compares explained (in the model)
and unexplained variances (errors)
SQUARED
Variance
SS is Sum of Squares
DF = N-1
VAR=SS/DF
SD = √VAR
VALUE
20
34
34
38
38
40
41
41
41
42
43
47
47
48
49
49
55
55
55
55
COUNT
20
MEAN
43.6
43.6
43.6
43.6
43.6
43.6
43.6
43.6
43.6
43.6
43.6
43.6
43.6
43.6
43.6
43.6
43.6
43.6
43.6
43.6
DIFFERENCE
557
92.16
92.16
31.36
31.36
12.96
6.76
6.76
6.76
2.56
0.36
11.56
11.56
19.36
29.16
29.16
130
130
130
130
SS =
DF=
MEAN 43.6
1461
19
VAR =
76.88
SD=
8.768
ANOVA

Two Groups: T-test

Three + Group Comparisons: Are
errors (discrepancies between
observations and the overall mean)
explained by group membership or
by some other (random) effect?
Oneway
ANOVA
Group 1
6
5
4
5
4
6
5
4
Group 2
8
9
7
8
9
7
8
9
Group 3
3
2
1
3
2
1
3
2
8.125
2.125
(X-Mean)2
1.266
0.016
0.766
0.016
0.766
1.266
0.016
0.766
(X-Mean)2
0.016
0.766
1.266
0.016
0.766
1.266
0.016
0.766
(X-Mean)2
0.766
0.016
1.266
0.766
0.016
1.266
0.766
0.016
4.875
4.875
4.875
SS Within
14.625
Group means
4.875
Grand mean
5.042
(X-Mean)2
0.918
0.002
1.085
0.002
1.085
0.918
0.002
1.085
8.752
15.668
3.835
8.752
15.668
3.835
8.752
15.668
4.168
9.252
16.335
4.168
9.252
16.335
4.168
9.252
Total SS
158.958
MSS(Between)/MSS(Within)
Between
Groups
Winthin groups
SS
DF
Mean SS
Between Groups Mean SS
Within Groups Mean SS
14.625
24-3=21
0.696
+
72.167
0.696
Total Errors
144.333 =
3-1=2
72.167
158.958
24-1=23
6.911
103.624
p-value < .05
ONEWAY (Excel or SPSS)
Anova: Single Factor
SUMMARY
Groups
Group 1
Group 2
Group 3
ANOVA
Source of
Variation
Between Groups
Within Groups
Total
Count
Sum
39
65
17
Average
4.875
8.125
2.125
Variance
0.696
0.696
0.696
144.333
14.625
2
21
MS
72.167
0.696
F
103.624
158.958
23
8
8
8
SS
df
P-value
1.318E-11
F crit
3.467
Profiling
Customer Profiling:
Documenting or Describing



Who is likely to buy or not respond?
Who is likely to buy what product or
service?
Who is in danger of lapsing?
CHAID or CART

Chi-Square Automatic Interaction
Detector




Based on Chi-Square
All variables discretecized
Dependent variable: nominal
Classification and Regression Tree



Variables can be discrete or continuous
Based on GINI or F-Test
Dependent variable: nominal or continuous
Use of Decision Trees

Classify observations from a target binary
or nominal variable  Segmentation

Predictive response analysis from a target
numerical variable  Behaviour

Decision support rules  Processing
Decision Tree
Example:
dmdata.sav
Underlying Theory
 X2
CHAID Algorithm
Selecting Variables

Example
 Regions (4), Gender (3, including Missing)
Age (6, including Missing)

For each variable, collapse categories to
maximize chi-square test of independence:
Ex: Region (N, S, E, W,*)  (WSE, N*)
Select most significant variable
Go to next branch … and next level
Stop growing if …estimated X2 < theoretical X2



CART (Nominal Target)

Nominal Targets:


GINI (Impurity Reduction or Entropy)
Squared probability of node membership
Gini=0 when targets are perfectly classified.
Gini Index =1-∑pi2
Example


Prob: Bus = 0.4, Car = 0.3, Train = 0.3
Gini = 1 –(0.4^2 + 0.3^2 + 0.3^2) = 0.660
CART (Metric Target)

Continuous Variables:
Variance Reduction (F-test)
Comparative Advantages
(From Wikipedia)






Simple to understand and interpret
Requires little data preparation
Able to handle both numerical
and categorical data
Uses a white box model easily
explained by Boolean logic.
Possible to validate a model
using statistical tests
Robust