1 - 國立雲林科技大學
Download
Report
Transcript 1 - 國立雲林科技大學
國立雲林科技大學
National Yunlin University of Science and Technology
Applying Data Mining Technique to
Direct Marketing
Advisor : Dr. Hsu
Student : Sheng-Hsuan Wang
Department of Information Management
National Yunlin University of Science and Technology
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline
Motivation
Objective
Introduction
Background
The Generalized SOM
Experiments
Conclusions
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation
Firms with the huge amount of complex
marketing data on hand, need to further
analysis and expect to make more profits.
Clustering, a technique of data mining, is
especially suitable for segmenting data.
However, firm’s database usually consist of
mixed data (numeric and categorical data).
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective
We utilize a new visualized clustering
algorithm, the generalized self-organizing map
(GSOM), to segment customer data for direct
marketing.
─
Unlike conventional SOM, the GSOM can reasonably
express the relatively distance of categorical values.
Then, we apply GSOM to direct marketing
would generate more profits.
4
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction (1/5)
Marketing practices have shifted to customeroriented from traditional mass marketing.
Firms usually perform market segmentation
and devise different marketing strategies for
different segments.
5
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction (2/5)
Data mining means a process of nontrivial
extraction of implicit, previously unknown and
potentially useful information from a huge
amount of data.
Cluster analysis can assist marketers in
identifying clusters of customers with similar
characteristics.
6
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction (3/5)
The self-organizing map (SOM) network,
proposed by Kohonen, is an useful visualized
tool in data mining
─
─
Dimensionality reduction & Information visualization
Preserve the original topological relationship
7
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction (4/5)
The approach of the SOM in handling
categorical data
─
It uses binary encoding that transforms categorical values
to a set of binary values.
8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction (5/5)
In this paper, we propose an extended SOM,
named generalized SOM (GSOM), to overcome
the drawback in handling categorical data
─
We construct the concept hierarchies for each categorical
attributes.
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Background (1/2)
Self-organizing map, SOM
─
─
Find the winner (BMU) by (1)
Update the winner and neighborhood by (2)
v arg min || x(t ) wi (t ) || , i {1,..., M } (1)
i
wi (t 1) wi (t ) (t ) hvi (t ) [ x(t ) wi (t )] (2)
|| rv ri || 2
(3)
hvi (t ) exp
2
2 (t )
10
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Background (2/2)
Problems of the conventional SOM
D(Coke, Pepsi) = D(Coke, Mocca) = D(Pepsi, Mocca)
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
The Generalized SOM
We use concept hierarchies to help calculate
the distances of categorical values
─
─
An input pattern and the GSOM vector are mapped to
their associated concept hierarchies.
The distance between the input pattern and the GSOM
vector is calculated by measuring the aggregated
distance of mapping points in the hierarchies.
ID
Drink
1
Coke
2
Pepsi
3 Mocca
Input pattern
Any
Juice
Coffee
Carbonated
mq
Orange Apple Latte Mocca Coke Pepsi
12
x
SOM network
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Concept hierarchies (1/3)
General concepts
0
1
1
2
1
1
1
1
1
1
1
1
Specific concepts
D(Coke, Pepsi) < D(Coke, Mocca) = D(Pepsi, Mocca)
13
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Concept hierarchies (2/3)
ID
Drink
1
Coke
2
Pepsi
3
Mocca
Input pattern
Any
Juice
Coffee
mq=(Pepsi, 1.7)
Carbonated
mq
Orange Apple Latte Mocca Coke Pepsi
x
SOM network
A point X=(NX, dX)
NX: an anchor (leaf node) of point X
dX: a positive offset (distance) from X to root
Example: x=(Coke, 2.0); mq=(Pepsi, 1.7)
14
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Concept hierarchies (3/3)
Any
0
1
2
Juice
Coffee
duplication
Carbonated
mq
red blue
dx dmq
Orange Apple Latte Mocca Coke Pepsi
x
| X Y | d X dY 2 d LA (4)
d LA min( d X , dY , d LCA ( N X , NY ) ) (5)
Example: x=(Coke, 2.0); mq=(Pepsi, 1.7)
|x – mq | = 2 + 1.7 – 2×1 = 1.7
15
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments
Experiment dataset
─
─
Synthetic dataset consists of 6 groups of two
categorical attributes, Department and Drink.
Real dataset Adult from the UCI repository
With 48,842 patterns of 15 attributes.
8 categorical attributes, 6 numerical attributes, and 1 class
attribute Salary.
76% of the patterns have the value of ≤50K.
16
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments
Parameters were set according to the
suggestion in the software package SOM_PAK.
─
─
Categorical values are transformed to binary values
when we train the SOM.
While mixed data are used directly when we train the
GSOM. Each link weight of concept hierarchies is set
to 1.
17
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Synthetic dataset (1/2)
Group No
Department
Drink
Data Count
Common Pattern
1
EE
Coke
20
2
CE
Pepsi
10
Engineering College &
Carbonated Drinks
3
MIS
Latte
20
4
MBA
Mocca
10
5
VC
Orange
20
6
SD
Apple
10
Management College &
Coffee Drinks
Design College &
Juice Drinks
Any
Any
Engineering
EE
Management
CE MIS
Juice
Design
MBA VC
SD
Department
Orange
Coffee
Apple Latte
Carbonated
Mocca Coke
Pepsi
Drink
18
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Synthetic dataset (2/2)
─
An 8×8 SOM network is used for the training. After
900 training iterations, the trained maps of SOM and
GSOM under the same parameters are shown in below.
Binary SOM
GSOM
19
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Real dataset (1/3)
We randomly draw 10,000 patterns which have
75.76% of ≤50K, similar to the Salary
distribution of the original Adult dataset
─
─
Three categorical attributes, Marital-status,
Relationship, and Education.
Four numeric attributes, Capital-gain, Capital-loss, Age,
and Hours-per-week.
20
Intelligent Database Systems Lab
Concept hierarchies for the categorical attributes are
constructed as shown in below.
─
Advanced
College
HighSchool
Junior
Real dataset (2/3)
21
Intelligent Database Systems Lab
Doctorate
Prof-school
Masters
Bachelors
Assoc-acdm
Assoc-voc
Some-college
HS-grad
12th
11th
10th
9th
7th-8th
5th-6th
1st-4th
Preschool
Married-civ-spouse
Married-AF-spouse
Married-spouse-absent
Widowed
Divorced
Separated
Never-married
Husband
Wife
Own-child
Other-relative
Not-in-family
Unmarried
Little
Education
Marital-status
Relationship
Couple
Single
ANY
ANY
ANY
N.Y.U.S.T.
I. M.
N.Y.U.S.T.
I. M.
Real dataset (3/3)
─
A 15×15 SOM network is used for the training. After
50,000 iterations, the trained maps of SOM and GSOM
under the same parameters are shown in below.
Binary SOM
GSOM
22
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Distributions of Salary attribute in each cluster
Group
No
7
6
3
1
2
4
5
All
Data
Count
2,626
1,950
1,753
1,045
1,133
740
734
9,981
No. of
>50K
1,287
794
216
59
48
12
8
2,424
No. of
≤50K
1,339
1,156
1,537
986
1,085
728
726
7,557
Ratio of
>50K
49.01%
40.72%
12.32%
5.65%
4.24%
1.62%
1.09%
24.29%
Ratio of
≤50K
50.99%
59.28%
87.68%
94.35%
95.76%
98.38%
98.91%
75.71%
23
Intelligent Database Systems Lab
Application to Direct Marketing (1/2)
After we utilize the GSOM to perform data
clustering, this segmented dataset can be
further applied to catalog marketing.
Suppose that
─
─
─
The cost of mailing a catalog is $2.
The customers whose salaries are over 50K, we make
an average profit of $10 per person.
Otherwise, we make an average profit of $1 per person.
24
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Application to Direct Marketing (2/2)
Expected N(>50K)×10+N(≤50K)×1-[N(>50K)+N(≤50K)]×2
profits
17,500
$14,344
15,000
12,500
10,000
7,500
$7,505
5,000
Trained
2,500
Random
0
0
2,626
4,576 6,329 7,374 8,507
Number of patterns
9,247 9,981
25
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
N.Y.U.S.T.
I. M.
Conclusions
In this paper, we propose a data clustering method
─
─
─
The GSOM extends the conventional SOM and overcomes
its drawback in handling categorical data by utilizing
concept hierarchies.
The experimental results confirmed that the GSOM can
better reveal the cluster structure of data than the
conventional SOM does.
We can make more profits by the marketing based on the
segmentation results of the GSOM than by the marketing
to the customers randomly drawn from the customer
database.
26
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Q &A
27
Intelligent Database Systems Lab