E³ Global Fusion

Download Report

Transcript E³ Global Fusion

Lecture 3
MARK2039
Winter 2006
George Brown College
Wednesday 9-12
Recap
• What are the four stages of data mining and
who are the stakeholders
• Data mining measures and metrics
–
–
–
–
Mean
Median
Mode
Standard Deviation
• Why are these above Statistics important in
evaluating numbers
Boire Filler Group
Recap
• Is the Average or Mean Appropriate in
deriving Insight about a group,segment or
sample behaviour.
• Why do we need to look at how numbers
vary?
• What are some of the measures used to
assess variation?
Boire Filler Group
Recap
Distribution A
350
500
750
1000
1150
Distribution B
700
725
750
775
800
2 distributions above. What do they mean and you would
interpret the results. Both distributions have the same
median and mean
Boire Filler Group
Recap
Distribution A
3
4
5
6
7
8
Distribution B
3
4
5
6
7
1000
What is the problem here?
Boire Filler Group
Recap
• Consider the following two distributions ...
Distribution
A:
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
10,000
Mean
Stdev
Distribution
B:
4,500
4,600
4,700
4,800
4,900
5,000
5,100
5,200
5,300
5,400
5,500
5,000
3,316.62
Boire Filler Group
5,000
331.66
Recap
• For a binomial distribution, such as
response, we must use a different formula.
( p * q) ( N )
1
0
0
1
0
1
0
0
0
0
0.300
0.145
Boire Filler Group
Responder
Non - responder
Non - responder
Responder
Non - responder
Responder
Non - responder
Non - responder
Non - responder
Non - responder
Mean
Stdev
Recap
• What are Indexes.
• Give me some examples.
• Why are they important in the marketing
world?
• What is the most common one used in the
marketing world?
Boire Filler Group
Lift
• Lift represents a relative comparison
between two numbers. It is a type of
index. How is normally used?
• Typically, it represents the number of a
particular of a particular group divided by
the average.( X1/average).
• Example:
Response Rate
Target Group
2%
Average
1.50%
Boire Filler Group
Recap-Lift
• Use relative measures and not absolutes
• The notion of “lift” should be the marketer’s key
determinant of success
Example
Campaign 1
Campaign 2
Strategy 1
3% Resp. Rate
23% Resp. Rate
Strategy 2
1.5% Resp. Rate 21.5% Resp. Rate
Difference
1.5% Resp.
Rate
What is the key learning here?
Boire Filler Group
1.5% Resp. Rate
Assignment 2
1.Answer the following questions on the table listed below:
Col. A
240
250
220
250
240
240
240
260
50
235
Col. B
4000
4000
3000
2000
1000
5000
3000
3000
2000
3000
Col.C
300
300
300
400
100
150
150
400
500
2000
 Calculate the following averages and medians for each column of numbers
mean
median
222.5
240
3000
3000
460
300
Boire Filler Group
Assignment 2
 What kind of distributions are col. A and col. C and what metric would be best
used to communicate to business users.
Skewed,or asymmetric,or nonnormal. Median is the key measure
 What column would be most reliable in estimating results to a larger population
and why?
Col. A as std.deviation is smallest which allows our range around the mean to be much
tighter.
Boire Filler Group
Assignment 2
2 marks
2. The median height of 65 inches is the same for two classes. Yet, the average in one
class is 65 inches vs. 70 inches in another class. What is causing this difference?
An outlier value containing a very tall person is causing the mean of one class to
be 70 inches
Boire Filler Group
Assignment 2
 Calculate the index values for each variable for Customer A.
 Why are indexes useful in database marketing?
Spending: .5
Tenure: .5
Income: 1.2
Indexes are useful as relative measures in terms of comparing a value relative to
the average and being able to rank order or prioritize records
Boire Filler Group
Evaluating test results
• In database marketing, marketers are
constantly asked what to conclude from
their testing results.
• For instance, are the results of one
strategy significantly different than
another strategy.
• Let’s take a look at some examples.
Boire Filler Group
Evaluating Marketing Test
• Two groups of cells have been tested for
different communication strategies.
Results are as follows. What would you
conclude?
Strategy
A
B
Sample Size
10000
5000
Boire Filler Group
Response
Rate
2.30%
2%
Evaluating Marketing Test
• To determine this, you need to do statistical
testing which essentially comprises three
factors:
– Confidence level that you want
– Actual standard deviation based on the lower sample
size
– Response Rate Or performance Rate
– For our purposes, we will use a 95% confidence
interval which essentially translates into 2 standard
deviations around the mean
Boire Filler Group
Evaluating Marketing Test
• Calculate the following confidence
intervals at 95%
– 1%
– 2%
– 5%
– 5%
with
with
with
with
a
a
a
a
std.
std.
std.
std.
deviation of .1%
Deviation of .05%
Deviation of .5%
Deviation of .3%
• Let’s get back to the problem
Boire Filler Group
Evaluating Marketing Test
• Two groups of cells have been tested for
different communication strategies.
Results are as follows. What would you
conclude?
Strategy
A
B
Sample Size
10000
5000
Boire Filler Group
Response
Rate
2.30%
2%
Evaluating Marketing Test
• Calculate the standard deviation first using the
sample with the lower qty-Strategy B.
–
( p * q) ( N )
– Sq. root of (.02X.98)/5000=.00198
– 95% confidence interval=
• .02+2*.00198 and .02-2*.00198=
• .01604<=.02<=.02396.
– Based on this result, what can you conclude
between Strategy A and Strategy B
Boire Filler Group
Evaluating Marketing Test
Results
• Two other groups of cells have been
tested for different communication
strategies. Results are as follows. What
would you conclude?
•Strategy
•A
•B
•Sample Size
•1000
•2000
•Response
•Rate
•5.00%
•3%
Suppose the A becomes 3.3%. What would you conclude?
Boire Filler Group
Evaluating Marketing Test
• Calculate the standard deviation first using the
sample with the lower qty-Strategy A.
–
( p * q) ( N )
– Sq. root of (.05X.95)/1000=.00689
– 95% confidence interval=
• .05+2*.00689 and .05-2*.00689=
• .03622<=.05<=.06378.
– Based on this result, what can you conclude
between Strategy A and Strategy B
Boire Filler Group
Evaluating Marketing Test
Results
• Two other groups of cells have been
tested for different communication
strategies. Results are as follows. What
would you conclude?
•Strategy
•A
•B
•Sample Size
•1000
•2000
•Response
•Rate
•5.00%
•4.0%
Suppose B becomes 4.0%. What would you conclude?
Boire Filler Group
Evaluating Marketing Test Results
• Calculate the standard deviation first using the
sample with the lower qty-Strategy A.
–
( p * q) ( N )
– Sq. root of (.05X.95)/1000=.00689
– 95% confidence interval=
• .05+2*.00689 and .05-2*.00689=
• .03622<=.05<=.06378.
– Based on this result, what can you conclude
between Strategy A and Strategy B
Boire Filler Group
Evaluating Marketing Test Results
• Having done several of these tests, what
will cause your confidence range to
narrow
– Large sample size
– Smaller response rates
Boire Filler Group
Data
Review of Data
Types Of Data/Format
• Character-Level Data
• Numeric Data
• Date
• Give me some examples
• In Data Mining, what do we have to do with
all data before building a solution
Boire Filler Group
Data Format Examples
• Gender
• Income
• Spending
• Birthdate
• Customer type
• How would you use gender,customer type,
and birthdate in a data mining exercise
Boire Filler Group
Data Transformation
• Gender Variable
– Male=1, non male=0
– Female=1,non female=0
– What happens to missing values here?
• Customer Type Variable
– Gold member=1,non gold member=0
– Platinum member=1,non platinum member=0
– Etc.
Boire Filler Group
Data Transformation
• Birthdate
– Convert birthdate to age
– Extract birthyear from birthdate field and
substract from current year(i.e.2005-1954)
• Date of last Spending Activity
– Create recency of last spend
– Create tenure variable
– How would this be done.
Boire Filler Group
Data
• Discrete vs. index vs. continuous
• Discrete
– Yes/No
– On/Off
• Convert above type data to 1,0 type
scenario
Boire Filler Group
Data
• Index Type Data
Customer Type
Regular
Gold
Platinum
Average
Average Spend
100
200
300
125
List Source
A
B
C
Average
Average spend
200
400
600
400
Could convert each customer type to binary value.
But what would be more valuable way to convert or
transform this variable?
Boire Filler Group
Data
• Continuous data
– What are some examples
• What does it mean when we say that data
is continuous?
Boire Filler Group
Data Type
• Looking at data as we have in the last
number of slides, we can create what we
call data categories:
– Nominal
– Ordinal
– Interval
Boire Filler Group
Data Categories
• Nominal variables are variables where the
values do not represent any real order or
magnitude of value.
• Examples:
– Gender
– Product Category
– Promotion Category
Boire Filler Group
Data Categories
• Ordinal Variables represent fields where
the values have some order
• Good examples are:
– index-type variables
– Model rank
– Etc.
Boire Filler Group
Data Categories
• Interval Variables represent fields where
the actual values indicate order but also
magnitude.
– Income
– Spend
– Model Score
• What data category is the most granular?
• Which category might you typically expect
to be more powerful in a data mining
exercise?
Boire Filler Group
Data Usefulness
• When is Data Useful?
– Few Missing values
– Variable does not consist primarily of one value
– Non-Numeric Data consists of too many values
which cannot be properly grouped into more
meaningful categories
Boire Filler Group
Examples-Analytical Perspective
Variable
Income
Customer Type
Gender
Household Size
Product Type
Customer Name
Postal Code
# of records
100000
100000
100000
100000
100000
100000
100000
Data Field
format
numeric
character
character
numeric
character
character
character
What fields are useful and why?
Boire Filler Group
# of Unique
values
50000
4
2
7
3000
100000
50000
# of
missing
values
2000
10000
50000
90000
5000
0
0
Examples
Closer look at income
Income
% of Records
<25000
25000-50000
50000-75000
75000+
Missing
25%
25%
25%
23%
2%
Closer look at gender
Gender
Male
Female
Missing
% of records
23%
27%
50%
Boire Filler Group
Examples
• Closer Look at Customer Type
Customer Type
Gold
Bronze
Silver
Platinum
Missing
% of records
5%
40%
30%
15%
10%
Closer look at Product Type
Product Type
A001
B001
C003
A010
….
missing
Z004
% of records
0.07%
0.08%
0.06%
0.06%
Cum. % of records
0.07%
0.15%
0.21%
0.27%
5%
0.08%
99.92%
100%
Boire Filler Group
Examples
Variables
1st 3 digits of postal
code
household size
Credit score
mortgage account
Product code
Median Income of
Postal Code of record
# of records
Data Field
Format
# of unique
values
# of missing
values
100000
100000
100000
100000
100000
character
numeric
numeric
character
character
?
?
?
?
?
100000
100000
100000
100000
100000
100000
numeric
?
100000
•What variables would be useful here
•What would be the number of unique variables
Boire Filler Group
Examples
Variables
1st 3 digits of postal
code
household size
Credit score
mortgage account
Product code
Median Income of
Postal Code of record
# of records
Data Field
Format
# of unique
values
# of missing
values
100000
100000
100000
100000
100000
character
numeric
numeric
character
character
100000
100000
100000
100000
100000
0
0
0
0
0
100000
numeric
100000
0
•What variables would be useful here
Boire Filler Group
Examples-Marketing Perspective
• A mortgage company is conducting a
campaign to its high value customers. One of
the key characteristics of value is high income
which is self-reported at time of application.
Income
< 30000
30000-60000
60000-80000
80000-100000
100000+
missing
% of records
5%
5%
20%
10%
10%
50%
As a marketer, how will you use this information and what do you need to
consider?
Boire Filler Group
Examples-Marketing Perspective
• An insurance company is marketing an
insurance product to people over the age of
60. Listed below is a report indicating the
distribution of age.
•
Age
<30
30-40
40-55
55-65
65+
missing
% of records
5%
10%
15%
10%
10%
50%
As a marketer, how will you use this information?
Boire Filler Group
Examples-Marketing Perspective
• An retail company has over 1000 product SKU’s.
After investigation, it has been determined that the
1st digit represents a broader product category. You
have been asked to design the product layout for all
stores.
Product SKU
A000003
A000004
B000005
B000006
….
Z999999
% of records
0.03%
0.02%
0.03%
0.04%
Cum. % of records
0.03%
0.05%
0.08%
0.12%
0.02%
100%
As a marketer, how will you use this information?
Boire Filler Group
Examples-Marketing Perspective
•
Gender
Male
Female
Missing
Income
0-20K
20K-40K
40K-60K
60K-80K
80K+
missing
% of records
10%
12%
88%
% of records
5%
4%
7%
6%
5%
73%
What can be done here, if anything and what else
can we consider in terms
of using gender and
Boire Filler Group
Examples-Marketing Perspective
• You have postal code information for each
customer. You are asked to design
customer reports by province.How would
you do this?
Boire Filler Group
Examples-Data Mining Perspective
• You have the following variables and values
– Gender: ’M’:Male
‘F’:Female
– Age:
‘B’: <20M
‘F’: 20M-40M
‘R’:40M-60M
‘S’:60M-80M
‘T’:80M-100M
‘Z’: 100M+
• What must be done here?
Boire Filler Group