Client`s Logo/Name

Download Report

Transcript Client`s Logo/Name

Lecture 4
MARK2039
Winter 2006
George Brown College
Wednesday 9-12
Assignment 31) You are given the marketing results of three different initiatives. What would you
conclude and why for each initiative listed below? Remember you need to base
your conclusion using a 95% confidence interval.
Initiative 1:
Strategy 1
Strategy 2
Response Rate
5.00%
4.50%
Quantity
30000
25000
.13%-Std.dev.
4.24%<=4.50%<=4.76%
would do strategy 1 as it is stat. sign.and therefore different from strategy 2.
2
Assignment 3
1) continued
Initiative 2
Response Rate
6.00%
3.00%
Strategy 3
Strategy 4
Quantity
100
200
Std.dev.=2.37%
1.26%<=6%<=10.74%
Would do either strategy as there is no stat. diff. between strategies.
Initiative 3
Strategy 4
Strategy 5
Response Rate
45%
40%
Quantity
15000
10000
Std.dev.=.489%
39.02%<= 40%<=40.98%
would do strategy 4 as it is stat. different and significant from strategy 5.
3
Assignment 3
2) Listed below is a table containing a number of variables. Explain the reason why
each variable is useful or not useful in a future analysis.
Variables
# of records
Data Field
Format
# of unique
values
# of missing
values
1st 3 digits of postal
code
household size
Credit score
mortgage account
Product code
100000
100000
100000
100000
100000
character
numeric
numeric
character
character
1000
10
90000
2
5000
0
20000
95000
0
0
Median Income of
Postal Code of record
100000
numeric
20000
0
4
Assignment 3
2)continued
1st 3 digits are not useful. Character-level data will be converted to binary
variables(1000) and at this level data is too granular and too sparse for any real
impact in a data mining exercise.
Household size-useful as % of missing values is below 50%.
Credit score: not useful as % of missing values at 95%.
Mortgage account: useful-0 missing values and data can be meaningfully grouped
for this character type data.
Product code: not useful as data is character and will need to be converted to
binary variables(5000) . These binary variables will be too granular to find any
meaningful patterns.
Median Income of Postal Code: useful-data granularity high but ok as variable is
numeric and % of missing values is low- 0%.
5
Recap-
Data,Data,Data-What Phase of the Data Mining Process are we in?
• Data Formats?
– Examples?
• Data Transformations? What do we mean here?
Examples?
• In all data mining projects, what must the final data values be?
6
Recap• Data,Data,Data-What Phase of the Data Mining Process are we in?
• Data Types? What are they?
• What is discrete vs. index vs. continuous
and how do they relate to Data Type.
– Birthdate
-Gender
– Product category
-Model rank
– Spending percentile
-Income
– Promotion Date
-Model Score
7
Recap
•Let’s take a look at postal code
Postal Code
M5A1G5
V6A1A1
…..
H4B2E5
% of Records
0.0020%
0.0035%
0.0029%
•How would you use the info here. Create
binary variables for every postal code
value. Is there another better way to
group?
8
Types of Data
• Nominal
• Ordinal
• Interval
• Nominal is basically a yes/no variable or variable with outcomes that have
no order or sense of magnitude to the numbers
– Derived variables are coded as 0,1.
– Give me an example of this and how you would create a nominal
variable for data mining?. Assume you are analyzing response rate
trends for 10000 customers?
9
Types of Data
• Ordinal
– There is order to the values of the variable
– Give me examples of this and what it would like in a data mining
exercise. Assume you are analyzing response rate trends for 10000
customers?
• Interval
– There is a sense of magnitude between two values.
– Give me examples of this and what it would like in a data mining
exercise. Assume you are analyzing response rate trends for 10000
customers?
• How does ordinal and interval differ. Explain it within the context of a data
mining exercise where we analyze response?
10
Data Usefulness
• When is Data Useful?
– Few Missing values
– Variable does not consist primarily of one value
– Non-Numeric Data consists of only a few values which can be
properly grouped into more meaningful categories
11
Examples-Analytical Perspective
Variable
Income
Customer Type
Gender
Household Size
Product Type
Customer Name
Postal Code
# of records
100000
100000
100000
100000
100000
100000
100000
Data Field
format
numeric
character
character
numeric
character
character
character
# of Unique
values
50000
4
2
7
3000
100000
50000
# of
missing
values
2000
10000
50000
90000
5000
0
0
What fields are useful and why?
12
Examples
Closer look at income
Income
% of Records
<25000
25000-50000
50000-75000
75000+
Missing
25%
25%
25%
23%
2%
Closer look at gender
Gender
Male
Female
Missing
% of records
23%
27%
50%
Create a data mining
response rate trend
with each variable
For both variables,
demonstrate how no
response rate might exist.
13
Examples
• Closer Look at Customer Type
Customer Type
Gold
Bronze
Silver
Platinum
Missing
% of records
5%
40%
30%
15%
10%
Create a data mining
response rate trend
with each variable
Closer look at Product Type
Product Type
A001
B001
C003
A010
….
missing
Z004
% of records
0.07%
0.08%
0.06%
0.06%
Cum. % of records
0.07%
0.15%
0.21%
0.27%
5%
0.08%
99.92%
100%
For both variables,
demonstrate how no
response rate trend
might exist.
14
Examples
Variables
1st 3 digits of postal
code
household size
Credit score
mortgage account
Product code
Median Income of
Postal Code of record
# of records
Data Field
Format
# of unique
values
# of missing
values
100000
100000
100000
100000
100000
character
numeric
numeric
character
character
?
?
?
?
?
100000
100000
100000
100000
100000
100000
numeric
?
100000
•What variables would be useful here
•What would be the number of unique values
•What would some of these look like in a data
mining response rate analysis exercise?
15
Examples
Variables
1st 3 digits of postal
code
household size
Credit score
mortgage account
Product code
Median Income of
Postal Code of record
# of records
Data Field
Format
# of unique
values
# of missing
values
100000
100000
100000
100000
100000
character
numeric
numeric
character
character
100000
100000
100000
100000
100000
0
0
0
0
0
100000
numeric
100000
0
•What variables would be useful here
•What would some of these look like in a data
mining response rate analysis exercise?
16
Examples-Marketing Perspective
• A mortgage company is conducting a campaign to its high value customers.
One of the key characteristics of value is high income which is self-reported
at time of application.
Income
< 30000
30000-60000
60000-80000
80000-100000
100000+
missing
% of records
5%
5%
20%
10%
10%
50%
As a marketer, how will you use this information and what do you need to
consider?
What might the results be if you applied this learning to a marketing
campaign.
17
Examples-Marketing Perspective
• An insurance company is marketing an insurance product to people over
the age of 60. Listed below is a report indicating the distribution of age.
•
Age
<30
30-40
40-55
55-65
65+
missing
% of records
5%
10%
15%
10%
10%
50%
As a marketer, how will you use this information?
What might the results be if you applied this learning to a marketing
campaign.
18
Examples-Marketing Perspective
• An retail company has over 1000 product SKU’s. After investigation, it has been
determined that the 1st digit represents a broader product category. You have been
asked to design the product layout for all stores.
Product SKU
A000003
A000004
B000005
B000006
….
Z999999
% of records
0.03%
0.02%
0.03%
0.04%
Cum. % of records
0.03%
0.05%
0.08%
0.12%
0.02%
100%
As a marketer, how will you use this information?
19
Examples-Marketing Perspective
•
Gender
Male
Female
Missing
Income
0-20K
20K-40K
40K-60K
60K-80K
80K+
missing
% of records
10%
12%
88%
% of records
5%
4%
7%
6%
5%
73%
What can be done here, if anything and what else can we
consider in terms of using gender and income information?
What might it look like in a data mining exercise?
20
Examples-Marketing Perspective
• You have postal code information for each customer. You are asked
to design customer reports by province.How would you do this?
• What would this look like in a data mining response rate analysis
exercise?
21
Examples-Data Mining Perspective
• You have the following variables and values
– Gender:
’M’:Male
‘F’:Female
– Income ‘B’: <20
‘F’: 20-40
–
‘R’:40-60
‘S’:60-80
‘T’:80-100
‘Z’: 100+
• What must be done here?
• What would this look like in a data mining response rate analysis exercise?
22
Concepts
• Operational Database
– Customer DB
– Transactional DB
• Data Warehouse
• Data Mart
• Analysis Flat file vs. OLAP
• External Data Overlays
– Postal Code Overlays
– Survey/Registration Data
23
Databases
• Operational Databases vs. Data Warehouses vs. Data Marts vs.
Analytical File
• Operational data consists of information from the source systems
– Customer File
– Transaction System
– Finance System
– Operations
– Human Resources
– Etc.
In practice, what do you think an operations database is really
dealing with?
24
Databases
• Data warehouse
– Pulls elements and fields from each source system
– May summarize/organize or aggregate information with each
system to present the information in a more meaningful way?
– Warehouse can comprise information from disparate areas of
company
– What do we mean by this?
25
Databases
• Data mart
– Can in many cases be very similar to data warehouses in the
way that information is summarized and aggregrated
– Pulls elements and fields from each source system
– May summarize/organize or aggregate information with each
system to present the information in a more meaningful way?
– Usually is focussed solely towards one functional area of the
company
• Marketing data Mart
• Let’s think of some information that might be contained in a data
mart?
26
Databases
Customer
Transaction
Finance
..
Etc.
Data Warehouse
Data MartMarketing
Data MartFinance
Data MartEtc.
27
Database Structure
• In Database Design, most databases are relational
– Creates a key which becomes a database index
– This index or key becomes the link between different files
Customer
Customer
Transaction
Promotion
Customer ID is the link between all the tables
Why do we need to think about the notion relational?
28
Database Structure
• Relational DB
– Database indexes allow very quick processing of data when
joining and merging files together
• The key in all database design is to create a database that optimizes
processing of all information.
• In database design, you want the right data to be stored which is
useful from a data mining perspective
• From a marketing standpoint, can you think of some examples?
Why is this important from a data mining standpoint?
29
Database Structure
• Other approaches used in speeding up database processing
– Inverted flat files
• This technology allows each field to be indexed
• Very common amongst the leading-edge DB suppliers today.
• Is much faster at processing data than traditional relational DB
technology
• Again, why is this relevant from a data mining perspective?
30
Databases
• Analytical File
• For most data mining applications, your analytic file needs to be in
the format of one record per customer with all known attributes
• Generally, the database is not in that format.
• ECTL – extraction, clean, transform, load – is the
process/methodology for preparing data for data mining
 Typically a flat file used for analysis
 What do think is the most important concept for data mining?
 Databases or Analytical File
 How do they work together?
31
Databases
File 1
-Cust ID
-Income
-Age
-Household Size
File 2
-Cust ID
-Trans. Type
-Trans Date
-Trans Amt
32
Databases
• In building databases, the notion of continuity management is
important
• In the context of household or customers on a database, continuity
management is the process by which you are able to track customers
through events in time.
• Why is this important?
33
Analytical file
• All data mining algorithms want their input in tabular form – rows &
columns as in a spreadsheet or database table
Typically, if we saw data like this, what typically needs to be
done? Assume reference number is the customer I.D. What does
continuity mean here?
34
What the Data Should Look Like
• A customer “snapshot” = a single row
Each row represents
the customer and
whatever might be useful
for data mining
35
What the Data Should Look Like
• The columns
– Contain data that describe aspects of the customer (e.g., sales $ and
quantity for each of product A, B, C)
– Contain the results of calculations referred to as derived variables (e.g.,
total sales $)
Cust Id
123
123
123
456
456
Date Of
Purchase
jan 4/2006
dec.6/2006
mar.4/2006
apr.6/2006
feb.6/2006
# of Months since
last purchase
4
5
2
1
3
Derived variables are
Total Price in 1st chart
and # of months since
last purchase in 2nd
chart
36
Sourcing the Data from External Data Sources
Typical Data Sources - External
•
Geo-demographic information
– Statistics Canada (aggregated level data)
• Census data
• Taxfiler data
•
•
Geo-demographic Cluster Codes
– Generation 5 – Mosaic
– Equifax -Psyte
Survey Data
– ICOM
37
Sourcing the Data (Extraction)
Census data
Data collected every 5 years.
Enumeration Area level.
~ 250 households on average.
~ 440 households in large urban areas.
~125 households in rural areas.
~ 50,000 EA’s in Canada
Can be converted to postal code level and appended to your file.
Type of data
-immigration/ethnicity/language patterns
-occupation
-education
-income/gender/age/employment
-religion
38
Sourcing the Data (Extraction)
Taxfiler data.
Data collected every year.
Postal walk level.
~ 450 households on average.
~ 26,000 Postal Walks in Canada.
·Contains data from previous year tax returns.
· Income by source and type. Employment, investment.
· RRSP contributions and room. Etc.
Can also be appended to your files at postal code level.
39
Sourcing the Data (Extraction)
Geo-Demographic cluster codes.
• Uses Stats Can data in most cases plus other external
data overlays to determine postal code cluster groups
– Quebec farm families
– Young and Struggling
– Empty Nesters
– Upper Income Family-Oriented
• Equifax
– High credit risk
– Medium credit Risk
– Etc.
40
Sourcing the Data-Stats Can Type Table
Postal Area
Area 1
Area 1
Area 1
Area 2
Area 2
Area 3
Area 3
Area 3
Median Income Avg. Age
42000
40
42000
40
42000
40
50000
35
50000
35
37000
43
37000
43
37000
43
Avg.
Household
Size
2
2
2
1
1
3
3
3
% French
10.00%
10.00%
10.00%
85.00%
85.00%
5.00%
5.00%
5.00%
41
Sourcing the Data (Extraction)
Typical Data Sources - External
• Business to Business “Firmographics”
– SIC, Number of Employees, Revenue etc.
– Sources:
• D&B
• CBI / InfoCanada
• Scott’s
Company
XYZ
….
Employee Size
1-4
…
Industry
Classification
retail
…
Sales Size
<1 million
..
Yrs In
business
10
..
42
Sourcing the Data (Extraction)
Typical Data Sources - Survey
• Attitudinal- Needs, preferences, social values, opinions
• Behavioral- Buying habits, lifestyle, brand usage
 For most data mining projects, we want to assign a value to all
customers; therefore the information used must be available for all
customers
– survey-based information generally cannot be used as it typically
can only be applied a small portion of the database
43
Sourcing the Data (Extraction)
Typical Data Sources - Survey
• ICOM
– Surveys to approx. 10MM Canadians
– Fully updated every 2 years
– Contains attitude behaviour and purchase behaviours across all
industry sectors
• What do you think the value is here?
44
Examples
• A marketer wants to target high risk cancels
for a retention campaign for a Telco. Information is contained in legacy database systems
containing a customer file, transaction file, and call detail file. As a marketer and analyst,
answer the following requirements
– 5 Key Data fields from above files that should be created in analytical
exercise
– Create a diagram or schema of how this data would be linked into an
analytical file
– What resources would you need and why?
• People
• Software
45
Examples
• How would the previous example change if the information was available in a
data mart or warehouse
46
Examples
•
•
•
A university is conducting a fund raising campaign to its alumni(100000
members). On its database, it has the following information:
– Age of alumnus
– Year graduated
– Degree and specialization
– Donation value
– Current Address
It has also collected information from a survey. 10% of members have
responded to the survey with the following %’s of members answering the
following information:
– Current Occupation-5%
– Current Income-8%
– Why they give?-7%
– How much they give
As a marketer and analyst, how would you use the information to conduct a
campaign to its high value donors
47
Examples
•
A computer company collects information from all customers who purchase a
new product. This new product information is collected through a product
registration form which the customer fills in at point of purchase. This information
relates to the following:
– Product preferences,Income,household size and hobbies
All customer tombstone information as well as purchase information related to
products bought has been summarized and stored onto a data mart.
As a marketer and analyst, how would you use the information to develop a
cross-sell campaign.
48
Examples
• A credit card company has 100000 customers containing tombstone
information and detailed transactional information on their database.
50000 customers have email addresses. 10% of 50000 customers have
responded to a survey in which 5% have indicated that they consider
themselves loyal customers. Web activity of these loyal customers
indicate that many of them have clicked on travel-related packages.
• As a marketer and analyst, how would you use this information to sell
travel-related insurance.
49