Transcript Slide 1

Analysis of Reliance Home Comfort
(RHC) Survey Data (fragment)
Objectives
• To show potentials of Business Intelligent Solution in
the development and analysis of complex survey study
• Illustrate fruitfulness of synergy of statistical and data
mining approaches in survey data analysis
• Formulate new important business questions that can
be answered only within data mining modeling
paradigm
2
Brief description of 2008 Reliance Home
Comfort (RHC) Brand & Ad Tracking Study
•
•
•
•
•
The study is dedicated to evaluation of client awareness and ability to
recognize 7 the most popular Canada home comfort products and
services:
Reliance Home Comfort
Direct Energy
Lennox
Carrier
Air One
Sears
Home Depot
The phone household survey is conducted by agents who asked
customers to identify at most 3 out of those 7 companies. Therefore,
the number of recognized companies could be between 0 and 3.
The number of questions in the questionnaire was about 300, but the
questionnaire had hierarchical structure, and average time to complete
the survey was approximately 15 minutes.
Example of questions:
– When you think of COMPANIES that provide ESSENTIAL HOME
COMFORT products and services, which company comes to mind
FIRST?
– Have you seen or heard any advertising from any companies that
provide ESSENTIAL HOME COMFORT products and services in the
past 3 months?
3
Executive Summary: BI Solutions
• Business Intelligence Solutions (BIS) is a well established statistical/data
mining/GIS company that conducts business in the USA and Canada.
• Our specialization is complex unstructured business problems for data rich
firms. Our multidisciplinary team includes professionals in applied statistics,
data mining, GIS, and software application development.
• Among our employees there are professionals with PhD degree in diverse
quantitative fields: Applied Statistics, Data Mining/Machine Learning,
Operations Research and Differential Equations
• The team members are authors of more than 100 published papers on
diverse applications of data mining and other quantitative fields to market
research, customer relationship management, pilot study design, etc.
• BIS has access to the best statistical, visualization, data mining and GIS
software on the world market.
• The essence of our approach is to understand and analyze our client’s
business problem and corresponding data through the prism of dissimilar
statistical/data mining models. As a result we are always able to produce
the best possible model /results and help our clients in the most effective
4
and scientifically sound way.
Exploratory Data Analysis (EDA) and Data
Complexity
5
Example of Data Transformation
Exploratory Data Analysis (EDA) and data preprocessing are a vital step of any data analysis project
Original First Response (Q9)
Frequency
Modified First Response (Q9)
Frequency
Transformed Data, 9 categories
Categories MorEnergy, Prestige Home Comfort,
and Roy Inch & Sons have no variance and
do not produce useful information in the analysis.
Therefore these categories should be aggregated.
Original Data, 22 categories
This example demonstrates the necessity of these preliminary steps:
it turns out that the predictability of constructed variable Modified
First Response is much higher than original First Response (Q9)
6
6
Modified First Response (Q9) by Region (Q1)
Company comes to mind FIRST (Q9) is significantly different
(p-value for Chi-Square is 0.0003) for different regions (Q3)
Region
Binary Q9
For Sudbury/Thunder Bay residents Reliance Home Comport company
comes to mind FIRST 6 times more often than for Hamilton residents.
Contrasting RHC with aggregated other companies (Other), we can note that Other
has practically uniform distribution. Therefore, the advertisement/marketing of RHC
In Burlington, Hamilton, and Oakville have to be improved.
7
The 5-point scale statements (questions Q74a - Q74f) should be analyzed
separately for those interviewees who heard about the company by word of
mouth, and who did not
Spearman correlation is non-parametric (distribution free) measure of the relationship between two variables
Q70a: How to hear about the company:
Word of mouth / Recommendation = Yes
Just 2 pairs of questions out of 15
have non-significant correlation
Different correlation structure
10 pairs of questions out of 15
have non-significant correlation
Q70a: How to hear about the company:
Word of mouth / Recommendation = No
8
Exploratory Data Analysis summary
• RHC survey data analysis requires sophisticated
approaches due to high complexity of the data.
• The Complexity can be characterized by :
– High dimensionality (about 300 attributes/questions)
– Uncharacterizable non-linearities
– Hierarchy among attributes
– Presence of differently scaled attributes (numeric,
binary, and nominal)
– Vast majority of attributes are nominal
– Large percentage of categorical attributes with huge
numbers of categories and non-uniform frequency
distributions
– Large percentage of missing values for some
attributes/predictors
9
Data Mining Application
(Decision Tree and TreeNet) to Survey
Data Analysis
10
Fragment of Decision Tree for Binary First Response (Q9)
Binary First Response: RHC, or OTHER
Example of “If-Then” scenario that
can be answered by Decision Tree:
If all interviewees would give the
highest score to the quality of RHC
products and services, how the
probability of First Response = RHC
will be changed?
Providing high quality products
and services is a great predictor
of Binary First Response: Probability
of First Response = RHC jumps by 100%
from 0.11 for the whole sample to 0.21
for interviewees experienced good quality
RHC has a weak
association with
low quality of
products and services
11
TreeNet: Intro
• TreeNet (Stochastic Gradient Boosting) was invented in 1999
by Stanford University Professor Jerome Friedman. It is the
most flexible and powerful data mining tool.
• Salford Systems - a California based data mining software
development company (http://www.salford-systems.com) has
implemented and commercialized this invention as a TreeNet
product in 2003. It was the first stochastic gradient boosting
tool in the world data mining industry.
• The intensive research has shown that TreeNet models are
among the most accurate of any known modeling techniques.
• TreeNet model is a non-parametric non-linear regression and
can be described as a linear combination of a large amount of
small trees.
12
Drivers of Modified First Response (Q9)
• The most important predictor of values of the First
Response (Q9) is Q75a (Age of interviewee)
• Q1(Region) and Q78 (income) are examples of predictors
with modest impact on First Response (Q9)
• Q8 (Gender) is an example of a predictor that have no impact
on First Response (Q9)
Predictor importance of the probability
(Modified First Response = RHC)
13
13
Misclassification Rate: TreeNet model for
Modified First Response (Q9) prediction
Cost Matrix
Cost of correct classification equals 0,
and cost of incorrect classification equals 1.
Prediction Accuracy (learning data- 60%)
The Percent Error is the smallest for
Reliance Home Comfort (best accuracy):
Pct Error = 0.00.
The Percent Error is the largest for
Union Energy (worst accuracy):
Pct Error = 27.59.
On average, the prediction accuracy
for Modified First Response across all 9
Categories is 15.79%.
14
14
TreeNet model: Impact of You mentioned that you are familiar with RHC (Q15 )
on Probability of Binary First Response (Q9) = RHC, controlling for all other
predictors
Using the TreeNet model, it is possible to answer diverse “If – Then” business questions. For example, if the response
“Telemarketing” would be increased by 10 %, how the probability of First Response = RNC will be changed?
The highest positive impact on the
Probability of First Response (Q9) = RHC
The highest negative impact on the
Probability of First Response (Q9)= RHC
15
15
TreeNet summary
• TreeNet algorithm has about 20 different options that can be
controlled by a researcher.
• Usage of default options did not produce a good model.
• Determination of the best set of options/optimal model is time
consuming and requires experience and expertise.
• TreeNet is an appropriate tool for the analysis of complex survey
data.
• TreeNet is a perfect tool for
– Prediction and Scoring
– Estimation of a probability of an event of interest
– Identification of predictor importance and drivers
– “If - Then” scenario analysis
16
16
Conclusion
•
Typical survey data analysis questions are:
– Segmenting respondents
– Drivers identification of question of interest
– Relationship between different survey questions
– Predictability of the answer to a question under consideration
– Diverse ‘If – then’ scenarios
– Combining primary and secondary data to answer unique
business question
•
The essence of our approach is to understand and analyze our
client’s business problem and corresponding data through the prism
of dissimilar statistical/data mining models.
•
Synergy of data mining and traditional statistics allows to extract
maximum useful information from complex survey data.
•
As a result we are always able to produce the best possible model
/results and help our clients in the most effective and scientifically
sound way.
17