SAS Homework - Temple Fox MIS

Download Report

Transcript SAS Homework - Temple Fox MIS

SAS HOMEWORK 4 REVIEW
CLUSTERING AND SEGMENTATION
MIS2502
Data Analytics
SAS Homework 4 Review
Clustering and Segmentation
• Using AAEM.DUNGAREE Data Set
• Explore data set : SALESTOT and STOREID
• Assign ID to STOREID
• SALESTOT Role – Rejected
• Add a Cluster node (Explore)
• In Properties select Internal Standardization => Standardize
• Run and Evaluate
• Change Properties Segment Max to 6
• Run and Evaluate
• Add a Segment Profile node (Assess)
• Run and Evaluate
Set Up
• Retail – looking for patterns sales of types of jeans by
store
Data Source - Edit Variables
Data Source – Explore
Note scale
Add Cluster Node, Standardize
Segments, Automatic
note root mean square std deviation
Change Number of Clusters to 6
Segments, Max 6
note root mean square std deviation
Segment Profile Node
Segment Profiles
red outline is the overall distribution
Questions
How do the SALESTOT and STOREID distributions differ from the other variables’ distributions (look at the histograms of each
one)?
Assign STOREID a model role of ID and SALESTOT a model role of Rejected.
Make sure that the remaining variables have the Input model role and the Interval measurement level. Based on the variable
descriptions on page 1 and your answer to part
Why do you think that the variable SALESTOT should be rejected?
Add a Cluster node to the diagram workspace and connect it to the Input Data node.
Select the Cluster node and select Internal Standardization  Standardization.
Why is it important to standardize your inputs? (hint: look at the range of the scales on the X axis of the histograms)
Run the diagram from the Cluster node and examine the results.
How many clusters are created?
What might be a problem with having so many clusters?
What is the highest root mean squared standard deviation among the clusters?
Two hints:
Look at the Mean Statistics window.
The root mean squared standard deviation means basically the same thing as the sum of squares error.
Distribution of Store Id
Distribution of SaleTot
• Does tell you that there are
a handful of stores selling
well below average
• These 2 variables aren’t
useful for the product mix
analysis.
Why Standardize ?
• Note difference in
range of numbers
on x axis
Segment Profile Node
Reading a Histogram
4) Now look at the
specific segment
distribution (blue). For
this segment
approximately 86% of
the stores sell
within volume ranges
3 and 4.,
Look at the distribution in total, and then the individual bars. For
this distribution you would say that for this segment, they sell
less original jeans than average, and in a narrower range /with
less variability (not part of the question). Overall you can say this
because the distribution is to the left of and 'tighter' than the
overall distribution.
1) The red bars are the distribution of Original Jeans sales over all
segments. By comparing the specific segment distribution (blue) to
the overall distribution (red) you can make some observations
about the what makes this segment different in regards to Original
Jeans sold.
3) note that for ranges 3 ,4 and 5, the overall
average (red) shows roughly that 65% of
stores sell in these volume ranges (11% and
23 % and 31% respectively). You get this by
reading the Y axis.
2) Note that you have 8 ranges of standardized sales volumes on the x axis for the
overall average (the red). These are ordered for lowest (on the left) to highest (on the
right). We established this earlier when looking at the individual segments.
5) Conclusion: Overall, this segment has more stores selling original jeans in lower volume ranges than
the overall average. Therefore, for this segment we can say that the stores sell less Original Jeans
than average.
Segment Profiles
red outline is the overall distribution
Original
In Class
Answer the questions about this output:
1. How many distinct customer groups (segments) are there?
2. Explain how the customers in cluster 1 are different from cluster 2?
3. What aspect of the customer data most differentiates cluster 1 from cluster 3?
4. Which cluster has the highest cohesion? In practical terms, what does that mean?
In Class – Evaluating Clustering Output
5. Is the root mean squared standard deviation of these clusters higher or lower than
they were in the three cluster scenario? Why?
6. Is the distance to the nearest cluster higher or lower than in the three cluster
scenario? Why?
7. Which scenario (#1 or #2) has higher cohesion among its clusters?
8. Which scenario (#1 or #2) has higher separation between its clusters?