Review of Probability and Statistics
Download
Report
Transcript Review of Probability and Statistics
Statistical Review
Measures of Central Location
SALARY.XLS
Lists starting salaries for 190 graduates from an
undergraduate school of business.
The data is in the range named Salary on a sheet
called Data.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
The Mean
We calculate the mean salary by entering the formula
“=AVERAGE(Salary)” in cell B6 of the Excel
Functions worksheet.
The mean salary is $29,762.
The mean in this example is a “representative”
measure because the distribution of salaries is nearly
symmetric.
The mean can be misleading due to skewness.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
The Median
The median is the “middle” observation when the
data are listed from smallest to largest.
If there is an odd number of observations, the median
is the middle observation.
If there is an even number of observations, we take
the median to be the average of the two middle
observations.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
The Median -- continued
We calculate the median salary in Example 3.1 by
entering the formula “=MEDIAN(Salary)” in cell B7 of
the Excel Functions worksheet.
The median in this example is $29,850.
In this case, the mean and the median values are
nearly the same because the distribution is
approximately symmetric.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
The Median -- continued
If the salary distribution were skewed (for example, a
few graduates received abnormally large salaries),
the mean would be biased upward while the median
would not be affected by the unusual values.
Thus, it is better to use the median in characterizing
the center of a distribution when that distribution is
skewed.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
The Mode
The mode is the most frequently occurring value.
If the values are essentially continuous, as with the
salaries in Example 3.1, then the mode is essentially
irrelevant. There is typically no single value that
occurs more than once.
Thus, the mode is not likely to provide much
information.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
The Mode
SHOES.XLS
This file lists shoe sizes purchased at a shoe store.
We seek to find the best-selling shoe size at this
store.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
The Mode
The mode is the most frequently occurring value.
If the values are essentially continuous, then the
mode is usually not relevant. There is typically no
single value that occurs more than once.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Why is the mode relevant here?
Shoe sizes come in discrete increments, rather than
a continuum; so it makes sense to find the mode, the
size that is requested most often, in this example.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Computing the Mode in EXCEL
The mode can be found in Excel by entering the
formula “=MODE(Range of Data)”.
Applying this formula in the present example, we find
that size 11 is the most frequently purchased shoe
size.
This is also apparent form the histogram on the next
slide.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Distribution of Shoe Sizes
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Measures of Variability: Variance and
Standard Deviation
OTIS4.XLS
Suppose that Otis Elevator is going to stop
manufacturing elevator rails. Instead, it is going to
buy them from an outside supplier.
Otis would like each rail to have a diameter of 1 inch.
The company has obtained samples of ten elevator
rails from each supplier. They are listed in columns A
and B of this Excel file.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Which should Otis prefer?
Observe that the mean, median, and mode are all
exactly 1 inch for each of the two suppliers.
Based on these measures, the two suppliers are
equally good and right on the mark. However, we
when we consider measures of variability, supplier 1
is somewhat better than supplier 2. Why?
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Explanation
The reason is that supplier 2’s rails exhibit more
variability about the mean than do supplier 1’s rails.
If we want rails to have a diameter of 1 inch, then
more variability around the mean is very undesirable!
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Variance
The most commonly used measures of variability are
the variance and standard deviation.
The variance is essentially the average of the
squared deviations from the mean.
We say “essentially” because there are two versions
of the variance: the population variance and the
sample variance.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
More on the Variance
The variance tends to increase when there is more
variability around the mean.
Indeed, large deviations from the mean contribute
heavily to the variance because they are squared.
One consequence of this is that the variance is
expressed in squared units (squared dollars, for
example) rather than original units.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Standard Deviation
A more intuitive measure of variability is the standard
deviation.
The standard deviation is defined to be the square
root of the variance.
Thus, the standard deviation is measured in original
units, such as dollars, and it is much easier to
interpret.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Computing Variance and
Standard Deviation in Excel
Excel has built-in functions for computing these
measures of variability.
The sample variances and standard deviations of the
rail diameters from the suppliers in the present
example can be found by entering the following
formulas: “=VAR(Supplier1)” in cell E8 and
“=STDEV(Supplier1)” in cell E9.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Computing Variances & Standard
Deviations -- continued
Of course, enter similar formulas for supplier 2 in
cells F8 and F9.
As we mentioned earlier, it is difficult to interpret the
variances numerically because they are expressed in
squared inches, not inches.
All we can say is that the variance from supplier 2 is
considerably larger than the variance from supplier 1.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Interpretation of the Standard
Deviation
The standard deviations, on the other hand, are
expressed in inches. The standard deviation for
supplier 1 is approximately 0.012 inch, and supplier
2’s standard deviation is approximately three times
this large.
This is quite a disparity. Hence, Otis will prefer to buy
rails from supplier 1.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Interpretation of the Standard
Deviation: Rules of Thumb
DOW.XLS
This file contains monthly closing prices for the Dow
Jones Index from January 1947 through January
1993.
The monthly returns from the index are also shown
starting with February 1947. Each return is the
monthly percentage change (expressed) as a
decimal) in the index.
How well do the rules of thumb work for these data?
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Rules of Thumb
Many data sets follow “rules of thumb”.
Approximately 68% of the observations are within
one standard deviation of the mean.
Approximately 95% of the observations are within two
standard deviations of the mean.
Approximately 99.7% - almost all - of the
observations are within three standard deviations of
the mean.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Index Time Series Plot
A time series plot of the index show that the index
has been increasingly fairly steadily over the period.
Whenever a series indicates a clear trend such as
the index does, most of the measures we have been
discussing are less relevant.
For example, the mean of the index for this period
has at most historical interest. We are probably more
interested in predicting the future of the Dow, and the
historical mean has little relevance for predicting the
future.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Time Series Plot of Dow Closing
Index
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Time Series Plot of Dow Returns
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Return Time Series Plot
A time series plot of the returns show no obvious
trend over the period.
The measures we have been discussing are relevant
in discussing the series of returns, which fluctuate
around a stable mean.
We first calculate the mean and standard deviation of
the returns by using the Excel functions AVERAGE
and STDEV in cells B4 and B5. See the table on the
next slide.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Rules of Thumb for Dow Jones
Data
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Returns -- continued
The average return is 0.59% and the standard
deviation of about 3.37%.
Therefore, the rules of thumb (if they apply) imply, for
example, that about 2/3 of all returns are within the
interval 0.59% + 3.37%, that is from -2.78% to 3.95%.
In order to determine if the rules of thumb apply to
these returns, we can use a frequency table.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Creating the Frequency Table
We first enter the upper limits of the suitable
categories in the range A8:A15.
Any categories can be chosen but it is convenient to
choose categories in which each breakpoint is one
standard deviation higher than the previous one with
the open-ended categories on either end are “more
than 3 standard deviations from the mean”.
Next we use the FREQUENCY function to fill in
column C. “=FREQUENCY(Returns,Bins)”
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Frequency Table continued
Finally, we use the frequencies in column C to
calculate the actual percentage of return within k
standard deviations of the mean for k=1, k=2 and k=3
and we compare these with the percentages from the
rules of thumb.
The agreement between these percentages is not
perfect - there are a few more observations within
one standard deviation of the mean than the rule of
thumb predicts - but in general the rules of thumb
work quite well.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Obtaining Summary Measures
with Add-Ins
SALARY.XLS
Lists starting salaries for 190 graduates from an
undergraduate school of business.
The data is in the range named Salary on a sheet
called Data.
We need to find a set of useful summary measures
for the salaries.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Summary Statistics
To find the summary statistics of a set of data we can
use the Stat-Pro Add-In or Excel’s Analysis ToolPak.
In this example we use the Stat-Pro Add-In .
Begin by placing the cursor anywhere within the data
range. Then select StatPro/Summary Stats/OneVariable Summary Stats menu item.
Select all variables you want to summarize, and
select the summary measures you want to find from
the Available Summary Measures dialog box shown
on the next slide.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Available Summary Measures
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
About the Measures
Four measures are selected by default. These are
mean, median, Standard Deviation and Count. You
can override these.
A typical output appears here.
It includes many of the
measures we have discussed
plus a few more.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
About the Measures -- continued
The mean absolute deviation is similar to the
variance except that it is an average of the absolute
(note squared) deviations from the mean.
The kurtosis and skewness indicate the relative
peakedness of the distribution and its skewness.
By clicking on any of the cells containing the
measures (Column B), you’ll see that StatPro
provides the formulas for the outputs. (Analysis
ToolPak does not do so.)
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
About the Measures -- continued
The effect of this is that if any of the data changes the
summary measures we produced change
automatically.
All output is formatted as “numerical” to three decimal
places by default. You can reformat them in a more
appropriate manner if you would like.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Measures of Association: Covariance
and Correlation
EXPENSES.XLS
A survey questions members of 100 households
about their spending habits.
The data in this file represent the salary, expense for
cultural activities, expense for sports-related
activities, and the expense for dining-out for each
household over the past year.
Do these variables appear to be related linearly?
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Covariance and Correlation
When we need to summarize the relationship
between two variables we can use the measures
covariance and correlation. We summarize the type
of behavior observed in a scatterplot.
Each measures the strength (and direction) of a
linear relationship between two numerical variables.
The relationship is “strong” if the points in a
scatterplot cluster tightly around some straight line. If
this line rises form left to right then the relationship is
“positive”. If it falls from left to right then the
relationship is “negative”.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Determining Linear
Relationships
Scatterplots of each variable versus each other would
provide the answer to the question but six
scatterplots would be required, one for each pair.
To get a quick indication of possible linear
relationships we can use Stat-Proto obtain a table of
correlations and/or covariances.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Table of Correlations and
Covariances
To get the table, place the cursor anywhere in the
data set and use the StatPro/Summary
Stats/Correlations, Covariances menu item and
proceed in the obvious way.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Relationships
The only relationships that stand out are the positive
relationships between salary and cultural expenses
and between salary and dining expenses.
The negative relationships are between cultural and
sports-related expenses.
To confirm these graphically we show scatterplots of
Salary versus Culture and Culture versus Sports
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Scatterplot Indicating Positive
Relationship
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Scatterplot Indicating Negative
Relationship
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Correlation and Covariance
Properties
In general, the following properties are evident from
the Table of correlations and covariances.
– The correlation between a variable and itself is 1.
– The correlation between X and Y is the same as the
correlation between Y and X. Therefore, it is sufficient to list
the correlations below (or above) the diagonal in the table.
(The same is true for the covariances).
– The covariance between a variable and itself is the variance
of the variable. We indicate this in the heading of the
covariance table.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Correlation and Covariance
Properties -- continued
– It is difficult to interpret the magnitudes of covariances.
These depend on the fact that the data are measured in
dollars rather than, say, thousands of dollars. It is such
easier to interpret the magnitudes of the correlations
because they are scaled to be between -1 and +1.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Describing Data Sets with Boxplots
DOW.XLS
This file lists the monthly returns on the Dow from
February 1947 through January 1993.
Use a boxplot to summarize the distribution of these
returns.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Boxplots
A boxplot is a very useful graphical method for
summarizing data.
Boxplots can be used in two ways: either to describe
a single variable in a data set or to compare two (or
more) variables.
Excel has no boxplot option, but we included this
option in the StatPro add-in.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Creating Boxplots
Place the cursor anywhere within the data set, use
the StatPro/Charts/BoxPlots(s) menu item and
proceed in the obvious way.
Eventually two sheets will be added to your
workbook. One has a the boxplot chart, while the
other contains summary measures used to form the
boxplot.
The following slides show the chart and the summary
measure information.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Boxplot Chart
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Boxplot Summary Measures
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Keys to Understanding Boxplots
The right and left of the box are at the third and first
quartiles. Therefore, the length of the box equals the
interquartile range (IQR), and the box itself
represents the middle 50% of the observations. The
height of the box has no significance.
The vertical line inside the box indicates the location
of the median. The point inside the box indicates the
location of the mean.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Keys to Understanding Boxplots
-- continued
Horizontal lines are drawn from each side of the box.
They extend to the most extreme observations tat are
no farther than 1.5 IQRs from the box. They are
useful for indicating variability and skewness.
Observations farther than 1.5 IQRs from the box are
shown as individual points. If they are between 1.5
IRQs and 3 IQRs from the box, they are call mild
outliers and are hollow. Otherwise, they are called
extreme outliers and are solid.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Dow Returns Boxplot
The boxplot for this example summarizes the
distribution of the returns.
It implies that the Dow returns are approximately
symmetric on each side of the median, although the
mean is a bit below the median.
Also there are a few mild outliers but no extreme
outliers.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Describing Data Sets with Boxplots
ACTORS.XLS
Recall that the salaries of famous actors and
actresses are listed in this file.
Use side-by-side boxplots to compare the salaries of
male and female actors and actresses.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Side-by-Side Boxplots
Boxplots are probably most useful for comparing two
populations graphically. This is done using side-byside boxplots.
The data setup for this type of “comparison” can be in
one of two forms: stacked or unstacked.
Data are stacked if there is a “code” variable that
designates which category each observation is in,
and there is a single “measurement” variable that
contains the data for both categories.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Side-by-Side Boxplots -continued
The data are unstacked if there is a separate data
column for each category.
In this example the data are stacked because Gender
designates the gender associated with each
observation and Salary is the single measurement
variable.
If the data were unstacked, for example, actors’
salaries would be in one column and actresses’
salaries would be in another.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Creating a Side-by-Side Boxplot
Since the data are stacked the following must be
steps must be followed to create the boxplots.
– Place cursor within the data set and select the
StatPro/Charts/Boxplot(s) menu item.
– In the dialog box that opens check the “stacked” option.
– Then choose Gender as the code variable and Salary as the
measurement variable.
The resulting data in stacked form and the side-byside boxplot appears.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Actor Data in Stacked Form
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Side-by-Side Boxplot Chart
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Reading the Boxplot
It is clear the the female salary box is considerably to
the left of the male salary box, although both have
about the same IQR.
Each boxplot has three indications that the salary
distributions are skewed to the right:
– the means are larger than the medians
– the medians are closer to the left sides of the boxes than to
the right sides
– the horizontal lines extend farther to the right than to the left
of the boxes. However, there are no outliers.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Applying the Tools
Background Information
The Spring Mills Company produces and distributes a
wide variety of manufactured goods. Due to its
variety, it has a a large number of customers.
Spring Mills classifies these customers as small,
medium and large, depending on the volume of
business each does with them.
Recently they have noticed a problem with accounts
receivable. They are not getting paid by their
customers in as timely a manner as they would like.
This obviously costs them money.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
RECEIVE.XLS
Spring Mills has gathered data on 280 customer
accounts.
For each of these accounts the data set lists three
variables:
– Size, the size of the customer (coded 1 for small, 2 for
medium, 3 for large)
– Days, the number of days since the customer was billed
– Amount, the amount the customer owes
What information can we obtain from this data?
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Analysis
It is best to start by getting a good sense of the data.
To do this we:
– Calculate several summary measures for the Days and
Amount
– Create a histogram of Amount
– Create a scatterplot of Amount versus Days
From these we determine:
– a positive skewness in the Amount variable (The mean is
considerably larger than the median and the standard
deviation of Amount is quite large).
– The scatterplot suggests some suspicious behavior, with two
distinct groups of points.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Analysis -- continued
The next step is to see whether the different
customer sizes have any effect on either Days,
Amount, or the relationship between Days and
Amount.
To do this, it is useful to unstack the Days and
Amount variables - that is to create a new Days and
Amounts variable for each group of customer sizes.
For example, the Days and Amount variables for
customers of size 1 are named Days1 and Amount1.
This can be accomplished by using StatPro’s
Unstack procedure but copying and pasting also
work. 3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Analysis -- continued
Once unstacked we need to calculate summary
measures and a variety of charts on these unstacked
variables.
The charts include:
– Histograms of Amount for each size customer
– Boxplots of days owed by different size customers
– Boxplots of amounts owed by different size customers
– Scatterplots of Amount versus Days for each size customer
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Summary Measures for
Combined Data
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Histogram of All Amounts Owed
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Scatterplot of Amount versus
Days for All Customers
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Summary Measures Broken
Down by Size
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Histogram for Small Customers
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Histogram of Amount for
Medium Customers
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Histogram of Amount for Large
Customers
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Boxplots of Days Owed by
Different Size Customers
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Boxplots of Amounts Owed by
Different Size Customers
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Scatterplot of Amount versus
Days for Small Companies
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Scatterplot of Amount versus
Days for Medium Companies
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Scatterplot of Amount versus
Days for Large Companies
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Analysis -- continued
There is obviously a lot going on here and it is
evident form the charts. We point out the following:
– there are considerably fewer large customers than small or
medium customers.
– the large customers tend to owe considerably more than
small or medium customers.
– the small customers do not tend to be as long overdue as
the small and medium customers.
– there is no relationship between Days and Amount for the
small customers, but there is a definite positive relationship
between these variables for the medium and large
customers.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
More Analysis
We have done the obvious but here is still much
more we can do.
For example, suppose Spring Mills wants a
breakdown of customers who owe at least $500.
We first create a new variable called “Large?” next to
the original variables that equals 1 for all amounts
greater than $500 and equals 0 otherwise.
We do this by entering a formula in D6 and copying
down.The next slide shows a sample of data and the
formula to be used.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
More Analysis - -continued
We enter the formula
=IF(C6>$B$3,1,0)
in cell D6 and copy it down.
We can then use a pivot table to create a count of
the number of 1’s in this new variable for each value
of the Size variable.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Pivot Tables for Counts of
Customers Who Owe More than
$500
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
More Analysis -- continued
We created the pivot table twice, once showing the
counts as percentages of each column, and once
showing them as percentages of each row.
One table shows that 73% of all customers with
amounts less than $500 are small customers.The
other table shows that 45% of all medium-sized
customers owe more than $500.
This type of analysis is often referred to as “slicing
and dicing the data”. They are based on the same
counts but portray them in different ways. Neither
way is better; they both provide useful information.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
More Analysis -- continued
Finally, we investigate the amount of interest Spring
Mills is losing by the delays in its customers’
payments.
We assume that the company can make 12% annual
interest on excess cash. Then we can create a Lost
variable for each customer size that indicates the
amount of interest Spring Mills loses on each
customer group.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
More Analysis
The formula entered in cell C10 to calculate Lost1 is
=B10*A10*$C$7/365. This is the amount owed by the
number of days owed multiplied by the interest rate,
divided by the number of days in a year. Then we
copy this formula down and to the columns to
calculate Lost2 and Lost3.
Then we calculate the sums of these amounts in row
5.
The next slide shows a sample of data at this point.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
More Analysis -- continued
Although Spring Mills is losing more per customer
from the large customers, it is losing more total from
the medium-sized customers - because there are
more of them.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
More Analysis -- continued
This is shown graphically by a pie chart of the sums
in row 5.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Findings
If Spring Mills really wants to decrease receivables, it
might want to target the medium-sized customer
group, from which it is losing the most interest.
Or it could target the large customers because they
owe the most on average.
The most appropriate action depends on the cost and
effectiveness of targeting any particular customer
group. However, the analysis presented here gives
the company a much better picture of what’s currently
going on.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Applying the Tools
Background Information
The R&P Supermarket is open 24 hours a day, 7
days a week. Lately it has been receiving a lot of
complaints from customers about excessive waiting
in line for checking out.
R&P has decided to investigate this situation by
gathering data on arrivals, departures, and line
lengths at the checkout stations.
It has collected data in half-hour increments for an
entire week - 336 observations - starting at 8 am on
Monday and ending at 8 am on the following Monday.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
CHECKOUT.XLS
This file includes the data collected on the following
variables:
– InitialWaiting, the number waiting or being checked out at
the beginning of a half-hour period
– Arrivals, the number of arrivals to the checkout stations
during a period
– Departures, the number of checkout stations open during a
period
– Checkers, the number of checkout stations open during a
period
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
CHECKOUT.XLS -- continued
The data set also includes time variables:
– Day, day of week
– StartTime, clock time at the beginning of each half-hour
period
– TimeInterval, a descriptive term for the time of day such as
Lunch rush for 11:30 a.m. to 1:30 p.m.
Finally the data set includes a calculated variable:
– EndWaiting, the number waiting or being checked out at the
end of a half-hour period This variable for any time period
equals InitialWaiting plus Arrivals minus Departures; it also
equals InitialWaiting for the next period.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
The Data
Supermarket checkout efficiency
Day
Mon
Mon
Mon
Mon
Mon
Mon
Mon
Mon
Mon
Mon
Mon
Mon
Mon
Mon
Mon
Mon
Mon
Mon
StartTime
8:00 AM
8:30 AM
9:00 AM
9:30 AM
10:00 AM
10:30 AM
11:00 AM
11:30 AM
12:00 PM
12:30 PM
1:00 PM
1:30 PM
2:00 PM
2:30 PM
3:00 PM
3:30 PM
4:00 PM
4:30 PM
TimeInterval InitialWaiting
Morning rush
2
Morning rush
1
Morning
8
Morning
7
Morning
5
Morning
2
Morning
7
Lunch rush
1
Lunch rush
6
Lunch rush
14
Lunch rush
8
Afternoon
9
Afternoon
3
Afternoon
6
Afternoon
5
Afternoon
7
Afternoon
2
Afternoon rush
4
Arrivals Departures
21
22
25
18
27
28
21
23
20
23
36
31
30
36
34
29
56
48
58
64
53
52
30
36
34
31
36
37
30
28
29
34
35
33
32
25
EndWaiting Checkers TotalCustomers
1
3
23
8
3
26
7
3
35
5
3
28
2
5
25
7
5
38
1
5
37
6
5
35
14
7
62
8
7
72
9
7
61
3
5
39
6
5
37
5
5
42
7
5
35
2
5
36
4
5
37
11
5
36
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Analysis Information
The manager of R&P wants to analyze the data to
discover any trends, particularly in the pattern of
arrivals throughout the day or across the entire week.
Also the store currently uses a “seat-of-the-pants”
approach to opening and closing checkout stations
each half hour. The manager would like to see how
well the current approach is working.
Of course, she would love to know the “best” strategy
for opening and closing checkout stations - but this is
beyond her (and our) capabilities at this point
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Analysis
Obviously time plays a crucial role in this data, so a
good place to start is to create one or more time
series plots.
The time series plot of InitialWaiting and Arrivals
Variables shown on the next slide shows that:
– Fridays and Saturdays are the busiest days
– the time pattern of arrivals is somewhat different - more
spread out - during the weekends than during the weekdays
– there are fairly regular peak arrival periods during the
weekdays
– the number waiting is sometimes as large as 10 or 20, and
the largest of these tend to be around peak arrival times
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Time Series Plot of Initial
Waiting and Arrivals Variables
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Analysis -- continued
A similar time series plot shown below shows Arrivals
and Departures.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Analysis -- continued
On this plot it is difficult to separate the two time
series; they are practically on top of each other.
Perhaps this is not so bad because this tells us that
the store is checking out customers approximately as
quickly as they are arriving.
A somewhat more efficient way to understand the
time series behavior is to use pivot tables.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Analysis - continued
A pivot table for Average InitialWaiting by Hour of Day
is generated. This table can be seen on the next
slide.
To create this table, we:
– drag the InitialWaiting variable to the Data area, express it as
an average
– drag StartTime variable to the Row area
– drag the Day variable to the Page area
Finally, we use the data in the pivot table to create a
time series plot.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Average Initial Waiting by Hour
of Day
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Analysis -- continued
Note how the variable in the Page area works.
Depending on which day we select in the page area
the corresponding table and time series plot changes.
Similarly, a pivot table can be created, accompanied
by a column chart. Average Arrivals by TimeInterval
of Day can be analyzed with a pivot table by dragging
Arrival to the Data area, expressing it as an average,
dragging the TimeInterval variable to the Row area,
and dragging the day variable to the Page area. This
table is is on the next slide. You can check for the
patterns on each day.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Average Arrivals by Time
Interval of Days
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
More Analysis
The manager of R&P is ultimately interested in
whether the “right” number of checkout stations are
available throughout the day.
We can create two scatterplots to provide some
evidence. The plots follow this slide.
The first of these would be a scatterplot of Checkers
versus TotalCustomers. The TotalCustomers variable
is calculated as the sum of InitialWaiting and the
Arrivals to measure the total amount of work
presented to the checkout stations in any half-hour
period.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Scatterplot of Checkers versus
Total Customers
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Scatterplot of End Waiting versus
Checkers
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
More Analysis -- continued
From the scatterplot we can see an obvious positive
relationship between these two variables.
Evidently management is reacting as they should - it
is opening more checkout stations when there is
more traffic.
The second scatterplot shows EndWaiting versus
Checkers. There is again a definite upward trend.
Periods when more checkout stations are open tend
to be associated with periods where more customers
still remain in the checkout process.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
More Analysis - continued
Presumably, management is reacting with more open
checkout stations in busy periods, but it is not
reacting strongly enough.
Just fiddling with the numbers in the Checker column
will not solve the manager's problems.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Conclusions
Two problems exist. First, there is a trade-off between
the “cost” of having customers wait in line and the
cost of paying extra checkout people. This is a
difficult tradeoff for any supermarket manager.
Second, the number of departures is clearly related
to the number of checkout stations open. Therefore it
doesn’t make sense to change the numbers in the
Checkers column without changing the numbers in
the Departures (and hence InitialWaiting and
EndWaiting) column in an appropriate way. This is
not an easy problem.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Applying the Tools
Background Information
The HyTex Company is a direct marketer of
stereophonic equipment, personal computers and
other electronic products.
HyTex advertises entirely by mailing catalogs to its
customers, and all of its orders are taken over the
telephone.
The company spends a great deal of money on its
catalog mailings, and it wants to be sure that this is
paying off in sales.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
CATALOGS.XLS
This file contains the data that has been collected on
1000 customers at the end of the current year.
For each customer it has data on the following
variables:
– Age: coded as 1 for 30 or younger, 2 for 31 to 55, 3 for 56
and older.
– Gender: coded as 1 for males, 2 for females
– OwnHome: coded as 1 if customer owns a home, 2
otherwise
– Married: coded as 1 if customer is currently married, 2
otherwise
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
CATALOGS.XLS -- continued
– Close: coded as 1 if customers lives reasonably close to a
shopping area that sells similar merchandise, 2 otherwise
– Salary: combined annual salary of customer and spouse (if
any)
– Children: number of children living with customer
– History: coded as “NA” if customer had no dealings with the
company before this year, 1 if customer was a low spending
customer last year, 2 if medium-spending, 3 if high-spending
– Catalogs: Number of catalogs sent to the customer this year
– AmountSpent: Total amount of purchases made by the
customer this year
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Analysis Desired
HyTex wants to analyze these data carefully to
understand its customers better.
They want to see whether they are sending the
catalogs to the right customers.
Each customer either receives 6, 12, 18, or 24
catalogs through the mail. Currently which customer
receives which amount is not thought out carefully.
They want to know if the current distribution of
catalogs is effective. Is there room for improvement?
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Analysis
HyTex is obviously interested in the AmountSpent
variable. Therefore, it makes sense to create
scatterplots of AmountSpent versus selected
“explanatory” variables.
First we create the scatterplot shown on the next
slide showing AmountSpent versus Salary.
– It is clear that customers with higher salaries tend to spend
more, although the variability in amounts spent increases
significantly as salary increases.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Scatterplot of Amount Spent
versus Salary
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Analysis -- continued
Second, we create a scatterplot of AmountSpent
versus Catalogs
– This plot shows that there is some tendency toward higher
spending among customers who receive more catalogs.
– But do the catalogs cause more spending, or are more
catalogs sent to customers who would tend to spend more
anyway? There is no way to answer this with this data.
Next is the scatterplot of AmountSpent versus
Children
– This plot shows the interesting tendency of customers with
more children to spend less.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Scatterplot of Amount Spent
versus Catalogs
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Scatterplot of Amount Spent
versus Children
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
More Analysis
Pivot tables and accompanying charts are very useful
in this type of situation. We can begin by using this
technique to get a better understanding of the
demographics of the customers.
The first pivot table and chart shows the percentage
of an age group who own homes. Using the pivot
table we can see how these percentages change for
women, unmarried men and so on.
– Specifically small percentages of the younger people own
their own hoe, regardless of marital status or gender.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Percent of Home Owners versus
Age, Married and Gender
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
More Analysis -- continued
The second pivot table shows the percentage of each
age group who are married, for any combination of
the Gender and OwnHome variables.
– For example, we can check that the married/unmarried split
is quite different for women who don’t own a home than for
male home owners.
The third pivot table shows the average Salary
broken down by Age and Gender, with page variables
for OwnHome and Married.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Percent of Married versus Age,
OwnHome and Gender
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Average Salary versus Age,
Gender, Married and OwnHome
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
More Analysis --continued
– The shape of the resulting charts is practically the same for
any combination of the page variables. However, the heights
of the bars change appreciably. For example, salaries are
higher for the married home owners than for unmarried
customers who are not home owners.
Another pivot table and chart break the data down in
another way. Each column in the pivot table shows
the percentages in the various History categories for
a particular number of children. Each of these
columns corresponds to one of the bars in the
“stacked” bar chart. We have also used Close as a
page variable.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Percentages in History Categories
versus Children and Close
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
More Analysis -- continued
Two interesting points emerge from this plot.
– First, customers with more children tend to be more heavily
represented in the low-spending History category.
– Second, you can check by changing the setting of the Close
variable from 1 to 2, the percentage of high spenders among
customers who live far from electronic stores is much higher
than for those who live close to such stores.
The next pivot table provide insight into how HyTex
determined its catalog mailing distributions. Each row
shows a percentage of a particular History category
that were sent 6,12,18 or 24 catalogs.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Catalog Distribution versus
History
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
More Analysis -- continued
The company’s distribution policy is still unclear but
we can see that it definitely sends more catalogs to
high spending customers and fewer to low spending
customers.
Finally the last pivot table shows AmountSpent
versus History and Catalogs, with a variety of
demographic variables in the page area. There are
so many possible combinations so it can be difficult
to discover all the existing patterns
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
Average Amount Spent versus
History, Catalogs and
Demographic Variables
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11
More Analysis -- continued
One things stands out loud and clear from the graph:
the more customers receive, the more they tend to
spend.
Also, if they were large spenders last year, they tend
to be large spenders this year.
In a pivot table with this many combinations there will
almost certainly be some combinations with no
observations.
3.2 | 3.3 | 3.4 | 3.5 | 3.6 | 3.7 | 3.8 | 3.9 | 3.10 | 3.11