Applied Quantitative Methods III. MBA course Montenegro

Download Report

Transcript Applied Quantitative Methods III. MBA course Montenegro

Applied Quantitative Methods
MBA course Montenegro
Peter Balogh
PhD
[email protected]
6. Measures of dispersion
• In the previous part of the presentation we
considered several measures of the typical, or
average value.
• The mean is widely regarded as the most important
descriptive statistic.
• When references are made to the average time or
the average weight or the average cost it is
generally the mean that has been calculated.
• Knowledge of the mean, the median and the mode
will increase our understanding of the data but will
not provide a sufficient understanding of the
differences in the data.
6. Measures of dispersion
• In many applications it is the differences that are of
particular interest to us.
• In market research, for example, we are interested
not only in the typical values but also in whether
opinions or behaviours are fairly consistent or vary
considerably.
• A niche market is defined by difference.
• Quality control, whether in the manufacturing or
the service sector, is concerned with difference
from the expected.
6. Measures of dispersion
• In this part I introduce ways of measuring this
variability, or dispersion, and then consider ways of
comparing different distributions.
• Measures of dispersion can be absolute
(considering only one set of data at a time and
giving an answer in the original units e.g. £'s,
minutes, years), or relative (giving the answer as a
percentage or proportion and allowing direct
comparison between distributions).
6.1 The standard deviation
• The standard deviation is the most widely used measure of
dispersion, since it is directly related to the mean.
• If you choose the mean as the most appropriate measure of
central location, then the standard deviation would be the
natural choice for a measure of dispersion.
• Unlike the mean, the standard deviation is not so well
known and does not have the same intuitive meaning.
• The standard deviation measures differences from the mean
- a larger value indicating a larger measure of overall
variation.
• The standard deviation will also be in the same units as the
mean (£‘s, minutes, years) and a change of units (e.g. from
£’s to dollars, or metres to centimetres) will change the
value.
6.1 The standard deviation
• The application of computer packages will generally
make the determination of the standard deviation a
relatively straightforward procedure, but it is worth
checking what version of the formula is being used (the
divisor can be n or n - 1).
• I will continue to follow the practice of showing the
calculations by hand, as you may still need to do them.
• Such calculations do have the additional advantage of
showing how the standard deviation is related to the
mean.
• The standard deviation is particularly important in the
development of statistical theory, since most statistical
theory is based on distributions described by their
mean and standard deviation.
6. 1.1 Untabulated data
• We have already seen how to calculate the mean from simple data.
• We will need this calculation of the mean before we calculate the standard
deviation.
• We can again use the first 10 observations on the number of cars entering a
car park in 10-minute intervals:
10 22 31 9 24 27 29 9 23 12
• The mean of this data is 19.6 cars.
• The differences about the mean are shown diagrammatically in Figure 6.2.
• To the left of the mean the differences are negative and to the right of the
mean the differences are positive.
• It can be seen, for example, that the observation 9 is 10.6 units below the
mean, a deviation of -10.6.
• The sum of these differences is zero - check this by adding all the
deviations.
• This summing of deviations to zero illustrates the physical interpretation of
the mean as being the centre of gravity with the observations as a number
of "weights in balance'.
• A
6. 1.1 Untabulated data
• To calculate the standard deviation we follow six steps:
– Compute the mean x
– Calculate the differences from the mean  x  x 
2
– Square these differences
x  x 
2
– Sum the squared differences
x  x 
– Average the squared differences to find variance:

 x  x 
2
n
– Square root variance to find standard deviation.
   x  x 2

n




6.1.2 Tabulated discrete data
• Table 6.2, showing the number of working days lost
by employees in the last quarter, typifies the
tabulation of discrete data.
6.1.2 Tabulated discrete data
• We need to allow for the fact that 410 employees
lost no days, 430 lost one day and so on by
including frequency in our calculations.
• In this example there are 1440 employees in total
and we need to include 1440 squared differences.
• The formula for the standard deviation becomes
6.1.2 Tabulated discrete data
6.1.3 Tabulated (grouped) data
• When data is presented as a grouped frequency
distribution we must determine whether it is
discrete or continuous (as this will affect the way
we view the range of values) and determine the
mid-points.
• Once the mid-points have been determined we
proceed as before using mid-point values for x and
frequencies, as shown in Table 6.4.
• The approach shown clearly illustrates how the
standard deviation summarizes differences, but
would be extremely tedious to perform by hand.
6.1.3 Tabulated (grouped) data
• Some algebraic manipulation of the formula given in Section 6.1.2,
will provide a simplified formula that is easier to work with for
both calculations by hand and the construction of spreadsheets.
• The simplified formula is usually presented as follows:
• The formula does lose its intuitive appeal but is easier to use.
• Formula of this kind can be presented in a variety of ways. Using a
formula presented in different ways should not be a problem.
What you do need to be sure about are the stages required in the
calculations (e.g. what columns to add] and the assumptions being
made (e.g. is n or [n - 1) being used as the divisor?). The use of this
simplified formula is illustrated in Table 6.5.
6.1.3 Tabulated (grouped) data
6.1.4 The variance
• The variance is the squared value of the standard
deviation, and therefore is calculated easily once
the standard deviation is known.
• It is sometimes used as a descriptive measure of
dispersion or variability rather than the standard
deviation, but its importance lies in more advanced
statistical theory.
• As we will see, you can add variances but you
cannot add standard deviations.
• Variance is mentioned here for completeness.
6.2 Other measures of dispersion
• While the standard deviation is the most widely
used measure of dispersion, it is not the only one.
• As we saw when looking at measures of location,
different measures (mean, median and mode) are
appropriate for different situations and the same is
true for measures of dispersion.
• Furthermore, some of the measures of dispersion
are specifically linked to certain measures of
location and it would not make sense to mix and
match the statistics.
6.2.1
The range
• The range is the most easily understood measure of
dispersion as it is the difference between the highest
and lowest values.
• If we were again concerned with the 10 observations:
10 22 31 9 24 27 29 9 23 12
the range would be 22 cars (31 - 9).
• It is, however, a rather crude measure of spread, being
dependent on the two most extreme observations.
• It is also highly unstable as new data is added.
• If this measure is to be used, it may well be better to
quote the highest and lowest figure, rather than the
difference.
6.2.1
The range
• The range has, however, found a number of specialist
applications, particularly in quality control (range
charts).
• When dealing with data presented as a frequency
distribution we will not always know exactly the highest
and lowest values, only the group they lie in.
• If the groups are open-ended (e.g. 60 and more), then
any values used will merely be based on assumptions
that we have made about the widths of the groups.
• In such cases there seems little point in quoting either
the range or the extreme values.
6.2.2 The quartile deviation
• If we are able to quote a half-way value, the median,
then we can also quote quarter-way values, the
quartiles.
• These are order statistics like the median and can be
determined in the same way.
• With untabulated data or tabulated discrete data it will
merely be a case of counting through the ordered data
set until we are a quarter of the way through and three
quarters of the way through and noting the values; this
will give the first quartile and third quartile,
respectively.
• When working with tabulated continuous data, further
calculations are necessary.
• Consider for example the data given in Table 6.6 (see
Table 5.6 for the determination of the median).
6.2.2 The quartile deviation
• The lower quartile (referred to as Q1), will
correspond to the value one-quarter of the way
through the data, the 11th ordered value:
• and the upper quartile (referred to as Q3) to the
value three-quarters of the way through the data,
the 33rd ordered value:
The graphical method
• To estimate any of the order statistics graphically,
we plot cumulative frequency against the value to
which it refers, as shown in Figure 6.4.
• The value of the lower quartile is £12 and the value
of the upper quartile is £25 (to an accuracy of the
nearest £1 which the scale of this graph allows).
Calculation of the quartiles
• We can adapt the median formula (see Section
5.1.3) as follows:
• where O is the order value of interest, l is the lower
boundary of corresponding group, i is the width of
this group, F is the cumulative frequency up to this
group, and f is the frequency in this group.
Calculation of the quartiles
• The lower quartile will lie in the group '£10 but
under £15' and can be calculated thus:
• The upper quartile will lie in the group '£20 but
under £30' and can be calculated thus:
Calculation of the quartiles
• The quartile range is the difference between the
quartiles:
• and the quartile deviation (or semi-interquartile
range) is the average difference:
Calculation of the quartiles
• As with the range, the quartile deviation may be
misleading.
• If the majority of the data is towards the lower end of
the range, for example, then the third quartile will be
considerably further above the median than the first
quartile is below it, and when we average the
difference of the two numbers we will disguise this
difference.
• This is likely to be the case with a country's personal
income distribution.
• In such circumstances, it would be preferable to quote
the actual values of the two quartiles, rather than the
quartile deviation.
6.2.3 Percentiles
• The formula given in Section 6.2.2 for an order value, O, can
be used to find the value at any position in a grouped
frequency distribution of continuous data.
• For data sets that are not skewed to one side or the other, the
statistics we have calculated so far will usually be sufficient,
but heavily skewed data sets will need further statistics to
fully describe them.
• Examples would include some income distributions, wealth
distributions and times taken to complete a complex task.
• In such cases, we may want to use the 95th percentile, i.e. the
value below which 95% of the data lies.
• Any other value between 1 and 99 could also be calculated.
• An example of such a calculation is shown in Table 6.7.
6.2.3 Percentiles
• For this wealth distribution, the first quartile and
the median are both zero.
• The third quartile is £4347.83.
• None of these statistics adequately describes the
distribution.
• To calculate the 95th percentile, we find 95% of the
total frequency, here
0.95 x 26 700 = 25365
and this is the item whose value we require.
6.2.3 Percentiles
• It will be in the group labelled 'under £100 000'
which has a frequency of 800 and a width of
50 000 (i.e. 100000 - 50 000).
• Using the formula, we have:
6.2.4 Back to raw data
• So far this chapter has taken us from individual numbers (raw
data) through ordered data to grouped data, looking at the
methods used to find the measures of dispersion.
• The previous chapter did the same for measures of location.
• However, the idea of grouping the data developed when
calculation had to be done by hand, or at least using sliderules and calculators.
• It was the only practical method when large amounts of data
were being analysed.
• Now we have computers and suitable software, which can
deal with huge amounts of data very quickly and easily,
without having to make assumptions about an even spread of
data within each group, or guessing what the highest or
lowest value was.
6.2.4 Back to raw data
• Add to this that most data starts life as individual
bits of raw data, and you can see that most of the
descriptive statistics we have been discussing can
be found very easily, provided someone has
recorded them electronically.
• An example using Excel is shown as Figure 6.5.
• An example of the output from SPSS is shown as
Figure 6.6.
• If you are trying to describe secondary data for
which you only have tabulated data, then, of
course, you have to go back to the methods we
have been discussing.
6.3 Relative measures of dispersion
• All of the measures of dispersion described earlier
in this chapter have dealt with a single set of data.
• In practice, it is often important to compare two or
more sets of data, maybe from different areas, or
data collected at different times.
• In Part 4 we look at formal methods of comparing
the difference between sample observations, but
the measures described in this section will enable
some initial comparisons to be made.
• The advantage of using relative measures is that
they do not depend on the units of measurement of
the data.
6.3.1 Coefficient of variation
• This measure calculates the standard deviation
from a set of observations as a percentage of the
arithmetic mean:
• Thus the higher the result, the more variability
there is in the set of observations.
6.3.1 Coefficient of variation
• If, for example, we collected data on personal
incomes for two different years, and the results
showed a coefficient of variation of 89.4% for the
first year, and 94.2% for the second year, then we
could say that the amount of dispersion in personal
income data had increased between the two years.
• Even if there has been a high level of inflation
between the two years, this will not affect the
coefficient of variation, although it will have meant
that the average and standard deviation for the
second year are much higher, in absolute terms,
than the first year.
6.3.2 Coefficient of skewness
• Skewness of a set of data relates to the shape of the
histogram which could be drawn from the data.
• The type of skewness present in the data can be
described by just looking at the histogram, but it is
also possible to calculate a measure of skewness so
that different sets of data can be compared.
• Three basic histogram shapes are shown in Figure
6.7, and a formula for calculating skewness is shown
below.
6.3.2 Coefficient of skewness
• A typical example of the use of the coefficient of
skewness is in the analysis of income data.
• If the coefficient is calculated for gross income
before tax, then the coefficient gives a large positive
result since the majority of income earners receive
relatively low incomes, while a small proportion of
income earners receive high incomes.
• When the coefficient is calculated for the same
group of earners using their after tax income, then,
although a positive result is still obtained, its size
has decreased.
6.3.2 Coefficient of skewness
• These results are typical of a progressive tax system,
such as that in the UK.
• Using such calculations it is possible to show that
the distribution of personal incomes in the UK has
changed over time.
• A discussion of whether or not this change in the
distribution of personal incomes is good or bad will
depend on your economic and political views; the
statistics highlight that the change has occurred.
6.4 Variability in sample data
• We would expect the results of a survey to identify
differences in opinions, income and a range of other
factors.
• The extent of these differences can be summarized
by an appropriate measure of dispersion (standard
deviation, quartile deviation, range).
• Market researchers, in particular, seek to explain
differences in attitudes and actions of distinct
groups within a population.
• It is known, for example, that the propensity to buy
frozen foods varies between different groups of
people.
6.4 Variability in sample data
• As a producer of frozen foods you might be
particularly interested in those most likely to buy
your products.
• Supermarkets of the same size can have very
different turnover figures and a manager of a
supermarket may wish to identify those factors
most likely to explain the differences in turnover.
• A number of clustering algorithms have been
developed in recent years that seek to explain
differences in sample data.
6.4 Variability in sample data
• As an example, consider the following
algorithm or procedure that seeks to explain
the differences in the selling prices of houses:
• 1 Calculate the mean and a measure of
dispersion for all the observations in your
sample. In this example we could calculate the
average price and the range of prices (Figure
6.8).
6.4 Variability in sample data
• It can be seen from the range that there is
considerable variability in price relative to the
average price.
• Usually the standard deviation would be
preferred to the range as a measure of dispersion
for this type of data.
• 2 Decide which factors explain most of the
difference (range) in price, for example, location,
house-type, number of bedrooms.
• If location is considered particularly important,
we can divide the sample on that basis and
calculate the chosen descriptive statistics (Figure
6.9).
6.4 Variability in sample data
• In this case we have chosen to segment the sample by
location, areas X and Y.
• The smaller range within the two new groups indicates
that there is less variability of house prices within
areas.
• We could have divided the sample by some other factor
and compared the reduction in the range.
• 3 Divide the new groups and again calculate the
descriptive statistics. We could divide the sample a
second time on the basis of house-type (Figure 6.10).
• 4 The procedure can be continued in many ways with
many splitting criteria.
• A more sophisticated version of this procedure is
known as the automatic interactive detection
technique.
Case 2: using measures of difference and performance
• Managers are likely to meet a number of measures
of difference and increasingly also various measures
of performance (benchmarking, for instance, has
become an important management tool, where
targets are determined using the performance of
the 'best' organizations on certain measures).
• Managers need to be able to respond to this type of
information with insight and confidence.
Case 2: using measures of difference and performance
• It is important for managers to clarify what these
measures mean in business terms and what the
underlying assumptions are.
• In the same way that you don't need to be an
accountant to use accounting information, you don't
need to be a statistician to use statistical information.
• Managers should look for a business understanding in
the information they are given and develop responses
that allow their organization to interpret and apply
such information.
• Knowing the assumptions will reveal some of the
thinking of those that devised them.
• Management is a process that involves a judgement as
to what is appropriate and when.