Numerical Descriptive Measures

download report

Transcript Numerical Descriptive Measures

Chapter 4
Numerical
Descriptive
Techniques
Introduction
 Recall Chapter 2, where we used graphical techniques to
describe data:
 While this histogram provides some new insight, other
interesting questions (e.g. what is the class average? what
is the mark spread?) go unanswered.
2007會計資訊系統計學(一)上課投影片
4-2
Numerical Descriptive Techniques
 Measures of Central Location(中央位置)

Mean, Median, Mode
 Measures of Variability(離散程度)

Range, Standard Deviation, Variance, Coefficient of Variation
 Measures of Relative Standing(相對位置)

Percentiles, Quartiles
 Measures of Linear Relationship(線性關係)

Covariance, Correlation, Least Squares Line
2007會計資訊系統計學(一)上課投影片
4-3
4.1 Measures of Central Location
 Usually, we focus our attention on two types of
measures when describing population
characteristics:


Central location (e.g. average)
Variability or spread
The measure of central location reflects
the locations of all the actual data points.
2007會計資訊系統計學(一)上課投影片
4-4
4.1 Measures of Central Location
 The measure of central location reflects the
locations of all the actual data points.
 How?
With two data points,
the central location
But
if
the
third data
With one data point
should
fall inpoint
the middle
on the leftthem
hand-side
clearly the centralappears between
(in order
of
the
midrange,
it
should
“pull”of
location is at the point to reflect the location
the central
location
to the left.
itself.
both
of them).
2007會計資訊系統計學(一)上課投影片
4-5
The Arithmetic Mean(算術平均數)
 This is the most popular and useful measure of
central location
Sum of the observations
Mean =
Number of observations
2007會計資訊系統計學(一)上課投影片
4-6
Notation
 When referring to the number of observations in a
population, we use uppercase letter N
 When referring to the number of observations in a
sample, we use lower case letter n
 The arithmetic mean for a population is denoted
with Greek letter “mu”: (母體平均數)
 The arithmetic mean for a sample is denoted with
an
“x-bar”.(樣本平均數)
2007會計資訊系統計學(一)上課投影片
4-7
Statistics is a pattern language
Size
Population
Sample
N
n
Mean
2007會計資訊系統計學(一)上課投影片
4-8
The Arithmetic Mean
Sample mean
x
n
n
ii11xxii
nn
Sample size
2007會計資訊系統計學(一)上課投影片
Population mean

N
 i1 x i
N
Population size
4-9
Statistics is a pattern language
Size
Population
Sample
N
n
Mean
2007會計資訊系統計學(一)上課投影片
4-10
The Arithmetic Mean
• Example 4.1
The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33,
14, 8, 0, 9, 22 hours. Find the mean time on the Internet.
10
x10
0x1  7x2  ...  22
i 1 xi
x

 11.0
10
10
• Example 4.2
Suppose the telephone bills of Example 2.1 represent
the population of measurements. The population mean is
200
 i1
 x38.45
 ...  x45.77
x i x42.19
1
2
200



200
200
2007會計資訊系統計學(一)上課投影片
43.59
4-11
The Arithmetic Mean
• Additional Example
When many of the measurements have the same value, the
measurement can be summarized in a frequency table. Suppose
the number of children in a sample of 16 employees were recorded
as follows:
Number of children per family
Number of families
x
16
x
i 1 i
2007會計資訊系統計學(一)上課投影片
16
0 1 2 3
3+ 4 + 7+ 2
16 employees
x1  x 2 ... x16 3(0)  4(1)  7(2)  2(3)


 1.5
16
16
4-12
The Arithmetic Mean
 …is appropriate for describing measurement data,
e.g. heights of people, marks of student papers, etc.
 …is seriously affected by extreme values called
“outliers”. E.g. as soon as a billionaire moves into a
neighborhood, the average household income
increases beyond what it was previously!
2007會計資訊系統計學(一)上課投影片
4-13
The Median(中位數)
 The Median of a set of observations is the value that
falls in the middle when the observations are arranged
in order of magnitude.
Example 4.3
Comment
Find the median of the time on the internet Suppose only 9 adults were sampled
(exclude, say, the longest time (33))
for the 10 adults of example 4.1
Even number of observations
0, 0, 5,
0, 7,
5, 8,
7, 8,
9, 12,
14,14,
22,22,
33 33
8.59,, 12,
2007會計資訊系統計學(一)上課投影片
Odd number of observations
0, 0, 5, 7, 8 9, 12, 14, 22
4-14
The Mode(眾數)
 The Mode of a set of observations is the value that occurs
most frequently.
 Set of data may have one mode (or modal class), or two or
more modes.
The modal class
2007會計資訊系統計學(一)上課投影片
For large data sets
the modal class is
much more relevant
than a single-value
mode.
4-15
The Mode
 Example 4.5
Find the mode for the data in Example 4.1. Here are the data
again: 0, 7, 12, 5, 33, 14, 8, 0, 9, 22
Solution



All observation except “0” occur once. There are two “0”.
Thus, the mode is zero.
Is this a good measure of central location?
The value “0” does not reside at the center of this set
(compare with the mean = 11.0 and the mode = 8.5).
2007會計資訊系統計學(一)上課投影片
4-16
The Mode

Additional example
 The manager of a men’s store observes the waist size (in
inches) of trousers sold yesterday: 31, 34, 36, 33, 28, 34,
30, 34, 32, 40.
 The mode of this data set is 34 in.
This information seems to be
valuable (for example, for the
design of a new display in the
store), much more than “ the
median is 33.5 in.”
2007會計資訊系統計學(一)上課投影片
4-17
Measures of Central Location
 The mode of a set of observations is the value that occurs
most frequently.
 A set of data may have one mode (or modal class), or two,
or more modes.
 Mode is a useful for all data types, though mainly used for
nominal data.
 For large data sets the modal class is much more relevant
than a single-value mode.
※ Sample and population modes are computed the same way.
2007會計資訊系統計學(一)上課投影片
4-18
=MODE(range) in Excel
 Note: if you are using Excel for your data analysis
and your data is multi-modal (i.e. there is more than
one mode), Excel only calculates the smallest one.
 You will have to use other techniques (i.e. histogram)
to determine if your data is bimodal, trimodal, etc.
2007會計資訊系統計學(一)上課投影片
4-19
The Mean, Median and Mode
 Additional example
A professor of statistics wants to report the results of a midterm exam,
taken by 100 students.
•
The mean of the test marks is 73.90
•
The median of the test marks is 81
•
The mode of the test marks is 84
Describe the information each one provides.
The mean provides information
Median
indicates
that half of the class
The mode must be usedThe
when
data are
nominal
about the over-all performance
level
a grade
If marks are
classified byreceived
letter grade,
thebelow 81%, and half of the class
of the class. It can serve
as a tool
a grade above 81%. A student can use
frequency
of for
each gradereceived
can be calculated.
making comparisons with
other
thisa statistic
to place his mark relative to other
Then,
the mode becomes
logical measure
classes and/or other exams.
students in the class.
to compute.
2007會計資訊系統計學(一)上課投影片
4-20
Relationship among Mean, Median, and Mode
 If a distribution is symmetrical, the mean,
median and mode coincide
 If a distribution is asymmetrical, and skewed
to the left or to the right, the three measures
differ.
A positively skewed distribution
(“skewed to the right”)
Mode Mean
Median
2007會計資訊系統計學(一)上課投影片
4-21
Relationship among Mean, Median, and Mode
 If a distribution is symmetrical, the mean, median
and mode coincide
 If a distribution is non symmetrical, and skewed
to the left or to the right, the three measures
differ.
A positively skewed distribution
(“skewed to the right”)
A negatively skewed distribution
(“skewed to the left”)
Mode
Mean
Median
Mean
Mode
Median
2007會計資訊系統計學(一)上課投影片
4-22
Mean, Median, Mode
 If data are symmetric, the mean, median, and mode
will be approximately the same.
 If data are multimodal, report the mean, median
and/or mode for each subgroup.
 If data are skewed, report the median.
2007會計資訊系統計學(一)上課投影片
4-23
Mean, Median, & Modes for Ordinal & Nominal Data
 For ordinal and nominal data the calculation of the
mean is not valid.
 Median is appropriate for ordinal data.
 For nominal data, a mode calculation is useful for
determining highest frequency but not “central
location”.
2007會計資訊系統計學(一)上課投影片
4-24
The Geometric Mean(幾何平均數)
 This is a measure of the average growth rate.
 Let Ri denote the the rate of return in period i
(i=1,2…,n). The geometric mean of the returns R1,
R2, …,Rn is the constant Rg that produces the same
terminal wealth at the end of period n as do the
actual returns for the n periods.
2007會計資訊系統計學(一)上課投影片
4-25
The Geometric Mean
For the given series of rate of
returns the nth period return is
calculated by:
If the rate of return was Rg in every
period, the nth period return would
be calculated by:
n
(
1

R
)
(1  R1 )(1  R 2 )...( 1  R n ) 
g
Rg is selected such that…
R g  n (1  R1 )(1  R2 )...(1  Rn )  1
2007會計資訊系統計學(一)上課投影片
4-26
Finance Example
 Suppose a 2-year investment of $1,000 grows by 100% to $2,000 in
the first year, but loses 50% from $2,000 back to the original $1,000 in
the second year. What is your average return?
 Using the arithmetic mean, we have
 This would indicate we should have $1,250 at the end of our
investment, not $1,000.
 Solving for the geometric mean yields a rate of 0%, which is correct.
The upper case Greek Letter “Pi” represents a product of terms…
2007會計資訊系統計學(一)上課投影片
4-27
The Geometric Mean
 Additional Example



A firm’s sales were $1,000,000 three years ago.
Sales have grown annually by 20%, 10%, -5%.
Find the geometric mean rate of growth in sales.
 Solution
Since Rg is the geometric mean
(1+Rg)3 = (1+.2)(1+.1)(1-.05)= 1.2540
Thus,

Rg  3 (1  .2)(1  .1)(1  .05)  1  .0784, or 7.84%.
2007會計資訊系統計學(一)上課投影片
4-28
Measures of Central Location: Summary


Compute the Mean to
Describe the central location of a single set of interval data


Compute the Median to
Describe the central location of a single set of interval or ordinal data


Compute the Mode to
Describe a single set of nominal data


Compute the Geometric Mean to
Describe a single set of interval data based on growth rates
2007會計資訊系統計學(一)上課投影片
4-29
4.2 Measures of variability
 Measures of central location fail to tell the whole story
about the distribution.
 A question of interest still remains unanswered:
How much are the observations spread out
around the mean value?
2007會計資訊系統計學(一)上課投影片
4-30
4.2 Measures of variability
Observe two hypothetical
data sets:
Small variability
The average value provides
a good representation of the
observations in the data set.
This data set is now
changing to...
2007會計資訊系統計學(一)上課投影片
4-31
4.2 Measures of variability
Observe two hypothetical
data sets:
Small variability
The average value provides
a good representation of the
observations in the data set.
Larger variability
The same average value does not
provide as good representation of the
observations in the data set as before.
2007會計資訊系統計學(一)上課投影片
4-32
Measures of Variability
 Measures of central location fail to tell the whole
story about the distribution; that is, how much are the
observations spread out around the mean value?
For example, two sets of class
grades are shown. The mean
(=50) is the same in each case…
But, the red class has greater
variability than the blue class.
2007會計資訊系統計學(一)上課投影片
4-33
 The range(全距)



The range of a set of observations is the difference
between the largest and smallest observations.
Its major advantage is the ease with which it can be
But, how do all the observations spread out?
computed.
? ?to provide information
Its major shortcoming is?its failure
Smallest
on the dispersion
of the observationsLargest
between the two
observation
observation
end points.
2007會計資訊系統計學(一)上課投影片
The range cannot assistRange
in answering this question
4-34
Range
 The range is the simplest measure of variability, calculated
as:
 Range = Largest observation – Smallest observation
 E.g.

Data: {4, 4, 4, 4, 50}
Range = 46

Data: {4, 8, 15, 24, 39, 50} Range = 46

The range is the same in both cases,

but the data sets have very different distributions…
2007會計資訊系統計學(一)上課投影片
4-35
Range
 Its major advantage is the ease with which it can be
computed.
 Its major shortcoming is its failure to provide
information on the dispersion of the observations
between the two end points.
 Hence we need a measure of variability that
incorporates all the data and not just two
observations. Hence…
2007會計資訊系統計學(一)上課投影片
4-36
Variance(變異數)
 Variance and its related measure, standard deviation, are
arguably the most important statistics. Used to measure
variability, they also play a vital role in almost all statistical
inference procedures.
 Population variance is denoted by
母體變異數)
 (Lower case Greek letter “sigma” squared)
 Sample variance is denoted by
 (Lower case “S” squared)
2007會計資訊系統計學(一)上課投影片
樣本變異數)
4-37
Statistics is a pattern language
Size
Population
Sample
N
n
Mean
Variance
2007會計資訊系統計學(一)上課投影片
4-38
The Variance


This measure reflects the dispersion of all the
observations
The variance of a population of size N x1, x2,…,xN
whose mean is  is defined as
2 

2
N
(
x


)
i 1 i
N
The variance of a sample of n observations
x1, x2, …,xn whose mean is x is defined as
s2 
2007會計資訊系統計學(一)上課投影片
ni1( xi  x)2
n 1
4-39
The Variance
 Example 4.7

The following sample consists of the number of jobs six
students applied for: 17, 15, 23, 7, 9, 13. Finds its mean
and variance
 Solution
x
i61 xi
6
17  15  23  7  9  13 84


 14 jobs
6
6

n
2

(
x

x
)
1
2
i1 i
s 

(17  14)2  (15  14)2  ...(13  14)2
n 1
6 1
 33.2 jobs2
2007會計資訊系統計學(一)上課投影片

4-40
The Variance – Shortcut method
n
2
n

(  i 1 xi ) 
1
2
2
s 
xi 
i


1
n 1 
n

2


1
17  15  ...  13  
2
2
2

 17  15  ...  13 

6 1 
6

 33.2 jobs 2

2007會計資訊系統計學(一)上課投影片

4-41
Why not use the sum of deviations?
Consider two small populations:
9-10= -1
11-10= +1
8-10= -2
12-10= +2
A measure of dispersion
A
Can the sum of deviations
agreesofwith
this
Be aShould
good measure
dispersion?
The sum
of deviations is
observation.
zero for both populations,
8 9 10 11 12
therefore, is not a good
…but
Themeasurements
mean of both in B
measure
of
arepopulations
moredispersion.
dispersed
is 10...
4-10 = - 6
16-10 = +6
7-10 = -3
then those in A.
B
4
Sum = 0
7
10
13
16
13-10 = +3
Sum = 0
2007會計資訊系統計學(一)上課投影片
4-42
The Variance
Let us calculate the variance of the two populations
2
2
2
2
2
(
8

10
)

(
9

10
)

(
10

10
)

(
11

10
)

(
12

10
)
2A 
2
5
2
2
2
2
2
(
4

10
)

(
7

10
)

(
10

10
)

(
13

10
)

(
16

10
)
B2 
 18
5
Why is the variance defined as
After all, the sum of squared
the average squared deviation?
deviations increases in
Why not use the sum of squared
magnitude when the variation
deviations as a measure of
of a data set increases!!
variation instead?
2007會計資訊系統計學(一)上課投影片
4-43
The Variance
Which data set has a larger dispersion?
Data set B
is more dispersed
around the mean
A
B
1
2 3
1
3
5
Let us calculate the sum of squared deviations for both data sets.
2007會計資訊系統計學(一)上課投影片
4-44
The Variance
SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10
SumB = (1-3)2 + (5-3)2 = 8
SumA > SumB. This is inconsistent with the
observation that set B is more dispersed.
A
B
1
2007會計資訊系統計學(一)上課投影片
2 3
1
3
5
4-45
The Variance
However, when calculated on “per observation”
basis (variance), the data set dispersions are
properly ranked.
A2 = SumA/N = 10/10 = 1
B2 = SumB/N = 8/2 = 4
A
B
1
2007會計資訊系統計學(一)上課投影片
2 3
1
3
5
4-46
Standard Deviation(標準差)
 The standard deviation of a set of observations is
the square root of the variance .
Sample standard dev iation: s  s
2
Population standard dev iation:   
2007會計資訊系統計學(一)上課投影片
2
4-47
Statistics is a pattern language
Size
Population
Sample
N
n
Mean
Variance
Standard
Deviation
2007會計資訊系統計學(一)上課投影片
4-48
Standard Deviation
 Example 4.8



To examine the consistency of shots for a new innovative
golf club, a golfer was asked to hit 150 shots, 75 with a
currently used (7-iron) club, and 75 with the new club.
The distances were recorded.
Which 7-iron is more consistent?
2007會計資訊系統計學(一)上課投影片
4-49
Standard Deviation
 Example 4.8 – solution
Excel printout, from the
“Descriptive Statistics”
sub-menu.
The innovation club is
more consistent, and
because the means are
close, is considered a
better club
2007會計資訊系統計學(一)上課投影片
Current
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Innovation
150.5467
0.668815
151
150
5.792104
33.54847
0.12674
-0.42989
28
134
162
11291
75
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
150.1467
0.357011
150
149
3.091808
9.559279
-0.88542
0.177338
12
144
156
11261
75
4-50
Standard Deviation

Additional Example
• Rates of return over the past 10 years for two mutual funds are
shown below. Which one have a higher level of risk?
Fund A: 8.3, -6.2, 20.9, -2.7, 33.6, 42.9, 24.4, 5.2, 3.1, 30.05
Fund B: 12.1, -2.8, 6.4, 12.2, 27.8, 25.3, 18.2, 10.7, -1.3, 11.4
2007會計資訊系統計學(一)上課投影片
4-51
Standard Deviation
Solution
Let us use the
Excel printout that
is run from the
“Descriptive
statistics” submenu.
Fund A should be
considered riskier
because its standard
deviation is larger
2007會計資訊系統計學(一)上課投影片
Fund A
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Fund B
16 Mean
5.295 Standard Error
14.6 Median
#N/A Mode
16.74 Standard Deviation
280.3 Sample Variance
-1.34 Kurtosis
0.217 Skewness
49.1 Range
-6.2 Minimum
42.9 Maximum
160 Sum
10 Count
12
3.152
11.75
#N/A
9.969
99.37
-0.46
0.107
30.6
-2.8
27.8
120
10
4-52
Interpreting Standard Deviation
 The standard deviation can be used to


compare the variability of several distributions
make a statement about the general shape of a distribution.
 The empirical rule(經驗法則): If a sample of
observations has a bell-shaped distribution, the interval
( x  s, x  s) contains approximately 68% of the measuremen ts
( x  2s, x  2s) contains approximately 95% of the measuremen ts
( x  3s, x  3s) contains approximately 99.7% of the measuremen ts
2007會計資訊系統計學(一)上課投影片
4-53
The Empirical Rule
Approximately 68% of all observations fall
within one standard deviation of the mean.
Approximately 95% of all observations fall
within two standard deviations of the mean.
Approximately 99.7% of all observations fall
within three standard deviations of the mean.
2007會計資訊系統計學(一)上課投影片
4-54
Interpreting Standard Deviation
 Example 4.9
A statistics practitioner wants to describe the way
returns on investment are distributed.



The mean return = 10%
The standard deviation of the return = 8%
The histogram is bell shaped.
2007會計資訊系統計學(一)上課投影片
4-55
Interpreting Standard Deviation
Example 4.9 – solution
 The empirical rule can be applied (bell shaped histogram)
 Describing the return distribution



Approximately 68% of the returns lie between 2% and 18%
[10 – 1(8), 10 + 1(8)]
Approximately 95% of the returns lie between -6% and 26%
[10 – 2(8), 10 + 2(8)]
Approximately 99.7% of the returns lie between -14% and 34%
[10 – 3(8), 10 + 3(8)]
2007會計資訊系統計學(一)上課投影片
4-56
Chebysheff’s Theorem(柴比氏定理)
 A more general interpretation of the standard deviation is
derived from Chebysheff’s Theorem, which applies to all
shapes of histograms (not just bell shaped).
 The proportion of observations in any sample that lie
 within k standard deviations of the mean is at least:
For k=2 (say), the theorem states that at least 3/4 of all observations
lie within 2 standard deviations of the mean. This is a “lower bound”
compared to Empirical Rule’s approximation (95%).
2007會計資訊系統計學(一)上課投影片
4-57
The Chebysheff’s Theorem
 The proportion of observations in any sample that lie within
k standard deviations of the mean is at least
1-1/k2 for k > 1.
 This theorem is valid for any set of measurements
(sample, population) of any shape!!
K
Interval
Chebysheff
Empirical Rule
1
2
3
x  s, x  s
x  2s, x  2s
x  3s, x  3s
2007會計資訊系統計學(一)上課投影片
at least 0%
at least 75%
at least 89%
(1-1/12)
(1-1/22)
(1-1/32)
approximately 68%
approximately 95%
approximately 99.7%
4-58
The Chebysheff’s Theorem
 Example 4.10

The annual salaries of the employees of a chain of computer
stores produced a positively skewed histogram. The mean and
standard deviation are $28,000 and $3,000,respectively. What
can you say about the salaries at this chain?
 Solution
At least 75% of the salaries lie between $22,000 and $34,000
28000 – 2(3000) 28000 + 2(3000)
At least 88.9% of the salaries lie between $19,000 and $37,000
28000 – 3(3000) 28000 + 3(3000)
2007會計資訊系統計學(一)上課投影片
4-59
Coefficient of Variation(變異係數)
 The coefficient of variation of a set of observations
is the standard deviation of the observations divided
by their mean, that is:
 Population coefficient of variation = CV =
 Sample coefficient of variation = cv =
2007會計資訊系統計學(一)上課投影片
4-60
Statistics is a pattern language
Size
Population
Sample
N
n
Mean
Variance
Standard
Deviation
Coefficient of
Variation
2007會計資訊系統計學(一)上課投影片
S
CV
cv
4-61
Coefficient of Variation
 This coefficient provides a
proportionate measure of variation, e.g.
 A standard deviation of 10 may be perceived as
large when the mean value is 100, but only
moderately large when the mean value is 500.
2007會計資訊系統計學(一)上課投影片
4-62
Measures of Variability
 If data are symmetric, with no serious outliers, use
range and standard deviation.
 If comparing variation across two data sets, use
coefficient of variation.
 The measures of variability introduced in this section
can be used only for interval data.
2007會計資訊系統計學(一)上課投影片
4-63
4.3 Measures of Relative Standing and Box Plots
 Measures of relative standing are designed to provide
information about the position of particular values relative
to the entire data set.
 Percentile(百分位數)

The pth percentile of a set of measurements is the value for which
• p percent of the observations are less than that value
• (100-p) percent of all the observations are greater than that value.

Example
• Suppose your score is the 60% percentile of a SAT test. Then
60% of all the scores lie here
40%
Your score
2007會計資訊系統計學(一)上課投影片
4-64
Quartiles(四分位數)
 We have special names for the 25th, 50th, and 75th
percentiles, namely quartiles.
 The first or lower quartile is labeled Q1 = 25th percentile.
 The second quartile, Q2 = 50th percentile (which is also the
median).
 The third or upper quartile, Q3 = 75th percentile.
 We can also convert percentiles into quintiles (fifths) and
deciles (tenths).
2007會計資訊系統計學(一)上課投影片
4-65
Commonly Used Percentiles





First (lower) decile
First (lower) quartile, Q1,
Second (middle)quartile,Q2,
Third quartile, Q3,
Ninth (upper) decile
= 10th percentile
= 25th percentile
= 50th percentile
= 75th percentile
= 90th percentile
 Note: If your exam mark places you in the 80th percentile,
that doesn’t mean you scored 80% on the exam – it means
that 80% of your peers scored lower than you on the exam;
its about your position relative to others.
2007會計資訊系統計學(一)上課投影片
4-66
Quartiles
 Example
Find the quartiles of the following set of
measurements 7, 8, 12, 17, 29, 18, 4, 27, 30, 2, 4,
10, 21, 5, 8
2007會計資訊系統計學(一)上課投影片
4-67
Quartiles:Solution
Sort the observations
2, 4, 4, 5, 7, 8, 10, 12, 17, 18, 18, 21, 27, 29, 30
The first quartile
15 observations
At most (.25)(15) = 3.75 observations
should appear below the first quartile.
Check the first 3 observations on the
left hand side.
At most (.75)(15)=11.25 observations
should appear above the first quartile.
Check 11 observations on the
right hand side.
Comment:If the number of observations is even, two observations
remain unchecked. In this case choose the midpoint between these
two observations.
2007會計資訊系統計學(一)上課投影片
4-68
Location of Percentiles
 Find the location of any percentile using the formula
P
L P  (n  1)
100
th
w hereL P is the locationof the P percentile
 Example 4.11
Calculate the 25th, 50th, and 75th percentile of the data in
Example 4.1
2007會計資訊系統計學(一)上課投影片
4-69
Location of Percentiles
 Example 4.11 – solution

After sorting the data we have 0, 0, 5, 7, 8, 9, 12, 14, 22, 33.
25
L 25  (10  1)
 2.75
100
Values 0
0
Location 2
Location 1
3.75
2.75
5
3
Location 3
The 2.75th location
Translates to the value
(.75)(5 – 0) = 3.75
2007會計資訊系統計學(一)上課投影片
4-70
Location of Percentiles
 Example 4.11 – solution continued
50
L 50  (10  1)
 5.5
100
The 50th percentile is halfway between the fifth and
sixth observations (in the middle between 8 and 9),
that is 8.5.
2007會計資訊系統計學(一)上課投影片
4-71
Location of Percentiles
 Example 4.11 – solution continued
75
L 75  (10  1)
 8.25
100
The 75th percentile is one quarter of the distance
between the eighth and ninth observation that is
14+.25(22 – 14) = 16.
Eighth
observation
2007會計資訊系統計學(一)上課投影片
Ninth
observation
4-72
Location of Percentiles
 Please remember…
position
2.75
16
0 0 | 5 7 8 9 12 14 | 22 33
3.75
position
8.25
Lp determines the position in the data set where the
percentile value lies, not the value of the percentile itself.
2007會計資訊系統計學(一)上課投影片
4-73
Quartiles and Variability
 Quartiles can provide an idea about the shape of a
histogram
Q1 Q2
Positively skewed
histogram
2007會計資訊系統計學(一)上課投影片
Q3
Q1
Q2
Q3
Negatively skewed
histogram
4-74
Interquartile Range(四分位距)
 The quartiles can be used to create another measure of
variability, the interquartile range, which is defined as
follows:
Interquartile range = Q3 – Q1
 This is a measure of the spread of the middle 50% of the
observations
 Large value indicates a large spread of the observations
2007會計資訊系統計學(一)上課投影片
4-75
Box Plot(箱形圖、盒鬚圖)
 This is a pictorial display that provides the main
descriptive measures of the data set:
•
•
•
•
•
L - the largest observation
Q3 - The upper quartile
Q2 - The median
Q1 - The lower quartile
S - The smallest observation
1.5(Q3 – Q1)
S
2007會計資訊系統計學(一)上課投影片
Whisker
1.5(Q3 – Q1)
Q1
Q2 Q 3
Whisker
L
4-76
Box Plot
 Example 4.14 (Xm02-01)
Bills
42.19
38.45
29.23
89.35
118.04
110.46
.
Smallest =. 0
.
Q1 = 9.275
Median = 26.905
Q3 = 84.9425
Largest = 119.63
IQR = 75.6675
Outliers = ()
2007會計資訊系統計學(一)上課投影片
Left hand boundary = 9.275–1.5(IQR)= -104.226
Right hand boundary=84.9425+ 1.5(IQR)=198.4438
-104.226
0
9.275
84.9425 119.63
26.905
198.4438
No outliers are found
4-77
Box Plot

Additional Example - GMAT scores
Create a box plot for the data regarding the GMAT scores of
200 applicants (see GMAT.XLS)
GMAT
512
531
461
515
.
.
.
Smallest = 449
Q1 = 512
Median = 537
Q3 = 575
Largest = 788
IQR = 63
Outliers = (788, 788, 766, 763, 756, 719, 712, 707, 703, 694, 690, 675, )
417.5 449
512-1.5(IQR)
2007會計資訊系統計學(一)上課投影片
512
537
575
669.5
575+1.5(IQR)
788
4-78
Box Plot
GMAT - continued
Q1
512
449
25%

Q2
537
Q3
575
50%
669.5
25%
Interpreting the box plot results
• The scores range from 449 to 788.
• About half the scores are smaller than 537, and about half are
larger than 537.
• About half the scores lie between 512 and 575.
• About a quarter lies below 512 and a quarter above 575.
2007會計資訊系統計學(一)上課投影片
4-79
Box Plot
GMAT - continued
The histogram is positively skewed
Q1
512
449
25%
Q2
537
50%
Q3
575
669.5
25%
50%
25%
2007會計資訊系統計學(一)上課投影片
25%
4-80
Box Plot
 Example 4.15 (Xm04-15)


A study was organized to compare the quality of service
in 5 drive through restaurants.
Interpret the results
 Example 4.15 – solution

Minitab box plot
2007會計資訊系統計學(一)上課投影片
4-81
Box Plot
Jack in the Box5
Jack in the box is the slowest in service
Hardee’s
Hardee’s service time variability is the largest
C7
McDonalds
4
3
Wendy’s
2
Popeyes
1
Wendy’s service time appears to be the
shortest and most consistent.
100
200
300
C6
2007會計資訊系統計學(一)上課投影片
4-82
Box Plot
Times are symmetric
Jack in the Box5
Jack in the box is the slowest in service
Hardee’s
Hardee’s service time variability is the largest
C7
McDonalds
4
3
Wendy’s
2
Popeyes
1
Wendy’s service time appears to be the
shortest and most consistent.
100
200
300
C6
2007會計資訊系統計學(一)上課投影片
Times are positively skewed
4-83
4.4 Measures of Linear Relationship
 We now present two numerical measures of linear
relationship that provide information as to the strength &
direction of a linear relationship between two variables
(if one exists).
 They are the covariance and the coefficient of
correlation.
 Covariance(共變數) - is there any pattern to the
way two variables move together?
 Coefficient of correlation (相關係數)- how strong
is the linear relationship between two variables?
2007會計資訊系統計學(一)上課投影片
4-84
Covariance(共變數)
Population covariance  COV(X, Y) 
(x i   x )(y i   y )
N
x (y) is the population mean of the variable X (Y).
N is the population size.
(xi  x)(y i  y)
Sample cov ariance cov (x y, ) 
n-1
x (y) is the sample mean of the variable X (Y).
n is the sample size.
2007會計資訊系統計學(一)上課投影片
4-85
Covariance
 In much the same way there was a “shortcut” for
calculating sample variance without having to calculate the
sample mean, there is also a shortcut for calculating
sample covariance without having to first calculate the
mean:
2007會計資訊系統計學(一)上課投影片
4-86
Statistics is a pattern language
Size
Population
Sample
N
n
Mean
Variance
S2
Standard Deviation
S
Coefficient of Variation
Covariance
2007會計資訊系統計學(一)上課投影片
CV
cv
Sxy
4-87
Covariance
 Compare the following three sets
xi
yi
(x – x) (y – y) (x – x)(y – y)
2
6
7
13
20
27
-3
1
2
x=5
y =20
xi
yi
(x – x) (y – y) (x – x)(y – y)
2
6
7
27
20
13
-3
1
2
x=5
y =20
2007會計資訊系統計學(一)上課投影片
-7
0
7
21
0
14
Cov(x,y)=17.5
7
0
-7
-21
0
-14
xi
yi
2
6
7
20
27
13
Cov(x,y) = -3.5
x=5 y =20
Cov(x,y)=-17.5
4-88
Covariance Illustrated
 Consider the following three sets of data (textbook §4.5)
In each set, the values of X are the same, and the value for Y are the same;
the only thing that’s changed is the order of the Y’s.
In set #1, as X increases so does Y; Sxy is large & positive
In set #2, as X increases, Y decreases; Sxy is large & negative
In set #3, as X increases, Y doesn’t move in any particular way; Sxy is “small”
2007會計資訊系統計學(一)上課投影片
4-89
Covariance (Generally speaking)
 When two variables move in the same direction (both
increase or both decrease), the covariance will be a
large positive number.
 When two variables move in opposite directions, the
covariance is a large negative number.
 When there is no particular pattern, the covariance is a
small number(close to zero).
2007會計資訊系統計學(一)上課投影片
4-90
Covariance
Y
Y
Ⅱ
Ⅰ
(x   X )  0
( y  Y )  0
COV(X, Y) <0
(x   X )  0
( y  Y )  0
 COV(X, Y) >0
Ⅲ
Ⅳ
(x   X )  0
( y  Y )  0
 COV(X, Y) >0
(x   X )  0
( y  Y )  0
 COV(X, Y) <0
X
2007會計資訊系統計學(一)上課投影片
X
4-91
Covariance
COV(X, Y) >0
2007會計資訊系統計學(一)上課投影片
COV(X, Y) <0
COV(X, Y)  0
4-92
The coefficient of correlation(相關係數)
Population coefficien t of correlatio n
COV ( X, Y)

xy
Greek letter
“rho”

Sample coefficien t of correlatio n
cov( X, Y )
r
sxsy
This coefficient answers the question: How strong is the
association between X and Y.
2007會計資訊系統計學(一)上課投影片
4-93
Statistics is a pattern language
Size
Population
Sample
N
n
CV
S2
S
cv
Sxy
r
Mean
Variance
Standard Deviation
Coefficient of Variation
Covariance
Coefficient of Correlation
2007會計資訊系統計學(一)上課投影片
4-94
Coefficient of Correlation
The advantage of the coefficient of correlation over
covariance is that it has fixed range from -1 to +1, thus:
If the two variables are very strongly positively related, the
coefficient value is close to +1 (strong positive linear
relationship).
If the two variables are very strongly negatively related, the
coefficient value is close to -1 (strong negative linear
relationship).
No straight line relationship is indicated by a coefficient
close to zero.
2007會計資訊系統計學(一)上課投影片
4-95
Coefficient of Correlation
+1 Strong positive linear relationship
COV(X,Y)>0
 or r =
or
0
No linear relationship
-1 Strong negative linear relationship
2007會計資訊系統計學(一)上課投影片
COV(X,Y)=0
COV(X,Y)<0
4-96
Coefficient of Correlation
2007會計資訊系統計學(一)上課投影片
4-97
The coefficient of correlation and the covariance
Example 4.16
 Compute the covariance and the coefficient of
correlation to measure how GMAT scores and
GPA in an MBA program are related to one
another.
 Solution

We believe GMAT affects GPA. Thus
• GMAT is labeled X
• GPA is labeled Y
2007會計資訊系統計學(一)上課投影片
4-98
The coefficient of correlation and the covariance
Example 4.16
Student
1
x
599
y
9.6
x2
y2
xy
358801
92.16
5750.4
2
689
8.8
474721
77.44 6063.2
cov(x,y)=(1/12-1)[67,559.2-(7587)(106.4)/12]=26.16
3
584
7.4
341056
54.76
4321.6
Sx = {(1/12-1)[4,817,755-(7587)2/12)]}.5=43.56
4
631
100
6310
Sy =………………………………………………….
(similar
to Sx )10= 1.12 398161
593 xSy = 26.16/(43.56)(1.12)
8.8
351649 77.44
r = 11
cov(x,y)/S
= .5362 5218.4
12
683
8
466489
64
5464
Total
7,587
106.4
4,817,755
957.2
67,559.2
2007會計資訊系統計學(一)上課投影片
Shortcut Formulas
cov(x, y ) 
 xi  y i 
1 
 xi y i 


n 1 
n

2



1

x
2
s2 

x
 i 

n  1 
n 
4-99
The coefficient of correlation and the covariance
Example 4.16 – Excel
 Use the Covariance option in Data Analysis
 If your version of Excel returns the population covariance
and variances, multiply each one by n/n-1 to obtain the
corresponding sample values.
 Use the Correlation option to produce the correlation matrix.
Variance-Covariance Matrix
Population
values
GPA
GPA
1.15
GMAT
23.98
2007會計資訊系統計學(一)上課投影片
GMAT
Sample
values
GPA
GPA
1.25
12
×

12-1
1739.52
GMAT
26.16
GMAT
1897.66
4-100
The coefficient of correlation and the covariance
Example 4.16 – Excel
 Interpretation


The covariance (26.16) indicates that GMAT score and
performance in the MBA program are positively related.
The coefficient of correlation (.5365) indicates that there
is a moderately strong positive linear relationship
between GMAT and MBA GPA.
2007會計資訊系統計學(一)上課投影片
4-101
Least Squares Method(最小平方法)
Recall, the slope-intercept equation for a line is expressed in
these terms:
y = mx + b
Where:
m is the slope of the line
b is the y-intercept.
If we’ve determined there is a linear relationship between two
variables with covariance and the coefficient of correlation, can
we determine a linear function of the relationship?
2007會計資訊系統計學(一)上課投影片
4-102
The Least Squares Method
 …produces a straight line drawn through the points
so that the sum of squared deviations between the
points and the line is minimized. This line is
represented by the equation:
b0 (“b” naught) is the y-intercept,
b1 is the slope, and
(“y” hat) is the value of y determined by the line.
2007會計資訊系統計學(一)上課投影片
4-103
The least Squares Method
Y
Errors
Errors
X
Different lines generate different errors, thus different
sum of squares of errors.
There is a line that minimizes the sum of squared errors.
2007會計資訊系統計學(一)上課投影片
4-104
The Least Squares Method
 We are seeking a line that best fits the data when two
variables are (presumably) related to one another.
 We define “best fit line” as a line for which the sum of
squared differences between it and the data points is
minimized.
n
Minimize( y i  ŷ i )
2
i1
The actual y value of point i
The y value of point i
calculated from the
equation ŷ  b  b
i
2007會計資訊系統計學(一)上課投影片
0
1xi
4-105
The least Squares Method
The coefficients b0 and b1 of the line that minimizes the
sum of squares of errors are calculated from the data.
n
b1 
cov(x, y )
s x2
 ( x  x )( y  y )
i
i
i 1

,
n

( xi  x ) 2
i 1
b0  y  b1 x
n
where y 
2007會計資訊系統計學(一)上課投影片
y
i 1
n
n
i
and x 
x
i
i 1
n
4-106
The Least Squares Method
 Example 4.17

b1 
Find the least squares line for Example 4.16 (Xm04-16.xls)
cov(x, y )

x
s x2
xi
n
y

y
26 .16

 .0138
1897 .2
7,587

 632 .25
12
Scatter Diagram
12
y = 0.1496 + 0.0138x
10
8
106 .4
6

 8.87
500
n
12
b0  y  b1 x  8.87  (.0138 )( 632 .25 )  .145
i
2007會計資訊系統計學(一)上課投影片
600
700
800
4-107