Transcript Numerical
Chapter 4
Numerical
Descriptive
Techniques
4.2 Measures of Central Location
Usually, we focus our attention on two types of
measures when describing population
characteristics:
Central location (e.g. average)
Variability or spread
The measure of central location
reflects the locations of all the actual
data points.
統計學用來衡量資料特性的統計測量數:
1. 中央趨勢(Central location)
2. 分散度(Variability)
中央趨勢的衡量
主要表示資料分配的中心位置或資料的
共同趨勢。用來表示資料的中央趨勢之
測量數,主要有三種:
1.平均數(mean)
2.中位數(median)
3.眾數(mode)
4.2 Measures of Central Location
The measure of central location reflects the
locations of all the actual data points.
How?
With two data points,
the central location
But
if
the
third data
With one data point
should
fall inpoint
the middle
on the leftthem
hand-side
clearly the centralappears between
(in order
of
the
midrange,
it
should
“pull”of
location is at the point to reflect the location
the central
location
to the left.
itself.
both
of them).
The Arithmetic Mean
This is the most popular and useful measure of
central location
Sum of the observations
Mean =
Number of observations
The Arithmetic Mean
Sample mean
x
n
n
ii11xxii
nn
Sample size
Population mean
N
i1 x i
N
Population size
The arithmetic
mean
The Arithmetic Mean
• Example 4.1
The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33,
14, 8, 0, 9, 22 hours. Find the mean time on the Internet.
x
10
i 1
xi
10
0x1
7x2 ... 22
x10
11.0
10
• Example 4.2
Suppose the telephone bills of Example 2.1 represent
the population of measurements. The population mean is
200
i1 x i
x42.19
x38.45
... x45.77
1
2
200
200
200
43.59
平均數(算術平均數)
1.所有觀測值的總和除以觀測值的個數
2.算術平均數是資料的平均數點
3.優點:使用所有(每一個)的數據
缺點:易受極端值的影響
例子:
郭董:”林小姐(會計),請您算一下並
告訴我我們公司全體員工的平均的月薪
。謝謝!”林小姐面帶微笑的回答:”
請等一下,我來算一算。”(半小時以後
)王小姐:”報告總經理,我們公司的平
均月薪是新台幣35,660元。”
郭董:”很好,現在的企業這麼難經營
,本公司有有這麼好的薪資,算起來很
不錯。大家努力幹,公司不會虧待大家
!”
林小姐面上仍然個持微笑,但心裡想:”見你的鬼,該
好好幹的是你,公司沒虧待的也只有你一個。”
各位,一個小公司平均月薪35,600元算起來還不壞啊。
林小姐幹麼不高興呢?她已幹了3年的會計,但是現在
的薪水才22,500元。原來公司的十五個員工的薪資是這
樣的:
14,500: 15,000: 16,000: 16,500: 17,000: 17,900:
18,500: 19,000: 21,000: 22,500: 25,000: 30,000: 35,000
250,000(郭董)
The Median(中位數)
The Median of a set of observations is the value that
falls in the middle when the observations are arranged
in order of magnitude.
Example 4.3
Comment
Find the median of the time on the internet Suppose only 9 adults were sampled
(exclude, say, the longest time (33))
for the 10 adults of example 4.1
Even number of observations
0, 0, 5,
0, 7,
5, 8,
7, 8,
9, 12,
14,14,
22,22,
33 33
8.59,, 12,
Odd number of observations
0, 0, 5, 7, 8 9, 12, 14, 22
中位數
搜集得來的資料經順序排列後,居於數列中央的那
一個數值,那是中位數
(1)N為奇數:中位數位於數列中的第(N+1)/2位。
(2)N為偶數:則可取前後兩個數之平均數。
在所有觀察值中至少有一半(50%)的數值大於等於該
數值或至少有一半(50%)的數值小於等於該數值。
不受極端值之影響,可是不易進行統計推論。
The Mode(眾數)
The Mode of a set of observations is the value that
occurs most frequently.
Set of data may have one mode (or modal class), or two
or more modes.
The modal class
For large data sets
the modal class is
much more relevant
than a single-value
mode.
The Mode The Mean, Median,
Mode
The Mode
Example 4.5
Find the mode for the data in Example 4.1. Here are the
data again: 0, 7, 12, 5, 33, 14, 8, 0, 9, 22
Solution
• All observation except “0” occur once. There are two “0”. Thus,
the mode is zero.
• Is this a good measure of central location?
• The value “0” does not reside at the center of this set
(compare with the mean = 11.0 and the mode = 8.5).
眾數
指資料內的觀察值中發生次數最多的那
一個數值。
不受極端值之影響;可能有多個或沒有
;對觀察值的個數或數值變化的感應不
靈敏。
Relationship among Mean, Median, and Mode
If a distribution is symmetrical, the mean,
median and mode coincide
If a distribution is asymmetrical, and skewed
to the left or to the right, the three measures
differ.
A positively skewed distribution
(“skewed to the right”)
Mode Mean
Median
Relationship among Mean, Median, and Mode
If a distribution is symmetrical, the mean, median
and mode coincide
If a distribution is non symmetrical, and skewed
to the left or to the right, the three measures
differ.
A positively skewed distribution
(“skewed to the right”)
A negatively skewed distribution
(“skewed to the left”)
Mode
Mean
Median
Mean
Mode
Median
中央趨勢各統計量數之比較與選擇:
1.名義(類別)尺度:眾數
2.順序尺度:眾數、中位數
3.區間尺度:平均數、中位數、及眾
數均可
4.單一測量數不能清楚說明或難區分
時,可以同時採取多個測量數。
4.3 Measures of variability
Measures of central location fail to tell the whole story
about the distribution.
A question of interest still remains unanswered:
How much are the observations spread out
around the mean value?
4.3 Measures of variability
Observe two hypothetical
data sets:
The average value provides
a good representation of the
observations in the data set.
This data set is now
changing to...
Small variability
4.3 Measures of variability
Observe two hypothetical
data sets:
Small variability
The average value provides
a good representation of the
observations in the data set.
Larger variability
The same average value does not
provide as good representation of the
observations in the data set as before.
由平均數、中位數與眾數可了解資料的中
央趨勢,若有二組資料,其中央趨勢相同,
我們要比較這兩組資料呢?
ANS:可進一步比較這兩組資料的分散程度差
異的大小。
分散程度的比較有時比中央趨勢(Mean)的
比較來得更重要。
分散程度或變異性(Variability)的計算---根據平均數、中位數或眾數為中心,通
常是以平均數來衡量觀測值的分散程度。
分散程度或變異性(Variability)
Small variability
Larger variability
分散程度的衡量
1.全距(Range)
2.變異數(Variance)
3.標準差(Standard Deviation)
4.變異係數(Coefficient of Variance )(CV)
The range
The range of a set of observations is the difference
between the largest and smallest observations.
Its major advantage is the ease with which it can be
But, how do all the observations spread out?
computed.
? to provide
Its major shortcoming?is its?failure
Largest
information onSmallest
the dispersion of the
observations
observation
observation
between the two end points.
The range cannot assistRange
in answering this question
全距
1. R=最大值-最小值
2.以資料頭尾兩者相差的大小衡量整
個分散度。
3.一般R愈大,表示分散程度愈大,
可是它只考慮最大與最小兩個觀察
值並未考慮所有的觀察值,故不能
精確的反應與描述所觀察的整體。
The Variance
This measure reflects the dispersion of all the
observations
The variance of a population of size N x1, x2,…,xN
whose mean is is defined as
2
2
N
(
x
)
i1 i
N
The variance of a sample of n observations
x1, x2, …,xn whose mean is x is defined as
s2
ni1( xi x)2
n 1
Why not use the sum of deviations?
Consider two small populations:
9-10= -1
11-10= +1
8-10= -2
12-10= +2
A measure of dispersion
A
Can the sum of deviations
agreesofwith
this
Be aShould
good measure
dispersion?
The sum
of deviations is
observation.
zero for both populations,
8 9 10 11 12
therefore, is not a good
…but
Themeasurements
mean of both in B
measure
of
arepopulations
moredispersion.
dispersed
is 10...
4-10 = - 6
16-10 = +6
7-10 = -3
then those in A.
B
4
Sum = 0
7
10
13
16
13-10 = +3
Sum = 0
The Variance
Let us calculate the variance of the two populations
2
2
2
2
2
2 (8 10) (9 10) (10 10) (11 10) (12 10)
A
2
5
2
2
2
2
2
2 ( 4 10) (7 10) (10 10) (13 10) (16 10)
B
18
5
Why is the variance defined as
the average squared deviation?
Why not use the sum of squared
deviations as a measure of
variation instead?
After all, the sum of squared
deviations increases in
magnitude when the variation
of a data set increases!!
The Variance
Let us calculate the sum
of squared
deviations
for both data sets
Which
data set has
a larger dispersion?
Data set B
is more dispersed
around the mean
A
B
1
2 3
1
3
5
The Variance
SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10
SumB = (1-3)2 + (5-3)2 = 8
SumA > SumB. This is inconsistent with the
observation that set B is more dispersed.
A
B
1
2 3
1
3
5
The Variance
However, when calculated on “per observation”
basis (variance), the data set dispersions are
properly ranked.
A2 = SumA/N = 10/5 = 2
B2 = SumB/N = 8/2 = 4
A
B
1
2 3
1
3
5
The Variance
Example 4.7
The following sample consists of the number of jobs
six students applied for: 17, 15, 23, 7, 9, 13. Finds its
mean and variance
Solution
x
i61 xi
ni1( x i
6
17 15 23 7 9 13 84
14 jobs
6
6
2
x)
1
s
(17 14)2 (15 14)2 ...(13 14)2
n 1
6 1
33.2 jobs 2
2
The Variance – Shortcut method
n
2
n
1
(
x
)
2
2
i1 i
s
x i
n 1 i1
n
2
1 2
17
15
...
13
2
2
17 15 ... 13
6 1
6
33.2 jobs 2
變異數Variance
1.變異數的值必≧零;若為零,表示所有
的觀測數值均相同。
2.適合進行統計推論工作。
3.變異數之單位為觀測數值單位的平方,
具有複名數,不具統計意義,不易解釋。
Standard Deviation (SD,標準 偏 差 )
The standard deviation of a set of observations is
the square root of the variance .
Sample standard deviation : SD
s
2
Population standard deviation :
2
Standard Deviation
Example 4.8
To examine the consistency of shots for a new
innovative golf club, a golfer was asked to hit 150
shots, 75 with a currently used (7-iron) club, and 75
with the new club.
The distances were recorded.
Which 7-iron is more consistent?
Standard Deviation
Example 4.8 – solution
Excel printout, from the
“Descriptive Statistics” submenu.
The innovation club is
more consistent, and
because the means are
close, is considered a
better club
Current
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Innovation
150.5467
0.668815
151
150
5.792104
33.54847
0.12674
-0.42989
28
134
162
11291
75
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
150.1467
0.357011
150
149
3.091808
9.559279
-0.88542
0.177338
12
144
156
11261
75
標準差
1.標準差是將變異數開根號。
由於變異數的名數為複名數,不易解
釋,為除去該缺點,將變異數開根號所
得的稱為標準差。
2.標準差的衡單位與原始資差無異。
3.變異數與標準差是測量資料分散程度
,比較良好且是最常用的統計測量測
量數。
Interpreting Standard Deviation
The standard deviation can be used to
compare the variability of several distributions
make a statement about the general shape of a
distribution.
The empirical rule: If a sample of observations has a
mound-shaped distribution, the interval
( x s, x s) contains approximat ely 68% of the measuremen ts
( x 2s, x 2s) contains approximat ely 95% of the measuremen ts
( x 3s, x 3s) contains approximat ely 99.7% of the measuremen ts
Interpreting Standard Deviation
Example 4.9
A statistics practitioner wants to describe the
way returns on investment are distributed.
The mean return = 10%
The standard deviation of the return = 8%
The histogram is bell shaped.
Interpreting Standard Deviation
Example 4.9 – solution
The empirical rule can be applied (bell shaped histogram)
Describing the return distribution
Approximately 68% of the returns lie between 2% and 18%
[10 – 1(8), 10 + 1(8)]
Approximately 95% of the returns lie between -6% and 26%
[10 – 2(8), 10 + 2(8)]
Approximately 99.7% of the returns lie between -14% and 34%
[10 – 3(8), 10 + 3(8)]
經驗法則
若資料的分配呈現常態峰則或鐘型分配。
1.約有68%的資料落入一個標準差之內。
2.約有95%的資料落入二個標準差之內。
3.約有99.7%的資料落入三個標準差之內。
The Coefficient of Variation變異係數(CV)
The coefficient of variation of a set of measurements is
the standard deviation divided by the mean value.
s
Sample coefficient of variation: cv
x
P opulationcoefficient of variation: CV
This coefficient provides a proportionate measure of
variation.
A standard deviation of 10 may be perceived
large when the mean value is 100, but only
moderately large when the mean value is 500
衡量相對分散度的變異係數(CV)
CV =標準差 / 平均數
變異係數-標準差除以平均數的目的表達相對的變動情形。
測量分散程度的統計測量數
如全距,變異數與標準差,均只能衡量資料的絕對分散程
度。
若有二組資料,而欲比較其分散程度,變異數與標準差會
受到平均數大小不同以及不同測量單位的影響。
現假設
A公司83年營業收益中,其平均數為3371萬元,標準差為
383萬元。變異係數為:
383
CV
0.1136
3371
B公司83年營業收益中,其平均數為6000萬元,標準差為
400萬元
比較其營業額的相對分散情形何者較穩定?
B公司的營業額的標準差雖較大,但其平均營業額為
6000萬元,較A公司大得多,兩公司的規模顯然不同。
因此,為比較其營業額的相對分散情形,必須利用變
異係數來比較。B的變異係數為400/6000=0.0667小於A
公司的變異係數。由此可知,B公司的營業收益分散程
度相對較小,83年12個月營業收益相對A公司而言較穩
定,變化較少。
4.4 Measures of Relative Standing
and Box Plots
Percentile
The pth percentile of a set of measurements is the
value for which
• p percent of the observations are less than that value
• 100(1-p) percent of all the observations are greater than
that value.
Example
• Suppose your score is the 60% percentile of a SAT test.
Then
40%
60% of all the scores lie here
Your score
Quartiles
Commonly used percentiles
First (lower)decile
First (lower) quartile, Q1,
Second (middle)quartile,Q2,
Third quartile, Q3,
Ninth (upper)decile
= 10th percentile
= 25th percentile
= 50th percentile
= 75th percentile
= 90th percentile
Quartiles
Example
Find the quartiles of the following set of
measurements 7, 8, 12, 17, 29, 18, 4, 27, 30, 2,
4, 10, 21, 5, 8
Quartiles
Solution
Sort the observations
2, 4, 4, 5, 7, 8, 10, 12, 17, 18, 18, 21, 27, 29, 30
The first quartile
15 observations
At most (.25)(15) = 3.75 observations
should appear below the first quartile.
Check the first 3 observations on the
left hand side.
At most (.75)(15)=11.25 observations
should appear above the first quartile.
Check 11 observations on the
right hand side.
Comment:If the number of observations is even, two observations
remain unchecked. In this case choose the midpoint between these
two observations.
Location of Percentiles
Find the location of any percentile using the formula
P
LP (n 1)
100
where LP is the location of the P th percentile
Example 4.11
Calculate the 25th, 50th, and 75th percentile of the data in
Example 4.1
Location of Percentiles
Example 4.11 – solution
After sorting the data we have 0, 0, 5, 7, 8, 9, 12, 14, 22, 33.
L 25 (10 1)
25
2.75
100
The 2.75th location
Translates to the value
(.75)(5 – 0) = 3.75
Values 0
0
Location 2
Location 1
3.75 5
2.75
3
Location 3
Location of Percentiles
Example 4.11 – solution continued
50
L 50 (10 1)
5.5
100
The 50th percentile is halfway between the fifth
and sixth observations (in the middle between 8
and 9), that is 8.5.
Location of Percentiles
Example 4.11 – solution continued
75
L 75 (10 1)
8.25
100
The 75th percentile is one quarter of the distance
between the eighth and ninth observation that is
14+.25(22 – 14) = 16.
Eighth
observation
Ninth
observation
Quartiles and Variability
Quartiles can provide an idea about the shape of
a histogram
Q1 Q2
Positively skewed
histogram
Q3
Q1
Negatively skewed
histogram
Q2
Q3
Interquartile Range
This is a measure of the spread of the middle
50% of the observations
Large value indicates a large spread of the
observations
Interquartile range = Q3 – Q1
Box Plot
This is a pictorial display that provides the main
descriptive measures of the data set:
•
•
•
•
•
L - the largest observation
Q3 - The upper quartile
Q2 - The median
Q1 - The lower quartile
S - The smallest observation
1.5(Q3 – Q1)
S
Whisker
1.5(Q3 – Q1)
Q1
Q2 Q 3
Whisker
L
Box Plot
Example 4.14 (Xm02-01)
Bills
42.19
38.45
29.23
89.35
118.04
110.46
.
Smallest =. 0
.
Q1 = 9.275
Median = 26.905
Q3 = 84.9425
Largest = 119.63
IQR = 75.6675
Outliers = ()
Left hand boundary = 9.275–1.5(IQR)= -104.226
Right hand boundary=84.9425+ 1.5(IQR)=198.4438
-104.226
0
9.275
84.9425 119.63
26.905
No outliers are found
198.4438
Box Plot
Additional Example - GMAT scores
Create a box plot for the data regarding the GMAT scores of
200 applicants (see GMAT.XLS)
GMAT
512
531
461
515
.
.
.
Smallest = 449
Q1 = 512
Median = 537
Q3 = 575
Largest = 788
IQR = 63
Outliers = (788, 788, 766, 763, 756, 719, 712, 707, 703, 694, 690, 675, )
417.5 449
512-1.5(IQR)
512
537
575
669.5
575+1.5(IQR)
788
Box Plot
GMAT - continued
Q1
512
449
25%
Q2
537
50%
Q3
575
669.5
25%
Interpreting the box plot results
• The scores range from 449 to 788.
• About half the scores are smaller than 537, and about half are larger than
537.
• About half the scores lie between 512 and 575.
• About a quarter lies below 512 and a quarter above 575.
Box Plot
GMAT - continued
The histogram is positively skewed
Q1
512
449
25%
Q2
537
50%
Q3
575
669.5
25%
50%
25%
25%
Box Plot
Example 4.15 (Xm04-15)
A study was organized to compare the quality of
service in 5 drive through restaurants.
Interpret the results
Example 4.15 – solution
Minitab box plot
Box Plot
Jack in the Box5
Jack in the box is the slowest in service
Hardee’s
Hardee’s service time variability is the largest
C7
McDonalds
4
3
Wendy’s
2
Popeyes
1
Wendy’s service time appears to be the
shortest and most consistent.
100
300
200
C6
Box Plot
Times are symmetric
Jack in the Box5
Jack in the box is the slowest in service
Hardee’s
Hardee’s service time variability is the largest
C7
McDonalds
4
3
Wendy’s
2
Popeyes
1
Wendy’s service time appears to be the
shortest and most consistent.
100
300
200
C6
Times are positively skewed
4.5 Measures of Linear Relationship
The covariance and the coefficient of correlation
are used to measure the direction and strength
of the linear relationship between two variables.
Covariance - is there any pattern to the way two
variables move together?
Coefficient of correlation - how strong is the linear
relationship between two variables
Covariance
Populationcovariance COV(X,Y)
(x i x )( y i y )
N
x (y) is the population mean of the variable X (Y).
N is the population size.
(xi x)(y i y )
Sample covariance cov(x, y)
n-1
x (y) is the sample mean of the variable X (Y).
n is the sample size.
Covariance
Compare the following three sets
xi
yi
(x – x) (y – y) (x – x)(y – y)
2
6
7
13
20
27
-3
1
2
x=5
y =20
xi
yi
(x – x) (y – y) (x – x)(y – y)
2
6
7
27
20
13
-3
1
2
x=5
y =20
-7
0
7
21
0
14
Cov(x,y)=17.5
7
0
-7
-21
0
-14
Cov(x,y)=-17.5
xi
yi
2
6
7
20
27
13
x=5 y =20
Cov(x,y) = -3.5
Covariance
If the two variables move in the same direction,
(both increase or both decrease), the covariance
is a large positive number.
If the two variables move in opposite directions,
(one increases when the other one decreases),
the covariance is a large negative number.
If the two variables are unrelated, the covariance
will be close to zero.
The coefficient of correlation
Population coefficient of correlation
COV( X, Y)
x y
Sample coefficient of correlation
cov(X, Y)
r
sx sy
This coefficient answers the question: How strong is
the association between X and Y.
The coefficient of correlation
+1 Strong positive linear relationship
COV(X,Y)>0
or r =
or
0
No linear relationship
-1 Strong negative linear relationship
COV(X,Y)=0
COV(X,Y)<0
The coefficient of correlation
If the two variables are very strongly positively
related, the coefficient value is close to +1
(strong positive linear relationship).
If the two variables are very strongly negatively
related, the coefficient value is close to -1 (strong
negative linear relationship).
No straight line relationship is indicated by a
coefficient close to zero.
The coefficient of correlation and the
covariance – Example 4.16
Compute the covariance and the coefficient of
correlation to measure how GMAT scores and
GPA in an MBA program are related to one
another.
Solution
We believe GMAT affects GPA. Thus
• GMAT is labeled X
• GPA is labeled Y
The coefficient of correlation and the
covariance – Example 4.16
Student
1
x
599
y
9.6
x2
y2
xy
358801
92.16
5750.4
2
689
8.8
474721
77.44 6063.2
cov(x,y)=(1/12-1)[67,559.2-(7587)(106.4)/12]=26.16
3
584
7.4
341056
54.76
4321.6
Sx = {(1/12-1)[4,817,755-(7587)2/12)]}.5=43.56
4
100
6310
Sy =………………………………………………….
similar631
to Sx =10
1.12 398161
593 xSy = 26.16/(43.56)(1.12)
8.8
351649 77.44
r = 11
cov(x,y)/S
= .5362 5218.4
12
683
8
466489
64
5464
Total
7,587
106.4
4,817,755
957.2
67,559.2
Shortcut Formulas
cov(x, y )
xi y i
1
xi yi
n 1
n
2
1
x
2
s2
x
i
n 1
n
The coefficient of correlation and the
covariance – Example 4.16 – Excel
Use the Covariance option in Data Analysis
If your version of Excel returns the population covariance and
variances, multiply each one by n/n-1 to obtain the
corresponding sample values.
Use the Correlation option to produce the correlation matrix.
Variance-Covariance Matrix
Population
values
GPA
GPA
1.15
GMAT
23.98
GMAT
1739.52
Sample
values
12 GPA
´
12-1
GMAT
GPA
GMAT
1.25
26.16
1897.66
The coefficient of correlation and the
covariance – Example 4.16 – Excel
Interpretation
The covariance (26.16) indicates that GMAT score
and performance in the MBA program are positively
related.
The coefficient of correlation (.5365) indicates that
there is a moderately strong positive linear
relationship between GMAT and MBA GPA.
The Least Squares Method
We are seeking a line that best fits the data when two
variables are (presumably) related to one another.
We define “best fit line” as a line for which the sum of
squared differences between it and the data points is
n
minimized.
2
Minimize ( y i yˆ i )
i1
The actual y value of point i
The y value of point i
calculated from the
equation yˆ i b 0 b1x i
The least Squares Method
Y
Errors
Errors
X
Different lines generate different errors,
thus different sum of squares of errors.
There is a line that minimizes the sum of squared errors
The least Squares Method
The coefficients b0 and b1 of the line that minimizes the
sum of squares of errors are calculated from the data.
n
b1
cov(x, y )
s x2
( x x )( y y)
i
i
i 1
,
n
( xi x ) 2
i 1
b0 y b1 x
n
where y
y
i 1
n
n
i
and x
x
i 1
n
i
The Least Squares Method
Example 4.17
b1
x
Find the least squares line for Example 4.16 (Xm04-16.xls)
cov(x, y )
s x2
xi
n
y
y
26.16
.0138
1897.2
Scatter Diagram
12
7,587
632.25
12
y = 0.1496 + 0.0138x
10
8
6
106.4
500
8.87
n
12
b0 y b1 x 8.87 (.0138)(632.25) .145
i
600
700
800