sample mean - s3.amazonaws.com

Download Report

Transcript sample mean - s3.amazonaws.com

Describing Distributions
•
•
•
•
•
Measures of location
Measures of spread
Unusual observations
Robust measures
Shape of the distribution
Measures of centre
•Mean
•Median
Median
•The median M is the midpoint of a
distribution, the number such that half the
observations are smaller and the other half
are larger.
•Procedure: order the data , count until
middle
Median - examples
39 41 38 42 340
n=5
Order:
38 39 41 42 340
Median is 3rd number in list - 41
39 41 38 42
n=4
Order:
38 39 41 42
Median is average of 2nd and 3rd numbers in list - 40
Mean
• If n observations are denoted by x1, x2, x3,…xn,
their sample mean is

1
x  ( x1  x2    xn )
n
Mean - examples
• Sample mean of 39, 41, 38, 42 is
(39 + 41 + 38 + 42)/4 = 40
Median = 40
• Sample mean of 39, 41, 38, 42, 340 is
(39 + 41 + 38 + 42+340)/5 = 100
Median = 41
Measures of spread
• Sample variance and sample standard
deviation
• First and third quartiles and interquartile
range
• Range
Quartiles
Q1: First Quartile
Median of the observations less than the median
Q3: Third Quartile
Median of observations greater than the median
Example of Quartiles
Data:
1 2 3 4 5 6 7 8 9 10 11 12 13 140 200
Median =
Q1 =
Q3 =
Interquartile Range (IQR)
IQR = Q3 – Q1
Previous example:
Sample Variance
• The sample variance is the mean squared deviation
from the mean. The sample variance of n
observations x1, x2, x3, …xn is
n
1
2
s 
( xi  x)

n  1 i 1
2
Sample standard deviation
• The sample standard deviation s is the
square root of the variance
• It has the same unit of measurement as the
original observations
Variance and standard deviation
Data
Square of
deviation
Deviation
from mean
-2
-1
1
2
38
39
41
42
Mean = 40
Variance
Std dev
4
1
1
4
10
3.33
1.83
Variance and standard deviation
Square of
Deviation
deviation
from mean
3844
-62
38
3721
-61
39
3481
-59
41
3364
-58
42
57600
240
340
72010
Mean = 100
18002.50
Variance
134.17
Std dev
Data
Example: Find the sample mean and sample standard deviation for the second set of
rainfall measurements.
xi
31
35
36
30
37
35
x
xi  x
( xi  x ) 2
Example: Find the sample mean and sample standard deviation for the second set of
rainfall measurements.
xi
xi  x
( xi  x ) 2
31
35
36
30
37
35
-3
1
2
-4
3
1
0
9
1
4
16
9
1
40
x  34
s2=40/(6-1)=8, s=2.8
As we suspected the variability for the second lot of rainfall figures was lower than for
the first.
The 5-number Summary
Minimum, Q1, Median, Q3, Maximum
Represent this graphically by a boxplot
Boxplot
• A central box spans the quartiles
• A line in the box marks the median
• Observations more than 1.5  IQR outside
the central box are plotted individually as
possible outliers
• Lines extend from the box out to the
smallest and largest observations that are
not suspected outliers
Example of a Boxplot
The Pulse Data
100
Pulse1
90
80
70
60
50
Outliers on boxplot
• inner fence
• outer fence
Q1  1.5 IQR
Q3 + 1.5 IQR
Q1  3 IQR
Q3 + 3 IQR
• Plot between fence values as * (possible outlier)
• and outside outer fence values as  (probable outliers)
Robust (resistant) statistic
• Outlier is a value outside the usual range
• Robust (resistant) statistic is not much
affected by outliers
• Unaffected: median, quartiles, IQR
• Affected: mean
• Most affected: standard deviation
Example - a weeks travelling
times on the Met
• TIMES
36
29
29
• TIMES2
29
36
29
35
184
30
34
31
35
40
30
34
31
30
34
30
34
Example: a weeks travelling
times on the MET
Variable
TIMES
TIMES2
Mean
47.2
32.8
Median
32.5
32.50
Std dev
48.1
3.61
Q1
29.8
29.75
Q3
35.2
35.25
The Pulse Data
Students in an introductory statistics course participated
in a simple experiment. Each student recorded his or her
height, weight, gender, smoking preference, usual
activity level, and resting pulse. Then they all flipped
coins, and those whose coins came up heads ran in place
for one minute. Then the entire class recorded their
pulses once more
Column Name Description
A
Pulse1 First pulse rate
B
Pulse2 Second pulse rate
C
Ran
D
Smokes 1 = smokes regularly, 2 = does not regularly
E
Sex
F
Height: Height in inches
G
Weight Weight in pounds
H
Activity: Usual level of physical activity:
1 = ran in place, 2 = did not run in place
1 = male, 2 = female
1 = slight 2 = moderate 3 = a lot
Comparing distributions with boxplots
What is the difference between males and females?
First, find the five-number summary for Pulse1 for males and females separately, by first
sorting the Pulse1 data into sex order, recalling that 1=male 2=female.
Sex
1 = male
2 = female
Min
48
58
Q1
63
66
M
70
78
Q3
75
86
Max
92
100
Using these five-number summaries, we can easily construct side-by side comparative
boxplots.
100
Pulse1
90
80
70
60
50
1
2
Sex
Note how easy it is to see the higher median value for females and the greater spread.
Let’s compare the pulse-rates of those who ran and those who didn’t. This time we use
the Pulse2 measurements, and sort them according to the value of Ran .
140
130
120
Pulse2
110
100
90
80
70
60
50
1
2
Ran
Note: Not only is the median higher, amongst those who ran, but the spread is much
greater also. Why?
STUDENT EXAMPLES Example 1: Produce a boxplot for the following data:
27, 4, 13, 12, 35, 19, 33, 26, 35, 3, 41, 31, 42
Q1=13 M=27 Q3=35
Example 1: Produce a boxplot for the following data:
27, 4, 13, 12, 35, 19, 33, 26, 35, 3, 41, 31, 42
Q1=13 M=27 Q3=35
Outliers: 1.5 x IQR = 33, 13-33 = -20, 35+33 = 68 Outside [-20,68]
40
C9
30
20
10
0
What should one do about outliers?
Example: Newcomb’s measurements of the passage time of light.
What variable is being measured? Newcomb measured how long light took to travel from
his laboratory on the Potomac River to a mirror at the base of the Washington Monument
and back, a total distance of about 7400 metres. Newcomb computed the speed of light
from the travel time.
What are the units of measurement? Newcomb’s first measurement of the passage of time
of light was 0.000024828 second. So his unit of measurement was seconds.
How are the data recorded? The entries in the table look nothing like 0.000024828. Such
numbers are awkward to write and to do arithmetic with. We therefore move the decimal
point nine places to the right, giving 24828, and then record only the deviation from
24800. The table entry 28 is short for the original 0.000024828, and the entry -2 stands
for 0.000024798. This is called coding the data.
Time
-44
23
25
27
28
31
36
-2
23
25
27
29
32
36
16
23
25
27
29
32
36
16
24
26
27
29
32
37
19
24
26
28
29
32
39
20
24
26
28
29
32
40
21
24
26
28
30
33
21
24
26
28
30
33
22
25
27
28
30
34
22
25
27
28
31
36
40
30
20
Time
10
0
-10
-20
-30
-40
-50
Newcomb decided to leave in the –2 in his estimate of the speed of light but remove the –
44. In fact both values are considered outliers by the definitions above. Further, we expect
“errors” to be symmetric, and they are, once both values are excluded.
Linear transformations of a
variable
Example: The mean maximum daily
temperature in March is 25ºC with standard
deviation 3ºC
ºF = 32 + 1.8  ºC
Hence the mean maximum daily temperature
in March is 32 + 1.8  25 =77 ºF with
standard deviation 1.8  3=5.4 ºF
Linear transformations of a
variable
Example: Distances and costs for taxis
Distances (km) 8 10 12 13 15
Flagfall $2 and $1.50 per km
Cost = 2 + 1.5  Distance
Costs
($) 14
17 20 21.5 24.5
Mean distance 11.6km
Mean cost $(2 + 1.5  11.6)
Linear transformation of a
variable
If y = a + bx then
• sy = |b|sx
• My = a + bMx (M is the median)
• If b > 0, (Q1)y = a + b(Q1)x
(Q3)y = a + b(Q3)x
 y  a  bx