Transcript week3
The empirical (68-95-99.7) rule
• With a bell shaped distribution,
about 68% of the data fall within a distance of 1 standard
deviation from the mean.
95% fall within 2 standard deviations of the mean.
99.7% fall within 3 standard deviations of the mean.
• What if the distribution is not bell-shaped?
There is another rule, named Chebyshev's Rule, that tells us
that there must be at least 75% of the data within 2 standard
deviations of the mean, regardless of the shape, and at least
89% within 3 standard deviations.
week3
1
Linear transformations
•
•
•
A linear transformation changes the original value x into a
new variable xnew .
xnew is given by an equation of the form,
xnew a bx
Example 1.19 on page 54 in IPS.
(i) A distance x measured in km. can be expressed in
miles as follow, x
0.62x .
new
(ii) A temperature x measured in degrees Fahrenheit can be
converted to degrees Celsius by
xnew 5(x 32) 160 5 x
9
9 9
week3
2
Effect of a Linear Transformation
• Multiplying each observation in a data set by a number b
multiplies both the measures of center (mean, median, and
trimmed means) by b and the measures of spread (range,
standard deviation and IQR) by |b| that is the absolute
value of b.
• Adding the same number a to each observation in a data
set adds a to measures of center, quartiles and percentiles
but does not change the measures of spread.
• Linear transformations do NOT change the overall shape
of a distribution.
week3
3
Measure
x
xnew
a bx
Median
M
a+bM
Mode
Mode
Range
R
bR
IQR
IQR
b IQR
Stdev
s
bs
Mean
x
week3
a+bMode
4
Example 1
•
A sample of 20 employees of a company was taken and
their salaries were recorded. Suppose each employee
receives a $300 raise in the salary for the next year.
State whether the following statements are true or false.
a) The IQR of the salaries will
i. be unchanged
ii. increase by $300
iii. be multiplied by $300
b) The mean of the salaries will
i. be unchanged
ii. increase by $300
iii. be multiplied by $300
week3
5
Nonlinear transformations
• A very common nonlinear transformation in statistic is the
logarithm transformation.
• Recall: lnx = logex where e is the natural number e = 2.7183.
• If measurements on a variable x have a right skewed
distribution. The distribution of lnx will be roughly
symmetric.
• If measurements on a variable x have a left skewed
distribution. The distribution of lnx will be even more left
skewed.
week3
6
Example 2 - Nonlinear transformations
Histogram for sales data
Histogram for ln(sales)
60
200
Frequency
Frequency
50
100
40
30
20
10
0
0
0
1000 2000 3000 4000 5000 6000 7000 8000 9000
0
Sales
1
2
3
4
5
6
7
8
9
10
ln(sales)
week3
7
Density curves
• Using software, clever algorithms can describe a distribution
in a way that is not feasible by hand, by fitting a smooth curve
to the data in addition to or instead of a histogram. The curves
used are called density curves.
• It is easier to work with a smooth curve, because histogram
depends on the choice of classes.
• Density Curve
Density curve is a curve that
is always on or above the horizontal axis.
has area exactly 1 underneath it.
• A density curve describes the overall pattern of a distribution.
week3
8
• The area under the curve and above any range of values is
the relative frequency (proportion) of all observations that
fall in that range of values.
• Example: The curve below shows the density curve for
scores in an exam and the area of the shaded region is the
proportion of students who scores between 60 and 80.
week3
9
Median and mean of Density Curve
• The median of a distribution described by a density curve
is the point that divides the area under the curve in half.
• A mode of a distribution described by a density curve is a
peak point of the curve, the location where the curve is
highest.
• Quartiles of a distribution can be roughly located by
dividing the area under the curve into quarters as
accurately as possible by eye.
week3
10
Normal distributions
• An important class of density curves are the symmetric
unimodal bell-shaped curves known as normal curves. They
describe normal distributions.
• All normal distributions have the same overall shape.
• The exact density curve for a particular normal distribution is
specified by giving its mean and its standard deviation .
• The mean is located at the center of the symmetric curve and
is the same as the median and the mode.
• Changing without changing moves the normal curve
along the horizontal axis without changing its spread.
week3
11
• The standard deviation controls the spread of a normal
curve.
week3
12
• There are other symmetric bell-shaped density curves that
are not normal e.g. t distribution.
• The normal density curves are specified by a particular
function. The height of a normal density curve at any point
x is given by
2
x
1
1
e 2
2
• Notation: A normal distribution with mean and standard
deviation is denoted by N(, ).
week3
13
The 68-95-99.7 rule
In the normal distribution with mean and standard deviation ,
Approx. 68% of the observations fall within of the mean .
Approx. 95% of the observations fall within 2 of the mean .
Approx. 99.7% of the observations fall within 3 of the mean .
week3
14
Example 1.23 on p72 in IPS
• The distribution of heights of women aged 18-24 is
approximately N(64.5, 2.5), that is ,normal with mean = 64.5
inches and standard deviation = 2.5 inches.
• The 68-95-99.7 rule says that the middle 95% (approx.) of
women are between 64.5-5 to 64.5+5 inches tall.
The other 5% have heights outside the range from 59.5 to 69.5
inches, and 2.5% of the women are taller than 69.5 .
• Exercise:
1) The middle 68% (approx.) of women are between ____to ___
inches tall.
2) ___% of the women are taller than 66.75.
3) ___% of the women are taller than 72.
week3
15
Standardizing and z-scores
• If x is an observation from a distribution that has mean and
standard deviation , the standardized value of
x is given by
x
z
• A standardized value is often called a z-score.
• A z-score tells us how many standard deviations the original
observation falls away from the mean of the distribution.
• Standardizing is a linear transformation that transform the data
into the standard scale of z-scores. Therefore, standardizing does
not change the shape of a distribution, but changes the value of
the mean and stdev.
week3
16
Example 1.24 on p73 in IPS
• The heights of women is approximately normal with mean
= 64.5 inches and standard deviation = 2.5 inches.
• The standardized height is
z height 64.5
2.5
• The standardized value (z-score) of height 68 inches is
z 68 64.5 1.4
2.5
or 1.4 std. dev. above the mean.
• A woman 60 inches tall has standardized height
z 60 64.5 1.8
2.5
or 1.8 std. dev. below the mean.
week3
17
The Standard Normal distribution
• The standard normal distribution is the normal distribution
N(0, 1) that is, the mean = 0 and the sdev = 1 .
• If a random variable X has normal distribution N(, ), then the
standardized variable
Z X
has the standard normal distribution.
• Areas under a normal curve represent proportion of
observations from that normal distribution.
• There is no formula to calculate areas under a normal curve.
Calculations use either software or a table of areas. The table
and most software calculate one kind of area: cumulative
proportions . A cumulative proportion is the proportion of
observations in a distribution that fall at or below a given value
and is also the area under the curve to the left of a given value.
week3
18
The standard normal tables
• Table A gives cumulative proportions for the standard
normal distribution. The table entry for each value z is the
area under the curve to the left of z, the notation used is
P( Z ≤ z).
e.g. P( Z ≤ 1.4 ) = 0.9192
week3
19
Standard Normal Distribution
z
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3.0
.00
.01
.02
.03
.04
.05
.06
.07
.08
.09
.5000
.5398
.5793
.6179
.6554
.6915
.7257
.7580
.7881
.8159
.8413
.8643
.8849
.9032
.9192
.9332
.9452
.9554
.9641
.9713
.9772
.9821
.9861
.9893
.9918
.9938
.9953
.9965
.9974
.9981
.9987
.5040
.5438
.5832
.6217
.6591
.6950
.7291
.7611
.7910
.8186
.8438
.8665
.8869
.9049
.9207
.9345
.9463
.9564
.9649
.9719
.9778
.9826
.9864
.9896
.9920
.9940
.9955
.9966
.9975
.9982
.9987
.5080
.5478
.5871
.6255
.6628
.6985
.7324
.7642
.7939
.8212
.8461
.8686
.8888
.9066
.9222
.9357
.9474
.9573
.9656
.9726
.9783
.9830
.9868
.9898
.9922
.9941
.9956
.9967
.9976
.9982
.9987
.5120
.5517
.5910
.6293
.6664
.7019
.7357
.7673
.7967
.8238
.8485
.8708
.8907
.9082
.9236
.9370
.9484
.9582
.9664
.9732
.9788
.9834
.9871
.9901
.9925
.9943
.9957
.9968
.9977
.9983
.9988
.5160
.5557
.5948
.6331
.6700
.7054
.7389
.7703
.7995
.8264
.8508
.8729
.8925
.9099
.9251
.9382
.9495
.9591
.9671
.9738
.9793
.9838
.9875
.9904
.9927
.9945
.9959
.9969
.9977
.9984
.9988
.5199
.5596
.5987
.6368
.6736
.7088
.7422
.7734
.8023
.8289
.8531
.8749
.8944
.9115
.9265
.9394
.9505
.9599
.9678
.9744
.9798
.9842
.9878
.9906
.9929
.9946
.9960
.9970
.9978
.9984
.9989
.5239
.5636
.6026
.6406
.6772
.7123
.7454
.7764
.8051
.8315
.8554
.8770
.8962
.9131
.9279
.9406
.9515
.9608
.9686
.9750
.9803
.9846
.9881
.9909
.9931
.9948
.9961
.9971
.9979
.9985
.9989
.5279
.5675
.6064
.6443
.6808
.7157
.7486
.7794
.8078
.8340
.8577
.8790
.8980
.9147
.9292
.9418
.9525
.9616
.9693
.9756
.9808
.9850
.9884
.9911
.9932
.9949
.9962
.9972
.9979
.9985
.9989
.5319
.5714
.6103
.6480
.6844
.7190
.7517
.7823
.8106
.8365
.8599
.8810
.8997
.9162
.9306
.9429
.9535
.9625
.9699
.9761
.9812
.9854
.9887
.9913
.9934
.9951
.9963
.9973
.9980
.9986
.9990
.5359
.5753
.6141
.6517
.6879
.7224
.7549
.7852
.8133
.8389
.8621
.8830
.9015
.9177
.9319
.9441
.9545
.9633
.9706
.9767
.9817
.9857
.9890
.9916
.9936
.9952
.9964
.9974
.9981
.9986
.9990
The table shows
area to left of ‘z’
under standard
normal curve
For a negative
number, -z :
Area below (-z) =
Area above (z) =
1 – Area below (z)
20
The standard normal tables - Example
• What proportion of the observations of a N(0,1)
distribution takes values
a) less than z = 1.4 ?
b) greater than z = 1.4 ?
c) greater than z = -1.96 ?
d) between z = 0.43 and z = 2.15 ?
week3
21
Properties of Normal distribution
• If a random variable Z has a N(0,1) distribution then P(Z = z)=0.
The area under the curve below any point is 0.
• The area between any two points a and b (a < b) under the
standard normal curve is given by
P(a ≤ Z ≤ b) = P(Z ≤ b) – P(Z ≤ a)
• As mentioned earlier, if a random variable X has a N(, )
distribution, then the standardized variable
X
Z
has a standard normal distribution and any calculations about X
can be done using the following rules:
week3
22
•
P(X = k) = 0
for all k.
a
P X a P Z
b
P X b 1 P Z
b
a
Pa X b P
Z
• The solution to the equation P(X ≤ k) = p is
k = μ + σzp
Where zp is the value z from the standard normal table that
has area (and cumulative proportion) p below it, i.e. zp is the
pth percentile of the standard normal distribution.
week3
23
Questions
1.
The marks of STA221 students has N(65, 15) distribution. Find
the proportion of students having marks
(a) less then 50.
(b) greater than 80.
(c) between 50 and 80.
2.
Example 1.30 on page 79 in IPS:
Scores on SAT verbal test follow approximately the
N(505, 110) distribution. How high must a student score in
order to place in the top 10% of all students taking the SAT?
3.
The time it takes to complete a stat220 term test is normally
distributed with mean 100 minutes and standard deviation 14
minutes. How much time should be allowed if we wish to
ensure that at least 9 out of 10 students (on average) can
complete it? (final exam Dec. 2001)
week3
24
4.
General Motors of Canada has a deal: ‘an oil filter and lube job
in 25 minutes or the next one free’. Suppose that you worked for
GM and knew that the time needed to provide these services was
approximately normal with mean 15 minutes and std. dev. 2.5
minutes. How many minutes would you have recommended to
put in the ad above if it was decided that about 5 free services for
100 customers was reasonable?
5.
In a survey of patients of a rehabilitation hospital the mean length
of stay in the hospital was 12 weeks with a std. dev. of 1 week.
The distribution was approximately normal.
Out of 100 patients how many would you expect to stay longer
than 13 weeks?
What is the percentile rank of a stay of 11.3 weeks?
What percentage of patients would you expect to be in longer
than 12 weeks?
What is the length of stay at the 90th percentile?
What is the median length of stay?
a)
b)
c)
d)
e)
week3
25
Normal quantile plots and their use
•
•
•
•
A histogram or stem plot can reveal distinctly nonnormal
features of a distribution.
If the stem-plot or histogram appears roughly symmetric
and unimodal, we use another graph, the normal quantile
plot as a better way of judging the adequacy of a normal
model.
Any normal distribution produces a straight line on the
plot.
Use of normal quantile plots:
If the points on a normal quantile plot lie close to a
straight line, the plot indicates that the data are normal.
Systematic deviations from a straight line indicate a
nonnormal distribution.
Outliers appear as points that are far away from the
overall pattern of the plot.
week3
26
• Histogram, the nscores plot and the normal quantile plot
for data generated from a normal distribution (N(500, 20)).
15
540
520
10
value
510
5
500
490
480
470
0
460
460
470
480
490
500
510
520
530
540
-2
value
-1
0
1
2
ncores
Normal Probability Plot for value
99
ML Estimates
95
Mean:
500.343
StDev:
17.4618
90
80
Percent
Frequency
530
70
60
50
40
30
20
10
5
1
450
week3
500
Data
550
27
• Histogram, the nscores plots and the normal quantile plot
for data generated from a right skewed distribution
Frequency
10
5
0
0
5
10
value
value
10
5
0
-2
-1
0
ncores
week3
1
2
21
28
2
ncores
1
0
-1
-2
0
5
10
value
Normal Probability Plot for value
99
ML Estimates
95
Mean:
2.64938
StDev:
2.17848
90
Percent
80
70
60
50
40
30
20
10
5
1
0
week3
Data
5
10
29
• Histogram, the nscores plots and the normal quantile plot
for data generated from a left skewed distribution
Frequency
10
5
0
0.25
0.35
0.45
0.55
0.65
0.75
0.85
0.95
1.05
value
1.0
0.9
value
0.8
0.7
0.6
0.5
0.4
0.3
-2
-1
0
nscore
week3
1
2
30
2
0
-1
-2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
value
Normal Probability Plot for value
99
ML Estimates
95
Mean:
0.8102
StDev:
0.161648
90
80
Percent
nscore
1
70
60
50
40
30
20
10
5
1
0.50
0.75
Dataweek3
1.00
1.25
31
• Histogram, the nscores plots and the normal quantile plot
for data generated from a uniform distribution (0,5)
9
8
Frequency
7
6
5
4
3
2
1
0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
value
5
value
4
3
2
1
0
-2
-1
0
ncores
week3
1
2
32
2
ncores
1
0
-1
-2
0
1
2
3
4
5
value
Normal Probability Plot for value
99
ML Estimates
95
Mean:
2.21603
StDev:
1.46678
90
Percent
80
70
60
50
40
30
20
10
5
1
-2
-1
0
1
2
week3
Data
3
4
5
6
33
Question
(similar to Q5 Term test Oct, 2000)
Below are 4 normal probability (quantile) plots and 4
histograms produced by MINITAB for some data sets.
The histograms are not in the same order as normal scores
plots.
Match the histograms with the nscores plots.
week3
34
120
10
Frequency
data
110
100
5
90
80
0
-2
-1
0
1
2
0
2
4
6
nscores
8
10
12
14
data
50
40
Frequency
data
40
30
30
20
10
0
20
-2
-1
0
1
2
0
10
20
30
nscores
40
50
60
data
14
15
12
Frequency
data
10
8
6
4
10
5
2
0
0
-2
-1
0
1
2
80
nscores
84
88
92
96
100 104
108 112 116
data
60
8
7
50
6
Frequency
data
40
30
20
5
4
3
2
10
1
0
0
-2
-1
0
nscores
1
2
week3
20
22
24
26
28
30
data
32
34
36
38
40
35
Looking at data - relationships
• Two variables measured on the same individuals are associated
if some values of one variable tend to occur more often with
some values of the second variable than with other values of that
variable.
• When examining the relationship between two or more
variables, we should first think about the following questions:
– What individuals do the data describe?
– What variables are present? How are they measured?
– Which variables are quantitative and which are categorical?
– Is the purpose of the study is simply to explore the nature of
the relationship, or do we hope to show that one variable can
explain variation in the other?
week3
36
Response and explanatory variables
• A response variable measure an outcome of a study. An
explanatory variable explains or causes changes in the
response variables.
• Explanatory variables are often called independent variables
and response variables are called dependent variables. The
ides behind this is that response variables depend on
explanatory variables.
• We usually call the explanatory variable x and the response
variable y.
week3
37
Scatterplot
• A scatterplot shows the relationship between two
quantitative variables measured on the same individuals.
• Each individual in the data appears as a point in the plot
fixed by the values of both variables for that individual.
• Always plot the explanatory variable, if there is one, on the
horizontal axis (the x axis) of a scatterplot.
• Examining and interpreting Scatterplots
– Look for overall pattern and striking deviations from
that pattern.
– The overall pattern of a scatterplot can be described by
the form, direction and strength of the relationship.
– An important kind of deviation is an outlier, an
individual value that falls outside the overall pattern.
week3
38
Example
• There is some evidence that drinking moderate amounts of wine
helps prevent heart attack. A data set contain information on
yearly wine consumption (litters per person) and yearly deaths
from heart disease (deaths per 100,000 people) in 19 developed
nations. Answer the following questions.
• What is the explanatory variable?
• What is the response variable?
• Examine the scatterplot below.
week3
39
Heart disease deaths
300
200
100
0
1
2
3
4
5
6
7
8
9
Wine
week3
40
• Interpretation of the scatterplot
– The pattern is fairly linear with a negative slope. No outliers.
– The direction of the association is negative . This means that
higher levels of wine consumption are associated with lower
death rates.
– This does not mean there is a causal effect. There could be
lurking variables. For example, higher wine consumption could
be linked to higher income, which would allow better medical
care.
• MINITAB command for scatterplot
Graph > Plot
week3
41
Categorical variables in scatterplots
• To add a categorical variable to a scatterplot, use a different
colour or symbol for each category.
• The scatterplot below shows the relationship between the
world record times for 10,000m run and the year for both men
and women.
2300
F
M
Time (seconds)
2200
2100
2000
1900
1800
1700
1600
1900
1950
2000
Year
week3
42
Categorical explanatory variables
• Scatterplots display the association between two quantitative
variables.
• To display a relationship between a categorical explanatory
variable and a quantitative response variable, make a side-byside comparison of the distributions of the response for each
category.
• A back-to-back stemplot compares two distributions.
• Side-by-side boxplots compare any number of distributions.
week3
43
Example
We want to investigate to association between how much
education a person has and his/her income.
Education appears as a categorical variable.
1 = did not reach high school,
2 = some high school but no high school diploma.
up to
6 = postgraduate degree.
Order the categories and make side-by side boxplots for the
income.
week3
44
• The side-by-side boxplots show a strong positive association
between education and earnings.
week3
45
Correlation
• A sctterplot displays the form, direction and strength of the
relationship between two quantitative variables.
• Correlation (denoted by r) measures the direction and
strength of the liner relationship between two quantitative
variables.
• Suppose that we have data on variables x and y for n
individuals. The correlation r between x and y is given by
n
xi yi nx y
n
1
1 i 1
xi x yi y
r
n 1 i 1
sx s y
s x s y
n 1
week3
46
Example
• Family income and annual savings in thousand of $ for a sample
of eight families are given below.
savings
1
2
2
5
5
6
7
8
income
36
39
42
45
48
51
54
56
C3
-1.42887
-1.02062
-0.61237
-0.20412
0.20412
0.61237
1.02062
1.42887
C4
-1.45101
-1.03144
-0.61187
-0.19230
0.22727
0.64684
1.06641
1.34612
C5
2.07331
1.05271
0.37469
0.03925
0.04433
0.39611
1.08840
1.92343
Sum of C5 = 6.99429
• r = 6.99429/7 = 0.999185
• MINITAB command: Stat > Basic Statistics > Correlation
week3
47
Properties of correlation
• Correlation requires both variables to be quantitative and make no
use of the distinction between explanatory and response variables.
• Because r uses standardized values of observations, it does not
depend on units of measurements of x and y. Correlation r has no
unit if measurement.
• Positive r indicates positive association between the variables and
negative r indicates negative association.
• Correlation measures the strength of only the linear relationship
between two variables, it does not describe curved relationship!
• r is always a number between –1 and 1.
Values of r near 0 indicates a weak linear relationship.
The strength of the linear relationship increases as r moves
away from 0. Values of r close to –1 or 1 indicates that the
points lie close to a straight line.
r is not resistant. r is strongly affected by a few outliers.
week3
48
week3
49
Question from Term test, summer 99
• MINITAB analyses of math and verbal SAT scores is given below.
Variable
Verbal
Math
GPA
Variable
Verbal
Math
GPA
N
200
200
200
Mean
Median
595.65
586.00
649.53
649.00
2.6300
2.6000
Minimum
Maximum
361.00
780.00
441.00
800.00
0.3000
3.9000
TrMean
595.57
650.37
2.6439
StDev
73.21
66.35
0.5803
SE Mean
5.18
4.69
0.0410
Stem-and-leaf of Verbal
N = 200
Leaf Unit = 10
1
3 6
4
4 034
19
4 566888888889999
52
5 000000122222222333333333444444444
(56)
5 55555555555556666666777777777777778888888888888889999999
92
6 00000000011111111222222333333333444444444444444
45
6 555555666666666778888888889999
15
7 0011112244
5
7 55568
week3
50
Stem-and-leaf of Math
Leaf Unit = 10
1
3
12
38
(63)
99
51
12
2
4
4
5
5
6
6
7
7
8
N
= 200
4
79
001222234
55555666677777778888889999
000000000000001111111111112222222222222222333333333344444444444
555555555666666666666667777777777788888889999999
000000000011111111111112222222333334444
5566777789
00
30
20
Frequency
Frequency
20
10
0
10
0
400
500
600
Math
700
800
400
week3
500
600
Verbal
700
800
51
a)
Find the 25th percentile, 75th percentile and the IQR of the math
SAT scores.
b)
You were one of the students of this study and your math SAT
score was 532. What is your z-score and percentile standing?
c)
If the math SAT scores were in fact left (negatively) skewed, but
the mean was still 650, what could you say about the percentile
standing of someone who obtains a score of 650?
d)
e)
What is the class width ?
i) of the histogram for verbal SAT scores?
ii) of the stemplot of the verbal SAT scores?
Describe both the verbal and math score distributions and
compare one with the other.
week3
52
g)
Give a rough sketch of how a normal probability plot would
look if the verbal scores were
i. Right (positively) skewed
ii. Uniform in shape
h)
For verbal scores, aside from running through the data and
tallying, can you determine the approx. percentage of scores
which fall between 523 and 668? If so give the percentage.
week3
53
Question (Term Test May 98)
•
Descriptive statistics of scores of 3 groups of students are given
below.
Variable
Post1
•
Group
B
D
S
N
22
22
22
Mean
6.682
9.773
7.773
Median
6.500
10.000
7.000
TrMean
6.650
9.800
7.750
StDev
2.767
2.724
3.927
Using the information above estimate the following in some
reasonable way. State any assumptions that you have to make.
(a) The 90th percentile of the post1 scores using method B.
b) The proportion of post1 scores that would be 7 or higher for
those using method D.
week3
54