Lecture 3 - people.stat.sfu.ca

Download Report

Transcript Lecture 3 - people.stat.sfu.ca

Statistics 270 - Lecture 3
• Last class: types of quantitative variable, histograms, measures of
center, percentiles and measures of spread…well, we shall finish
these today
• Will have completed Chapter 1
• Assignment #1: Chapter 1, questions: 6, 20b, 26, 36b-d, 48, 60
• Some suggested problems:
• Chapter 1: 1, 5, 13 or 14 (DO histogram), 19, 26, 29, 33
Gr
C
o
u
n
t
0 2 4 6 8 10 12 14
E
x am
6 0 7 0 8 0 9 01 0 0
E x
am
1
Gr
Gr a
C
o
u
n
t
0 2 4 6 8 10
E
x am
6 0 7 0 8 0 9 01 0 0
Ex
am
1
Gra
Gr a
C
o
u
n
t
0 2 4 6 8
E
x am
6 0 7 0 8 0 9 01 0 0
Ex
am
1
Gra
Measures of Spread (cont.)
• 5 number summary often reported:
• Min, Q1, Q2 (Median), Q3, and Max
• Summarizes both center and spread
• What proportion of data lie between Q1 and Q3?
•
Displays 5-number summary
graphically
•
Box drawn spanning quartiles
•
Line drawn in box for median
•
Lines extend from box to max.
and min values.
•
Some programs draw whiskers
only to 1.5*IQR above and below
the quartiles
E
x
a
m
S
c
o
r
e
s
60 65 70 75 80 85
Box-Plot
Undergrad
90
• Can compare distributions
using side-by-side box-plots
60
70 ExamScores 80
• What can you see from the
plot?
Undergrads
Grads
Other Common Measure of Spread:
Sample Variance
• Sample variance of n observations:
n
s 
2
 (x  x)
i 1
2
i
n 1
• Can be viewed as roughly the average squared deviation of
observations from the sample mean
• Units are in squared units of data
Sample Standard Deviation
• Sample standard deviation of n observations:
n
s
 (x  x)
i 1
2
i
n 1
• Can be viewed as roughly the average deviation of observations
from the sample mean
• Has same units as data
Exercise
• Compute the sample standard deviation and variance for the Muzzle
Velocity Example
• Variance and standard deviation are most useful when measure of
center is
• As observations become more spread out, s : increases or
decreases?
• Both measures sensitive to outliers
• 5 number summary is better than the mean and standard deviation
for describing (I) skewed distributions; (ii) distributions with
outliers
Population and Samples
• Important to distinguish between the population and a sample from
the population
• A sample consisting of the entire population is called a
• What is the difference between the population mean and the
sample mean?
• The population variance ( or std. deviation) and that of the
population
• Population median and sample median?
Empirical Rule for Bell-Shaped Distributions
• Approximately
• 68% of the data lie in the interval x  s
• 95% of the data lie in the interval x  2s
• 95% of the data lie in the interval x  3s
• Can use these to help determine range of typical values or to
identify potential outliers
Example…Putting this all together
•
A geyser is a hot spring that becomes unstable and erupts hot gases into the air.
Perhaps the most famous of these is Wyoming's Old Faithful Geyser.
•
Visitors to Yellowstone park most often visit Old Faithful to see it erupt.
Consequently, it is of great interest to be able to predict the interval time of the next
eruption.
Example…Putting this all together
•
Consider a sample of 222 interval times between eruptions (Weisberg, 1985). The
first few lines of the available data are:
Day of Study
•
1
1
1
1
Length of
Eruption
(Minutes)
4.4
3.9
4.0
4.0
Interval Between
Eruption
(Minutes)
78
74
68
76
.
.
.
.
.
.
.
.
.
Goal: Help predict the interval between eruptionsConsider a variety of plots that may
shed some light upon the nature of the intervals between eruptions
Example…Putting this all together
•
Goal: Help predict the interval between eruptions
•
Consider a histogram to shed some light upon the nature of the intervals between
eruptions
20
10
0
Frequency
30
40
Histogram of Old Faithful Eruption Intervals
40
50
60
70
Eruption Intervals (Minutes)
80
90
Example…Putting this all together
Summary Statistics:
Minimum
1st Quartile
Median
Mean
3rd Quartile
Maximum
Standard Deviation
42
60
75
71.01
81
95
12.80
80
70
60
50
40
Eruption Intervals (Minutes)
90
Boxplot of Old Faithful Eruption Intervals
Example…Putting this all together
•
What does the box-plot show?
•
Is a box-plot useful at showing the main features of these data?
•
What does the empirical rule tell us about 95% of the data? Is this useful?
•
We will come back to this in a minute…
Scatter-Plots
Help assess whether there is a
relationship between 2 continuous
variables,
•
Data are paired
• (x1, y1), (x2, y2), ... (xn, yn)
Sc a tte r p
40 50 EruptionIervals(Minu6te0) 70 80 90
•
•
Plot X versus Y
•
If there is no natural
pairing…probably not a good idea!
•
What sort of relationships might
we see?
2
3
4
5
Dura ti o n
(
Example…Putting this all together
• What does this plot reveal?
80
70
60
50
40
Eruption Intervals (Minutes)
90
Scatterplot of Eruption Interval vs Duration
2
3
4
Duration (Minutes)
5
Example…Putting this all together
40
30
Frequency
0
10
20
30
Frequency
20
10
Eruption Greater Than or Equal to 3 Minutes
Minimum
53
1st Quartile
74
Median
78
Mean
78.16
3rd Quartile
82.5
Maximum
95
Standard Deviation 6.89
Number of
155
Observations
0
Eruption Less Than 3 Minutes
Minimum
42
1st Quartile
60
Median
51
Mean
54.46
3rd Quartile
58
Maximum
78
Standard Deviation 6.30
Number of
67
Observations
40
Summary Statistics:
40
50
60
70
Eruption Intervals (Minutes)
80
50
60
70
80
Eruption Intervals (Minutes)
90
Example…Putting this all together
•
Suppose an eruption of 2.5 minutes had just taken place. What would you
estimate the length of the next interval to be?
•
Suppose an eruption of 3.5 minutes had just taken place. What would you
estimate the length of the next interval to be?