Introduction to Probability and Statistics Eleventh Edition

Download Report

Transcript Introduction to Probability and Statistics Eleventh Edition

Definitions
• A variable is a characteristic that
changes or varies over time and/or for
different individuals or objects under
consideration.
• Examples: Hair color, white blood cell
count, time to failure of a computer
component.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Definitions
• An experimental unit is the
individual or object on which a
variable is measured.
• A set of measurements, called data,
can be either a sample or a
population.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Definitions
• Popuation is collection of all items we
are interested in.
• Sample is subset of population that we
observe.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Types of Variables
Qualitative
Quantitative
Discrete
Continuous
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Types of Variables
•Qualitative variables measure a quality
or characteristic on each experimental
unit.
•Examples:
•Hair color (black, brown, blonde…)
•Make of car (Dodge, Honda, Ford…)
•Gender (male, female)
•State of birth (California, Arizona,….)
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Types of Variables
•Quantitative variables measure a
numerical quantity on each experimental
unit.
Discrete if it can assume only a
finite or countable number of values.
Continuous if it can assume the
infinitely many values corresponding
to the points on a line interval.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Examples
• For each orange tree in a grove, the number
of oranges is measured.
– Quantitative discrete
• For a particular day, the number of cars
entering a college campus is measured.
– Quantitative discrete
• Time until a light bulb burns out
– Quantitative continuous
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
2.1 Describing Qualitative Data
• Use a data distribution to describe:
– What values of the variable have
been measured
– How often each value has occurred
• “How often” can be measured 3 ways:
– Frequency
– Relative frequency = Frequency/n
– Percent = 100 x Relative frequency
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Example
• A bag of M&M®s contains 25 candies:
• Raw Data:
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
• Statistical Table:
Color
Tally
Frequency Relative
Frequency
Percent
Red
mmmmm
5
5/25 = .20
20%
Blue
mmm
3
3/25 = .12
12%
Green
mm
2
2/25 = .08
8%
mmm
3
3/25 = .12
12%
Orange
Brown
mm mm m m mm
8
8/25 = .32
32%
Yellow
mmmm
4
4/25 = .16
16%
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Graphs
Bar Chart
Pie Chart
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
2.2 Describing Quantitative Data
• Dot plot
• Stem and leaf plot
• Histogram.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Dotplots
• The simplest graph for quantitative data
• Plots the measurements as points on a
horizontal axis, stacking the points that
duplicate existing points.
• Example: The set 4, 5, 5, 7, 6
4
5
6
7
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Stem and Leaf plot
The ages of the CEOs of 30 top ranked small
companies in Americain 1993.
33 38 40 43 43 44 45 45 46 46 47 47 47
48 48 50 50 51 52 53 55 55 56 57 57 58
60 61 63 69.
3|38
4|0334556677788
5|00123556778
6|01369
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Relative Frequency Histograms
• A relative frequency histogram for a
quantitative data set is a bar graph in which
the height of the bar shows “how often”
(measured as a proportion or relative
frequency) measurements fall in a particular
class or subinterval.
Create intervals
Stack and draw bars
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Relative Frequency Histograms
• Divide the range of the data into 5-12
subintervals of equal length.
• Calculate the approximate width of the
subinterval as Range/number of subintervals.
• Round the approximate width up to a
convenient value.
• Use the method of left inclusion, including the
left endpoint, but not the right in your tally.
• Create a statistical table including the
subintervals, their frequencies and relative
frequencies.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Relative Frequency Histograms
• Draw the relative frequency histogram,
plotting the subintervals on the horizontal
axis and the relative frequencies on the
vertical axis.
• The height of the bar represents
– The proportion of measurements falling in
that class or subinterval.
– The probability that a single measurement,
drawn at random from the set, will belong to
that class or subinterval.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Example
The ages of 50 tenured faculty at a
state university.
•
•
•
•
•
•
•
•
34
42
34
43
48
31
59
50
70
36
34
30
63
48
66
43
52
43
40
32
52
26
59
44
35
58
36
58
50 37 43 53 43 52 44
62 49 34 48 53 39 45
41 35 36 62 34 38 28
53
We choose to use 6 intervals.
Minimum class width = (70 – 26)/6 = 7.33
Convenient class width = 8
Use 6 classes of length 8, starting at 25.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Age
Tally
Frequency Relative
Frequency
Percent
25 to < 33
1111
5
5/50 = .10
10%
33 to < 41
1111 1111 1111
14
14/50 = .28
28%
41 to < 49
1111 1111 111
13
13/50 = .26
26%
49 to < 57
1111 1111
9
9/50 = .18
18%
57 to < 65
1111 11
7
7/50 = .14
14%
65 to < 73
11
2
2/50 = .04
4%
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
2.4 Numerical Measures of
Center
Symmetric: Mean = Median
Skewed right: Mean > Median
Skewed left: Mean < Median
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
2.6: Interpreting the Standard
Deviation
• Chebyshev’s Rule
• The Empirical Rule
Both tell us something about where
the data will be relative to the mean.
Copyright ©2003 Brooks/Cole 20
A division of Thomson Learning, Inc.
Chebyshev’s Theorem
Given a number k greater than or equal to 1 and a
set of n measurements, at least 1-(1/k2) of the
measurement will lie within k standard deviations of
the mean.
 Can be used for either samples ( x and s) or for a population (m
and s). Valid for any dataset.
Important results:
If k = 2, at least 1 – 1/22 = 3/4= 75% of the measurements are
within 2 standard deviations of the mean.
If k = 3, at least 1 – 1/32 = 8/9=89% of the measurements are
within 3 standard deviations of the mean.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Using Measures of
Center and Spread:
The Empirical Rule
Given a distribution of measurements
that is approximately mound-shaped:
The interval m  s contains approximately 68% of
the measurements.
The interval m  2s contains approximately 95%
of the measurements.
The interval m  3s contains approximately 99.7%
of the measurements.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Empirical Rule Example
• Hummingbirds beat their wings
in flight an average of 55 times
per second.
• Assume the standard deviation
is 10, and that the distribution
is symmetrical and mounded.
– Approximately what
percentage of hummingbirds
beat their wings between 45
and 65 times per second?
– Between 55 and 65?
– Less than 45?
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Empirical Rule Example
•
•
Hummingbirds beat their
wings in flight an average
of 55 times per second.
Assume the standard
deviation is 10, and that
the distribution is
symmetrical and
mounded.
– Approximately what
percentage of
hummingbirds beat
their wings between
45 and 65 times per
second?
– Between 55 and 65?
– Less than 45?
Since 45 and 65 are
exactly one standard
deviation below and
above the mean, the
empirical rule says that
about 68% of the
hummingbirds will be in
this range.
Copyright ©2003 Brooks/Cole 24
A division of Thomson Learning, Inc.
Empirical Rule Example
•
•
Hummingbirds beat their
wings in flight an average
of 55 times per second.
Assume the standard
deviation is 10, and that the
distribution is symmetrical
and mounded.
– Approximately what
percentage of
hummingbirds beat
their wings between 45
and 65 times per
second?
– Between 55 and 65?
– Less than 45?
This range of numbers is
from the mean to one
standard deviation above
it, or one-half of the
range in the previous
question. So, about onehalf of 68%, or 34%, of
the hummingbirds will
be in this range.
Copyright ©2003 Brooks/Cole 25
A division of Thomson Learning, Inc.
Empirical Rule Example
•
•
Hummingbirds beat their
wings in flight an average
of 55 times per second.
Assume the standard
deviation is 10, and that
the distribution is
symmetrical and
mounded.
– Approximately what
percentage of
hummingbirds beat
their wings between
45 and 65 times per
second?
– Between 55 and 65?
– Less than 45?
Half of the entire data set
lies above the mean, and
~34% lie between 45 and
55 (between one standard
deviation below the mean
and the mean), so ~84%
(~34% + 50%) are above
45, which means ~16%
are below 45.
Copyright ©2003 Brooks/Cole 26
A division of Thomson Learning, Inc.
Empirical Rule
•
•
Since ~95% of all the
measurements will be
within 2 standard
deviations of the mean,
only ~5% will be more
than 2 standard deviations
from the mean.
About half of this 5% will
be far below the mean,
leaving only about 2.5% of
the measurements at least
2 standard deviations
above the mean.
Copyright ©2003 Brooks/Cole 27
A division of Thomson Learning, Inc.
2.7: Numerical Measures of
Relative Standing
• Percentiles: for any (large) set of n
measurements (arranged in ascending
or descending order), the pth percentile
is a number such that p% of the
measurements fall below that number
and (100 – p)% fall above it.
• K-tk Quartile: k quarters lie below it.
Copyright ©2003 Brooks/Cole 28
A division of Thomson Learning, Inc.
Percentiles
• Finding percentiles is similar to finding the
median – the median is the 50th percentile.
– If you are in the 50th percentile for the
GRE, half of the test-takers scored better
and half scored worse than you.
– If you are in the 75th percentile, you
scored better than three-quarters of the
test-takers.
Copyright ©2003 Brooks/Cole 29
A division of Thomson Learning, Inc.
Z-scores
• The z-score tells • Sample z-score
xx
us how many
z
s
standard
deviations above • Population z-score
or below the mean
xm
z
s
a particular
measurement is.
Copyright ©2003 Brooks/Cole 30
A division of Thomson Learning, Inc.
Z-Scores
• Z scores are related to the empirical rule:
For a perfectly symmetrical and mound-shaped
distribution,
– ~68 % will have z-scores between -1 and 1
– ~95 % will have z-scores between -2 and 2
– ~99.7% will have z-scores between -3 and 3
Copyright ©2003 Brooks/Cole 31
A division of Thomson Learning, Inc.
2.8: Methods for Determining
Outliers
• An outlier is a measurement that is
unusually large or small relative to the
other values.
• Three possible causes:
– Observation, recording or data entry
error
– Item is from a different population
– A rare, chance event
Copyright ©2003 Brooks/Cole 32
A division of Thomson Learning, Inc.
Box plot
• The box plot is a graph representing
information about certain percentiles
for a data set and can be used to
identify outliers
Copyright ©2003 Brooks/Cole 33
A division of Thomson Learning, Inc.
Lower Quartile
(QL)
Minimum Value
30
35
Median
Upper Quartile
(QU)
Maximum Value
BoxPlot
40
45
50
55
Wins by Team at the 2007 MLB All-Star Break
Copyright ©2003 Brooks/Cole 34
A division of Thomson Learning, Inc.
BoxPlot
Interquartile
Range (IQR) = QU - QL
30
35
40
45
50
55
Wins by Team at the 2007 MLB All-Star Break
Copyright ©2003 Brooks/Cole 35
A division of Thomson Learning, Inc.
Outer Fence at
QU + 3(IQR)
Inner Fence at QU + 1.5(IQR)
BoxPlot
20
30
40
50
60
70
80
90
100
110
Wins by Team at the 2007 MLB All-Star Break
(One team had its total wins for 2006 recorded)
Copyright ©2003 Brooks/Cole 36
A division of Thomson Learning, Inc.
Outliers and Z-scores
• Outliers and z-scores
– The chance that a z-score is
between -3 and +3 is over 99%.
– Any measurement with |z| > 3 is
considered an outlier.
Copyright ©2003 Brooks/Cole 37
A division of Thomson Learning, Inc.
#Wins
Mean
Sample
Variance
Sample
Standard
Deviation
Minimum
Maximum
n = 30
45.68
146.69
12.11
25
104
• Outliers and z-scores
Here are the descriptive
statistics for the games won
at the All-Star break, except
one team had its total wins
for 2006 recorded.
That team, with 104 wins
recorded, had a z-score of
(104-45.68)/12.11 = 4.82.
That’s a very unlikely result,
which isn’t surprising given
what we know about the
observation.
Copyright ©2003 Brooks/Cole
38
A division of Thomson Learning, Inc.
2.9: Graphing Bivariate
Relationships
•
Scattergram (or scatterplot) shows the relationship
between two quantitative variables
Positive Relationship
Negative Relationship
1100
Present Value (i = 10%)
2500
Imports
2000
1500
1000
500
0
1000
900
800
700
600
500
400
300
200
0
2000
4000
6000
8000
10000 12000 14000
Gross domestic product
0
2
4
6
8
10
12
Year
Copyright ©2003 Brooks/Cole 39
A division of Thomson Learning, Inc.
•
If there is no
linear
relationship
between the
variables, the
scatterplot may
look like a
cloud, a
horizontal line
or a more
complex curve
Source: Quantitative Environmental Learning Project
http://www.seattlecentral.org/qelp/index.html
Copyright ©2003 Brooks/Cole 40
A division of Thomson Learning, Inc.