Ch 6A Random Sampling & Data Descriptions
Download
Report
Transcript Ch 6A Random Sampling & Data Descriptions
Chapter 6 - Random
Sampling and Data
Description
Experience the joy of
dealing with large
quantities of data
Chapter 6A
This Week in Prob/Stat
Today’s Discussion
+ Bonus Material
Descriptive Statistics
Distributions
Histogram
Cumulative frequency distribution
Frequency distribution (continuous data)
Measures of Central Tendency (location)
Mean
Median
Mode
Measures of Variability (dispersion)
Variance (standard deviation)
Range
Quartiles
Coefficient of Variation
Measures of skewness
Measures of Kurtosis
Histograms
Data gets placed into class intervals, cells, or bins
(synonyms).
Continuous data - Number of bins ~ sqrt(nobs) or
use Sturges rule.
Histogram shows the relative frequency of the
sample observations in each class.
Histogram ~ probability density (or mass) function
By summing counts in the succession of bins you can
construct a cumulative frequency plot.
Cumulative frequency plot ~ empirical distribution
function ~ cumulative distribution function
A Discrete Example
Raw data: the number of
accident claims received per
day over the last 50 days by
the Nofrills Insurance Co.
Bin Frequency Cumulative %
0
8
16.00%
1
13
42.00%
2
11
64.00%
3
8
80.00%
4
6
92.00%
5
2
96.00%
6
1
98.00%
7
1
100.00%
week
1
2
3
4
5
6
7
8
9
10
Mon
4
0
5
1
3
2
1
7
3
2
Tues
3
3
2
1
4
0
4
2
3
1
Wed
1
0
0
0
2
0
4
2
1
1
Thur
1
0
2
1
3
1
5
3
4
2
Fri
4
1
0
1
3
2
2
6
1
2
Frequency
0.3
0.25
0.2
0.15
0.1
0.05
0
0
1
2
3
4
Number of claims
5
6
7
A Discrete Empirical Cumulative
Frequency Distribution
Number of
claims
Cumulative
Frequency
0x<1
1x<2
2x<3
3x<4
4x<5
5x<6
6x<7
7 x << ?∞
16%
42%
64%
80%
92%
96%
98%
100%
A Discrete Empirical Cumulative
Frequency Distribution Graph
Cumulative %
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
0
1
2
3
4
5
Number of Claims
6
7
8
A Continuous Data Example
Raw Data: Time to repair or
replace in hours a failed
transformer by the Dayton
Power and Light Company
Industry standard is 2.5
hours
2.2
1.7
2.4
2.5
2.9
4.4
5.0
1.8
3.2
2.2
1.9
4.0
5.0
5.6
2.5
4.5
3.7
4.3
4.3
2.4
3.6
2.5
1.9
2.7
4.5
3.3
3.9
2.7
3.0
3.9
3.3
2.0
1.6
1.6
2.9
data collection: 35 repairs performed
between 01/01/07 and 06/30/07
Sturges’ rule for grouping data
k = 1 + 3.3 log10 n
where k = number of classes,
n = sample size.
x = integer part of x
For example,
n
35
50
500
5000
k
6
7
10
13
n
6
7
22
71
A Histogram
Data was generated from
a lognormal distribution
Bin
x <= 1
1<x<=2
2<x<=3
3<x<=4
4<x<=5
5<x
frequency
0
0.2
0.34286
0.22857
0.17143
0.05714
0.4
0.3
0.2
0.1
0
x <= 1 1<x<=2 2<x<=3 3<x<=4 4<x<=5
Transformer repair times in hours
5<x
Frequency Polygon
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
1
2
3
4
5
Repair time in hours
6
7
8
Cumulative Frequency Distribution
- ogive
Cumulative Frequency
120.00%
100.00%
80.00%
60.00%
40.00%
20.00%
0.00%
0
1
2
3
4
Repair Times
5
6
7
Measures of Central
Tendency – i.e. averages
Seeking the middle ground
Types of Data
nominal (also categorical or discrete) (e.g. group employees by job type)
only comparisons are equality and inequality.
ordinal (e.g. rank colleges based surveys and interviews)
the numbers assigned to objects represent the rank order (1st, 2nd, 3rd etc.)
of the entities measured.
comparisons of greater and less can be made, in addition to equality and
inequality.
interval (e.g. temperature, IQ measurements)
no "less than" or "greater than" relations among the classifying names
no operations such as addition or subtraction
have all the features of ordinal measurements,
equal differences between measurements represent equivalent
intervals.
operations such as addition and subtraction are therefore
meaningful.
Ratio (e.g. group travel times into intervals)
have all the features of interval
operations such as multiplication and division are therefore meaningful.
The zero value on a ratio scale is non-arbitrary
6-1 Numerical Summaries
Definition: Sample Mean
Characteristics of the mean
most widely known and used average
an artificial concept since it may not coincide with any actual
value
affected by every value of every item
therefore uses all the information available in the sample
highly influenced by extreme values
can be computed directly from the raw data
e.g. does not need to be sorted as does the median
requires interval or ratio data
lends itself better to algebraic analysis than other measures
of central tendency
has some desirable statistical properties
answers the question, "if all the quantities had the same
value, what would that value have to be in order to achieve
the same total?"
Example 6-1
6-1 Numerical Summaries
Figure 6-1 The sample mean as a balance point for a
system of weights.
Population Mean
For a finite population with N measurements, the mean is
The sample mean is a reasonable estimate of the population mean.
Sample Median
Median is a measure of central tendency such that half of the
values in a sample are below it and half are above it.
If the number of observations is even, then average the two
central values.
Sample median less influenced by ‘outliers’ than the sample
mean.
• Not affected by extreme values
• affected by the number but not the value of extremes
widely used in skewed distributions where the mean would be
distorted by extreme values
• e.g. economic data
Can be used where the data is ranked but not measured
quantitatively
unreliable if the data do not cluster at the center of the
distribution
Order Statistics
Define
X(1) = Min {X1, X2, …, Xn}
X(2) = 2nd smallest {X1, X2, …, Xn}
X(i) = ith smallest {X1, X2, …, Xn}
X(n) = Max {X1, X2, …, Xn}
Therefore X(1) X(2) X(3) … X(n)
X med
X ( k ) if n 2k 1 is odd
X ( k ) X ( k 1)
if n 2k is even
2
Median Repair Time
Raw Data: Time to repair or replace in
hours a failed transformer by the
Dayton Power and Light Company
Industry standard is 2.5 hours
2.2
1.7
2.4
2.5
2.9
4.4
5.0
1.8
3.2
2.2
1.9
4.0
5.0
5.6
2.5
4.5
3.7
4.3
4.3
2.4
3.6
2.5
1.9
2.7
4.5
3.3
3.9
2.7
3.0
3.9
3.3
2.0
1.6
1.6
2.9
sort data
Observation number
1
1.6
2
1.6
3
1.7
4
1.8
5
1.9
6
1.9
7
2.0
8
2.2
9
2.2
10
2.4
11
2.4
12
2.5
13
2.5
14
2.5
15
2.7
16
2.7
17
2.9
18
2.9
19
3.0
20
3.2
21
3.3
22
3.3
23
3.6
24
3.7
25
3.9
26
3.9
27
4.0
28
4.3
29
4.3
30
4.4
31
4.5
32
4.5
33
5.0
34
5.0
35
5.6
Median of an even number of
observations
observation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
value
27.40
9.08
165.29
214.85
98.70
76.07
9.87
77.96
15.01
49.86
1.18
188.07
317.26
59.79
384.63
48.74
raw data
observation
sort
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
value
1.18
9.08
9.87
15.01
27.40
48.74
49.86
59.79
76.07
77.96
98.70
165.29
188.07
214.85
317.26
384.63
59.79 76.07
67.93
2
average the middle
two observations
Good use of the median
Constructively Yours is a small privately owned and operated
business that specializes in small residential construction and
remodeling projects. In addition to the owner-president, the
company employs 8 other workers. position
annual salary
receptionist
worker 1
worker 2
worker 3
worker 4
salesperson 1
salesperson 2
job foreman
President
mean
median
$22,050
$28,175
$29,500
$31,450
$32,800
$34,150
$38,000
$43,200
$230,000
$54,369
$32,800
Warning: The above salary information is confidential and proprietary
and should not be disclosed beyond its use in the classroom.
Is the Median Representative?
The Makit Company is a small job
shop that primarily employs machine
operators and engineers.
position
clerk
Machinist 1
Machinist 2
Machinist 3
Machinist 4
Machinist 5
Machinist 6
Machinist 7
Machinist 8
Engineer 1
Engineer 2
Engineer 3
Engineer 4
mean
median
annual salary
$18,400
$28,175
$29,500
$31,450
$32,800
$34,150
$34,200
$35,500
$36,100
$68,500
$78,230
$85,400
$90,100
$46,347
$34,200
Warning: The above salary information is confidential and proprietary
and should not be disclosed beyond its use in the classroom.
Mode
The most frequent value assumed by a random variable or
occurring in a sample.
The term is applied both to probability distributions and to
collections of data.
The mode is not necessarily unique, since the same maximum
frequency may be attained at different values. The worst case
is given by the uniform distributions in which all values are
equally likely.
For example, the mode of the sample [1, 3, 6, 6, 6, 6, 7, 7,
12, 12, 17] is 6.
The mode of a discrete probability distribution is the value x at
which its probability mass function takes its maximum value.
the value that is most likely to be sampled.
The mode of a continuous probability distribution is the value
x at which its probability density function attains its maximum
value.
Not affected by extreme values
Can be computed from nominal data
Example – Sample Mode
week
1
2
3
4
5
6
7
8
9
10
Mon
4
0
5
1
3
2
1
7
3
2
Tues
3
3
2
1
4
0
4
2
3
1
Wed
1
0
0
0
2
0
4
2
1
1
Thur
1
0
2
1
3
1
5
3
4
2
Fri
4
1
0
1
3
2
2
6
1
2
Raw data: the number of accident
claims received per day over the last
50 days by the Nofrills Insurance Co.
Bin Frequency
0
8
1
13
2
11
3
8
4
6
5
2
6
1
7
1
Mode = 1
geometric mean
the two means are equal if and only if all members of
the data set are equal
allows the definition of the arithmetic-geometric mean,
a mixture of the two which always lies in between
Used to determine "average factors"
x1 x2 xn
The geometric mean is smaller than or equal to the
arithmetic mean
n
If a stock rose 10% in the first yr, 20% in the second
yr and fell 15% in the third yr, then compute the
geometric mean of the factors 1.10, 1.20 and 0.85 as
(1.10 × 1.20 × 0.85)1/3 = 1.0391... and conclude that
the stock rose 3.91 percent per year, on average.
answers the question, "if all the quantities had the
same value, what would that value have to be in
order to achieve the same product?"
n
harmonic mean
is appropriate for situations when the average of rates is desired
1 1
1
...
x1 x2
xn
if for half the distance of a trip you travel at 40 mph per hour and for
the other half of the distance you travel at 60 mph per hour, then
your average speed for the trip is given by the harmonic mean of 40
and 60, which is 48; that is, the total amount of time for the trip is the
same as if you traveled the entire trip at 48 mph per hour.
If you had traveled for half the time at one speed and the other half
at another, the arithmetic mean, in this case 50 mph per hour, would
provide the correct average.
In finance, used to calculate the average cost of shares
purchased over a period of time.
an investor purchases $1000 worth of stock every month for three
months. If the spot prices at execution time are $8, $9, and $10, then
the average price the investor paid is $8.926 per share.
However, if the investor purchased 1000 shares per month, the
arithmetic mean would be used
midrange and beyond
xmin xmax
2
It is highly sensitive to outliers and ignores all but two data
points; therefore it is rarely used in statistical analysis.
While the mean of a set of values minimizes the sum of
squares of deviations and the median minimizes the average
absolute deviation, the midrange minimizes the maximum
deviation.
For a given data set, the harmonic mean is always the least of
the three, while the arithmetic mean is always the greatest of
the three and the geometric mean is always in between
Measures of
Dispersion
The search for variability
Definition: Sample Variance
Figure 6-2
How Does the Sample Variance Measure Variability?
How the sample variance measures variability through the
xi x
deviations
.
Example 6-2
Table 6-1
Computational Form of s2
Population Variance
When the population is finite and consists of N values,
we may define the population variance as
The sample variance is a reasonable estimate of the population variance.
Homing in on the Sample Range
Example measures
Raw Data: Time to repair or
replace in hours a failed
transformer by the Dayton
Power and Light Company
2.2
1.7
2.4
2.5
2.9
4.4
5.0
1.8
3.2
2.2
1.9
4.0
min
max
mean
median
std dev
range
5.0
5.6
2.5
4.5
3.7
4.3
4.3
2.4
3.6
2.5
1.9
2.7
4.5
3.3
3.9
2.7
3.0
3.9
3.3
2.0
1.6
1.6
2.9
1.6
5.6
3.1
2.9
1.10
4.0 = 5.6 – 1.6
Quartiles
A quartile is any of the three values which divide the sorted data
set into four equal parts, so that each part represents 1/4th of the
sampled population.
Thus:
first quartile (designated Q1) = lower quartile = cuts off lowest
25% of data = 25th percentile
second quartile (designated Q2) = median = cuts data set in half
= 50th percentile
third quartile (designated Q3) = upper quartile = cuts off highest
25% of data, or lowest 75% = 75th percentile
The difference between the upper and lower quartiles is called
the interquartile range.
More Data Features
When an ordered set of data is divided into four equal parts, the division
points are called quartiles.
The first or lower quartile, q1 , is a value that has approximately one-fourth
(25%) of the observations below it and approximately 75% of the observations
above.
The second quartile, q2, has approximately one-half (50%) of the observations
below its value. The second quartile is exactly equal to the median.
The third or upper quartile, q3, has approximately three-fourths (75%) of the
observations below its value. As in the case of the median, the quartiles may
not be unique.
6-2 Example of Data Features
The compressive strength data in Table 6-2 contains
n = 80 observations. Minitab software calculates the first and third quartiles as
the(n + 1)/4 and 3(n + 1)/4 ordered observations and interpolates as needed.
For example, (80 + 1)/4 = 20.25 and 3(80 + 1)/4 = 60.75.
Therefore, Minitab interpolates between the 20th and 21st ordered observation
to obtain q1 = 143.50 and between the 60th and
61st observation to obtain q3 =181.00.
Data Features
• The interquartile range is the difference between the upper
and lower quartiles, and it is sometimes used as a measure of
variability.
• In general, the 100kth percentile is a data value such that
approximately 100k% of the observations are at or below this
value and approximately 100(1 - k)% of them are above it.
Examples in Variability
Professor Higgins has experienced considerable variability in
his driving time from home to the University. Help the good
professor find a measure of his variability.
value
44.0
41.7
34.6
44.8
21.8
26.0
28.9
27.9
38.9
37.0
23.3
32.5
45.9
38.9
42.4
31.4
sorted
driving times
in minutes
observation value
1
21.8
2
23.3
3
26.0
4
27.9
5
28.9
6
31.4
7
32.5
8
34.6
9
37.0
10
38.9
11
38.9
12
41.7
13
42.4
14
44.0
15
44.8
16
45.9
Quartiles
Q1
Q2
Q3
variance
std dev
range
interquartile
range
true median
mean
62.1
7.88
24.1
13.8
35.8
35.0
Coefficient of Variation
Where should the Vary A. Schun
Company direct its efforts to
reduce the variability in its
production lead-time?
Data source
gear assembly
Jeff
Jerry
Housing assembly
Judy
Jared
Julie
Final Assembly
Jane
Jim
John
mean
std dev
s
CV 100
X
CV
1.65
1.73
0.088
0.075
5.33
4.34
4.23
5.67
4.78
1.02
0.99
0.85
24.11
17.46
17.78
34.56
37.58
32.1
2.45
2.05
2.11
7.09
5.46
6.57
unit production times in minutes
Real Bonus Material
Descriptive Statistics for the
Overachieving Student
Skewness and Kurtosis
n
Mj
j
(
x
x
)
i
i 1
n
M3
M3
ˆ
1 3/2 3
M2
ˆ2
M4 M4
4
2
M2
j 1, 2,3, 4
Moments about the
mean. For example,
variance is the second
1 is Skewness – third
moment about the
mean
2 is Kurtosis – the
fourth moment
about the mean
Note how a power of the sample variance is used to
‘standardize’ the 1 and 2 estimates.
Skewness
Measures the direction and degree of departure from symmetry
If the distribution is perfectly symmetrical, the measure of
skewness will be zero
Normal distribution
uniform and rectangular
If the distribution is asymmetrical (i.e. skewed), the tail of the
distribution will extend in the direction of the positive (negative)
numbers if the measure of skewness is positive (negative)
Both distributions have the same expectation and variance.
The one on the left is positively skewed.
The one on the right is negatively skewed.
Kurtosis
The extent of peakedness in the distribution
Kurtosis is a measure of whether the data are peaked or flat
relative to a normal distribution (kurtosis = 3 - mesokurtic).
Data with high (positive) kurtosis tend to have a distinct peak near the
mean, decline rather rapidly, and have heavy tails.
Data with low (negative) kurtosis tend to have a flat top near the mean. A
uniform distribution would be the extreme case.
Higher kurtosis means more of the variance is due to infrequent extreme
deviations, as opposed to frequent modestly-sized deviations.
If a random variable’s kurtosis is greater than 3, it is said to be leptokurtic.
If its kurtosis is less than 3, it is said to be platykurtic.
The distribution on the right has higher kurtosis than the one on the left.
It is more peaked at the center, and it has fatter tails.