Transcript Power Point
Univariate Statistics
LIR 832
Class #2
September 15, 2008
Topics for Next Four Lectures
Fundamental Problem in Statistics: Learning about
populations from samples
Describing Data Compactly:
– How we might describe data, why compactness matters.
– Measures of Central Tendency (what are they, when to use them)
– Measures of dispersion
Probability Distributions:
– As samples are hopefully random draws from populations, we
need to understand the likelihood of drawing samples. This leads
us to a review of some basic probability distributions.
Inference from Samples to Populations:
– Sampling Distributions and the Central Limit Theorem
– Estimation
– Hypothesis Testing
Basic Issues in Statistics
Populations and Samples: Generally wish
to know about populations
– What is a population?
– How do we count a population?
– What types of populations would you be
concerned with in your professional life?
Basic Issues in Statistics
Use of Samples to Learn about Populations
What is a sample?
– representative sample
– random sample
– convenience sample
Why use samples rather than populations?
–
–
–
–
less time consuming to collect
less expensive to collect
often more accurate than census
population may not exist at the time data is collected
Basic Issues in Statistics
Samples are affected by randomness, two
samples drawn from a population are unlikely to
be identical (sampling variability) and neither is
an exact reproduction of the population.
– What is meant by random?
An event is random if, despite knowing all of the possible
outcomes in advance, we are not able to exactly predict a
particular outcome.
experiment: M&M issue
Samples are, to some degree, random
– different samples produce different estimates
– sample mean may be different than (but close to) population
mean
Basic Issues in Statistics
Since sample is not an exact reproduction of
the population, we need to allow for
sampling variability in using samples to
tell us about populations.
What types of HR/IR issues might involve
the use of samples?
Example: Training Program
We are interested in a training program
which is supposed to improve productivity.
It would be very expensive to implement
throughout a firm, particularly if it doesn’t
work. Instead, we set up an experiment in
which we try the program on a sample of
employees at a single location (this may be
called a pilot program).
Example: Training Program
We experiment with a pilot program and find that
productivity rose by 2% Our problem in using the
pilot (sample): if we replicate the pilot throughout
the firm is it reasonable to believe that:
– we will get a 2% boost in productivity, or
– This could this just be the result of getting a “good”
sample (Folks who happened to respond favorably to
the program).
Our core problem in using samples is
distinguishing between systematic effects of
programs and chance outcomes
Example: Turnover
You run human resources for a large low wage
manufacturing plant. Your firm has established
that there should be a 2.5% turnover rate per
month and has a policy that turnover rates above
2.5% are evidence of ineffective human resource
programs.
You have kept your turnover rate at 2.4% per
month for the last year and one half. Last month
and this month that rate has shot up to 3.2%. Is
this evidence that you are not doing your job as a
human resource manager?
What is Data?
Answer: The numeric representation of the characteristics of an
individual, object or experiment.
You, as an individual, provide multi-dimensional data:
– Quantitative:
Age
Height
Your pre-class views on LIR 832
– Qualitative:
Gender
Educational Attainment
Occupation
– Economic Data:
Income
Debt
Expected Income on graduating from this program.
What is Data?
You are multi-dimensional. We may be very
interested in the relationship between your
characteristics:
– Gender and Educational Attainment with expected
income
– Blood pressure with age
Data can also be collected on plants,
establishments and firms as well as by political or
geographic entity.
– Firm: Revenue, Operating Costs, Debt to Equity Ratio,
Number of Employees, Number of Locations,
Distribution of Occupations, Presence of Program to
Encourage Diversity in the Labor Force
What is Data?
We are typically interested in variables, data
which vary across ‘individuals’ or ‘units of
observation.
– age, gender, income vary across individuals in this
class
– revenue, profit (and so much more) varies across
divisions of General Motors
– Some characteristics do not vary in a given data set.
For example, all of you will have the same level of
educational attainment, a masters degree, when you
graduate. Such characteristics are called constants. In
contrast, level of education will vary in a survey of
MSU graduates
What is Data?
Some of the difference in individual outcomes in
variables is systematic (predictable), some is
random.
– There is systematic difference in earnings by gender &
education
– Still, if we take women graduates of this program with
three years of post graduation experience there will be
considerable differences in annual earnings. Much of
this is random in the sense that we could not predict it
in advance.
Types of Data
Qualitative or nominal data:
– The numeric values indicate qualitative states but does
not indicate rank or any arithmetic relationship.
northeast = 1/midwest = 2/south = 3/west = 4
White =1/Black =2/Asian-Pacific Islander = 3
– The numeric values designate a qualitative outcome but
they could be changed around without any loss of
information. E.G.:
northeast = 20/midwest = 100/south = -3/west = 41,298
White =1/Black =20/Asian-Pacific Islander = 3000
Types of Data
Ordinal or ranked data:
– Likert scale: (subjective)
2 = strongly dislike/ -1 = dislike/ 0 = like/ 1 = strongly like
– Income: (objective)
0 - income under $4,999 per year
1 - income from $5,000 to $9,999
2 - income from $10,000 to $14,999
– The values of the data designate a ranking or order and
we can tell if the outcomes are =, > or < but we cannot
perform arithmetic operations.
Types of Data
Cardinal Data (also called interval or ratio
data): we can apply arithmetic operations to
this data (=, >, < +, -, x, ÷). There can be
values of 0 as well as negative values.
weekly earnings
age
number of absences last week
Summarizing Data
The Problem of Compactly Describing
Populations
– Ages of students in the class (33 observations
on age from 2004):
25
27
22
22
27
24
37
21
22
24
21
31
26
39
23
23
27
22
25
22
25
32
23
51
23
25
23
22
37
27
22
25
23
Summarizing Data
Possible first step to summarize: Calculate a
range.
– Example: Class Age Data
Range: 30 (from 21 to 51)
– Example: Frito-Lay HR Metrics
Range: 27.64% (from 1.00% to 28.64%)
Summarizing Data
Another possible solution: Histogram
– One option: Create table of absolute and relative
frequencies.
Age Distribution of LIR 832 on Wednesday August 29, 2005
AGE
20 to 24
25 to 29
30 to 34
35 to 39
40 to 44
45 to 49
50 to 54
Absolute Frequency
17
10
2
3
0
0
1
Relative Frequency
.515 (= 17/33)
.303 (= 10/33)
.061 (= 2/33)
.091 (= 3/33)
0
0
.030 (= 1/33)
Summarizing Data
Another option: Graph histogram
Summarizing Data
Graph of Frito-Lay HR Metrics:
Measures of Central Tendency
Measures of central tendency: (Mode, Median &
Mean)
– Extremely compact, a single measure is used to
represent useful information about possibly large
bodies of data.
– Examples:
Mean unemployment rate
Average turnover (number or rate)
Average earnings
Health care program most commonly chosen by employees.
Class Age Data (Ordered)
21
21
22
22
22
22
22
22
22
23
23
23
23
23
23
24
24
25
25
25
25
25
26
27
27
27
27
31
32
37
37
39
51
Measures of Central Tendency
Mode - the most common or frequent
outcome
– In the class age data the mode is:_________
– Not necessarily unique
– Can be computed for all types of data
– Important method in advanced statistical
analysis (maximum likelihood estimation)
Measures of Central Tendency
Median - the middle (or central) observation if
observations are ranked from largest to smallest
– In the class age data the median is:______
– Median is unique
– Can be computed for cardinal and ordinal data
– Uses only the central observation
– Insensitive to outliers.
– Median (w/ 51 year old) = median (w/out) = 24
Measures of Central Tendency
Mean - average observation accounting for
value of observations. It is the sum of all
observations divided by the number of
observations.
– Mean - sum of all observations divided by the
number of observations
Measures of Central Tendency
– Class Age: Population Mean for 33 Students: μage =
–
–
–
–
–
–
868/33 = 26.3
All cardinal data has a mean (but ordinal and qualitative
doesn’t)
Uses all of the data in the calculation.
Only one mean
Represents the balance point of the data
Sensitive to outliers.
Remove our 51 year old from the data: μ = 25.8
Measures of Central Tendency
Q: When Should We Use a Mode, a Median or a
Mean?
A: Our goal is to give the reader an idea of the
“typical” outcome
– Mode: lovely when there are a few possible outcomes
and you want to indicate which is most popular.
Which medical plan among five is chosen most frequently by
our employees?
Not so good when you have 10,000 outcomes and there are
few duplicates. The mode doesn’t really tell you very much
(see our class age and Frito-Lay HRM data).
Model approaches lead fairly quickly to pie chart presentations
as these quickly summarize information on matters such as
“portion participating”.
Measures of Central Tendency
Mean: good because is summarizes so much data but can
be sensitive to outliers and can average across important
distinctions.
– Consider two samples of annual income: one with Bill Gates, one
without Bill Gates.
– Consider the following data on five employees blood pressure.
–
–
–
–
Normal
80
83
75
High
150
170
– What is the average blood pressure among our employees? If 120
is a cut off between normal and high blood pressure, does the mean
help us to understand health issues among employees in our firm?
Measures of Central Tendency
Median: good because it is invariant to
outliers, but it doesn’t use data very
efficiently. Unless the data is skewed, the
mean uses a lot more data efficiently than
medians and we can develop measures of
dispersion for means.
Measures of Central Tendency
Geometric or Harmonic Means:
– A type of mean that we won’t use very much in this
course
– Suppose we have a problem in which we are calculating
the return to an investment over time in which the
return is
year 1
year 2
year 3
5%
10%
12%
So your equation for calculating your return on
principle would be: principle*(1.05)*(1.10)*(1.12)
Measures of Central Tendency*
What is the mean rate of return? If we used our arithmetic
mean, we would get: (5% + 10% + 12%)/3 = 9%.
– This is wrong because it doesn’t allow for compounding. The
correct method of calculating the mean is
73
11.10
1.141.089599
15 2.0
.05_11.11
.1011.12
.121.313
1.2936
This is pretty close to the arithmetic mean, but if we
averaged: 10%, 11%, 12%, 13%, 14%, 15%, 100% interest
rates:
– arithmetic mean: 23.7%
– Geometric mean: 26.2%=
The geometric mean is the root of the number of values
you multiply to get your final result
Dispersion
A story of labor relations specialists and TV
reporters:
While the average may be the same, obviously the
two jobs are not the same in terms of pay. So we
need to have a statistic that can tell us whether the
numbers are close together, as with HR managers,
or spread out, as with TV Broadcasters.
Dispersion
Another example. Average Temperatures
in East Lansing and Mercury are similar
65○F vs 63.7○F. However, the dark side of
mercury is close to absolute zero Kelvin and
the sun side is around 350○F. The mean
temperature in E.L. conveys much more
useful information than does the mean for
mercury.
Dispersion
Dispersion: What is it, how do we measure it?
– Dispersion is very important if we are using samples to
learn about populations. Randomness in sampling
results in sample means being dispersed around the
population mean. As a result, we don’t believe that
sample statistics exactly reproduce population
characteristics. If we are going to use samples, we
need to figure out how to handle this dispersion
(sampling variability).
Most people are comfortable with measures of
central tendency, but not with dispersion.
Witness Human Resource Dashboards.
Dispersion
Issue:
– How close is a typical observation to the mean?
– How much information does our mean contain?
Dispersion
Variance and Standard Deviation:
– Dispersion is more about distance than
direction. Move to using a distance measure
– Need a measure which is
A measure of distance (don’t want sign)
In the same units as the underlying data
Mathematically tractable
Dispersion: Population
Alternative Formula: Variance
Standard Deviation
All Very Nice, so what do we do with Standard
Deviation?
– We could work with the empirical rule or Chebychev’s
(or is it Tschebychev’s) inequality
– This would give us a taste of what we will learn when
we work with the normal distribution and the Central
Limit Theorem. Since we are moving fast, we will put
these aside and wait until we get there next week.
– For the moment, we have a couple measures of
dispersion, we don’t really know what they mean.
Probability Distributions
Return to our fundamental problem: using samples
to learn about unobserved populations
Samples are random (hopefully) draws from
populations, we need to understand the likelihood
of drawing a particular samples.
– Our understanding of randomness is based on
probability theory. Probability theory is used to
understand events where all possible outcomes are
known in advance, but where it is not possible to
predict the actual outcome with certainty.
Drawing a particular hand in a poker game
How many employees will be absent tomorrow
The profitability of your department in the next quarter.
Probability: Terminology
Experiment: An experiment: an action whose outcome
cannot be known with certainty in advance
Event: the outcome of an experiment. It cannot be
predicted exactly in advance, but we can predict the
distribution of the outcomes for large numbers of trials.
Flip a coin three times and count the number of heads
The number of employees absent on a given day.
The consequences of ...
The number of heads on three flips
The number of employees absent
A probability is the likelihood that a particular event, or
set of events, will occur in the future.
A probability distribution is a list of all the outcomes
(events) of an experiment and their probabilities.
More on Probability
Distributions
A listing of events and the likelihood that those
events will occur.
Flip a coin three times and count the number of
heads:
–
–
–
–
–
Heads
0
1
2
3
Note:
Probability:
1/8
3/8
3/8
1/8
0 ≤P ≤1
Probability Distribution:
Three Coins
0.4
0.35
Probability
0.3
0.25
0.2
0.15
0.1
0.05
0
0
1
2
Number of Heads
3
Example: Charting Absences
Over 30-Day Period
Example: Charting Absences
Over 30-Day Period
Example: Charting Absences
Over 30-Day Period
Probability Distributions
Problem: Our plant has 1,500 employees. We
want to learn about their views on a variety of
issues. The survey a consultant has designed takes
30 minutes and that is too long a survey to give to
the workforce of the entire plant. Instead, we are
going to sample 5%, 1/20th, of the plant
workforce.
– We would like the sample to be representative and for
every employee to have an equal chance of being
chosen to participate in the sample. How can we do
this?
Probability Distributions:
Uniform
The probability distribution in which all
individuals are equally likely to be chosen is
a uniform distribution
– Every outcome is equally likely
– Rolling a fair die
– Commonly used in lotteries and to draw simple
random samples
Probability Distributions:
Uniform
Probability Distributions
With integer (discrete) data, the formula for the
probability of any one person being drawn is:
Why add 1?
– Consider a problem in which we have a distribution of
the hourly earnings of employees (who earn between 5
and10 per hour). If we just calculate the number of
points as : 10- 5 we get 5 or 1/5 probability; But there
are 6 points: 5 6 7 8 9 10 as we include the bottom
point. So we need to add 1 back in.
Probability Distributions:
Uniform
Probability Distributions
Return to our problem using statistical software:
(USE MINITAB IN CLASSROOM).
– We have a list of all students enrolled in the SLIR
currently.
– Use MINITAB to randomly assign a number between 0
and 100 (or 0 and 1) at random. Each number has an
equal likelihood of being assigned.
– Now chose an interval that contains 5% of all the
numbers between 0 and 100. For example, 0 - 5 or 60 65. This should actually be done in advance.
– The employees in this group are your chosen 5%.
Congratulations, you have now chosen a random
5% sample!
Probability Distributions:
Poisson
Staffing a Call Center: How Many Employees When the
Number of Calls Coming in a Random?
– Example: We oversee a benefits call center. Call center personnel
indicate that a typical call takes 50 minutes to field. Personnel also
get a 10 minute break every hour to be able to collect their
thoughts, get to a bathroom, and so on. So our typical employee
can handle one call per hour. Our records indicate that the call
center averages 20 calls per hour.
– We are concerned about the quality of the service provided by the
call center. Quality has many dimensions, but a critical dimension
is whether employees are able to get through when they call. How
many employees do we need to be certain we will not turn away
more than 1 in 10 calls? More than 1 in 20 calls?
Probability Distributions:
Poisson
Poisson - developed to predict the number
of events occurring at a particular time or in
a particular space:
– Cars arriving at a toll booth
– Dents in a one meter square of sheet metal
– Deaths due to kicks in the head by a horse in a
Prussian cavalry division
– Number of calls arriving at a call center in an
hour (related to manning)
Probability Distributions:
Poisson
Some Mathematical Argot
e, the natural base, is 2.71828182845905. So, e5 = 2.718281828459055 =
148.41
x! = x*(x-1)*(x-2)*(x-3).....3*2*1
5! = 5*4*3*2*1
Probability Distributions:
Poisson
Learning to Use the Poisson Formula with a
Simple Problem:
– Average 3 calls per hour, calculate the probability of up
to 10 calls
You calculate p(3), p(4) & p(5) (Divide the
classroom)
Probability Distributions:
Poisson
Now calculate the probability of no more
than 1, 2, 3 ..... phone calls (This is
referred to as a cumulative probability)
– P(1) = .1493
– P(1 or 2 ) = .1493 + .2240
– P(1 or 2 or 3) = ?
Call Center Example
Probability Distributions:
Binomial
A Problem of Discrimination, or Is It?
– XYZ firm is going through a layoff at a plant. Thirty-eight percent
of employees are age 40 or older. The firm downsizes and lays off
100 employees. 47 of those employees are age 40 or older. Does
this firm have a problem with the ADEA?
Legal standard: A layoff provides prima facia evidence of
discrimination if the observed outcome, an excessive
proportion of older workers being laid off, has less than a
5% probability of happening by chance.
An excess number of older workers have being laid off.
What is the likelihood that 47 percent of the laid off
workers would be 40 or older absent some intent to reduce
the proportion of older workers in the labor force?
Probability Distributions:
Binomial
Binomial:We can model our layoff as a
binomial event, you are either laid off or not
laid off.
– underlying a binomial is an event with only two
outcomes.
– typically interested the likelihood of a set of multiple
outcomes.
– Proto-typical example: The likelihood of getting X
heads on flipping a coin Y times. Consider the
likelihood of flipping a coin 3 times. What is the
probability distribution for a coin which has a 75%
chance of coming up heads?
Probability Distributions:
Binomial
Probability Distributions:
Binomial
Example: You are at a firm in which 38% of
employees are age 40 or older. The firm
downsizes and lays off 100 employees. 47 of
those employees are age 40 or older. Does this
firm have a problem with the ADEA?
– Probabilistic issue: each layoff decision is a yes/no
decision similar to a coin toss. Formally, we are asking
the equivalent of the following problem. Consider a
coin which turns up heads 38% of the time (not a fair
coin). What is the likelihood on 100 flips that we will
get 47 heads?
Probability Distributions:
Binomial
Probability Distributions:
Hazard
Survival or Hazard Function – Exponential
Distribution: (Time to failure)
– originally developed to predict the expected length of
–
–
–
–
life of a lightbulb
How long an employee will be absent from work with
an injury.
How long a machine will run prior to needing to be
adjusted.
How long an employee will remain with a firm or in a
particular job.
Because this involves time, which is a continuous
variable, we turn to this a little later in our discussion of
univariate statistics.
Probability Distributions
To this point we have worked with random
variables in which outcomes were discrete
(integer) and which each integer outcome
was associated with a probability.
– Our binomial problem: What is the likelihood
that out of a group of fifty employees, eight
will resign in the next month?
Probability Distributions:
Continuous Variables
Many of the numbers we are concerned with are
continuous
– turnover rates
– time to resignation or time itself
– earnings (too many to treat as discrete)
With continuous numbers, we can find a number
between any two other numbers. For example, 1.5
falls between 1 and 2 while 1.27 falls between 1
and 1.5 (as do many other numbers).
Probability Distributions:
Continuous Variables
Example: Consider a problem in which we have a
distribution of the hourly earnings of employees
(who earn between 5 and10 per hour). To this
point, we have been working with integers, its as
if we assumed that people earned $5, $6, $7, $8,
$9, or $10. If wages were uniformly distributed,
then 1/6 employees would be at each dollar
amount. So if we asked what proportion of
employees earned less than $6, it would be 1/6.
Probability Distributions:
Continuous Variables
Probability Distributions:
Continuous Variables
Now consider the more realistic situation, in
which employees can earn any amount
between $5 and $6 – lets still assume that
earnings are uniformly distributed
– How might we calculate the probability that the
wage falls between $5.00 and $6.00?
– P(5.00 x 6.00) = 1/5th of the area between 5
and 10 = .20 or 20%
Probability Distributions:
Continuous Variables
We are now working with areas rather than with
points.
– Rather than use probabilities P(X) where we talk about
the P(5.5) or P(6), we use:
– Points no longer have any probability value (weight)
P(5.50) = 0 = p(6.0) = P(X= x for any x)
– This occurs because, there are an infinity (∞)of
numbers between 5 and 7. If we divide 1, the value of
5.5 by infiinity, the outcome is zero.
1/∞ = 0
So, with continuous numbers, no single point has
any value. Instead, we need to think about area.
Probability Distributions:
Continuous Variables
Normal Distribution
Normal Distribution
Why are we learning about the normal?
– The “bell curve” was first noticed when 18th century
astronomers noticed that errors in their predictions of
the positions of plants and other heavenly bodies tended
to cluster symmetrically around the mean. A graph of
the errors had a bell shape.
– In the 19th century a Belgian astronomer observed that
this “law of error” also applied to many human
phenomena such as the chest sizes of more than 5,000
Scottish soldiers (the mean was 40")
Normal Distribution
As a matter of statistics, the bell curve is
assured to arise whenever some variable,
such as human height, is determined by lots
of little causes such as genes, health and
diet (New Yorker, page 86, January 24,
2005).
Normal Distribution
Good description of natural and social
phenomena.
– Natural phenomena
neck and arm lengths
Heights
– Social phenomena
grades in a large class
weekly earnings (log of)
– Samples taken from large populations
Central Limit Theorem tells us that the means of samples
of 30 or more observations are normally distributed
around the population mean.
Normal Distribution
Normal Distribution
Normal Distribution
Characteristics of the Normal
– Symmetric around its mean
mean = media = mode
concentrated close to the mean:
– 68% of observations are ± σ (within one standard
deviation) of the mean
– 95.44% of observations are within 2 standard deviations of
the mean.
– Suddenly, we are using Standard Deviations for
a purpose. More to follow.
Normal Distribution
Normal Distribution
If x (or y or w or any random variable) is normal,
then we write it out in shorthand as
– x ~ N(μ, σ2)
– Where μ is the population mean and σ2 is the
population variance
So if we have a normal with mean 3 and variance
4, we can write that normal out as (using w as the
designation of our random variable)
– W ~ N(3, 4)
Normal Distribution
Normal Distribution
The Z distribution:
– Fortunately we have tables for a particular
normal distribution
– z ~ N(0,1) normally distributed mean 0
variance 1
Normal Distribution
Pass Out Z-Table Here.
Normal Distribution
Using the Table for Z values:
– P(z ≥1.5) where z~N(0,1)?
– Translate:
what is the probability of observing a z at least 1.5 standard
deviations above the mean (>= 1.5) if the distribution of z is
normal 0,1?
If we chose 100 random values from a standard normal
distribution, what proportion would be greater than or equal to
1.5?
Use of the Table: Find 1.5 on the vertical scale, find 0.00 on
the horizontal scale outcome.
– P(z ≥1.50) = .0668 or 6.68% so the body of the table
provides the probability of an event
Normal Distribution
Normal Distribution
Normal Distribution
Normal Distribution
A problem: The scores for an employment
exam are normally distributed with a mean
of 500 and a variance of 5625. Your boss
only wants to hire individuals scoring more
than 650 on this exam. What is the
likelihood of observing a score of more than
650?
Note: the random variable GRE ~ N(500,5625) and
has standard deviation of 75
Normal Distribution
What is the likelihood of observing a score
of more than 650?
– Pr(GRE ≥ 650)
– Pr((GRE - μ)/σ ≥(650 - 500)/75))
– Pr(z ≥(150/75))
– Pr(z ≥2) = .0228 or 2.28%
Normal Distribution
If our boss wants us to hire 10 employees
with scores of 650 or more, how many
exams must we give to be reasonably sure
of getting ten qualified prospects?
Sampling Distributions /
Central Limit Theorem
Our statistical issue: If we wish to use samples to
say useful things about unobserved populations,
then we need to know the relationship between
populations and the samples they produce
or
What is the relationship between the
characteristics of a population and the mean of a
sample drawn from that population?
Formally: if x comes from a population with
mean μ and variance σ2, how is distributed?
Sampling Distributions /
Central Limit Theorem
Samples and Populations
Clarifying our thinking about samples:
– How many samples are there in a population?
Population of 10 and we take a sample of 3. There
are
10C3 = 10!/3!*7! (Aka the combination of ten
things taken three at a time)
– Where 3! = 3*2*1
– 7! = 7*6*5*4*3*2*1
Samples = 120 samples
Samples and Populations
Samples and Populations
Example of the number of samples which
can be obtained from a population:
– Population of 5: sample size 3:
– We have five individuals age 22, 24, 26, 28,
and 30. Take samples of 3.
What is the mean and variance of the population?
How many samples of three are there?
What is the distribution of mean ages among these
samples of three?
Samples and Populations
What is the mean and variance of the population?
– μ = 26
– σ2 = 8
How many samples of three are there.?
– 5C3
= 5!/3!*2! = 10
What is the distribution of mean ages among
these samples of three?
Samples and Populations
Samples and Populations
Samples and Populations
Samples and Populations
Samples and Populations
There are many samples which may be
drawn from a given population. We
typically draw a single sample, but if we
drew systematically, we could draw a very
large number of samples. Hence we can
talk about the distribution of x-bar.
Sample Means are close to μ, but they are
not necessarily equal to it.
Samples and Populations
Different samples produce different means.
Samples do not exactly reproduce population
characteristics. This is known as sampling
variability. Some texts refer to this as sampling
error but this is a misleading term.
The variance of the sample means is much smaller
than the variance of the population data.
The single sample is a source of profound
confusion. Because we draw but one, it doesn’t
look random, we don’t see the sampling variability
and don’t always understand why the sample is
not the ‘correct’ value of the population.
Samples and Populations
Central Limit Theorem
Central Limit Theorem
This result is without reference to the
distribution of the underlying population.
– Why is this important? Because if we don’t know the
mean and variance of the population, we probably don’t
know the distribution. But with the CLT, we can still
do a lot in discussing the relationship of the sample to
the population.
Central Limit Theorem
As n becomes large, variance around μ becomes small.
This is sensible since observations which are far above the
mean tend to get averaged in with observations which are
far below the mean.
The precision of the estimate is not a function of the
size of the population, but of the size of the sample. A
given size of sample will give the same precision whether
we are sampling Ingham county or China (well, it’s a little
more complex than that, but the point is generally correct).
Central Limit Theorem
Central Limit Theorem
Central Limit Theorem
INSERT FIGURE 24 HERE.
Central Limit Theorem
Central Limit Theorem
Central Limit Theorem
Central Limit Theorem
We are interested in whether a new IT system in a
plant has accelerated the improvement in
productivity in a plant. Suppose that over a long
period we have found that productivity improved
at an annualized rates of 3.0% per month with
annualized variance of 1.5%. We look at the
twelve months following installation of the new IT
and find a growth rate of 3.5% annually. How
likely are we to get this higher rate of growth if
the plant is really on a 3% growth path. Growth
rates are normally distributed.
Samples and Populations
Samples and Populations
Issue: Because of sampling variability we know
that our point estimate is unlikely to be exact. We
would like to develop an estimator which allows
for sampling variability.
The problem with estimation: Shooting in a
gallery in which we never see if we hit the target.
Need to establish the properties of estimators
theoretically
Samples and Populations
Three properties which we like in an estimator:
– unbiased: the estimator does not systematically over
or undershoot the population parameter of interest
systematically.
– consistent: as the number of observations increases, the
variability of our estimator decreases.
Example: the variance of the sample mean falls at rate n
– efficient: for a given sample size, we prefer that our
estimator have the smallest possible variance.
Samples and Populations
Unbiased: The sampling distribution is centered
around the population mean.
– A biased and unbiased estimator: We are measuring the
number of units which flow by a point in a 10 second
period. We have calibrated one instrument correctly.
Unfortunately, the second instrument is mis-calibrated,
and adds 1 to each count. The number of units which
flow by the point is random but, as a sample, is
normally distributed. We get the following chart.
Samples and Populations
Samples and Populations
Samples & Populations
Samples & Populations
Samples and Populations
Need to be a little careful about using the term bias
– Example: For whatever reasons, we take a sample of
men to estimate their absence behavior. This sample is
likely a biased estimator of the absence behavior of the
full population because male and female absence
behavior is likely very different. It is an unbiased
sample if we only use it to learn about male behavior.
Whether this sample is biased or not depends on the
population for which we use it. In the second instance,
we acknowledge the limits of the sample (and so limit its
usefulness) but the sample will be accurate (unbiased).
Samples and Populations
In the instance of the male sample, we at
least know that it is a biased estimator of
overall absence behavior, even if we don’t
know the direction of the bias. All to often,
researchers don’t realize that they are using
a biased estimator.
Example: Send a survey out to 3,000
trucking firms, get 177 responses back
Samples and Populations
Consistent: As the number of observations
increases, our estimator becomes more
precise.
– We know that the mean is a consistent
estimator, is the median a consistent estimator?
Samples and Populations
Samples and Populations
Samples and Populations