Transcript Week 3

Review of Basic Statistical Concepts
Farideh Dehkordi-Vakil
Inferential Statistics

Introduction to Inference



The purpose of inference is to draw conclusions from data.
Conclusions take into account the natural variability in the data,
therefore formal inference relies on probability to describe chance
variation.
We will go over the two most prominent types of formal statistical
inference



Confidence Intervals for estimating the value of a population
parameter.
Tests of significance which asses the evidence for a claim.
Both types of inference are based on the sampling distribution of
statistics.
Inferential Statistics



Since both methods of formal inference are based
on sampling distributions, they require probability
model for the data.
The model is most secure and inference is most
reliable when the data are produced by a properly
randomized design.
When we use statistical inference we assume that
the data come from a randomly selected sample or
a randomized experiment.
Inferential Statistics




A market research firm interviews a random
sample of 2500 adults. Results: 66% find shopping
for cloths frustrating and time consuming.
That is the truth about the 2500 people in the
sample.
What is the truth about almost 210 million
American adults who make up the population?
Since the sample was chosen at random, it is
reasonable to think that these 2500 people
represent the entire population pretty well.
Inferential Statistics



Therefore, the market researchers turn the fact that
66% of sample find shopping frustrating into an
estimate that about 66% of all adults feel this way.
Using a fact about a sample to estimate the truth
about the whole population is called statistical
inference.
To think about inference, we must keep straight
whether a number describes a sample or a
population.
Inferential Statistics

Parameters and Statistics

A parameter is a number that describes the population.


A parameter is a fixed number, but in practice we do not know
its value.
A statistic is a number that describes a sample.


The value of a statistic is known when we have taken a
sample, but it can change from sample to sample.
We often use statistic to estimate an unknown parameter.
Inferential Statistics




Changing consumer attitudes towards shopping
are of great interest to retailers and makers of
consumer goods.
One trend of concern to marketers is that fewer
people enjoy shopping than in the past.
A market research firm conducts an annual survey
of consumer attitudes.
The population is all Us residents aged 18 and
over.
Example:Consumer attitude towards
shopping



A recent survey asked a nationwide random
sample of 2500 adults if they agreed or disagreed
that “ I like buying new cloths, but shopping is
often frustrating and time consuming.”
Of the respondents, 1650 said they agreed.
The proportion of the sample who agreed that
cloths shopping is often frustrating is:
1650
Pˆ 
 .66  66%
2500
Example:Consumer attitude towards
shopping



The number P̂ = .66 is a statistic.
The corresponding parameter is the
proportion (call it P) of all adult U.S.
residents who would have said “agree” if
asked the same question.
We don’t know the value of parameter P, so
P̂ its estimate.
we use as
Inferential Statistics




If the marketing firm took a second random
sample of 2500 adults, the new sample would
have different people in it.
It is almost certain that there would not be exactly
1650 positive responses.
That is, the value of P̂ will vary from sample to
sample.
Random samples eliminate bias from the act of
choosing a sample, but they can still be wrong
because of the variability that results when we
choose at random.
Inferential Statistics



The first advantage of choosing at random is that
it eliminates bias.
The second advantage is that if we take lots of
random samples of the same size from the same
population, the variation from sample to sample
will follow a predictable pattern.
All statistical inference is based on one idea: to
see how trustworthy a procedure is, ask what
would happen if we repeated it many times.
Inferential Statistics



Suppose that exactly 60% of adults find shopping
for cloths frustrating and time consuming.
That is, the truth about the population is that
P = 0.6.
What if we select an SRS of size 100 from this
population and use the sample proportion P̂ to
estimate the unknown value of the population
proportion P?
Inferential Statistics

To answer this question:




Take a large number of samples of size 100
from this population.
Calculate the sample proportion for each
sample.
Make a histogram of the values of
.
Examine the distribution displayed in the
histogram for shape, center, and spread, ass
well as outliers or other deviations.
Inferential Statistics



The result of many SRS have a regular pattern.
Here we draw 1000 SRS of size 100 from the same population.
The histogram shows the distribution of the 1000 sample proportions
Inferential Statistics

Sampling Distribution

The sampling distribution of a statistic is the
distribution of values taken by the statistic in all
possible samples of the same size from the
same population.
Example:Mean income of American
households




What is the mean income of households in the
United States?
The Bureau of Labor Statistics contacted a random
sample of 55,000 households in March 2001 for
the current population survey.
The mean income of the 55,000 households for the
year 2000 was X  $57,045.
$57,045 is a statistic that describes the CPS
sample households.
Example:Mean income of American
households




We use it to estimate an unknown parameter, the
mean income of all 106 million American
households.
We know that X would take several different
values if the Bureau of Labor Statistics had taken
several samples in March 2001.
We also know that this sampling variability
follows a regular pattern that can tell us how
accurate the sample result is likely to be.
That pattern obeys the laws of probability.
Normal Density Curve

These density curves,
called normal curves,
are




Symmetric
Single peaked
Bell shaped
Normal curves
describe normal
distributions.
Normal Density Curve



The exact density curve for a particular
normal distribution is described by giving
its mean  and its standard deviation .
The mean is located at the center of the
symmetric curve and it is the same as the
median.
The standard deviation  controls the spread
of a normal curve.
Normal Density Curve
The 68-95-99.7 Rule


Although there are many normal curve, They all
have common properties. In particular, all Normal
distributions obey the following rule.
In a normal distribution with mean  and standard
deviation :



68% of the observations fall within  of the mean .
95% of the observations fall within 2 of .
99.7% of the observations fall within 3 of .
The 68-95-99.7 Rule
The 68-95-99.7 Rule
Inferential Statistics

Standardizing and z-score

If x is an observation that has mean  and
standard deviation , the standardized value of
x is
x
z


A standardized value is often called z-score.
Standard Normal Distribution

The standard Normal
distribution is the
Normal distribution
N(0, 1) with mean
 = 0 and standard
deviation  =1.
Standard Normal Distribution

If a variable x has any normal distribution
N(, ) with mean  and standard deviation
, then the standardized variable
z
x

has the standard Normal distribution.
The Standard Normal Table

Table A is a table area
under the standard
Normal curve. The
table entry for each
value z is the area
under the curve to the
left of z.
The Standard Normal Table



What the area under
the standard normal
curve to the right of
z = - 2.15?
Compact notation:
z < -2.15
P = 1 - .0158 =.9842
The Standard Normal Table



What is the area under
the standard normal
curve between z = 0
and z = 2.3?
Compact notation:
0 < z < 2.3
P = .9893 - .5 =.4893
Example:Annual rate of return on stock
indexes
The annual rate of return on stock indexes (which
combine many individual stocks) is approximately
Normal. Since 1954, the S&P 500 stock index has
had a mean yearly return of about 12%, with
standard deviation of 16.5%. Take this Normal
distribution to be the distribution of yearly returns
over a long period. The market is down for the
year if the return on the index is less than zero. In
what proportion of years is the market down?
Example:Annual rate of return on stock
indexes

State the problem


Call the annual rate of return for S& P 500-stocks Index x. The
variable x has the N(12, 16.5) distribution. We want the proportion
of years with
X < 0.
Standardize

Subtract the mean, then divide by the standard deviation, to turn x
into a standard Normal z:
x0
x  12 0  12

16.5
16.5
z  .73
Example:Annual rate of return on stock
indexes


Draw a picture to show
the standard normal curve
with the area of interest
shaded.
Use the table


The proportion of
observations less than
- 0.73 is .2327.
The market is down on an
annual basis about 23.27%
of the time.
Example:Annual rate of return on stock
indexes

What percent of years have annual return
between 12% and 50%?

State the problem
12  x  50

Standardize
12  12 x  12 50  12


16.5
16.5
16.5
0  z  2.30
Example:Annual rate of return on stock
indexes


Draw a picture.
Use table.

The area between 0
and 2.30 is the area
below 2.30 minus the
area below 0.
0.9893- .50 = .4893
Estimating with Confidence

Community banks are banks with less than a billion dollars
of assets. There are approximately 7500 such banks in the
United States. In many studies of the industry these banks
are considered separately from banks that have more than a
billion dollars of assets. The latter banks are called “large
institutions.” The community bankers Council of the
American bankers Association (ABA) conducts an annual
survey of community banks. For the 110 banks that make
up the sample in a recent survey, the mean assets are X =
220 (in millions of dollars). What can we say about , the
mean assets of all community banks?
Estimating with Confidence

The sample mean X is the natural estimator of the
unknown population mean .
We know that

X is an unbiased estimator of .
 The law of large numbers says that the sample mean
must approach the population mean as the size of the
sample grows.
Therefore, the value X = 220 appears to be a



reasonable estimate of the mean assets  for all
community banks.
But, how reliable is this estimate?
Estimating with Confidence



An estimate without an indication of its variability
is of limited value.
Questions about variation of an estimator is
answered by looking at the spread of its sampling
distribution.
According to Central Limit theorem:

If the entire population of community bank assets has
mean  and standard deviation , then in repeated
samples of size 110 the sample mean X approximately
follows the N(, 110) distribution
Estimating with Confidence

Suppose that the true standard deviation  is equal
to the sample standard deviation s = 161.


This is not realistic, although it will give reasonably
accurate results for samples as large as 100. Later on
we will learn how to proceed when  is not known.
Therefore, by Central Limit theorem. In repeated
sampling the sample mean X is approximately
normal, centered at the unknown population mean
,with standard deviation
X 
161
 15 millions of dollars
110
Confidence Interval

A level C confidence interval for a parameter has
two parts:


An interval calculated from the data, usually of the
form
Estimate  margin of error
A confidence Level C, which gives the probability that
the interval will capture the true parameter value in
repeated samples.
Confidence Interval




We use the sampling distribution of the sample
mean X to construct a level C confidence interval
for the mean  of a population.
We assume that data are a SRS of size n.


,
The sampling distribution is exactly N(
n )
when the population has the N(, ) distribution.
The central Limit theorem says that this same
sampling distribution is approximately correct for
large samples whenever the population mean and
standard deviation are  and .
Confidence Interval for a Population Mean

Choose a SRS of size n from a population having unknown
mean  and known standard deviation . A level C
confidence interval for  is
X  z

n
Here z* is the critical value with area C between –z* and
z* under the standard Normal curve. The quantity
z

n
is the margin of error. The interval is exact when the
population distribution is normal and is approximately
correct when n is large in other cases.
Example: Banks’ loan –to-deposit ration

The ABA survey of community banks also asked
about the loan-to-deposit ratio (LTDR), a bank’s
total loans as a percent of its total deposits. The
mean LTDR for the 110 banks in the sample is
X  76.7 and the standard deviation is s = 12.3. This
sample is sufficiently large for us to use s as the
population  here. Find a 95% confidence interval
for the mean LTDR for community banks.
Tests of Significance





Confidence intervals are appropriate when our goal is to
estimate a population parameter.
The second type of inference is directed at assessing the
evidence provided by the data in favor of some claim about
the population.
A significance test is a formal procedure for comparing
observed data with a hypothesis whose truth we want to
assess.
The hypothesis is a statement about the parameters in a
population or model.
The results of a test are expressed in terms of a probability
that measures how well the data and the hypothesis agree.
Example: Bank’s net income

The community bank survey described in
previously also asked about net income and
reported the percent change in net income between
the first half of last year and the first half of this
year. The mean change for the 110 banks in the
sample is X  8.1% Because the sample size is
large, we are willing to use the sample standard
deviation s = 26.4% as if it were the population
standard deviation . The large sample size also
makes it reasonable to assume that X is
approximately normal.
Example: Bank’s net income



Is the 8.1% mean increase in a sample good evidence that
the net income for all banks has changed?
The sample result might happen just by chance even if the
true mean change for all banks is  = 0%.
To answer this question we asks another


Suppose that the truth about the population is that = 0% (this is
our hypothesis)
What is the probability of observing a sample mean at least as far
from zero as 8.1%?
Example: Bank’s net income

The answer is:

p( X  8.1)  P( Z 
8.1  0
)  P( Z  3.22)
26.4 110
 1  .9994  .0006


Because this probability is so small, we see that the
sample mean X  8.1 is incompatible with a population
mean of  = 0.
We conclude that the income of community banks has
changed since last year.
Example: Bank’s net income

The fact that the calculated probability is very
small leads us to conclude that the average percent
change in income is not in fact zero. Here is why.


If the true mean is  = 0, we would see a sample mean
as far away as 8.1% only six times per 10000 samples.
So there are only two possibilities:


 = 0 and we have observed something very unusual, or
 is not zero but has some other value that makes the
observed data more probable
Example: Bank’s net income


We calculated a probability taking the first
of these choices as true ( = 0 ). That
probability guides our final choice.
If the probability is very small, the data
don’t fit the first possibility and we
conclude that the mean is not in fact zero.
Tests of Significance: Formal details


The first step in a test of significance is to state a
claim that we will try to find evidence against.
Null Hypothesis H0



The statement being tested in a test of significance is
called the null hypothesis.
The test of significance is designed to assess the
strength of the evidence against the null hypothesis.
Usually the null hypothesis is a statement of “no effect”
or “no difference.” We abbreviate “null hypothesis” as
H0.
Tests of Significance: Formal details

A null hypothesis is a statement about a population,
expressed in terms of some parameter or parameters.




The null hypothesis in our bank survey example is
H0 :  = 0
It is convenient also to give a name to the statement we
hope or suspect is true instead of H0.
This is called the alternative hypothesis and is abbreviated
as Ha.
In our bank survey example the alternative hypothesis
states that the percent change in net income is not zero. We
write this as
Ha :   0
Tests of Significance: Formal details



Since Ha expresses the effect that we hope to find evidence
for we often begin with Ha and then set up H0 as the
statement that the Hoped-for effect is not present.
Stating Ha is not always straight forward.
It is not always clear whether Ha should be one-sided or
two-sided.
 The alternative Ha :   0 in the bank net income
example is two-sided.
 In any give year, income may increase or decrease, so
we include both possibilities in the alternative
hypothesis.
Tests of Significance: Formal details

Test statistics

We will learn the form of significance tests in a
number of common situations. Here are some
principles that apply to most tests and that help
in understanding the form of tests:


The test is based on a statistic that estimate the
parameter appearing in the hypotheses.
Values of the estimate far from the parameter value
specified by H0 gives evidence against H0.
Example: bank’s income

The test statistic

In our banking example The null hypothesis is
H0:  = 0, and a sample gave the X  8.1 . The test
statistic for this problem is the standardized version of
X :
z

X  0
 n
This statistic is the distance between the sample mean
and the hypothesized population mean in the standard
scale of z-scores.
z
8.1  0
 3.22
26.4 110
Tests of Significance: Formal details


The test of significance assesses the evidence against the
null hypothesis and provides a numerical summary of this
evidence in terms of probability.
P-value


The probability, computed assuming that H0 is true, that the test
statistic would take a value extreme or more extreme than that
actually observed is called the P-value of the test. The smaller the
p-value, the stronger the evidence against H0 provided by the data.
To calculate the P-value, we must use the sampling distribution of
the test statistic.
Example: bank’s income

The P-value

In our banking example we found that the test statistic
for testing H0 :  = 0 versus Ha :   0 is
z


8.1  0
 3.22
26.4 110
If the null hypothesis is true, we expect z to take a value
not far from 0.
Because the alternative is two-sided, values of z far
from 0 in either direction count ass evidence against H0.
So the P-value is:
P( z  3.22)  p ( z  3.22)
 (1  .9994)  0.0006  .0012
Example: bank’s income


The p-value for bank’s
income.
The two-sided p-value is
the probability (when H0
is true) that
takes a
value at least as far from 0
as the actually observed
value.
Tests of Significance: Formal details






We know that smaller P-values indicate stronger
evidence against the null hypothesis.
But how strong is strong evidence?
One approach is to announce in advance how
much evidence against H0 we will require to reject
H0.
We compare the P-value with a level that says
“this evidence is strong enough.”
The decisive level is called the significance level.
It is denoted be the Greek letter .
Tests of Significance: Formal details


If we choose  = 0.05, we are requiring that
the data give evidence against H0 so strong
that that it would happen no more than 5%
of the time (1 in 20) when H0 is true.
Statistical significance

If the p-value is as small or smaller than , we
say that the data are statistically significant at
level .
Tests of Significance: Formal details


You need not actually find
the p-value to asses
significance at a fixed
level .
You need only to compare
the observed statistic z
with a critical value that
marks off area  in one or
both tails of the standard
Normal curve.
Test for a Population Mean

There are four steps in carrying out a
significance test:




State the hypothesis.
Calculate the test statistic.
Find the p-value.
State your conclusion in the context of your
specific setting.
Test for a Population Mean

Once you have stated your hypotheses and
identified the proper test, you can do steps 2 and 3
by following a recipe. Here is the recipe:

We have a SRS of size n drawn from a normal
population with unknown mean . We want to test the
hypothesis that  has a specified value. Call the
specified value 0. The Null hypothesis is
H0:  = 0
Test for a Population Mean

The test is based on the sample mean X . because
Normal calculations require standardized variable, we
will use as our test statistic the standardized sample
mean
x  0
z




n
This one-sample z statistic has the standard Normal
distribution when H0 is true.
The P-value of the test is the probability that z takes a
value at least as extreme as the value for our sample.
What counts as extreme is determined by the alternative
hypothesis Ha.
Example: Blood pressures of executives

The medical director of a large company is concerned
about the effects of stress on the company’s younger
executives. According to the National Center for health
Statistics, the mean systolic blood pressure for males 35 to
44 years of age is 128 and the standard deviation in this
population is 15. The medical director examines the
records of 72 executives in this age group and finds that
their mean systolic blood pressure is X  129.93. Is this
evidence that the mean blood pressure for all the
company’s young male executives is higher than the
national average?
Example: Blood pressures of executives

Hypotheses:
H0:  = 128
Ha:  > 128

Test statistic:
z

P-value:
x   0 129.93  128

 1.09
 n
15 72
P  p( z  1.09)  1  .8621  .1379
Example: Blood pressures of executives

Conclusion:

About 14% of the time, a
SRS of size 72 from the
general male population
would have a mean blood
pressure as high as that of
executive sample. The
observed X  129.93 is not
significantly higher than the
national average.
The t-distribution


Suppose we have a simple random sample of size
n from a Normally distributed population with
mean  and standard deviation .
The standardized sample mean, or one-sample z
statistic
x
z

0

n
has the standard Normal distribution N(0, 1).
When we substitute the standard deviation of the
mean (standard error) s /n for the /n, the
statistic does not have a Normal distribution.
The t-distribution


It has a distribution called t-distribution.
The t-distribution

Suppose that a SRS of size n is drawn from a N(, )
population. Then the one sample t statistic
t
x
s n
has the t-distribution with n-1 degrees of freedom.
 There is a different t distribution for each sample size.
 A particular t distribution is specified by giving the
degrees of freedom.
The t-distribution




We use t(k) to stand for t
distribution with k degrees of
freedom.
The density curves of the tdistributions are symmetric
about 0 and are bell shaped.
The spread of t distribution is a
bit greater than that of standard
Normal distribution.
As degrees of freedom k
increase, t(k) density curve
approaches the N(0, 1) curve.
The one –Sample t Confidence Interval

Suppose that an SRS of size n is drawn from a
population having unknown mean . A level C
confidence interval for  is

s
x t*
n
Where t* is the value for the t (n-1) density curve with
area C between –t* and t*. The margin of error is
t*

s
n
This interval is exact when the population distribution
is Normal and is approximately correct for large n in
other cases.