Slides for Session #14

Download Report

Transcript Slides for Session #14

Statistics for Social
and Behavioral Sciences
Session #14:
Estimation, Confidence Interval
(Agresti and Finlay, Chapter 5)
Prof. Amine Ouazad
Statistics Course Outline
PART I. INTRODUCTION AND RESEARCH DESIGN
Week 1
PART II. DESCRIBING DATA
Weeks 2-4
PART III. DRAWING CONCLUSIONS FROM DATA:
INFERENTIAL STATISTICS
Weeks 5-9
Firenze or Lebanese Express now
PART IV. : CORRELATION AND CAUSATION:
REGRESSION ANALYSIS
This is where we talk
about Zmapp and Ebola!
Weeks 10-14
Last 2 Sessions
• A statistic is a random variable.
• The distribution of a statistic is called its sampling
distribution.
• In particular the mean of a variable in a sample is a
statistic.
• The expected value of the sample mean is equal to
the true mean.
• The standard deviation of the sample mean is called
the standard error.
• Central Limit theorem: with a large sample size, the
sampling distribution of the mean of X is normal, and
the empirical rule applies.
The standard error is sX / √N.
Last 2 Sessions
• For a proportion (X is 0,1): sX = √( p (1-p) ). As we
typically do not observe the true proportion p, but
the sample proportion p.
• For other variables (X is not 0,1): As we do not
observe the true standard deviation sX but rather
the sample standard deviation sX, we approximate
sX by sX and thus approximate the standard error
by sX / √N.
• We are interested in estimating parameters, but
we only observe statistics. Can we use statistics as
estimators?
Outline
1. Back to Zomato
Just applying the formulas we know
2. Estimators:
Point Estimator
Biased vs Unbiased Estimators
Efficient vs Inefficient Estimators
Interval Estimator
Next time:
Estimation, Confidence Intervals (continued)
Chapter 5 of A&F
Back to Zomato
1. What statistical issue would preclude us from using the
Central Limit Theorem?
2. Assuming we can use the CLT, what is the Margin of Error
on Cafe Firenze and Lebanese Express’s ratings? Think !!
• Questions:
1. When rating a restaurant, what are the possible choices
for the user?
2. What is 3.4 on this rating?
3. What are we trying to estimate?
4. What is the formula for the standard error of ratings?
•
Is a rating X a 0,1 variable?
5. What is the standard deviation sX of ratings?
6. Finally what is the standard error of the rating 3.4?
7. And what is the margin of error for the rating 3.4?
(MoE = twice the standard error)
Recap: Central Limit Theorem
• Central Limit Theorem: with large sample size,
the distribution of the sample mean is normal,
with mean the true mean and with standard
deviation (=standard error) equal to:
sX
Café Firenze’s case
N
• X is not 0,1: Approximate the true standard
deviation sX using the sample standard deviation
s X.
• X is 0,1: Approximate sX = √( p (1-p) ) , where p is
the true proportion, using the sample proportion
for p.
Back to Zomato
• If we had all the ratings of individual users:
– John
– Abdullah
– Anthony
– Claire
– Al Bloom
– John Sexton
– Ayesha
3
4
5
3
3
3
3
“Hated it, service is poor”
“Great venue”
“Perfect, loved the al dente pasta”
“Ok for a downtown lunch”
“The italian restaurant of the world”
“Can achieve more”
“There are alternatives”
• The average is 3.4, and we would find sX=…………….
Zomato Problemo
• The website only reports the sample mean of
ratings…
• We thus have to figure out a conservative of sX (the
largest possible).
• What is the highest possible sx?
Outline
1. Back to Zomato
Just applying the formulas we know
2. Estimators:
Point Estimate
Biased vs Unbiased Estimators
Efficient vs Inefficient Estimators
Interval Estimate
Next time:
Estimation, Confidence Intervals (continued)
Chapter 5 of A&F
Parameters and their point estimates
Parameters (« True » values)
Point Estimate
Population mean m
Example: Population mean rating of Cafe
Firenze
Sample mean m
Sample mean rating of Cafe Firenze
Population median
Sample median
Population standard deviation sX
Example: Population standard deviation
of ratings of Cafe Firenze
Sample standard deviation sX.
Sample standard deviation of ratings of
Cafe Firenze
Population variance sX2
Sample variance sX2
Population p-th percentile
Sample p-th percentile
• This is called a “point estimate” because we give a single number (a “point” on the axis).
Biased vs Unbiased Estimator
• We have seen that to get the standard error of
the sample mean, we need to have an estimate
of sX.
• So far we have used:
N
1
2
(x
x
)
å i
N i=1
• And the textbook has given:
1 N
2
(x
x
)
å i
N -1 i=1
• These are two different estimators of the same
quantity sX.
• The textbook’s estimator of sX is unbiased.
These two formulas are
“point estimates”.
Efficient vs Inefficient Estimator
• Among all possible estimators, an estimator is
efficient if it has the smallest standard error.
• The standard error of
1 N
(xi - x )2
å
N i=1
• Is smaller than the standard error of
1 N
2
(x
x
)
å i
N -1 i=1
• The slides’ version is efficient, while the
textbook’s version is unbiased. There is a
conundrum.
These two formulas are
“point estimates”.
What do you actually need to remember?
• “Good” estimators are unbiased and efficient.
– The sample mean is an unbiased and efficient
estimator of the population mean.
• “Less good” estimators may be either unbiased
or efficient.
– The sample standard deviation with denominator N-1
is unbiased but inefficient.
– The sample standard deviation with denominator N is
biased but efficient.
– We keep using the formula we learnt…
Parameters and Interval Estimate
• An interval estimate is an interval of numbers
around the point estimate, which includes the
parameter with probability either 90%, 95%,
or 99%.
• Example:
“the interval estimate
[156.2 cm – 0.49cm ; 156.2 cm + 0.49cm]
includes the population average height with
probability 95%.”
Parameters and Interval Estimate
• An interval estimate that includes the parameter
with probability 95% is called a 95% confidence
interval.
• The expression “95% confidence interval” is widely
used.
• Example:
“[156.2 cm – 0.49cm ; 156.2 cm + 0.49cm]
is a 95% confidence interval for the population
average height.”
How do we build a
95% confidence interval?
• Goal: estimate the population average m.
• From previous session:
[m – MoE ; m + MoE] includes the sample mean with
probability 95%.
• We conclude: the interval
[m – MoE; m+MoE] includes the population mean with
probability 95%.
[m – MoE; m+MoE] is a 95% confidence interval for m.
MoE = 1.96 x Standard Error
Standard Error = sX/√N
Wrap up
• Central Limit theorem: with a large sample size, the
sampling distribution of the sample mean of X is normal,
and the empirical rule applies.
The standard error is the standard deviation of the sampling
distribution sX / √N.
• For a proportion: sX = √( p (1-p) ). As we typically do not
observe the true proportion p, but the sample proportion p.
• For other variables: As we do not observe the true standard
deviation sX but rather the sample standard deviation sX,
we approximate the standard error by sX / √N.
• We are interested in estimating parameters, but we only
observe statistics. Can we use statistics as estimators?
Estimators can be unbiased, and efficient.
Coming up:
Readings:
• This week and next week:
–
–
•
•
•
•
Chapter 5 entirely – estimation, confidence intervals.
Understand the confidence interval, the point estimate.
Online quiz on Thursday.
Deadlines are sharp and attendance is followed.
Tonight is the midterm election!!
Watch : http://www.msnbc.com/jose-diaz-balart/watch/is-2014-the-margin-of-error-midterms-349919811638
For help:
• Amine Ouazad
Office 1135, Social Science building
[email protected]
Office hour: Tuesday from 5 to 6.30pm.
• GAF: Irene Paneda
[email protected]
Sunday recitations.
At the Academic Resource Center, Monday from 2 to 4pm.