Transcript Lab_2x
Advanced Quantitative Techniques
Lab 2:
Normality, Graphing Distributions,
Confidence Intervals
Normal distribution
What are the Characteristics of a
Normal Distribution?
• Unimodal
• Bell shaped
• Symmetric
• Mean = Mode = Median
• Skewness = 0
• Kurtosis = 3
• 68 – 95 – 99.7 rule
If population has a Normal distribution
68.2% of dataset is within 1 standard deviation of the mean
95.4% of dataset is within 2 standard deviations of the mean
99.7% of dataset is within 3 standard deviations of the mean
More about Normal distribution
• Probability of any event is the area under the density curve.
• Total area under curve = 1 (collectively exhaustive)
• Normal distributions are idealized description of data
• Total area is approximate; never precisely calculated because
the line never touches x-axis.
Is population normal distributed?
0
100
Frequency
200
300
use calls_311.dta
histogram POP2010, width (600) frequency normal
0
10000
20000
POP2010
30000
Is population normal distributed?
sum POP2010, detail
Variance vs. Standard Deviation
Variance
(σ2)
Standard Deviation
(σ)
1
2
xi
n
1
2
xi
n
Average of squared
differences from the mean
Square root of the
variance
Skewness
Skewness is a measure of symmetry
Where is the tail?
Mean > Median
Skewness > 0
Mean = Median
STATA:
Skewness = 0
Mean < Median
Skewness < 0
Skewness
Kurtosis
• Kurtosis is a measure of whether the data are
peaked or flat relative to a normal
distribution.
(Kurtosis > 3)
(Kurtosis = 3)
(Kurtosis < 3)
Example of Normal distribution
• use Lab_2_Data.16.dta
• histogram bwt, width (400) frequency normal
Example of Normal distribution
• sum bwt, detail
Sampling
• Population – a group that includes all the
cases (individuals, objects, or groups) in which
the researcher is interested.
• Sample – a relatively small subset from a
population.
Sampling
• Random sample
• Stratified sample: divide the population into
groups and draw a random sample from each
group
• Cluster sample: group the population into
small clusters, draws a simple random sample
of clusters, and sample everything in the
clusters
Sampling
• Parameter – A measure used to describe a
population distribution.
• Statistic – A measure used to describe a
sample distribution.
• Estimation – A process whereby we select a
random sample from a population and use a
sample statistic to estimate a population
parameter.
Inference
Inferential Statistics
• We generally don’t know anything about the
population distribution
• We have a sample of data from the population
• We assume that the average/mean is the most
appropriate description of population (no more
median because we assume normal distribution)
• The sample is to be random and representative
(“large enough”)
Inferential Statistics
What can we infer about the
population based on a sample?
• From now on, we’re estimating the population mean
(μ) with the sample mean (x).
• We are no longer talking about individual behavior;
we’re talking about average behavior
Distribution of Means
• Take a random sample over, and over, and over
again (random means each data point has an
equal chance of being chosen).
• You get many sample means
x1 , x2 , x3 , x4 , x5 ,..., x
• Plot the sampling distribution of these means: you
get a distribution of averages (not raw data points!)
Distribution of Means
• Sampling Distribution of Means: Frequency distribution
(histogram) of the sample means, not of the data themselves.
Freq
Distribution of all possible sample means
x
**This is not the distribution of x**
• If we sample randomly from a large enough population, the
distribution of the averages of the data (not the population
data!) is a bell curve (normal distribution).
• This is the case regardless of what the population distribution
looks like.
Confidence Intervals
• The goal of calculating confidence intervals is
to determine how sure we are that the true
population mean, μ, is approximated by the
sample mean x.
Confidence Intervals
• Confidence Level – The likelihood, expressed as a
percentage or a probability, that a specified
interval will contain the population parameter.
– 95% confidence level – there is a .95 probability that
a specified interval DOES contain the population mean.
– 99% confidence level – there is 1 chance out of 100
that the interval DOES NOT contain the population
mean.
STATA: ci Command
• Open Stata and calls_311.dta
. Ci means calls_per_thousand, level(90)
Significance Level
Sample Size
Sample Mean
. ci calls_per_thousand, level(90)
Variable
Obs
Mean
calls_per_~d
2168
1.534331
Standard Error =
s
n
Std. Err.
[90% Conf. Interval]
.0335816
1.479071
1.589592
Lower Bound of the CI
Upper Bound of the CI
Build a 95% CI for 311 calls per thousand people.
The default CI for the CI command in Stata is 95%.
. ci calls_per_thousand, level(90)
Variable
Obs
Mean
calls_per_~d
2168
1.534331
Std. Err.
[90% Conf. Interval]
.0335816
1.479071
Std. Err.
[95% Conf. Interval]
.0335816
1.468476
1.589592
Precise
. ci calls_per_thousand
Variable
Obs
Mean
calls_per_~d
2168
1.534331
1.600187
. ci calls_per_thousand, level(99)
Variable
Obs
Mean
calls_per_~d
2168
1.534331
Confident
Std. Err.
[99% Conf. Interval]
.0335816
1.447755
1.620908
Build a CI for Bronx calls/1,000pps that leaves a 10%
chance of overestimation error.
ci means calls_per_thousand if county=="005",
level(80)
. ci calls_per_thousand if county=="005", level(80)
Variable
Obs
Mean
calls_per_~d
339
1.66264
Std. Err.
[80% Conf. Interval]
.0937592
1.542247
1.783032
Build a CI for Manhattan calls/1,000pps that leaves a 20%
chance that the population mean is not captured by the
interval.
ci means calls_per_thousand if county=="061",
level(80)
. ci calls_per_thousand if county=="061", level(80)
Variable
Obs
Mean
calls_per_~d
288
2.018764
Std. Err.
[80% Conf. Interval]
.1377721
1.841794
2.195733
Are they significantly different?
Confidence intervals in a Normal
distribution