Lecture 5 Outline - Wharton Statistics Department
Download
Report
Transcript Lecture 5 Outline - Wharton Statistics Department
Lecture 5 Outline – Tues., Jan. 27
• Miscellanea from Lecture 4
• Case Study 2.1.2
• Chapter 2.2
– Probability model for random sampling (see
also chapter 1.4.1)
– Sampling distribution of sample mean
– t-test
– Confidence intervals
Miscellanea from Lecture 4
• Definition of medians/quartiles
– The median is the midpoint of a distribution, the
number such that half the observations are smaller and
half are larger.
• Computing the median: Make a list of all
observations. If n is odd, the median is the center of
the ordered observations; if n is even, the median is
the mean of the two center observations.
– pth percentile of a distribution: value such that p
percent of the observations fall at or below it. Exact
computation in JMP: order the observations from top to
bottom and count up required percent of observations
from the bottom of the list. If pth percent falls between
two observations, JMP takes weighted average, (1p)*lower observation + p*higher observation.
Miscellanea from Lecture 4
• Definition of median/quartiles continued:
– First quartile is 25th percentile. Third quartile is
75th percentile.
• Long-tailed vs. short-tailed distributions:
Loosely, a long-tailed distribution has a tail
that dies out slower than the normal
distribution. A short-tailed distribution has
a tail that dies out faster than the normal
distribution. See figure at end of notes.
Case Study 2.1.2
• Broad Question: Are any physiological indicators
associated with schizophrenia? Early studies
suggested certain areas of brain may be different
in persons with schizophrenia than in others but
confounding factors clouded the issue.
• Specific Question: Is the left hippocampus region
of brain smaller in people with schizophrenia?
• Research design: Sample pairs of monozygotic
twins, where one of twins was schizophrenic and
other was not. Comparing monozy. twins controls
for genetic and socioeconomic differences.
Case Study 2.1.2 Cont.
• The mean difference (unaffected-affected) in volume of
left hippocampus region between 15 pairs is 0.199. Is this
larger than could be explained by “chance”?
• Probability (chance) model: Random sampling (fictitious)
from a single population.
• Scope of inference
– Goal is to make inference about population mean but
inference is questionable because we did not take a
random sample.
– No causal inference can be made. In fact researchers
had no theories about whether abnormalities preceded
the disease or resulted from it.
Probability Model
• Goal is to compare two groups (affecteds and
unaffecteds) but we have taken a paired sample.
We can think of having one population (pairs of
twins) and looking at the mean of one variable, the
difference in hippocampus volumes in each pair.
• Probability model: Simple random sample with
replacement from population. When the
population size is more than 50 times the sample
size, simple random sampling with replacement
and simple random sampling without replacement
are essentially equivalent.
Review of Terminology
• Population: Collection of all items of interest to researcher,
e.g., heights of members of this class, U.S. adults’
incomes, lifetimes of a new brand of tires.
• Statistic (random variable): Any quantity that can be
calculated from the data.
• Probability (sampling) distribution of statistic: the
proportion of times that a statistic will take on each
possible value in repeated trials of the data collection
process (randomized experiment or random sample).
• Population distribution: The probability distribution of a
randomly chosen observation from the population.
• Parameter: Describes feature of population distribution
(e.g., mean or standard deviation)
Parameters and Statistics
• Population parameters ( , )
– = population mean
– 2 = population variance = average size of
(Y )2 in population
• Hypotheses: H 0 : 0, H1 : 0
• Sample statistics ( Y , s )
– Sample: Y1 ,, Yn
1 n
– Y Yi = sample mean
n i 1
–
s2
n
1
2
(
Y
Y
)
i
n 1 i 1
= sample variance
Continuous Distributions
• A continuous random variable can take values
with any number of decimals like 1.2361248912.
• The probability of a continuous r.v. taking on an
exact value is 0. But there is a nonzero chance
that continuous r.v. will take on a value in an
interval.
• Density function defines probability for
continuous r.v. The probability that a r.v. takes on
values between 3.9 and 6.2 is the area under the
density function between 3.9 and 6.2. Total area
under density function is 1.
• Example of continuous r.v.: height.
Normal probability distribution
• The normal probability distributions are a family
of density functions for a continuous r.v. that are
“bell-shaped.”
• The normal probability distribution has two
parameters, mean and standard dev.
• The probability that a normal r.v. will be within
one s.d. of its mean is about 68%. The probability
that a normal r.v. will be within two s.d.’s of its
mean is about 95%.
Sampling distribution of sample mean
• Consider random sample of size n from a population with
mean and variance 2 Key facts about sampling
distribution of Y .
– Center: The mean of the sampling distribution of Y is
– Spread: Sample means are closer to the population
mean than single values. The sampling distribution has
. (Y )
SD
n
– Shape: If the population distribution is normal, the
sampling distribution of the sample mean will be
normal. If the population distribution is not normal, the
sampling distribution of the sample mean will be nearly
normal for n>30 (Central Limit Theorem).
Standard errors
• The standard error of a statistic is an estimate of
the standard deviation in its sampling distribution.
It is the best guess of likely size of difference
between a statistic used to estimate parameter and
parameter itself.
• Associated with every standard error is a measure
of the amount of information used to estimate
variability, called its degrees of freedom. Degrees
of freedom are measured in units of “equivalent
numbers of independent observations.”
s
SE
(
Y
)
• Standard error of sample mean:
n
d.f. = n-1
Testing a hypothesis about
• H 0 : 0, H1 : 0
• Could the difference of Y from * (the
hypothesized value for , =0 here ) be due to
chance (in random sampling)?
| t |
| Y * |
SE (Y ).
• Test statistic:
• The test statistic will tend to be near 0 when H0 is
true and far from 0 when H0 is false.
• Assume the population distribution is normal. If
H0 is true, then t has the Student’s t-distribution
with n-1 degrees of freedom.
P-value
• The (2-sided) p-value is the proportion of
random samples with absolute value of t
ratios >= observed test statistic (|t|)
• Schizophrenia example: t = 3.23
8
7
Estim Mean 0.1986666667
Hypoth Mean 0
T Ratio 3.2289280811
P Value 0.0060615436
6
Y
5
4
3
2
1
0
-0.4
-0.3
Sample Size = 15
-0.2
-0.1
.0
X
.1
.2
.3
.4
Schizophrenia Example
• p-value (2-sided, paired t-test) = .006
• So either,
– (i) the null hypothesis is incorrect OR
– (ii) the null hypothesis is correct and we happened to
get a particularly unusual sample (only 6 out of 1000
are as unusual)
• Strong evidence against H 0 : 0
• One-sided test: H 0 : 0, H1 : 0
– Test statistic: t
Y 0
s/ n
– For schizophrenia example, t=3.21, p-value (1-sided)
=.003
Matched pairs t-test in JMP
• Click Analyze, Matched Pairs, put two
columns (e.g., affected and unaffected) into
Y, Paired Response.
• Can also use one-sample t-test. Click
Analyze, Distribution, put difference into Y,
columns. Then click red triangle under
difference and click test mean.
Confidence Intervals
• Point estimate: a single number used as the best estimate of
a population parameter, e.g., Y for .
• Interval estimate (confidence interval): range of values
used as an estimate of a population parameter.
• Uses of a confidence interval:
– Provides a range of values that is “likely” to contain the
true parameter. Confidence interval can be thought of
as the range of values for the parameter that are
“plausible” given the data.
– Conveys precision of point estimate as an estimate of
population parameter.
Confidence interval construction
• A confidence interval typically takes the form:
point estimate margin of error
• The margin of error depends on two factors:
– Standard error of the estimate
– Degree of “confidence” we want.
– Margin of error = Multiplier for degree of confidence *
SE of estimate
– For a 95% confidence interval, the multiplier for degree
of confidence is about 2 in most cases.
CI for population mean
• If the population distribution of Y is normal
(* we will study the if part later) 95% CI for
mean of single population:
Y tn1 (.975) * SE (Y )
s
Y tn1 (.975) *
n
• For schizophrenia data:
.199cm3 2.145 0.615cm3
0.067cm3 to 0.331cm3
Interpretation of CIs
• A 95% confidence interval will contain the true
parameter (e.g., the population mean) 95% of the
time if repeated random samples are taken.
• It is impossible to say whether it is successful or
not in any particular case, i.e., we know that the CI
will usually contain the true mean under random
sampling but we do not know for the
schizophrenia data if the CI (0.067cm3 ,0.331cm3)
contains the true mean difference.
Confidence Intervals in JMP
• For both methods of doing paired t-test
(Analyze, Matched Pairs or Analyze,
Distribution), the 95% confidence intervals
for the mean are shown on the output.