Transcript Chaper 3

Chaper 3
Some basic concepts of statistics
Population versus Sample
Population
• Numbers that describe the
population are called
_________________
• Population mean is
represented by ________
• Population variance is
represented by ________
Sample
• Numbers that describe the
sample are called
__________________
• Sample mean is
represented by ________
• Sample variance is
represented by ________
Sample mean and variance
Use the following data set: 5,9,8,7,6,5,8,4,1
• Calculate sample mean:
• Calculate sample variance:
• Sample standard deviation:
Population Mean and Standard
deviation
• m = E(Y) = Syip(yi)
• Population standard deviation:
s2 = S(yi-m)2p(yi)
Use the following information to calculate Population mean,
variance and standard deviation:
Y
P(Y)
1
0.1
2
0.6
3
0.2
4
0.1
Sampling distribution
• The distribution of all y-bars possible with
n=50.
• E(y-bar)= m
• Var(y-bar)= s2/n
Section 3.3 Summarizing Information in
Populations and Samples: The Finite
Population Case
• If the population is infinitely large, we can
assume sampling without replacement
(probabilities of selecting observations are
independent)
• However, if population is finite, then
probabilities of selecting elements will change
as more elements are selected
(Example: rolling a die versus selecting cards
from standard 52 card deck)
Estimating total population
• We will represent the total of a population as t and the
statistic as t-hat
• More to come on this in the next few chapters
Sampling without replacement
• Same idea can be used with sampling without
replacement, but probabilities become more
difficult to find (STT 315 helps to understand
how to calculate these).
3.4 Sampling distribution
• In your introductory statistics
class, you discovered that the
sampling distribution of y-bar was
normally distributed (if n was
large enough) with mean m and
standard deviation s/sqrt(n).
Tchebysheff’s theorem
• If n is NOT large enough to assume CLT and
the population distribution is NOT normal,
then we can still use Tchebysheff’s theorem to
get a lower bound:
For any k > 1, at least (1-(1/k2)) will fall within
k standard deviations of the mean (this is a
LOWER BOUND!!) . Therefore, within 1
standard deviation, at least 0% (not very
useful); within 2 standard deviations, at least
75%; within 3, at least 88.88889%
Finite population size
All the theory in introductory statistics class (and so far in this
class) assumes INDEPENDENT observations (infinite
population…..or so large that we can assume infinite population)
What happens when this is not true?
Rcode
R-code:
x<-rgamma(80,shape=0.5,scale=9)
hist(x)
x.bar.dist<-function(x,n)
{xbar<-vector(length=1000)
for (i in 1:1000)
{ temp<-sample(x,n,replace=FALSE)
xbar[i]<-mean(temp)
}
return(xbar)}
Rcode
R-code:
x.bar.dist1<-function(n)
{xbar<-vector(length=1000)
for (i in 1:1000)
{ temp<-rgamma(n,shape=0.5,scale=9)
xbar[i]<-mean(temp)
}
return(xbar)}
3.5 Covariance and Correlation
• Relationship between two random variables:
covariance
• The covariance indicates how two variables “covary”
• Positive covariance indicates a positive “covary” or
association
• Negative covariance indicates a negative “covary” or
association
• Zero covariance indicates no association (NOT
necessarily independence!!!)
More on Covariance
• We calculate covariance by
E[(y1-m1)(y2-m2)].
• Look at graphs to discuss
covariance (measure of
LINEAR dependency)
• However, covariance
depends on the scale of the
two variables
• Correlation “standardizes”
the covariance
• Correlation =
cov(y1,y2)/(s1s2) = r
• Note that -1<r<1
3.6 Estimation
• Since we do not know parameters, we
estimate them with statistics!! If q is the
parameter of interest, then q-hat is the
estimator of q. We want the following
properties to hold:
1. E(q-hat) = q
2. V(q-hat) = s2(q-hat) is small
Error of Estimations and Bounds
• The error of estimation is defined as
|(q-hat)-q|
• Set a bound on this error of estimation (B)
such that
P(|(q-hat)-q| < B) = 1-a
The value of B (bound) can be thought of as the
margin of error. In fact, this is how confidence
intervals (when the sampling distribution of
the statistics is normally distributed).