Transcript lecture 3
Introduction to Data
Analysis
Probability Distributions
Today’s lecture
Probability distributions (A&F 4)
Normal distribution.
Sampling distributions = normal distributions.
Standard errors (part 1).
2
Probability – an idiot’s guide
We’re interested in how likely, how probable,
it is that our sample is similar to the
population.
In order to make this judgement, we need to
think about probability a little bit.
In particular we need to think about probability
distributions.
3
Probability
The proportion of times that an outcome would occur
in a long run of repeated observations.
Imagine tossing a coin, on any one flip the coin can
land heads or tails.
If we flip the coin lots of times then the number of
heads is likely to be similar to the number of tails
(law of large numbers).
Thus the probability of a coin landing heads on any
one flip is ½, or 0.5, or in bookmakers’ terms ‘evens’.
If the coin was double headed then the probability of
heads would be 1 – a certainty.
4
Probability Distribution (1)
The mean of a probability distribution of a variable
is:
µ = ∑ y P(y) if y is discrete.
µ = ∫ y P(y)dy if y is continuous.
Also called the expected value: E(y)=Probability
times payoff
Standard Dev (σ) of prob dist measures variability.
Larger
σ = more spread out distribution
5
Probability distribution (2)
Lists the possible outcomes together with their
probabilities.
Now, let’s take a continuous-level variable, like hours
spent working by students per week.
The mean = 20, and standard deviation = 5 .
But what about the distribution…?
Assign probabilities to intervals of numbers, for example the
probability of students working between 0 and 10 hours is (let’s
say) 2½ per cent.
Can graph this, with the area under the curve for a certain interval
representing the probability of the variable taking that value.
6
Probability distribution (3)
Area between 0 and 10
is 2.5 per cent of the total
Area beneath the curve
0
5
10
15
20
25
30
35
40
Time spent working (hours)
7
Probability distribution (4)
Given this distribution, there is a 0.025
probability (2.5%, or 40-1 for the gamblers)
that if I picked a student they would have done
less than 10 hours work in a week.
A lot of continuous variables have a certain
distribution – this is known as the normal
distribution.
The student work distribution is ‘normal’.
8
What is a ‘normal distribution’?
NDs are symmetrical.
The distribution higher than the mean is the same as the distribution
lower than the mean.
Unlike income, which has a skewed distribution.
For any normal distribution, the probability of falling
within z standard deviations of the mean is the same,
regardless of the distribution’s standard deviation.
The Empirical Rule tells us:
For 1 s.d. (or a z-value of 1) the probability is .68
For 2 s.d. (actually 1.96) the probability is .95
For 3 s.d. the probability is almost 1.
9
Brief aside—what is Z?
The Z-score for a value Y on a variable is the number
of standard deviations that Y falls from µ.
z
y
We can use Z-scores to determine the probability in
the tail of a normal distribution that is beyond a
number Y.
10
Normal distribution (1)
Area under the curve
here is 0.68 of the
total area under the
curve.
1 s.d. less than
the mean = 15
Hence the probability
of working between
15 and 25 hours is
0.68.
1 s.d. more than
the mean = 25
0
5
10
15
20
25
30
35
40
Time spent working (hours)
11
Normal distribution (2)
For any value of z (i.e. not just whole numbers but
say 2.34 s.d.), there is a corresponding probability.
Most stats book have z tables in their front/back covers.
Thus if we were to pick a student out of our
population of known distribution we could work out
how likely it would be that she was a hard worker.
Even non-normal distributions can be transformed to
produce approximately normal distributions.
For example, incomes are not normally distributed, but we ‘log’
them to make a normal distribution (more on this later).
12
What’s the point?
But, surely we don’t know the distribution or
mean of the population (that’s probably why
we’re sampling it after all), so what use is all
this…?
13
Back to sampling
The reason that normal distributions are of
relevance to us, is that the distributions of
sample means are normally distributed.
In order to understand what this means let’s
take an example of sampling.
I want to take a driving trip around the world, visit
every country and pay no attention to speed limits.
I don’t particularly want to go to prison however, so
what to do…
14
Sampling example (1)
The plan is to bribe all
policemen when caught
speeding. Thus I want to
measure how much it costs on
average to bribe a policeman
to avoid a speeding ticket.
It’s costly to collect this
information, so I don’t want
to investigate every country
before I set off.
Therefore I sample the
countries to try and estimate
the average bribe I will need
to pay.
James’ car
Ryan’s car
15
Sampling example (2)
I randomly sample 5 countries and measure the cost
of the bribe.
Imagine for the minute I know what the population distribution
looks like (it happens to be normal with a mean of $500).
Population distribution
Mean of population ($500)
Sample mean ($450)
One observation ($700)
16
Sampling distributions (1)
If we took lots of samples we would get a distribution
of sample means, or the sampling distribution. It so
happens that this sampling distribution (the
distribution of sample means( or any statistic)) is
normally distributed.
Due to averaging the sample mean does not vary as
widely as the individual observations.
Moreover, if we took lots of samples then the
distribution of the sample means would be centred
around the population mean.
17
Sampling distributions (2)
Imagine I took lots of samples. There would be a
normal distribution of their means, centred around
the population mean.
Mean of population
Mean of all sample means
Population distribution
Sampling distribution
18
3 Very Important Things
If we have lots of sample means then the average will
be the same as the population mean.
If the sample size is large(ish), the distribution of
sample means (what is called the sampling
distribution) is approximately normal.
In technical language the sample mean is an unbiased estimator of
the population mean.
This is true regardless of the shape of the population distribution.
As n (the sample size) increases the sampling
distribution looks more and more like a normal
distribution.
This is called the central limit theorem.
19
Sampling distributions (4)
Sampling distribution
Mean of population
Mean of all sample means
Population distribution
20
Sampling distributions (5)
21
‘Accurate’/‘inaccurate’ samples (1)
Some sampling
distributions are bigger
than others…
The top sampling
distribution is better for
estimating the
68% of distribution
population mean as
more of the sample
means lie near the
population mean.
68% of distribution
Mean of population
22
‘Accurate’/‘inaccurate’ samples (2)
Sampling distributions that are tightly
clustered will give us a more accurate estimate
on average than those that are more dispersed.
Remember, high standard deviations give us a ‘short
and flabby’ distribution and low standard deviations
give us ‘tall and tight’ distribution.
We need to estimate what our sampling
distribution’s standard deviation is.
But how do we do this…?
23
A (little) bit of math now…
Before we work out what the sampling distribution
looks like, some important terms.
Population mean
Population standard deviation
Sample observatio n X
Sample mean X
Sample standard deviation s
Sample size n
24
Standard error (1)
For my bribery sample, we
know the following:
Sample mean
X 450
But, we want to know the
standard deviation of the
sampling distribution, so we
can see what the typical
deviation from the
population mean will be.
Sample standard deviation
s 150
Sample size
n5
25
Standard error (2)
Fortunately for us:
σ
n
We don' t know what σ is, but we do know s.
Standard deviation X
Standard error X
s
n
The standard error is an estimate of how far any sample mean
‘typically’ deviates from the population mean.
26
Standard error (3)
For my bribery sample.
Thus, the ‘typical’
deviation of a sample
mean from the
population mean (of
$500) would be $64 , if
we repeatedly sampled
the population.
s
Standard error X
n
150
Standard error X
5
150
2.34
64.10
27
2 More Very Important Things
The formula for standard error means that as:
…the
n of the sample increases the sampling
distribution is tighter.
This makes sense, the bigger the sample the better it is
at estimating the population mean.
…the
distribution of the population becomes
tighter, the sampling distribution is also tighter.
This also makes sense. If a population is dispersed it
will be more unlikely to get observations near the mean.
28
Binary variables
This works for binary variables too, where the mean is
just the proportion…
Population mean population proportion
Population standard deviation σ (1 )
Sample proportion P.
Standard deviation P
Standard error P
(1 )
n
P (1 P )
n
P (1 P )
n
29
Do we trust Blair?
Take the example of Blair’s trustworthiness.
1000 people in the sample, 30% trust him (i.e. the mean is 0.30).
Given this, we can work out the standard error.
Sample proportion P 0.30.
Standard error P
P(1 P)
n
0.3 * 0.7
0.014
1000
The typical deviation from the proportion would be 1.4% if we
took lots of samples.
30
And finally, standard error (4)
Don’t forget that we know the shape of the
distribution of the sample means (it’s normal).
We know the sample mean, the shape of the distribution
of all the sample means, and how dispersed the
distribution of sample means is.
So (at last) we can calculate the probability of
the sample mean being ‘near’ to the population
mean (i.e. calculate a z-score and look up
corresponding probability). But wait, there’s
more…
31
Next week
Finish off standard error.
Think about how we can measure a range around our
sample mean that we can be confident contains the
population mean.
These ranges are called confidence intervals.
Hypothesis testing.
What’s a hypothesis?
How samples can help us to work out the probability of
hypotheses being correct.
32