Section 4 - Probability Distributions

Download Report

Transcript Section 4 - Probability Distributions

William Christensen, Ph.D.
In Section 4 we will combine elements of
Section 2 (Distributions) with Section 3
(Probabilities)
A Probability Distribution helps us
understand the chance (probability) of
some event occurring.
But remember, being a statistician means
you never say you’re certain
Probability Distributions
Here’s what a probability distribution
looks like, in table form.
In this probability distribution x
represents the number of baby girls
among 14 randomly selected newborns.
Each probability P(x) represents the
probability or chance of EXACTLY x
number of girls among 14 randomly
selected newborns.
For example, the probability of finding
exactly 6 girls among 14 newborn babies
is 0.183 or 18.3%. The probability of
finding exactly 0 girls among 14
newborns is 0.000 (this is rounded to 3
decimals – there is actually a small chance
which we would see if we carried the decimals
out further).
x
P(x)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
0.000
0.001
0.006
0.022
0.061
0.122
0.183
0.209
0.183
0.122
0.061
0.022
0.006
0.001
0.000
Probability Distributions
Requirements
•
1.
There are a couple of things that define a
probability distribution – rules that any
probability distribution must obey
The sum of the probabilities must be 1
∑P(x) = 1
Really, what this means is that every possible outcome must be included in the
probability distribution. For example, in the previous slide we showed the
probability distribution for the number of girls among 14 newborns. Did we
include every possible outcome? Yes we did, we included from 0 – 14 of those
14 newborns being girls, that’s every possible outcome isn’t it? And, do the
probabilities all add up to 1? Add them up and see. If they do, then we have
satisfied this first rule of probability distributions.
Probability Distributions
Requirements
2.
Every individual probability must be between 0
and 1. This is something we learned earlier and
applies to all probabilities.
0  P(x)  1
•
•
Based on these two rules, does our sample of
baby girls among 14 newborns meet the
requirements for a probability distribution?
YES
These are important rules and concepts that you must remember
throughout the course
Probability Distributions
Here’s what our
probability
distribution of
girls among 14
newborn babies
looks like in a
histogram.
Probability Distributions
•
•
•
Looking at the previous slide (the histogram of the
probability distribution of girls among 14 newborns),
could you guess what the “average” number of girls
among 14 newborns is?
Well, we can also calculate the “average” or what is
sometimes also called the “expected value” (the term
“expected value” is usually used when we are talking
about the probability of some kind of money-related
outcome)
The “average” or “expected value” of a probability
distribution =
∑[x * P(x)]
Probability Distributions
•
•
Remember that it is the average x
value (number of girls among 14
newborns in our example) that
we are calculating, NOT the
average probability
For our 14 newborns, we can use
the formula as shown here in
Excel. When we summed all the
x*P(x) values, we got 7.000. Is
that also the value you guessed
looking at the histogram?
x
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
P(x)
0.000
0.001
0.006
0.022
0.061
0.122
0.183
0.209
0.183
0.122
0.061
0.022
0.006
0.001
0.000
x*P(x)
0.000
0.001
0.011
0.067
0.244
0.611
1.100
1.466
1.466
1.100
0.611
0.244
0.067
0.011
0.001
7.000
Probability Distributions
•
•
•
•
•
Here’s another example tailored to the “expected value” concept
of average.
Let’s say you go to Mesquite and make $1 bets with a potential
payoff of $500.
Let’s also say that the chances of winning are 1/1000 or 0.001
(that means the chances of losing are 1 - 0.001 = 0.999)
Again, using the formula and Excel, we get the following results.
The results show that, on average, you should “expect” to lose
50 cents each time you make your $1 bet.
Event
Win
Lose
x
$499.00
-$1.00
P(x)
0.001
0.999
x*P(x)
0.499
-0.999
-$0.50
Binomial
Probability
Distributions
Binomial Probability Distribution
As you read through the following requirements of a “binomial”
probability distribution you should recognize that our previous
example of the number of girls among 14 newborn babies is one
example of a binomial probability distribution
1. A binomial probability distribution always has a fixed
number of trials. (in our example, this was 14 – we checked
the probability of 0-14 girls among 14 newborns)
2. The trials must be independent. The outcome of any
individual trial doesn’t affect the probabilities in the other
trials. (in our example, the fact that gender of one baby had
absolutely no effect on the gender of any other baby)
3. Each trial must have all outcomes classified into two
categories. (in our example, babies had to be either boy or
girl – thus all outcomes were classified into two categories)
4. The probabilities must remain constant for each trial. (in
our example, the probability of any baby being a girl was 0.50
or 50% and this was the same for every baby)
Notation for Binomial Probability
Distributions
Although the following formula can be used to manually calculate
probability in a binomial probability distribution, I DO NOT expect you
to know or remember it. However, I DO EXPECT you to be able to use
Excel to calculate binomial probability distributions
P(x) =
Where,
n
=
x
p
q
P(x)
=
=
=
=
n!
(n - x )! x!
* px * qn-x
fixed number of trials
specific number of successes in n trials
probability of success in one of n trials
probability of failure in one of n trials (q = 1 - p )
probability of getting exactly x successes among n trials
Binomial Probabilities using Excel
Example: Let’s use Excel to re-create the binomial
probability distribution showing the probability of
finding 0-14 girls among a group of 14 randomly
selected newborn babies. If you want to follow along
and see a movie of me doing this in Excel, just click
anywhere on the next slide and it will play (or open the
windows media file called binomdist1.wmv)
1.
Set up an Excel spreadsheet with one column
labeled x to represent the number of girls
among the 14 newborns), and another column
labeled P(x) to represent the probability of
finding exactly x number of girls among the 14
newborns.
x
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
P(x)
Binomial Probabilities using Excel
2.
3.
Click on fx (function), then select the category called
“Statistical”, and the function called BINOMDIST
A shortcut alternative is to simply type the function
=BINOMDIST in the first empty cell under the P(x) column
The function BINOMDIST requires the following fields
=BINOMDIST(number_s, trials, Probability_s, Cumulative)
Number_s is the
x value for
which we are
calculating the
probability. E.g.,
to find the
probability of 0
girls in 14
newborns, this
value would be 0
(or cell A2, the
cell address that
contains 0).
Trials is the
total numbers
of trials
included in
our
distribution.
E.g., for 14
newborns,
Trials = 14
Probability is the
chance that any single
event might occur.
E.g., for our newborns,
we know the
probability of a girl in
any single case is 0.50.
In any problem of this
type you are either
given the probability
or the means to
calculate it – you
should never have to
guess.
Cumulative allows us to specify
whether we want the result to be
the probability of exactly x
successes (this is normally the
case and we enter “0” or “false”)
or whether we want the results to
“cumulate” the probabilities by
adding all previous P(x)’s to the
current x to give us the probability
of x OR fewer successes. To
cumulate, enter “1” or “True”.
For 0 girls in 14 newborns, the Excel formula looks like this =BINOMDIST(0,14,0.5,FALSE)
Binomial Probabilities using Excel
4. Next, we can copy the BINOMDIST formula down the P(x)
column and we now have the entire binomial probability
distribution for how many girls we might find among 14
randomly selected newborns.
•
Notice how all the probabilities are between 0
and 1 and they all add up to 1.000. Remember,
these rules must be met in order to have a
valid probability distribution.
x
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
P(x)
0.000
0.001
0.006
0.022
0.061
0.122
0.183
0.209
0.183
0.122
0.061
0.022
0.006
0.001
0.000
1.000
Binomial Probabilities using Excel
•
•
Here’s another problem for you to try.
AT&T Directory Assistance has been found to be accurate
and correct 90% of the time. Assuming we make 5 calls to
AT&T Directory Assistance, construct the binomial
probability distribution for correctly answered directory
assistance calls. Also, find the probability that AT&T gets
exactly 3 of the 5 calls right.
•
I’ll provide the answers on the next slide, but you must
know how to do and interpret these distributions, so I
strongly encourage you to practice this several times at
least.
Binomial Probabilities using Excel
•
AT&T Directory Assistance has been found to be accurate
and correct 90% of the time. Assuming we make 5 calls to
AT&T Directory Assistance, construct the binomial
probability distribution for correctly answered directory
assistance calls. (click anywhere on the next slide or open
file binomdist2.wmv to see me do this) Solution:
x
(correctly answered
calls)
0
1
2
3
4
5
P(x)
0.000
0.000
0.008
0.073
0.328
0.590
1.000
=BINOMDIST(0,5,0.9,FALSE)
Probability that AT&T gets
exactly 3 calls right is 0.073
or 7.3%
Mean, Standard Deviation
and Variance for
Binomial
Probability Distributions
Calculating the Mean of Binomial
Probability Distributions
•
•
•
•
•
•
•
Remember: we learned to calculate the mean of a probability
distribution by using the formula ∑[x*P(x)]
We can still use that formula, but for Binomial Probability
Distributions there is an easier formula.
For Binomial Probability Distribution µ (mean) = n * p
Where n is the number of trials (14 newborns in our previous
example, or 5 calls to AT&T), and p is the probability of any single
success (e.g., p=0.50 that any newborn was a girl, or p=0.90 that
any call to AT&T directory assistance was handled correctly).
Thus, for the number of girls out of 14 newborns, the mean or
average would be µ = n * p = 14 * 0.50 = 7.00 girls (just like we
calculated using the longer method)
And, for AT&T Directory Assistance calls µ = n * p = 5 * 0.90 = 4.50
calls
Remember: The mean, standard deviation, and variance that we
calculate refer to the x’s (e.g., girls and calls) NOT to probability
Calculating the Standard Deviation
and Variance of Binomial Probability
Distributions
•
•
•
•
The quick formula for calculating the variance of a binomial
probability distribution is σ2 = n * p * q
Where σ2 represents variance (remember σ represents standard
deviation and standard deviation squared σ2 is variance), n
represents the number of trials, p represents the probability of
success in any simply event and q = 1 – p (e.g., if p=0.50 then q=10.50=0.50, or if p=0.90 then q=1-0.90=0.10)
Thus, standard deviation or σ = n * p * q
I expect you to remember these formulas including your p’s and
q’s for the entire course. Again p = probability and q = 1 - p
Calculating the Standard Deviation
and Variance of Binomial Probability
Distributions
Examples:
•
•
For our problem involving the number of girls out of 14 newborns,
we can now calculate standard deviation, using Excel as follows:
q = 1 – p = 1 – 0.50 = 0.50, so standard deviation = sqrt(n * p * q) =
sqrt(14 * 0.50 * 0.50) = 1.87 girls
And for our problem involving the number of correctly handled
AT&T directory assistance calls out of 5 calls, we can calculate
standard deviation as follows:
q = 1 – p = 1 – 0.90 = 0.10, so standard deviation = sqrt(n * p * q) =
sqrt(5 * 0.90 * 0.10) = 0.67 calls
Using Standard Deviation to
determine what’s “Unusual”
•
•
•
•
•
Remember: In Section II we learned that it is unusual for a value
(x) to vary by more than two standard deviations from the mean
We can apply this to Binomial Probability Distributions to
determine whether or not a particular x value is unusual (unusual
meaning less than about a 5% chance of occurring)
Going back to our examples, we found that for finding girls among
14 newborns, the mean=7 girls, and the standard deviation = 1.87
girls.
Therefore, among 14 girls it would be unusual to find fewer than
3.26 or approximately 3 girls (7 - (2*1.87) = 3.26 which is the mean
minus 2 standard deviations)
It would also be unusual to find more than 10.74 or about 10 girls
(7 + (2*1.87) = 10.74 which is the mean plus 2 standard deviations)
Using Standard Deviation to
determine what’s “Unusual”
•
•
•
Regarding the 5 calls to AT&T Directory Assistance, we
calculated a mean = 4.5 calls, and a standard deviation = 0.67
calls.
Therefore, among 5 calls to AT&T it would be unusual to have
them correctly handle fewer than 3.16 or approximately 3 calls
(4.5 - (2*0.67) = 3.16 which is the mean minus 2 standard
deviations)
It would also be unusual to have them correctly handle more
than 5.84 (this is more than our total of 5 so we round this
down to 5) or 5 calls (4.5 + (2*0.67) = 5.84 or 5 which is the
mean plus 2 standard deviations)
Poisson
Probability
Distributions
Poisson Probability Distributions
These distributions are typically used to describe
arrivals of people, things, or occurrences over time.
For example, a Poisson probability distribution could
be used to describe people arriving at Golden Corral
Restaurant, or airplanes arriving at Salt Lake
International Airport, or earthquakes arriving in Japan.
It is amazingly accurate at modeling such arrivals.
Notation for Poisson Probability
Distributions
Although the following formula can be used to manually calculate
probability in a Poisson probability distribution, I DO NOT expect you
to know or remember it. However, I DO EXPECT you to be able to use
Excel to calculate Poisson probability distributions
P(x) =
Where,
µx =
x
=
x!
=
P(x) =
µ x • e -µ where e  2.71828
x!
mean or average arrival rate during a given time period
specific number of arrivals during a given time period
factorial x
probability of exactly x arrivals during a given time period
•
•
•
•
Requirements for a Poisson
Probability Distribution
The randomly variable x represents the number
of occurrences or arrivals of an event over some
specific time period
The occurrences or arrivals must be random
The occurrences or arrivals must be
independent of each other
The occurrences or arrivals must be uniformly
distributed over the time period being used. For
example, you couldn’t use it to model arrivals at
a restaurant for a whole day when the restaurant
was extra-busy during lunch time and slow
during the afternoon.
Poisson Probability Distribution
Parameters (mean, standard deviation variance)
•
•
With a Poisson probability distribution, the mean ( µ ) is
either given or determined from observation or
experiment. However, IF n  100 AND n*p  10, then it is
also possible to calculate the mean of a Poisson
distribution by using the same formula we used with a
binomial distribution, that is µ = n * p
The standard deviation of a Poisson distribution is
calculated using the following formula:
σ=
µ
Poisson Probabilities using Excel
Example: A classic example of the Poisson distribution involves the number
of deaths caused by horse kicks in the Prussian Army between 1875 and
1894. During that 20 year period there were 196 deaths by horse kick.
That’s an average of 196/20 = 9.8 horse-kick deaths per year in the
Prussian Army. Remember that we have to have an average per time
period in order to use the Poisson distribution (this is key). Now, let’s use
Excel to determine the probability for various numbers of horse-kick
deaths per year (there is no set number to investigate, we usually just
keep going until the probability drops to near-zero – you’ll see what I mean
when you try it). If you want to follow along and see a movie of me doing
this in Excel, just click anywhere on the next slide (or open the windows
media file called poisson1.wmv)
1.
Set up an Excel spreadsheet with one column
labeled x to represent the number of horse-kick
deaths per year in the Prussian Army, and
another column labeled P(x) to represent the
probability of finding exactly x number of
horse-kick deaths in the Prussian Army during
any given year.
x (horse-kick
deaths/year)
0
1
2
3
4
5
6
P(x)
Poisson Probabilities using Excel
2.
3.
Click on fx (function), then select the category called “Statistical”, and the
function called POISSON
A shortcut alternative is to simply type the function =POISSON in the
first empty cell under the P(x) column
The function POISSON requires the following fields
=POISSON( x, Mean, Cumulative)
x represents the number
of occurrences or
arrivals. E.g., the number
of horse-kick deaths per
year in the Prussian
Army
Mean is the average
rate of occurrences or
arrivals during a
specific time period.
E.g., for the Prussian
Army this was the
average horse-kick
deaths per year, which
we calculated as 196
deaths divided by 20
years equals an
average of 9.8 deaths
per year.
Cumulative allows us to specify
whether we want the result to be
the probability of exactly x
occurrences (this is normally the
case and we enter “0” or “false”)
or whether we want the results to
“cumulate” the probabilities by
adding all previous P(x)’s to the
current x to give us the probability
of x OR fewer occurrences. To
cumulate, enter “1” or “True”.
For 0 horse-kick deaths per year the Excel formula looks like this =POISSON(0,9.8,FALSE)
Poisson Probabilities using Excel
4. Next, we can copy the POISSON formula
down the P(x) column AND continue to add
x’s until the probability drops to 0 (or near
zero). We now have the Poisson probability
distribution for how many horse-kick deaths
per year we expect to find in the Prussian
Army.
• It is interesting to note that the Poisson
distribution is very accurate in predicting the
probability of horse-kick deaths in the
Prussian Army.
•
Notice how all the probabilities are between 0
and 1 and they all add up to 1.000. Remember,
these rules must be met in order to have a
valid probability distribution.
x (horse-kick
deaths/year)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
P(x)
0.000
0.001
0.003
0.009
0.021
0.042
0.068
0.096
0.117
0.127
0.125
0.111
0.091
0.068
0.048
0.031
0.019
0.011
0.006
0.003
0.002
0.001
0.000
1.000
Poisson Probabilities using Excel
•
•
Here’s another problem for you to try.
For a recent period of 100 years, there were 93 major
earthquakes (at least 6.0 on the Richter scale) in the world
(based on data from the World Almanac and Book of
Facts). Use the Poisson distribution to find the
probabilities of 0-8 major earthquakes during any given
year.
•
I’ll provide the answers on the next slide, but you must
know how to do and interpret these distributions, so I
strongly encourage you to practice this several times at
least.
Poisson Probabilities using Excel
•
With 93 major earthquakes in the last 100 years, we can calculate
an average of 93 / 100 = 0.93 major earthquakes per year. Using
Excel we can now use the =POISSON function to calculate the
probabilities of exactly x number of earthquakes in any given year.
(click anywhere on the next slide or open file poisson2.wmv to see
me do this) Solution:
x (earthquakes)
0
1
2
3
4
5
6
7
8
P(x)
0.395
0.367
0.171
0.053
0.012
0.002
0.000
0.000
0.000
1.000
=POISSON(0,0.93,FALSE)
Probability of exactly 5
earthquakes in any year is
0.002 or 0.2%
•
Just for fun, let’s compare our Poisson distribution, which gives us
a calculated probability of 0-8 earthquakes during any given year,
and compare it to the actual historical record of earthquakes to see
if it did a good job predicting the actual probability. Wow! Almost
spooky isn’t it – the Poisson distribution was pretty accurate.
From our Poisson probabilities, we can
predict the number of years in which there
are x number of major earthquakes by
taking the Poisson P(x) * 100 years
Predicted
39
37
17
5
1
0
0
0
0
Actual
47
31
13
5
2
0
1
1
0
The Actual number of years
in the last 100 years that
experienced x number of
major earthquakes is based
on historical records
years in which there were
years in which there were
years in which there were
years in which there were
years in which there were
years in which there were
years in which there were
years in which there were
years in which there were
0
1
2
3
4
5
6
7
8
major earthquakes
major earthquakes
major earthquakes
major earthquakes
major earthquakes
major earthquakes
major earthquakes
major earthquakes
major earthquakes
William Christensen, Ph.D.