The Meaning of Independence

Download Report

Transcript The Meaning of Independence

The Meaning of Independence
in Probability and Statistics
Henry Mesa
Use your keyboard’s arrow keys to move the slides
forward (▬►) or backward (◄▬)
Use your keyboard’s arrow keys to move the slides
forward (▬►) or backward (◄▬)
If you want to stop the slide show use the Esc key on your
keyboard.
As you view the slides have paper and pencil handy. Take down
notes, and when asked to guess at a result do so before going on. If
something does not make sense, go back through the slides using
the backward (◄▬) key on your keyboard. If the slides do not
make sense to you then write down your question and ask your
instructor.
The concept of independence is elusive for students.
Often students view independence as a Cause and Effect
issue, as evident when one asks a student if two events are
independent, the initial response is to say that “one event has
nothing to do with this other event, so they are independent.”
Other times confusion sets in between disjoint events and
independent events; “The two events are disjoint, so they
must be independent,” is often a response.
The reality is that if two events are disjoint, then the events
can not be independent.
So what is independence then?
What follows is an attempt to make the meaning
absolutely clear, but, in plain words, independence
has to do with a change in probability. More to be
said as we continue. We need to make sure we
understand the basics of how we can measure a
probability.
If you were asked what is the probability of throwing a fair
die and having a three appear you would not hesitate to say
one-sixth.
1
P(throw a 3) =
6
Suppose that you are in a classroom with 20 people,
and you are told that four were born on your
birthdate (month and day). What is the probability of
choosing a person at random and having that person
4 1
share your birthday?
P(same date) =

20
5
I am sure you would not have any trouble saying
that it is one-fifth.
In both cases you made some major assumptions. For the die
you assumed that any side is equally likely, after all, it was a
fair die. You also realized that there are six sides of which
only one contains a three. Thus,
1
P(throw a 3) =
6
For the class room situation, you again used the same
logic as the die problem. There are twenty people
and four share my birth date.
4 1
P(same date) =

20 5
The key is the sample space. The sample space contains all
outcomes of some random phenomena. For the die there are
six items in the sample space, all equally likely thus,
1
P(throw a 3) =
6
For the class room situation, there are twenty items
in the sample space, of which four meet your criteria,
thus,
4 1
P(same date) =

20 5
The sample space represents the whole, everything that can
occur when viewing a random phenomena. Think about what
a fraction can represent.
part
meet criteria
=
whole all that can occur
1
P(throw a 3) =
6
4 1
P(same date) =

20 5
Is this important in order to understand independence?
YES!
Why? Because, the concept of independence
depends on the sample space.
Here is how independence is going to be explained.
What is the chance of rolling a three when you roll two
dice and sum up each outcome?
Now ask the same question, “what is the chance of
rolling a three,” if you roll three dice?
Why? Because, the concept of independence
depends on the sample space.
Here is how independence is going to be explained.
What is the chance of rolling a three when you roll two
dice and sum up each outcome?
Now ask the same question, “what is the chance of
rolling a three,” if you roll three dice?
While both are asking the same question the sample space has
changed; in one scenario you are tossing two dice (all possible
outcomes of two dice), and in the other three dice. And this is at
the heart of the concept of independence.
You ask question based on a particular sample space. Now change
the sample space, and ask the same question. If the probability
stays the same then we have independence!
Why? Because, the concept of independence
depends on the sample space.
Here is how independence is going to be explained.
What is the chance of rolling a three when you roll two
dice and sum up each outcome?
Now ask the same question, “what is the chance of
rolling a three,” if you roll three dice?
If the answer to both questions had been, for example 0.3, then it
does not matter that I added another die. While the sample space
changed (all possible combinations of three dice) the probability
has not changed.
By the way, the probability does change; 0.0556, versus, 0.00463.
Which means that we don’t have independence.
The concept of independence depends on a change in
probability when we change the sample space.
P(event A) = a
P(event A in a different sample space) = ?
If
P(event A in a different sample space) = P(event A)
We have independence.
Organization in statistics is vital to properly communicate
your meaning with others as well as to communicate with
yourself. Yes, yourself. Have you ever written something
down that was very clear as you were writing it, then
hours later, when visiting those same notes you are
confused as to the meaning of your writing?
Thus, we need notation to clearly denote when we have switched
sample space. In everyday writing this occurs all the time, and it
is up to the reader to comprehend when a change in sample space
has occurred.
“Ten percent of the adult women in Texas are infected with the
human papilloma virus (HPV). Of the 18-24 group, 25% of the
women are infected with HPV.”
Notice that the first probability (proportion), the 10%,
concerns adult women in Texas.
However, the second probability does not concern all adult
women in Texas, but a subgroup of the original group,
which consist of women aged 18 to 24 years.
First sample space is all adult women in Texas;
P(HPV) = 0.1
The second sample space concerns all adult women
in Texas in the age group 18-24; P(HPV) = 0.25.
To denote that there has been a change in sample
space with respect to the original probability, I will
use this notation called conditional probability
notation.
First sample space is all adult women in Texas;
P(HPV) = 0.1
The second sample space concerns all adult women
in Texas in the age group 18-24. P(HPV) = 0.25.
P(HPV) = 0.1
P(HPV | 18 – 24) = 0.25
<= Conditional Probability
The vertical line can be read “given that.” What it does is alert the
reader that the group (sample space) that was the focus recently
has changed. The question has not changed but the group has.
The vertical line signals the required condition (group
change/sample space change) for the question.
The notation P(A | B) is conditional probability notation. It states the probability of
event A given that we are now only considering the sample space that is defined by
event B.
“I think the New York Yankees have a 70% chance this year of making
the World Series.” comments Bob. “Haven’t you heard?” exclaims
Tanya. “Derek Jeter, and Alex Rodriguez are both out of the line up for
the entire season! I give then a 30% chance.
Both people are giving the odds of the Yankees making the World Series,
but both are speaking from two different perspectives (sample space). I
could encode the first speakers probability statement as
P(make W.S. ) = 0.70
But to denote a change in sample space for the second speaker I can use
conditional notation.
P(make W.S. | No DJ and No AR) = 0.3.
The first speaker assumed that both of the mentioned players are in the
lineup but the second probability makes it clear they are not in the line
up.
So how does the conditional probability notation help in
understanding what is independence?
If P(A | B) = P(A) then we have independence between
events A and B. Also, if the above is true, so is
P(B | A) = P(B).
What!?!!!!
What the above notation says, is that if P(A) = 0.7, for example, and P(A |
B) “the probability of A but from the perspective of the sample space
named B,” is also P(A | B) =0.7 our probability of event A has not
changed even though we changed sample space.
P(A | B) = P(A)
And thus the events are independent.
Note that a sample space can also be an event. A sample space is
defined by the user, just like an event. If I toss a die, but I decide to
ignore whenever a 1 shows up, then my sample space is {2, 3, 4, 5, 6}.
If P(A | B) = P(A) then we have independence between
events A and B. Also, if the above is true, so is
P(B | A) = P(B).
It seems crazy, arbitrary to eliminate an actual possibility but
people do this all the time; “If the ball lands beyond this line, it
does not count.”
Please, trust me. We are on a journey of discovery, and
discovery takes time. What we need is another simple
example to start putting these ideas together.
Consider a standard deck of 52 cards.
There are four suits: diamond , Clubs
, Hearts , and Spades . Each suit
consist of 13 cards.
Now, here is the first question.
I choose a card out of shuffled deck randomly. I don’t let you see it, but
I ask you “What is the probability that the card I hold is an ace?”
Since you do not know any better, you would say P(ace) = 4/52, since
there are four aces, out of a deck of 52 cards. You are assuming that
any of the cards on the deck are equally likely to be chosen.
I peak at my card, and I tell you I am going to give you a hint. The card
I hold is a diamond card. By saying this, we have established that we
are no longer considering all 52 cards; the sample space has changed
from having a deck of 52 cards to a deck of 13 cards all diamond cards.
Consider a standard deck of 52 cards.
There are four suits: diamond , Clubs
, Hearts , and Spades . Each suit
consist of 13 cards.
Now, here is the first question.
I choose a card out of shuffled deck randomly. I don’t let you see it, but
I ask you “What is the probability that the card I hold is an ace?”
Since you do not know any better, you would say P(ace) = 4/52, since
there are four aces, out of a deck of 52 cards. You are assuming that
any of the cards on the deck are equally likely to be chosen.
IUsing
peak at
my card,
I tell
you| Idiamond)
am going=to? give you a hint. The card
notation,
weand
have
P(ace
I hold is a diamond card. By saying this, we have established that we
are no longer considering all 52 cards; the sample space has changed
But wait,
hasathe
hint
at to
all?a deck
P(aceof| diamond)
= 1/13
whichcards.
is
from
having
deck
ofhelped
52 cards
13 cards all
diamond
the same as 4/52. I have changed the sample space but my probability
has not changed!
Consider a standard deck of 52 cards.
There are four suits: diamond , Clubs
, Hearts , and Spades . Each suit
consist of 13 cards.
What does this mean! WE HAVE INDEPENDENCE!
The event, “a card is an ace” and the event, “a card is a diamond card,”
are independent.
Using notation, we have P(ace | diamond) = P(ace)
Also note, P(diamond) = 13/52 = ¼, but P(diamond | ace) = ¼.
If P(A | B) = P(A) then we have independence. Also, if the above is
true, so is P(B | A) = P(B).
What does this mean at a practical level? It means the new information did
not change my odds, and this is very important!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Consider the following exchange.
A glum looking man, and a woman walk into a doctors office. The woman
suddenly blurts out, my husband has HIV!
The doctor is taken by surprise. He asks what makes you think so?
He took an over the counter HIV test and it came out positive.
Now the doctor is familiar with this test, and he knows that it produces
positive results 75% of the time if you have HIV and it also produces
positive result 4% of the time if you do not have HIV.
P(positive result | HIV) = 0.75 P(positive result | not HIV) = 0.04
Notice that both positive results involve two different populations (sample
spaces); people with HIV and people without HIV.
Now the HIV rate for all adults in the county that the man is from has a
HIV rate of 0.001 a very low rate.
P(positive result | HIV) = 0.75 P(positive result | not HIV) = 0.04
Now the HIV rate for all adults in the county that the man is from has a HIV
rate of 0.001 a very low rate.
So the doctor will attempt to find if the man is from a special group within the
county. That is, the doctor knows that if someone walks in from the street at
random they have a 0.001 chance of being infected. But are all the groups the
same?
The wife adds trying out to be helpful, my husband bowls regularly. Now the
doctor looks puzzled at that revelation since he can not think why that is
relevant. In other words the doctor is going to see if the man in front of him is
from a high risk group thus tipping the probability of 0.001 to a higher level
and thus making the result of a positive test more meaningful. But bowling is
not what the doctor has in mind..
P(positive result | HIV) = 0.75 P(positive result | not HIV) = 0.04
Now the HIV rate for all adults in the county that the man is from has a HIV
rate of 0.001 a very low rate.
As a matter of fact the doctor is thinking that if he is a regular bowler the
chance of having HIV is 0.001 the same as the population. P(HIV | regular
bowler) = 0.001. That is having “HIV” and being a “regular bowler” are
independent events. The proportion of HIV cases among those that bowl
regularly is the same as the population those bowlers come from. Notice that
being told that the person bowls regularly did not add any more information
(change in probability); no change in the probability while changing the
sample space results in independence.
P(positive result | HIV) = 0.75 P(positive result | not HIV) = 0.04
Now the HIV rate for all adults in the county that the man is from has a HIV
rate of 0.001 a very low rate.
The doctor was thinking along the lines of some very potentially embarrassing
questions which would put the man in a high risk group and thus give more
credibility to the positive test result.
• Does the husband or wife engage in extramarital affairs.
• Does the husband use drugs that involve injection and the potential for using
contaminated needles.
In other words the doctor is thinking along the lines of putting the man in a
high risk group (change the man’s grouping is the same as changing the
sample space.)
We need another example to help us better understand independence.
Suppose a virus is affecting a community. Out of 200,000 people 40000 are
affected.
A virus is affecting a community. Out of 200,000 people 40000 are affected.
What is the probability that someone is infected?
Age group (in years)
under 18
18 - 30
30 - 50
50-65
over 65
Total
Infected
40000
Not Infected
160000
34000
Total
P(infected) =
50000
78000
26000
12000
200000
40000
200000
= 0.2
Thus, 20% of the population is infected with the virus.
Suppose we further broke down those that are infected according to their
age.
A virus is affecting a community. Out of 200,000 people 40000 are affected.
P(infected) =
40000
200000
Age group (in years)
= 0.2
under 18
18 - 30
30 - 50
50-65
over 65
Total
Infected
6800
10000
15600
5200
2400
40000
Not Infected
27200
40000
62400
20800
9600
160000
Total
34000
50000
78000
26000
12000
200000
Suppose that a person is 18 – 30 years of age. What is the probability that
this person is infected?
Notice that the question suggests that the sample space has changed!
Thus, I will use the correct function notation to describe the question.
P(infected | 18 – 30)
Age group (in years)
under 18
18 - 30
30 - 50
50-65
over 65
Total
Infected
6800
10000
15600
5200
2400
40000
Not Infected
27200
40000
62400
20800
9600
160000
Total
34000
50000
78000
26000
12000
200000
P(infected) =
40000
200000
= 0.2
Suppose that a person is 18 – 30 years of age. What is the probability that
this person is infected?
Notice that the sample space is
not the entire 200000. We are
10000
told the person is 18 – 30 years
P(infected | 18 – 30) =
50000
of age. Thus, the population
has been reduced to the
The probability
50,000 people in that age
continues to be 0.2,
= 0.2
group.
no change from
P(infected)
Age group (in years)
under 18
18 - 30
30 - 50
50-65
over 65
Total
Infected
6800
10000
15600
5200
2400
40000
Not Infected
27200
40000
62400
20800
9600
160000
Total
34000
50000
78000
26000
12000
200000
P(infected) =
40000
200000
= 0.2
Suppose that a person is 18 – 30 years of age. What is the probability that
this person is infected?
Notice that the sample space is
not the entire 200000. We are
10000
told the person is 18 – 30 years
P(infected | 18 – 30) =
50000
of age. Thus, the population
has been reduced to the
The probability
50,000 people in that age
continues to be 0.2,
= 0.2
group.
no change from
P(infected)
Age group (in years)
under 18
18 - 30
30 - 50
50-65
over 65
Total
Infected
6800
10000
15600
5200
2400
40000
Not Infected
27200
40000
62400
20800
9600
160000
Total
34000
50000
78000
26000
12000
200000
P(infected) =
40000
200000
= 0.2
P(infected | 18 – 30) =
10000
50000
= 0.2
This means that the event, being an 18-30 year old, is
independent of being infected. That is 18-30 year olds
get infected at the same rate as the entire population.
We changed the sample space from the entire population of
200,000 to the 50,000 within the 200,000.
The questions that follow are all of that same type.
Age group
Age
group (in
(in years)
years)
under 18
18 - 30
30 - 50
50-65
over 65
Total
Infected
6800
10000
15600
5200
2400
40000
Not Infected
27200
40000
62400
20800
9600
160000
Total
34000
50000
78000
26000
12000
200000
A person is chosen at random from this population. What is the probability
that the person is under 18? Try and find the answer first before continuing.
P(under 18) =
34000
200000
= 0.17
An infected person is chosen at random. What is the probability that this
person is under 18? Try and find the answer first before continuing.
P(under 18 | infected) =
6800
40000
= 0.17
Since P(under 18 | infected) = P(under 18) we have independence
between the two events. This implies that the percentage of infected
under 18 year olds is the same as the population of under 18 year olds in
the population.
Age group (in years)
under 18
18 - 30
30 - 50
50-65
over 65
Total
Infected
6800
10000
15600
5200
2400
40000
Not Infected
27200
40000
62400
20800
9600
160000
Total
34000
50000
78000
26000
12000
200000
Lets compare this following question with the two previous questions.
What is the probability that a person is chosen at random is a 30 to 50 year
old that is not infected? Attempt to write the question using function notation
with the correct conjunction.
P(30 – 50 AND Not Infected) =
62400
200000
= 0.312
Notice that in this question we are not assuming either event is occurred.
The sample space continues to be the original population.
Survival
The table below shows the class of passenger aboard the Titanic and who survived
Class
the accident.
Alive
Dead
Total
First
202
123
325
Second Third
118
178
167
528
285
706
Crew
212
673
885
Total
710
1491
2201
Is the survival rate on the Titanic independent of passenger class?
To answer the question lets restate it more specifically. Is the survival rate independent
of being a first class passenger for example? One way to answer this is to show that
P(Alive) = P(Alive | First) or P(First | Alive) = P(First). If one is true so is the other. Try
and answer this on your own first.
P(Alive) =
710
2201
≈ 0.3226
This says that about 32.26% of the passengers on the
Titanic survived; roughly one-third.
202
P(Alive | First) = 325
≈ 0.6215
The second result says that 62.15% of the first
class passengers survived. Clearly, we do not
have independence. The chance of surviving
on the Titanic was better if you were a first class
passenger. Notice that both questions concern
surviving the ship accident, but on the second
question we have changed the sample space.
Survival
The table below shows the class of passenger aboard the Titanic and who survived
Class
the accident.
Alive
Dead
Total
First
202
123
325
Second Third
118
178
167
528
285
706
Crew
212
673
885
Total
710
1491
2201
Lets ask a similar question again.
Are the events a person is Alive and a person is a Third class
passenger independent? Try and answer the question on your own
first.
P(Alive) =
710
2201
≈ 0.3226
P(Alive | Third) =
This says that about 32.26% of the passengers on the
Titanic survived; roughly one-third.
178
706
≈ 0.2521
It turns out of the third class passengers (new
sample space) only 25.21% survived. Clearly,
the events are not independent. While aboard
the Titanic about 1/3 survived, only 25% of the
third class survived.
Survival
The table below shows the class of passenger aboard the Titanic and who survived
Class
the accident.
Alive
Dead
Total
First
202
123
325
Second Third
118
178
167
528
285
706
Crew
212
673
885
Does it matter what I make the new sample space?
Total
710
1491
2201
No, let’s answer the
same question again.
Are the events a person is Alive and a person is a Third class passenger
independent?
P(Third) =
706
2201
≈ 0.3208
P(Third| Alive) =
This says that about 32.08% of the passengers on the
Titanic were Third class; roughly one-third.
178
710
≈ 0.2507
So, if you started to interview a survivor of the
Titanic at random, there would be a roughly ¼
chance that this person was from third class.
The result is the same. We do not have
independence.
From the examples, you can see that not having independence has to
do with a change in probability of some given event due to change in
the sample space.
We ask a probability question, and get a response. Change the sample
space and ask the same question. If the probability does not change,
then we have independence.
If P(A | B) = P(A) then we have independence. Also, if the above is
true, so is P(B | A) = P(B).
If the probability does change P(A | B) ≠ P(A) then we do not have
independence.
Why is independence important?
For probability theory, it enables us to calculate probabilities in a different
manner.
For Statistics having independence indicates that no new information is
provided by changing sample space.
«The Effect of Country Music on Suicide»
(S. Stack and J. Gundlach; Wayne State University and Auburn University; 1992)
"The greater the airtime devoted to country music, the greater the white suicide rate"
According to the authors, Steven Stack and Jim Gundlach, the paper "assesses the link
between country music and metropolitan suicide rates. Country music is hypothesized to
nurture a suicidal mood through its concerns with problems common in the suicidal
population, such as marital discord, alcohol abuse, and alienation from work. The results of a
multiple regression analysis of 49 metropolitan areas show that the greater the airtime
devoted to country music, the greater the white suicide rate. The effect is independent of
divorce, southernness, poverty, and gun availability. The existence of a country music
subculture is thought to reinforce the link between country music and suicide.
Notice that the second to last sentence attempts to change the sample space
to see if the suicide rate changes. But the authors suggest that the new
sample spaces (divorced people, or living in a Southern state, or classified as
living in Poverty, or owning a gun) did not alter the rate (probability).
The End
To make the most of these slides, read your text, attempt some
problems and regardless of how you do in those homework problems
view these slides again. Obviously if you did well in the homework
assignments you will feel comfortable with the ideas presented, but
make sure that you actively view these slides; paper and pencil in
hand. You should be able to anticipate the answers to the questions
posed. If you did not do well in the homework, then see if a missing
part of your understanding can be found in the slides; you should write
down as specifically as possible what you believe is missing in your
understanding. What is it that you do not understand.