lecture5 - University of Michigan
Download
Report
Transcript lecture5 - University of Michigan
School of Information
University of Michigan
Discrete and continuous
distributions
Where does the binomial coefficient come
from?
Suppose I 7 blue and pink balls, each of them uniquely marked so that I can
distinguish them
A
B
C
D
E
F
G
How many different samples can I draw containing the same balls but
in a different order?
7!
G E
C
D B
F A
I have 7 choices for the first spot, 6 choices for the second (since I’ve
picked 1 and now have only 6 to choose from),
5 choices for the third, etc.
7! = 7 * 6 * 5 * 4 * 3 * 2 * 1
Now if I am just counting the number of blue and pink balls, I don’t care
about the order.
So all possible arrangements (3!) of the pink balls look the same to me
A
B
F
D
E
C
G
A
B
G
D
E
C
F
A
B
E
D
F
C
G
A
A
B
G
D
F
C
E
A
B
E
D
G
C
F
A
B
F
D
G
C
E
B
D
So instead of having 7! combinations, we have 7!/3! combinations,
because where before we had 6 different possibilities of uniquely ordering
different pink balls – they are equivalent
C
E
F
G
The same goes for the blue balls, if we can’t tell them apart, we lose a
factor of 4!
number of ways of arranging n different things
Binomial coefficient =C(n,k)= ----------------------------------------------------------------(# of ways to arrange k things)*(# ways to arrange n-k things)
n!
= ----------------k! (n-k)!
Note that the binomial coefficient is symmetric – there are the same
number of ways of choosing k or n-k things out of n
We’ve got the coefficient, what is the
distribution about?
Suppose your sample of 7 is actually drawn from a very
large population
(so large that it is basically unaffected by the removal of a
measly 7 balls)
p = probability that ball is pink
(1-p) = probability that ball is not pink (blue)
The probability that you draw a sample with 3 pink balls
and 4 blue balls in a particular order e.g. (two pink
followed by 3 blues, followed by a pink followed by a
blue) is
prob(pink)*prob(pink)*prob(blue)*prob(blue)*prob(blue)*prob(pink)*prob(blue)
= p3*(1-p)4
We’ve got the coefficient, what is the
distribution about?
But the binomial distribution just tells us what the
probability is of drawing e.g. 3 pink balls, not 3
pink balls at a particular point in the draw
The probability that you draw a sample with 3 pink
balls and 4 blue balls in no particular order is
= C(7,3) p3*(1-p)4
+
….
Probability distribution
A probability distribution lists all the possible
outcomes and their probabilities
Outcomes are mutually exclusive
e.g. drawing 0, 1, 2, 3… pink balls
Outcome probabilities sum to one
e.g. when drawing 7 balls, the probability has to be
one of {0,1,2,3,4,5,6,7}
Denote p(x) to mean P(X=x), that is the
probability that the outcome is x
Binomial distribution
The binomial distribution tells us the probability
of drawing k pink balls out of n
It depends on
n = the number of trials (draws)
k = the number of pink balls (successes)
p = the probability of drawing a pink ball (success)
n k
nk
p(n, k ) p (1 p)
k
n!
k
nk
p (1 p)
k!(n k )!
the binomial distribution in R
>barplot(dbinom(0:7,7,0.
5),names.arg=0:7)
0.20
0.15
0.10
> dbinom(3,7,0.5)
[1] 0.2734375
0.05
are equally likely
0.00
if blue and pink balls
0.25
dbinom(x, size, prob)
0
1
2
3
4
5
6
7
what if p ≠ 0.5?
0.0
0.1
0.2
0.3
0.4
> barplot(dbinom(0:7,7,0.1),names.arg=0:7)
0
1
2
3
4
5
6
7
What is the mean?
mean of a binomial distribution is just n*p
in general = E(X) = x p(x)
0.00
0.05
0.10
0.15
0.20
0.25
probabilities that
sum to 1
0*
+1*
+2*
+3*
+4*
= 3.5
+5*
+6*
+7*
What is the variance?
variance of a binomial distribution is just
n*p*(1-p)
in general s2 = E[(X-)2] = (x-)2 p(x)
*
(0.5)2 *
probabilities that
sum to 1
0.20
0.25
(-0.5)2
0.15
(1.5)2 *
0.10
(-1.5)2 *
0.05
(-2.5)2 *
(2.5)2 *
0.00
(-3.5)2 *
+
+
+
+
+
+
(-3.5)2 *
+
0.0
0.00
0.05
0.1
0.10
0.2
0.15
0.3
0.20
0.4
0.25
Which distribution has greater variance?
0
1
2
3
4
5
6
7
p = 0.5
var = n*p*(1-p) = 7*0.5*0.5 = 7*0.25
0
1
2
3
4
5
6
7
p = 0.1
var = n*p*(1-p) = 7*0.1*0.9=7*0.09
briefly comparing an experiment to a distribution
theoretical
distribution
50
100
150
200
250
300
Histogram of y
result of
1000 trials
0
Frequency
experiments = 1000
tosses = 7
for (i in 1:experiments) {
x = sample(c("H","T"),
tosses, replace = T)
y[i] = sum(x=="H")
}
hist(y,breaks=-0.5:7.5)
lines(0:7,dbinom(0:7,7,0.5)*
1000)
0
2
4
y
6
cumulative distribution
aka CDF = cumulative density function
the probability that x is less than or equal to
some value a
Fx a Pr X a Pr X x p x
xa
xa
1.0
0.6
0.2
0.4
cumulative distribution
0.6
0.4
0.0
0.2
0.0
probability distribution
0.8
0.8
1.0
cumulative distribution
0
1
2
3
4
5
6
7
P(X=x)
> barplot(dbinom(0:7,7,0.5),names.arg=0:7)
0
1
2
3
4
5
6
7
P(X≤x)
> barplot(pbinom(0:7,7,0.5),names.arg=0:7)
0.0
0.0
0.2
0.2
0.6
0
1
2
3
4
P(X=x)
5
6
7
0.4
0.6
cumulative distribution
0.4
probability distribution
0.8
0.8
1.0
1.0
cumulative distribution
0
1
2
3
4
P(X≤x)
5
6
7
example: surfers on a website
Your site has a lot of visitors 45% of whom are
female
You’ve created a new section on gardening
Out of the first 100 visitors, 55 are female.
What is the probability that this many or more of
the visitors are female?
P(X≥55) = 1 – P(X≤54) = 1-pbinom(54,100,0.45)
another way to calculate cumulative
probabilities
?pbinom
P(X≤x) = pbinom(x, size, prob, lower.tail = T)
P(X>x) = pbinom(x, size, prob, lower.tail = F)
> 1-pbinom(54,100,0.45)
[1] 0.02839342
> pbinom(54,100,0.45,lower.tail=F)
[1] 0.02839342
0.04
0.02
what is the area
under the curve?
0.00
probability distribution
0.06
female surfers visiting a section of a website
0
6 13 21 29 37 45 53 61 69 77 85 93
1.0
cumulative distribution
0.6
0.4
0.2
cumulative distribution
0.8
> 1-pbinom(54,100,0.45)
[1] 0.02839342
0.0
<3 %
0
6 13 21 29 37 45 53 61 69 77 85 93
Another discrete distribution: hypergeometric
randomly draw n elements without replacement
from a set of N elements, r of which are S’s
(successes) and (N-r) of which are F’s (failures)
hypergeometric random variable x is the number
of S’s in the draw of n elements
r N r
x n x
p( x)
N
n
hypergeometric example
fortune cookies
there are N = 20 fortune cookies
r = 18 have a fortune, N-r = 2 are empty
What is the probability that out of n = 5 cookies, s=5
have a fortune (that is we don’t notice that some cookies
are empty)
> dhyper(5, 18, 2, 5)
[1] 0.5526316
So there is a greater than 50% chance that we won’t
notice.
hypergeometric and binomial
When the population N is (very) big, whether one
samples with or without replacement is pretty much the
same
100 cookies, 10 of which are empty
0.5
binomial
0.0
0.1
0.2
0.3
0.4
hypergeometric
1
2
3
4
5
number of full cookies out of 5
code aside
> x = 1:5
hypergeometric probability
> y1 = dhyper(1:5,90,10,5)
binomial probability
> y2 = dbinom(1:5,5,0.9)
> tmp = as.matrix(t(cbind(y1,y2)))
> barplot(tmp,beside=T,names.arg=x)
Poisson distribution
# of events in a given interval
e.g. number of light bulbs burning out in a building in a year
# of people arriving in a queue per minute
p ( x)
x
e
x!
= mean # of events in a given interval
Example: Poisson distribution
You got a box of 1,000 widgets.
The manufacturer says that the failure rate is 5
per box on average.
Your box contains 10 defective widgets. What
are the odds?
> ppois(9,5,lower.tail=F)
[1] 0.03182806
Less than 3%, maybe the manufacturer is not
quite honest.
Or the distribution is not Poisson?
Poisson approximation to binomial
If n is large (e.g. > 100) and n*p is moderate (p
should be small) (e.g. < 10), the Poisson is a
good approximation to the binomial with = n*p
0.00
0.05
0.10
0.15
binomial
Poisson
0
1
2
3
4
5
6
7
8
9
11
13
15
Continuous distributions
Normal distribution (aka “bell curve”)
fits many biological data well
e.g. height, weight
serves as an approximation to binomial,
hypergeometric, Poisson
because of the Central Limit Theorem (more on
this later) is important to inference problems
sampling from a normal distribution
0.0
0.1
0.2
0.3
0.4
Histogram of x
Density
x <- rnorm(1000)
h <- hist(x, plot=F)
ylim <range(0,h$density,dnor
m(0))
hist(x,freq=F,ylim=ylim)
curve(dnorm(x),add=T)
-4
-2
0
x
2
4
plotting on log axes
7
First of all, this is what a log function looks like
2
1
0
y = log(x) is equivalent to
x = exp(y) = ey
3
y
4
5
6
> x = 1:1000
> y = log(x)
> plot(x,y)
0
200
400
600
x
800
1000
plotting the function y = e-x
> x = 1:20
> y = exp(-x)
0.3
> plot(x,y)
0.0
0.1
y
0.2
hard to tell what’s going
on here, all the values
are so close to 0
5
10
x
15
20
1 e-01
1 e-05
1 e-03
5
10
15
x
just y on a log scale
> plot(x,y,log="y")
20
1 e-09
1 e-07
y
1 e-05
1 e-07
1 e-09
y
1 e-03
1 e-01
changing the axes
1
2
5
10
20
x
both x and y on a log scale
> plot(x,y,log="xy")
from PS: CO2 levels over last ~ 50 years
CO2 levels over last ~ 400,000 years