biol.582.f2011.lec.3

Download Report

Transcript biol.582.f2011.lec.3

BIOL 582
Lecture Set 3
Probability Distributions
Review
BIOL 582
• We learned that we could empirically generate distributions
for things like test statistics
• Recall from R examples,
100
50
0
Frequency
150
200
Two samples (n1 =20), (n2=30), mean 1 = mean 2, sd 1 = sd 2, 1000 random permutations
-3
-2
-1
0
mean 1 - mean 2
1
2
3
Review
BIOL 582
• We learned that we could empirically generate distributions
for things like test statistics
• Recall from R examples, but a slight change
0.1
0.2
By changing frequency to
density, the height of the
distribution at any point
measures the probability of
the value found on the x axis
0.0
Density
0.3
0.4
Two samples (n1 =20), (n2=30), mean 1 = mean 2, sd 1 = sd 2, 1000 random permutations
-3
-2
-1
0
mean 1 - mean 2
1
2
3
Review
BIOL 582
• We learned that we could empirically generate distributions
for things like test statistics
• Recall from R examples, but a slight change
Two samples (n1 =20), (n2=30), mean 1 = mean 2, sd 1 = sd 2, 1000 random permutations
0.2
0.1
0.0
Density
0.3
0.4
Red line is the probability
density function for the
standard normal distribution.
It is “theoretical”. If the data fit
this line well, it can be used as
a proxy for estimating the
probability of any event
described on the x axis
-3
-2
-1
0
mean 1 - mean 2
1
2
3
BIOL 582
Probability distributions
• May be empirical (generated; becoming more common) or
theoretical (based on probability theory; historically
pervasive).
• There are several different theoretical probability
distributions, each having utility under certain conditions
• Theoretical probability distributions are often called
parametric distributions, as the attributes of such
distributions are influenced by the “behavior” of various
parameters. (E.g., the shape of the normal distribution is influenced by the behavior of the
mean and variance)
• Inferential statistical methods that rely on parametric
probability distributions are called parametric tests.
Probability distributions
BIOL 582
Term
Definition
PMF
Probability mass function. The function used to form a discrete
probability distribution (E.g., binomial, Poisson)
PDF
Probability density function. The function used to form a continuous
probability distribution (E.g., normal, log-normal, t, Chi-square, F)
CMF, CDF
Continuous probability mass or discrete function. Determines the
cumulative probability of a range of events, but is otherwise related to
PMF and PDF. We will not look at any, but they do exist and can be found easily in provided
sources.
Integration
A concept of calculus, used to measure the area under the curve (AUC)
of a PMF, PDF, CMF,
or CDF. Usually written as
b
ò a f (x)dx = F(b) - F(a) = Pr ( a £ X £ b)
The AUC of a probability function is the cumulative probability associated
with limits a and b.
mode
Most frequently occurring value or the greatest height of a PMF or PDF
tail
Region of low probability for any PMF or PDF
E(X)
Expected value of a PMF or PDF; X is the variable of interest
var(X)
Variance of the distribution
Probability distributions
BIOL 582
Term
Definition
symmetric
Tails are similar, mode in center of distribution
skewed
Tails are dissimilar, mode and mean not the same
kurtosis
“Peakedness” of distribution: platykurtic vs. leptokurtic
BIOL 582
Type: Discrete
Binomial Distribution
Common Distributions
PMF
æ n ö k
n-k
Pr(X = k) = ç
÷ p (1- p)
k
è
ø
æ n ö
n!
where ç
÷=
k
k!(n
- k)!
è
ø
Parameters
n trials
p event probability
(k events)
E(X) = np
var(X) = np(1- p)
Use
Categorical data; logistic regression –
any case where Bernoulli Trials
(success or failure outcome) is
appropriate. E.g., disease research,
nesting success in birds,
environmental sex determination in
turtles
Binomial distribution for n = 20
p = 0.1 (blue), p = 0.5 (green) and p = 0.8 (red)
x-axis is k (number of events = “success”)
Taken from Wikipedia
Common Distributions
BIOL 582
Type: Discrete
PMF
Poisson Distribution
Pr(X = k) =
e
Parameters
-l
(l)
λ expected value
(k expected event)
k
k!
Use
Count or ordinal data; logistic
regression – when one is interesting
in knowing the likelihood of countable
random events. E.g., modeling
disease outbreaks, behavior studies,
genetic mutation research
E(X) = var(X) = l
Comparison of Poisson distributions
Distributions
0.2
0.1
0.0
Density
0.3
lambda=20
lambda=12
lambda=8
lambda=4
0
5
10
15
X = k events
20
25
30
Common Distributions
BIOL 582
Type: Continuous
PDF
Parameters
Normal Distribution
Pr(X = k) =
Standard Normal means
mean = 0 and standard
deviation =1
1
2ps 2
e
-(k-m )2
2s 2
μ expected value
σ2 variance
(k event value)
E(X) = m
var(X) = s 2
Use
Most commonly used distribution in statistical
analyses. Many parametric tests assume
normally distributed errors or model parameters.
The CLT indicates that test statistics should
have normally distributions, even when derived
from non-normal samples. Binomial and
Poisson distributions that have large numbers of
Bernoulli trials can be approximated by the
Normal distribution. This list can get really long!
0.4
Comparison of Normal distributions
0.2
0.1
0.0
Density
0.3
Distributions
sd=1
sd=2
sd=4
sd=8
-10
-5
0
X=k
5
10
Common Distributions
BIOL 582
Type: Continuous
Lognormal
Distribution
PDF
Pr(X = k) =
x>0
1
2kps 2
Parameters
e
μ expected value
σ2 variance
(k event value)
-(ln k-m )2
2s 2
E(X) = em
(
)
var(X) = es -1 e2 m+s
2
2
Use
Often used as a data transformation to produce
a normally distributed variable (because of link
between the two); Often the distribution of a
variable that is a factor of another positive
random variable (E.g., weight and length). Other
E.g., survival analysis, morphology, abundance
studies
Comparison of Logormal distributions (mu =1)
Distributions
0.4
0.2
0.0
Density
0.6
log sd=1
log sd=0.7
log sd=0.4
log sd=0.2
0
2
4
6
X=k
8
10
Common Distributions
BIOL 582
Type: Continuous
PDF
Parameters
æ n +1 ö
n +1
Gç
÷
2 è 2 ø æ t ö 2
Pr(X = k) =
ç1+ ÷
æn ö
np G ç ÷ è n ø
è2ø
Student t
Distribution
t=
ν degrees of freedom
n subjects
Γ Gamma function
(k event value)
E(X) = 0
k -m
s
var(X) = n / (n - 2)
2
n
G(x) =
ò
¥ x-1 -t
0
t e dt
William Sealy Goset
Use
Often used as for t-test statistics of twosample tests, paired tests, or
comparisons of regression parameter
estimates to a theoretical value (usually
0). Can be used for many different
parameters. Also has a link to Normal
and F distributions. E.g. paired designs
for before/after experimental treatments
(dose/response), linear regression,
correlation analysis.
0.4
Comparison of t Distributions
Distributions
0.1
0.2
Notice that as n increases
(meaning the df increases), the tdistribution converges on the
normal distribution. One way to
think of the t-distribution is that it is
a standard normal distribution,
corrected for small sample sizes.
0.0
Density
0.3
df=1
df=3
df=8
df=30
normal
-6
-4
-2
0
t value
2
4
6
Common Distributions
BIOL 582
Type: Continuous
PDF
F Distribution
Parameters
n
(n1k ) n 2n
n +n
(n1k + n 2 )
kB (n1, n 2 )
1
1
Pr(X = k) =
B(x, y) =
ν degrees of freedom
(two parts)
Γ Gamma function
B Beta function
(k event value)
2
2
G ( x ) G ( y)
G ( x + y)
E(X) = n 2 / (n 2 - 2)
var(X) =
Use
The primary distribution for F statistics
used in analysis of variance (ANOVA).
Also used in population genetics.
Rather universal for any research that
involves evaluating components of
linear models.
2n 22 (n 2 + v1 - 2)
n1 (n 2 - 2) (n 2 - 4)
2
1.2
Comparison of F distributions
Distributions
0.6
0.4
0.2
0.0
Density
0.8
1.0
df=4,10
df=4,100
df=8,10
df=8,100
df=12,10
df=12,100
df=20,10
df=20,100
0
1
2
3
F value
4
5
Common Distributions
BIOL 582
Type: Continuous
PDF
F Distribution
Parameters
n
(n1k ) n 2n
n +n
(n1k + n 2 )
kB (n1, n 2 )
1
1
Pr(X = k) =
B(x, y) =
ν degrees of freedom
(two parts)
Γ Gamma function
B Beta function
(k event value)
2
2
G ( x ) G ( y)
G ( x + y)
E(X) = n 2 / (n 2 - 2)
var(X) =
Use
The primary distribution for F statistics
used in analysis of variance (ANOVA).
Also used in population genetics.
Rather universal for any research that
involves evaluating components of
linear models.
2n 22 (n 2 + v1 - 2)
n1 (n 2 - 2) (n 2 - 4)
2
1.2
Comparison of F distributions
Distributions
0.6
0.4
0.2
0.0
Density
0.8
1.0
df=4,10
df=4,100
df=8,10
df=8,100
df=12,10
df=12,100
df=20,10
df=20,100
0
1
2
3
F value
4
5
Common Distributions
BIOL 582
Type: Continuous
PDF
Χ2 Distribution
Pr(X = k) =
1
k n /2 e-k/2
2n /2 G (n / 2)
k>0
Parameters
Use
The primary distribution for Χ2 statistics
used in likelihood ratio tests,
contingency tables, and categorical
analysis. Also used to “fit” other
distributions. E.g., allele frequencies,
stepwise regression, model
comparisons
ν degrees of freedom
Γ Gamma function
(k event value)
E(X) = n
var(X) = 2n
0.5
Comparison of Chi-square distributions
Distributions
0.0
0.1
0.2
Density
0.3
0.4
df=2
df=4
df=8
df=16
0
5
10
15
Chi-square value
20
25
30
BIOL 582
•
•
•
•
Final thoughts
There are MANY more distributions. This is just a sample.
These distributions are “simulations” for the distributions of variables,
parameters, or text statistics. There are other ways to simulate
distributions.
These are all parametric distributions.
Often, one asks if the data “Fit” a distribution.
• Using a PMF or PDF, one can estimate the expected values of a
theoretical distribution.
• One can then compare that to observed densities (frequencies).
c =å
2
•
(O - E )
2
E
• Which has degrees of freedom equal to the “bins” for comparison
• Thus, a theoretical distribution can be used to see if data fit some
other distribution
This lecture should be referenced anytime we use a parametric test
with a specific distributional form for estimating the probability of a
type I error.