Transcript Lecture2

Probability
Distributions
and
Frequentist
Statistics
“A single death is a tragedy, a million deaths is a statistic”
Joseph Stalin
Can we answer that?
N Balls Total
M Red
N-M Blue
?
1st draw
P(R1|I) = (M/N)
?
2nd draw
N Balls Total
M Red
N-M Blue
The Red and the Blue
The Outcome of first draw
Red-2  R2 = (R1 + B1), R2 is a “nuisance” parameter.
Marginalize = Integrate
R2 = R1 ,R2 + B1 , R2
over all options.
P(R2 |I ) = P(R1 , R2 | I ) + P(B1 , R2 | I )
Using product rule
= P(R1 | I ) P(R2 | R1 , I ) + P(B1 | I ) P(R2| B1 , I )
M M-1 + N-M M
N N-1
N
N-1
= M
N
= P(R1 |I )
... = P(R3 |I ) etc
=
Marginalization
RAIN NO RAIN
NO CLOUDS
1/6
1/3
1/2
0
1/2
1/2
1/6
5/6
Chance of Rain
Chance of Cloud
CLOUDS
Marginalization
Where Ai represents a set of Mutually Exclusive and Exhaustive
possibilities, then marginalization or integrating out of “nuisance
parameters” takes the form:
P(|D,I) =
i P(, A |D,I)
i
Or in the limit of a continuously variable parameter
A (rather than discrete case above) P changes into a
probability density function:
P(|D,I) =
 dA P(, A|D,I)
This technique is often required in inference, for example we
may be interested in the frequency of a sinusoidal signal in noisy
data, but not interested in the amplitude (a nuisance parameter)
Probability Distributions
We denote probability distributions over all possible values of a
variable x by p(x) .
Cumulative
Discrete
Lim [p(x < X < x+δx)] / δx
δx→ 0
Continuous
Properties of Probability Distributions
The expectation value for a function g(X) is the weighted average:
g(X) =

g(x) p(x)
All x
 ʃ g(x) f(x) dx

(discrete)
(continuous)
If it exists, this is the first moment, or mean of the distribution.
The rth moment for a random variable X about the origin (x=0) is:
’r
=Xr
=

xr p(x)
All x
 ʃ x f(x) dx
r

(discrete)
(continuous)
The mean = ’1 = X   is the 1st moment about the origin.
Properties of Probability Distributions
The rth central moment for a random variable X about the mean
(origin=) is:
r
r
=(X-) r
=

(x-)
All x
 ʃ (x-) f(x) dx
p(x)
r

(discrete)
(continuous)
First central moment: 1 =  (X-)  = 0
Second central moment: Var(X) = x2 =  ( X -  )2 
x2 =  ( X -  )2  =  ( X2 – 2X + 2)  =  X2 – 2  X + 2
=  X2 – 22 + 2 =  X2 – 2 =  X2 –  X 2
Therefore the variance x2 =  X2 –  X 2
Properties of Probability Distributions
Third central moment: 3 =  ( X -  )3  Skewness
Fourth central moment: 4 =  ( X -  )4  Kurtosis
The median and the mode
both provide estimates of
central tendency for a
distribution, and are in many
cases more robust against
outliers than the mean.
Example: Mean and Median filtering
Image degraded by salt noise
Mean Filter
Median Filter
The Uniform Distribution
A flat distribution with peak value normalized so that the area under the curve=1
• Commonly used as an ingnorance prior to express impartiality (a lack of
bias) of the value of a quantity over the given interval.
• Round-off error, quantization error are uniformly distributed
Uniform PDF
Cumulative Uniform PDF
The Binomial Distribution
Binomial statistics apply when there are exactly two mutually exclusive outcomes of a trial
(labelled "success" and "failure“). The binomial distribution gives the probability of
observing k successes in n trials, with the probability of success on a single trial denoted by
p (p is assumed fixed for all trials).
n
Fixed p, Varying n
• Among the most useful discrete
distribution functions in statistics.
• Multinomial distribution is a
generalization for the case where
there is more than a binary outcome.
Fixed n, Varying p
The Negative Binomial Distribution
Closely related to the Binomial distribution, the Negative Binomial Distribution applies
under the same circumstances but where the variable of interest is the number of trials
n to obtain k successes and n-k failures (rather than the number of successes in N
trials). For n Bernoulli trials each with success fraction p, the negative_binomial
distribution gives the probability of observing k failures and n-k successes with success
on the last trial:
The Poisson Distribution
Another crucial discrete distribution function, the Poisson expresses the
probability of a number of events k (e.g. failures, arrivals, occurrences ...)
occurring in a fixed period of time (or fixed area of space), provided these events
occur with a known mean rate λ (events/time), and are independent of the
previous event.
• Poisson distribution is the limiting
case of a binomial distribution
where the probability for success p
goes to zero while the number of
trials n grows such that λ = np is
finite.
• Examples: photons received from a
star in an interval; meteorite
impacts over an area; pedestrians
crossing at an intersection etc…
The Normal (Gaussian) Distribution
The Normal or Gaussian distribution is probably the most well known statistical
distribution. A Gaussian with mean zero and standard deviation one is known
as the Standard Normal Distribution. Given mean μ and standard deviation σ it
has the PDF:
• Continuous distribution
which is the limiting case
for a binomial as the
number of trials (and
successes) is very large.
• Its pivotal role in statistics
is partly due to the Central
Limit Theorem (see later).
Examples: Gaussian Distributions
Human IQ
Distribution
The Power Law Distribution
Power law distributions are ubiquitous in science, occurring in diverse phenomena,
including city sizes, incomes, word frequencies, and earthquake magnitudes. A powerlaw implies that small occurrences are extremely common, whereas large instances
are extremely rare. This “law” takes a number of forms (can be referred to as Zipf and
sometimes Pareto). A simple illustrative power law is:
Power Law PDF - Linear Scale
k=0.5
K=1.0
K=2.0
Power Law PDF – Log-Log scale
Example
Power Laws
from Nature
Physics Example: Cosmic Ray Spectrum
The Exponential Distribution
The exponential distribution is a continuous probability distribution with an
exponential falloff controlled by the rate parameter λ: larger values of λ entail a
more rapid falloff in the distribution.
• The exponential distribution is
used to model times between
independent events which
happen at a constant average
rate (e.g. lifetimes, waiting
times).
The gamma Distribution
The gamma distribution is a two-parameter continuous pdf characterized by two
parameters usually designated the shape parameter k and the scale parameter θ.
When k=1 it coincides with the exponential distribution, and is also closely related to
the Poisson and Chi Squared Distributions.
Gamma PDF:
Where the Gamma
function is defined:
• The Gamma distribution gives
a flexible class of PDFs for
nonnegative phenomena,
often used in modeling waiting
times.
• Conjugate for the Poisson PDF
The
Beta
Distribution
The family of beta probability distributions is defined on the fixed interval [0,1] and
parameterized by two positive shape parameters, α and β. In Bayesian statistics it is
frequently encountered as a prior for the binomial distribution.
Beta PDF:
Where the Beta
function is defined:
• The family of Beta
distributions allows for a
wide variety of shapes
over a fixed interval.
• If likelihood function is a
binomial, then a Beta
prior will lead to another
beta function for the
posterior.
• The role of the Beta
function can be thought
of as a simple
normalization to ensure
that the total PDF
integrates to 1.0
Central Limit Theorem:
Experimental demonstration
.....
Central Limit Theorem:
A Bayesian demonstration
x1
dx1
x2
y
dx2
dy
X1  x1 to dx1 P(x1 |I ) = f1 (x1)
X2  x2 to dx2 P(x2 |I ) = f2 (x2)
Y  y to dy
I  Y is the sum of X1 and X2
P(Y |I ) =  dX1 dX2 P(Y, X1 , X2 | I )
Using the product rule, and
=  dX1 dX2 P(X1 | I ) P(X2 | I ) P(Y | X1 , X2 , I )
independence of X1 , X2
Because y = x1 + x2
Therefore
P(Y | X1 , X2 , I ) = δ (y – x1 – x2 )
P(Y |I ) = dX1 f1 (x1)  dX2 f2 (x2) δ (y – x1 – x2 )
= dX1 f1 (x1) f2 (y – x1)
Convolution Integral
Central Limit Theorem:
Convolution Demonstration