Transcript Probability

Statistics for Health Research
Introduction to
Distributions and
Probability
Peter T. Donnan
Professor of Epidemiology and Biostatistics
Overview
•
•
•
•
•
•
Distributions
History of probability
Definitions of probability
Random variable
Probability density function
Normal, Binomial and Poisson
distributions
Introduction to Probability
Density Functions
•Normal Distribution /
•Gaussian / Bell curve
•Poisson named after French
Mathematician
•Binomial related to binary factors
(Bernoulli Trials)
Early use of Normal
Distribution
• Gauss was a German mathematician
who solved mystery of where Ceres
would appear after it disappeared
behind the Sun.
• He assumed the errors formed a
Normal distribution and managed to
accurately predict the orbit of Ceres
What is the
relationship between
the Normal or
Gaussian distribution
and probability?
Probability
“I cannot believe that God plays
dice with the cosmos”
Albert Einstein
“The probable is what usually
happens”
Aristotle
Origins of Probability
• Early interest in permutations Vedic
•
•
•
•
•
•
literature 400 BC
Distinguished origins in betting and
gambling!
Pascal and Fermat studied division of
stakes in gambling (1654)
Enlightenment – seen as helping
public policy, social equity
Astronomy – Gauss (1801)
Social and genetic – Galton (1885)
Experimental design – Fisher (1936)
Types of Probability
Two basic definitions:
1) Frequentist
2) Subjectivist
Classical
Bayesian
Proportion of
times an event
occurs in a long
series of ‘trials’
Strength of belief
in event happening
Frequentists
• Consider tossing a fair coin
• In any trial, event may be a
‘head’ or ‘tail’ i.e. binary
• Repeated tossing gives series
of ‘events’
• In long run prob of heads=0.5
THTTHHHHTHHHTHHHTTHTTTHHTTHTTHHHTTTHHTHHHTTTTTHHH
0.6
0.56
0.52
Frequentist
Probability
• Note the difference between ‘long run’
•
•
•
probability and an individual trial
In an individual trial a head either occurs
(X=1) or does not occur (X=0)
Patient either survives or dies following an
MI
Prob of dying after MI ≈ 30% based on a
previous long series from a population of
individuals who experienced MI
Random Variable
• Consider rolling 2 dice and we want
to summarise the probabilities of all
possible outcomes
• We call the outcome a random
variable X which can have any value
in this case from 2 to 12
• Enumerate all probabilities in sample
space S
• P (2) = 1/6x1/6 = 1/36, P (3)=2/36,
P (4) = 3/36, etc…..
Probability Density Function for
rolling two dice
6/36
5/36
4/36
3/36
2/36
1/36
2
1/36
2
3
4
5
6
7
8
9
10
11
12
2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
3
4
5
6
7
8
9
10
11
12
Probability Density Function for
rolling two dice
What is probability of getting 12? Answer 1/36
What is probability of getting more than 8? Ans. 10/36
6/36
5/36
4/36
3/36
2/36
1/36
2
3
4
5
6
7
8
9
10
11
12
Probability Density Function for
continuous variable
6/36
5/36
4/36
3/36
2/36
1/36
2
1/36
2
3
4
5
6
7
8
9
10
11
12
2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
3
4
5
6
7
8
9
10
11
12
Probability
Consider distribution of weight in
kg; all values possible not just
discrete
20…….30……40…… 50 ……60…….70…….80…..90….100….110…… 120
Weight in kilograms
2
3
4
5
6
7
8
9
10
11
12
Probability Density Function
in SPSS
Use Analyze / Descriptive Statistics / Frequencies
and select no table and charts box as below
Probability Density Function
in SPSS
Data from ‘LDL Data.sav’ of baseline LDL cholesterol
Normal Distribution
Note that a Normal or Gaussian
curve is defined by two parameters:
Mean µ and Standard Deviation σ
And often written as N ( µ, σ )
Hence any Normal distribution has mathematical form
Impossible to be integrated so area under the curve
obtained by numerical integration and tabulated!
Normal Distribution
As noted earlier the curve is symmetrical about the
mean and so p ( x ) > mean = 0.5 or 50%
And
p ( x ) < mean = 0.5 or 50%
And p (a < x < b) = p(b) – p(a)
50%
50%
Normal Distribution and
Probabilities
So we now have a way of working out the probability
of any value or range of values of a variables IF a
Normal distribution is a reasonable fit to the data
p (a < x < b) = p(b) – p(a) which is the area under
the curve between a and b
50%
50%
Normal Distribution
Most of area lies between +1 and -1 SD (64%)
The large majority lie between +2 and -2 SDs (95%)
Normal Distribution
Probability Density
Function (PDF) =
How well does my data fit
a Normal Distribution?
Statistics
Baseline LDL
N
Mean
Median
St d. Dev iation
Skewness
St d. Error of Skewness
Minimum
Maximum
Valid
Missing
1383
0
3.454363
3.506214
.9889157
.039
.066
.3345
7.5650
Note median and mean virtually the same
Skewness = 0.039, close to zero
Skewness is measure of symmetry (0=perfect symmetry)
Eyeball test - fitted normal curve looks good!
Try Q-Q plot in Analyze /
Descriptive Statistics/ Q-Q plot
Plot compares Expected
Normal distribution
with real data and if
data lies on line y = x
then the Normal
Distribution is a good
fit
Note still an eyeball
test!
Is this a good fit?
I used to be Normal until I
discovered Kilmogorov-Smirnoff!
One-Sample Kolmogorov-Smirnov Test
N
Normal Parameters a,b
Most Extreme
Diff erences
Mean
Std. Deviation
Absolute
Positive
Negative
Kolmogorov -Smirnov Z
Asy mp. Sig. (2-tailed)
Baseline LDL
1383
3.454363
.9889157
.043
.043
-.043
1.617
.011
a. Test distribution is Normal.
b. Calculated f rom data.
Eyeball Test indicates distribution is approximately
Normal but K-S test is significant indicating
discrepancy compared to Normal
WARNING: DO NOT RELY ON THIS TEST
Consider the distribution of survival
times following surgery for colorectal
cancer
Statistics
Time f rom Surgery
N
Mean
Median
Std. Dev iation
Skewness
Std. Error of Skewness
Minimum
Maximum
Valid
Missing
476
0
848.3908
835.5000
582.39657
2.081
.112
14.00
5763.00
Note median=835 days and mean=848
Skewness = 2.081, very skewed (> 1.0)
Strong tail to right! Approximately Normal?
Try a log transformation for right
positive skewed data?
Statistics
logtime
N
Mean
Median
Std. Deviation
Skewness
Std. Error of Skewness
Minimum
Maximum
Valid
Missing
476
0
6.4346
6.7286
.95059
-1.504
.112
2.67
8.66
Better but now slightly skewed to left!
Examples of skewed distributions
in Health Research
Discrete random variables – hospital admissions,
cigarettes smoked, alcohol consumption, costs
Continuous RV – BMI, cholesterol, BP
30%
The Binomial
Distribution
• ‘Binomial’ means ‘two numbers’.
• Outcomes of health research are often
•
•
measured by whether they have occurred
or not i.e binary.
For example, recovered from disease,
admitted to hospital, died, etc
May be modelled by assuming that the
number of events n has a binomial
distribution with a fixed probability of
event p
The Binomial
Distribution
• Based on work of Jakob Bernoulli, a Swiss
•
•
•
•
mathematician
Refused a church appointment and
instead studied mathematics
Early use was for games of chance but
now used in every human endeavour
When n = 1 this is called a Bernoulli trial
Binomial distribution is distribution for a
series of Bernoulli trials
The Binomial
Distribution
• Binomial distribution written as B ( n , p)
0.20
Probability of R Successes
•
where n is the total number of events and p
= prob of an event
This is a Binomial
Distribution with
p=0.25 and n=20
0.15
0.10
0.05
0.00
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Successes
The Binomial
Distribution
Binomial distributions used for binary factors
and so used to assess percentages or
proportions
Utilised in Cross-tabulation and logistic
regression
Note as N gets larger or
P ~0.5 then Binomial is
Equal to Normal Distr.
B(n,p) ~ N (np, np(1-p))
The Poisson
Distribution
Poisson distribution (1838), named after its
inventor Simeon Poisson who was a French
mathematician. He found that if we have a
rare event (i.e. p is small) and we know the
expected or mean ( or µ) number of
occurrences, the probabilities of 0, 1, 2 ...
events are given by:
P( R ) 
e


R!
R
The Poisson
Distribution
Note similarity to Binomial
In fact when p is small and n is large
B(n, p) ~ P (µ = np)
Also for large values of µ:
P (µ) ~ N ( µ, µ )
Hence if n and p not known
could use Poisson instead
The Poisson
Distribution
In health research often used to model the
number of events assumed to be random:
Number of hip replacement failures,
Number of cases of C. diff infection,
Diagnoses of leukaemia around nuclear
power stations,
Number of H1N1 cases in Scotland,
Etc.
Summary
•Many of variables measured in Health Research form
distributions which approximate to common distributions
with known mathematical properties
40
•Normal, Poisson, Binomial, etc…
•Note a relationship for all centred
30
20
10
Std. Dev = .96
around the exponential distribution
Mean = -.04
N = 501.00
0
-2.95
-1.95
-.95
.05
1.05
2.05
RANNORM
Where e = 2.718
• All belong to the Exponential Family of distributions
• These probability distributions are critical to applying
statistical methods
SPSS Practical
• Read in data file ‘LDL Data.sav’
• Consider adherence to statins, baseline
•
•
•
LDL, min Chol achieved, BMI, duration of
statin use
Assess distributions for normality
If non-normal consider a transformation
Try to carry out Q-Q plots