P(X<x) - IN2P3

Download Report

Transcript P(X<x) - IN2P3

Probability and Statistics
Basic concepts I
(from a physicist point of view)
Benoit CLEMENT – Université J. Fourier / LPSC
[email protected]
Bibliography
Kendall’s Advanced theory of statistics, Hodder Arnold Pub.
volume 1 : Distribution theory, A. Stuart et K. Ord
volume 2a : Classical Inference and and the Linear Model, A.
Stuart, K. Ord, S. Arnold
volume 2b : Bayesian inference, A. O’Hagan, J. Forster
2
The Review of Particle Physics, K. Nakamura et al., J. Phys. G 37,
075021 (2010) (+Booklet)
Data Analysis: A Bayesian Tutorial, D. Sivia and J. Skilling, Oxford
Science Publication
Statistical Data Analysis, Glen Cowan, Oxford Science Publication
Analyse statistique des données expérimentales, K. Protassov,
EDP sciences
Probabilités, analyse des données et statistiques, G. Saporta,
Technip
Analyse de données en sciences expérimentales, B.C., Dunod
Sample and population
SAMPLE
Finite size
Selected through a
random process
eg. Result of a
measurement
POPULATION
Potentially infinite
size
eg. All possible
results
Characterizing the sample, the population and the
drawing procedure
 Probability theory (today’s lecture)
Using the sample to estimate the characteristics of
the population
3
 Statistical inference (tomorrow’s lecture)
Random process
A random process (« measurement » or
« experiment ») is a process whose outcome cannot
be predicted with certainty.
It will be described by :
Universe: Ω = set of all possible outcomes.
Event : logical condition on an outcome. It can
either be true or false; an event splits the universe in
2 subsets.
4
An event A will be identified by the subset A for
which A is true.
Probability
A probability function P is defined by : (Kolmogorov, 1933)
P : {Events} → [0:1]
A
→ P(A)
satisfying :
P(Ω)=1
P(A or B) = P(A) + P(B) if A and B = Ø
Interpretation of this number :
- Frequentist approach : if we repeat the random
process a great number of times n , and count the number of
times the outcome satisfy event A, nA then the ratio :
nA
lim
 P(A) defines a probability
n  n
5
- Bayesian interpretation : a probability is a measure of
the credibility associated to the event.
Simple logic
Event « not A » is associated
with the complement A.
P(A) = 1–P(A)
P(Ø) = 1-P(Ω) = 0
Event « A and B »
Event « A or B »
P(A or B) = P(A)+P(B)–P(A and B)
6
Conditional probability
If an event B is known to be true, one can restrain the
universe to Ω’=B and define a new probability function
on this universe, the conditional probability.
P(A|B) = « probability of A given B »
7
P(A and B)
P(A | B) 
P(B)
Incompatibility and
Indpendance
Two incompatible events cannot be true
simultaneously, then : P(A and B) = 0
P(A or B) = P(A)+P(B)
Two events are independent, if the realization
of one is not linked in any way to the
realization of the other :
P(A|B)=P(A) and P(B|A) = P(B)
8
P(A and B) = P(A).P(B)
Bayes theorem
The definition of conditional probability leads to :
P(A and B) = P(A|B).P(B) = P(B|A).P(A)
Hence relating P(A|B) to P(B|A) by the Bayes theorem :
P(A | B).P(B)
P(B | A) 
P(A)
Or, using a partition {Bi} :
P(A | Bi ).P(Bi )
P(A | Bi ).P(Bi )
P(Bi | A) 

 P(A and Bi )  P(A | Bi ).P(Bi )
i
i
This theorem will play a major role in Bayesian inference :
given data and a set of models, it translates into :
9
P(data | modeli ).P(modeli )
P(modeli | data) 
 P(data | modeli ).P(modeli )
i
Application of Bayes
100 dices in a box :
70 are equiprobable (A) 30 have a probability 1/3 to get 6 (B)
You pick one dice, throw it until you reach 6 and count the
number of try. Repeating the process thrice, you get 2, 4 and 1.
What’s the probability that the dice is equilibrated ?
5n1
2n1
n1
P(n | B)  n
For one throw : P(n | A)  (1 p6 ) p6  n
6
3
Combining several throw: (for one dice, throws are independent)
5n1 n2 n3 3
P(n1 and n2 and n3 | A)  P(n1 | A)P(n2 | A)P(n3 | A)  n1 n2 n3
6
2n1 n2 n3 3
P(n1 and n2 and n3 | B)  n1 n2 n3
3
P(A | n1, n2 , n3 ) 
10
P(n1, n2 , n3 | A)P(A)
P(n1, n2 , n3 | B)P(B)  P(n1, n2 , n3 | A)P(A)
5 n1  n 2  n 3  3
54
 0.7
 0.7
n1  n 2  n 3
7
6
6
 n1  n 2  n 3  3
 4
 0.42
n1  n 2  n 3  3
4
2
5
2
5

0.3

 0.7

0.3


0.7
37
67
3 n1  n 2  n 3
6 n1  n 2  n 3
Random variable
When the outcome of the random process is a number
(real or integer), we associate to the random process
a random variable X.
Each realization of the process leads to a particular
result : X=x.
x is a realization of X.
For a discrete variable :
Probability law : p(x) = P(X=x)
For a real variable : P(X=x)=0,
Cumulative density function : F(x) = P(X<x)
dF = F(x+dx)-F(x) = P(X < x+dx) - P(X < x)
= P(X < x or x < X < x+dx) - P(X < x)
= P(X < x) + P(x < X < x+dx) - P(X < x)
= P(x < X < x+dx) = f(x)dx
11
dF
Probability density function (pdf) : f(x) 
dx
Density function
Probability density function
F(x)
1
f(x)

 f(x)dx  P(Ω )  1
x

Note : discrete variables can
also be described by a
probability density function
using Dirac distributions:
f(x)   p(i)δ(i - x)
i
12
Cumulative density function
 p(i)  1
i
x
By construction :
F(-)  P(Ø)  0
F()  P( Ω )  1
a
F(a)   f(x)dx
-
b
P(a  X  b)  F(b) - F(a)   f(x)dx
a
Change of variable
Probability density function of Y = φ(X)
For φ bijective
•φ increasing : X<x  Y<y
P(X  x)  FX (x)  P(Y  y)  FY (Y)  FY ( (x))  fY (y) 
•φ decreasing : X<x  Y>y
dF(x) f(x)

dy
 ' (x)
P(X  x)  FX (x)  P(Y  y)  1 FY (Y)  1 FY ( (x))  fY (y)  
f(x)
f( 1(y))

in both case : fY (y) 
 ' (x)  ' ( 1(y))
dF(x)
f(x)

dy
  ' (x)
If φ not bijective : split into several bijective parts φi
1
f(i (y))
f(x)
fY (y)  

1
i i ' (x)
i i ' (i (y))
13
Very useful for Monte-Carlo : if X is uniformly distributed
between 0 and 1 then Y=F-1(X) has F for cumulative density
Multidimensional PDF (1)
Random variables can be generalized to random vectors :

X  (X1, X2 ,..., Xn )
the probability density function becomes :
 
f(x)dx  f(x 1, x 2 ,, x n )dx1dx 2  dx n
 P(x1  X 1  x 1  dx 1 and x 2  X 2  x 2  dx 2 ...
 and x n  X n  x n  dx n )
and
b
d
a
c
P(a  X  b and c  Y  d)   dx  dy f(x, y)
Marginal density : probability of only one of the component
fX (x)dx  P(x  X  x  dx and -   Y  )   f(x, y)dxdy
14
 fX (x)   f(x, y)dy
Multidimensional PDF (2)
For a fixed value of Y=y0:
f(x|y0)dx = « Probability of x<X<x+dx knowing that Y=y0 »
is , a conditional density for X. It is proportional to f(x,y), so
f(x | y)  f(x, y)
 f(x | y)dx  1
f(x, y)
f(x, y)
 f(x | y) 

 f(x, y)dx fY (y)
The two random variables X and Y are independent if all events
of the form x<X<x+dx are independent from y<Y<y+dy
f(x|y)=fX(x) and f(y|x)=fY(y) hence f(x,y)= fX(x).fY(y)
Translated in term of pdf’s, Bayes’ theorem becomes:
15
f(x | y)fY (y)
f(x | y)fY (y)
f(y | x) 

fX (x)
 f(x | y)fY (y)dy
See A.Caldwell’s
lecture for detailed
use of this formula
for statistical
inference
Sample PDF
A sample is obtained from a random drawing within a
population, described by a probability density function.
We’re going to discuss how to characterize, independently
from one another:
- a population
- a sample
To this end, it is useful, to consider a sample as a finite set
from which one can randomly draw elements, with
equipropability
We can the associate to this process a probability density, the
empirical density or sample density
1
fsample (x)   δ(x - i)
n i
16
This density will be useful to translate properties of
distribution to a finite sample.
Characterizing a distribution
How to reduce a distribution/sample to a finite
number of values ?
 Measure of location:
Reducing the distribution to one central value
 Result
 Measure of dispersion:
Spread of the distribution around the central value
 Uncertainty/Error
 Higher order measure of shape
17
 Frequency table/histogram (for a finite sample)
Measure of location
sample (size n)
population
Mean value : Sum (integral) of all possible values weighted by
the probability of occurrence:
1 n
μ  x   xi
n i 1

μ  x   xf(x)dx

Median : Value that split the distribution in 2 equiprobable parts

med(x)

f(x)dx  

med(x)
f(x)dx 
1
2
x1  x 2  ...  xn
, odd n
 x(n 1)/2
med(x)   1
 2 (xn/2  x1n/2 ), even n
Mode : The most probable value = maximum of pdf
18
df
d2 f
 0,
dx x mod(x)
dx 2
0
x mod(x)
?
Measure of dispersion
sample (size n)
population
Standard deviation (σ) and variance (v= σ²) : Mean value of the
squared deviation to the mean :
n
v  σ   (x  μ) f(x)dx
2
2
Koenig’s theorem :
1
v  σ   (xi  μ)2
n i 1
2
σ 2   x 2 f(x)dx  μ 2  f(x)dx  2μ  xf(x)dx
σ2  x2  μ 2  x2  x 2
Interquartile difference : generalize the median by splitting the
distribution in 4 :
med(x)  q2
q1
q2
q3

1
 f(x)dx  q1 f(x)dx  q2 f(x)dx  q3 f(x)dx  4
δ  q3  q1
19
Others…
Bienaymé-Chebyshev
Consider the interval : Δ=]-∞,µ-a[]µ+a,+∞[
2
2
Then for xΔ :
 x μ 
 x μ 

  1 
 f(x)  f(x)
a
a




2
 x μ 
 
 f(x)dx  Δ f(x)dx
Δ
 a 
2
 x μ 
 
 f(x)dx  Δ f(x)

a


σ2
 2  P X  μ  a 
a

Finally Bienaymé-Chebyshev’s inequality P X  μ  aσ   1 
It gives a bound on the confidence level of the interval μ±aσ
20
a
1
2
3
4
5
Chebyshev’s bound
0
0.75
0.889
0.938
0.96
Normal distribution
0.683 0.954
0.997
0.99996
0.9999994
1
a2
Multidimensional case
A random vector (X,Y) can be treated as 2 separate variables
mean and variance for each variable : μX μY σX σY
Doesn’t take into account correlations between the variables
ρ=-0.5
ρ=0
ρ=0.9
Generalized measure of dispersion : Covariance of X and Y
Cov(X,Y)   (x  μ X )(y - μ Y )f(x, y)dxdy  ρσ X σ Y  μ XY  μ Xμ Y
1 n
Cov(X,Y)   (xi  μ X )(yi - μ y )
n i 1
Cov(X, Y)
Correlation : ρ 
Uncorrelated : ρ=0
σ Xσ Y
21
Independent
Uncorrelated
ρ=0
only quantify linear correlation
Regression
22
Measure of location:
• a point : (μX , μY)
• a curve : line closest to the points  linear regression
Minimizing the dispersion between the curve « y=ax+b » and
the distribution :
 1

w(a,b)   (y  ax  b)2 f(x, y)dxdy   (yi  axi  b)2 
 n i

 w
 a  0   x(y  ax  b)f(x, y)dxdy
 w

 0   (y  ax  b)f(x, y)dxdy
 b
a(σ 2X  μ 2X )  bμ X  ρσ X σ Y  μ X μ Y

aμ X  b

μY

σY

a

ρ
Fully correlated
ρ=1

σX

Fully anti-correlated ρ=-1
σY
b  μ Y  ρ μ X
Then Y = aX+b

σX
Decorrelation
Covariance matrix for n variables Xi:
 σ12
ρ12σ1σ 2

σ 22
 ρ12σ1σ 2
Σij  Cov(Xi , X j )  Σ  



ρ σ σ ρ σ σ
2n 2 n
 1n 1 n
23
 ρ1nσ1σn 

 ρ2nσ 2 σn 




2

σn 
For uncorrelated variables Σ is diagonal
Real and symmetric matrix: can de diagonalized  One can
define n new uncorrelated variables Yi
 σ'12 0  0 


2
 0 σ' 2  0 
-1
Σ'  

B
ΣB, Y  BX

   
 
2
 0
0

σ
'
n

σ’i2 are the eigenvalues of Σ,
B contains the orthonormal eigenvectors.
The Yi are the principal components. Sorted for the larger to
the smaller σ’ they allow dimensional reduction
Moments
For any function g(x), the expectation of g is :
E[g(X)]   g(x)f(x)dx It’s the mean value of g
Moments μk are the expectation of Xk.
0th moment : μ0=1 (pdf normalization)
1st moment : μ1=μ (mean)
X’ = X-μ1 is a central variable
2nd central moment : μ’2=σ2 (variance)
ixt
ixt
1
Characteristic function :  (t)  E[e ]   f(x)e dx  FT [f]
(itx)k
(it)k
From Taylor expansion :  (t)   
f(x)dx  
μk
k!
k!
k
k
24
k
d

μ k  ik
dt k
t 0
Pdf entirely defined by its moments
CF : useful tool for demonstrations
Skewness and kurtosis
Reduced variable : X’’ = (X-μ)/σ = X’/ σ
Measure of asymmetry :
3rd reduced moment : μ’’3 = √β1 = 1 : skewness
1=0 for symmetric distribution. Then mean = median
Measure of shape :
4th reduced moment : μ’’4 = β2 = 2 + 3 : kurtosis
For the normal distribution β2 =3 and 2 =0
Generalized Koenig’s theorem
n!
μ'n  (1) (1 n)μ  
(μ 1)nk μ k
k 2 k! (n  k)!
n
25
 1 

μ' 'n  
 μ '2 
n
1
n2
μ 'n
n
Skewness and kurtosis (2)
26
Discrete distributions
Binomial distribution: randomly choosing K objects within a finite
set of n, with a fixed drawing probability of p
Variable
:K
p = 0.65
Parameters : n,p
n = 10
n!
k
nk
Law
: P(k;n,p) 
p (1 p)
k! (n  k)!
Mean
: np
Variance
: np(1-p)
27
Poisson distribution : limit of the binomial when n→+∞,p →0,np=λ
Counting events with fixed probability per time/space unit.
Variable
:K
λ = 6.5
Parameters : λ
-λ k
e λ
P(k;
λ
)

Law
:
k!
Mean
:λ
Variance
:λ
Real distributions
Uniform distribution : equiprobability over a finite range [a,b]
Parameters : a,b
1
Law
: f(x; a, b) 
if a  x  b
ba
Mean
Variance
μ  (a  b)/2
:
: v  σ 2  (b  a) 2 /12
Normal distribution (Gaussian) : limit of many processes
Parameters : μ, σ
(xμ )

1
Law
: f(x;μ, σ) 
e 2σ
2
2
σ 2π
Chi-square distribution : sum of the square
of n normal reduced variables2
n  X -μ

Variable
: C   k X 
 σX 
k 1 

Parameters : n
n
c
n
1 
1  n 
2
2
2
Law
: f(c;n)  c e 2 Γ 
k
k
28
2
Mean
:n
Variance
: 2n
Convergence
29
Multidimensional Pdfs
Multinomial distribution : randomly choosing K1 , K2 ,… Ks objects
within a finite set of n, with a fixed drawing probability for each
category p1, p2,… ps with ΣKi=n and Σpi=1
Parameters : n, p1, p2,… ps


n!
Law
:P(k; n,p) 
pk1 pk2 pks
k1!k 2 !k s !
Mean
: μi=npi
Variance
: σi2=npi(1-pi)
Cov(Ki,Kj)=-npipj
1
2
s
Rem : variables are not independent. The binomial, correspond to s=2, but has
only one independent variable.
Multinormal distribution :

Parameters : μ, Σ
 
1  
 (x μ )Τ Σ 1 (x μ )
 
1
Law
: f(x; μ, Σ) 
e 2
2π Σ
 
if uncorrelated : f(x; μ, Σ)  
30
Independent
1
e
σi 2 π

(x i μ i )2
2 σi
2
Uncorrelated
Sum of random variables
The sum of several random variable
is a new random variable S
n
S   Xi
i 1
Assuming the mean and variance of each variable exists,
Mean value of S :
n
 n 
μ S     xi f(x1,..., xn )dx1...dxn    xi fXi (xi )dxi 
i 1
 i 1 
n
μ
i1
i
The mean is an additive quantity
Variance of S :

σ     x i  μ Xi
 i 1
n
2
S
2

 f(x 1,..., x n )dx1...dx n

n
  σ 2Xi  2
Cov(Xi , X j )
i
j i
i 1
31
For uncorrelated variables, the variance is additive
-> used for error combinations
n
σ   σ 2Xi
2
S
k 1
Sum of random variables
Probability density function of S : fS(s)
Using the characteristic function :
 it  xi 
S (t)   fS (s)e ds   fX (x)e
dx
ist
For independent variables
S (t)    fX (xk )eitx dxk    X (t)
k
k
i
The characteristic function factorizes.
Finally the pdf is the Fourier transform of the cf, so :
fS  fX1  fX 2    fXn
32
The pdfs of the sum is a convolution.
Sum of Normal variables  Normal
Sum of Poisson variables (λ1 and λ2)  Poisson, λ = λ1 + λ2
Sum of Khi-2 variables (n1 and n2)  Khi-2, n = n1 + n2
Sum of independent variables
Weak law of large numbers
Sample of size n = realization of n independent variables, with
the same distribution (mean μ, variance σ2).
S 1
The sample mean is a realization of M    X i
n n
Mean value of M : μM=μ
Variance of M : σM2 = σ2/n
Central-Limit theorem
n independent random variables of mean μi and variance σi2
Xi  μ i
1
Sum of the reduced variables : C 

σi
n
The pdfs of C converge to a reduced normal distribution :
33
1
fC (c) n

e
 
2π

c2
2
The sum of many random fluctuation is normally distributed
Central limit theorem
Naive demonstration:
For each Xi : X’’i has mean 0 and variance 1. So its characteristic
function is :
t2
 X (t)  1
''
i
2
 o(t 2 )
Hence the characteristic function of C :
n

t
t
 t 

C (t)   Xi'' 
 o
   1
2n
 n
n

2
2

 


n
For n large :

lim C (t)  lim  1
n 
34
n 

n
t 
  e
2n 
2

t2
2
 FT 1[fC ]
This is a naive demonstration, because we assumed that the
moments were defined.
For CLT, only mean and variance are required (much more
complex)
Central limit theorem
700
700
600
600
Gauss
500
-4
-3
-2
-1
X1
400
400
300
300
200
200
100
100
0
0
-100
0
1
2
-4
-3
-2
-1
3
4
5
-4
-3
-2
-1
-100
700
700
600
600
Gauss
(X1+X2+X3)*racine(3)
500
35
Gauss
500
400
300
300
200
200
100
100
0
0
0
1
2
3
4
0
1
5
-4
-3
-2
-1
-100
2
3
4
5
Gauss
(X1+X2+X3+X4+X5)*racine(5)
500
400
-100
(X1+X2)*racine(2)
0
1
2
3
4
5
Dispersion and uncertainty
Any measure (or combination of measure) is a realization of a
random variable.
- Measured value : θ
- True value
: θ0
Uncertainty = quantifying the difference between θ and θ0 :
 measure of dispersion
We will postulate : Δθ = ασθ Absolute error, always positive
36
Usually one differentiate
- Statistical error : due to the measurement Pdf.
- Systematic errors or bias  fixed but unknown deviation
(equipment, assumptions,…)
Systematic errors can be seen as statistical error in a set a
similar experiences.
Error sources
Observation error : ΔO
Position error : ΔP
Scaling error: ΔS
θ = θ0+δO+δS+δP
Each δi is a realization of a random variable : mean 0 (negligible)
and variance σi2. For uncorrelated error sources :
Δ O  ασ O 

2
2
2
Δ S  ασ S Δ 2tot  (α σ tot )2  α 2 (σ O  σ S  σP )  Δ 2O  Δ 2S  ΔP2
ΔP  ασ P 

37
Choice of α ?
If many sources, from central-limit  normal distribution
α=1 gives (approximately) a 68% confidence interval
α=2 gives 95% CL (and at least 75% from Bienaymé-Chebyshev)
Error propagation
Measure : x±Δx
Compute : f(x)  Δf ?
f'(x)
f(x)
df
dx
Δf
Δf
Assuming small errors,
using Taylor expansion :
Δx
Δx
x
df
1d f
1d f
1 d4 f
2
3
4

f(x  Δx)  f(x) 
Δx 
Δx  
Δx 
Δx 
2
3
4
dx
2 dx
24 dx
 6 dx

2
3
df
1 d2 f
1 d3 f
1 d4 f
2
3
4


f(x  Δx)  f(x) 
Δx 
Δ
x

Δ
x

Δ
x
2
3
4

dx
2 dx
24 dx
 6 dx

38
 1 d3 f
1
df
3

 Δf  f(x  Δx)  f(x  Δx) 
Δx 
Δ
x
3
2
dx  6 dx

Error propagation
Measure : x±Δx, y±Δy,…
Compute : f(x,y,…) -> Δf ?
Idea : treat the effect of
each variable as separate
error sources
zm=f(xm,ym)
f df(x, y m )

x
dx
Curve z=f(x,ym), fixed
ym
2Δfx
ym
f
f
Δx f 
Δx , Δ y f 
Δy
x
y
Surface
z=f(x,y)
xm
2Δx
Then
2Δy
2
2

f f
 f
  f
Δf 2  Δ x f 2  Δ Y f 2  ρ xyΔ x fΔ y f   Δx    Δy   ρ xy
ΔxΔy

x

y

x

y

 

2



f
f f
Δf 2   
Δxi    ρxix j
ΔxiΔx j

x

x

x
i 
i
i
j
 i,ji
uncorrelated
39
 f

Δf 2   
Δx i 
i  x i

2
correlated
Δf 
f
f
Δx 
Δy
x
y
anticorrelated
Δf 
f
f
Δx 
Δy
x
y