Review of Probability

Download Report

Transcript Review of Probability

Probability Review
CS479/679 Pattern Recognition
Dr. George Bebis
1
Why Bother About Probabilities?
• Accounting for uncertainty is a crucial component
in decision making (e.g., classification) because of
ambiguity in our measurements.
• Probability theory is the proper mechanism for
accounting for uncertainty.
• Take into account a-priori knowledge, for example:
"If the fish was caught in the Atlantic ocean, then
it is more likely to be salmon than sea-bass
2
Definitions
• Random experiment
– An experiment whose result is not certain in advance
(e.g., throwing a die)
• Outcome
– The result of a random experiment
• Sample space
– The set of all possible outcomes (e.g., {1,2,3,4,5,6})
• Event
– A subset of the sample space (e.g., obtain an odd
number in the experiment of throwing a die = {1,3,5})
3
Intuitive Formulation of Probability
• Intuitively, the probability of an event α could be defined
as:
where N(a) is the number of times that event α happens
in n trials
• Assumes that all outcomes are equally likely (Laplacian
definition)
4
Axioms of Probability
S
A1
A3
A2
A4
C)
5
Unconditional (Prior) Probability
• This is the probability of an event in the absence
of any evidence.
P(Cavity)=0.1
means that
“in the absence of any other information, there is a
10% chance that the patient is having a cavity”.
6
Conditional (Posterior) Probability
• This is the probability of an event given some
evidence.
P(Cavity/Toothache)=0.8 means that
“there is an 80% chance that the patient is having a
cavity given that he is having a toothache”
7
Conditional (Posterior) Probability
(cont’d)
• Conditional probabilities are defined as follows:
P( A, B)
P( A / B) 
,
P( B)
P( A, B)
P  B / A 
P( A)
• The above definitions lead to the chain rule:
P(A,B)=P(A/B)P(B)=P(B/A)P(A)
8
Law of Total Probability
• If A1, A2, …, An is a partition of mutually exclusive events
and B is any event, then:
P( B)  P( B, A1 )  P( B, A2 )  ...  P( B, An ) 
P( B / A1 ) P( A1 )  P( B / A2 ) P( A2 )  ...  P( B / An ) P( An )
n
  P( B / Aj ) P( Aj )
j 1
• Special case : A, A
S
A1
A2
A3
A4
P( B)  P( B, A)  P( B, A)  P( B / A) P( A)  P( B / A) P( A)
9
Bayes’ Rule
• Using the definition of conditional probabilities leads
to the Bayes’ rule:
P( A / B) 
P( A, B)
,
P( B)
P  B / A 
P( A, B)
P( A)
P( B / A) P( A)
P( A / B) 
P( B)
where:
P( B)  P( B, A)  P( B, A)  P( B / A) P( A)  P( B / A) P( A)
10
Example
• Consider the probability of Disease given Symptom:
P( Symptom / Disease) P( Disease)
P( Disease / Symptom) 
P( Symptom)
where:
P ( Symptom )  P ( Symptom / Disease ) P ( Disease ) 
P ( Symptom / Disease ) P ( Disease )
11
Example (cont’d)
• Meningitis causes a stiff neck 50% of the time.
• A patient comes in with a stiff neck – what is the probability
that he has meningitis?
P( S / M ) P( M )
P( M / S ) 
P( S )
where P(S/M)=0.5
• Need to know the following:
– The prior probability of a patient having meningitis (P(M)=1/50,000)
– The prior probability of a patient having a stiff neck (P(S)=1/20)
P(M/S)=0.0002
12
General Form of Bayes’ Rule
• If A1, A2, …, An is a partition of mutually exclusive events
and B is any event, then the Bayes’ rule is given by:
P( B / Ai ) P( Ai )
P( Ai / B) 
P( B)
where P( B) 
n
 P( B / A ) P( A )
j 1
j
j
13
Independence
•
Two events A and B are independent iff:
P(A,B)=P(A)P(B)
•
Using independence, we can show that:
P(A/B)=P(A) and P(B/A)=P(B)
•
A and B are conditionally independent given C iff:
P(A/B,C)=P(A/C)
e.g., P(WetGrass/Season,Rain)=P(WetGrass/Rain)
14
Random Variables
• In many experiments, it is easier to deal with a summary
variable than with the original probability structure.
Example
• In an opinion poll, we ask 50 people whether agree or
disagree with a certain issue.
– Suppose we record a "1" for agree and "0" for disagree.
– The sample space for this experiment has 250 elements.
• Suppose we are only interested in the number of people
who agree.
– Define the variable X=number of "1“ 's recorded out of 50.
– The new sample space has only 51 elements.
15
Random Variables (cont’d)
• A random variable (r.v.) is a function that assigns a value
to the outcome of a random experiment.
X(j)
16
Random Variables (cont’d)
• How do we compute probabilities using random variables?
– Suppose the sample space is S=<s1, s2, …, sn>
– Suppose the range of the random variable X is <x1,x2,…,xm>
– We observe X=xj iff the outcome of the random experiment
is an s j  S such that X(sj)=xj
P( X  x j ) 

s j :X ( s j ) x j
P( s j )
17
Example
• Consider the experiment of throwing a pair of dice:
18
Discrete/Continuous Random Variables
• A discrete r.v. can assume only discrete values.
• A continuous r.v. can assume a continuous range of values
(e.g., sensor readings).
• Probability mass/density function (pmf/pdf)
• Probability distribution function (PDF)
19
Probability mass function (pmf)
• The pmf of a discrete r.v. X assigns a probability to
each possible value x of X.
• Notation: given two r.v.'s, X and Y, their pmf are denoted as
pX(x) and pY(y); for convenience, we will drop the subscripts and
denote them as p(x) and p(y). However, keep in mind that these
functions are different!
20
Probability density function (pdf)
• The pdf of a continuous r.v. X shows the
probability of being “close” to some number x.
• Notation: given two r.v.'s, X and Y, their pdf are denoted as
pX(x) and pY(y); for convenience, we will drop the subscripts and
denote them as p(x) and p(y). However, keep in mind that these
functions are different!
21
Probability Distribution Function (PDF)
• Definition:
F ( x)  P( X  x)
• Some properties of PDF:
– (1) 0  F ( x)  1
– (2) F(x) is a non-decreasing function of x
• If X is discrete, its PDF can be computed as follows:
x
x
k 0
k 0
F ( x)  P ( X  x)   P ( X  k )   p (k )
22
Probability Distribution Function (PDF)
(cont’d)
pmf
23
Probability Distribution Function (PDF)
(cont’d)
• If X is continuous, its PDF can be computed as follows:
x
F ( x) 
 p(t )dt
for all x

• The pdf can be obtained from the PDF using:
dF
p( x) 
( x)
dx
24
Example
• Gaussian pdf and PDF
0
25
Joint pmf (discrete case)
• For n random variables, the joint pmf assigns a
probability for each possible combination of
values:
p(x1,x2,…,xn)=P(X1=x1, X2=x2, …, Xn=xn)

p( x1 , x2 ,..., xn )  1
x1 , x2 ,..., xn
Notation: the joint pmf /pdf of the r.v.'s X1, X2, ..., Xn and Y1,
Y2, ..., Yn are denoted as pX1X2...Xn(x1,x2,...,xn) and
pY1Y2...Yn(y1,y2,...,yn); for convenience, we will drop the subscripts
and denote them p(x1,x2,...,xn) and p(y1,y2,...,yn), keep in mind,
however, that these are two different functions.
26
Joint pmf (discrete case) (cont’d)
• Specifying the joint pmf requires a large number of values
– kn assuming n random variables where each one can
assume one of k discrete values.
– Can be simplified if we assume independence or
conditional independence.
P(Cavity, Toothache) is a 2 x 2 matrix
Joint Probability
Toothache
Not Toothache
Cavity
0.04
0.06
Not Cavity
0.01
0.89
Sum of probabilities = 1.0
27
Joint pdf (continuous r.v.)
For n random variables, the joint pdf gives the probability of
being “close” to a given combination of values:
p( x1 , x2 ,..., xn )  0
 ...  p( x , x ,..., x )dx ...dx
1
x1
2
n
1
n
1
xn
28
Conditional pdf
• The conditional pdf is defined as follows:
p( x, y )
p ( y / x) 
p ( x)
• Using the definition of conditional pdfs, leads to
the chain rule:
p ( x, y )  p ( y / x ) p ( x )
• General case:
p( x1 , x2 ,..., xn )  p( x1 / x2 ,..., xn ) p( x2 / x3 ,..., xn )... p( xn1 / xn ) p( xn )
29
Independence
• Knowledge about independence between r.v.'s is
very useful because it simplifies things greatly!
e.g., if X and Y are independent, then:
p ( x, y )  p ( x ) p ( y )
2D
1D
30
Law of Total Probability
• The law of total probability:
p( y )   p( y / x) p( x)
x
31
Marginalization
• Given the joint pmf/pdf, we can compute the pmf/pdf of
any subset of variables using marginalization:
– Examples:

p( x)   p( x, y)dy


p( x1 , x2 ,..., xi 1 , xi 1 ,..., xn )   p( x1 , x2 ,..., xn )dxi

p ( x1 , x2 )   ...  p( x1 , x2 ,..., xn )dx3 ...dxn
x3
xn
32
Example
33
Normal (Gaussian) Distribution
1-dimensional:
-
d-dimensional:
dx1
d x d symmetric matrix
34
Normal (Gaussian) Distribution (cont’d)
• Parameters and shape of Gaussian distribution
d ( d 1)
– Number of parameters is d  2
– Location determined by μ; shape determined by Σ
35
Normal (Gaussian) Distribution (cont’d)
• Mahalanobis distance:
r  (x   )  (x   )
t
1
• If the variables are independent, the multivariate
normal distribution becomes:
( xi  i ) 2
p ( x)  
exp[
]
2
2
i
2 i
2 i
1
Σ is diagonal
36
Expected Value
37
Expected Value (cont’d)
• The sample mean and expected value are related
by:
• The expected value for a continuous r.v. is given by:
38
Variance and Standard Deviation
39
Covariance
40
Covariance Matrix – 2 variables
Σ=
and Cov(X,Y)=Cov(Y,X)
(i.e., symmetric matrix)
can be computed using:
41
Covariance Matrix – n variables
(i.e., symmetric matrix)
42
Uncorrelated r.v.’s
D
D
43
Properties of covariance matrix
(Note: we will review these concepts later in case you do not remember them)
44
Covariance Matrix Decomposition
where Φ-1= ΦT
45
Linear Transformations
46
Linear Transformations (cont’d)
Whitening
transform
47