Transcript P(X=x).
Short review of probabilistic concepts
Probability theory plays very important role in statistics. This
lecture will give short review of basic concepts of
probability theory.
•
•
•
•
•
•
•
Basic principles and definitions
Conditional probabilities and independence
Bayes’s theorem and postulate
Random variables and probability distributions
Bayes’s theorem and likelihood
Expectations and moments
Entropy
Random experiment
Random experiment satisfies following conditions:
1.
All possible distinct outcomes are known in advance
2.
In any particular experiment outcome is not known in advance
3.
Experiment can be repeated under identical conditions
The outcome space is the set of possible outcomes.
Example 1. Tossing a coin is a random experiment. Outcome space is {H,T} – head
and tail.
Example 2. Rolling a die. Outcome space is {1,2,3,4,5,6}
Example 3. Drawing from an urn with N balls, M of them is red and N-M is white.
Outcome space is {R,W} – red and white
Example 5. Measuring temperature (in C or in K): What is outcome space?
Something that might or might not happen depending on outcome of experiment is
called an event. An event is a subset of the outcome space
Example: Rolling a die. {1,2,3} or {2,4,6}
Example: Measuring temperature in Celsius. Give example of of an event.
Classical definition of probability
If all outcomes are equally likely then the probability of event A is number of
outcomes in A (M(A)) divided by number of all outcomes (M):
P ( A)
M ( A)
M
Example: If coin is fair then probability of H is ½ and probability of T is ½
Example: If die is fair then probability of {1} is 1/6
If the outcome space is real numbers or are in a space then probability is measured as
ratio of area of an event and area of outcome space:
P( A)
M ( A)
M ( )
Where M is area.
Example: Outcome space is the interval [0,2]. What is probability of [0,1]?
Frequency definition of probability
Since random experiments can be repeated as many times as we wish under identical
conditions (in theory) we can measure the relative frequency of the occurrence of
an event. If number of trials is m and number of the occurrence of A is m(A) then
according to frequency definition probability of A is the limit:
P ( A) lim
m( A)
m
(m )
According to law of large numbers this limit exists. When number of trials is small
then there might be strong fluctuations. As number of trials increases fluctuations
tend to decrease.
Other definitions (subjective) of probability
There are other definitions of probability also:
• Degree of belief. How much a person believes in an event. In that sense one
person’s probability would be different from another person’s. For example:
existence of “an extra-terrestrial life”.
• Degree of knowledge. In many cases exact value of an event exists but we do not
know it. By carrying out experiments we want to find this value. Since experiment
is prone to errors it is in general impossible to find exact value and we assign
probability for this. That is purpose of the most statistical procedures and
techniques. According to Jaynes if proper rules are designed then exactly same
information would produce exactly same probabilities. (See Jaynes, The
Probability theory: Logic of Science). This definition reflects our state of
knowledge about the event and can change as we update our knowledge.
Probability axioms
Probability is defined as a function from subsets of outcome space to the real line
R that satisfies following conditions:
1.
2.
3.
Non-negativity: P(A) 0
Additivity: if AB= then P(AB) = P(A) + P(B)
Probability of whole space is 1. P() = 1
All above definitions obey these rules. So any property that can be derived from
these axioms is valid for all definitions
Show that: P( )=0 (Hint: = )
Show that: 0 P(A) 1 (Hint A and Ã=-A are not intersecting).
Conditional probability and independence
Let us consider if an event B has occurred or will occur and we want to know what is
probability of A. Knowing B may influence our knowledge about A. Or
occurrence of B may influence of occurrence of A. The probability of A given B is
called conditional probability of A given B and is defined as (for P(B)>0):
P( A | B )
P( A B )
P( B )
It is clear that now event event B has become new outcome space. Event A and B are
called independent if occurrence of B does not influence on probability of A.
P( A | B ) P( A) and P( B | A) P( B )
It can aso be written as:
P( A B ) P( A) P( B )
Note that only one of the above equations is independent.
The Law of total probability
In many cases when direct calculation of probability is not known it is easier to
divide an event into smaller parts and calculate their probability and then take
weighted average of them. This can be done using the law of total probability.
Let B1, B2,,,Bn be partition of , I.e. they are mutually exclusive (BiBj=) and their
sum is (1n Bi= ) then from the axioms of probability:
n
P( A) P( A | Bi ) P( Bi )
i 1
Consider a box with N balls, M of them are red and N-M are white. We make two
draws. We don’t know what is the first ball. What is probability of the second ball
being red. (Hint: Use partition as ({R1} {W1}). Then use law of total probability
for ({R2}. Here subscript shows the first or the second draw.)
Bayes’s theorem
Bayes’s theorem is a tool that updates probability of an event in the light of evidence. It
is written in various forms. All they are equivalent. Let us again consider partition of
outcome space – B1,B2,,,,Bn so that they are mutually exclusive and sum of them is
equal to . Then for one of these events (say j-th event) we can write:
P( B j | A)
P( A | B j ) P( B j )
n
P( A | Bi ) P( Bi )
P( A | B j ) P( B j )
P( A)
i 1
Usually P(Bj|A) is called posterior
probability, P(Bj) is prior probability and P(A|Bj) is
likelihood. It is widely used in statistical inferences. We will come back to this
theorem again.
Example: A box contains four balls. There are two possibilities: a) all balls are white
(B1) b) two white and two red (B2). A ball is drawn and it is white (event A). What is
the probability that all balls are white. B1 (all white) and B2 (two white and two red)
are two possible outcomes with prior probabilities ½. If B1 is true then probability of
A is 1 and if B2 is true then probability of A is ½. Calculate P(B1|A). What is
probability P(B2|A)
Bayes’s postulate: If there is no prior information available then prior probabilities
should be assumed to be equal.
Random variables
Random variable is a function from outcome space to the real line
X: R
Example: Consider random experiment of tossing the coin twice. Outcome space is:
={(H,H),(H,T),(T,H),(T,T)})
Define random variable as
X((T,T)) = 0, X((H,T))=X((T,H)) = 1, X((H,H))=2
Example 2: Rolling a die. Outcome space {1,2,3,4,5,6). Define random variable
X(j) = j.
Probability distribution function
Discrete case (number of elements in outcome space is finite or countable infinite):
Probability function p assigns for each possible realisation x of a random variable X
the probability P(X=x). Obviously xp(x) = 1.
Example: The number of heads turning up in two tosses is random variable with
probability function p(1) = ½, p(0) = ¼, p(2) = ½.
For continuous random variable it is not possible to define probability for each
realisation since their probability is 0. For them it is easy to define distribution
function:
F(x) = P(Xx)
i.e. probability that X is less than or equal to x. F(x) has following properties:
1) F(- ) = 0, 2) F(x) is monotonic and increasing function, 3) F(+ ) = 1.
This function is defined for discrete as well as continuous random variables. If
derivative of F(x) exists (it is defined for continuous random variables) then it
is called probability density function – f(x) = dF(x)/dx. Another relation
between them is:
x
F ( x)
f ( x)dx
Joint probability distributions
If there are more than one random variables then their joint probability distribution is
defined similarly. For discrete case:
p(x,y) = P((X,Y)=(x,y)) = P(X=x,Y=y)
Then xyp(x,y) = 1, p(x,y)0.
The margian probability function p(x) is derived by summing over all possible values of
y
pX(x) = xp(x,y)
Conditional probability function of X given Y=y is:
p(x|y) = p(x,y)/pY(y)
Definition for joint probability distribution for continous random variables is similar.
F(x,y) = P(X x,Yy). Probability density (f(x,y)) is derivative of probability function
with respect to its arguments. It has properties:
f ( x, y ) 0,
f ( x, y )dxdy 1
Marginal and conditional probability densities are defined similar to discrete random
variables by replaces summation with integration.
Joint probability distributions and independence
Random events {X=x} and {Y=y} are independent if
P(X=x, Y=y) = P(X=x)P(Y=y)
Random variables are independent if for all pairs (x,y) this relation holds. It can also
be written as
p(x,y) = pX(x)pY(y)
And then p(x|y) = pX(x) and p(y|x) = pY(y)
For continuous random variables definition analogous. It can be defined by replacing
p with f everywhere.
f(x,y) = fX(x)fY(y), f(x|y) = fX(x), f(y|x) = fY(y)
Bayes’s theorem then becomes:
f(x|y) = fX(x) f(y|x)/fY(y)
Usually we will drop subscripts X and Y.
Where f(x|y) is posterior probability density, f(x) is prior probability density f(y|x) is
likelihood of y if x would be observed, f(y) can be considered as normalisation
coefficient.
Expectation values. Moments
If X is a random variable and h(X) is its function then expectation value (discrete
case) is defined as:
E(h(X)) = xh(x)p(x)
If h(x) = x then it is called first moment. If h(x) = xn then it is called n-th moment. If
h(x) = (x-E(X))n then it is called n-th central moment: Second central moment is
called variance of the random variable. First moment and second central moment
play important role in statistics and they have special symbols
xp( x ) 2 ( x )2 p( x )
x
x
- is also called as standard deviation
When there are more than one random variable and their joint probability function
then their mixed moments also are defined. Most important of them is covariance
and correlation:
cov( x, y ) ( x x )( y y ) p( x, y ), (x, y)
cov( x, y )
x , y variables expectation values, moments,
x ycovariance and
For continuous random
correlation are defined similarly by replacing summation with integration. If
random variables are independent then their covariance is 0. Reverse is not true in
general
Entropy and maximum entropy
Entropy of random variable is the average amount of information generated by
observing its value. The information generated by observing realisation of X=x is
defined as:
H(X=x) = ln(1/p(x)) = -ln(p(x))
Example: We have a coin. Probability of head is 0.9 and probability of tail is 0.1.
Define random variable. X(H) = 1, X(T) = 0. What is information generated by
observing head? What is information generated by observing tail?
Entropy is defined (discrete case) as
H(X) = -x p(x) ln(p(x))
It is Shannon entropy first derived in information theory by Claude Shannon. Since
ln(p(x)) can be considered as a random variable then entropy can be written as
H(X) = E(ln(p(x))
Expected value of ln(p(x)).
It is often used to generate prior probability distribution from given information.
Continuous case is defined similarly by changing summation with integration and p
with f.