ppt - University of Illinois at Urbana
Download
Report
Transcript ppt - University of Illinois at Urbana
Essential Probability & Statistics
(Lecture for CS397-CXZ Algorithms in Bioinformatics)
Jan. 23, 2004
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
Purpose of Prob. & Statistics
• Deductive vs. Plausible reasoning
• Incomplete knowledge -> uncertainty
• How do we quantify inference under
uncertainty?
– Probability: models of random
process/experiments (how data are generated)
– Statistics: draw conclusions on the whole
population based on samples (inference on data)
Basic Concepts in Probability
• Sample space: all possible outcomes, e.g.,
– Tossing 2 coins, S ={HH, HT, TH, TT}
• Event: ES, E happens iff outcome in E, e.g.,
– E={HH} (all heads)
– E={HH,TT} (same face)
• Probability of Event : 1P(E) 0, s.t.
– P(S)=1 (outcome always in S)
– P(A B)=P(A)+P(B) if (AB)=
Basic Concepts of Prob. (cont.)
• Conditional Probability :P(B|A)=P(AB)/P(A)
– P(AB) = P(A)P(B|A) =P(B)P(B|A)
– So, P(A|B)=P(B|A)P(A)/P(B)
– For independent events, P(AB) = P(A)P(B), so
P(A|B)=P(A)
• Total probability: If A1, …, An form a partition
of S, then
– P(B)= P(BS)=P(BA1)+…+P(B An)
– So, P(Ai|B)=P(B|Ai)P(Ai)/P(B) (Bayes’ Rule)
Interpretation of Bayes’ Rule
Hypothesis space: H={H1 , …, Hn}
P( H i | E )
Evidence: E
P( E | H i )P( H i )
P( E )
If we want to pick the most likely hypothesis H*, we can drop p(E)
Posterior probability of Hi
Prior probability of Hi
P( H i | E ) P( E | H i ) P( H i )
Likelihood of data/evidence
if Hi is true
Random Variable
• X: S (“measure” of outcome)
• Events can be defined according to X
– E(X=a) = {si|X(si)=a}
– E(Xa) = {si|X(si) a}
• So, probabilities can be defined on X
– P(X=a) = P(E(X=a))
– P(aX) = P(E(aX)) (f(a)=P(a>x): cumulative dist. func)
• Discrete vs. continuous random variable (think of
“partitioning the sample space”)
An Example
•
•
•
•
•
Think of a DNA sequence as results of tossing a 4-face
die many times independently
P(AATGC)=p(A)p(A)p(T)p(G)p(C)
A model specifies {p(A),p(C), p(G),p(T)}, e.g., all 0.25
(random model M0)
P(AATGC|M0) = 0.25*0.25*0.25*0.25*0.25
Comparing 2 models
– M1: coding regions
– M2: non-coding regions
– Decide if AATGC is more likely a coding region
Probability Distributions
• Binomial: Times of successes out of N trials
N k
p(k | N ) p (1 p) N k
k
• Gaussian: Sum of N independent R.V.’s
1
f ( x)
e
2
( x )2
2 2
• Multinomial: Getting ni occurrences of
outcome i
N k
p(k | N ) p (1 p) N k
k
Parameter Estimation
• General setting:
– Given a (hypothesized & probabilistic) model that
governs the random experiment
– The model gives a probability of any data p(D|)
that depends on the parameter
– Now, given actual sample data X={x1,…,xn}, what
can we say about the value of ?
• Intuitively, take your best guess of -- “best”
means “best explaining/fitting the data”
• Generally an optimization problem
Maximum Likelihood Estimator
Data: a sequence d with counts c(w1), …, c(wN), and length |d|
Model: multinomial M with parameters {p(wi)}
Likelihood: p(d|M)
Maximum likelihood estimator: M=argmax M p(d|M)
N
|d |
N c ( wi )
p (d | M )
i c ( wi )
i
i 1
c( w1 )...c( wN ) i 1
where, i p ( wi )
N
l (d | M ) log p(d | M ) c( wi ) log i
i 1
N
N
i 1
i 1
l (d | M ) c( wi ) log i ( i 1)
'
l ' c( wi )
0
i
i
N
Since
i 1
i
N
i
Use Lagrange multiplier approach
c( wi )
Set partial derivatives to zero
1, c( wi ) | d |
i 1
We’ll tune p(wi) to maximize l(d|M)
So, i p ( wi )
c( wi )
|d |
ML estimate
Maximum Likelihood vs. Bayesian
• Maximum likelihood estimation
– “Best” means “data likelihood reaches maximum”
ˆ arg max P ( X | )
– Problem: small sample
• Bayesian estimation
– “Best” means being consistent with our “prior”
knowledge and explaining data well
ˆ arg max P ( | X ) arg max P( X | ) P( )
– Problem: how to define prior?
Bayesian Estimator
• ML estimator: M=argmax M p(d|M)
• Bayesian estimator:
– First consider posterior: p(M|d) =p(d|M)p(M)/p(d)
– Then, consider the mean or mode of the posterior dist.
• p(d|M) : Sampling distribution (of data)
• P(M)=p(1 ,…, N) : our prior on the model parameters
• conjugate = prior can be interpreted as “extra”/“pseudo” data
• Dirichlet distribution is a conjugate prior for multinomial
sampling distribution
( 1 N ) N i 1
Dir( | 1 , , N )
i
( 1 ) ( N ) i 1
“extra”/“pseudo” counts e.g., i= p(wi|REF)
Dirichlet Prior Smoothing (cont.)
Posterior distribution of parameters:
p( | d ) Dir( | c( w1 ) 1 , , c( wN ) N )
Property : If ~ Dir( | ), then E( ) {
i
i
}
The predictive distribution is the same as the mean:
p(w i | ˆ ) p(w i | ) Dir( | )d
c( w i ) i
c( w i ) p( w i | REF )
N
| d |
| d | i
i 1
Bayesian estimate (|d| ?)
Illustration of Bayesian Estimation
Posterior:
p(|D) p(D|)p()
Likelihood:
p(D|)
D=(c1,…,cN)
Prior: p()
: prior mode
: posterior mode
ml: ML estimate
Basic Concepts in Information Theory
• Entropy: Measuring uncertainty of a random
variable
• Kullback-Leibler divergence: comparing two
distributions
• Mutual Information: measuring the correlation
of two random variables
Entropy
Entropy H(X) measures the average uncertainty of random variable X
H ( X ) H ( p) p( x) log p( x)
all possible values
x
Define 0 log 0 0, log log 2
Example:
fair coin p( H ) 0.5
1
H ( X ) between 0 and 1
biased coin p( H ) 0.8
0
completely biased p( H ) 1
Properties: H(X)>=0; Min=0; Max=log M; M is the total number of values
Interpretations of H(X)
•
Measures the “amount of information” in X
– Think of each value of X as a “message”
– Think of X as a random experiment (20 questions)
•
Minimum average number of bits to compress
values of X
– The more random X is, the harder to compress
A fair coin has the maximum information, and is hardest to compress
A biased coin has some information, and can be compressed to <1 bit on average
A completely biased coin has no information, and needs only 0 bit
" Information of x " "# bits to code x " log p( x) H ( X ) E p [ log p( x)]
Cross Entropy H(p,q)
What if we encode X with a code optimized for a wrong distribution q?
Expected # of bits=? H ( p, q) E p [ log q( x)] p( x) log q( x)
x
Intuitively, H(p,q) H(p), and mathematically,
q ( x)
]
p
(
x
)
x
q ( x)
log [ p( x)
] 0
p
(
x
)
x
H ( p, q) H ( p) p( x)[ log
By Jensen ' s inequality :
p f ( x ) f ( p x )
i
i
i
i i
i
where, f is a convex function, and
p
i
i
1
Kullback-Leibler Divergence D(p||q)
What if we encode X with a code optimized for a wrong distribution q?
How many bits would we waste? D( p || q) H ( p, q) H ( p) p( x) log
x
Properties:
- D(p||q)0
- D(p||q)D(q||p)
- D(p||q)=0 iff p=q
p ( x)
q ( x)
Relative entropy
KL-divergence is often used to measure the
distance between two distributions
Interpretation:
-Fix p, D(p||q) and H(p,q) vary in the same way
-If p is an empirical distribution, minimize D(p||q) or H(p,q) is
equivalent to maximizing likelihood
Cross Entropy, KL-Div, and Likelihood
Data / Sample for X : Y ( y1 ,..., yN )
1
Empirical distribution : p( x)
N
1 if x y
0 o.w.
N
( yi , x) ( y, x)
i 1
N
Likelihood:
L(Y ) p( X yi )
i 1
N
log Likelihood:
log L(Y ) log p( X yi ) c( x) log p( X x) N p( x) log p( x)
i 1
x
x
1
log L(Y ) H ( p, p) D( p || p) H ( p)
N
1
Fix the data, arg max p log L(Y ) arg min p H ( p, p) arg min p D( p, p)
N
Criterion for estimating a good model
Mutual Information I(X;Y)
Comparing two distributions: p(x,y) vs p(x)p(y)
I ( X ; Y ) p( x, y ) log
x, y
p ( x, y )
H ( X ) H ( X | Y ) H (Y ) H (Y | X )
p( x) p( y )
Conditional Entropy: H(Y|X)
H (Y | X ) E p ( x , y ) [ log p( y | x)] p ( x, y ) log p ( y | x) p ( x) p ( y | x) log p( y | x)
x, y
x
y
Properties: I(X;Y)0; I(X;Y)=I(Y;X); I(X;Y)=0 iff X & Y are independent
Interpretations:
- Measures how much reduction in uncertainty of X given info. about Y
- Measures correlation between X and Y
What You Should Know
•
Probability concepts:
– sample space, event, random variable, conditional prob.
multinomial distribution, etc
•
•
•
Bayes formula and its interpretation
Statistics: Know how to compute maximum likelihood
estimate
Information theory concepts:
– entropy, cross entropy, relative entropy, conditional entropy,
KL-div., mutual information, and their relationship