ppt - University of Illinois Urbana

Download Report

Transcript ppt - University of Illinois Urbana

Essential CS & Statistics
(Lecture for CS498-CXZ Algorithms in Bioinformatics)
Aug. 30, 2005
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
Essential CS Concepts
•
•
•
Programming languages: Languages that we use to
communicate with a computer
–
–
–
–
Machine language (010101110111…)
Assembly language (move a, b; add c, b; …)
High-level language (x= a+2*b… ), e.g., C++, Perl, Java
Different languages are designed for different applications
System software: Software “assistants” to help a computer
– Understand high-level programming languages (compilers)
– Manage all kinds of devices (operating systems)
– Communicate with users (GUI or command line)
Application software: Software for various kinds of applications
– Standing alone (running on a local computer, e.g., Excel, Word)
– Client-server applications (running on a network, e.g., web browser)
Intelligence/Capacity of a Computer
• The intelligence of a computer is determined
by the intelligence of the software it can run
• Capacities of a computer for running software
are mainly determined by its
– Speed
– Memory
– Disk space
• Given a particular computer, we would like to
write software that is highly intelligent, that
can run fast, and that doesn’t need much
memory (contradictory goals)
Algorithms vs. Software
• Algorithm is a procedure for solving a problem
– Input: description of a problem
– Output: solution(s)
– Step 1: We first do this
– Step 2: ….
– ….
– Step n: here’s the solution!
• Software implements an algorithm (with a
particular programming language)
Example: Change Problem
• Input:
– M (total amount of money)
– C1 > c2 > … >Cd (denominations)
• Output
– i1 , i2 , … ,id (number of coins of each kind), such
that i1*C1 + i2 *C2 + … + id *Cd =M and i1+ i2 + … + id
is as small as possible
Algorithm Example: BetterChange
c  (c1 ,..., cd )
BetterChange(M,c,d)
1
r=M;
2
for k=1 to d {
1
ik=r/ck
3
r=r-r-ik*ck
4 }
5
Return (i1, i2, …, id)
Input variables
Take only the integer part (floor)
Output variables
Properties of an algorithms:
- Correct vs. Incorrect algorithms (Is BetterChange correct?)
- Fast vs. Slow algorithms (How do we quantify it?)
Big-O Notation
• How can we compare the running time of two
algorithms in a computer-independent way?
• Observations:
– In general, as the problem size grows, the running
time increases (sorting 500 numbers would take
more time than sorting 5 elements)
– Running time is more critical for large problem size
(think about sorting 5 numbers vs. sorting 50000
numbers)
• How about measuring the growth rate of
running time?
Big-O Notation (cont.)
•
•
•
•
•
•
•
•
Define problem size (e.g., the lengths of a sequence,
n)
Define “basic steps” (e.g., addition, division,…)
Express the running time as a function of the problem
size ( e.g., 3*n*log(n) +n)
As the problem size approaches the positive infinity,
only the highest-order term “counts”
Big-O indicates the highest-order term
E.g., the algorithm has O(n*log(n)) time complexity
Polynomial (O(n2)) vs. exponential (O(2n))
NP-complete
Basic Probability & Statistics
Purpose of Prob. & Statistics
• Deductive vs. Plausible reasoning
• Incomplete knowledge -> uncertainty
• How do we quantify inference under
uncertainty?
– Probability: models of random
process/experiments (how data are generated)
– Statistics: draw conclusions on the whole
population based on samples (inference on data)
Basic Concepts in Probability
• Sample space: all possible outcomes, e.g.,
– Tossing 2 coins, S ={HH, HT, TH, TT}
• Event: ES, E happens iff outcome in E, e.g.,
– E={HH} (all heads)
– E={HH,TT} (same face)
• Probability of Event : 1P(E) 0, s.t.
– P(S)=1 (outcome always in S)
– P(A B)=P(A)+P(B) if (AB)=
Basic Concepts of Prob. (cont.)
• Conditional Probability :P(B|A)=P(AB)/P(A)
– P(AB) = P(A)P(B|A) =P(B)P(B|A)
– So, P(A|B)=P(B|A)P(A)/P(B)
– For independent events, P(AB) = P(A)P(B), so
P(A|B)=P(A)
• Total probability: If A1, …, An form a partition
of S, then
– P(B)= P(BS)=P(BA1)+…+P(B An)
– So, P(Ai|B)=P(B|Ai)P(Ai)/P(B) (Bayes’ Rule)
Interpretation of Bayes’ Rule
Hypothesis space: H={H1 , …, Hn}
P( H i | E ) 
Evidence: E
P( E | H i )P( H i )
P( E )
If we want to pick the most likely hypothesis H*, we can drop p(E)
Posterior probability of Hi
Prior probability of Hi
P( H i | E )  P( E | H i ) P( H i )
Likelihood of data/evidence
if Hi is true
Random Variable
• X: S   (“measure” of outcome)
• Events can be defined according to X
– E(X=a) = {si|X(si)=a}
– E(Xa) = {si|X(si)  a}
• So, probabilities can be defined on X
– P(X=a) = P(E(X=a))
– P(aX) = P(E(aX)) (f(a)=P(a>x): cumulative dist. func)
• Discrete vs. continuous random variable (think of
“partitioning the sample space”)
An Example
•
•
•
•
•
Think of a DNA sequence as results of tossing a 4-face
die many times independently
P(AATGC)=p(A)p(A)p(T)p(G)p(C)
A model specifies {p(A),p(C), p(G),p(T)}, e.g., all 0.25
(random model M0)
P(AATGC|M0) = 0.25*0.25*0.25*0.25*0.25
Comparing 2 models
– M1: coding regions
– M2: non-coding regions
– Decide if AATGC is more likely a coding region
Probability Distributions
• Binomial: Times of successes out of N trials
N k
p(k | N )    p (1  p) N k
k 
• Gaussian: Sum of N independent R.V.’s

1
f ( x) 
e
2
( x   )2
2 2
• Multinomial: Getting ni occurrences of
outcome i
N
 k ni
p(n1 ,..., nk | N )  
  pi
 n1 ... nk  i 1
Parameter Estimation
• General setting:
– Given a (hypothesized & probabilistic) model that
governs the random experiment
– The model gives a probability of any data p(D|)
that depends on the parameter 
– Now, given actual sample data X={x1,…,xn}, what
can we say about the value of ?
• Intuitively, take your best guess of  -- “best”
means “best explaining/fitting the data”
• Generally an optimization problem
Maximum Likelihood Estimator
Data: a sequence d with counts c(w1), …, c(wN), and length |d|
Model: multinomial M with parameters {p(wi)}
Likelihood: p(d|M)
Maximum likelihood estimator: M=argmax M p(d|M)
N
|d |

 N c ( wi )
p (d | M )  
  i c ( wi )
  i
i 1
 c( w1 )...c( wN )  i 1
where, i  p ( wi )
N
l (d | M )  log p(d | M )   c( wi ) log i
i 1
N
N
i 1
i 1
l (d | M )   c( wi ) log i   ( i  1)
'
l ' c( wi )

 0
i
i
N
Since

i 1
 i  
N
i
Use Lagrange multiplier approach
c( wi )
Set partial derivatives to zero

 1,    c( wi )   | d |
i 1
We’ll tune p(wi) to maximize l(d|M)
So, i  p ( wi ) 
c( wi )
|d |
ML estimate
Maximum Likelihood vs. Bayesian
• Maximum likelihood estimation
– “Best” means “data likelihood reaches maximum”
ˆ  arg max P ( X |  )

– Problem: small sample
• Bayesian estimation
– “Best” means being consistent with our “prior”
knowledge and explaining data well
ˆ  arg max P ( | X )  arg max P( X |  ) P( )


– Problem: how to define prior?
Bayesian Estimator
• ML estimator: M=argmax M p(d|M)
• Bayesian estimator:
– First consider posterior: p(M|d) =p(d|M)p(M)/p(d)
– Then, consider the mean or mode of the posterior dist.
• p(d|M) : Sampling distribution (of data)
• P(M)=p(1 ,…, N) : our prior on the model parameters
• conjugate = prior can be interpreted as “extra”/“pseudo” data
• Dirichlet distribution is a conjugate prior for multinomial
sampling distribution
( 1     N ) N  i 1
Dir( |  1 ,  ,  N ) 
i
( 1 )  ( N ) i 1
“extra”/“pseudo” counts e.g., i= p(wi|REF)
Dirichlet Prior Smoothing (cont.)
Posterior distribution of parameters:
p( | d )  Dir( | c( w1 )  1 , , c( wN )   N )
Property : If  ~ Dir( |  ), then E( )  {
i
i

}
The predictive distribution is the same as the mean:
p(w i | ˆ )   p(w i |  ) Dir( |  )d
c( w i )   i
c( w i )  p( w i | REF )


N
| d | 
| d |  i
i 1
Bayesian estimate (|d| ?)
Illustration of Bayesian Estimation
Posterior:
p(|D) p(D|)p()
Likelihood:
p(D|)
D=(c1,…,cN)
Prior: p()

: prior mode
: posterior mode
ml: ML estimate
Basic Concepts in Information Theory
• Entropy: Measuring uncertainty of a random
variable
• Kullback-Leibler divergence: comparing two
distributions
• Mutual Information: measuring the correlation
of two random variables
Entropy
Entropy H(X) measures the average uncertainty of random variable X
H ( X )  H ( p)    p( x) log p( x)
  all possible values
x
Define 0 log 0  0, log  log 2
Example:
fair coin p( H )  0.5
1

H ( X )  between 0 and 1
biased coin p( H )  0.8
0
completely biased p( H )  1

Properties: H(X)>=0; Min=0; Max=log M; M is the total number of values
Interpretations of H(X)
•
Measures the “amount of information” in X
– Think of each value of X as a “message”
– Think of X as a random experiment (20 questions)
•
Minimum average number of bits to compress
values of X
– The more random X is, the harder to compress
A fair coin has the maximum information, and is hardest to compress
A biased coin has some information, and can be compressed to <1 bit on average
A completely biased coin has no information, and needs only 0 bit
" Information of x "  "# bits to code x "   log p( x) H ( X )  E p [  log p( x)]
Cross Entropy H(p,q)
What if we encode X with a code optimized for a wrong distribution q?
Expected # of bits=? H ( p, q)  E p [ log q( x)]    p( x) log q( x)
x
Intuitively, H(p,q)  H(p), and mathematically,
q ( x)
]
p
(
x
)
x
q ( x)
  log  [ p( x)
] 0
p
(
x
)
x
H ( p, q)  H ( p)   p( x)[ log
By Jensen ' s inequality :
 p f ( x )  f ( p x )
i
i
i
i i
i
where, f is a convex function, and
p
i
i
1
Kullback-Leibler Divergence D(p||q)
What if we encode X with a code optimized for a wrong distribution q?
How many bits would we waste? D( p || q)  H ( p, q)  H ( p)   p( x) log
x
Properties:
- D(p||q)0
- D(p||q)D(q||p)
- D(p||q)=0 iff p=q
p ( x)
q ( x)
Relative entropy
KL-divergence is often used to measure the
distance between two distributions
Interpretation:
-Fix p, D(p||q) and H(p,q) vary in the same way
-If p is an empirical distribution, minimize D(p||q) or H(p,q) is
equivalent to maximizing likelihood
Cross Entropy, KL-Div, and Likelihood
Data / Sample for X : Y  ( y1 ,..., yN )
1
Empirical distribution : p( x) 
N
1 if x  y
0 o.w.
N
  ( yi , x)  ( y, x)  
i 1
N
Likelihood:
L(Y )   p( X  yi )
i 1
N
log Likelihood:
log L(Y )   log p( X  yi )   c( x) log p( X  x)  N  p( x) log p( x)
i 1
x
x
1
log L(Y )   H ( p, p)   D( p || p)  H ( p)
N
1
Fix the data, arg max p log L(Y )  arg min p H ( p, p)  arg min p D( p, p)
N
Criterion for estimating a good model
Mutual Information I(X;Y)
Comparing two distributions: p(x,y) vs p(x)p(y)
I ( X ; Y )   p( x, y ) log
x, y
p ( x, y )
 H ( X )  H ( X | Y )  H (Y )  H (Y | X )
p( x) p( y )
Conditional Entropy: H(Y|X)
H (Y | X )  E p ( x , y ) [ log p( y | x)]   p ( x, y ) log p ( y | x)   p ( x) p ( y | x) log p( y | x)
x, y
x
y
Properties: I(X;Y)0; I(X;Y)=I(Y;X); I(X;Y)=0 iff X & Y are independent
Interpretations:
- Measures how much reduction in uncertainty of X given info. about Y
- Measures correlation between X and Y
What You Should Know
•
•
Computational complexity, big-O notation
Probability concepts:
– sample space, event, random variable, conditional prob.
multinomial distribution, etc
•
•
•
Bayes formula and its interpretation
Statistics: Know how to compute maximum likelihood
estimate
Information theory concepts:
– entropy, cross entropy, relative entropy, conditional entropy,
KL-div., mutual information, and their relationship