Introduction

Download Report

Transcript Introduction

Course Introduction
• What these courses are about
• What I expect
• What you can expect
What these courses are about
• overview of ways in which computers are
used to solve problems in biology
• supervised learning of illustrative or
frequently-used algorithms and programs
• supervised learning of programming
techniques and algorithms selected from
these uses
I Expect
• students will have basic knowledge of
biology and chemistry (at the level of
Modern Biology/Chemistry)
• students will have basic familiarity with
statistics
• students have some programming
experience and willingness to work to
improve
You can expect
• Homework assignments
– 80% of grade
• Final (20% of grade)
• Grades totally determined by points
system
Textbook
• Required textbook: Biological
Sequence Analysis: Probabilistic
models of proteins and nucleic acids
by Durbin et al.
• Recommended additional textbook:
Introduction to Computational Biology
by Waterman
Chapter 1
Introduction
Purpose
• A great acceleration in the accumulation of
biological knowledge started in our era
• Part of the challenge is to organize,
classify and parse the immense richness
of sequence data
• This is not just a task of string parsing, for
behind the string of base or amino acids is
the whole complexity of molecular biology
Information Flow
• A major task in computational molecular
biology is to “decipher” information
contained in biological sequences
• Since the nucleotide sequence of a
genome contains all information
necessary to produce a functional
organism, we should in theory be able
to duplicate this decoding using
computers
Review of basic biochemistry
• Central Dogma: DNA makes RNA makes
protein
• Sequence determines structure
determines function
Structure
• DNA composed of four nucleotides or
"bases": A,C,G,T
• RNA composed of four also: A,C,G,U (T
transcribed as U)
• proteins are composed of amino acids
Purpose
This class is about methods which are in
principle capable of capturing some of the
complexity of biology, by integrating diverse
sources of biological information into clean,
general, and tractable probabilistic models
for sequence analysis.
However,
The most reliable way to determine a
biological molecule’s structure or function
is by direct experimentation.
It is far easier to obtain the DNA sequence
of the gene corresponding to an RNA or
protein than it is to experimentally
determine its function or its structure.
The Human Genome Project
Gives us the raw sequence of an estimated
20,000-25,000 human genes, only a small
fraction of which have been studied
experimentally.
The development of computational methods
have become more important (computer
science, statisticians, and etc….)
Basic Information
• New sequences are adapted from pre-existing
sequences
• We compare a new sequence with an old
sequence with known structure or function
• Two related sequences are called homologous
and we are transferring information by homology
• It is somewhat similar to determine the similarity
between two text strings
• In fact, we will be trying to find a plausible
alignment between sequences
Definition
• A sequence is a linear set of characters
(sequence elements) representing
nucleotides or amino acids
– DNA composed of four nucleotides or
"bases": A,C,G,T
– RNA composed of four also: A,C,G,U (T
transcribed as U)
– proteins are composed of amino acids (20)
Character representation of
sequences
• DNA or RNA
– use 1-letter codes (e.g., A,C,G,T)
• protein
– use 1-letter codes
• can convert to/from 3-letter codes
(e.g., A = Ala = Alanine
C = Cys = Cysteine)
Alignment
• Find the best alignment between two strings under some
scoring system
• “+1” for a match; “-1” for a mismatch
• Most important, we want a scoring system to give the
biologically most likely alignment the highest score
• Note that biological molecules have evolutionary
histories, 3D folded structures, and other features
• This is more the realm of statistics than computer
science
• Probabilistic modeling approach might be used and
extend
Probabilities & Probabilistic
Models
• A model means a system that simulates the
object under consideration
• A probabilistic model is to produce different
outcomes with different probabilities
• That is, it stimulates a whole class of objects,
and assign each object an associated probability
• The objects will be sequences, and a model
might describe a family of related sequences
Example: Rolling a six-sided die
• A probability model of rolling a 6-sided die
involves 6 parameters p1, p2, p3, p4, p5,
and p6
• The probability of rolling i is pi
• pi≧0 and Σpi=1
• Rolling the die 3 times independently,
P([1,6,3])= p1 p6 p3
Example: Biological Sequence
• Biological sequences are strings from
finite alphabet of residues (4 nucleotides
or 20 amino acids)
• A residue a occurs at random with
probability qa, independent of all other
residues in the sequence
• If the sequence is denoted by x1… xn, the
probability of the whole sequence is
qx1 qx2… qxn
Maximum Likelihood Estimation
• The parameters of a probability model is
estimated from a training set (sample)
• The probability qa for amino acid a can be
estimated as the observed frequency of
residues in a database of known protein
sequences (SWISS-PROT)
• The training sequences are not
systematically biased towards a peculiar
residue composition
http://au.expasy.org/sprot/
MLE (continued)
• This way of estimating models is called
maximum likelihood estimation (MLE)
• The MLE maximizes the total probability of
all sequences given the model (the
likelihood)
• Given a model with parameters θ and a
set of data D, the maximum likelihood
estimate for θ is that value which
maximizes P(D|θ)
Estimation
• If estimating parameters from a limited amount
of data, there is a danger of overfitting
• Overfitting: The model becomes very well
adapted to the training data, but it will not
generalize well to new data
• For example, observing the three flips of a coin
[tail, tail, tail] would lead to the maximum
likelihood estimate that the probability of head is
0 and that of tail is 1
Conditional, Joint, and Marginal
• We have two dies, D1 and D2
• The conditional probability of rolling i given die
D1 is called P(i|D1)
• We pick a die with probability P(Dj), j=1, 2
• The probability for picking die Dj and rolling an i
is the product of the two probabilities, P(i,
Dj)=P(Dj)P(i|Dj), the joint probability
• P(X, Y)=P(X|Y)P(Y)
• P(X)=ΣYP(X, Y)=ΣY P(X|Y)P(Y), the marginal
probability
Bayes Theorem
• Bayes’ theorem
P(Y | X ) P( X )
P( X | Y ) 
P(Y )
• The denominator is the marginal
• The numerator is the joint
Example 1
Consider an occasionally dishonest casino
that uses two kinds of dice. Of the dice 99%
are fair but 1% are loaded so that a six
comes up 50% of the time. Suppose we pick
a die at random and roll it three times,
getting three consecutive sixes. What is
P(Dloaded|3 sixes)?
Example 1 (Continued)
P(3 sixes | Dloaded) P( Dloaded)
P( Dloaded | 3 sixes ) 
P(3 sixes )
P( Dloaded)  0.01
P(3 sixes | Dloaded)  0.5  0.125
P(3 sixes)  P(3 sixes | Dloaded) P( Dloaded) 
P(3 sixes | Dfair ) P( Dfair )
3
Example 1 (Continued)
3
(0.5 )(0.01)
P( Dloaded | 3 sixes ) 
3
3
(0.5 )(0.01)  (1 / 6) (0.99)
 0.21
We will still more likely pick up a fair die, despite
seeing three successive sixes.
Example 2
• Assume that, on average, extracellular
protein have a slightly different amio acid
composition than intracellular proteins
• For example, cysteine is more common in
extracellular than intracellular proteins
• Question: whether a new protein
sequence x=x1…xn is intracellular or
extracellular?
Example 2 (continued)
• We first split our training examples from SWISSPROT into extracellular and intracellular
proteins
int
q
• Estimate a set of frequencies a for
intracellular proteins, and a corresponding set of
extracellular frequencies q aext
• The probability that any new sequence is
extracellular is pext, and the corresponding
probability of being intracellular is pint. Note that
pint=1- pext
Example 2 (continued)
P( x | ext )   i q
ext
xi
and P( x | int )   i q
P( x)  p p ( x | ext )  p p( x | int )
int
ext
P(ext | x) 
p
p
ext
ext
i q
ext
xi
i q
ext
xi
int
 p i q
int
xi
int
xi
Bayesian Model
• θ is the parameter of interest
• Before collecting data, the information regarding
θ is called the prior information, P(θ)
• After collected the data, the information
regarding θ is called the posterior information,
P(θ|D)
• If we do not have enough data to reliably
estimate the parameters, we can use prior
knowledge to constrain the estimates
Bayesian and Frequentist
D ~ N(θ,1)
• To frequentists, θ is fixed (unknown)
• To Bayesians, θ is random
• If θ is random, what should its distribution
be?
• Frequentists argue that the determination
of the prior distribution of θ is very
subjective
Prior and Posterior
• Suppose that θ has a probability distribution
P(θ) (prior)
• Assume that θ and D|θ are independent
• P(D, θ) is the joint distribution of D and θ
• P(D | θ) is the conditional distribution of D given
θ
• P(θ | D) is the conditional distribution of θ
given D (the posterior)
Prior and Posterior
• P(D| θ)P(θ)=P(D, θ)=P(θ | D)P(D)
• Bayes’ theorem:
PD |  P( )
P | D  
P D 
Posterior Distribution
Given D’s density p(D|θ) and a prior
probability density P(θ), the posterior
density for θ is given as
p(θ|D)=cp(θ) p(D|θ) ,
where
c-1=∫ p(θ) p(D|θ) dθ
(the marginal of D).
Example
D ~ N(θ,s 2), s is known.
P()=N( 0, s 02)
Then the posterior density is normal with
2
2




1
/
s
1
/
s
0
  d

mean   0 
 1/ s 2 1/ s 2 
 1/ s 2 1/ s 2 
0
0




and

variance  1 / s  1 / s
2
0

2 1
Conjugate Prior
• D ~ N(θ,s 2) is a normal distribution
• Prior distribution, P()=N( 0, s 02), is
also a normal distribution
• Posterior distribution, P(|D), is also a
normal distribution
• The normal distribution is conjugate to
the normal
Specification of the Prior
• Conjugate priors:
–The beta distribution is conjugate to the
binomial
–The normal distribution is conjugate to the
normal
–The gamma distribution is conjugate to the
Poisson
Specification of the Prior
• Noninformative (uninformative) priors:
P(θ)  constant
• When we don’t have a strong belief or in
public policy situations strongly differ
Specification of the Prior
Sometimes, we will wish to use an
informative P(θ). We know a priori that the
amino acids phenylalanine (Phe, F), tyrosine (Tyr,
Y), and tryptophan (Trp, W) are structurally similar
and often evolutionarily interchangeable. We
would want to use a P(θ) that tends to favor
parameter sets that assign them very similar
probabilities.
Parameter Estimation
• Choose the parameter value for θ that maximize
P(|D)
• This is called maximum a posteriori or MAP
estimation
• MAP estimation maximizes the likelihood times
the prior
• If the prior is flat (uninformative), then MAP is
equal to the MLE
• Another parameter estimation is to choose the
mean of the posterior
Maximum A Posteriori (MAP) Estimation
• Ex: estimating probabilities for a die
We roll 1, 3, 4, 2, 4, 6, 2, 1, 2, 2
MLE: p5=0 ~ overfitting
add 1 to each observed number of counts
(pseudocount):
MAP: p5=1/16
When estimating large parameter sets from
small amounts of data, we believe that
Bayesian methods provide a consistent
formalism for bringing in additional
information from previous experience with
the same type of data.