#### Transcript StatMod - Alan Moses

Review of statistical modeling
and probability theory
Alan Moses
ML4bio
What is modeling?
• Describe some observations in a simple,
more compact way
X = (X1,X2)
What is modeling?
• Describe some observations in a simple,
more compact way
Model: a = -
Gm
r2
Instead of all the observations, we only need to
remember a constant ‘G’ and measure some
parameters ‘m’ and ‘r’.
What is statistical modeling?
• Deals also with the ‘uncertainty’ in
observations
Deviation
or Variance
Expectation
• Mathematics is more complicated
• Also use the term ‘probabilistic’ modeling
What kind of questions will we answer
in this course?
What’s the
best linear
model to
explain some
data?
What kind of questions will we answer
in this course?
Are there multiple
groups? What are
they?
What kind of questions will we answer
in this course?
Given new data,
which group do we
assign it to?
3 major areas of machine learning
(that have proven useful in biology)
• Regression
• Clustering
• Classification
What’s the
best linear
model to
explain some
data?
Are there multiple
groups? What are
they?
Given new data,
which group do we
assign it to?
Expression Level
Molecular Biology example
X = (L,D)
Expectation
Variance
disease
Expectation
Variance
disease
Expression Level
Expression Level
Molecular Biology example
“clustering”
V1
E1
E2
V2
Class 2 is
“enriched” for
disease
Expectation
Variance
disease
Expression Level
“regression”
AA
Aa
Genotype
aa
Expression Level
Expression Level
Molecular Biology example
“clustering”
V1
E1
E2
V2
Class 2 is
“enriched” for
disease
Variance
disease
Expression Level
“regression”
AA
Aa
Genotype
aa
Expression Level
Expectation
“clustering”
V1
E1
E2
V2
Class 2 is
“enriched” for
disease
“classification”
Expression Level
Expression Level
Molecular Biology example
Aa
disease?
AA
Aa
Genotype
aa
Probability theory
• Probability theory quantifies uncertainty
using ‘distributions’
• Distributions are the ‘models’ and they
depend on constants and parameters
E.g., in one dimension, the Gaussian or Normal distribution
depends on two constants e and π and two parameters that
have to be measured, μ and σ
2
P(X|μ,σ) =
e
1
√2πσ2
–(X–μ)
2σ2
‘X’ are the possible datapoints that could come
from the distribution. In statistics jargon ‘X’ is
called a random variable
Probability theory
• Probability theory quantifies uncertainty
using ‘distributions’
• Choosing the distribution or ‘model’s the
first step in a statistical model
• E.g., data: mRNA expression levels,
counts of sequencing reads, presence or
absence of protein domains or ‘A’ ‘C’ ‘G’
and ‘T’ s
• We will use different distributions to
describe these different types of data.
Typical data and distributions
•
•
•
•
Data is categorical (yes or no, A,C,G,T)
Data is a fraction (e.g., 13 out of 5212)
Data is a continuous number (e.g., -6.73)
Data is a ‘natural’ number (0,1,2,3,4…)
• It’s also possible to do regression,
clustering and classification without
specifying a distribution
Molecular Biology example
• In this example, we might try to combine a Bernoulli for the
disease data, Poisson for the genotype and Gaussian for the
expression level
• We also might try to classify without specifying distributions
Expression Level
“classification”
Aa
disease?
AA
Aa
Genotype
aa
Molecular Biology example
• genomics era means we will almost never have the expression level
for just one gene or the genotype at just one locus
• Each gene’s expression level can be considered another ‘dimension’
• for 1000s of genes….
Gene 1 Expression Level
Gene 2 Expression Level
Gene 2 Expression Level
• for two genes, if each point is data for one person, we
can make a graph of this type of data
Gene 3
Gene 4
Gene 5
…
Gene 1 Expression Level
Molecular Biology example
• genomics era means we will almost never have the expression level
for just one gene or the genotype at just one locus
Gene 2 Expression Level
• We’ll usually make 2-D plots, but anything we say about 2-D can
usually be generalized to n-dimensions
Each “observation” , X,
contains expression level
for Gene 1 and Gene 2
Represent this as a vector:
e.g.,
X = (1.3, 4.6)
Or generally
Gene 1 Expression Level
X = (X1, X2)
Molecular Biology example
• genomics era means we will almost never have the expression level
for just one gene or the genotype at just one locus
Gene 2 Expression Level
• We’ll usually make 2-D plots, but anything we say about 2-D can
usually be generalized to n-dimensions
Each “observation” , X,
contains expression level
for Gene 1 and Gene 2
Represent this as a vector:
e.g.,
X = (1.3, 4.6)
Or generally
Gene 1 Expression Level
X = (X1, X2)
This gives a geometric interpretation to
multivariate statistics
Probability theory
• Probability theory quantifies uncertainty
using ‘distributions’
• Distributions are the ‘models’ and they
depend on constants and parameters
E.g., in two dimensions, the Gaussian or Normal distribution
depends on two constants e and π and 5 parameters that have
to be measured, μ and Σ
–
P(X|μ,σ) =
1
2π √|Σ|
e
1
2
(X–μ)T Σ-1 (X–μ)
‘X’ are the possible datapoints that could come
from the distribution. In statistics jargon ‘X’ is
called a random variable
What does the mean mean in
2 dimensions?
What does the standard
deviation mean?
Bivariate Gaussian
Molecular Biology example
• genomics era means we will almost never have the expression level
for just one gene or the genotype at just one locus
Gene 2 Expression Level
• We’ll usually make 2-D plots, but anything we say about 2-D can
usually be generalized to n-dimensions
Each “observation” , X,
contains expression level
for Gene 1 and Gene 2
Represent this as a vector:
µ
Gene 1 Expression Level
X = (X1, X2)
The mean is also a vector:
µ = (µ1, µ2)
The variance is a matrix:
σ11 σ12
=
Σ σ21 σ22
Σ=
-4
1
0
0
1
-2
0
2
4
“spherical covariance”
rmvnorm(n = 300, mean = c(1, 1), sigma = matrix(c(1, 0, 0, 1), [,1]
ncol = 2))[,1]
2
Σ=σ I
Σ=
1
0
-4
-2
0
4
0
2
4
“axis-aligned, diagonal covariance”
rmvnorm(n = 300, mean = c(1, 1), sigma = matrix(c(1, 0, 0, 4), [,1]
ncol = 2))[,1]
= 300, mean = c(1, 1), sigma = matrix(c(1, 0
rmvnorm(n = 300, mean = c(1, 1), sigma = matrix(c(1, -1.9, -1.9,rmvnorm(n
[,2]
1), ncol = 2))[,2]
4), ncol = 2))[,2]
-4
-2
0
2
4
-4
-2
0
2
4
vnorm(n = 300, mean = c(1, 1), sigma = matrix(c(1, 0, 0, 4),rmvnorm(n
[,2]
= 300, mean = c(1, 1), sigma = matrix(c(1, 0
ncol = 2))[,2]
ncol = 2))[,2]
-4
-2
0
2
4
-4
-2
0
2
4
µ
Σ=
-4
1.0
0.5
0.5
1.0
-2
0
2
4
“correlated data”
rmvnorm(n = 300, mean = c(1, 1), sigma = matrix(c(1, 0.5, 0.5, [,1]
1), ncol = 2))[,1]
1.0 -1.9
Σ =-1.9
4.0
-4
-2
0
2
“full covariance”
4
rmvnorm(n = 300, mean = c(1, 1), sigma = matrix(c(1, -1.9, -1.9, [,1]
4), ncol = 2))[,1]
Probability theory
• Probability theory quantifies uncertainty
using ‘distributions’
• Distributions are the ‘models’ and they
depend on constants and parameters
• Once we chose a distribution, the next
step is to chose the parameters
• This is called “estimation” or “inference”
P(X|μ,σ) =
e
1
√2πσ2
2
(X–μ)
– 2
2σ
Expression Level
Estimation
We want to make a statistical model.
1.Choose a model (or probability
distribution)
Expectation
Variance
2.Estimate its parameters
• Choose the parameters so the model ‘fits the data’
2
(X–μ)
– 2
2σ
1
P(X|μ,σ) =
√2πσ2
e
How do we know which parameters fit the data?
• There are many ways to measure how well a model fits
that data
• Different “Objective functions” will produce different
“estimators” (E.g., MSE, ML, MAP)
Laws of probability
(True for all distributions)
• If X1 … XN are a series of random variables (think
datapoints)
P(X1 , X2) is the “joint probability” and is equal to
P(X1) P(X2) if X1 and X2 are independent.
i=N
P(X1 … XN ) =Π P(Xi)
i=1
P(X1 | X2), is the “conditional probability” of event X1 given X2
Conditional probabilities are related by Bayes’ theorem:
P(X1| X2) = P(X2 |X1)
P(X1)
P(X2)
Likelihood and MLEs
• Likelihood is the probability of the data (say X) given
certain parameters (say θ)
L = P(X|θ)
• Maximum likelihood estimation says: choose θ, so that
the data is most probable.
L
θ
=0
• In practice there are many ways to maximize the
likelihood.
Example of ML estimation
Data:
Xi
5.2
9.1
8.2
7.3
7.8
P(Xi|μ=6.5, σ=1.5)
0.182737304
0.059227322
0.13996368
0.230761096
0.182737304
L = P(X|θ) = P(X1 … XN | μ, σ)
i=5
= ΠP(Xi|μ=6.5, σ=1.5) = 6.39 x 10-5
i=1
L
Mean, μ
Example of ML estimation
In practice, we almost always use the log likelihood,
which becomes a very large negative number when there
is a lot of data
Mean, μ
Log(L)
Log(L)
Example of ML estimation
ML Estimation
• In general, the likelihood is a function of multiple
variables, so the derivatives with respect to all of these
should be zero at a maximum
• In the example of the Gaussian, we have two
parameters, so that
L
μ
=0
and  L
σ
=0
• In general, finding MLEs means solving a set of coupled
equations, which usually have to be solved numerically
for complex models.
MLEs for the Gaussian
μML =
1
NΣ
X
X
VML =
1
(X - μ
NΣ
ML)
2
X
• The Gaussian is the symmetric continuous distribution
that has as its “centre” a parameter given by what we
consider the “average” (the expectation).
• The MLE for the for variance of the Gaussian is like the
squared error from the mean, but is actually a biased
(but still consistent!?) estimator
Other estimators
• Instead of likelihood, L = P(X|θ) we can choose
parameters to maximize posterior probability:
P(θ|X)
• Or sum of squared errors:
Σ
X
(X – μMSE)2
– θ2
• Or a penalized likelihood: L* = P(X|θ) x e
• In each case, estimation involves a
mathematical optimization problem that usually
has to be solved on computer
• How do we choose?