Transcript A 1
CASE STUDY: Genetic Linkage
Analysis via Bayesian Networks
D
H
D D
A2 A2
2
1
A2/A2
A1/A1
Phase inferred
H D
A1 A2
Recombinant
H
H
4
3
A2/A2
A1/A2
D D
A1 A2
D
H |D
A2 | A2
5
A1/A2
We speculate a locus with alleles H (Healthy) / D (affected)
If the expected number of recombinants is low (close to
zero), then the speculated locus and the marker are
tentatively physically closed.
1
The Variables Involved
Lijm = Maternal allele at locus i of person j. The values of
this variables are the possible alleles li at locus i.
Lijf = Paternal allele at locus i of person j. The values of this
variables are the possible alleles li at locus i (Same as for Lijm) .
Xij = Unordered allele pair at locus i of person j. The values
are pairs of ith-locus alleles (li,l’i). “The genotype”
Yj = person I is affected/not affected. “The phenotype”.
Sijm = a binary variable {0,1} that determines which maternal
allele is received from the mother. Similarly,
Sijf = a binary variable {0,1} that determines which paternal
allele is received from the father.
It remains to specify the joint distribution that governs
these variables. Bayesian networks turn to be a perfect
choice.
2
The Bayesian network for Linkage
Locus 1
Locus 2 (Disease)
Locus 3
Locus 4
This network depicts the qualitative relations between the variables.
We have already specified the local conditional probability tables.
3
Details regarding recombination
L11m
L12m
L11f
X11
S13m
L12f
X12
S13f
L13f
L13m
X13
L21m
X21
S23m
Y1
P( s23t
L22m
L21f
1
| s13t , )
where t {m,f}
1
L22f
X22
Y2
L23f
L23m
S23f
X23
Y3
is the recombination fraction between loci 2 & 1.
4
Details regarding the Loci
P(L11m=a) is the frequency of
allele a.
Li1m
X11 is an unordered allele pair
at locus 1 of person 1 = “the
data”.
P(x11 | l11m, l11f) = 0 or 1
depending on consistency
Li1f
Xi1
Si3m
Y1
Li3m
The phenotype variables Yj are 0 or 1 (e.g, affected or not affected) are
connected to the Xij variables (only in the disease locus). For example,
model of perfect recessive disease yields the penetrance probabilities:
P(y11 = sick | X11= (a,a)) = 1
P(y11 = sick | X11= (A,a)) = 0
P(y11 = sick | X11= (A,A)) = 0
5
SUPERLINK
Stage 1: each pedigree is translated into a Bayesian
network.
Stage 2: value elimination is performed on each
pedigree (i.e., some of the impossible values of the
variables of the network are eliminated).
Stage 3: an elimination order for the variables is
determined, according to some heuristic.
Stage 4: the likelihood of the pedigrees given the
values is calculated using variable elimination
according to the elimination order determined in stage 3.
Allele recoding and special matrix multiplication is used.
6
Comparing to the HMM model
X
S1
X2
S2
X3
S3
Xi-1
Si-1
Xi
Si
Xi+1
Si+1
X1
X2
X2
X3
X3
Xi-1
Yi-1
Xi
Xi
Xi+1
Xi+1
The compounded variable Si = (Si,1,m,…,Si,2n,f) is called the
inheritance vector. It has 22n states where n is the number of
persons that have parents in the pedigree (non-founders).
The compounded variable Xi = (Xi,1,m,…,Xi,2n,f) is the data
regarding locus i. Similarly for the disease locus we use Yi.
REMARK: The HMM approach is equivalent to the Bayesian
network approach provided we sum variables locus-after-locus
say from left to right.
7
Experiment A (V1.0)
Elimination Order: General
Person-by-Person
Locus-by-Locus (HMM)
over 100 hours
Out-of-memory
Pedigree size
Too big for
Genehunter.
•
•
•
Same topology (57 people, no loops)
Increasing number of loci (each one with 4-5 alleles)
Run time is in seconds.
8
Experiment C (V1.0)
• Same topology (5 people, no loops)
• Increasing number of loci (each one with 3-6 alleles)
• Run time is in seconds.
Software
Order type
Bus error
Out-of-memory
9
Some options for improving efficiency
n
P(data | ) P ( xi | pai )
xk
x3
x1
i 1
1.
Multiplying special probability tables
efficiently.
2.
Grouping alleles together and
removing inconsistent alleles.
3.
Optimizing the elimination order of
variables in a Bayesian network.
4.
Performing approximate calculations
of the likelihood.
10
Standard usage of linkage
There are usually 5-15 markers. 20-30% of the persons in
large pedigrees are genotyped (namely, their xij is
measured). For each genotyped person about 90% of the
loci are measured correctly. Recombination fraction
between every two loci is known from previous studies
(available genetic maps).
The user adds a locus called the “disease locus” and places
it between two markers i and i+1. The recombination
fraction ’ between the disease locus and marker i and ”
between the disease locus and marker i+1 are the unknown
parameters being estimated using the likelihood function.
This computation is done for every gap between the given
markers on the map. The MLE hints on the whereabouts of
a single gene causing the disease (if a single one exists).
11
Relation to Treewidth
The unconstrained Elimination Problem reduces to finding
treewidth if:
• the weight of each vertex is constant,
• the cost function is C ( X ) max CGi ( X (i ) )
i
• Finding the treewidth of a graph is known to be NPcomplete (Arnborg et al., 1987).
• When no edges are added, the elimination sequence is
perfect and the graph is chordal.
12
Parameter Estimation
Lecture #10
.
Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Likelihood function for a die:
Multinomial sampling
Let X be a random variable with 6 values x1,…,x6 denoting the
six outcomes of a die. Suppose we observe a sequence of
independent outcomes:
Data = (x6,x1,x1,x3,x2,x2,x3,x4,x5,x2,x6)
What is the probability of this data ?
If we knew the long-run frequencies i for falling on side xi,
then,
P( Data | ) 4 5 1 i
i 1
5
2
1
3
2
2
2
3
Where ={1,2,3,4,5} are called the parameters of the
likelihood function. We wish to estimate these parameters
from the data we have seen.
14
Sufficient Statistics
To compute the probability of data in the die example
we only require to record the number of times Ni
falling on side i (namely,N1, N2,…,N6).
We do not need to recall the entire sequence of
outcomes
P( Data | )
N1
1
N2
2
N3
3
N4
4
N5
5
1 i 1 i
5
N6
{Ni | i=1…6} is called the sufficient statistics for the
multinomial sampling.
15
Sufficient Statistics
A
sufficient statistics is a function of the data that
summarizes the relevant information for the
likelihood
Formally, s(Data) is a sufficient statistics if for any
two datasets D and D’
s(Data) = s(Data’ ) P(Data|) = P(Data’|)
Datasets
Statistics
16
Maximum Likelihood Estimate
Maximum likelihood estimate is an assignment to the
parameters that maximizes the probability of data
(i.e., the likelihood function ).
Usually one maximizes the log-likelihood function
which is easier to do and gives an identical answer:
5
N3
N5
N1
N2
N4
log P( Data | ) log 1 2 3 4 5 1 i 1 i
i 1 Ni log i N6 log 1 i 1 i
5
A sufficient
condition for
maximum is:
5
N6
log P( Data | ) N i
0
5
i
i 1 i
i 1
N6
17
Finding the Maximum
We have just found that:
Divide the
ith
and
jth
Ni
i
1 i 1 i
5
equations: j
1
6
Sum from j=1 to 6:
N6
Hence the MLE is given by:
j 1
Ni
Nj
Nj
Ni
i
i
Ni
i
N
i 1,..,6
18
Adding Pseudo Counts
Ni
The MLE given by i
N
i 1,..,6,
can be misleading for small data sets because it could happen that a
small data set is not typical. For example, it might be that we know
that the dice is manufactured to be loaded but the small dataset we
examined does not show this property.
The MAP estimate is given by
i
N i N 'i
N N'
i 1,..,6
The six pseudo counts N’i sum to N’. They express one’s assessment
regarding the frequencies for each side prior to seeing the data.
Large N’ indicates high confidence. Smaller than 1 values are possible.
The MAP estimate can be justified as maximizing one’s posterior
(namely, after seeing the data) best estimate of the frequencies for
each side. The theory formally justifying this formula is called
Bayesian Statistics (not covered in this course due to time
constraints).
19
Example: The ABO locus
Recall that a locus is a particular place on the chromosome. Each locus’
state (called genotype) consists of two alleles – one parental and one
maternal. Some loci (plural of locus) determine distinguished features.
The ABO locus, for example, determines blood type.
The ABO locus has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}.
The first two genotypes determine blood type A, the next two determine
blood type B, then blood type AB, and finally blood type O.
We wish to estimate the proportion in a population of the 6 genotypes.
Suppose we randomly sampled N individuals and found that Na/a have
genotype a/a, Na/b have genotype a/b, etc. Then, the MLE is given by:
a/a
Na / a
N
N
N
N
N
, a / o a / o , b / b b / b , b / o b / o , a / b a / b , o / o o / o
N
N
N
N
N
N
20
The ABO locus (Cont.)
However, testing individuals for their genotype is a very
expensive test. Can we estimate the proportions of genotype
using the common cheap blood test with outcome being one
of the four blood types (A, B, AB, O) ?
The problem is that among individuals measured to have
blood type A, we don’t know how many have genotype a/a and
how many have genotype a/o. So what can we do ?
We use the Hardy-Weinberg equilibrium rule that tells us
that in equilibrium the frequencies of the three alleles
a,b,o in the population determine the frequencies of the
genotypes as follows: a/b= 2a b, a/o= 2a o, b/o= 2b o,
a/a= [a]2, b/b= [b]2, o/o= [o]2. So now we have three
parameters that we need to estimate.
21
The Likelihood Function
Let X be a random variable with 6 values xa/a, xa/o ,xb/b, xb/o,
xa/b , xo/o denoting the six genotypes. The parameters are
= {a ,b, o}.
The probability P(X= xa/b | ) = 2a b.
The probability P(X= xo/o | ) = o o.
And so on for the other four genotypes.
What is the probability of Data={B,A,B,B,O,A,B,A,O,B, AB} ?
P( Data | ) 2 a o 2 b o
2
a
3
2
b
2
5
1
a b
2
o o
Obtaining the maximum of this function yields the MLE.
This can be done by multidimensional Newton’s algorithm.
22
Computing MLE
Finding MLE parameters: nonlinear optimization problem
P(Data| )
Gradient Ascent (Newton like methods):
Follow gradient of likelihood w.r.t. to parameters (As taught in your favorite
Numerical Analysis course). Improve, by adding line search methods to
determine step size and get faster convergence. Start at several random
locations.
23
Gene Counting
Had we known the counts na/a and na/o (blood type A
individuals), we could have estimated a from n individuals as
follows (and similarly estimate b and o):
2na / a na / o na / b
a
2n
Can we compute what na/a and na/o are expected to be ?
Using the current estimates of a and o we can as follows:
na / a na
2
a
2 a o
2
a
na / o
2 a o
na 2
a 2 a o
We repeat these two steps until the parameters converge.
24
Gene Counting (example of EM)
Input: Counts of each blood type nA, nB, nO, nAB of n people.
Desired Output: ML estimate of allele frequencies a ,b , o.
Initialization: Set a ,b ,and o to arbitrary values (say, 1/3).
Repeat
a2
2 a o
n
n
n
n
E-step (Expectation): a / a
A
a/o
A
2
2
a 2 a o
nb / b
b2
nB 2
b 2 b o
a 2 a o
nb / o nB
2 b o
b2 2 b o
M-step (Maximization):
a
2na / a na / o n AB
2n
b
Until a ,b ,and o converge
2nb / b nb / o n AB
2n
o
2nO na / o nb / o
2n
25