Transcript GWAS II

Genome-wide association
studies
Usman Roshan
Recap
• Single nucleotide polymorphism
• Genome wide association studies
– Relative risk, odds risk (or odds ratio) as an
approximation to relative risk
– Chi-square test to determine significant SNPs
– Logistic regression model for determining odds
ratio
SNP genotype representation
The example
F:
AACACAATTAGTACAATTATGAC
M:AACAGAATTAGTACAATTATGAC
is represented as
CG
CC
GG …
SNP genotype encoding
• If SNP is A/B (alphabetically ordered) then count
number of times we see B.
• Previous example becomes
A/T C/T G/T …
A/T C/T G/T
H0: AA
TT
GG …
0
2
0
H1: AT
CC
GT
… =>1
0
1
H2: AA
CT
GT
…
0
1
1
Now we have data in numerical format
…
…
…
…
Example GWAS
Case 1
Case 2
Case 3
Control 1
Control 2
Control 3
A/T
AA
AT
AA
TT
TT
TA
C/G
CC
CG
CG
GG
CC
CG
A/G …
AA
AA
AA
GG
GG
GG
Encoded data
A/T
A/G
Case1 AA
Case2 AT
Case3 AA
Con1
Con2
Con3
CC
CG
CG
TT
TT
TA
AA
AA
AA
GG
CC
CG
C/G
=>
GG
GG
GG
A/G
0
1
0
0
1
1
2
2
1
A/T
C/G
0
0
0
2
0
1
2
2
2
Example
Odds of AA in case = (80/100)/(20/100) = 4
Odds of AA in control = (70/100)/(30/100) = 7/3
Odds ratio of AA = 4/(7/3) = 12/7
AA
AC
CC
Case
80
15
5
Control
70
15
15
Chi-square statistic
2
(c

e
)
2
i
i

=

Define the statistic:
ei
i=1
n
where
ci = observed frequency for ith outcome
ei = expected
frequency for ith outcome
n = total outcomes
The probability distribution of this statistic is given by the
chi-square distribution with n-1 degrees of freedom.
Proof can be found at
http://ocw.mit.edu/NR/rdonlyres/Mathematics/18-443Fall2003/4226DF27-A1D0-4BB8-939A-B2A4167B5480/0/lec23.pdf
Great. But how do we use this to get a SNP p-value?
Null hypothesis for case
control contingency table
•
We have two random variables:
–
–
•
•
Null hypothesis: the two variables are independent
of each other (unrelated)
Under independence
–
–
–
•
•
P(D,G)= P(D)P(G)
P(D=case) = (c1+c2+c3)/n
P(G=AA) = (c1+c4)/n
Expected values
–
•
D: disease status
G: allele type.
E(X1) = P(D=case)P(G=risk)n
We can calculate the chi-square statistic for a given
SNP and the probability that it is independent of
disease status (using the p-value).
SNPs with very small probabilities deviate
significantly from the independence assumption
and therefore considered important.
AA
AC
CC
Case
c1
c2
c3
Control
c4
c5
c6
n=c1+c2+c3+c4+c5+c6
Chi-square statistic exercise
• Compute expected values
and chi-square statistic
• Compute chi-square
statistic and p-value by referring
To chi-square distribution
AA
AC
CC
Case
80
15
5
Control
60
15
25
Logistic regression
•
•
The odds ratio estimated from the contingency table directly has a skewed
sampling distribution.
A better (discriminative) approach is to model the log likelihood ratio
log(Pr(G|D=case)/Pr(G|D=control)) as a linear function. In other words:
log(
•
Why:
–
–
•
•
Pr(G | D  case)
)  wT G  w0
Pr(G | D  control)
Log likelihood ratio is a powerful statistic
Modeling as a linear function yields a simple algorithm to estimate parameters
G is number of copies of the risk allele

With some manipulation this becomes
Pr(D  case | G) 
1
1 e
(w T G w 0 )
How do we get the odds ratio
from logistic regression? (I)
• Using Bayes rule we have
log(
Pr(G | D  case)
)  wT G  w0
Pr(G | D  control)
By exponentiating both sides we get
T
Pr(G | D  case)
 ew G  w0
Pr(G | D  control)
And by taking the ratio with G=1 and G=0 we get
Pr(G  1 | D  case)
wT 1 w0
Pr(G  1 | D  control) e
 wT 0  w  e w
0
Pr(G  0 | D  case)
e
Pr(G  0 | D  control)
How do we get the odds ratio
from logistic regression? (II)
Continued from previous slide: by rearranging the terms in the numerator
and denominator we get
Pr(G  1 | D  case)
Pr(G  1 | D  case)
Pr(G  1 | D  control)
Pr(G  0 | D  case)

Pr(G  0 | D  case)
Pr(G  1 | D  control)
Pr(G  0 | D  control) Pr(G  0 | D  control)
By symmetry of odds ratio this is
Pr(G  1 | D  case)
Pr(D  case | G  1)
Pr(G  0 | D  case)
Pr(D  control | G  1)

 OR (odds ratio)
Pr(G  1 | D  control)
Pr(D  case | G  0)
Pr(G  0 | D  control) Pr(D  control | G  0)
Since the original ratio (see previous slide) is equal to ew and is equal to the
odds ratio we conclude that the odds ratio is given by this value.
How to find w and w0?
• And so ew is our odds ratio. But how do
we find w and w0?
– We assume that one’s disease status D
given their genotype G is a Bernoulli
random variable.
– Using this we form the sample likelihood
– Differentiate the likelihood by w and w0
– Use gradient descent
Today
•
•
•
•
•
Basic classification problem
Maximum likelihood
Logistic regression likelihood
Algorithm for logistic regression likelihood
Support vector machine
Supervised learning for two
classes
• We are given n training samples (xi,yi) for
i=1..n drawn i.i.d from a probability
distribution P(x,y).
• Each xi is a d-dimensional vector (xiRd) and
yi is +1 or -1
• Our problem is to learn a function f(x) for
predicting the labels of test samples xi’Rd for
i=1..n’ also drawn i.i.d from P(x,y)
Loss function
• Loss function: c(x,y,f(x))
• Maps to [0,inf]
• Examples:
0 if y  f (x)
c(x, y, f (x))  
 1 otherwise
 0 if y  f (x)
c(x, y, f (x))  
c'(x) otherwise
c(x, y, f (x))  (y  f (x))2
Test error
• We quantify the test error as the expected
error on the test set (in other words the
average test error). In the case of two
classes:
1 n' 2
Rtest [ f ]    c(xi ', y j , f (xi '))P(y j | xi ')
n' i 1 j 1
• We’d like to find f that minimizes this but we
need P(y|x) which we don’t have access to.
Expected risk
• Suppose we didn’t have test data (x’). Then
we average the test error over all possible
data points x
2
R[ f ]    c(x, y j , f (x))P(y j , x)
xX j 1
• We want to find f that minimizes this but we
don’t have all data points. We only have
training data.
Empirical risk
• Since we only have training data we can’t
calculate the expected risk (we don’t even
know P(x,y)).
• Solution: we approximate P(x,y) with the
empirical distribution pemp(x,y)
1 n
pemp (x, y)   xi (x) yi (y)
n i 1
• The delta function x(y)=1 if x=y and 0
otherwise.
Empirical risk
• We can now define the empirical risk as
2
Remp [ f ]    c(x, y j , f (x))pemp (y j , x)
xX j 1
1 n
  c(xi , yi , f (xi ))
n i 1
• Once the loss function is defined and training data is
given we can then find f that minimizes this.
Example of minimizing
empirical risk (least squares)
• Suppose we are given n data points
(xi,yi) where each xiRd and yiRd. We
want to determine a linear function
f(x)=ax+b for predicting test points.
• Loss function c(xi,yi,f(xi))=(yi-f(xi))2
• What is the empirical risk?
Empirical risk for least
squares
n
Remp [ f ]   c(xi , yi , f (xi ))
i 1
n
  (yi  f (xi ))2
i 1
n
  (yi  axi  b))2
i 1
Now finding f has reduced to finding a and b.
Since this function is convex in a and b we know there
is a global optimum which is easy to find by setting
first derivatives to 0.
Empirical risk for logistic
regression
• Recall that logistic regression model:
Pr(D  case | G) 
1
1 e
(w T G w 0 )
• where G is number of copies of risk allele.
• In order to use this to predict one’s risk of
disease or to determine the odds ratio we

need to know w and w0.
• We use maximum likelihood to find w and w0.
But it doesn’t yield a simple closed form
solution like least squares.
Maximum likelihood
• We can classify by simply selecting the model M that has the
highest P(M|D) where D=data, M=model. Thus classification can
also be framed as the problem of finding M that maximizes
P(M|D)
• By Bayes rule:
P(M | D) 
P(D | M )P(M )
P(D | M )P(M )

P(D)
 P(D | M )P(M )
M
Maximum likelihood
•
Suppose we have k models to consider and each has the same
probability. In other words we have a uniform prior distribution
P(M)=1/k. Then
P(M | D)  P(D | M ) 1 k
•
 P(D | M )P(M )  P(D | M )
M
In this case we can solve the classification problem by finding the
model that maximizes P(D|M). This is called the maximum likelihood
optimization criterion.
Maximum likelihood
• Suppose we have n i.i.d. samples (xi,yi)
drawn from M. The likelihood P(D|M) is
P(D | M )  P((x1 , y1 ),...,(xn , yn ) | M )  P(x1 , y1 | M )...P(xn , yn | M )
n
n
i 1
i 1
  P(xi , yi | M )   P(yi | xi , M )P(xi )
• Consequently the log likelihood is
n
n
i 1
i 1
 log P(D | M )   log P(yi | xi , M )   P(xi )
Maximum likelihood and
empirical risk
• Maximizing the likelihood P(D|M) is the
same as maximizing log(P(D|M)) which
is the same as minimizing -log(P(D|M))
• Set the loss function to
c(xi , yi , f (xi ))   log(P(yi | xi , f ))
• Now minimizing the empirical risk is the
same as maximizing the likelihood
Maximum likelihood example
• Consider a set of coin tosses produced by a
coin with P(H)=p (P(T)=1-p)
• We want to determine the probability P(H) of
the coin that produced HTHHHTHHHTHTH.
• Solution:
– Form the log likelihood
– Differentiate w.r.t. p
– Set to the derivative to 0 and solve for p
• How about the probability P(H) of the coin
that produces k heads and n-k tails?
Classification by likelihood
• Suppose we have two classes C1 and
C 2.
• Compute the likelihoods P(D|C1) and
P(D|C2).
• To classify test data D’ assign it to class
C1 if P(D|C1) is greater than P(D|C2)
and C2 otherwise.
Logistic regression (in light of
disease risk prediction)
• We assume that the probability of disease
given their genotype is
Pr(D  case | G) 
1
1 e
(w T G w 0 )
• where G is the numeric format genotype.
• Problem: given training data we want to
 estimate w and w by maximum likelihood
0
Maximum likelihood for logistic
regression
• Assume that one’s disease status given their
genotype is a Bernoulli random variable with
probability P(D|Gi).In other words training sample i is
case with probability P(D=case|Gi) and control with
probability 1-P(D=case|Gi).
• Assume we have m cases and n-m controls.
• The likelihood is given by
m
n
i 1
i  m 1
P(data | w, w0 )   P(D  case | Gi )  (1  P(D  case | G))
Maximum likelihood for logistic
regression
• Likelihood:
m
n
i 1
i  m 1
P(data | w, w0 )   P(D  case | Gi )  (1  P(D  case | G))
• -Log likelihood
m
 log P(data | w, w0 )    log(P(D  case | Gi )) 
i 1
n
 log(1  P(D  case | G ))
i
i  m 1
n
1
1




   log 

log
1

T


 1  e(wT Gi  w0 )  i 
1  e(w Gi  w0 ) 
i 1
m 1
m
• Set first derivates to 0 and solve for w and w0.
• No closed form therefore have to use gradient descent.
Gradient descent
• Given a convex function f(x,y) how can
we find x and y that minimizes this?
• Solution is gradient descent
– The gradient vector points in the direction
of greatest increase in the function.
– Therefore solution is to move in small
increments in the negative direction of the
gradient until we reach the optimum.
Disease risk prediction
• How exactly do we predict risk?
– Personal genomics companies: Composite
odds ratio score
– Academia: Composite odds ratio score and
recently other classifiers as well
– Still an open problem: depends upon
classifier and the set of SNPs
Composite odds ratio score
• Recall that we can obtain the odds ratio from
the logistic regression model
Pr(D  case | G  1)
wT 1 w0
e
1  Pr(D  case | G  1)
OR 
 wT 0  w  e w
0
Pr(D  case | G  0)
e
1  Pr(D  case | G  0)
• Define =ew
• Now we can predict the risk with n alleles
R(G1,G2 ,...,Gn )  1G1 2G2 ...nGn
Example of risk prediction
study (type 1 diabetes)
Example of risk prediction
study (arthritis)