DATA ANALYSIS - DCU School of Computing

Download Report

Transcript DATA ANALYSIS - DCU School of Computing

DATA ANALYSIS
Module Code: CA660
Lecture Block 7
Examples in Genomics and Trait
Models
• Genetic traits may be controlled by No.genes-usually unknown
Taking “genetic effect” as one genotypic term, a simple model
yij    Gi   ij
2
2
2
y
~
NID
(

,

),
G
~
NID
(
0
,

),

~
NID
(
0
,

for
p
g
e)
where yij is the trait value for genotype i in replication j,  is
the mean, Gi the genetic effect for genotype i and ij the errors.
• If assume Normality (and want Random effects) + assume
zero covariance between genetic effects and error
 p2   g2   e2
Note: If same genotype replicated b times in an experiment, with
phenotypic means used, error variance averaged over b.
2
Example - Trait Models contd.
• What about Environment and GE interactions? Extension to
Genotypic effects
Simple Model.
measured within
yijk    Gi  E j  (GE)ij  ijk
blocks
ANOVA Table: Randomized Blocks within environment and within
sets/blocks in environment = b = replications. Focus - on genotype effect
Source
Environment
Blocks
Genotypes
GE
Error
dof
e-1
(b-1)e
g-1
(g-1)(e-1)
(b-1)(g-1)e
Expected MSQ
know there are differences
again – know there are differences
 e2  b ge2  be g2
 e2  b ge2
 e2
Note: individuals blocked within each of multiple environments, so environmental
effect intrinsic to error. Model form is standard, but only meaningful
comparisons are within environment, hence form of random error = population
2
variance =  e ; so random effects of interest from additional variances & ratios
3
Example contd.
• HERITABILITY = Ratio genotypic to phenotypic variance
 g2
 g2
H 2  2
 p  g   e2
• Depending on relationship among genotypes,
interpretation of genotypic variance differs. May contain
additive, dominance, other interactions, variances
 a2 ,  d2 ,  l2 (Above = heritability in broad terms).
• For some experimental or mating schemes, an additive
genetic variance may be calculated. Narrow/specific
sense heritability then
 a2
 a2
H
 p2

 a2   d2   l2   e2
• Again, if phenotypic means used, obtain a mean-based
heritability for b replications.
 g2
 g2
H 2  2
 p  g   e2 / b
4
Extended Example- Two related traits
Have
yij  1  G1i  1 j
y2 j   2  G2i   2 j
where 1 and 2 denote traits, i the gene and j an individual in
population. Then ‘y’ is the trait value,  overall mean, G genetic
effect,  = random error.
To quantify relationship between the two traits, the variancecovariance matrices for phenotypic,  p genetic  g and
environmental effects e
 g21  g12   e21  e12 
  p21  p12 
p  

   g  e  

2
2
2
 g 21  g 2   e 21  e 2 
 p 21  p 2 
So correlations between traits in terms of phenotypic, genetic and
 p12
 g12
environmental effects:
 e12
p 
 
2
p1
2
p2
;
g 
 
2
g1
2
g2
;
e 
 e21 e22
5
MAXIMUM LIKELIHOOD ESTIMATION
• Recall general points: Estimation, definition of Likelihood
function for a vector of parameters  and set of values x.
Find most likely value of  = maximise the Likelihood fn.
Also defined Log-likelihood (Support fn. S() ) and its
derivative, the Score, together with Information content
per observation, which for single parameter likelihood is
given by


2

2




 
 


I ( )  E   log L( x)    E 
log L( x)
 
  2

  
 


• Why MLE? (Need to know underlying distribution).
Properties: Consistency; sufficiency; asymptotic efficiency
(linked to variance); unique maximum; invariance and,
hence most convenient parameterisation; usually MVUE;
amenable to conventional optimisation methods.
6
VARIANCE, BIAS & CONFIDENCE

k
1
2
ˆ
i  
 k

k

ˆ
i 

2
• Variance of an Estimator - usual form or ˆ 
i 1
i 1
for k independent estimates
• For a large sample, variance of MLE can be approximated by
ˆ2 
1
nI ( )
can also estimate empirically, using re-sampling* techniques.
• Variance of a linear function (of several estimates) –
(common need in genomics analysis), e.g. heritability.
• Recall Bias of the Estimator E (ˆ)  
then the Mean Square Error is defined to be: MSE  E(ˆ  )2
expands to E{[ˆ  E (ˆ)]  [ E (ˆ)   ]}2   2ˆ  [ E (ˆ)   ]2

so we have the basis for C.I. and tests of hypothesis.
7
2
COMMONLY-USED METHODS of
obtaining MLE
• Analytical - solving dL  0 or dS d  0 when simple
d
solutions exist
• Grid search or likelihood profile approach
• Newton-Raphson iteration methods
• EM (expectation and maximisation) algorithm
N.B. Log.-likelihood, because max. same  value as Likelihood
Easier to compute
Close relationship between statistical properties of MLE
and Log-likelihood
8
METHODS in brief
Analytical : - recall Binomial example earlier
dS ( ) x n  x
Score 
 
0
d


x
ˆ 
n
• Example : For Normal, MLE’s of mean and variance,
(taking derivatives w.r.t mean and variance separately),
and equivalent to sample mean and actual variance (i.e.
/N), -unbiased if mean known, biased if not.
• Invariance : One-to-one relationships preserved
• Used: when MLE has a simple solution
9
Methods for MLE’s contd.
Grid Search – Computational
Plot likelihood or log-likelihood vs parameter. Various features
• Relative Likelihood =Likelihood/Max. Likelihood (ML set =1).
Peak of R.L. can be visually identified /sought algorithmically.
e.g.
S ( )  Log[ 20 (1  )80   80 (1  )20 ]
Plot likelihood and parameter space range 0    1 gives 2 peaks, symmetrical around ˆ  0.2  likelihood
profile for the well-known mixed linkage phase problem in
linkage analysis.
If e.g. constrain 0    0.5
MLE ˆ  0.5 = R.F. between
genes (possible mixed linkage phase).
10
contd.
• Graphic/numerical Implementation - initial estimate of ,
direction of search determined by evaluating likelihood at both
sides of .
Search takes direction giving increase. Initial search
increments large, e.g. 0.1, then when likelihood change starts
to decrease or become negative, stop and refine increment.
• Multiple peaks – can miss global maximum, computationally
intensive
• Multiple Parameters - grid search. Interpretation of Likelihood
profiles can be difficult.
11
Example
• Recall Exs 2, Q. 8.
Data used to show a linkage relationship between marker and a “rustresistant”gene.
Escapes = individuals who are susceptible, but show no disease
(rust) phenotype under experimental conditions. So define  ,  as
proportion escapes and R.F. respectively.
1   is penetrance for disease trait, i.e.
P{ that individual with susceptible genotype has disease phenotype}.
Purpose of expt.-typically to estimate R.F. between marker and gene.
• Use: Support function = Log-Likelihood
S ( ,  )  168log(1   )  3 log( )  52log(   )  163log(1     )
12
Example contd.
,  Expected value of Score
• Setting 1st derivatives (Scores) w.r.t =0.
(w.r.t.  is zero, (see analogies in classical sampling/hypothesis
testing). Similarly for . Here, however, No simple analytical solution,
so can not solve directly for either.
• Using grid search, likelihood reaches maximum at ˆ  0.02, ˆ  0.22
• In general, this type of experiment tests H0: Independence between
marker and gene (  0.5) and H0: no escapes (   0)
Uses Likelihood Ratio Test statistics. (MLE 2 equivalent)
• N.B: Moment estimates solve slightly different problem, because no
info. on expected frequencies, - (not same as MLE)
13
MLE Estimation Methods contd.
Newton-Raphson Iteration
Have Score () = 0 from previously. N-R consists of replacing Score by
linear terms of its Taylor expansion, so if ´´ a solution,  ´=1st guess
dS( ) dS( )
d 2 [ S ( )]

 (    )
0
2
d
d
d
d [ S ( )] d
      2
d S ( ) d 2
Repeat with  ´´ replacing ´
Each iteration - fits a parabola to
Likelihood Fn.
L.F.
2nd
 1st
• Problems - Multiple peaks, zero Information, extreme estimates
• Multiple parameters – need matrix notation, where S matrix e.g. has
elements = derivatives of S(, ) w.r.t.  and  respectively. Similarly,
Information matrix has terms of form
 2

 2

 E  2 S ( ,  )  E 
S ( ,  )etc.
 

 

      1
 Estimates are       I 1 ( )S ( )
Variance of
14
      N
Log-L i.e.S()
Methods contd.
Expectation-Maximisation Algorithm - Iterative. Incomplete data
(Much genomic data fits this situation e.g. linkage analysis with marker
genotypes of F2 progeny. Usually 9 categories observed for 2-locus,
2-allele model, but 16 = complete info., while 14 give info. on linkage.
Some hidden, but if linkage parameter known, expected frequencies
can be predicted – as you know - and the complete data restored
using expectation).
• Steps: (1) Expectation estimates statistics of complete data, given
observed incomplete data.
• -(2) Maximisation uses estimated complete data to give MLE.
• Iterate till converges (no further change)
15
E-M contd.
Implementation
• Initial guess, ´, chosen (e.g. =0.25 say = R.F.).
• Taking this as “true”, complete data is estimated, by distributional
statements e.g. P(individual is recombinant, given observed
genotype) for R.F. estimation.
• MLE estimate ´´ computed.
• This, for R.F.  sum of recombinants/N.
• Thus MLE, for fi observed count,
  
1
N
 f P ( R G)
i i
Convergence ´´ = ´ or      tolerance(0.00001)
16
LIKELIHOOD : C.I. and H.T.
• Likelihood Ratio Test – c.f. with 2.
• Principal Advantage of G is Power, as unknown parameters
involved in hypothesis test.
Have : Likelihood of  taking a value A which maximises
it, i.e. its MLE and likelihood  under H0 : N , (e.g. N = 0.5)
• Form of L.R. Test Statistic
 L( N x) 
 L( A x) 
G  2 Log 
G  2 Log 


L
(

x
)
or,
conventionally
L
(

x
)


A



N
- choose; easier to interpret.
• Distribution of G ~ approx. 2 (d.o.f. = difference in dimension of
parameter spaces for L(A), L(N) )
n
O
Oi Log i
• Goodness of Fit : notation as for 2 , G ~ 2n-1 : G  2
Ei
i 1


r
c
• Independence: G  2
i 1
j 1
Oij Log
Oij
Eij
notation again as for 2
17
Power-Example extended
• Under H0 : E{G}  2n{0.5Log0.5  0.5Log0.5  Log0.5}  0
• At level of significance =0.05, suppose true  =  1 = 0.2, so if
n=25
E{G}  50{0.2Log0.2  0.8Log0.8  Log0.5}  9.6
(e.g. in genomics might apply where R.F. =0.2 between two genes (as
opposed to 0.5). Natural logs. used, though either possible in practice.
Hence, generic form “Log” rather than Ln here. Assume Ln throughout
for genetic/genomic examples unless otherwise indicated)
2
• Rejection region at 0.05 level is 1  3.84
• If sketch curves, P{LRTS falls in the acceptance region} = 0.13,
= Prob.of a false negative when actual value of  = 0.2
• If sample size increased, e.g. n=50, E{G} = 19 and easy to show
that P{False negative} = 0.01
• Generally: Power for these tests given by P{ df2 ,nE{G }  2 }
u n it
18
Likelihood C. I.’s - method
• Example: Consider the following Likelihood function L( )  (1   ) 
 is the unknown parameter ; a, b observed counts
• For 4 data sets observed,
A: (a,b) = (8,2), B: (a,b)=(16,4) C: (a,b)=(80, 20) D: (a,b) = (400, 100)
a
• Likelihood estimates can be plotted vs possible parameter values,
with MLE = peak value.
e.g. MLE = 0.2, Lmax=0.0067 for A, and Lmax=0.0045 for B etc.
Set A: Log Lmax- Log L=Log(0.0067) - Log(0.00091)= 2 gives  95% C.I.
so  =(0.035,0.496) corresponding to L=0.00091,  95% C.I. for A.
Similarly, manipulating this expression, Likelihood value corresponding
to  95% confidence interval given as L = 7.389Lmax
Note: Usually plot Log-likelihood vs parameter, rather than Likelihood.
As sample size increases, C.I. narrower and  symmetric
19
b
Multiple Populations: Extensions to G Example
• Recall Mendel’s data - earlier and Extensions to 2 for same
In brief
Round Wrinkled
Plant
1
2
3
O
45
4
5
6
7
8
9
10
Total
336
Pooled 336
Heterogeneity
E
42.75
327.75
O
12
E
14.25
101
101 109.25
G
dof
0.49
1
0.09
1
0.10
1
1.30
0.01
0.71
0.79
0.63
1.06
0.17
5.34
0.85
4.50
1
1
1
1
1
1
1
10
1
9
p-value
0.49
0.77
0.75
0.26
0.93
0.40
0.38
0.43
0.30
0.68
0.36
0.88
20
Multiple Populations - summary
• Parallels

2

 Oij 
 2 Oi log 
 E 
i 1 j 1 
 ij 

p
GTotal
• Partitions therefore
n







 
n
GPooled  2
j 1
p
i 1



Oij log






p
i 1
p
i 1
 
Oij  
 
 
 
Eij  
 
 
and Gheterogeneity = Gtotal - GPooled (n=no. classes, p = no.populations)
Example: Recall Backcross (AaBb x aabb)- Goodness of fit (2- locus model).
For each of 4 crosses, a Total GoF statistic can be calculated according to
expected segregation ratio 1:1:1:1 – (assumes no segregation distortion
for both loci and no linkage between loci).
For each locus GoF calculated using marginal counts, assuming each
genotype segregates 1:1.
Difference between Total and 2 individual locus GoF statistics is L-LRTS (or
chi-squared statistic) contributed by association/linkage between 2 loci. 21
Example: Marker Screening
Screening for Polymorphism - (different detectable alleles) – look
at stages involved.
Genomic map –based on genome variation at locations (from
molecular assay or traditional trait observations).
(1) Screening polymorphic genetic markers is Exptal step 1
- usually assay a large number of possible genetic markers in
small progeny set = random sample of mapping population.
If a marker does not show polymorphism for set of progeny, then
marker non-informative ; will not be used for data analysis).
22
Example contd.
(2) Progeny size for screening – based on power, convenience etc.,
e.g. False positive = monomorphic marker determined to be
polymorphic. Rare since m-m cannot produce segregating
genotypes if these determined accurately.
False negatives high particularly for small sample. e.g. for markers
segregating 1:1 – (i)Backcross, recombinant inbred lines, doubled
haploid lines, or (ii)F2 with codominant markers,
So, e.g.
(i) P{sampling all individuals with same genotype) = 2(0.5)n
(ii) P{false negative for single marker, n=5} = 2(0.25)5+0.55=0.0332
Hence Power curves as before.
23
Example contd. S.R 1:1 vs 3:1- use LRTS
• Detection of departure from S.R. of 1:1

 O 
 O 
G  2O1 log 1   O2 log 2  
 0.5n 
 0.5n  

 2[O1LogO1  O2 LogO2  nLog(0.5n)]
n = sample size, O1, O2 observed counts of 2 genotypic classes.
• For true S.R. 3:1, O1 genotypic frequency of dominant genotype,
T.S. parametric value is approx.

 E (O1 ) 
 E (O2 )  
GE  2E (O1 ) Log

E
(
O
)
Log

2



0
.
5
n
0
.
5
n






 0.75n 
 0.25n  
 20.75nLog

0
.
25
nLog




0
.
5
n
0
.
5
n





 0.2616 n
24
Example contd.
• To reject a S.R. of 1:1 at 0.05 significance level, a LogLRTS of
at least 3.84 (critical value for rejection) is required.
2
• Statistical Power P{GE ,1  3.84}
• For n=15 then, power is
P{32.924,1  3.84}
• For a power of 90%, n  40 needed
• If problem expressed other way. i.e. calculating Expected LRTS
(for rejecting a 3:1 S.R. when true value is S.R. 1:1), this is
0.2877n and n  35 needed.
25
Maximum Likelihood Benefits
• Good Confidence Intervals
Coverage probability realised and interval biologically
meaningful
• MLE Good estimator of a CI
2
MSE consistent Limn E(ˆ  )  0
Absence of Bias E (ˆ)  
- does not “stand-alone” – minimum variance important
Asymptotically Normal
Precise – large sample
Biological inference valid
Biological range realistic
ˆ  
~ N (0, 1) as n  
 ˆ
26