TreeDT:Gene Mapping by Tree Disequilibrium Test

Download Report

Transcript TreeDT:Gene Mapping by Tree Disequilibrium Test

TreeDT:Gene Mapping by Tree
Disequilibrium Test
Author:Pettri Sevon
Dept. of computer science & Finnish Genome
center. Univ. of Helsinki
Hannu T.T. Toivonen
Nokia Research Center. Univ. of Helsinki
Vesa Ollikainen
Finnish Genome Center. Univ. of Helsinki
Advisor: Dr. Hsu
Graduate: Cheng-Wen Hong
1
Outline
•
•
•
•
•
•
•
•
•
•
1.Motivation
2.Objective
3.Introduction
4.Problem Background
5.Method
6.Algorithms
7.Related Work
8.Experiment
9.Conclusions
10.Personal Opinion
2
Motivation
• USA and England will finish the human
gene mapping in 2003. In the long time. A
geneticist will research human gene
sequence variation,the inheritance of
complex trait and the discovery of new
disease susceptibility genes. It is an
immense important for human health.
3
Objective
• We find a novel gene mapping method
(TreeDT).It is effective to locate a diseasesusceptility gene for a given disease.
• The gene and the proteins can be
analyzed to understand the disease
causing mechanisms and to design new
medicines.
4
Introduction
• (1).Gene mapping aims at discovering a statistical
connection from a given disease to a narrow region in
the genome(chromosomes).
• (2).Genetic markers along chromosomes provide data
that can be used to discover associations between
patient phenotypes(diseased vs.healthy) and
chromosomal regions(i.e. potential disease gene loci).
• (3).We introduce TreeDT, a novel method for gene
mapping. It analyses the observed strings of markers by
tree patterns that reflect the possible genetic history of a
disease susceptibility(DS) gene and locate the DS gene
loci effectively.
5
•
•
•
•
•
•
•
•
(3).The contributions of TreeDT are:
(1). A novel approach to gene mapping using tree
patterns.
(2). An efficient algorithm for generating and testing
tree patterns.
(3).a method for estimating the statistical significance
of findings.
6
Problem Background
• (1).Marker Data: A genetic marker is a short polymorphic region in
• the DNA, denoted here by M1,M2,…The different variants of DNA
• that different people have at the marker are alleles , denoted in our
• examples by 1,2,3,…. The collection of markers is a maker map,
• And its corresponding alleles constitute its haplotype (figure1)
• The input data consists of haplotypes of diseased and control
persons.
7
Problem Background
• (2).Linkage disequilibrium
•
•
•
•
•
•
All the current carriers of a DS
gene have inherited from a
founder who introduced the gene
mutation to population(figure2).
And if find a haplotype linked with
the mutation locus forever.It is a
linkage disequilibrium(LD),nonrandom association between
nearby markers.
(3).Gene Mapping
Using linkage analysis to
determine the relative position bet-ween two genes on chromosome.
8
Problem Background
• (4).Summary of Background and Problem
•
Located markers can be very informative:given an
ancestor with a mutated gene, the descendants that
inherit the gene are also likely to inherit alleles of nearby
markers.
• The LD-based gene mapping problem is now.
• The input consists of a marker map,and a set of disease• -associated haplotypes and a set of control haplotypes
on the given map.The task is to predict the location of a
disease susceptibility gene on map.
9
Method
• Based on the observed haplotypes, TreeDT evaluates
the most likely coalescence tree at a number of locations
along the analyzed chromosome.and then assesses the
subtree clustering of disease-associated haplotypes in
these trees(Using tree disequilibrium test,intended for
predicting DS gene location.)
10
Method
•
•
•
(1).Haplotype Prefix Trees:Given a location(potential gene locus) in the
chromosome-the haplotypes to the right(or to the left) of the location can be
organized into a prefix tree (Figure3and4) .
TreeDT builds two prefix trees, one to the left and one to the right ,
Between each pair of consecutive markers and test their disequilibrum.
11
Method
• (2).Tree Disequilibrium Test( for a haplotype prefix tree T)
• H0:The disease-association statuses are randomly distributed in the leaves
•
•
of T.
H1:The distribution of the disease-association statuses deviates in some
subtrees of T from the overall distribution of statuses.
For measuring the disequilibruim: The test statistic Zk for a tree with k
deviant subtrees T1,..,TK ,where ai is the number of disease-associated
haplotypes and ni the total number of haplotypes in subtree TiES,AND P is
the proportion of disease-associated haplotypee in the sample.
k
zk  
i 1
ai  ni p
ni p(1  p )
12
Method
• (3).Significance Test
• (a)Zk is a measure for the disequilibrium of a given tree,at a certain
location in the chromosome,with given k deviant subtrees.
• (b)TreeDT finds for each k the set S of subtrees that maximizes Zk
• (Zk can be efficiently maximized simultaneously for all k using a
recursive algorithm.)
• (c)Since Zk’s for different degrees of freedom k are not comparable
and the distribution of the maximized Zk is very complex,TreeDT
estimates the p value for each maximized Zk (under H0 ), p values
are estimated by a permutation test.
• (d)In order to get a single p value for the disequilibrium at a given
location, A comined measure we the product of the lowest p value
over aal k from each side.
13
Method
• (e)The output of TreeDT is essentially the p value ranked
list of locations. A point prediction for the DS gene
location is obtained by taking the best location, a
(potentially fragmented) region of length L is obtained by
taking best locations until a length of L is covered.
• (f)All these three nested p value tests(for each tree and
k , for each location ,for the best location) can carried out
efficiently.
14
Algorithms
• (1).Constructing Haplotype Prefix-Trees
•
The haplotype prefix-trees to the left and right from each analyzed
location can be efficiently identified using a string –sorting algorithm.
• (2).An Algorithm for Maximizing the Tree Disequilibrium Statistic Zk
• It is essential that the time-complexity of the algorithm for
maximizing the Zk is as low as possible. Because it must be excuted
for each tree location and permutation in turn.
• (3).INPUT: A haplotype prefix tree T
•
OUTPUT:Maximum values of Zk in the tree T for each k.
•
The time complexity of the algorithm is O(n*n),where n is the
number of leaves(haplotype) in the tree.
15
Algorithm
• (4).Multiple Nest Permutation Tests
• The straight forward algorithm for a three-level nested permutation
test using nested loops would have time complexity proportional to
n*n*n,where nis the number of permutations at each level.
16
Relate Work
• (1).Several statistical methods to detect LD around a DS gene. But
these methods are computationally heavy.
• (2).Haplotype Pattern Mining(HPM) is based on analyzing the LD of
sets of haplotype patterns.
• (3).Transmission / Disequilibrium Tests(TDT) are an established way
of testing association and linkage in a sample where linkage
disequilibrium exists between the mutation locus and nearby marker
loci.
• (4).m-TDT is to detect LD in multipoint variant,haplotype of several
alleles.
17
Experiments
• We compare TreeDT empirically to TDT, to m-TDT,and to HPM.
• We evaluate the methods on Simulation of data( simulated to
resemble a realistic population isolate.
• Using 100 data sets,Each data set consisted of 200 diseaseassociated and 200 control chromosomes.The length of be analyzed
was 100 cM, and a map of 101equidistantly spaced markers,each
having 5 alleles.
18
Analysis of TreeDT
• (1).First we assess the prediction accuracy(power) of TreeDT with
different A ,the proportion of disease-associated chromosomes that
actually carry the mutation.For A=20% or 15% the accuracy is very
good. And with lower values of A the accuracy decreases until with
A=5%(challenging) only in20-30% of data sets can the gene be
localized within a reasonable accuracy 10-20 cm.
19
Analysis of TreeDT
• (2).We evaluate the effect of the only parameter of TreeDT,the
number of deviant subtrees(founders) that are searched for in each
tree (FIGURE5B).
•
As we increase the number of founders (deviant subtrees),evidence about
the gene location becomes more fragmented, but the upper limit of 6
subtrees gives consistently competitive results.
20
Analysis of TreeDT
• Figure 5c show the experimental relationship between power(ratio
ture positives / all positives) and overall p(ratio false positives / all
negatives),For higher values of A the classification accuracy is
extremely good,but A=5%(challenging) the classification no better
than random guessing.
21
Comparison to other methods
• (1).TreeDT,HPM and m-TDT have practically identical performance
in localizing the DS gene in the baseline setting (FIGURE 6A), TDT
is clearly inferior compared to the other methods.
22
Comparison to other methods
• (2).In a test setting with three founders who introduced the mutation
to the population (Figure 6B),TreeDT has an edge over HPM,which
in turn has an edge over m-TDT,TDT barely beats random guessing.
23
Comparison to other methods
• (3).We compare the methods with a large amount of missing data
(Figure 6c).HPM is most robust with respect to missing data ,but
TreeDT is not much weaker than HPM.Performance of m-TDT
degrads much more clearly.
• In the previous discussion(1)(2)(3) can show that TreeDT is very
competitive.
24
Conclusions
• (1).TreeDT is a novel method for gene mapping and our experiment
show that TreeDT is effective in extreme conditions for gene
mapping problems:with lots of noise(only 10% - 20% of affected
chromosomes carry the mutation ,lots of missing data) and with
small sample sizes(200 affected and 200 control chromosomes).
• (2).TreeDT is competitive with other recent data mining methods.
25
Personal Opinion
• We can find a better statistic for Tree Disequilibrium Test,
• (1).The Distribution of the maximized statistic is very
simple and compute p values are low time complexity ,
• (2).The maximized statistics are comparable in different
degrees of freedom.
• (3).we don,t use Tree method to find other methods2626.
26