Transcript PPT
Parallel Genehunter: Implementation of a
linkage analysis package for distributed
memory architectures
Michael Moran
CMSC 838T Presentation
May 9, 2003
Introduction
Goals
Link Genes to specific loci in the genome
Decrease time and memory requirements through
parallelization
Motivation
Locate genes for specific phenotypes
Test for inherited diseases and risk factors
Gene therapy
CMSC 838T – Presentation
Talk Overview
Introduction
Talk Overview
Genetic Linkage Problem
Previous Work
Parallel Genehunter
Evaluation
Observations
CMSC 838T – Presentation
Genetic Linkage Problem
Sexual Reproduction
Offspring created by two haploid gametes
Gametes are produced from diploid/polyploid cells during
meiosis
www.blc.arizona.edu/courses/181gh/rick/genetics1/
CMSC 838T – Presentation
Genetic Linkage Problem
Recombination occurs in two ways
1.
Random segregation of chromatids
2 x 23 human chromosomes
=>
223 possible haploid combinations
Genes on different chromosomes
recombine with probability
.5
www.gen.umn.edu/faculty_staff/hatch/1131/
CMSC 838T – Presentation
Genetic Linkage Problem
Recombination occurs in two ways
1.
Random segregation of chromatids
2.
Crossover between homologous
pairs of chromosomes
Genes on the same chromosome
recombine with probability
depending on their distance and
location on the chromosome
CMSC 838T – Presentation
Genetic Linkage Problem
Given
This model of recombination
Data for a particular pedigree (family)
Phenotype information for each individual
Genetic markers for each individual
Recombination frequencies for each pair of markers
Can we apply probabilistic methods to
Reconstruct the inheritance patterns
Link phenotypes to the markers
CMSC 838T – Presentation
Previous Work
Fisher, Haldane, Smith, Morton (1935 - 1955)
Methods to infer genetic maps using maximum likelihood
estimators
Elston, Stewart (1971)
Genetic Linkage Algorithm
Linear in pedigree size
Exponential in number of markers
Lander, Green (1987)
Genetic Linkage Algorithm
Linear in number of markers
Exponential in pedigree size
CMSC 838T – Presentation
Previous Work
Genehunter (2001)
Implementation of Lander & Green
Analyzes a pedigree containing n non-founders
The inheritance of a gene by one
non-founder can be summarized
by two bits
The entire pedigree’s inheritance
pattern can be summarized by a
2n bits
CMSC 838T – Presentation
Previous Work
3 steps of Genehunter:
Step 1 : For each marker, calculate the probability of each
of the possible inheritance pattern.
0: grandfather’s chromatid
1: grandmother’s chromatid
Pr([0,0]) = .5
Pr([0,1]) = .5
Pr([1,0]) = 0
Pr([1,1]) = 0
Store probabilities in a vector of size 22n
CMSC 838T – Presentation
Previous Work
3 steps of Genehunter:
Step 2 : For each marker, calculate the conditional probably of
each inheritance pattern conditional on all of the markers to
the left, and to the right
•
For two markers’ inheritance vectors, each disagreeing
bit requires a crossover event
•
The probability of transitioning between inheritance
vectors i, j differing in d bits is
M i, j d (1 ) 2nd
CMSC 838T – Presentation
Previous Work
3 steps of Genehunter:
Step 2 : For each marker, calculate the conditional probably of
each inheritance pattern conditional on all of the markers to
the left, and to the right
•
Mi,j = cost of transitioning between inheritance vectors i&j
•
P1 , P2 = probability vectors for every inheritance pattern
given markers 1 and 2 respectively
•
P2|1 = P2 • (M P1)
•
Calculate the probabilities of each marker’s inheritance
conditional on all others by Markov Chain or FFT
convolution
CMSC 838T – Presentation
Previous Work
3 steps of Genehunter:
Step 3 : For each marker, calculate the probability of unknown
gene being located at specific locations
•
•
•
•
Hypothesizes phenotype has a gene located at a particular
location.
By default tries 5 evenly-spaced locations between consecutive
pairs of markers
Calculates PD, the probabilities of each inheritance pattern for
based on this phenotype (as in step 1)
For a location between markers i&i+1, p= PD • Px|1...i • Px|i+1...m
Space Requirement:
O(22n)
O(22n-f) exploiting symmetry of f founders
Time Requirement:
O(m22n)
O(m22n-f) with f founders
CMSC 838T – Presentation
Parallel Genehunter
Approach
Parallelize the 3 Genehunter steps separately
Divides each 22n-sized marker vector evenly among the P
processors
allows greater distribution of memory than assigning
O(m/P) entire vectors to each processor
CMSC 838T – Presentation
Parallel Genehunter
Parallelization of step 1
For each marker, calculate the probability of each of the
possible inheritance pattern
Each processor calculates the probabilities for a particular
22n / P inheritance patterns for ever marker
CMSC 838T – Presentation
Parallel Genehunter
Parallelization of step 2
For each marker, calculate the conditional probably of each
inheritance pattern conditional on all of the markers to the left, and to
the right
FFT convolution
As in serial genehunter, 22n x 22n matrix-vector multiplication
is replaced FFT-based convolution:
1.
2.
3.
2 forward 1D FFTs on 22n-length vectors
element-by-element multiplication
inverse FFT
Each 1D FFT is equivalent to a 2D FFT on a
P x 22n / P matrix
There are well-known distributed algorithms for this FFT using
all-to-all communication.
Dot Product in P2|1 = P2 • (M P1)
trivially parallelized: each processor has the same
portion of each vector.
CMSC 838T – Presentation
Parallel Genehunter
Parallelization of step 3
For each marker, calculate the probability of unknown
gene being located at specific locations
computing Px|1...i and Px|i+1...m
FFTs parallelized as in step 2
Final dot product p = (PD • Px|1...i • Px|i+1...m)
parallelized as in step 2
each processor holds all the same portion of each vector
CMSC 838T – Presentation
Evaluation
Experimental Environment
Input data sets
51 family member pedigree
{19,21,24}-bit data sets (# bits = 2n-f )
Computing Facilities
Cplant Cluster (Sandia National Laboratories)
DEC Alpha EV6 processors
Myrinet connection
CMSC 838T – Presentation
Evaluation
Runtimes For 19,21 and 24 bit problems
CMSC 838T – Presentation
Evaluation
Runtimes For 19,21 and 24 bit problems
CMSC 838T – Presentation
Observations
Pro: Performs Genehunter computation exactly
Pro: Effective for “multipoint linkage” of phenotypes
Con: Old-fashioned compared to protein-based methods (?)
Pro: Distributes memory requirements
Pro: More computers allows larger feasible inputs
Con: Experiments based on 1 pedigree
Pro: Efficient parallelization up to 32 or 64 processors
Con: Only allows pedigrees to grow by only 3 or 4 individuals
in equal time
CMSC 838T – Presentation
References
Genetic Recombination
Dr. Craig Woodworth, Genetic Recombination in Eukaryotes, Lecture Notes,
(www.clarkson.edu/class/by214/powerpoint)
Genehunter
K. Markianos, M.J. Daly, & L. Kruglyak. Efficient Multipoint Linkage Analysis
Through Reduction of Inheritance Space. American Journal of Human Genetics
68, 2001.
Parallel Genehunter
G. Conant, S. Plimpton, W. Old, A. Wagner, P. Fain, & G. Heffelfinger. Parallel
Genehunter: Implementation of a Linkage Analysis Package for Distributed-Memory
Architectures, Proceedings of the First IEEE Workshop on High Performance
Computational Biology, International Parallel and Distributed Computing
Symposium, 2002.
CMSC 838T – Presentation
Questions?
CMSC 838T – Presentation