Bayesian association of haplotypes and non

Download Report

Transcript Bayesian association of haplotypes and non

Bayesian association of haplotypes and
non-genetic factors to regulatory and
phenotypic variation in human populations
Jim Huang*
Probabilistic and Statistical Inference Group,
Edward S. Rogers Department of Electrical and Computer Engineering
University of Toronto
Toronto,
ON, Canada
Anitha Kannan and John Winn
Microsoft Research Cambridge
Machine Learning and Perception Group
UK
ISMB/ECCB 2007
Cambridge,
24/07/2007
Outline
• Main contributions:
• Joint Bayesian modelling of genetic variation data and
quantitative trait measurements
• Rich probabilistic model for genotype data
• State-of-the-art results on predicting missing
genotypes
ISMB/ECCB 2007
24/07/2007
Outline
Genotype:
Unordered pair of
SNPs along both
chromosomes
Haplotype:
Ordered set of
SNPs along a
chromosome
ISMB/ECCB 2007
Presence of recombination
hotspots partitions
haplotypes into blocks
[Daly, 2001]
24/07/2007
Part I: Learning haplotype block structure
• Our model for genotype data should:
–
–
–
–
Account for phase & parent-child information
Account for uncertainty in ancestral haplotypes
Account for uncertainty in block structure
Account for population-specific haplotype block
statistics
– Allow for prior knowledge of haplotype block
structure
ISMB/ECCB 2007
24/07/2007
Previous models for genotype data
•
Previous methods learn a low-dimensional representation of the
genotype data:
•
HAPLOBLOCK (Greenspan, G. and Geiger, D. RECOMB 2003)
–
•
fastPHASE (Scheet P. and Stephens, M. Am J Hum Genet 2006)
–
•
Hard partitioning of data into set of haplotype blocks using lowdimensional “ancestral” haplotypes
Learn ancestral haplotypes from high-dimensional genotype data while
accounting for uncertainty in haplotype blocks
Jojic, N., Jojic, V. and Heckerman, D. UAI 2004.
ISMB/ECCB 2007
24/07/2007
Probabilistic generative model for genotype
data
Low-dimensional
latent representation
Unsupervised
learning via
maximum
likelihood
High-dimensional
data
ISMB/ECCB 2007
24/07/2007
Predicting missing genotype data
•
Have we learned a good density model for genotype data?
•
Gains from
– Accounting for uncertainty in haplotype block structure
– Accounting for uncertainty in ancestral haplotypes
– Accounting for parental relationships
•
Assess model using cross-validation/test prediction error
ISMB/ECCB 2007
24/07/2007
Predicting missing genotype data
• Crohn’s/5q31 data set (Daly et al., 2001)
– Crohn’s disease data from Chromosome 5q31 containing genotypes for
129 children + 258 parents across 103 loci (phases given for children)
• For each test set, make ρ fraction of data missing
• Retain model parameters from model learned from training data, then
draw 1000 samples over missing data
• Compute fill-in error rate over 1000 samples, for all missing data
ISMB/ECCB 2007
24/07/2007
Prediction error for Crohn’s/5q31 data
ISMB/ECCB 2007
24/07/2007
Comparative performance for Crohn’s/5q31
data
ISMB/ECCB 2007
24/07/2007
Establishing haplotype block boundaries
• Define the recombination prior γ on transition probabilities
– Different γ correspond to different “blockiness” of data
• For each locus k, can compute the probability of transition pk
– Can establish a threshold t and establish block boundaries
• Once blocks are defined, can assign block labels
lb = (m,n)
ISMB/ECCB 2007
24/07/2007
Haplotype block structure in the ENm006
region
• 573 SNP markers for 270 individuals from 3 subpopulations:
– 90 Yoruba individuals (30 parent-parent-offspring trios) from Ibadan, Nigeria
(YRI);
– 90 individuals (30 trios) of European descent from Utah (CEU)
– 45 Han Chinese individuals from Beijing (CHB+JPT)/45 Japanese
individuals from Tokyo (JPT)
ISMB/ECCB 2007
24/07/2007
Part II: Linking haplotype block structure and gene
expression data
ISMB/ECCB 2007
24/07/2007
Individual 1
Label 4
Label 3
Label 2
Label 1
A model for linking haplotype structure to
quantitative trait measurements
Latent Relevance
block variable
profile
Observed
quantitative trait
profile
Individual 2
Haplotype
block 1
Individual 3
x
x 1.0
Individual 4
Individual 5
+
=
Individual 1
Haplotype
block 2
Individual 2
Individual 3
x
x 0.0
Individual 4
Individual 5
ISMB/ECCB 2007
24/07/2007
A Bayesian model for linking haplotype
structure to quantitative measurements
blocks b = 1,…,B
π0
wbg
Block label
Relevance variable
Tbj
Latent block
profile
Sbj
zgj
Observed trait
μbg
ρg
τ0,μ0
α0,β0
Noise precision
quantitative traits
g = 1,…,G
individuals j = 1,…,J
ISMB/ECCB 2007
24/07/2007
Linking haplotype blocks to phenotype
•
•
•
387 individuals with Crohn’s (+1) or non-Crohn’s (-1) phenotype;
Link 10 haplotype blocks from 5q31 to phenotype
Average cross-validation error: 23.1% + 3.45%
Test cases
(sorted)
Test data splits
Haplotype blocks 2 and 10 most relevant to Crohn’s phenotype
(p < 4.76 x 10-5)
ISMB/ECCB 2007
24/07/2007
Linking haplotype blocks to gene expression
• ENm006 data set:
• 19 haplotype blocks (573 SNPs)
• 28 gene expression profiles in ENm006 region (Stranger et
al., 2007)
ISMB/ECCB 2007
24/07/2007
Addressing population stratification
The population
variable affects
phenotype/gene
expression…
…whereas variation
between
individuals is the
effect we’re
interested in
ISMB/ECCB 2007
24/07/2007
Associations between haplotype blocks and
gene expression
GDI1 - HapBlock2 (YRI)
p < 2.5 x 10-4
ISMB/ECCB 2007
GDI1 - HapBlock5
(CHB+JPT)
p < 3.33 x 10-4
24/07/2007
Summary
• Enhanced version of Jojic et al. (UAI 2004) model for haplotype
inference/ discovering block structure
• Novel Bayesian model for associating haplotype blocks to gene
expression
• We re-discover population-specific block structures across populations
in the HapMap data
• Predictions for Crohn’s disease from Chromosome 5q31 data
• Cis- associations between blocks and gene expression in ENm006 in
presence of non-genetic factors
• Cis- association between HapBlocks 2 and 5 and GDI1
ISMB/ECCB 2007
24/07/2007
The road ahead…
• Applying to larger portions of the HapMap data
• Finding trans- associations
• Non-linear models for associating block structure to
quantitative traits
• Joint learning of haplotype block structure and associations
• Accounting for patterns of gene co-expression/similar
phenotypes
ISMB/ECCB 2007
24/07/2007
Acknowledgements
• Manolis Dermitzakis and Richard Durbin,
Wellcome Trust Sanger Institute
• Nebojsa Jojic,
Microsoft Research Redmond
• Paul Scheet,
University of Michigan - Ann Arbor
• US National Science Foundation (NSF)
ISMB/ECCB 2007
24/07/2007