슬라이드 1 - California Institute for

Download Report

Transcript 슬라이드 1 - California Institute for

The HAP webserver:
Tools for the Discovery of Genetic Basis of Human Disease
HYUN MIN KANG
NOAH ZAITLEN
TAURIN TAN-ATICHAT
EDWARD SHYU
GRACE SHAW
Computer Science and Engineering
University of California, San Diego
Bioinformatics Program
University of California, San Diego
Electrical and Computer Engineering
University of California, San Diego
Computer Science and Engineering
University of California, San Diego
Computer Science and Engineering
University of California, San Diego
DAFNA BITTON
ELAD HAZAN
ERAN HALPERIN
ELEAZAR ESKIN
Computer Science and Engineering
University of Calfornia, San Diego
Department of Computer Science
Princeton University
International ComputerScience Institute, Berkeley
Computer Science and Engineering
University of California, San Diego
4. Identifying Association via Statistical Tests
1. Introduction
Understanding the structure of human variation is important for understanding the genetic basis of human
diseases. Recent advances in high-throughput genotyping technology generating a tremendous amount of
high density single nucleotide polymorphism(SNP) data holds great promise for discovering genetic risk
factors associated with disease. In order to identify association between disease and variations in an
individual’s chromosome, the genotype data must be phased into haplotypes. Based on HAP, which is a
very efficient tool for haplotype resolution based on imprefect phylogeny, HAP webserver provides an
integrated method to reconstruct haplotype structure and to identify genetic variants associated with
complex phenotypes which can give insight into the genetic factors of complex diseases. Our methods
leverage interplay between genotype phasing, haplotype phylogeny, association analysis, and functional
SNPs prediction. Our methods leverage new insights into the structure of human variation which allows us
to observe phenotype associations directly from genotype and phenotype data. We demonstrate our
methods via an analysis of two genes implicated in hypertension. Our methods are easily accessible via the
webserver, providing complete results of association analysis including graphical visualizations. We
expect that our methods will facilitate current association studies.
2. HAP – haplotype resolution
HAP is a haplotype analysis system which is aimed in helping geneticists perform disease association
studies. The main feature of HAP is a phasing method which is based on the assumption of imperfect
phylogeny. The phasing method is very efficient, which allows HAP to work with very large data sets, and
to perform other operations such as finding a partition of the region into blocks of limited diversity or
performing association tests on each of these block with in vitro experiments already published.
HAP takes as input a set of genotypes over a region, taken form a population, and returns the haplotype
phase of each of the individual’s genotypes. From our studies, we observed that HAP is very accurate
when the number of individual taken is at least a couple of dozens. In addition to phasing, HAP also
produces a partition of the region into blocks of correlated SNPs. The block partition of the haplotypes is
such that it minimizes the number of tag SNPs. HAP leverages a new insight into the underlying structure
of haplotypes which shows that SNPs are organized in highly correlated “blocks”(Daly et al 01, Patil et al
01).
HAP has shown to have competitive accuracy compared to the state of the art sofrwares(such as PHASE,
HAPLOTYPER). On the other hand, HAP is extremely fast and can be used on very large datasets.
Recently, HAP is successfully used in revealing whole genome haplotype structure. (Hinds et al. 05)
Leveraging haplotype structure
Quantitative phenotypes & Dose-effects
Nonparametric Tests
Covariates
CHGA
HAPLOTYPE
ID
NUCLEOTIDE AT POSITION
-1106 -1018 -1014 -988
-462
-415
-89
-57
Linear
Regression
STATISTICAL TESTS
Unpaired
Mannt-test
Whitney
JonckheereTerpstra
A
G
A
T
T
G
T
C
C
.948(+)
.963 (+)
.969 (+)
.963 (+)
B
A
A
T
T
G
T
C
C
.977()
.999 ()
.999 ()
.996 ()
C
G
A
C
G
A
T
A
C
.175 ()
.209 ()
.505 ()
.485 ()
D
G
A
T
T
G
C
C
C
.999 ()
.990 (+)
.983 (+)
.997 (+)
E
G
T
T
T
G
C
C
T
.004 (+)**
.004 (+)**
.011 (+)*
.011 (+)*
F
G
A
C
G
A
T
C
C
.836 ()
.836 ()
.978 ()
.986 ()
Table 1  Haplotype analysis between CHGA promoter region and CHGA284-301 plasma levels :
Statistical p-values for the association between the haplotypes in CHGA promoter region and CHGA284-301 plasma
levels in 221 African Americans over various statistical tests. Each haplotype ID and its sequence is identical to
that of Figure 2. The p-values are evaluated by permutation tests with 105 times of random shuffling of phenotypes.
The p-values are also adjusted to multiple comparisons, thus no further conservative adjustments are required. The
plus or minus sign next to each p-value denotes whether the haplotype variant shows positive or negative effect on
the phenotype for each statistical test. Single and double asterisks by the p-value denotes that the p-value is less
than 0.05 and 0.01, respectively. This table is automatically generated by our webserver.
Figure 4  CHGA association visualization
A histogram of CHGA284-301 levels grouped by the
number of copies of the haplotypes E in Table 1. The xaxis represents plasma levels, and y-axis represents the
fraction of individuals with given plasma level. It can be
observed that there are significant association for the
haplotype to increase plasma level. This figure is
automatically generated by our webserver.
Figure 1  HAP webserver (a) HAP is used in revealing whole genome haplotype structure. The article
Figure 5  CHGA functional SNPs prediction
Results of predicting how each SNP contributes to the
association identified in Table 1. The y-axis is a score
that represents the degree of functional contribution. The
SNP at the position -89 makes the highest functional
contribution, and those at positions -1014,-988,-462
share the second highest score. This results is consistent
to the in vitro experiments previously published. This
figure is automatically generated by our webserver.
“Whole-Genome Patterns of Common DNA Variation in Three Human Populations” is published on the cover
of Science. (b) The screenshot of HAP webserver main page, available at http://research.calit2.net/hap
5. Functional SNPs Prediction
3. Inferring Phylogenetic Relationships between Haplotypes
Recent studies have shown that within short regions, there is limited genetic variability, and only a small
number of haplotypes account for the entire population. In a typical region of 20kb, three or four common
haplotypes account for 80% of the population. Futhermore, most rare variants appear to be minor variants
of common haplotypes. Using these results, phylogeny is inferred by identifying most likely ancestors for
the each of the rare haplotypes given the frequent ones. Then, ancestral haplotypes are found by searching
for similar common variants.
Once associated haplotypes are identified using rigorous statistical tests, our methods provide a method for
estimating the likelihood of each SNP contributing the association. To make this prediction, we iterate
over several groupings of the haplotypes to attempt to isolate the functional SNPs. The outcome of the
second step is a score distribution over the SNPs estimating how likely each SNP is to be functional.
6. Whole Genome Association Studies with HDL Mouse Phenome Database
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Figure 2  Predicted CHGA phylogeny
Figure 3  Linkage disequilibrium plot
Each symbol denotes a haplotype variants of CHGA
promoter. Each haplotype variant is classified into one of
three groups: ancestral, common, or recent haplotype. A
solid line denotes mutant, and dashed lines denotes
recombination. This figure is automatically generated by
our webserver.
Results of of running HAP webserver with linkage
disequilibrium data. The example data is available via
webserver. The axis represent SNP positions. The red
regions indicate high disequilibrium while the blue
indicates low disequilibrium.
Figure 6  HDL Phenotype
The association test results for the level of HDL
cholesterol in the different mouse strains.
Figure 7  Random
The association test results for randomly permuted HDL
phenotype in figure 6.