Lect18_InterpretGWAS

Download Report

Transcript Lect18_InterpretGWAS

Understanding GWAS SNPs
Xiaole Shirley Liu
Stat 115/215
Pace of GWAS Studies
2
GWAS SNPs
• Association <> Causal
• What’s the most likely causal SNP / Gene
in LD with the genotyped SNP?
• Use functional genomics to identify the
disease tissue of origin
• What’s the SNP doing in non-coding
regions? RSNPs
3
Use Literature & Pathway Information to
Identify Putative Causal SNPs / Genes
4
Each Gene has an NCBI Page
5
Especially Bibliography
6
And Pathways
7
Literature Mining Terms
• Corpus: Collection of documents. E.g. all papers in
PubMed
• Term frequency: Number of times a word appears in a
•
•
•
•
8
document. E.g. “polymerase” appeared 41 times in a paper
Document frequency: Number of documents a word
appears in. E.g. 1234x papers has the word “transcription”
Collection frequency: Total number of times a word
appears in a corpus. E.g. “transcription” appeared 6789X
times in all of PubMed indexed papers
Stop words: Words in the corpus that contribute little to
meaning. E.g. to, is, an
Stemming: Group together different variations of the
same word. E.g. activate vs. activated vs. activating
Documents Represented as Vectors
”Our analysis includes
comparison of amino acid
environments with random
control environments as
well as with each of the
other amino acid
environments.”
acid
amino
analysis
comparison
control
environments
[…]
our
9
2
2
1
1
1
2
1
• A document is
summarized as a
vector of word counts.
• Each dimension
contains the number of
times a word appears.
• Can calculate
similarity between two
documents by
comparing their
vectors
Comparing Two Documents
• Intuitive comparison between two papers 
correlation coefficient of their word
occurrence vectors
• Correlation measures the strength of linear
relationship between two random variables
a = c(1, 3,
b = c(2, 3,
c = c(2, 0,
cor(a, b)
cor(b, c)
10
5, 1, 8, 20, 0, 0, 0, 3, 1)
4, 0, 10, 25, 1, 0, 2, 4, 3)
1, 10, 2, 4, 7, 1, 5, 0, 8)
0.985615
Correlated
-0.110328
Not correlated
Term Weighting Considerations
• Give different terms different weight
• Global weight
– Document frequency
11
Term Weighting Considerations
• Give different terms different weight
• Global weight
– Document frequency: Fewer documents, more
weight: log(N / df). E.g. progesterone vs gene
• Local weight
– Term frequency
12
Term Weighting Considerations
• Give different terms different weight
• Global weight
– Document frequency: Fewer documents, more
weight: log(N / df). E.g. progesterone vs gene
• Local weight
– Term frequency: More frequent, more weight:
log(1+tf). E.g. progesterone: 10 times in paper1
vs 3 in paper2
– Document length
13
Term Weighting Considerations
• Give different terms different weight
• Global weight
– Document frequency: Fewer documents, more
weight: log(N / df). E.g. progesterone vs gene
• Local weight
– Term frequency: More frequent, more weight: 1
+ log(tf). E.g. progesterone: 10 times in paper1
vs 3 in paper2
– Document length: Less weight for longer
document. E.g. paper1 200 pages vs paper2 3
pages
14
Evaluate Related of Papers
• Related Articles
– Similarity between two documents:
all terms (local wt1 × local wt2 × global wt)
– Pre-computed related articles for each citation
– Rank ordered by relevance
15
GRAIL: Gene Relationships
Across Implicated Loci
16
Raychaudhuri et al PLOS Genetics 2009
GRAIL: Gene Relationships
Across Implicated Loci
17
GRAIL: Gene Relationships
Across Implicated Loci
18
GRAIL: Gene Relationships
Across Implicated Loci
19
GRAIL on Height SNPs
20
GRAIL on Crohn’s Disease
• Use literature /
pathways to
identify potential
causal gene
• Find likely
reproducible SNP
hits, and increase
statistical power
21
GWAS SNPs
• Association <> Causal
• What’s the most likely causal SNP / Gene
in LD with the genotyped SNP?
• Use functional genomics to identify the
disease tissue of origin
• What’s the SNP doing in non-coding
regions? RSNPs
22
Identifying Causal Cell-type for
Complex Disease
• E.g. Rheumatoid Arthritis (RA)
• Many cell types implicated over the years,
ranging from neutrophils, synoviocytes, and
all classes of lymphocytes!
• It is difficult to establish causality complex
phenotypes in human
• Use expression data: Comprehensive and
unbiased, publicly available
23
Immunological Genome Project
• Start with a list of disease
SNPs
• Find genes near the SNP
that are specifically
expressed in a cell type
• Identify cell types that have
many such genes ... more
than expected by chance
24
Identifying Causal Cell-type for
Complex Disease From Expression
• Negative control: simulation from random set of
SNPs
• P-value: proportion of simulations exceeding the
observed enrichment
25
Hu et al, American Journal of Human Genetics, 2011
26
27
GWAS SNPs
• Association <> Causal
• What’s the most likely causal SNP / Gene
in LD with the genotyped SNP?
• Use functional genomics to identify the
disease tissue of origin
• What’s the SNP doing in non-coding
regions? eQTL and RSNPs
28
GWAS SNP Distribution
• RSNP
29
eQTL
• eQTL: use expression as phenotype
– Are there SNPs that are associated with expression
changes?
– Heritable genetic variation for transcription levels
30
RSNPs
• A SNP influences TF
binding, affecting
downstream (diseaserelated) gene
expression
31
eQTL and RSNPs
• eQTL: use expression as phenotype
– Are there SNPs that are associated with expression
changes?
– Heritable genetic variation for transcription levels
• RSNP: regulatory SNP
– Much of the influential variation is located cis- to the
coding locus
– In humans, mouse, and maize, 35%-50% of the genetic
basis for intraspecific differences in transcription level
are cis- to the coding locus (e.g. Morley et al. 2004; Schadt et
al. 2003; Stranger et al. 2005; Cheung et al. 2005, etc.).
32
Huang et al,
Nat Genet
2014
33
RSNPs from GWAS
• Enriched in
regulatory
sequences
(promoters and
enhancers) that
are identified
through histone
mark ChIP-seq
or DNase-seq
34
Maurano et al, Science 2012
Highest Correlated Genes of Distal
DHSs Harboring GWAS Variants
35
Trans-Effect of Cis-SNPs
• Three risk loci for
ESR1, MYC, and
KLF4
• Effect on TF
expression is small,
but much strong
when looking at the
expression of their
downstream target
genes
36
Li et al, Cell 2013
Useful Tools to Understand RSNPs
• Identify putative TFs
whose binding might
be influences by SNPs
based on ENCODE
ChIP-seq / DNase-seq
data
37
Understanding GWAS SNPs
• Association <> Causal
• Use literature and pathways to identify the
putative causal SNP / Gene in LD with the
genotyped SNP
• Use (cell-type specific) expression and
epigenomics to:
– Identify the disease tissue of origin
– Identify regulatory SNPs that affect TF binding
and influence the expression of important
downstream disease genes
38
Acknowledgement
• Soumya Raychaudhuri
• Manolis Dermitzakis
39