Talk Powerpoint

Download Report

Transcript Talk Powerpoint

Harvard Medical School
Mapping Transcription Mechanisms
from Multimodal Genomic Data
Hsun-Hsien Chang, Michael McGeachie, and Marco F. Ramoni
Children’s Hospital Informatics Program
Harvard-MIT Division of Health Sciences and Technology
Harvard Medical School
March 10, 2010
1
Harvard Medical School
Information Flow in Multimodal Genomic Data
• Genetic Variants
– 100k – 1000k SNPs
– 250k copy number
variations (CNVs)
– 250k methylation
measurements
Information
Information
• Transcripts
– 50k mRNA expression levels
– 50k microRNA expression levels
– 1.5M exon expression / splicing
2
Harvard Medical School
Expression Quantitative Trait Loci (eQTLs)
• Connection from variant to expression is an
information channel
– A DNA locus is modulating the expression level of
a gene = eQTL
• Cis(Trans) eQTLs are the genetic variants
located close to (far away) genes.
• Identifying cis-eQTLs is easier
– Focusing on cis-eQTL reduces search space
– trans eQTLs?
3
Harvard Medical School
Clinical Study on Pediatric Leukemia
• Cancer: based on genetic modification (variants) and cellular
malfunction (gene expression)
• Identification of eQTLs helps understand molecular
mechanisms in cancer and provides biological insight.
• Clinical study of Acute lymphoblastic leukemia (ALL)
– The most common malignancy in children, nearly one third of all
pediatric cancers.
– A few cases are associated with inherited genetic syndromes (i.e.,
Down syndrome, Bloom syndrome, Fanconi anemia), but the cause
remains unknown.
• Data
– 29 patients.
– Genotyped 100,000 SNPs (Affymetrix Human Mapping 100K).
– Profiled 50,000 gene expressions (Affymetrix HG-U133 Plus 2.0).
4
Harvard Medical School
Challenges in Finding eQTLs
• Compare the distribution of each Variant to the
levels of each expression measurement
– Computational
• All pairs of variants vs. expressions is costly
• Usually discretize expression levels (Pensa et al., BioKDD, 2004)
– Multiple testing considerations
• Understanding
– Too many associations to test via laboratory science
• Computational methods of biological discovery
• Want to summarize main informational (biological) pathways
• Answer: Use transcriptional information
5
Harvard Medical School
Transcriptional Information Channel
X
Transcription Channel
SNPs are modeled as
binomial variables.
Y
Expressions are modeled
as log-normal variables.
• Mutual Information quantifies information flow:
• Info Theory:
measures Entropy,
H(X)
• Higher MI is achieved by larger σ2 and smaller σk2 , i.e., when
expression level Y is more likely modulated by SNP X.
6
Harvard Medical School
• Transcript Y is modulated by SNP X:
• Transcript Y is independent of SNP X:
7
Harvard Medical School
Transcriptional Information Map
X1
Y1
X2
Y2
X3
Y3
X4
Y4
X5
Y5
X6
Y6
X7
Y7
X8
Y8
X9
Y9
8
Harvard Medical School
ALL Transcriptional Information Map of Chr21
9
Harvard Medical School
Cluster Genes and SNPs into Networks
X1
Y1
X2
Y2
X3
Y3
X4
Y4
X5
Y5
X6
Y6
X7
Y7
X8
Y8
X9
Y9
10
Harvard Medical School
Cluster Genes and SNPs into Networks
X1
Y1
Y2
X3
X4
X8
• We can further infer the optimal modulation
patterns using Bayesian networks.
Y9
11
Harvard Medical School
Bayesian Networks
• Bayesian networks are directed acyclic graphs:
– Nodes correspond to random variables.
– Directed arcs encode conditional probabilities of
the target nodes on the source nodes.
A
X
B
Y
Z
– p(X) depends on (A,B)
– p(Z|X,Y) independent of (A,B)
12
Harvard Medical School
Infer Bayesian Networks in Individual Clusters
X1
Y1
Y2
X3
X4
X8
• Step 1: Use TIM as the initial network.
• Step 2: Bayesian network infers SNP-SNP connections.
Y9
13
Harvard Medical School
A Bayesian Network Inferred from Chr21 TIM
14
Harvard Medical School
Information Theoretic Network Analysis
• Find hubs, motifs, guilds, etc.
–
–
–
–
•
Abstract edges
Global patterns -> local patterns
Reveal emergent properties
Information theoretic approach using Data
Compression
Alterovitz G, and Ramoni MF, “Discovering biological guilds through topological
abstraction,” AMIA Annu Symp Proc, pp. 1-5, 2006.
15
Harvard Medical School
Identified Fundamental Components
16
Reference: Alterovitz and Ramoni, AMIA Annu Symp Proc, pp. 1-5, 2006.
Harvard Medical School
Identification of Cis- and Trans eQTL
• RIPK4, 21q22.3
– Related to Downs
Syndrome
– RIPK4 has 5
(trans) SNPs in
q11.2 (shown as
blue in the figure)
affecting its
expression.
RIPK4
17
Harvard Medical School
Identification of Cis and Trans eQTL
• CYYR1, 21q21.1
– Recently discovered.
– Encodes a cysteine and
tyrosine-rich protein.
– Recent study found a
correlation with
neuroendocrine tumors.
– TIM shows CYYR1 modulated
by SNPs across the q arm of
chromosome 21.
– DSCAM related to Down’s
syndrome
– DSCAM-CYYR1 interaction
leads to ALL?
DSCAM
18
Harvard Medical School
Complete TIM Algorithm
Cluster 1
Genetic
Variant Transcript
..
.
..
.
..
.
..
.
Compute
Transcriptional
Information
..
.
..
.
..
.
..
.
Group Linked
SNPs and
Transcripts
..
.
Cluster N
Infer Network
in Individual
Clusters
Network
Topology
Analysis and
Summary
Cluster 1
Cluster N
...
19
Harvard Medical School
Transcriptional Information Maps
• Make large multimodal genetic dataset
amenable to transcriptional analysis
• Identifies
– Modulation patterns between genetic variants
and transcripts.
– CIS and TRANS eQTL.
• Analysis of pediatric ALL helps identify
biological hypotheses regarding connection to
Down’s syndrome
20
Harvard Medical School
Questions?
Thanks to
Prof. Marco F. Ramoni, Dr. Hsun-Hsien
Chang, Dr. Gil Alterowitz, Children’s
Hospital Informatics Program, Brigham
and Women’s Hospital
21