Transcriptome Profiling in Human Congenital Heart Disease
Download
Report
Transcript Transcriptome Profiling in Human Congenital Heart Disease
Transcriptome Profiling
of Human Cardiac Tissues
in Hypoplastic Left Heart
Syndrome
Karl D. Stamm, MS
Donna K. Mahnke, MS; Mary A. Goetsch, MS;
D. Woodrow Benson, MD, PhD; Xing Li, PhD;
Aoy Tomita-Mitchell, PhD; Timothy J. Nelson, MD, PhD;
James S. Tweddell, MD; Michael E. Mitchell, MD
September 2013 Research Update
Overview
• Medical Research
• Trouble with humans
• Rare diseases are common in a large enough population
• Next-Generation Sequencing Tech
• Illumina HiSeq methodology
• Differential expression
• Further Mining
• Principle components analyses
• Gene profiles and the self-organizing-map
Trouble with Humans
• Small sample sizes
• Low statistical power
• High interpersonal variability
• Ethnic backgrounds imply metabolic differences
• Phenocopy
• Multiple distinct diseases showing identical presentation
• Confounds clustering or association studies
• Ruins Case/Control study power
• PHI – Private/Protected Health Information
• Data security is paramount
• Cross-disciplinary collaborations are limited
• DNA is theoretically but not practically identifiable
Congenital Heart Defect
Incidence
• Down Syndrome 1:700 live births
• 50-60% have some structural heart defect
• 22qD Syndrome 1:4000 live births
• 75-90% have some structural heart defect
• ‘Healthy’
99:100 live births
• 0.8% have some structural heart defect
Proportion
Explained:
C.H.D. in particular
Hypoplastic Left Heart Syndrome
1 in 40 CHD cases are HLHS
2.5 : 10000 of all births
• Complex developmental
disorder
• 100% fatal before the
invention of the
Norwood Procedure
1981
• No multigenerational
pedigrees
• Spontaneous mutation:
immune to detection by
genetic linkage
All sequencing costs for this study provided by
Generate Reads – Illumina Tech
10 to 500 million short reads are generated in pairs, 2x50 to 2x100 bp each.
http://seqanswers.com/forums/showthread.php?t=21
Align Reads to Reference
•
•
•
•
Which one?
NCBI #37.3 has 3.1 billion bases across 190 contiguous scaffolds
UCSC hg19 has 3.2 billion bases across 163 contiguous scaffolds
Haploid reference contains disease alleles and chimeric sequence
like an A+B+O blood type.
Image of patches modifying the CHR17 reference from 2011 according to Ensembl
http://www.ensembl.info/blog/2011/05/20/accessing-non-reference-sequences-in-human/
Millions of Variants
• The 1000 Genomes project found 38 million SNPs, 1.4 million
short insertions or deletions, and more than 14 thousand
larger deletions
• The NHLBI Exome Sequencing Project targeted 22MBases
across 2,440 individuals and found 563,700 variants, 82% of
which were novel. They averaged 200 novel, coding mutations
per person.
• We find about 150-300 thousand SNVs in an exome, 10% of
which are nonsynonymous
• SAMTOOLS is the software of choice for variant calling relative
to your reference genome.
• CCG/Proline -> CTG/Leucine
• HOPX is a gene known to regulate heart development!
• Very common mutation
RNA-Seq vs. Whole Genome
1.
2.
3.
4.
Extract and purify
mRNA by
polyadenylation
Convert spliced
mRNA to DNA
fragments
Run standard
genome
sequencing on the
product
Result: Expression
level dependent
sequence coverage
Image found at
http://www.pacificu.edu/optometry/ce/courses/20591/armdpg3.cfm
RNA-Seq Reconstructs Transcripts
From the CuffLinks paper, Trapnell et al.
http://www.nature.com/nbt/journal/v28
/n5/abs/nbt.1621.html
Nature Biotechnology Volume: 28, Pages:
511–515 Year published: (2010)
IGV – aligned reads viewer
CoverageBED
Simple arbitrary feature read depth counting.
-Count by gene, exon, whatever
BEDTOOLS : a flexible suite of utilities for comparing genomic features.
http://code.google.com/p/bedtools/
Example of bad alignment
Variance and mean linked by local regression - for robust parameter estimation.
• Negative Binomial
• Models count as ‘binomial successes until a set number of failures’ which
better fits the RNA-Seq fragment generation (limited reagent)
• Allows/captures the ‘overdispersion’ seen in RNA-Seq experiments.
Scale the totals for compatible
means
Mean-Variance Connection
Detection in Low Values
Per-gene mean by difference ratio
DESeq
• Starting from 18,000 Rsids minus 1200 NA
• 1000 entries p<0.05
Theme
•
•
•
•
Big lists
Noisy data
Complex correlation
Heterogeneous background
Precious Tissue Samples
• Collecting tissue during surgery
is an extra burden placed on
overloaded surgical teams.
• Samples must be processed
carefully to avoid degradation of
sensitive molecules.
• Many steps and costs prior to
gene sequencing.
• Collaborators have provided 35
patients’ atrial septal tissues.
• Still no ethical source of healthy
control.
• Hope to see separation between red/notred or solid/notsolid points
• Lack of discrimination in major variation dimensions
• Implying uncontrolled heterogeneity dominates
Therefore, more difference person to person than between subtypes
Top25 Consistent Genes
• Anyone know what it means when Adducin2 and HomeoboxA4 are
overexpressed? Is it significant that a dehydrogenase is under-expressed?
Group Profiles
at Selected Dimensions
Self-Organizing Map
•
•
•
•
Kohonen 1990
Halfway between neural networks and k-means (horrible oversimplification)
Enforced grid layout and local neighborhood similarity
Data points (here 25-dimensional vectors) lay out in natural organization
Stochastic - Iteration
Pairwise Similarity
• Co-clustering frequency determines sample similarity
• Sub-clusters are identified organically
Results
• Lists of genes differential across conditions
• Many conditions, uncertain homogeneity
• List cutoff subjective
• No healthy control group
• We can mine these lists for pathways or biological processes
• Resulting in more lists of more complex results
Transcriptome Project
Future Work
• A few more samples are coming… Can we build a classifier?
• Predict non-measured variables? Signatures of immune
response point towards treatment targets.
• Predict compensatory effects? Samples are taken just days
after birth, but 8 months after the heart started beating.
• How else we could look at this rich, unique dataset?
Thanks for listening