ppt - Purdue College of Engineering

Download Report

Transcript ppt - Purdue College of Engineering

Big data versus the big C
Saurabh Bagchi
Paper: "Bioinformatics: Big Data Versus the Big
C", Neil Savage, Nature, May 28, 2014
Slide 1/19
The Cell
cell, nucleus, cytoplasm, mitochondrion
© 1997-2005 Coriell Institute for Medical Research
How many?

Cells in the human body:
~1014 (100 trillion)
~1015 bacterial cells!
Chromosomes
histone, nucleosome, chromatin, chromosome, centromere, telomere
telomere
centromere
nucleosome
DNA
H1
chromatin
~146bp
H2A, H2B, H3, H4
How many?

Chromosomes in a human cell:
46 (2x22 + X/Y)
Nucleotide
deoxyribose, nucleotide, base, A, C, G, T, purine, pyrimidine, 3’, 5’
purines
to previous nucleotide
O
O
P
O-
5’
H
O
H
C
H
Guanine (G)
Thymine (T)
Cytosine (C)
to base
O
C
Adenine (A)
C
H
H
C
3’
to next nucleotide
C
H
pyrimidines
H
Let’s write “AGACC”!
“AGACC” (backbone)
“AGACC” (DNA)
deoxyribonucleic acid (DNA)
3’
5’
3’
5’
DNA is double stranded
strand, reverse complement
5’
3’
3’
5’
DNA is always written 5’ to 3’
AGACC or GGTCT
RNA
ribose, ribonucleotide, U
purines
to previous ribonucleotide
O
O
P
O-
5’
H
O
H
C
H
Uracil (U)
Cytosine (C)
C
H
C
3’
Guanine (G)
to base
O
C
Adenine (A)
H
C
OH
to next ribonucleotide
H
pyrimidines
How many?

Nucleotides in the human genome:
~ 3 billion
Genes & Proteins
gene, transcription, translation, protein
Double-stranded DNA
5’
TAGGATCGACTATATGGGATTACAAAGCATTTAGGGA...TCACCCTCTCTAGACTAGCATCTATATAAAACAGAA
3’
3’
ATCCTAGCTGATATACCCTAATGTTTCGTAAATCCCT...AGTGGGAGAGATCTGATCGTAGATATATTTTGTCTT
5’
(transcription)
Single-stranded RNA
AUGGGAUUACAAAGCAUUUAGGGA...UCACCCUCUCUAGACUAGCAUCUAUAUAA
(translation)
protein
How many?

Genes in the human genome:
~ 20,000 – 25,000
Genes and cancer
• Tumor suppressor genes
• Onco genes: An oncogene is a gene that has the potential
to cause cancer. In tumor cells, they are often mutated or
expressed at high levels.
• Observation (2013, Stephen Elledge of Harvard)
– Aneuploidy: a condition in which the number of chromosomes
in the nucleus of a cell is not an exact multiple of the monoploid
number of a particular species. An extra or missing chromosome
is a common cause of genetic disorders including human birth
defects.
– Aneuploidy was correlated with high rate of cancer
– Finding: Aneuploidy resulted in missing tumor-suppressor
genes or extra copies of oncogenes
Slide 14/19
Data items
• Lots of data to mine for patterns of cancer
–
–
–
–
Genome of tumor cell, genome of normal cell
Medical history
Life style history
CT scan, MRI scan
• Find out correlations with cancer
• Possible treatments (experimental today)
– Gene therapy
– Drug targets
• Existing tool: Tumor Suppressor and Oncogene Explorer
– Mine large data sets – roughly 8,000 tissue samples for 29 different kinds of
tumors
– Apply statistical classification to identify tumor suppressor genes and
ongogenes: from 70 to 320, from 50 to 200
– Distinguishing features include: mutation rate, ratio of benign mutations to
those that cause a gene to stop functioning
Slide 15/19
Databases
• We can mine these large databases, for various kinds of
tumors
1. Cancer Genome Atlas, from NCI (US)
2. Catalog of Somatic Mutations in Cancer, from
Wellcome Trust Sanger Institute (UK)
3. Galaxy
4. ENCODE
5. Roadmap Epigenomics Project
Slide 16/19
Tools and Discoveries
• Large project: Bionimbus
– Cloud-based, open-source platform for sharing and analyzing genomic data
from the Cancer Genome Atlas
• An example finding
– By Megan McNerney (U of Chicago, Spring 2013)
– Identified a gene that contributes to the development of acute myeloid
leukemia (AML)
– Data mining indicated that the CUX1 gene was the most significantly
differentially
gene
in cells that had
lost chromosome
7; this Del
gene
Retroviral
Insertionalexpressed
Mutagenesis
In Egr1+/-mice,
Haploinsufficient
For a Human
encodesLeukemia
for a tumor-suppressor
proteinNeoplasms With Proviral Insertions In
(5q) Myeloid
Gene, Develop Myeloid
Genes
Syntenic
To Human
– The
researchers
also5qidentified a CUX1 fusion transcript, in other words,
A Fernald,
RJCUX1
Bergerson,
J Wang,
ME McNerney,
T Karrison, J Anastasi, ...
part of
fused
to another
gene.
Blood
122 (21)
– They
hypothesized, and verified that this disruption in CUX1 may contribute
to the growth of abnormal blood cells, a hallmark of AML.
Slide 17/19
How much storage do I need?
• Cancer and normal genome of a human: 1 terabyte (1012)
• 1 M genomes = 1 exabyte (1018)  Cost of US
$100M/year
• Further sources of data: Electronic health records
– Includes diagnoses and notes on treatment
• Data mining also points to relation of drug dosage with
factors of the patient, like age
Slide 18/19
Example Influential Paper
• In Usenix Security 2014: “Privacy in Pharmacogenetics: An End-to-End Case
Study of Personalized Warfarin Dosing,” Matthew Fredrikson, Eric Lantz,
Somesh Jha, Simon Lin, David Page, Thomas Ristenpart
• A case study of warfarin dosing, a popular target for pharmacogenetic modeling
• Warfarin is an anticoagulant widely used to help prevent strokes in patients
suffering from atrial fibrillation (a type of irregular heart beat)
• However, it is known to exhibit a complex dose-response relationship affected
by multiple genetic markers, with improper dosing leading to increased risk of
stroke or uncontrolled bleeding.
• A long line of work has sought pharmocogenetic models that can accurately
predict proper dosage based on patient clinical history, demographics, and
genotype
• Their study used a dataset collected by the International Warfarin
Pharmocogenetics Consortium (IWPC), to date the most expansive such
database containing demographic information, genetic markers, and clinical
histories for thousands of patients from around the world
Slide 19/19