PCGP IT Impact

Download Report

Transcript PCGP IT Impact

The St. Jude Children’s Research Hospital/Washington
University Pediatric Cancer Genome Project: A CIO’s
Perspective
Clayton W. Naeve, Ph.D.
Endowed Chair in Bioinformatics
SVP & CIO
St. Jude Children’s Research Hospital
St. Jude Data: The First 50 Years
2000
2 1/2 Years (1000 TB)
1800
PCGP Data: 917 TB,
148 million files
1600
1400
Terabytes
1200
1000
800
48 Years (800 TB)
600
400
200
0
Admin/Clinical
The Data Deluge
Research
St. Jude/WashU Pediatric Cancer Genome Project
•
•
•
•
•
•
Launched Feb. 2010
St. Jude/WashU collaboration
WGS on 600 patients (leukemia, brain tumors, solid tumors)
Matched germline and tumor samples
1200 genomes (~90 billion bp/genome) in 36 months
~2 Petabytes of data
The PCGP Project
Challenges to Information Sciences
•
•
•
•
•
•
Moving data
Data workflow
Data analysis
Computational horsepower
Data storage
Data sharing
PCGP Challenges
• Multi-Terabyte data transit across networks is not trivial
• DNA sequence raw data reads, contig assembly, alignment to
reference, variants, etc. shipped to SJCRH as binary BAM files:
~100 GB
• 24 hrs to infinity to send via commodity internet
• Internet2 connectivity (10 Gbs via MRC) to transfer files from
WashU to SJCRH
• Evaluated 5 different fast data transfer algorithms….selected
FDT (developed at CalTech to transfer LHC data at Cern)
• Developed a pipeline to facilitate transfer
• Today: ~5 hour transit time/file
Moving Data
HPCF
IBM iDataplex
1008 cores/4 TB
RAM
Data Transfer
Node (dtn01)
505
IBM SoNAS
734 TB (usable)
IBM BladeCenter Cluster
810 cores/3 TB RAM
x6
x84
x4
x4
x4
Mellanox Grid
Director 4036
X4
Internal Data
Transfer Node
(datamover)
Mellanox Grid
Director 4036
Mellanox IS5200
Chasis Switch
x2
Mellanox IS5200
Chasis Switch
x4
ESX Cluster
29 Servers
x2

COMPACT
xSeries 335

COMPACT
xSeries 335

COMPACT
X4
xSeries 335

COMPACT
xSeries 335
PVFS Servers
SGI IS5500
60TB (usable)
Moving Data
Mellanox BridgeX
BX5020
Mellanox BridgeX
BX5020
10 GE Campus
Network
10 GE Campus
Network
SGI Altix UV1000
640 cores/5 TB RAM
Moving Data
• Began work on PCGP 9 months prior to
launch
• Developed a LIMS system for Validation Lab
• Developed a PCGP SharePoint site to
facilitate collaboration internally
• Developed a bioinformatics workflow engine:
PALLAS
•
•
•
•
•
•
•
•
•
•
Security management
Data provenance management
Intermediate and final result tracking
Flexible workflow design
Rapid new analytical algorithms/tools configuration
Web-based LSF job submission and monitoring
Support a range of protocols to connect to other web
application systems, databases, file systems, and etc.
Integrated with applications, such as SRM, Genome
Browser and etc.
Data integration with tissue sample, clinical, and research
data
Vision: parse each algorithm to the appropriate computing
environment
Data Workflow
Jinghui Zhang and CompBio Team
•
•
•
•
•
BAM Quality Assurance:
• Tumor Purity Algorithm (SJCRH)
• Not Disease/Genomic Swap (SNP checks)
• Xenograft Filter (Remove Contaminating Mouse Reads)
• Gene Exon and Genome Coverage algorithms (Gang Wu)
BAM file work:
• Bam file extraction and visualization
• Samtools and C++/bioperl api’s
• Bambino
• IGV
Single Nucleotide Variation:
• Freebayes
• In-house PCGP
Copy Number Variation:
• Stan’s Copy Number Algorithm
• Regression Tree Algorithm
Structural Variation:
• One End Anchored Inference:
• CREST
• ViralTopology
Data Analyses
•
•
•
•
•
•
Fusion Detection:
• In-house (Michael Rusch)
RNAseq:
• RNAseq mysql/Cufflinks
ChipSeq:
• ChiPseq mysql/in house (John Obenauer)
viralScan
• in-house (McGoldrick)
Integration:
• GFF intersect
• Gff2fasta
• gffBuilders
• Cancer warehouse
Visualization:
• Circos maker
• BED GFF Tracks maker
•
•
•
•
•
•
•
•
•
IBM BladeCenter (810 cores/3TB RAM)
IBM iDataplex (1,008 cores/4TB RAM) – April 2010
SGI Altix UV1000 (640 cores/5TB RAM/60TB storage using Lustre v2.2) – December 2011
IBM SoNAS (780 TB) – March 2011
Data Transfer Node (10 Gbps I2 connection) – April 2011
Internal Data Transfer Node (10 Gbps x2) – June 2011
QDR Infiniband (40 Gbps for all HPC equipment) – January 2012
Software (Platform LSF, Intel Parallel Studio)
Total: 2,366 cores, 13TB RAM (estimated 11.6 Tflops)
• 2010: 365,000 cpu hours
• 2011: 712,000 cpu hours
Computational Horsepower (HPCF)
•
IBM SoNAS (780 TB) – March 2011
•
•
•
•
Scales to 21PB; 1 billion files/filesystem; 7,200 drives
Current total on campus: 3.8 Petabytes (3,800,000 Gb)
PCGP uses 917 TB (<- +500TB on tape), 148 million data files
IBM TSM systems for backup/archive (Tiered)
•
•
•
•
•
240 SAS (15k) drives
480 SAS-NL (7.2k) drives
Current 7,900 tape capacity, up to 1.6TB/tape; 12.6+ PB total
734 TB usable under one file system
High speed/low latency backend interconnect (QDR InfiniBand 20Gb per
port and 100ns latency)
Data Storage
>356 Patients/712 Complete Genomes
Gene sequencing project identifies potential drug targets in common childhood brain tumor
Nature
June 20, 2012
Researchers studying the genetic roots of the most common malignant childhood brain tumor have discovered missteps in three of the four subtypes of the cancer that involve genes already targeted for
drug development. The most significant gene alterations are linked to subtypes of medulloblastoma that currently have the best and worst prognosis. They were among 41 genes associated for the first
time to medulloblastoma by the St. Jude Children's Research Hospital – Washington University Pediatric Cancer Genome Project.
World's largest release of comprehensive human cancer genome data helps researchers everywhere speed discoveries
Nature Genetics
May 29, 2012
To speed progress against cancer and other diseases, the St. Jude Children's Research Hospital – Washington University Pediatric Cancer Genome Project today announced the largest-ever release of
comprehensive human cancer genome data for free access by the global scientific community. The amount of information released more than doubles the volume of high-coverage, whole genome data
currently available from all human genome sources combined. This information is valuable not just to cancer researchers, but also to scientists studying almost any disease.
Genome sequencing initiative links altered gene to age-related neuroblastoma risk
Journal of the American Medical Association
March 13, 2012
St. Jude Children’s Research Hospital – Washington University Pediatric Cancer Genome Project and Memorial Sloan-Kettering Cancer Center discover the first gene alteration associated with patient
age and neuroblastoma outcome. Researchers have identified the first gene mutation associated with a chronic and often fatal form of neuroblastoma that typically strikes adolescents and young adults.
The finding provides the first clue about the genetic basis of the long-recognized but poorly understood link between treatment outcome and age at diagnosis.
Cancer sequencing initiative discovers mutations tied to aggressive childhood brain tumors
Nature Genetics
January 29, 2012
Findings from the St. Jude Children's Research Hospital – Washington University Pediatric Cancer Genome Project (PCGP) offer important insight into a poorly understood tumor that kills more than
90 percent of patients within two years. The tumor, diffuse intrinsic pontine glioma (DIPG), is found almost exclusively in children and accounts for 10 to 15 percent of pediatric tumors of the brain and
central nervous system.
Cancer sequencing project identifies potential approaches to combat aggressive leukemia
Nature
January 11, 2012
Researchers with the St. Jude Children's Research Hospital - Washington University Pediatric Cancer Genome Project (PCGP) have discovered that a subtype of leukemia characterized by a poor
prognosis is fueled by mutations in pathways distinctly different from a seemingly similar leukemia associated with a much better outcome. The work provides the first details of the genetic alterations
fueling a subtype of acute lymphoblastic leukemia (ALL) known as early T-cell precursor ALL (ETP-ALL). The results suggest ETP-ALL has more in common with acute myeloid leukemia (AML)
than with other subtypes of ALL.
Gene identified as a new target for treatment of aggressive childhood eye tumor
Nature
January 11, 2012
New findings from the St. Jude Children's Research Hospital – Washington University Pediatric Cancer Genome Project (PCGP) have helped identify the mechanism that makes the childhood eye
tumor retinoblastoma so aggressive. The discovery explains why the tumor develops so rapidly while other cancers can take years or even decades to form. The finding also led investigators to a new
treatment target and possible therapy for the rare childhood tumor of the retina, the light-sensing tissue at the back of the eye.
Progress
http://www.pediatriccancergenomeproject.org
http://explore.pediatriccancergenomeproject.org
Data Sharing
• Data Integration is critical: platform data (expression, WGS,
methylation, etc.) and processed data (“genomics” data with
phenotype data (clinical care, clinical research))
Data Sharing
19 Academic
Departments
Computational Biology
Information Sciences
Shared Resources
2 PhD
2 Support
Enterprise
Informatics
8-10 Faculty
50-60 Support Staff
10 PhD
Bioinformatics
2 developers
PCGP
5 PhD
127 FTEs
Research
Informatics
56 FTEs
1 Dev.
Clinical
Informatics
81 FTEs
HPC
Offshore
Developers
15 FTEs
Total=>150 FTEs with “research informatics” skills
Key: Staff
• Project total cost: $65M (11 Illuminas @ WashU and 4 @
SJCRH, sequencing costs, staffing, IT, etc.)
• New “IT” staff @ SJCRH: 10 FTEs in CompBiol, 0 FTEs in IS
• Capital IT investment: ~$7.2 M at SJCRH, $9M at WashU
• IT is ~25% of overall project costs (doesn’t include costs of
other participating SJ FTEs)
Information Sciences PCGP Team
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Ashish Pagare
David Zhao
Dan Alford
Stephen Espy
Kiran Chand Bobba
Scott Malone
Dr. Antonio Ferreira
Bill Pappas
James McMurry
Dr. Jianmin Wang
Dr. John Obenauer
Jared Becksfort
Pankaj Gupta
Dr. Suraj Mukatira
Key: Staff
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Simon Hagstrom
Sundeep Shakya
Asmita Vaidya
Swetha Mandava
Bhagavathy Krishna
Manohar Gorthi
Sandhya Rani Kolli
Sivaram Chintalapudi
Roshan Shrestha
Irina McGuire
PJ Stevens
Thanh Le
John Penrod
Pat Eddy
Dr. Dan McGoldrick
Questions?
cluster
Contig
assembly
SV
PALLAS
Data Workflow
large memory
CNV
INDELS
GPU
SNV
CIRCOS