Enabling Genomic Big Data
Download
Report
Transcript Enabling Genomic Big Data
Enabling Genomic
BIG DATA with
Content Centric Networking
J.J. Garcia-Luna-Aceves
UC Santa Cruz
[email protected]
Example Today
Cancer Genome Hub (CGhub)
CGhub’s purpose was to store the genome’s sequenced
as part of The Cancer Genome Atlas (TCGA) project.
At about 300 GB/genome this translated to about
17,000 genomes over the 44 month lifetime of the
project.
Transmission requirements of archiving effort reached
a sustained rate of 17 Gbps by the end of the 44 month
project
Cancer Genomics Hub (UCSC) is Housed in SDSC CoLo:
Large Data Flows to End Users
1G
8G
Cumulative TBs of CGH
Files Downloaded
30 PB
15G
Data Source: David Haussler,
Brad Smith, UCSC
Example Today
CGHub had to use current technology:
Data organization and search: XML schema definitions
Security: Existing techniques, HTTPS
Big data transfer: Modified Bit Torrent (Gene Torrent or GT)
• HTTPS and Bit Torrent
are problematic
• No caching with
HTTPS
• TCP limitations
percolate to multiple
connections under BT
or GT
• A potential
playground for DDoS?
The Future of Genomic BIG DATA
Is the Internet ready to support personalized
medicine?
Is the future of genomic data really different?
If not, what technology would be limiting
progress?
First:
Genomic data are really BIG DATA.
Personalized medicine will make genomic data
volumes explode, and many other applications of
genomic data will develop
Even if one site or a few mirrors are used for a
personal genome, it has to be uploaded.
Is Technology Ready in in 5-10 Years?
Communication,
storage and and computing
technologies are not the problem:
– Production optical transport @ 1
Tbpshttp://www.lightreading.com/document.asp?d
oc_id=188442&
– Individual hosts able to transmit at 100 Gbps
– I/O throughput can keep up with the network
speeds (i.e., disk will be able to handle 100 Gbps =
12.5 GBps).
– Memory and processing costs will continue to
decline.
Networking is The BIG PROBLEM for
Genomic BIG DATA
Speed of light will not increase but number of
genomic data repositories or distance between
them will
Internet protocol stack was not designed for BIG
DATA transfer over paths with large bandwidthdelay products:
–
–
–
–
TCP throughput
DDoS vulnerabilities (e.g., SYN flooding)
Caching vs privacy (e.g., HTTPS)
Static directory services (e.g., DNS vs content
directories).
Sobering Results for Today’s Internet
TCP and variations (e.g., BT) cannot be the baseline to support
big data genomics
Storage must be used to reduce bandwidth-delay products
Simulation results
-4-day simulation
-20 locations
-40 Gbps links with 5 to
25ms latency
- ave. degree of 5
TCP (client/server)
Content centric
approach
Internetworking BeND
TCP/IP architecture must change for BIG DATA, but
how?
Content Centric Network architectures (CCN) such
as NDN and CCNx have been proposed
The main advantage of CCN solutions is caching
But…NDN and CCNx still at early stages of
development
Big Data Networking is all about bandwidth-delay
product, not replacing IP addresses with names