Enabling Genomic Big Data

Transcript Enabling Genomic Big Data

Enabling Genomic
BIG DATA with
Content Centric Networking
J.J. Garcia-Luna-Aceves
UC Santa Cruz
[email protected]
Example Today
Cancer Genome Hub (CGhub)
 CGhub’s purpose was to store the genome’s sequenced
as part of The Cancer Genome Atlas (TCGA) project.
 At about 300 GB/genome this translated to about
17,000 genomes over the 44 month lifetime of the
project.
 Transmission requirements of archiving effort reached
a sustained rate of 17 Gbps by the end of the 44 month
project

Cancer Genomics Hub (UCSC) is Housed in SDSC CoLo:
Large Data Flows to End Users
1G
8G
Cumulative TBs of CGH
Files Downloaded
30 PB
15G
Data Source: David Haussler,
Brad Smith, UCSC
Example Today
CGHub had to use current technology:



Data organization and search: XML schema definitions
Security: Existing techniques, HTTPS
Big data transfer: Modified Bit Torrent (Gene Torrent or GT)
• HTTPS and Bit Torrent
are problematic
• No caching with
HTTPS
• TCP limitations
percolate to multiple
connections under BT
or GT
• A potential
playground for DDoS?
The Future of Genomic BIG DATA
Is the Internet ready to support personalized
medicine?
 Is the future of genomic data really different?
 If not, what technology would be limiting
progress?
First:
 Genomic data are really BIG DATA.
 Personalized medicine will make genomic data
volumes explode, and many other applications of
genomic data will develop
 Even if one site or a few mirrors are used for a
personal genome, it has to be uploaded.

Is Technology Ready in in 5-10 Years?
 Communication,
storage and and computing
technologies are not the problem:
– Production optical transport @ 1
Tbpshttp://www.lightreading.com/document.asp?d
oc_id=188442&
– Individual hosts able to transmit at 100 Gbps
– I/O throughput can keep up with the network
speeds (i.e., disk will be able to handle 100 Gbps =
12.5 GBps).
– Memory and processing costs will continue to
decline.
Networking is The BIG PROBLEM for
Genomic BIG DATA
Speed of light will not increase  but number of
genomic data repositories or distance between
them will
 Internet protocol stack was not designed for BIG
DATA transfer over paths with large bandwidthdelay products:

–
–
–
–
TCP throughput
DDoS vulnerabilities (e.g., SYN flooding)
Caching vs privacy (e.g., HTTPS)
Static directory services (e.g., DNS vs content
directories).
Sobering Results for Today’s Internet


TCP and variations (e.g., BT) cannot be the baseline to support
big data genomics
Storage must be used to reduce bandwidth-delay products
Simulation results
-4-day simulation
-20 locations
-40 Gbps links with 5 to
25ms latency
- ave. degree of 5
TCP (client/server)
Content centric
approach
Internetworking BeND
TCP/IP architecture must change for BIG DATA, but
how?
 Content Centric Network architectures (CCN) such
as NDN and CCNx have been proposed
 The main advantage of CCN solutions is caching
 But…NDN and CCNx still at early stages of
development
 Big Data Networking is all about bandwidth-delay
product, not replacing IP addresses with names


Enabling Genomic Big Data

Transcript Enabling Genomic Big Data

Directory