NCBioGrid Project Background

Download Report

Transcript NCBioGrid Project Background

North Carolina
Bioinformatics Grid
Thom H. Dunning, Jr.
HPCC Division, MCNC
Chemistry, University of North Carolina
Genomics
A Compute- & Data-Intensive Science
* from TimeLogic
Data Explosion
Rapid Growth of GenBank
20
Growth of GenBank

No. Gbases
15

10
5
0
1982
1986
1990
1994
1998
2002
Number of base pairs
increasing dramatically
(exponentially)
Growth in 2002 due to
additions in just 21
days!
Data Explosion
Number and Diversity of Databases
Nucleic Acids Research, 2002, Vol. 30, No. 1
Table 1. Molecular Biology Database Collection
Major Public Sequence Repositories
DNA Data Bank of Japan (DDBJ)
http://www.ddbj.nig.ac.jp
All known nucleotide and protein sequences
…
333 Databases
Varied Biomedical Content
…
VirOligo
http://viroligo.okstate.edu
Virus-specific oligonucleotides for PCR and
…
Computing Explosion
Assembly and Analysis of Genomic Data
Celera Genomics–Assembling the Genome



Compaq Alpha Clusters
Number of processors: ~ 750
Peak performance: 1 teraops
NuTech Sciences–Mining the Genome





IBM p640 System
Number of processors: ~ 5,000
Peak performance: 7½ teraops
Total memory: 2½ terabytes
Total disk storage: 50 terabytes
Genomics
Meeting the Information Challenge
Data
Storage
Network
Grid
Middleware
Computers
North Carolina Supercomputing Center
North Carolina
Research and Education Network
Winston
Salem
Boone
Greensboro
Elizabeth
City
Rocky Mount
RTP
Asheville
Greenville
Fayetteville
Cullowhee
Charlotte
Pembroke
RTP RPoP
Duke
NCCU
Wilmington
NCSU
Qwest
MCNC
UNC-CH
Morehead
City
NCSU
Centennial
Campus
NCREN3
• Increased bandwidth
• Increased reliability
• Increased resiliency
Grid Technologies
Major New Computing Technology

Under development since mid-1990s
Distinguishing Characteristics


“Middleware” to support efficient resource sharing in a
distributed, heterogeneous computing and data storage
environment
Focus on use of large-scale computing and data storage
Some Major Grid Efforts


NASA IPG—Testbed linking selected NASA centers
DataGrid—International Grid being developed for highenergy physics (CERN)
Grid Technologies
(cont’d)
Some Major Grid Efforts



(cont’d)
GriPhyN—Research in Grid technologies for physics
applications (Argonne, Florida)
e-Science Grid—Major effort in UK to develop a Grid
infrastructure for science and engineering research
BIRN—Data Grid focused on neuroimaging data
(UCSD, SDSC)
North Carolina
Genomics and Bioinformatics Consortium
Goal

Provide a venue for Consortium members to share
information and resources, plan strategic initiatives,
and form alliances
Distributed Across North Carolina

Concentration in Research Triangle, but extends across
all of North Carolina
Diverse Goals and Expertise

Human health, including animal models; agriculture
and forestry; evolutionary biology basic research; tool
development
Overall NC BioGrid Architecture
BioApp BioApp BioApp
#1
#2
#3
…
Grid Middleware
Network
Computing and Data
Resources
Grid-aware, -enabled
bioinformatics applications
Globus, Legion, …
NCREN3
NCSC plus
Member’s Computing Centers
NC BioGrid Project
Two Phases


Testbed Phase—test existing middleware, resolve issues,
prepare detailed plan (12-18 months)
Production Phase—create and operate NC BioGrid
Funding for Testbed from MCNC
Project Manager

Phil Emer, MCNC, Chief Architect/NC BioGrid
Project Oversight



MCNC Board of Directors
HPCC Advisory Board
NC BioGrid Technical Advisory Group