Protein Explorer: A Petaflops Special Purpose Computer System for
Download
Report
Transcript Protein Explorer: A Petaflops Special Purpose Computer System for
Protein Explorer: A Petaflops
Special Purpose Computer
System for Molecular Dynamics
Simulations
David Gobaud
Computational Drug Discovery
Stanford University
7 March 2006
Outline
Overview
Background
Delft Molecular Dynamics Processor
GRAPE
Protein Explorer Summary
MDGRAPE-3 Chip
Force Calculation Pipeline
J-Particle Memory and Control Units
System Architecture
Software
Cost
Questions
Overview
Protein Explorer
Petaflop special-purpose computer system for
molecular dynamics simulations
High-precision screening for drug design
Large-scale simulations of huge proteins/complexes
PC cluster with special-purpose engines to perform
the most time-consuming calculations
Dedicated LSI MDGRAPE-3 chip performs force
calculations at 165 Gflops or higher
ETA 2006
Background
PCs are universal machines
Various applications
Hardware can be designed independent of
applications
Obstacles to high-performance
Memory bandwidth bottleneck
Heat dissipation problem
Can be overcome by developing specialized
architectures
Delft Molecular Dynamics
Processor (DMDP)
Pioneered high-performance special-purpose
systems
Not able to achieve effective cost-performance
Demanded too much time and money in development
state
Speed of development is a crucial factor affecting costperformance because electronic device technology
continues to develop rapidly
Almost all calculations performed by DMDP making
hardware very complex
GRAPE (GRAvity PipE)
One of the most successful attempts to
develop high-performance special-purpose
systems
Specialized for simulations of classical
particles
Most time spent on calculation of long-range
forces (gravitational, Coulomb, and van der
Waals)
Thus special hardware only performs these
calculations
Hardware very simple and cost-effective
GRAPE (GRAvity PipE)
In 1995 first machine to break teraflops
barrier in nominal peak performance
Since 2001 leader in performance has been
Molecular Dynamics Machine at RIKEN at 78TFlops
2002 @ University of Tokyo a 64-TFlop
GRAPE-6 completed
Protein Explorer launched based on 2002
University of Tokyo success
Protein Explorer Summary
Host PC cluster with special purpose boards attached
Boards calculate only non-bounded forces
Communication time between host and boards is
proportional to number of particles
Calculation time proportional to
Very simple hardware and software
No detailed knowledge of hardware needed to write
programs
N^2 for direct summation of long-range forces
N*Nc for short range forces where Nc is the average number
of particles within the cutoff radius
0.25 byte/1000 operations
MDGRAPE-3 Chip - Force
Calculation Pipeline
3 subtractor units
6 adder units
8 multiplier units
1 function-evaluation unit
Can perform ~33 equivalent
operations/sec when it calculates the
Coulomb force
MDGRAPE-3 Chip - Force
Calculation Pipeline
MDGRAPE-3 Chip - Force
Calculation Pipeline
Most operations done in 32-bit single
precision floating point format
Force accumulation is 80-bit fixed point
format
Can be converted to 64-bit double precision
floating point
Coordinates stored in 40-bit fixed-point
format
Makes implementation of periodic boundary
condition easy
MDGRAPE-3 Chip - Force
Calculation Pipeline
Function Evaluator
Most important part of pipeline
Allows calculation of arbitrary smooth function
Has memory unit which contains a table for
polynomial coefficients and exponents and a
hardwired pipeline for fourth-order polynomial
evaluation
Interpolates an arbitrary smooth function g(x)
using segmented fourth-order polynomials by
Homer’s method
MDGRAPE-3 Chip - J-Particle
Memory and Control Units
20 Force Calculation Pipelines
j-Particle Memory Unit
Cell-Index Controller
32,768 bodies
“Main Memory”
6.6 Mbits constructed by static RAM
Controls j-Particle memory – generates addresses
Force Simulation Unit
Master Controller
Manages timings and inputs/outputs of the chip
MDGRAPE-3 Chip
2 virtual pipelines/physical pipeline
Physical bandwidth of j-particle unit 2.5
Gbytes/sec but virtual bandwidth will
reach 100 Gbytes/sec
340 arithmetic units
20 function-evaluator units which work
simultaneously
165 Gflops at 250MHz
MDGRAPE-3 Chip
MDGRAPE-3 Chip
Chip made by Hitachi
6M gates
10M bits of memory
Chip size is ~220 mm^2
Dissipate 20 watts at core voltage of
+1.2V
.12 W/Gflops much better than P4 3GHz
which is 14 W/Gflop
System Architecture
Host PC cluster will use Itanium or Opteron CPU
256 nodes with 512 CPUs each
Performance of node is 3.96 Tflops
Require 10G-bit/sec network
Infiniband 10G Ethernet or future Myrinet
Network topology will be a 2D hyper-crossbar
Each node has 24 MDGRAPE-3 chips
MDGRAPE-3 chips connected via 2 PCI-X busses at 133 MHz
19” rack can house 6 nodes
Total reaches a petaflop
43 racks total
Power dissipation ~150 KWatts
Occupy 100 m^2
System Architecture
Protein Explorer Board
Software
Very easy to create programs for
All computational abilities provided in a
library
No special knowledge of device needed
Cost
$20 million including labor
Less than $10/Gflop
At least ten times better than generalpurpose computers even when compared
with relatively cheap BlueGene/L
($140/Gflop)
Questions
What is Myrinet?
What is a two-dimensional hypercrossbar network topology?
How does this compare to massive
distributed computing such as
Folding@Home
Advantages?
Disadvantages?