Protein Explorer: A Petaflops Special Purpose Computer System for

Download Report

Transcript Protein Explorer: A Petaflops Special Purpose Computer System for

Protein Explorer: A Petaflops
Special Purpose Computer
System for Molecular Dynamics
Simulations
David Gobaud
Computational Drug Discovery
Stanford University
7 March 2006
Outline






Overview
Background
Delft Molecular Dynamics Processor
GRAPE
Protein Explorer Summary
MDGRAPE-3 Chip






Force Calculation Pipeline
J-Particle Memory and Control Units
System Architecture
Software
Cost
Questions
Overview

Protein Explorer

Petaflop special-purpose computer system for
molecular dynamics simulations





High-precision screening for drug design
Large-scale simulations of huge proteins/complexes
PC cluster with special-purpose engines to perform
the most time-consuming calculations
Dedicated LSI MDGRAPE-3 chip performs force
calculations at 165 Gflops or higher
ETA 2006
Background

PCs are universal machines



Various applications
Hardware can be designed independent of
applications
Obstacles to high-performance



Memory bandwidth bottleneck
Heat dissipation problem
Can be overcome by developing specialized
architectures
Delft Molecular Dynamics
Processor (DMDP)

Pioneered high-performance special-purpose
systems

Not able to achieve effective cost-performance



Demanded too much time and money in development
state
Speed of development is a crucial factor affecting costperformance because electronic device technology
continues to develop rapidly
Almost all calculations performed by DMDP making
hardware very complex
GRAPE (GRAvity PipE)



One of the most successful attempts to
develop high-performance special-purpose
systems
Specialized for simulations of classical
particles
Most time spent on calculation of long-range
forces (gravitational, Coulomb, and van der
Waals)


Thus special hardware only performs these
calculations
Hardware very simple and cost-effective
GRAPE (GRAvity PipE)




In 1995 first machine to break teraflops
barrier in nominal peak performance
Since 2001 leader in performance has been
Molecular Dynamics Machine at RIKEN at 78TFlops
2002 @ University of Tokyo a 64-TFlop
GRAPE-6 completed
Protein Explorer launched based on 2002
University of Tokyo success
Protein Explorer Summary


Host PC cluster with special purpose boards attached
Boards calculate only non-bounded forces




Communication time between host and boards is
proportional to number of particles
Calculation time proportional to



Very simple hardware and software
No detailed knowledge of hardware needed to write
programs
N^2 for direct summation of long-range forces
N*Nc for short range forces where Nc is the average number
of particles within the cutoff radius
0.25 byte/1000 operations
MDGRAPE-3 Chip - Force
Calculation Pipeline





3 subtractor units
6 adder units
8 multiplier units
1 function-evaluation unit
Can perform ~33 equivalent
operations/sec when it calculates the
Coulomb force
MDGRAPE-3 Chip - Force
Calculation Pipeline
MDGRAPE-3 Chip - Force
Calculation Pipeline


Most operations done in 32-bit single
precision floating point format
Force accumulation is 80-bit fixed point
format


Can be converted to 64-bit double precision
floating point
Coordinates stored in 40-bit fixed-point
format

Makes implementation of periodic boundary
condition easy
MDGRAPE-3 Chip - Force
Calculation Pipeline

Function Evaluator




Most important part of pipeline
Allows calculation of arbitrary smooth function
Has memory unit which contains a table for
polynomial coefficients and exponents and a
hardwired pipeline for fourth-order polynomial
evaluation
Interpolates an arbitrary smooth function g(x)
using segmented fourth-order polynomials by
Homer’s method
MDGRAPE-3 Chip - J-Particle
Memory and Control Units


20 Force Calculation Pipelines
j-Particle Memory Unit




Cell-Index Controller



32,768 bodies
“Main Memory”
6.6 Mbits constructed by static RAM
Controls j-Particle memory – generates addresses
Force Simulation Unit
Master Controller

Manages timings and inputs/outputs of the chip
MDGRAPE-3 Chip





2 virtual pipelines/physical pipeline
Physical bandwidth of j-particle unit 2.5
Gbytes/sec but virtual bandwidth will
reach 100 Gbytes/sec
340 arithmetic units
20 function-evaluator units which work
simultaneously
165 Gflops at 250MHz
MDGRAPE-3 Chip
MDGRAPE-3 Chip






Chip made by Hitachi
6M gates
10M bits of memory
Chip size is ~220 mm^2
Dissipate 20 watts at core voltage of
+1.2V
.12 W/Gflops much better than P4 3GHz
which is 14 W/Gflop
System Architecture



Host PC cluster will use Itanium or Opteron CPU
256 nodes with 512 CPUs each
Performance of node is 3.96 Tflops


Require 10G-bit/sec network






Infiniband 10G Ethernet or future Myrinet
Network topology will be a 2D hyper-crossbar
Each node has 24 MDGRAPE-3 chips
MDGRAPE-3 chips connected via 2 PCI-X busses at 133 MHz
19” rack can house 6 nodes


Total reaches a petaflop
43 racks total
Power dissipation ~150 KWatts
Occupy 100 m^2
System Architecture
Protein Explorer Board
Software


Very easy to create programs for
All computational abilities provided in a
library

No special knowledge of device needed
Cost


$20 million including labor
Less than $10/Gflop

At least ten times better than generalpurpose computers even when compared
with relatively cheap BlueGene/L
($140/Gflop)
Questions



What is Myrinet?
What is a two-dimensional hypercrossbar network topology?
How does this compare to massive
distributed computing such as
Folding@Home


Advantages?
Disadvantages?