General - Microsoft Research

Download Report

Transcript General - Microsoft Research

PACT 98
Http://www.research.microsoft.com/barc/gbell/pact.ppt
PACT
What Architectures?
Compilers?
Run-time environments?
Programming models?
… Any Apps?
Parallel Architectures and
Compilers Techniques
Paris, 14 October 1998
Gordon Bell
PACT
Microsoft
Talk plan


Where are we today?
History… predicting the future
–
–
–
–



Ancient
Strategic Computing Initiative and ASCI
Bell Prize since 1987
Apps & architecture taxonomy
Petaflops: when, … how, how much
New ideas: Grid, Globus, Legion
Bonus: Input to Thursday panel PACT
1998: ISVs, buyers, & users?

Technical: supers dying; DSM (and SMPs) trying
–
–
–
–
–

Mainline: user & ISV apps ported to PCs & workstations
Supers (legacy code) market lives ...
Vector apps (e.g ISVs) ported to DSM (&SMP)
MPI for custom and a few, leading edge ISVs
Leading edge, one-of-a-kind apps: Clusters of 16, 256,
...1000s built from uni, SMP, or DSM
Commercial: mainframes, SMPs (&DSMs), and
clusters are interchangeable (control is the issue)
– Dbase & tp: SMPs compete with mainframes if
central control is an issue else clusters
– Data warehousing: may emerge… just a Dbase
– High growth, web and stream servers:
PACT
Clusters have the advantage
c2000 Architecture
Taxonomy
Xpt connected SMPS
Xpt-SMPvector
Xpt-multithread (Tera)
mainline “multi”
SMP
Multicomputers
aka
Clusters … MPP
16-(64)- 10K
processors
Xpt-”multi” hybrid
DSM- SCI (commodity)
DSM (high bandwidth_
Commodity “multis” &
mainline switches
Proprietary “multis”
& switches
Proprietary DSMs
PACT
TOP500 Technical Systems by Vendor
(sans PC and mainframe clusters)
500
Other
Japanese
400
DEC
Intel
TMC
Sun
HP
300
IBM
Convex
200
SGI
100
PACT
un-98
ov-97
un-97
ov-96
un-96
ov-95
un-95
ov-94
un-94
ov-93
0
un-93
CRI
Parallelism of Jobs
20 Weeks of Data, March 16 - Aug 2, 1998
On NCSA15,028
Origin
JobsCluster
/ 883,777 CPU-Hrs
by # of Jobs
6%
by CPU Delivered
3%1%
9%
7%
2%
9%
8%
19%
40%
21%
16%
5%
1
2
3-4
5-8
9-16
17-32
33-64
65-128
19%
17%
18%
# CPUs
PACT
How are users using the Origin
Array?
120,000
100,000
80,000
CPU Hrs 60,000
Delivered 40,000
20,000
0
Mem/CPU
(MB)
# CPUs
PACT
National Academic Community
Large Project Requests September
1998
Over 5 Million NUs
Requested
Vector
DSM
MPP
One NU = One XMP Processor-Hour
Source: National Resource Allocation Committee
PACT
log (# apps)
GB's Estimate of Parallelism in
Engineering & Scientific Applications
----scalable multiprocessors----PCs
WSs
Supers
Clusters aka MPPs
aka multicomputers
dusty decks
for supers
new or
scaled-up
apps
Gordon’s
WAG
scalar
60%
vector
15%
Vector
One-of Embarrassingly &
& //
>>//
perfectly parallel
5%
5%
15%
PACT
granularity & degree of coupling (comp./comm.)
Application
Taxonomy
Technical
Commercial
If central control & rich
then IBM or large SMPs
else PC Clusters
General purpose, nonparallelizable codes
(PCs have it!)
Vectorizable
Vectorizable & //able
(Supers & small DSMs)
Hand tuned, one-of
MPP course grain
MPP embarrassingly //
(Clusters of PCs...)
Database
Database/TP
Web Host
Stream Audio/Video
PACT
One procerssor perf. as % of Linpack
1800
1600
22%
1400
Linpack
1200
Apps. Ave.
25%
1000
800
14%
19%
600
33%
26%
CFD
Biomolec.
Chemistry
Materials
QCD
400
200
0
T90
C90
SPP2000
SP2160
Origin
195
PCA
PACT
10 Processor Linpack (Gflops); 10 P appsx10;
Apps % 1 P Linpack; Apps %10 P Linpack
35
30
25
Gordon’s
WAG
20
15
10
5
0
T90
C90
SPP
SP2/160
Origin
195
PCA
PACT
Ancient history
PACT
Growth in Computational Resources
Used for UK Weather Forecasting
10T •
1T •
1010/ 50 yrs = 1.5850
100G •
10G •
205
1G •
100M •
YMP
195
10M •
1M •
KDF9
100K •
10K •
Mercury
1K •
100 •
10 • •
1950
Leo
•
2000
PACT
Harvard Mark I
aka IBM ASCC
PACT
“ market for maybe five
computers. ”
I think there is a world
Thomas Watson Senior,
Chairman of IBM, 1943
PACT
The scientific market is still
about that size… 3 computers



When scientific processing was 100%
of the industry a good predictor
$3 Billion: 6 vendors, 7 architectures
DOE buys 3 very big ($100-$200 M)
machines every 3-4 years
PACT
NCSA Cluster of 6 x 128
processors SGI Origin
PACT
Our Tax Dollars At Work
ASCI for Stockpile Stewardship




Intel/Sandia:
9000x1 node Ppro
LLNL/IBM:
512x8 PowerPC
(SP2)
LNL/Cray:
?
Maui
Supercomputer
Center
–
512x1 SP2
PACT
“LARC doesn’t need 30,000
words!”
--Von Neumann, 1955.



“During the review, someone said:
“von Neumann was right. 30,000 word
was too much IF all the users were as
skilled as von Neumann ... for ordinary
people, 30,000 was barely enough!”
-- Edward Teller, 1995
The memory was approved.
Memory solves many problems!
PACT
“
Parallel processing
computer architectures
will be in use by 1975.
”
Navy Delphi Panel
1969
PACT
“
In Dec. 1995 computers
with 1,000 processors
will do most of the
scientific processing.
”
Danny Hillis
1990 (1 paper or 1 company)
PACT
The Bell-Hillis Bet
Massive Parallelism in 1995
TMC
TMC
TMC
World-wide
Supers
World-wide
Supers
World-wide
Supers
Applications
Petaflops / mo.
Revenue
PACT
Bell-Hillis Bet: wasn’t paid off!
My goal was not necessarily to just
win the bet!
 Hennessey and Patterson were to
evaluate what was really
happening…
 Wanted to understand degree of
MPP progress and programmability

PACT
“
A 50 X LISP machine
”
Tom Knight, Symbolics
“
A Teraflops by 1995
”
“
A 1,000 node multiprocessor
”
Gordon Bell, Encore
DARPA, 1985
Strategic Computing Initiative (SCI)
 All of ~20 HPCC projects failed!
PACT
SCI (c1980s):
Strategic Computing Initiative funded
ATT/Columbia (Non Von), BBN Labs,
Bell Labs/Columbia (DADO),
CMU Warp (GE & Honeywell),
CMU (Production Systems), Encore, ESL,
GE (like connection machine), Georgia
Tech, Hughes (dataflow), IBM (RP3),
MIT/Harris, MIT/Motorola (Dataflow), MIT
Lincoln Labs, Princeton (MMMP),
Schlumberger (FAIM-1), SDC/Burroughs,
SRI (Eazyflow),
University of Texas,
PACT
Thinking Machines (Connection Machine),
Those who gave up their lives in
SCI’s search for parallellism
Alliant, American Supercomputer, Ametek,
AMT, Astronautics, BBN Supercomputer,
Biin, CDC (independent of ETA), Cogent,
Culler, Cydrome, Dennelcor, Elexsi, ETA,
Evans & Sutherland Supercomputers,
Flexible, Floating Point Systems,
Gould/SEL, IPM, Key, Multiflow, Myrias,
Pixar, Prisma, SAXPY, SCS, Supertek (part
of Cray), Suprenum (German National
effort), Stardent (Ardent+Stellar),
Supercomputer Systems Inc., Synapse,
Vitec, Vitesse, Wavetracer.
PACT
Worlton: "Bandwagon Effect"
explains massive parallelism
Bandwagon: A propaganda device by which
the purported acceptance of an idea ...is
claimed in order to win further public
acceptance.
Pullers: vendors, CS community
Pushers: funding bureaucrats & deficit
Riders: innovators and early adopters
4 flat tires:
training, system software,
applications, and "guideposts"
Spectators: most users, 3rd party ISVs PACT
Parallel processing is a
constant distance away.
“
“
Our vision ... is a system of millions
of hosts… in a loose confederation.
”
Users will have the illusion of a very
powerful desktop computer through
which they can manipulate objects.
”
Grimshaw, Wulf, et al
“Legion” CACM Jan.PACT
1997
Progress
"Parallelism is a journey.*"
*Paul Borrill
PACT
Let us not forget:
“The purpose of computing is
insight, not numbers.”
R. W. Hamming
PACT
Progress 1987-1998
PACT
Bell Prize Peak Gflops vs time
1000
100
10
1
0.1
1986
1988
1990
1992
1994
1996
1998
PACT2000
Bell Prize: 1000x 1987-1998








1987 Ncube 1,000 computers:
showed with more memory, apps scaled
1987 Cray XMP 4 proc. @200 Mflops/proc
1996 Intel 9,000 proc. @200 Mflops/proc
1998 600 RAP Gflops Bell prize
Parallelism gains
– 10x in parallelism over Ncube
– 2000x in parallelism over XMP
Spend 2- 4x more
Cost effect.: 5x; ECL  CMOS; Sram  Dram
Moore’s Law =100x
Clock: 2-10x; CMOS-ECL speed cross-over
PACT
No more 1000X/decade.
We are now (hopefully) only
limited by Moore’s Law and
not limited by memory access.
1 GF to 10 GF took 2 years
10 GF to 100 GFtook 3 years
100 GFto 1 TF took >5 years
2n+1 or 2^(n-1)+1?
PACT
$ /tp m C v s tim e
Commercial Perf/$
$/tpm C
$ 1 ,0 0 0
$100
2 5 0 % / y e a r i m p r o v e m e n t!
$10
M a r-9 4
S e p -9 4
A p r-9 5
O c t-9 5
d ate
M a y-9 6
D e c -9 6
J un-9 7
PACT
tp m C v s tim e
Commercal Perf.
1 0 0 ,0 0 0
2 5 0 % / y e a r i m p r o v e m e n t!
tpm C
1 0 ,0 0 0
1 ,0 0 0
100
M a r-9 4
S e p -9 4
A p r-9 5
O c t-9 5
d ate
M a y-9 6
D e c -9 6
J un-9 7
PACT
1998 Observations vs
1989 Predictions for technical


Got a TFlops PAP 12/1996 vs 1995.
Really impressive progress! (RAP<1 TF)
More diversity… results in NO software!
–
Predicted: SIMD, mC, hoped for scalable SMP
– Got: Supers, mCv, mC, SMP, SMP/DSM,
SIMD disappeared



$3B (un-profitable?) industry; 10 platforms
PCs and workstations diverted users
MPP apps DID NOT materialize
PACT
Observation: CMOS supers
replaced ECL in Japan

2.2 Gflops vector units have dual use
–
–



In traditional mPv supers
as basis for computers in mC
Software apps are present
Vector processor out-performs n
micros for many scientific apps
It’s memory bandwidth, cache
prediction, and inter-communication
PACT
Observation: price & performance




Breaking $30M barrier increases PAP
Eliminating “state computers” increased
prices, but got fewer, more committed
suppliers, less variation, and more focus
Commodity micros aka Intel are critical to
improvement. DEC, IBM, and SUN are ??
Conjecture: supers and MPPs may be
equally cost-effective despite PAP
– Memory bandwidth determines
performance & price
– “You get what you pay for ” aka
PACT
“there’s no free lunch”
Observation: MPPs 1, Users <1

MPPs with relatively low speed micros with
lower memory bandwidth, ran over supers,
but didn’t kill ‘em.
 Did the U.S. industry enter an abyss?
- Is crying “Unfair trade” hypocritical?
- Are users denied tools?
- Are users not “getting with the program”
 Challenge we must learn to program clusters...
- Cache idiosyncrasies
- Limited memory bandwidth
- Long Inter-communication delays
- Very large numbers of computers PACT
Strong recommendation:
Utilize in situ workstations!





NoW (Berkeley) set sort record, decrypting
Grid, Globus, Condor and other projects
Need “standard” interface and
programming model for clusters using
“commodity” platforms & fast switches
Giga- and tera-bit links and switches
allow geo-distributed systems
Each PC in a computational environment
should have an additional 1GB/9GB!
PACT
“
Petaflops by 2010
”
DOE
Accelerated Strategic
Computing Initiative (ASCI)
PACT
DOE’s 1997 “PathForward”
Accelerated Strategic
Computing Initiative (ASCI)
1997
 1999-2001
 2004
 2010

1-2 Tflops: $100M
10-30 Tflops $200M??
100 Tflops
Petaflops
PACT
“
When is a Petaflops
possible? What price?
”
Gordon Bell, ACM 1997





Moore’s Law
100x
But how fast can the clock tick?
Increase parallelism 10K>100K
10x
Spend more ($100M  $500M)
5x
Centralize center or fast network
3x
PACT
Commoditization (competition)
3x
Micros gains if 20, 40, & 60% / year
60%=
Exaops
1.E+21
1.E+18
40%=
Petaops
1.E+15
20%=
Teraops
1.E+12
1.E +9
1.E+6
1995
2005
2015
2025
2035
2045
PACT
Processor Limit: DRAM Gap
“Moore’s Law”
100
10
1
µProc
60%/yr.
.
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
7%/yr..
CPU
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
1000
• Alpha 21264 full cache miss / instructions executed:
180 ns/1.7 ns =108 clks x 4 or 432 instructions
• Caches in Pentium Pro: 64% area, 88% transistors
PACT
*Taken from Patterson-Keeton Talk to SigMod
Five Scalabilities
Size scalable -- designed from a few components,
with no bottlenecks
Generation scaling -- no rewrite/recompile is required
across generations of computers
Reliability scaling
Geographic scaling -- compute anywhere (e.g.
multiple sites or in situ workstation sites)
Problem x machine scalability -- ability of an
algorithm or program to exist at a range of sizes
that run efficiently on a given, scalable computer.
Problem x machine space => run time: problem scale,
machine scale (#p), run time, implies speedup and
efficiency,
PACT
The Law of Massive Parallelism
(mine) is based on application scaling
There exists a problem that can be made sufficiently large
such that any network of computers can run efficiently
given enough memory, searching, & work -- but this
problem may be unrelated to no other.
A ... any parallel problem can be scaled to run efficiently
on an arbitrary network of computers, given enough
memory and time… but it may be completely impractical
Challenge to theoreticians and tool builders:
How well will or will an algorithm run?
Challenge for software and programmers:
Can package be scalable & portable? Are there models?
Challenge to users:
Do larger scale, faster, longer run times, increase
problem insight and not just total flop or flops?
Gordon’s
Challenge to funders:
PACT
WAG
Is the cost justified?
Manyflops for Manybucks:
what are the goals of spending?






Getting the most flops, independent
of how much taxpayers give to spend
on computers?
Building or owning large machines?
Doing a job (stockpile stewardship)?
Understanding and publishing about
parallelism?
Making parallelism accessible?
Forcing other labs to follow?
PACT
Petaflops Alternatives c2007-14
from 1994 DOE Workshop
SMP
Cluster
Active Mem Grid
400 Proc.;
4-40 K Proc.;
400 K Proc.;
1 Tflops
10-100 Gflops
1Gflops
400 TB SRAM 400 TB DRAM
0.8 TB embed.
250K chips
60K-100K chips
4K chips
1 ps/result…
10-100 ps/result
multi-threading cache heirarchy
100 10 Gflops
thread is likely
No definition of storage, network, or
PACT
programming model
Or more parallelism… and use
installed machines






10,000 nodes in 1998 or 10x Increase
Assume 100K nodes
10 Gflops/10GBy/100GB nodes
or low end c2010 PCs
Communication is first problem…
use the network
Programming is still the major barrier
Will any problems fit it
PACT
Next, short steps
PACT
The Alliance LES NT Supercluster
“Supercomputer performance at mail-order prices”-- Jim Gray, Microsoft
• Andrew Chien, CS UIUC-->UCSD
• Rob Pennington, NCSA
• Myrinet Network, HPVM, Fast Msgs
• Microsoft NT OS, MPI API
192 HP 300 MHz
64 Compaq 333 MHz
PACT
2D Navier-Stokes Kernel - Performance
7
Preconditioned Conjugate Gradient Method With
Multi-levelOrigin-DSM
Additive Schwarz Richardson Pre-conditioner
Origin-MPI
6
NT-MPI
Gigaflops
5
SP2-MPI
T3E-MPI
4
SPP2000-DSM
3
Sustaining
7 GF on
128 Proc.
NT Cluster
2
1
Processors
Danesh Tafti, Rob Pennington, NCSA; Andrew Chien (UIUC, UCSD)
60
50
40
30
20
10
0
0
PACT
The Grid:
Blueprint for a New Computing Infrastructure
Ian Foster, Carl Kesselman (Eds), Morgan Kaufmann, 1999

Published July 1998;
ISBN 1-55860-475-8

22 chapters by expert
authors including:
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Andrew Chien,
Jack Dongarra,
Tom DeFanti,
Andrew Grimshaw,
Roch Guerin,
Ken Kennedy,
“A source book for the history
Paul Messina,
of the future” -- Vint Cerf
Cliff Neuman,
Jon Postel,
Larry Smarr,
Rick Stevens,
Charlie Catlett
John Toole
and many others http://www.mkp.com/grids
PACT
The Grid



“Dependable, consistent,
pervasive access to
[high-end] resources”
Dependable: Can provide
performance and
functionality guarantees
Consistent: Uniform
interfaces to a wide variety
of resources
Pervasive: Ability to “plug
in” from anywhere
PACT
Alliance Grid Technology Roadmap:
It’s just not flops or records/se
User Interface
Cave5D
Webflow
Virtual Director
VRML
NetMeeting
H.320/323
Java3D
ActiveX
Java
Middleware
CAVERNsoft
Workbenches
Tango
RealNetworks
Visualization
SCIRun
Habanero
Globus
LDAP
QoS
OpenMP
MPI
Compute
HPF
DSM
Clusters
Clusters
HPVM/FM
Condor JavaGrande
Symera (DCOM)
Abilene
vBNS
MREN
Data
svPablo
XML
SRB HDF-5
Emerge (Z39.50)
SANs
DMF
ODBC
PACT
Globus Approach


Applications
Focus on architecture issues
– Propose set of core services Diverse global
as basic infrastructure
svcs
– Use to construct high-level,
domain-specific solutions
Core Globus
Design principles
services
– Keep participation cost low
– Enable local control
– Support for adaptation
Local OS
PACT
Globus Toolkit: Core Services






Scheduling (Globus Resource Alloc. Manager)
– Low-level scheduler API
Information (Metacomputing Directory
Service)
– Uniform access to structure/state
information
Communications (Nexus)
– Multimethod communication + QoS
management
Security (Globus Security Infrastructure)
– Single sign-on, key management
Health and status (Heartbeat monitor)
Remote file access (Global Access to PACT
Secondary Storage)
Summary of some beliefs





1000x increase in PAP has not been
accompanied with RAP, insight,
infrastructure, and use.
What was the PACT/$?
“The PC World Challenge” is to provide
commodity, clustered parallelism to
commercial and technical communities
Only comes true of ISVs believe and act
Grid etc. using world-wide resources,
PACT
including in situ PCs is the new idea
PACT 98
Http://www.research.microsoft.com/barc/gbell/pact.ppt
PACT
The end
PACT