Architecture Challenges
Download
Report
Transcript Architecture Challenges
All the chips outside…
and around the PC
what new platforms? Apps?
Challenges, what’s interesting, and
what needs doing?
Gordon Bell
Bay Area Research Center
Microsoft Corporation
Copyright Gordon Bell & Jim Gray
ISCA2000
Architecture changes when
everyone and everything is mobile!
Power, security, RF, WWW, display,
data-types e.g. video & voice…
it’s the application of architecture!
The architecture problem
The apps
–
–
Data-types: video, voice, RF, etc.
Environment: power, speed, cost
The material: clock, transistors…
Performance… it’s about parallelism
–
–
–
–
–
–
–
–
–
Program & programming environment
Network e.g. WWW and Grid
Clusters
Multiprocessors
Storage, cluster, and network interconnect
Processor and special processing
Multi-threading and multiple processor per chip
Instruction Level Parallelism vs
Vector processors
Copyright Gordon Bell & Jim Gray
ISCA2000
IP On Everything
Copyright Gordon Bell & Jim Gray
ISCA2000
poochi
Copyright Gordon Bell & Jim Gray
ISCA2000
Sony Playstation export
limiits
Copyright Gordon Bell & Jim Gray
ISCA2000
PC At An Inflection Point?
It needs to continue to be upward. These
scalable systems provide the highest
technical (Flops) and commercial (TPC)
performance.
They drive microprocessor competition!
PCs
Non-PC
devices and Internet
Copyright Gordon Bell & Jim Gray
ISCA2000
TV/AV
Mobile
Companions
Consumer
PCs
The Dawn Of The PC-Plus Era,
Not The Post-PC Era…
devices aggregate via PCs!!!
Communications
Automation
& Security
Household
Management
PC will prevail for the next decade
as a dominant platform …
2nd to smart, mobile devices
Moore’s Law increases performance; and
alternatively reduces prices
PC server clusters with low cost OS beat
proprietary switches, smPs, and DSMs
Home entertainment & control …
–
–
Very large disks (1TB by 2005) to “store everything”
Screens to enhance use
Mobile devices, etc. dominate WWW >2003!
Voice and video become important apps!
C = Commercial; C’ = Consumer
Where’s the action? Problems?
Constraints: Speech, video, mobility, RF,
GPS, security…
Moore’s Law, including network speed
Scalability and high performance processing
–
–
–
Building them: Clusters vs DSM
Structure: where’s the processing, memory, and
switches (disk and ip/tcp processing)
Micros: getting the most from the nodes
Not ISAs: Change can delay Moore Law effect
… and wipe out software investment!
Please, please, just interpret my object code!
System on a chip alternatives… apps drive
–
Data-types (e.g. video, video, RF) performance,
portability/power, and cost
High Performance Computing
A 60+ year view
Copyright Gordon Bell & Jim Gray
ISCA2000
High performance
architecture/program timeline
1950 .
1960 .
1970 .
Vtubes
Trans.
MSI(mini)
1980 .
1990 .
Micro RISC
2000
nMicr
Sequential programming---->-----------------------------(single execution stream)
<SIMD Vector--//--------------Parallelization---
Parallel programs aka Cluster Computing
multicomputers
ultracomputers 10X in size & price!
“in situ” resources 100x in //sm
geographically dispersed
Copyright Gordon Bell & Jim Gray
<--------------<--MPP era-----10x MPP
NOW
VLSCC
Grid
ISCA2000
Computer types
-------- Connectivity-------WAN/LAN
Netwrked
Supers…
SAN
VPPuni
DSM
SM
NEC super
NEC mP
Cray X…T
(all mPv)
Clusters
GRID
Legion
T3E
SGI DSM
Mainframes
Condor
SP2(mP)
clusters &
Multis
BeowulfNOW
SGI DSM WSs PCs
NT clusters
Copyright Gordon Bell & Jim Gray
ISCA2000
Technical computer types
WAN/LAN
Netwrked
Supers…
New
SAN
DSM
SM
NEC mP NEC super
Old
Cray X…T
T series World
(all mPv)
world:
VPPuni
Clustered
GRID
( one
Computing
Legion
SGI DSM program
Mainframes
(multiple program
SP2(mP)
Condor
clusters &
Multis
NOW
stream)
streams)
Beowulf
SGI DSM WSs PCs
Dead Supercomputer Society
Copyright Gordon Bell & Jim Gray
ISCA2000
Dead Supercomputer Society
ACRI
Alliant
American Supercomputer
Ametek
Applied Dynamics
Astronautics
BBN
CDC
Convex
Cray Computer
Cray Research
Culler-Harris
Culler Scientific
Cydrome
Dana/Ardent/Stellar/Stardent
Denelcor
Elexsi
ETA Systems
Evans and Sutherland
Computer
Floating Point Systems
Galaxy YH-1
Goodyear Aerospace MPP
Gould NPL
Guiltech
Intel Scientific Computers
International Parallel Machines
Kendall Square Research
Key Computer Laboratories
MasPar
Meiko
Multiflow
Myrias
Numerix
Prisma
Tera
Thinking Machines
Saxpy
Scientific Computer Systems (SCS)
Soviet Supercomputers
Supertek
Supercomputer Systems
Suprenum
Vitesse Electronics
SCI Research c1985-1995
35 university and corporate R&D
projects
2 or 3 successes…
All the rest failed to work or be
successful
Copyright Gordon Bell & Jim Gray
ISCA2000
How to build scalables?
To cluster or not to cluster…
don’t we need a single, shared
memory?
Copyright Gordon Bell & Jim Gray
ISCA2000
Application
Taxonomy
Technical
Commercial
If central control & rich
then IBM or large SMPs
else PC Clusters
General purpose, nonparallelizable codes
(PCs have it!)
Vectorizable
Vectorizable & //able
(Supers & small DSMs)
Hand tuned, one-of
MPP course grain
MPP embarrassingly //
(Clusters of PCs...)
Database
Database/TP
Web Host
Stream Audio/Video
SNAP
… c1995
Scalable Network And Platforms
A View of Computing in 2000+
We all missed the impact of WWW!
Platform
Gordon Bell
Copyright Gordon Bell & Jim Gray
Network
Jim Gray
ISCA2000
Computing
SNAP
built entirely
from PCs
Portables
Wide-area
global
network
Mobile
Nets
Wide & Local
Area Networks
for: terminal,
PC, workstation,
& servers
Person
Person
servers
servers
(PCs)
(PCs)
Legacy
mainframes &
Legacy
minicomputers
mainframe
& terms
servers &
minicomputer
servers & terminals
???
scalable computers
built from PCs
Centralized
&Centralized
departmental
uni& mP servers
&
departmental
(UNIX
& NT)
servers
buit
from
PCs
TC=TV+PC
home ...
(CATV or ATM
or satellite)
A space, time
(bandwidth), &
generation scalable
environment
Copyright Gordon Bell & Jim Gray
ISCA2000
1000
100
10
Bell Prize and
Future Peak
Tflops (t)
*IBM
Petaflops
study
target
1
NEC
0.1
CM2
0.01
0.001
XMP
NCube
0.0001
1985
1990
1995
Copyright Gordon Bell & Jim Gray
2000
2005
2010
ISCA2000
Top 10 tpc-c
Top two Compaq systems are:
1.1 & 1.5X faster than IBM SPs;
1/3 price of IBM
1/5 price of SUN
Copyright Gordon Bell & Jim Gray
ISCA2000
Courtesy of Dr. Thomas Sterling, Caltech
Five Scalabilities
Size scalable -- designed from a few components,
with no bottlenecks
Generation scaling -- no rewrite/recompile or user
effort to run across generations of an architecture
Reliability scaling… chose any level
Geographic scaling -- compute anywhere (e.g.
multiple sites or in situ workstation sites)
Problem x machine scalability -- ability of an
algorithm or program to exist at a range of sizes
that run efficiently on a given, scalable computer.
Problem x machine space => run time: problem scale,
machine scale (#p), run time, implies speedup and
efficiency,
Copyright Gordon Bell & Jim Gray
ISCA2000
Why I gave up on large smPs & DSMs
Economics: Perf/Cost is lower…unless a commodity
Economics: Longer design time & life. Complex.
=> Poorer tech tracking & end of life performance.
Economics: Higher, uncompetitive costs for processor &
switching. Sole sourcing of the complete system.
DSMs … NUMA! Latency matters.
Compiler, run-time, O/S locate the programs anyway.
Aren’t scalable. Reliability requires clusters. Start there.
They aren’t needed for most apps… hence, a small
market unless one can find a way to lock in a user base.
Important as in the case of IBM Token Rings vs Ethernet.
Copyright Gordon Bell & Jim Gray
ISCA2000
FVCORE Performance
Finite Volume Community Climate Model,
Joint Code development NASA, LLNL and NCAR
50
SX-5
30
SX-4
GFlops
25
20
MPI on SGI
15
MLP on SGI
10
Max C90-16
5
Max T3E
0
0
200
400
Number of SGI processors
600
Architectural Contrasts –
Vector vs Microprocessor
Vector System
500Mhz
Two results
per clock
Vector lengths
fixed
Vectors fed at
high speed
Microprocessor System
600Mhz
CPU
CPU
Vector registers
8 KBytes
1st & 2nd Lvl Caches
8 MBytes
Memory
Memory
Two results per clock
(Will be 4 in next Gen SGI)
Vector lengths
arbitrary
Vectors fed at
low speed
Cache based systems are nothing more than “vector” processors with a highly
programmable “vector” register set (the caches). These caches are 1000x larger than
the vector registers on a Cray vector system, and provide the opportunity to execute
vector work at a very high sustained rate. In particular, note 512 CPU Origins contain 4
GBytes of cache. This is larger than most problems of interest, and offers a
tremendous opportunity for high performance across a large number of CPUs. This has
been borne out in fact at NASA Ames.
limited
scalability: mP,
uniform
memory access
mPs continue
to be the main
mP
mainframe,
super
micros
C om
mi c p etiti o
ro s
n&
mP
bus based
multi:
mini, W/S
note, only two structures:
1. shared memory mP with
mP
ring-based uniform & non-uniform
memory access; and
multi
2. networked workstations,
shared nothing
??
DEC, Encore,
Sequent, Stratus,
SGI, SUN, etc.
Conv ex, Cray,
Fuj istu, IBM,
Hitachi, NEC
line
mainframes &
supers
scalable, mP: smP,
non-uniform memory access
1st smP
-0 cache
Convergence to
one architecture
1995?
DASH, Conv ex,
Cray T3D, SCI
smP
DSM
some
cache
Cache for
locality
Natural
evolution
r
o y 1995?
Cm* ('75),
I
c
C
S ren
Butterfly ('85),
smC
,
ts h e
Cedar ('88)
bi co
next gen.
64 e
> ach
32 ? c
?
DSM=>smP
experimental,
1st smC
smC
scalable,
hypercube WS Micros,
med-coarse Fuj itsu, Intel,
multicomputer: smC, Transputer fast switch
Meiko, NCUBE,
grain
non uniform memory
(grid) H ig h
TMC; 1985-1994
den s
i
t
y
access
, mu l
ti -t
Cosmic Cube,
iPSC 1, NCUBE,
Transputer-based
networked
workstations:
smC
smC:
very
coarse
grain
hr ead
e
high bandwith
switch , comm.
protocols e.g.
ATM
Apollo, SUN, HP, etc.
d p ro
cess
or s &
swi tc
smC
coarse gr.
clusters
WSs Clusters v ia special
sw itches 1994 &ATM 1995
h
1995?
smC
fine-grain
DSM??
smP
all cache
arch.
KSR Allcache
next gen. smP
research e.g.
DDM, DASH+
Evolution of scalable
multiprocessors,
multicomputers, &
workstations to shared
memory computers
Mosaic-C,
J-machine
“Jim, what are the architectural
challenges … for clusters?”
WANS (and even LANs) faster than
backplanes at 40 Gbps
End of busses (fc=100 MBps)…
except on a chip
What are the building blocks or
combinations of processing, memory, &
storage?
Infiniband http://www.infinibandta.org
starts at OC48, but it may not go far or fast
enough if it ever exists.
OC192 is being deployed.
Copyright Gordon Bell & Jim Gray
ISCA2000
What is the basic structure of
these scalable systems?
Overall
Disk connection especially wrt to
fiber channel
SAN, especially with fast WANs &
LANs
Copyright Gordon Bell & Jim Gray
ISCA2000
Modern scalable switches …
also hide a supercomputer
Scale from <1 to 120 Tbps of
switch capacity
1 Gbps ethernet switches scale to
10s of Gbps
SP2 scales from 1.2 Gbps
Copyright Gordon Bell & Jim Gray
ISCA2000
GB plumbing from the baroque:
evolving from the 2 dance-hall model
Mp — S — Pc
:
|
:
|—————— S.fc — Ms
|
:
|— S.Cluster
|— S.WAN —
MpPcMs — S.Lan/Cluster/Wan —
:
Copyright Gordon Bell & Jim Gray
ISCA2000
SNAP Architecture----------
Copyright Gordon Bell & Jim Gray
ISCA2000
ISTORE Hardware Vision
System-on-a-chip enables computer, memory,
without significantly increasing size of disk
5-7 year target:
MicroDrive:1.7” x 1.4” x 0.2”
2006: ?
1999: 340 MB, 5400 RPM,
5 MB/s, 15 ms seek
2006: 9 GB, 50 MB/s ? (1.6X/yr capacity,
1.4X/yr BW)
Integrated IRAM processor
2x height
Connected via crossbar switch
growing like Moore’s law
16 Mbytes; ; 1.6 Gflops; 6.4 Gops
10,000+ nodes in one rack!
100/board = 1 TB; 0.16 Tflops
Copyright Gordon Bell & Jim Gray
ISCA2000
The Disk Farm? or
a System On a Card?
14"
The 500GB disc card
An array of discs
Can be used as
100 discs
1 striped disc
50 FT discs
....etc
LOTS of accesses/second
of bandwidth
A few disks are replaced by 10s of Gbytes
of
RAM
and
a
processor
to
run
Apps!!
Copyright Gordon Bell & Jim Gray
ISCA2000
Redmond/Seattle,
Map
of GrayWABell Prize results
single-thread single-stream tcp/ip
New York
via 7 hops
desktop-to-desktop …Win 2K
out of the box performance*
Arlington, VA
San Francisco,
CA
5626 km
10 hops
Copyright Gordon Bell & Jim Gray
ISCA2000
Ubiquitous 10 GBps SANs
in 5 years
1Gbps Ethernet are reality now.
–
Also FiberChannel ,MyriNet, GigaNet,
ServerNet,, ATM,…
1 GBps
10 Gbps x4 WDM deployed now
(OC192)
–
3 Tbps WDM working in lab
In 5 years, expect 10x,
wow!!
120 MBps
(1Gbps)
80 MBps
40 MBps
Copyright Gordon Bell &20
JimMBps
Gray
5 MBps
ISCA2000
The Promise of SAN/VIA:10x in 2 years
http://www.ViArch.org/
Yesterday:
–
–
–
10 MBps (100 Mbps Ethernet)
~20 MBps tcp/ip saturates
2 cpus
round-trip latency ~250 µs
Now
–
250
Time µs to
Send 1KB
200
150
Transmit
receivercpu
sender cpu
100
Wires are 10x faster
Myrinet, Gbps Ethernet, ServerNet,…
–
Fast user-level
communication
-
tcp/ip ~ 100 MBps 10% cpu
round-trip latency is 15 us
1.6
GbpsGordon
demoed
WAN
Copyright
Bell &on
Jim a
Gray
50
0
100Mbps
Gbps
SAN
ISCA2000
Processor improvements…
90% of ISCA’s focus
Copyright Gordon Bell & Jim Gray
ISCA2000
Copyright Gordon Bell & Jim Gray
ISCA2000
We get more of everything
Copyright Gordon Bell & Jim Gray
ISCA2000
Performance (VAX 780s)
100
Mainframes, minis, micros, and risc
Performance vs Time
for Several Computers
C
S
RI
10
1.0
%/y
ECL 15
780
5 Mhz
Mips
25 mhz
r
8600
TTL
%
60
r
y
/
o
|
|
MIPS (8 Mhz)
4K
Mips
(65 Mhz)
o
9000
uVAX 6K
(CMOS)
MV10K
C
S
I
C uVAX
S
O r CMOS Will RISC continue on a
M
68K C %/y
60%, (x4 / 3 years)?
38
Moore's speed law ?
0.1
1980
1985
1990
Computer ops/sec x word length / $
1.E+09
doubles every 1.0
1.E+06
.=1.565^(t-1959.4)
1.E+03
y = 1E-248e0.2918x
1.E+00
1.E-03
doubles every 2.3
doubles every 7.5
1.E-06
Copyright
Bell & Jim1920
Gray
1880 Gordon
1900
1940
1960
1980ISCA2000
2000
Performance in Mflop/s
Growth of microprocessor
performance
10000
1000
100
Cray 2
Cray Y-MP Cray C90
Alpha
RS6000/590
Alpha
RS6000/540
Cray X-MP
Cray 1S
10
Cray T90
Supers
Micros
i860
R2000
1
0.1
0.01
8087
80387
6881
80287
Copyright Gordon Bell & Jim Gray
ISCA2000
Albert Yu predictions ‘96
When
Clock (MHz)
MTransistors
Mops
Die (sq. in.)
2000
900
40
2400
1.1
Copyright Gordon Bell & Jim Gray
2006
4000
350
20,000
1.4
4.4x
8.75x
8.3x
1.3x
ISCA2000
Processor Limit: DRAM Gap
“Moore’s Law”
100
10
1
µProc
60%/yr.
.
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
7%/yr..
CPU
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
1000
• Alpha 21264 full cache miss / instructions executed:
180 ns/1.7 ns =108 clks x 4 or 432 instructions
• Caches in Pentium Pro: 64% area, 88% transistors
Copyright Gordon Bell & Jim Gray
*Taken from Patterson-Keeton Talk to SigMod
ISCA2000
The “memory gap”
Multiple e.g. 4 processors/chip in order
to increase the ops/chip while waiting
for the inevitable access delays
Or alternatively, multi-threading (MTA)
Vector processors with a supporting
memory system
System-on-a-chip… to reduce chip
boundary crossings
Copyright Gordon Bell & Jim Gray
ISCA2000
If system-on-a-chip is the
answer, what is the problem?
Small, high volume products
–
–
–
–
–
Phones, PDAs,
Toys & games (to sell batteries)
Cars
Home appliances
TV & video
Communication infrastructure
Plain old computers… and portables
Copyright Gordon Bell & Jim Gray
ISCA2000
SOC Alternatives… not
including C/C++ CAD Tools
The blank sheet of paper: FPGA
Auto design of a basic system: Tensilica
Standardized, committee designed
components*, cells, and custom IP
Standard components including more
application specific processors *, IP addons and custom
One chip does it all: SMOP
*Processors, Memory, Communication &
Memory
Links,
Copyright Gordon Bell & Jim Gray
ISCA2000
Xilinx 10Mg, 500Mt, .12 mic
Copyright Gordon Bell & Jim Gray
ISCA2000
Free 32 bit processor core
Copyright Gordon Bell & Jim Gray
ISCA2000
System-on-a-chip alternatives
FPGA
Sea of un-committed gate
arrays
Compile Unique processor for
a system every app
Systolic | Many pipelined or parallel
array
processors + custom
DSP |
Spec. purpose processors
VLIW
cores + custom
Pc & Mp. Gen. Purpose cores.
Specialized by I/O, etc.
ASICS
Universal Multiprocessor array,
Micro
programmable I/o
Xylinx,
Altera
Tensillica
TI
IBM, Intel,
Lucent
Cradle
Cradle: Universal Microsystem
trading Verilog & hardware for C/C++
UMS : VLSI = microprocessor : special systems
Software : Hardware
Single part for all apps
App spec’d@ run time using FPGA & ROM
5 quad mPs at 3 Gflops/quad = 15 Glops
Single shared memory space, caches
Programmable periphery including:
1 GB/s; 2.5 Gips
PCI, 100 baseT, firewire
$4 per flops; 150 mW/Gflops
UMS Architecture
DRAM
CONTROL
CLOCKS,
DEBUG
MEMORY
MEMORY
M M M M
S S S S
P P P P
M M M M
S S S S
P P P P
PROG I/O
PROG I/O
PROG I/O
MEMORY
PROG I/O
MEMORY
PROG I/O
PROG I/O
PROG I/O
PROG I/O
PROG I/O
PROG I/O
M M M M
S S S S
P P P P
NVMEM
PROG I/O
PROG I/O
M M M M
S S S S
P P P P
DRAM
Memory bandwidth scales with processing
Scalable processing, software, I/O
Each app runs on its own pool of processors
Enables durable, portable intellectual property
Recapping the challenges
Scalable systems
–
–
–
–
Latency in a distributed memory
Structure of the system and nodes
Network performance for OC192 (10 Gbps)
Processing nodes and legacy software
Mobile systems… power, RF, voice, I/0
–
Design time!
Copyright Gordon Bell & Jim Gray
ISCA2000
The End
Copyright Gordon Bell & Jim Gray
ISCA2000