Transcript Document

Plenary Presentation to the Workshop on
Frontiers of Extreme Computing:
Continuum Computer Architecture
Thomas Sterling
California Institute of Technology
and
Louisiana State University
October 25, 2005
October 25, 2005
Thomas Sterling - Caltech & LSU
1
Challenges to Extreme Computing










Power consumption
Exposure and exploitation of application Parallelism
Memory “speed”
Latency and clock propagation distance
Chip I/O bandwidth
Amdahl’s limits
Reliability and error rates
Programmability
Cost of development (and applied research)
Cost of manufacturing
October 25, 2005
Thomas Sterling - Caltech & LSU
2
Challenges to Computer Architecture
 Expose and exploit extreme fine-grain parallelism


Possibly multi-billion-way
Data structure-driven
 State storage takes up much more space than logic


1:1 flops/byte ration infeasible
Memory access bandwidth critical
 Latency


can approach a million cycles
All actions are local
 Overhead for fine grain parallelism must be very small


or system can not scale
One consequence is that global barrier synchronization is untenable
 Reliability



Very high replication of elements
Uncertain fault distribution
Fault tolerance essential for good yield
October 25, 2005
Thomas Sterling - Caltech & LSU
3
Insidious Underlying Assumptions
 Near term incremental solutions will continue indefinitely
 COTS


Off the shelf conventional micros
High density dumb DRAMs
 Processor centric



Cache based
Program counter driven
> 100M gates
 Separate memory components

Can’t possibly afford to do anything else
 Multi-core

More of the same
 Message passing model of computation

Some vectors and threads thrown in as well
 ALUs are the precious resource

Processor and memory architectures designed around them
 Manual locality management

Programmer explicitly specified for latency avoidance
October 25, 2005
Thomas Sterling - Caltech & LSU
4
My own warped perspective
 Data access bandwidth is the precious resource

ALUs relegated to high availability devices (as opposed to high utilization)
 Conventional delineation of functionality is an artifact of ancient tradeoffs


e.g., Processor, Memory, Interconnect
Current strategy: Global parallel execution from local sequential devices
 Computation is about continuations and state


“continuation” specifies an environment and next action(s)
Processors are one possible physical manifestation of a continuation


One continuation glued to each processor
Multithreading glues down a few continuations
 Barriers are bad

Over constrains parallel execution and destroys fine grain parallelism
 Meta-data determines control flow in some data intensive applications
 Control flow synchronization needs to be part of the meta-data
 Its some times better to move the continuation to the data than the data
to the continuation
 Complexity of operation needs not be derived from complexity of design

Complex emergent global behavior may be a product of simple local rules of
operation
October 25, 2005
Thomas Sterling - Caltech & LSU
5
Architecture Innovation





Extreme memory bandwidth
Active latency hiding
Extreme parallelism
Message-driven split-transaction computations (parcels)
PIM




e.g. Kogge, Draper, Sterling, …
Very high memory bandwidth
Lower memory latency (on chip)
Higher execution parallelism (banks and row-wide)
 Streaming




Dally, Keckler, …
Very high functional parallelism
Low latency (between functional units)
Higher execution parallelism (high ALU density)
October 25, 2005
Thomas Sterling - Caltech & LSU
6
Concepts of the MIND Architecture
 Virtual to physical address translation in memory



Global distributed shared memory thru distributed directory table
Dynamic page migration
Wide registers serve as context sensitive TLB
 Multithreaded control



Unified dynamic mechanism for resource management
Latency hiding
Real time response
 Parcel active message-driven computing



Decoupled split-transaction execution
System wide latency hiding
Move work to data instead of data to work
 Parallel atomic struct processing



Exploits direct access to wide rows of memory banks for fine grain parallelism and
guarded compound operations
Exploits parallelism for better performance
Enables very efficient mechanisms for synchronization
 Fault tolerance through graceful degradation
 Active power management
October 25, 2005
Thomas Sterling - Caltech & LSU
7
Memory Stack
memory address buffer
MIND Node
Memory
Controller
On-chip
Interface
sense amps & row buffer
Permutation Network
Wide Multi Word ALU
Wide Register Bank
October 25, 2005
Multithreading
Execution
Control
Parcel
Handler
Thomas Sterling - Caltech & LSU
Parcel Interface
8
Parcels for remote threads
Destination Locale
Data
Target Operand
Remote Thread Create Parcel
Payload
Action
Destination
Methods
Target Action Code
Source
Locale
Destination
Return Parcel
Action
Payload
Thread
Frames
Remote Thread
October 25, 2005
Thomas Sterling - Caltech & LSU
9
Parcel Simulation Latency Hiding
Experiment
Nodes
Flat
Nodes
Network
Remote Memory
Requests
Control
Experiment
Nodes
Remote Memory
Requests
Nodes
Test
Experiment
ALU
Remote Memory
Requests
ALU
Local Memory
Process Driven Node
October 25, 2005
Input
Parcels
Remote Memory
Requests
Output
Parcels
Local Memory
Parcel Driven Node
Thomas Sterling - Caltech & LSU
10
Latency Hiding with Parcels
with respect to System Diameter in cycles
Sensitivity to Remote Latency and Remote Access Fraction
16 Nodes
deg_parallelism in RED (pending parcels @ t=0 per node)
256
64
100
16
1/4%
4
1/2%
10
1%
2
2%
1
4%
64
25
6
10
24
40
96
16
38
4
64
25
6
10
24
40
96
16
38
4
64
25
6
10
24
40
96
16
38
4
64
25
6
10
24
40
96
16
38
4
64
25
6
10
24
40
96
16
38
4
1
64
25
6
10
24
40
96
16
38
4
Total transactional work done/Total process work done
1000
0.1
Remote Memory Latency (cycles)
October 25, 2005
Thomas Sterling - Caltech & LSU
11
Latency Hiding with Parcels
Idle Time with respect to Degree of Parallelism
Idle Tim e/Node
(num ber of nodes in black)
8.E+05
2
1
4
8
32
16
64
128
256
7.E+05
6.E+05
Idle time/node (cycles)
Process
Transaction
5.E+05
4.E+05
3.E+05
2.E+05
1.E+05
12
8
16
2
25
6
32
4
64
8
1
12
8
16
2
25
6
32
4
64
8
1
12
8
16
2
25
6
32
4
64
8
1
0.E+00
Parallelism Level (parcels/node at tim e=0)
October 25, 2005
Thomas Sterling - Caltech & LSU
12
Metric of Physical Locality, t
 Locality of operation dependent on amount of logic and
state that can be accessed round-trip within a single
clock cycle
 Define t as ratio of number of elements (e.g., gates,
transistors) per chip to the number of elements
accessible within a single clock cycle
 Not just a speed of light issue
 Also involves propagation through sequence of
elements
 When I was an undergrad, t = 1
 Today, t < 10
 For SFQ at 100 GHz, 100 < t < 1000
 At nano-scale, t > 100,000
October 25, 2005
Thomas Sterling - Caltech & LSU
13
Continuum Computer Architecture
Fundamental Concepts
 General purpose cellular architecture
 Global fine-grain cellular structure (Simultac)


2.5 or 3-D mesh
Blocks interact nearest neighbor
 Merge functionality into a single simple block
(Fonton)

Synergism among fontons yields emergent global behavior
of general parallel computing model

Communications nearest neighbor
Memory, all register with associative tags for names and type
specification
Data/instruction structures distributed across fontons – virtual
addressing
Logic performs basic operation on local data



 Dynamic adaptive and distributed resource
management
October 25, 2005
Thomas Sterling - Caltech & LSU
14
CCA Structure: Fonton





Small block of fully associative tagged memory/registers
Basic logical and arithmetic unit
Instruction register directs control to set data paths
Nearest neighbor communications with switching
PRECISE binary instruction set compressed encoding
Assoc. Memory
Inst. Reg.
ALU
Control
October 25, 2005
Thomas Sterling - Caltech & LSU
15
CCA Structure: Distributed
Associative Memory
October 25, 2005
Thomas Sterling - Caltech & LSU
16
Data Organization and Management
 Three classes of data ensembles



scalar values; stored in a single fonton
complex; records of some small number of values, which if small can
fit on a single fonton, or in adjacent fontons
compound; distributed across fontons and coordinated by links
 Data migration



objects are copied to adjacent fontons
copying exploits fine grain data parallelism, even for irregular data
structures
objects may transit by means of wormhole routing
 Data objects are virtual named

With tags for associative search and typing
October 25, 2005
Thomas Sterling - Caltech & LSU
17
Address Translation









Distributed associative mapping
Data carries virtual tags
Requests are directed in 2D/3D
Requests search sequence of fontons for reference
Reference match either locates operand or a new
directive (“crumbs”).
Reference tree of links using virtual links and
directors
Objects can be nailed down
Directory table provides global naming
“Shock-wave” searches for unknown positions
October 25, 2005
Thomas Sterling - Caltech & LSU
18
CCA Structure: Mesh Network
October 25, 2005
Thomas Sterling - Caltech & LSU
19
CCA Structure: Gate Array
October 25, 2005
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Thomas Sterling - Caltech & LSU
20
Instruction Streams and Execution
 Instruction streams are just another data structure
 Can be static in place or migrate through successive
fontons
 They are tagged as instructions and carry their target
environment id for unique instantiation
 Data can move across instruction sequence,
synthesizing the equivalent of a programmable
pipeline
 Instruction sequence can move across a stored data
sequence synthesizing the equivalent of vector
execution
October 25, 2005
Thomas Sterling - Caltech & LSU
21
Principal Mechanisms





Virtual address translation and object locating
Cellular client-server relationship
Resource allocation (load balancing) by diffusion (adjacent copying)
Data objects & structures distributed across fields of fontons
Vectors are mobile data structures (workhole routing type I)

Can move across software pipelines with separate instructions in each
fonton pipeline stage
 Instruction threads are mobile data structures





Can move across ephemeral vector register with separate datum in
each fonton of register bank (“Parcels”)
“Futures” coordinate time synchronization and space utilization
Irregular data structure pointers direct n-furcating
Fault isolation (reconfigurable?, at least On-Off)
Asynchronous interfaces
October 25, 2005
Thomas Sterling - Caltech & LSU
22
Summary
 Extremes in technology scale and clock rate will
demand innovations in architecture
 In the limit, locality of action will dominate operation
and bandwidth of storage access will determine
sustained performance
 Fine grain cellular structures provide highest storage
access bandwidths and highest peak performance
 Continuum Computer Architecture (CCA) exploits fine
grain cellular structure for general purpose parallel
processing
 CCA may provide convergent architecture for nanoscale and ultra high clock rate technologies at the end
of Moore’s Law
October 25, 2005
Thomas Sterling - Caltech & LSU
23