M-Machine and Grids
Download
Report
Transcript M-Machine and Grids
M-Machine and Grids
Parallel Computer Architectures
Navendu Jain
Readings
The M-machine multicomputer
Marco
et al., MICRO 1995
Exploiting fine-grain thread level parallelism on
the MIT multi-ALU processor
Keckler et al., MICRO 1998
A design space evaluation of grid processor
architectures
Nagarajan et al., MICRO 2001
Outline
The M-Machine Multicomputer
Thread Level Parallelism on M-Machine
Grid Processor Architectures
Review and Discussion
The M-Machine Multicomputer
Design Motivation
Achieve higher throughput of memory resources
Increase chip area devoted to processors
Arithmetic to bandwidth ratio of 12
operations/word
Minimize global communication (local sync.)
Faster execution of fixed size problems
Easier programmability of parallel computers
Incremental approach
Architecture
A bi-directional 3-D network mesh of multithreaded processing nodes
A chip comprises of a multi-ALU processor
(MAP) and 128KB on-chip sync. DRAM
A user-accessible message passing system (SEND)
Single global virtual address space
Target CLK 100 MHz (control logic 40MHz)
Multi-ALU processor (MAP)
A MAP chip comprises :
Three 64-bit 3-issue clusters
2-way interleaved on-chip
cache
A Memory Switch
A Cluster switch
External memory interface
On-chip network interfaces
and routers
A MAP Cluster
64-bit three issue
pipelined processor
2 Integer ALUs
1 Floating point ALU
Register Files
4KB Instruction cache
A MAP instruction has
1, 2 or 3 operations
Map Chip Die
(18 mm side, 5M transistors)
Exploiting Parallelism
on
M-Machine
Threads
Exploit ILP both with-in and across the clusters
Horizontal Threads (H-Threads)
Instruction level parallelism
Executes on a single MAP cluster
3-wide instruction stream
Communication/synchronization through
messages/registers/memory
Max. 6 H-Threads can be interleaved dynamically on
a cycle-by-cycle basis
Threads (contd.)
Vertical Threads (V-Threads)
Thread level parallelism (a standard process)
contains up-to 4 H-Threads (one per cluster)
Flexibility of scheduling (compiler/run-time)
Communication/synchronization through registers
At-most 6 resident V-Threads
4 user slots, 1 event slot, 1 exception
Concurrency Model
Three Levels of Parallelism
Instruction Level Parallelism ( 1 instruction)
VLIW, Superscalar processors
Issues: Control Flow, Data dependency, Scalability
Thread Level Parallelism (~ 1000 instructions)
Chip Multiprocessors
Issues: Limited coarse TLP, Inner cores non-optimal
Fine grain Parallelism (~ 50 – 1000 instructions)
Mapping
Granularity
Program
Architecture
ILP
Sub-expr
Cluster
Fine-Grain
Inner loops
sub-routines
MAP chip
TLP
Outer loops
Co-routines
Nodes
Fine-grain overheads
Thread creation (11 cycles – hfork)
Communication
Register-Register read/writes
Message passing/on-chip cache
Synchronization
Blocking on a register (full/empty bit)
Barrier Instruction (cbar instruction)
Memory (sync bit)
Grid Processor Architecture
Design Motivation
Continued scaling of the clock rate
Scalability of the processing core
Higher ILP - Instruction throughput (IPC)
Mitigate global wire and delay overheads
Closer coupling of Architecture and compiler
Architecture
An inter-connected 2-D network of ALU arrays
Each node has a IB and a execution unit
A single control thread maps instructions to nodes
Block-Atomic Execution Model
Mapping blocks of statically scheduled instructions
Dynamic execution in data-flow order
Forwarding temp. values to the consumer ALUs
Critical path scheduled along shortest physical path
GPA Architecture
Example: Block-Atomic Mapping
Implementation
Instruction fetch and map
predicated hyper-block, move instructions
Execution - control logic
Operand routing – max 3 dest., split instructions
Hyper-block control
Predication (execute-all approach), cmove instructions
Block-commit
Block-stitching
Review and Discussion
Key Ideas: Convergence
Microprocessor – no. of superscalar processors
comm./sync. via registers – low overheads
Exploiting ILP – TLP granularities
Dependency mapped to a grid of ALUs
Replication reduces design/verification effort
Point-to-point communication
Exposing architecture partitioning and flow of
operations to the compiler
Avoid wire, routing delays, memory wall problems
Ideas: Divergence
M-Machine
On-chip cache
Register based mech. [Delays]
Broadcasting and Point-to-point communication
GPA
Register Set
Grid: Chaining [Scalability]
Point-to-point communication
TERA
Fine-grain threads – Memory comm/sync (full/empty)
No support for single threaded code
Drawbacks (Unresolved Issues)
Grid Processor Arch.
Data Caches far from ALUs
Incur delays between
dependent operations due to
network router and wires
Complex Framemanagement and Blockstitching
Explicit compiler
dependence
M-Machine
Scalability
Clock speeds
Memory Synchronization
(use hfork)
Challenges/Future Directions
Architectural support to extract TLP
Parallelizing compiler technology
How many cores/threads
No. of threads – memory latency, wire delays [Flynn]
Inter-thread communication
Height of Grid == 8 (IPC 5-6) [GPA, Peter]
Optimization - f(comm., delays, memory costs)
Challenges (contd.)
On-fly data-dependence detection (RAW/WAR)
TLP/ILP Balance – M Multi-Computer
Thanks