M-Machine and Grids

Download Report

Transcript M-Machine and Grids

M-Machine and Grids
Parallel Computer Architectures
Navendu Jain
Readings



The M-machine multicomputer
Marco
et al., MICRO 1995
Exploiting fine-grain thread level parallelism on
the MIT multi-ALU processor
Keckler et al., MICRO 1998
A design space evaluation of grid processor
architectures
Nagarajan et al., MICRO 2001
Outline

The M-Machine Multicomputer

Thread Level Parallelism on M-Machine

Grid Processor Architectures

Review and Discussion
The M-Machine Multicomputer
Design Motivation

Achieve higher throughput of memory resources
Increase chip area devoted to processors
 Arithmetic to bandwidth ratio of 12
operations/word




Minimize global communication (local sync.)
Faster execution of fixed size problems
Easier programmability of parallel computers

Incremental approach
Architecture





A bi-directional 3-D network mesh of multithreaded processing nodes
A chip comprises of a multi-ALU processor
(MAP) and 128KB on-chip sync. DRAM
A user-accessible message passing system (SEND)
Single global virtual address space
Target CLK 100 MHz (control logic 40MHz)
Multi-ALU processor (MAP)

A MAP chip comprises :






Three 64-bit 3-issue clusters
2-way interleaved on-chip
cache
A Memory Switch
A Cluster switch
External memory interface
On-chip network interfaces
and routers
A MAP Cluster






64-bit three issue
pipelined processor
2 Integer ALUs
1 Floating point ALU
Register Files
4KB Instruction cache
A MAP instruction has
1, 2 or 3 operations
Map Chip Die
(18 mm side, 5M transistors)
Exploiting Parallelism
on
M-Machine
Threads


Exploit ILP both with-in and across the clusters
Horizontal Threads (H-Threads)
Instruction level parallelism
 Executes on a single MAP cluster
 3-wide instruction stream
 Communication/synchronization through
messages/registers/memory
 Max. 6 H-Threads can be interleaved dynamically on
a cycle-by-cycle basis

Threads (contd.)

Vertical Threads (V-Threads)
Thread level parallelism (a standard process)
 contains up-to 4 H-Threads (one per cluster)
 Flexibility of scheduling (compiler/run-time)
 Communication/synchronization through registers
 At-most 6 resident V-Threads


4 user slots, 1 event slot, 1 exception
Concurrency Model
Three Levels of Parallelism

Instruction Level Parallelism ( 1 instruction)
VLIW, Superscalar processors
 Issues: Control Flow, Data dependency, Scalability


Thread Level Parallelism (~ 1000 instructions)
Chip Multiprocessors
 Issues: Limited coarse TLP, Inner cores non-optimal


Fine grain Parallelism (~ 50 – 1000 instructions)
Mapping
Granularity
Program
Architecture
ILP
Sub-expr
Cluster
Fine-Grain
Inner loops
sub-routines
MAP chip
TLP
Outer loops
Co-routines
Nodes
Fine-grain overheads


Thread creation (11 cycles – hfork)
Communication
Register-Register read/writes
 Message passing/on-chip cache


Synchronization
Blocking on a register (full/empty bit)
 Barrier Instruction (cbar instruction)
 Memory (sync bit)

Grid Processor Architecture
Design Motivation





Continued scaling of the clock rate
Scalability of the processing core
Higher ILP - Instruction throughput (IPC)
Mitigate global wire and delay overheads
Closer coupling of Architecture and compiler
Architecture




An inter-connected 2-D network of ALU arrays
Each node has a IB and a execution unit
A single control thread maps instructions to nodes
Block-Atomic Execution Model
Mapping blocks of statically scheduled instructions
 Dynamic execution in data-flow order
 Forwarding temp. values to the consumer ALUs
 Critical path scheduled along shortest physical path

GPA Architecture
Example: Block-Atomic Mapping
Implementation

Instruction fetch and map




predicated hyper-block, move instructions
Execution - control logic
Operand routing – max 3 dest., split instructions
Hyper-block control
Predication (execute-all approach), cmove instructions
 Block-commit
 Block-stitching

Review and Discussion
Key Ideas: Convergence



Microprocessor – no. of superscalar processors
comm./sync. via registers – low overheads
Exploiting ILP – TLP granularities
Dependency mapped to a grid of ALUs




Replication reduces design/verification effort
Point-to-point communication
Exposing architecture partitioning and flow of
operations to the compiler
Avoid wire, routing delays, memory wall problems
Ideas: Divergence
M-Machine
 On-chip cache
Register based mech. [Delays]
 Broadcasting and Point-to-point communication
GPA
 Register Set
Grid: Chaining [Scalability]
 Point-to-point communication
TERA
 Fine-grain threads – Memory comm/sync (full/empty)
 No support for single threaded code
Drawbacks (Unresolved Issues)
Grid Processor Arch.
 Data Caches far from ALUs
 Incur delays between
dependent operations due to
network router and wires
 Complex Framemanagement and Blockstitching
 Explicit compiler
dependence
M-Machine
 Scalability
 Clock speeds
 Memory Synchronization
(use hfork)
Challenges/Future Directions

Architectural support to extract TLP

Parallelizing compiler technology

How many cores/threads




No. of threads – memory latency, wire delays [Flynn]
Inter-thread communication
Height of Grid == 8 (IPC 5-6) [GPA, Peter]
Optimization - f(comm., delays, memory costs)
Challenges (contd.)

On-fly data-dependence detection (RAW/WAR)

TLP/ILP Balance – M Multi-Computer
Thanks