DIS meeting (Oct, 2000) - The Parallel Systems and Computer

Transcript DIS meeting (Oct, 2000) - The Parallel Systems and Computer

USC
UNIVERSITY
OF SOUTHERN
CALIFORNIA
HiDISC: A Decoupled Architecture for
Applications in Data Intensive Computing
PIs: Alvin M. Despain and Jean-Luc Gaudiot
University of Southern California
http://www-pdpc.usc.edu
October 12th, 2000
HiDISC: Hierarchical Decoupled
Instruction Set Computer
New Ideas
Sensor
Inputs
Application
(FLIR SAR VIDEO ATR /SLD Scientific )
• A dedicated processor for each level of the
Decoupling Compiler
memory
hierarchy
• Explicitly manage each level of the memory hierarchy
using instructions generated by the compiler
Dynamic
Database
Processor
Processor
Processor
Registers
Cache
• Hide memory latency by converting data
access predictability to data access locality
• Exploit instruction-level parallelism without
extensive scheduling hardware
HiDISC Processor
Memory
Situational Awareness
Impact
• 2x speedup for scientific benchmarks with large data
sets over an in-order superscalar processor
• 7.4x speedup for matrix multiply over an in-order
issue superscalar processor
• Zero overhead prefetches for maximal
computation throughput
Schedule
• Defined benchmarks
• Completed simulator
• Performed instruction-level
simulations on hand-compiled
benchmarks
• Continue simulations
of more benchmarks
(SAR)
•Define HiDISC
architecture
•Benchmark result
• Develop and test a
full decoupling compiler
• Update Simulator
•Generate performance
statistics and evaluate
design
• 2.6x speedup for matrix decomposition/substitution over
an in-order issue superscalar processor
• Reduced memory latency for systems that have
high memory bandwidths (e.g. PIMs, RAMBUS)
• Allows the compiler to solve indexing functions for
irregular applications
April 98
Start
April 99
April 00
• Reduced system cost for high-throughput scientific codes
USC Parallel and Distributed Processing Center 2
HiDISC: Hierarchical Decoupled
Instruction Set Computer
Sensor
Inputs
Application
(FLIR SAR VIDEO ATR /SLD Scientific )
Decoupling Compiler
Dynamic
Database
Processor
Processor
Processor
Registers
Cache
HiDISC Processor
Memory
Situational Awareness
Technological Trend: Memory latency is getting longer relative to
microprocessor speed (40% per year)
Problem: Some SPEC benchmarks spend more than half of their time
stalling [Lebeck and Wood 1994]
Domain: benchmarks with large data sets: symbolic, signal processing and
scientific programs
Present Solutions: Multithreading (Homogenous), Larger Caches,
Prefetching, Software Multithreading
USC Parallel and Distributed Processing Center 3
Present Solutions
Solution
Limitations
Larger Caches
— Slow
— Works well only if working set fits cache and
there is temporal locality.
Hardware
Prefetching
— Cannot be tailored for each application
Software Prefetching
— Ensure overheads of prefetching do not
outweigh the benefits > conservative
prefetching
— Behavior based on past and present
execution-time behavior
— Adaptive software prefetching is required to
change prefetch distance during run-time
— Hard to insert prefetches for irregular access
patterns
Multithreading
— Solves the throughput problem, not the
memory latency problem
USC Parallel and Distributed Processing Center 4
The HiDISC Approach
Observation:
• Software prefetching impacts compute performance
• PIMs and RAMBUS offer a high-bandwidth memory system
- useful for speculative prefetching
Approach:
• Add a processor to manage prefetching
-> hide overhead
• Compiler explicitly manages the memory hierarchy
• Prefetch distance adapts to the program runtime behavior
USC Parallel and Distributed Processing Center 5
What is HiDISC?
Computation
Instructions
Computation
Processor (CP)

A dedicated processor for each
level of the memory hierarchy

Explicitly manage each level of
the memory hierarchy using
instructions generated by the
compiler

Hide memory latency by
converting data access
predictability to data access
locality (Just in Time Fetch)

Exploit instruction-level
parallelism without extensive
scheduling hardware

Zero overhead prefetches for
maximal computation throughput
Registers
Program
Compiler
Access
Instructions
Access Processor
(AP)
Cache
Cache Mgmt.
Instructions
Cache Mgmt.
Processor
(CMP)
2nd-Level Cache
and Main Memory
USC Parallel and Distributed Processing Center 6
Decoupled Architectures
8-issue
3-issue
5-issue
2-issue
Computation
Processor (CP)
Computation
Processor (CP)
Computation
Processor (CP)
Computation
Processor (CP)
Registers
Registers
Registers
Registers
Access Processor
(AP) - (5-issue)
Cache
Access Processor
(AP) - (3-issue)
Cache
Cache
3-issue
2nd-Level Cache
2nd-Level Cache
and Main Memory
Cache
Cache Mgmt.
Processor (CMP)
Cache Mgmt.
Processor (CMP)
and Main Memory
2nd-Level Cache
and Main Memory
2nd-Level Cache
and Main Memory
MIPS
DEAP
CAPP
HiDISC
(Conventional)
(Decoupled)
2nd-Level Cache
and Main Memory
3-issue
(New Decoupled)
DEAP: [Kurian, Hulina, & Caraor ‘94]
PIPE: [Goodman ‘85]
Other Decoupled Processors: ACRI, ZS-1, WA
USC Parallel and Distributed Processing Center 7
Slip Control Queue

The Slip Control Queue (SCQ) adapts
dynamically
if (prefetch_buffer_full ())
Don’t change size of SCQ;
else if ((2*late_prefetches) > useful_prefetches)
Increase size of SCQ;
else
Decrease size of SCQ;
• Late prefetches = prefetched data arrived after load
had been issued
• Useful prefetches = prefetched data arrived before load
had been issued
USC Parallel and Distributed Processing Center 8
(Discrete Convolution - Inner
Loop)
while (not EOD)
y = y + (x * h);
send y to SDQ
Computation Processor Code
for (j = 0; j < i; ++j)
y[i]=y[i]+(x[j]*h[i-j-1]);
Inner Loop Convolution
SAQ: Store Address Queue
SDQ: Store Data Queue
SCQ: Slip Control Queue
EOD: End of Data
for (j = 0; j < i; ++j) {
load (x[j]);
load (h[i-j-1]);
GET_SCQ;
}
send (EOD token)
send address of y[i] to SAQ
Access Processor Code
for (j = 0; j < i; ++j) {
prefetch (x[j]);
prefetch (h[i-j-1];
PUT_SCQ;
}
Cache Management Code
USC Parallel and Distributed Processing Center 9
Benchmarks
Benchmarks Source of
Benchmark
Lines of
Source
Code
LLL1
Livermore
Loops [45]
20
LLL2
Livermore
Loops
24
LLL3
Livermore
Loops
18
LLL4
Livermore
Loops
25
LLL5
Livermore
Loops
17
Tomcatv
SPECfp95 [68]
190
MXM
NAS kernels [5]
113
CHOLSKY
NAS kernels
156
VPENTA
NAS kernels
199
Qsort
Quicksort
sorting
algorithm [14]
58
Description
Data
Set
Size
1024-element
arrays, 100
iterations
1024-element
arrays, 100
iterations
1024-element
arrays, 100
iterations
1024-element
arrays, 100
iterations
1024-element
arrays, 100
iterations
33x33-element
matrices, 5
iterations
Unrolled matrix
multiply, 2
iterations
Cholsky matrix
decomposition
Invert three
pentadiagonals
simultaneously
Quicksort
24 KB
16 KB
16 KB
16 KB
24 KB
<64 KB
448 KB
724 KB
128 KB
128 KB
USC Parallel and Distributed Processing Center 10
Simulation
Parameter
Value
Parameter
Value
L1 cache size
4 KB
L2 cache size
16 KB
L1 cache associativity
2
L2 cache associativity
2
L1 cache block size
32 B
L2 cache block size
32 B
Memory Latency
Variable, (0-200 cycles)
Memory contention
time
Variable
Victim cache size
32 entries
Prefetch buffer size
8 entries
Load queue size
128
Store address queue
size
128
Store data queue size
128
Total issue width
8
USC Parallel and Distributed Processing Center 11
Simulation Results
LLL3
5
Tomcatv
3
MIPS
DEAP
CAPP
HiDISC
4
3
MIPS
DEAP
2.5
CAPP
HiDISC
2
1.5
2
1
1
0
0.5
0
40
80
120
160
Main Memory Latency
200
0
40
80
120
160
Main Memory Latency
200
Vpenta
Cholsky
16
14
12
10
8
6
4
2
0
0
12
MIPS
DEAP
CAPP
HiDISC
MIPS
DEAP
8
CAPP
6 HiDISC
10
4
2
0
40
80
120
160
Main Memory Latency
200
0
0
40
80
120
160
Main Memory Latency
200
USC Parallel and Distributed Processing Center 12
Accomplishments

2x speedup for scientific benchmarks with large data sets over an inorder superscalar processor

7.4x speedup for matrix multiply (MXM) over an in-order issue
superscalar processor - (similar operations are used in ATR/SLD)

2.6x speedup for matrix decomposition/substitution (Cholsky) over
an in-order issue superscalar processor

Reduced memory latency for systems that have high memory
bandwidths (e.g. PIMs, RAMBUS)

Allows the compiler to solve indexing functions for irregular
applications

Reduced system cost for high-throughput scientific codes
USC Parallel and Distributed Processing Center 13
Work in Progress
• Compiler design
• Data Intensive Systems (DIS) benchmarks
analysis
• Simulator update
• Parameterization of silicon space for VLSI
implementation
USC Parallel and Distributed Processing Center 14
Compiler Requirements
 Source
language flexibility
 Sequential assembly code for
streaming
•
•
•
•
Ease of implementation
Optimality of sequential code
Source level language flexibility
Portability
 Ease
of implementation
 Portability and upgradability
USC Parallel and Distributed Processing Center 15
Gcc-2.95 Features


Localized register spilling, global common sub
expression elimination using lazy code motion
algorithms
There is also an enhancement made in the control
flow graph analysis.The new framework simplifies
control dependence analysis, which is used by
aggressive dead code elimination algorithms
+ Provision to add modules for instruction
scheduling and delayed branch execution
+ Front-ends for C, C++ and Fortran available
+ Support for different environments and platforms
+ Cross compilation
USC Parallel and Distributed Processing Center 16
Compiler Organization
Source Program
GCC
Assembly Code
Stream Separator
Computational
Assembly Code
Access Assembly
Code
Cache Management
Assembly Code
Assembler
Assembler
Assembler
Computation
Assembly Code
Access Assembly
Code
Cache Management
Object Code
HiDISC Compilation Overview
USC Parallel and Distributed Processing Center 17
HiDISC Stream Separator
Sequential Source
Program Flow Graph
Classify Address Registers
Computation
Stream
Allocate Instruction to streams
Current Work
Access
Stream
Future Work
Fix Conditional Statements
Move Queue Access into Instructions
Move Loop Invariants out of the loop
Add Slip Control Queue Instructions
Substitute Prefetches for
Loads, Remove
global Stores, and
Reverse SCQ Direction
Add global data Communication and Synchronization
Produce Assembly code
Computation
Assembly Code
Access
Assembly Code
Cache Management
Assembly Code
USC Parallel and Distributed Processing Center 18
Compiler Front End Optimizations

Jump Optimization: simplify jumps to the
following instruction, jumps across jumps and
jumps to jumps

Jump Threading: detect a conditional jump that
branches to an identical or inverse test
Delayed Branch Execution: find instructions
that can go into the delay slots of other
instructions
 Constant Propagation: Propagate constants
into a conditional loop

USC Parallel and Distributed Processing Center 19
Compiler Front End Optimizations
(contd.)

Instruction Combination: combine groups of
two or three instructions that are related by data
flow into a single instruction

Instruction Scheduling: looks for instructions
whose output will not be available by the time that
it is used in subsequent instructions

Loop Optimizations: move constant
expressions out of loops, and do strengthreduction
USC Parallel and Distributed Processing Center 20
Example of Stressmarks

Pointer Stressmark
• Basic idea: repeatedly follow pointers to randomized
locations in memory
• Memory access pattern is unpredictable
• Randomized memory access pattern:
– Not sufficient temporal and spatial locality for conventional
cache architectures
• HiDISC architecture provides lower memory access
latency
USC Parallel and Distributed Processing Center 21
Decoupling of Pointer Stressmarks
for (i=j+1;i<w;i++) {
if (field[index+i] > partition) balance++;
}
if (balance+high == w/2) break;
else if (balance+high > w/2) {
min = partition;
}
else {
max = partition;
high++;
}
while (not EOD)
if (field > partition) balance++;
if (balance+high == w/2) break;
else if (balance+high > w/2) {
min = partition;
}
else {
max = partition;
high++;
}
Computation Processor Code
for (i=j+1; i<w; i++) {
load (field[index+i]);
GET_SCQ;
}
send (EOD token)
Access Processor Code
Inner loop for the next indexing
for (i=j+1; i<w; i++) {
prefetch (field[index+i]);
PUT_SCQ;
}
Cache Management Code
USC Parallel and Distributed Processing Center 22
Stressmarks

Hand-compile the 7 individual benchmarks
• Use gcc as front-end
• Manually partition each of the three instruction
streams and insert synchronizing instructions

Evaluate architectural trade-offs
• Updated simulator characteristics such as outof-order issue
• Large L2 cache and enhanced main memory
system such as Rambus and DDR
USC Parallel and Distributed Processing Center 23
Simulator Update

Survey the current processor architecture
• Focus on commercial leading edge technology
for implementation

Analyze the current simulator and previous
benchmark results
• Enhance memory hierarchy configurations
• Add Out-of-Order issue
USC Parallel and Distributed Processing Center 24
Memory Hierarchy

Current modern processors have
increasingly large L2 on-chip caches
• E.g., 256 KB L-2 cache on Pentium and Athlon
processor
• reduces L1 cache miss penalty

Also, development of new mechanisms in
the architecture of the main memory (e.g.,
RAMBUS) reduces the L2 cache miss
penalty
USC Parallel and Distributed Processing Center 25
Out-of-Order multiple issue



Most of the current advanced processors are
based on the Superscalar and Multiple Issue
paradigm.
• MIPS-R10000, Power-PC, Ultra-Sparc, Alpha
and Pentium family
Compare HiDISC architecture and modern
superscalar processors
• Out-of-Order instruction issue
• For precision exception handling, include inorder completion
New access decoupling paradigm for out-of-order
issue
USC Parallel and Distributed Processing Center 26
HiDISC with Modern DRAM
Architecture

RAMBUS and DDR DRAM improve the
memory bandwidth
• Latency does not improve significantly

Decoupled access processor can fully utilize
the enhanced memory bandwidth
• More requests caused by access processor
• Pre-fetching mechanism hide memory access
latency
USC Parallel and Distributed Processing Center 27
HiDISC / SMT

Reduced memory latency of HiDISC can
• decrease the number of threads for SMT
architecture
• relieve memory burden of SMT architecture
• lessen complex issue logic of multithreading

The functional unit utilization can increase
with multithreading features on HiDISC
• More instruction level parallelism is possible
USC Parallel and Distributed Processing Center 28
The McDISC System: Memory-Centered
Distributed Instruction Set Computer
Understanding
FLIR SAR VIDEO ESS
Inference Analysis
Computation Instructions
Computation
Processor (CP)
Register Links
to CP Neighbors
Sensor
Data
Registers
Program
Compiler
Access
Processor (AP)
Access Instructions
Cache Management
Instructions
Network
Management
Instructions
SES
3-D Torus
of Pipelined Rings
X Y
Z
to Displays
and Network
Cache
Cache Management
Processor (CMP)
Network Interface
Processor (NIP)
Main Memory
Disc Cache
Disc
Processor (DP)
Adaptive Signal
PIM (ASP)
SAR
Video
RAID
Dynamic
Database
Sensor
Inputs
Adaptive Graphics
PIM (AGP)
Decision Process
Targeting
Situation
Awareness
Disc Farm
USC Parallel and Distributed Processing Center 29
Summary

Designing a compiler
• Porting gcc to HiDISC
Benchmark simulation with new parameters
and updated simulator
 Analysis of architectural trade-offs for equal
silicon area
 Hand-compilation of Stressmarks suites and
simulation
 DIS benchmarks simulation

USC Parallel and Distributed Processing Center 30

DIS meeting (Oct, 2000) - The Parallel Systems and Computer

Transcript DIS meeting (Oct, 2000) - The Parallel Systems and Computer

Directory