Parallel On-chip Simultaneous Multithreading

Transcript Parallel On-chip Simultaneous Multithreading

USC
UNIVERSITY
OF SOUTHERN
CALIFORNIA
HiDISC: A Decoupled Architecture for
Applications in Data Intensive Computing
PIs: Alvin M. Despain and Jean-Luc Gaudiot
University of Southern California
http://www-pdpc.usc.edu
28 Sep. 2000
Personnel
PIs: Alvin M. Despain and Jean-Luc Gaudiot
 Graduate students

•
•
•
•
Manil Makhija
Wonwoo Ro
Seongwon Lee
Steve Jenks
USC Parallel and Distributed Processing Center 2
HiDISC: Hierarchical Decoupled
Instruction Set Computer
New Ideas
Sensor
Inputs
Application
(FLIR SAR VIDEO ATR /SLD Scientific )
• A dedicated processor for each level of the
Decoupling Compiler
memory
hierarchy
• Explicitly manage each level of the memory hierarchy
using instructions generated by the compiler
Dynamic
Database
Processor
Processor
Processor
Registers
Cache
• Hide memory latency by converting data
access predictability to data access locality
• Exploit instruction-level parallelism without
extensive scheduling hardware
HiDISC Processor
Memory
Situational Awareness
Impact
• 2x speedup for scientific benchmarks with large data
sets over an in-order superscalar processor
• 7.4x speedup for matrix multiply over an in-order
issue superscalar processor
• Zero overhead prefetches for maximal
computation throughput
Schedule
• Defined benchmarks
• Completed simulator
• Performed instruction-level
simulations on hand-compiled
benchmarks
• Continue simulations
of more benchmarks
(SAR)
•Define HiDISC
architecture
•Benchmark result
• Develop and test a
full decoupling compiler
• Update Simulator
•Generate performance
statistics and evaluate
design
• 2.6x speedup for matrix decomposition/substitution over
an in-order issue superscalar processor
• Reduced memory latency for systems that have
high memory bandwidths (e.g. PIMs, RAMBUS)
• Allows the compiler to solve indexing functions for
irregular applications
April 98
Start
April 99
April 00
• Reduced system cost for high-throughput scientific codes
USC Parallel and Distributed Processing Center 3
Outline
 HiDISC
Project Description
 Experiments and Accomplishments
 Schedule and Work in Progress
 Future Avenues
 Summary
USC Parallel and Distributed Processing Center 4
HiDISC: Hierarchical Decoupled
Instruction Set Computer
Sensor
Inputs
Application
(FLIR SAR VIDEO ATR /SLD Scientific )
Decoupling Compiler
Dynamic
Database
Processor
Processor
Processor
Registers
Cache
HiDISC Processor
Memory
Situational Awareness
Technological Trend: Memory latency is getting longer relative to
microprocessor speed (40% per year)
Problem: Some SPEC benchmarks spend more than half of their time
stalling [Lebeck and Wood 1994]
Domain: benchmarks with large data sets: symbolic, signal processing and
scientific programs
Present Solutions: Multithreading (Homogenous), Larger Caches,
Prefetching, Software Multithreading
USC Parallel and Distributed Processing Center 5
Present Solutions
Solution
Limitations
Larger Caches
— Slow
— Works well only if working set fits cache and
there is temporal locality.
Hardware
Prefetching
— Cannot be tailored for each application
Software Prefetching
— Ensure overheads of prefetching do not
outweigh the benefits > conservative
prefetching
— Behavior based on past and present
execution-time behavior
— Adaptive software prefetching is required to
change prefetch distance during run-time
— Hard to insert prefetches for irregular access
patterns
Multithreading
— Solves the throughput problem, not the
memory latency problem
USC Parallel and Distributed Processing Center 6
The HiDISC Approach
Observation:
• Software prefetching impacts compute performance
• PIMs and RAMBUS offer a high-bandwidth memory system
- useful for speculative prefetching
Approach:
• Add a processor to manage prefetching
-> hide overhead
• Compiler explicitly manages the memory hierarchy
• Prefetch distance adapts to the program runtime behavior
USC Parallel and Distributed Processing Center 7
What is HiDISC?
Computation
Instructions
Computation
Processor (CP)

A dedicated processor for each
level of the memory hierarchy

Explicitly manage each level of
the memory hierarchy using
instructions generated by the
compiler

Hide memory latency by
converting data access
predictability to data access
locality (Just in Time Fetch)

Exploit instruction-level
parallelism without extensive
scheduling hardware

Zero overhead prefetches for
maximal computation throughput
Registers
Program
Compiler
Access
Instructions
Access Processor
(AP)
Cache
Cache Mgmt.
Instructions
Cache Mgmt.
Processor
(CMP)
2nd-Level Cache
and Main Memory
USC Parallel and Distributed Processing Center 8
Decoupled Architectures
8-issue
3-issue
5-issue
2-issue
Computation
Processor (CP)
Computation
Processor (CP)
Computation
Processor (CP)
Computation
Processor (CP)
Registers
Registers
Registers
Registers
Access Processor
(AP) - (5-issue)
Cache
Access Processor
(AP) - (3-issue)
Cache
Cache
3-issue
2nd-Level Cache
2nd-Level Cache
and Main Memory
Cache
Cache Mgmt.
Processor (CMP)
Cache Mgmt.
Processor (CMP)
and Main Memory
2nd-Level Cache
and Main Memory
2nd-Level Cache
and Main Memory
MIPS
DEAP
CAPP
HiDISC
(Conventional)
(Decoupled)
2nd-Level Cache
and Main Memory
3-issue
(New Decoupled)
DEAP: [Kurian, Hulina, & Caraor ‘94]
PIPE: [Goodman ‘85]
Other Decoupled Processors: ACRI, ZS-1, WA
USC Parallel and Distributed Processing Center 9
Slip Control Queue

The Slip Control Queue (SCQ) adapts
dynamically
if (prefetch_buffer_full ())
Don’t change size of SCQ;
else if ((2*late_prefetches) > useful_prefetches)
Increase size of SCQ;
else
Decrease size of SCQ;
• Late prefetches = prefetched data arrived after load
had been issued
• Useful prefetches = prefetched data arrived before load
had been issued
USC Parallel and Distributed Processing Center 10
(Discrete Convolution - Inner
Loop)
while (not EOD)
y = y + (x * h);
send y to SDQ
Computation Processor Code
for (j = 0; j < i; ++j)
y[i]=y[i]+(x[j]*h[i-j-1]);
Inner Loop Convolution
SAQ: Store Address Queue
SDQ: Store Data Queue
SCQ: Slip Control Queue
EOD: End of Data
for (j = 0; j < i; ++j) {
load (x[j]);
load (h[i-j-1]);
GET_SCQ;
}
send (EOD token)
send address of y[i] to SAQ
Access Processor Code
for (j = 0; j < i; ++j) {
prefetch (x[j]);
prefetch (h[i-j-1];
PUT_SCQ;
}
Cache Management Code
USC Parallel and Distributed Processing Center 11
Benchmarks
Benchmarks Source of
Benchmark
Lines of
Source
Code
LLL1
Livermore
Loops [45]
20
LLL2
Livermore
Loops
24
LLL3
Livermore
Loops
18
LLL4
Livermore
Loops
25
LLL5
Livermore
Loops
17
Tomcatv
SPECfp95 [68]
190
MXM
NAS kernels [5]
113
CHOLSKY
NAS kernels
156
VPENTA
NAS kernels
199
Qsort
Quicksort
sorting
algorithm [14]
58
Description
Data
Set
Size
1024-element
arrays, 100
iterations
1024-element
arrays, 100
iterations
1024-element
arrays, 100
iterations
1024-element
arrays, 100
iterations
1024-element
arrays, 100
iterations
33x33-element
matrices, 5
iterations
Unrolled matrix
multiply, 2
iterations
Cholsky matrix
decomposition
Invert three
pentadiagonals
simultaneously
Quicksort
24 KB
16 KB
16 KB
16 KB
24 KB
<64 KB
448 KB
724 KB
128 KB
128 KB
USC Parallel and Distributed Processing Center 12
Simulation
Parameter
Value
Parameter
Value
L1 cache size
4 KB
L2 cache size
16 KB
L1 cache associativity
2
L2 cache associativity
2
L1 cache block size
32 B
L2 cache block size
32 B
Memory Latency
Variable, (0-200 cycles)
Memory contention
time
Variable
Victim cache size
32 entries
Prefetch buffer size
8 entries
Load queue size
128
Store address queue
size
128
Store data queue size
128
Total issue width
8
USC Parallel and Distributed Processing Center 13
Simulation Results
LLL3
5
Tomcatv
3
MIPS
DEAP
CAPP
HiDISC
4
3
MIPS
DEAP
2.5
CAPP
HiDISC
2
1.5
2
1
1
0
0.5
0
40
80
120
160
Main Memory Latency
200
0
40
80
120
160
Main Memory Latency
200
Vpenta
Cholsky
16
14
12
10
8
6
4
2
0
0
12
MIPS
DEAP
CAPP
HiDISC
MIPS
DEAP
8
CAPP
6 HiDISC
10
4
2
0
40
80
120
160
Main Memory Latency
200
0
0
40
80
120
160
Main Memory Latency
200
USC Parallel and Distributed Processing Center 14
Accomplishments

2x speedup for scientific benchmarks with large data sets over an inorder superscalar processor

7.4x speedup for matrix multiply (MXM) over an in-order issue
superscalar processor - (similar operations are used in ATR/SLD)

2.6x speedup for matrix decomposition/substitution (Cholsky) over
an in-order issue superscalar processor

Reduced memory latency for systems that have high memory
bandwidths (e.g. PIMs, RAMBUS)

Allows the compiler to solve indexing functions for irregular
applications

Reduced system cost for high-throughput scientific codes
USC Parallel and Distributed Processing Center 15
Schedule
• Defined benchmarks
• Completed simulator
• Performed instruction-level
simulations on handcompiled benchmarks
April 98
Start
• Continue simulations
of more benchmarks
(ATR/SLD)
•Define HiDISC
architecture
•Benchmark results
April 99
•Develop and test a
full decoupling
compiler
• Update Simulator
• Generate performance
statistics and evaluate
design
April 00
USC Parallel and Distributed Processing Center 16
Tasks
Compiler Design
Selecting a
Front end
Program
Flow Analysis
Separating
Instructions
into Streams
Compiler
Optimizations
Control Flow
Analysis
Simulator
Update
Stressmarks
Simulation
Simulate DIS
Benchmarks
DIS Benchmarks
DIS
Benchmarks
Analysis
Analyze and
handcompile
Stressmarks
USC Parallel and Distributed Processing Center 17
Work in Progress
• Compiler design
• Data Intensive Systems (DIS) benchmarks
analysis
• Simulator update
• Parameterization of silicon space for VLSI
implementation
USC Parallel and Distributed Processing Center 18
Compiler Requirements
 Source
language flexibility
 Sequential assembly code for
streaming
•
•
•
•
Ease of implementation
Optimality of sequential code
Source level language flexibility
Portability
 Ease
of implementation
 Portability and upgradability
USC Parallel and Distributed Processing Center 19
Compiler Front-ends

Trimaran
is a compiler infrastructure for
supporting research in compiling for ILP
architectures.The system provides explicit support
for EPIC architectures.
• Designed for the HPL-PD architecture.

SUIF(Stanford University Intermediate Format)
provides a platform for research on compiler
techniques for high-performance machines.
• Suitable for high level optimization

GCC is a part of the GNU Project, aiming at
improving the compiler used in the GNU system. The
GCC development effort uses an open development
environment and supports many platforms.
USC Parallel and Distributed Processing Center 20
Trimaran Architecture

Machine description facility (MDES)
• Describes ILP architectures


Compiler front end (IMPACT) for C
A compiler back-end (ELCOR) which is
parameterized by MDES and performs machine
dependent optimizations.
+ Support for Predication
+ Explicit support for EPIC architectures
+ Software Pipelining
- Low Portability
- Currently only a C front end is available
USC Parallel and Distributed Processing Center 21
SUIF Architecture
Part of the national compiler infrastructure
project
 Designed to support collaborative research
in optimizing and parallelizing compiler
 Originally designed to support high-level
program analysis of C and Fortran programs

+ Highly Modular
+ Portable
- More suited to high level optimizations
- Only has a front end for C
USC Parallel and Distributed Processing Center 22
Gcc-2.95 Features


Localized register spilling, global common sub
expression elimination using lazy code motion
algorithms
There is also an enhancement made in the control
flow graph analysis.The new framework simplifies
control dependence analysis, which is used by
aggressive dead code elimination algorithms
+ Provision to add modules for instruction
scheduling and delayed branch execution
+ Front-ends for C, C++ and Fortran available
+ Support for different environments and platforms
+ Cross compilation
Theoretical power and simplicity are secondary
USC Parallel and Distributed Processing Center 23
Compiler Organization
Source Program
GCC
Assembly Code
Stream Separator
Computational
Assembly Code
Access Assembly
Code
Cache Management
Assembly Code
Assembler
Assembler
Assembler
Computation
Assembly Code
Access Assembly
Code
Cache Management
Object Code
HiDISC Compilation Overview
USC Parallel and Distributed Processing Center 24
HiDISC Stream Separator
Sequential Source
Program Flow Graph
Classify Address Registers
Computation
Stream
Allocate Instruction to streams
Current Work
Access
Stream
Future Work
Fix Conditional Statements
Move Queue Access into Instructions
Move Loop Invariants out of the loop
Add Slip Control Queue Instructions
Substitute Prefetches for
Loads, Remove
global Stores, and
Reverse SCQ Direction
Add global data Communication and Synchronization
Produce Assembly code
Computation
Assembly Code
Access
Assembly Code
Cache Management
Assembly Code
USC Parallel and Distributed Processing Center 25
Compiler Front End Optimizations

Jump Optimization: simplify jumps to the
following instruction, jumps across jumps and
jumps to jumps

Jump Threading: detect a conditional jump that
branches to an identical or inverse test
Delayed Branch Execution: find instructions
that can go into the delay slots of other
instructions
 Constant Propagation: Propagate constants
into a conditional loop

USC Parallel and Distributed Processing Center 26
Compiler Front End Optimizations
(contd.)

Instruction Combination: combine groups of
two or three instructions that are related by data
flow into a single instruction

Instruction Scheduling: looks for instructions
whose output will not be available by the time that
it is used in subsequent instructions

Loop Optimizations: move constant
expressions out of loops, and do strengthreduction
USC Parallel and Distributed Processing Center 27
SAR-ATR/SLD Benchmarks

Second-level detection (SLD) algorithm of the
template-based automatic target recognition (ATR)
benchmark
• Using SAR (synthetic aperture radar) image


Compare the in-order issue superscalar MIPS (8way) and Cache and Prefetch Processor (CAPP)
ATR/SLD benchmark exhibits good temporal and
spatial locality
• Cache sizes of larger than 2 KB reduce the
miss rate to less than 0.3%.
USC Parallel and Distributed Processing Center 28
Benchmarks (Image
Understanding)
SLD/ATR
DIS Image Understanding
MIPS
CAPP
0.9
MIPS
G
;
3G
0.7 G
Cache Miss Rate:
First-Level Cache Miss Rate:
0.8
0.6
G
0.5
0.4 ;
0.3
;
2 KB
G
;
G
;
G
;
G
;
;G
4 KB
8 KB
16 KB
32 KB
64 KB
First-Level Cache Size


G
G
G
G
G
16 KB
32 KB
64 KB
2.5
2
1.5
0.5
0
1 KB

G
1
0.2
0.1

G
3.5
0
1 KB
2 KB
4 KB
8 KB
First-Level Cache Size
DIS Image Understanding benchmark for the smallest data input set (iu1.in).
The miss rate is approximately 3 percent on an 8-issue MIPS processor
Caches are not very effective at capturing the memory references
HiDISC processor will be a good architecture for improving the performance
of the DIS Image Understanding benchmark.
USC Parallel and Distributed Processing Center 29
DIS Benchmarks

Atlantic Aerospace DIS benchmark suite:
• Too large for hand-compiling

Atlantic Aerospace Stressmarks suite:
• small Data Intensive benchmark suite ( < 200
lines)
• Preliminary description released on Jul, 2000
• Final description released on Sep, 2000 (ver
1.0)
• Stressmarks suite to be available soon
USC Parallel and Distributed Processing Center 30
DIS Benchmarks Suite

Application oriented benchmarks
• Many defense application employ large data sets – non
contiguous memory access and no temporal locality

Five benchmarks in three categories
• Model based image generation
– Method of Moments, Simulated SAR Ray Tracing
• Target detection
– Image Understanding, Multidimensional Fourier Transform
• Database management
– Data Management
USC Parallel and Distributed Processing Center 31
Stressmark Suite
Stressmark
Problem
Memory Access
Pointer
Pointer following
Small blocks at unpredictable
locations. Can be parallelized
Update
Pointer following with
memory update
Small blocks at unpredictable location
Matrix
Conjugate gradient
Dependent on matrix representation
simultaneous equation solver Likely to be irregular or mixed, with
mixed levels of reuse
Neighborhood
Calculate image texture
measures by finding sum
and difference histograms
Regular access to pairs of words at
arbitrary distances
Field
Collect statistics on large
field of words
Regular, with little re-use
Corner-Turn
Matrix transposition
Block movement between processing
nodes with practically nil computation
Transitive Closure
Find all-pairs-shortest-path
solution for a directed graph
Dependent on matrix representation,
but requires reads and writes to
different matrices concurrently
* DIS Stressmark Suite Version 1.0, Atlantic Aerospace Division
USC Parallel and Distributed Processing Center 32
Example of Stressmarks

Pointer Stressmark
• Basic idea: repeatedly follow pointers to randomized
locations in memory
• Memory access pattern is unpredictable
• Randomized memory access pattern:
– Not sufficient temporal and spatial locality for conventional
cache architectures
• HiDISC architecture provides lower memory access
latency
USC Parallel and Distributed Processing Center 33
Decoupling of Pointer Stressmarks
for (i=j+1;i<w;i++) {
if (field[index+i] > partition) balance++;
}
if (balance+high == w/2) break;
else if (balance+high > w/2) {
min = partition;
}
else {
max = partition;
high++;
}
while (not EOD)
if (field > partition) balance++;
if (balance+high == w/2) break;
else if (balance+high > w/2) {
min = partition;
}
else {
max = partition;
high++;
}
Computation Processor Code
for (i=j+1; i<w; i++) {
load (field[index+i]);
GET_SCQ;
}
send (EOD token)
Access Processor Code
Inner loop for the next indexing
for (i=j+1; i<w; i++) {
prefetch (field[index+i]);
PUT_SCQ;
}
Cache Management Code
USC Parallel and Distributed Processing Center 34
Stressmarks

Hand-compile the 7 individual benchmarks
• Use gcc as front-end
• Manually partition each of the three instruction
streams and insert synchronizing instructions

Evaluate architectural trade-offs
• Updated simulator characteristics such as outof-order issue
• Large L2 cache and enhanced main memory
system such as Rambus and DDR
USC Parallel and Distributed Processing Center 35
Simulator Update

Survey the current processor architecture
• Focus on commercial leading edge technology
for implementation

Analyze the current simulator and previous
benchmark results
• Enhance memory hierarchy configurations
• Add Out-of-Order issue
USC Parallel and Distributed Processing Center 36
Memory Hierarchy

Current modern processors have
increasingly large L2 on-chip caches
• E.g., 256 KB L-2 cache on Pentium and Athlon
processor
• reduces L1 cache miss penalty

Also, development of new mechanisms in
the architecture of the main memory (eg.
RAMBUS) reduces the L2 cache miss
penalty
USC Parallel and Distributed Processing Center 37
Out-of-Order multiple issue



Most of the current advanced processors are
based on the Superscalar and Multiple Issue
paradigm.
• MIPS-R10000, Power-PC, Ultra-Sparc, Alpha
and Pentium family
Compare HiDISC architecture and modern
superscalar processors
• Out-of-Order instruction issue
• For precision exception handling, include inorder completion
New access decoupling paradigm for out-of-order
issue
USC Parallel and Distributed Processing Center 38
VLSI Layout Overhead (I)




Goal: Increase layout effectiveness of HiDISC
architecture
Cache has become a major portion of the chip
area
Methodology: Extrapolate HiDISC VLSI Layout
based on MIPS10000 processor (0.35 μm, 1996)
The space overhead is 11.3% over a comparable
MIPS processor
USC Parallel and Distributed Processing Center 39
VLSI Layout Overhead (II)
Component
Original MIPS
R10K(0.35 m)
Extrapolation
(0.15 m)
HiDISC
(0.15 m)
D-Cache (32KB)
26 mm2
6.5 mm2
6.5 mm2
I-Cache (32KB)
28 mm2
7 mm2
14 mm2
TLB Part
10 mm2
2.5 mm2
2.5 mm2
External Interface Unit
27 mm2
6.8 mm2
6.8 mm2
Instruction Fetch Unit and BTB
18 mm2
4.5 mm2
13.5 mm2
Instruction Decode Section
21 mm2
5.3 mm2
5.3 mm2
Instruction Queue
28 mm2
7 mm2
0 mm2
Reorder Buffer
17 mm2
4.3 mm2
0 mm2
Integer Functional Unit
20 mm2
5 mm2
15 mm2
FP Functional Units
24 mm2
6 mm2
6 mm2
Clocking & Overhead
73 mm2
18.3 mm2
18.3 mm2
Total Size without L2 Cache
292 mm2
73.2 mm2
87.9 mm2
129.2 mm2
143.9 mm2
Total Size with on chip L2 Cache
USC Parallel and Distributed Processing Center 40
Architecture Schemes for a
Flexible HiDISC
Efficient VLSI layout
 HiDISC with Out-of-Order issue

• Trade-off issue width and resources across the
processor hierarchy
• Synchronization for each of the streams
Utilize high bandwidth memory system
 Multithreading HiDISC Architecture (SMT &
CMP)
 Multiple HiDISC

USC Parallel and Distributed Processing Center 41
Decoupled Conventional and
DRAM Processor Threads
Conventional Instructions
Conventional Processor
(CP)
(CP)
Program
Compiler
Synchronizer
DRAM+Processor
(Access Processor)
Memory Management Instructions

Compiling a single program into two cooperating
instruction streams
• One stream is running on a conventional processor
• The other stream is running on a DRAM processor
(such as PIM)
USC Parallel and Distributed Processing Center 42
HiDISC with Modern DRAM
Architecture

RAMBUS and DDR DRAM improve the
memory bandwidth
• Latency does not improve significantly

Decoupled access processor can fully utilize
the enhanced memory bandwidth
• More requests caused by access processor
• Pre-fetching mechanism hide memory access
latency
USC Parallel and Distributed Processing Center 43
HiDISC / SMT

Reduced memory latency of HiDISC can
• decrease the number of threads for SMT
architecture
• relieve memory burden of SMT architecture
• lessen complex issue logic of multithreading

The functional unit utilization can increase
with multithreading features on HiDISC
• More instruction level parallelism is possible
USC Parallel and Distributed Processing Center 44
Multiple HiDISC on a Chip
Ext Int
Ext Int
Fetch
Level 2 Cache
Decode,
Rename
Reorder
Buffer,
Instr. Queues,
O-O-O logic
FP Unit
Wide-issue SMT

Ext Int
Ext Int
iL1
Proc 1
Proc 1
Proc 2
TLB
TLB
TLB
TLB
iL1
dL1
Level 2 Cache
dL1
L2 Cross bar
iL1
INT
unit
dL1
iL1 dL1 iL1 dL1
Level 2 Cache
L2 Cross bar
Level 2 Cache
iL1 dL1 iL1 dL1
TLB
TLB
TLB
Proc 2
Proc 3
Proc 4
Two processor POSM
P1
Four processor POSM
P2
P3
i d i d i d
L L L L L L
1 1 1 1 1 1
L2 Cross bar
i d i d i d
L L L L L L
1 1 1 1 1 1
P5
P6
P7
P4
i d
L L
1 1
i d
L L
1 1
P8
8 processor CMP
Flexible adaptation from multiple to single
processor
USC Parallel and Distributed Processing Center 45
Multiple HiDISC: McDISC






Problem: All extant, large-scale multiprocessors perform poorly
when faced with a tightly-coupled parallel program.
Reason: Extant machines have a long latency when
communication is needed between nodes. This long latency kills
performance when executing tightly-coupled programs.
The McDISC solution: Provide the network interface processor
(NIP) with a programmable processor to execute not only OS
code (e.g. Stanford Flash), but user code.
Advantage: The NIP, executing user code, fetches data before it
is needed.
Result: Fast execution of tightly-coupled parallel programs.
Execution: Lockstep, Multithreaded, Nomadic Threads, etc.
USC Parallel and Distributed Processing Center 46
The McDISC System: Memory-Centered
Distributed Instruction Set Computer
Understanding
FLIR SAR VIDEO ESS
Inference Analysis
Computation Instructions
Computation
Processor (CP)
Register Links
to CP Neighbors
Sensor
Data
Registers
Program
Compiler
Access
Processor (AP)
Access Instructions
Cache Management
Instructions
Network
Management
Instructions
SES
3-D Torus
of Pipelined Rings
X Y
Z
to Displays
and Network
Cache
Cache Management
Processor (CMP)
Network Interface
Processor (NIP)
Main Memory
Disc Cache
Disc
Processor (DP)
Adaptive Signal
PIM (ASP)
SAR
Video
RAID
Dynamic
Database
Sensor
Inputs
Adaptive Graphics
PIM (AGP)
Decision Process
Targeting
Situation
Awareness
Disc Farm
USC Parallel and Distributed Processing Center 47
Summary

Designing a compiler
• Porting gcc to HiDISC
Benchmark simulation with new parameters
and updated simulator
 Analysis of architectural trade-offs for equal
silicon area
 Hand-compilation of Stressmarks suites and
simulation
 DIS benchmarks simulation

USC Parallel and Distributed Processing Center 48
Benchmarks

Tomcatv (SPEC ’95)
• Vectorized mesh-generation program
• Six different inner loops.
– One of the loops references six different arrays
– Cache conflicts are a problem

MXM (NAS kernels)
• Matrix multiply program

Cholsky (NAS kernels)
• Matrix decomposition/substitution program
• High memory bandwidth requirement
USC Parallel and Distributed Processing Center 49
Benchmarks

Vpenta (NAS kernels)
• Inverting three pentadiagonals simultaneously

Qsort
• Quick sorting algorithm
• Chosen as a symbolic benchmark
– Reference patterns are not easily predictable
USC Parallel and Distributed Processing Center 50

Parallel On-chip Simultaneous Multithreading

Transcript Parallel On-chip Simultaneous Multithreading

Directory