Adaptive Memory Reconfiguration Management: The AMRM Project

Download Report

Transcript Adaptive Memory Reconfiguration Management: The AMRM Project

Adaptive Memory
Reconfiguration Management:
The AMRM Project
Rajesh Gupta, Alex Nicolau
University of California, Irvine
Andrew Chien
University of California, San Diego
DARPA DIS PI Meeting, Santa Fe, October 1998
Outline
• Project Drivers
– application needs for diverse (cache) memory configurations
– technology trends favoring reconfigurablity in highperformance designs
•
•
•
•
Project Goals and Deliverables
Project Implementation Plan
Project Team
Summary of New Ideas Proposed by AMRM
Introduction
• Many defense applications are data-starved
– large data-sets, irregular locality characteristics
» FMM Radar Cross-section Modeling, OODB, CG
• Memory access times falling behind CPU
speeds
– increased memory penalty and data starvation.
• No single architecture works well:
– Data-intensive applications need a variety of strategies
CPU
L1
TLB 3 cycles
2 GB/s
L2
33 cycles
to deliver high-performance according to application
memory reference needs:
» multilevel caches/policies
» intelligent prefetching schemes
» dynamic “cache-like” structures: prediction tables, stream
caches, victim caches
» even simple optimizations like block size selection
improve performance significantly.
Memory 57-72 ~
MB/s
106 cycles from disk
Technology Evolution
Feature Size
300
250
250
200
150
NTRS-94
180
NTRS-97
130
100
100
70
50
0
97 98 99 0 1 2 3 4 5 6 7 8 9 12
Year of Shipment
Wire delay, ns/cm.
90
80
70
60
50
40
30
20
10
0
89
92
95
98
1
4
7
Industry continues to outpace NTRS projections on
technology scaling and IC density.
Evolutionary growth but its effects are subtle and powerful!
Consider Interconnect
Static
interconnect
3000
Length (um)
CROSS-OVER
REGION
2000
Dynamic
interconnect
Avg. Interconnect
Length
1000
I
1000
II
Feature Size (nm)
III
Critical Length
100
Average interconnect delay is greater than the gate delays!
• Reduced marginal cost of logic coupled with signal regeneration
makes it possible to include logic in inter-block interconnect.
The Opportunity of Application-Adaptive
Architectures
• Use interconnect and data-path reconfiguration to
– adapt architectures for increased performance, combat
performance fragility and improve fault tolerance
• AMRM technological basis is in re-configurable hw:
– configurable hardware is used to improve utilization of
performance critical resources (instead of using
configurable hardware to build additional resources)
– design goal is to achieve peak performance across
applications
– configurable hardware leveraged in efficient utilization of
performance critical resources
First quantitative answers to utility of architectural
adaptation provided by the MORPH Point Design Study
(PDS)
MORPH Point Design Study:
Custom Mechanisms Explored
• Combat latency deterioration
– optimal prefetching:
» “memory side pointer chasing”
– blocking mechanisms
– fast barrier, broadcast support
– synchronization support
• Bandwidth management
– memory (re)organization to suit application
characteristics
– translate and gather hardware
» “prefetching with compaction”
• Memory controller design
Adaptation for Latency Tolerance
• Operation
1. Application sets prefetch parameters
(compiler controlled)
controlled)
» when a new cache block is filled
if(start<=vAddr<=end) {
if(pAddr & 0x20)
addr = pAddr - 0x20
else
addr = pAddr + 0x20
<initiate fetch of cache
line at addr to L1> }
CPU/L1
virtual addr./data
physical addr.
data
» set lower/upper bounds on memory
regions (for memory protection etc.)
» download pointer extraction function
» element size
2. Prefetching event generation (runtime
Prefetcher
additional addr.
L2 Cache
Adaptation for Bandwidth Reduction
• Prefetching Entire Row/Column
• Pack Cache with Used Data Only
Program View
Processor
Access
val
row
col
rowPtr
colPtr
Addr. Translation
Physical Layout
translate
val
col
val
col
...
Return
L1 Cache
val1 val2 val3
val
row
col
rowPtr
colPtr
cache
Gather Logic
synthesize
+ 64
val1, RowPtr1, ColPtr1
val2, RowPtr2, ColPtr2
val3, RowPtr3, ColPtr3
memory
•
•
•
•
No Change in Program Logical
Data Structures
Partition Cache
Translate Data
Synthesize Pointer
Adaptation Results
Naive
SW-Blocking
HW Gather
HW Bypass
Miss Rate(%)
20
15
10x reduction in miss rate.
10
5
600
500
Data Traffic (MB)
25
400100x
reduction in BW.
300
200
100
0
0
Read
Write
H a r d w a r e Blo ck
P r e fe tch e r
G a th e r
Tr a n sla te
LSI 10K
C e lls
4083
627
557
Xilin x
C LBs
1558
1408
1378
D e la y
(cy cle s)
3
3
2
Going Beyond PDS
• Memory hierarchy utilization
–
–
–
–
–
estimate working set size
memory grain size
miss types: conflict, capacity, coherence, cold-start
memory access patterns: sequential, stride prediction
assess marginal miss rates and “what-if” scenarios
• Dynamic cache structures
– victim caches, stream caches, stride prediction, buffers.
• Memory bank conflicts
– detect array references that cause bank conflicts
• PE load profiling
• Continuous validation hardware
Challenges in Building AA Architectures
• Without automatic application analysis application
adaptation is still pretty much subject to handcrafting
– Compiler support for identification and use of appropriate
architectural assists is crucial
• Significant semantic loss occurs when going from
application to compiler-level optimizations.
• The runtime system must actively support
architectural customization safely.
Project Goals
• Design an Adaptive Memory Reconfiguration
Management (AMRM) system that provides
– 100X improvement in hierarchical memory system
performance over conventional static memory hierarchy in
terms of latency and available bandwidth.
• Develop compiler algorithms that statically select
adaptation of memory hierarchy on a per application
basis
• Develop operating system and architecture features
which ensure process isolation, error detection and
containment for a robust multi-process computing
environment.
Project Deliverables
• An architecture for adaptive memory hierarchy
• Architectural mechanisms and policies for efficient
memory system adaptation
• Compiler support (identification and selection) of the
machine adaptation
• OS and HW architecture features which enable
process isolation, error detection, and containment
in dynamic adaptive systems.
Impact
• Optimized data placement and movement through
the memory hierarchy
per application sustained performance close to peak
machine performance
– particularly for applications with non-contiguous large datasets such as
» sparse-matrix and conjugate gradient computations,
circuit simulation
» data-base (relational and object-oriented) systems
» imaging data
» security-sensitive applications
Impact (continued)
• Integration with core system mechanisms enables multiprocess, robust and safe computing
– enables basic software modularity through processes on adaptive
hardware
– ensures static and dynamic adaptation will not compromise
system robustness -- errors generally confined to a single process
– provides mechanisms for online validation of dynamic adaptation
(catch compiler and hardware synthesis errors) enabling fallback
to earlier versions for correctness
• High system performance using standard CPU components
– adaptive cache management achieved using reconfigurable logic,
compiler and OS smarts
– 15-20X improvement in sparse matrix/conjugate gradient
computations
– 20X improvement in radar cross section modeling code
– high system performance without changing computation
resources preserves the DOD investment into existing software
The AMRM Project:
Enabling Co-ordinated Adaptation
2. Compiler Control of Cache Adaptation
Application Analysis
Compilation for Adaptive Memory
Synthesis &
Mapping
Software
Continuous Validation
Adaptive Machine Definition
Application Instrumentation for runtime adaptation
CPU
Vicitim cache
Base m/c
Stride predictor
L1
adapt
TLB
Adaptive Cache
Structures
L2
Prefetcher
Stream cache
Miss stride buffer
logic
Stream buffer
Write buffer
Memory
Operating System Strategies
Fault Detection and Containment
3. Safe and Protected Execution
1. Flexible Memory System Architecture
Project Organization
• Three coordinated thrusts
T1 design of a flexible memory system architecture
T2 compiler control of the adaptation process
T3 safe and protected execution environment
• System architecture enables machine adaptation
– by implementing architectural assists, mechanisms and
policies
• Compiler enables application-specific machine
adaptation
– by providing powerful memory behavior analysis techniques
• Protection and validation enables a robust multiprocess software environment
– by ensuring process isolation and online validation
Project Personnel
• Project Co-PIs
– Professor Rajesh Gupta, UC Irvine
– Professor Alex Nicolau, UC Irvine
– Professor Andrew Chien, UC San Diego
• Collaborators
– Dr. Phil Kuekes, HP Laboratories, Palo Alto
• Research Specialist
– Dr. Alexander Veidenbaum, UC Irvine
• Graduate Research Assistants
Prashant Arora
Chang Chun
Xiaomei Ji
Weiyu Tang
Dan Nicolaescu
Yibo Jiang
Rajesh Satapathy
• Contract Technical Monitor
– Dr. Larry Carter, AIC , Fort Huachuca, AZ
Louis Giannini
Jay Byun
Summary of New Ideas in AMRM
1. Application-adaptive architectural mechanisms and policies for
memory latency and bandwidth management:
– combat latency deterioration using hardware-assisted blocking,
prefetching
– manage bandwidth through adaptive translation, movement and
placement of application-data for the most efficient access
– cache organization, coherence, dynamic cache structures are
modified as needed by an application
2. Cache memory adaptation is driven by compiler techniques
– semantic retention applied at language and architectural levels
– control memory adaptation and maintain machine usability through
application software
3. OS and Architecture features enable process isolation and online
validation of adaptations
– OS and architecture features enable error detection, isolation and
containment; online validation extends to dynamic adaptations
– modular, robust static and dynamic reconfiguration with precise
characterization of isolation properties