Memory-architecture aware compilation

Download Report

Transcript Memory-architecture aware compilation

Technische Universität Dortmund
Automatic mapping to tightly coupled
memories and cache locking
Peter Marwedel1,2, Heiko Falk1,
Robert Pyka1, Lars Wehmeyer2
1TU
Dortmund
2Informatik Centrum Dortmund (ICD)
http://ls12-www.cs.uni-dortmund.de, http://www.icd.de
TU Dortmund
Problems with memory speeds
Speed gap between processor
and main DRAM increases
8
Speed
4
 2x
every 2
years
2
1
0
1
2
3
4
Similar problems also for
embedded systems &
MPSoCs
 In the future:
Memory access times >>
processor cycle times
 “Memory wall”
problem
5 years
[P. Machanik: Approaches to Addressing the
Memory Wall, TR Nov. 2002, U. Brisbane]
 P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
- 2-
TU Dortmund
Problems with memory energy
Memories a major
consumer of energy:
Example: mobile phone
Caches consume much
of the available energy
Clocks
4%
SysCtl
3%
CP 15
2%
PATag
RAM
1%
[O. Vargas (Infineon): Minimum power consumption in mobilephone memory subsystems; Pennwell Portable Design September 2005;] Thanks to Thorsten Koch (Nokia/ TU
Dortmund) for providing this source.
Other
4%
D Cache
19%
BIU
8%
I Cache
25%
arm9
25%
I MMU
4%
D MMU
5%
[Segars 01 according to Vahid@ISSS01]
 P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
- 3-
TU Dortmund
Tightly coupled memories/Scratch pad memories
(SPM): Fast, energy-efficient, timing-predictable
TCM/SPMs are
small, physically
separate
memories
mapped into the
address space;
Selection is by
an appropriate
address decoder
(simple!)
select
Address space
0
scratch pad memory
Small; no
tag memory
FFF..
Example
ARM7TDMI
cores, wellknown for low
power
consumption
SPM
 P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
- 4-
TU Dortmund
Migration of data and instructions to TCM/SPM
For i .{ }
Example:
for j ..{ }
while ...
Repeat
main
memory
Scratch pad
memory,
capacity SSP
Processor
Which objects (array, loop, etc.) to
be stored in SPM?
Non-overlaying (“static”)
memory allocation:
function ...
Objects always in TCM while
application is running
Array ...
Overlaying (“dynamic”)
allocation:
Array
Moving objects back and forth
between hierarchy levels
?
Int ...
 P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
- 5-
TU Dortmund
A survey of algorithms for scratchpad allocation
 Non-overlaying (“static”) approaches
Gain gk & size sk for each object k.
Maximise gain G = gk, respecting SSP   sk.  Knapsack
• Code, static data, stack, heap,
• Partitioning, handling large arrays
 Overlaying (“dynamic”) approaches
• single/multiple hierarchy levels
• for single process
• for static number of multiple processes
• for dynamic number of multiple processes
• not using/using MMU
Survey of algorithms: http://ls12-www.cs.uni-dortmund.de/publications/
papers/2007-marwedel-acaces.zip
 P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
- 6-
TU Dortmund
Partitioning
# of
partitions
number of partitions of size:
4k
2k
1k
512
256
128
64
7
0
1
1
1
1
1
2
6
0
1
1
1
1
2
0
5
0
1
1
1
2
0
0
4
0
1
1
2
0
0
0
3
0
1
2
0
0
0
0
2
0
2
0
0
0
0
0
1
1
0
0
0
0
0
0
Example of all considered memory partitions for a
total capacity of 4096 bytes
 P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
- 7-
TU Dortmund
Results for a non-overlaying approach:
parts of GSM coder/decoder
Energy model: based on
ARM evaluation board
 P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
- 8-
TU Dortmund
Using these ideas with an gcc-based tool flow
Source is split into 2 different files by specially developed
memory optimizer tool *.
main mem. src
application
source
.c
.txt
.c
Memory
optimizer
(Incl.
ICD-C*)
ARM-GCC
Compiler
spm src.
.c
ARM-GCC
Compiler
profile Info.
.ld
linker
linker script
*Built with new tool design suite ICD-C
available from ICD (see www.icd.de/es)
 P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
.exe
- 9-
TU Dortmund
Hybrid Context Switch
Saving/Restoring at
context switch
Process P1
Saving/Restoring
P3
at context switch
P2
Process P2
P1
Process P3
Process
ProcessP2
P1
P3
P1,P2, P3
Hybrid Context Switch (Hybrid)
 Disjoint + Shared SPM regions
 Good for all scratchpads
Scratchpad
 P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
- 10 -
TU Dortmund
Multi-process Scratchpad Allocation: Results
SPA: Single Process Approach
Energy Consumption (mJ)
160
150
140
Energy (SPA)
Energy (Non-Saving)
Energy (Saving)
CopyEnergy (Saving)
Energy (Hybrid)
CopyEnergy (Hybrid)
130
120
27%
110
100
90
80
64
128
256
512
1024
2048
4096
Scratchpad Size (bytes)




For small SPMs (64B-512B) Saving is better
For large SPMs (1kB- 4kB) Non-Saving is better
Hybrid is the best for all SPM sizes.
Energy reduction @ 4kB SPM is 27% for Hybrid approach
 P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
- 11 -
TU Dortmund
Approach overview
 2 steps: compile-time analysis & runtime decisions
 No need to know all applications at compile-time
 Capable of managing runtime allocated memory objects
 Integrated into an embedded operating system
App. 1
App. 2
Compile-time
Transformations
Standard
Compiler
(GCC)
App. n
Profit values / Allocation hints
Allocation
Manager
Operating
System RTEMS
Using MPArm simulator from U. Bologna
 P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
- 12 -
TU Dortmund
Comparison of SPMM to Caches for SORT
 Baseline: Main memory only
 SPMM peak energy reduction by
83% at 4k Bytes scratchpad
 Cache peak: 75% at 2k 2-way
cache
SPM Size
Δ 4-way
1024
74,81%
2048
65,35%
4096
64,39%
8192
65,64%
16384
63,73%
 SPMM capable of
outperforming caches
 OS and libraries are not
considered yet
Chunk allocation results:
 P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
- 13 -
TU Dortmund
Worst case timing analysis using aiT
C program
SPM size
memory-aware
compiler
ARMulator
Actual
performance
aiT
Worst case
execution time
executable
 P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
- 14 -
TU Dortmund
Results for G.721
Using Scratchpad:
Using Unified Cache:
aiT timing analyzer
 L. Wehmeyer, P. Marwedel: Influence of Onchip Scratchpad Memories on WCET: 4th Intl Workshop
on worst-case execution time analysis, (WCET), 2004
 L. Wehmeyer, P. Marwedel: Influence of Memory Hierarchies on Predictability for Time Constrained
Embedded Software, Design Automation and Test in Europe (DATE), 2005
 P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
- 15 -
TU Dortmund
Locking of I-Caches
Many caches allow “locking” or “freezing” (no replacements).
Can be used to improve timing predictability:
 Load promising Functions into Cache
Optimizations:
 Worst case paths can change during optimization
 Requires coupling of timing analysis tool and compiler
ANSI-C
source
Compiler
Linker
startup
code
WCEToptimized exe
Linker
exe
ait WCET
Analysis
I-Cache
Optimizer
 P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
[Heiko Falk, Sascha
Plazar, Henrik Theiling:
Compile-Time Decided
Instruction Cache
Locking Using WorstCase Execution Paths,
CODES/ISSS, 2007]
- 16 -
TU Dortmund
Relative WCETs after I-Cache Locking
110%
ADPCM
G723
Statemate
Compress
MPEG2
4096
8192
100%
90%
Rel. WCET [%]
80%
70%
60%
50%
40%
30%
20%
10%
0%
(ARM920T)
64
128
256
512
1024
2048
Cache Size [bytes]
 P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
- 17 -
16384
TU Dortmund
More information
2006
2007
• http://ls12-www.cs.uni-dortmund.de/publications/papers/2007-marwedelacaces.zip
• http://ls12-www.cs.uni-dortmund.de/publications/global_index.html
 P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
- 18 -
TU Dortmund
Conclusion
 Major impact of the memory system on system speed,
energy consumption and timing predictability.
 Memory hierarchies comprising TCMs/SPMs are fast,
energy-efficient and timing predictable.
 Algorithms have been designed for
• Code, static data, stack, heap
• Single and multiple memory hierarchy levels
• non-overlaying and overlaying allocation
• saving, non-saving and hybrid context switches
• mono- and multiprocessor systems.
 Very large improvements in terms of the considered
figures of merit.
 Compatible with existing tool flows
 P. Marwedel, TU Dortmund/Informatik 12 + ICD/ES, 2007
- 19 -