An Analysis of Compression in Improving Memory System
Download
Report
Transcript An Analysis of Compression in Improving Memory System
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
THE PERFORMANCE ADVANTAGE
OF APPLYING COMPRESSION TO
THE MEMORY SYSTEM
NIHAR MAHAPATRA, JIANGJIANG LIU,
KRISHNAN SUNDARESAN
{mahapatr, jliu3, ks48}@cse.buffalo.edu
AND
SRINIVAS DANGETI,
BALAKRISHNA VENKATRAO
{srinivas.dangeti, balakrishna.venkatrao}@eng.sun.com
ACM SIGPLAN WORKSHOP ON MEMORY SYSTEM PERFORMANCE
(MSP 2002), BERLIN, GERMANY, JUNE 16, 2002
Department of Computer Science & Engineering
University at Buffalo, State University of New York, NY, U.S.A.
© 2002
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Computer System Components
Information types
Addresses
Instructions
Data
Computation system: Processor core
Memory system
Storage components: Store information
Registers
TLB, Caches
Main memory
Communication components: Communicate information
I/O buffers and signaling/driving circuitry
I/O pads
Pins
On- and off-chip buses
I/O system: Secondary and tertiary storage I/O, network I/O
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
2
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Perf. Motivation: Comp. Sys. Perf. Growth
Raw computation system performance growth
Increasing device integration
Larger die area
Rising clock frequency
Architectural advancements: Increasing levels of parallelism
Bit level
Instruction level
Thread level
Processor level
Memory system requirements to avoid performance bottlenecks
Store increasing amounts of information
Supply this information at high enough bandwidth and low enough
latency
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
3
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Perf. Motivation: Mem. Sys. Solns. & Limitations
Memory system solutions
Increasing # of I/O buffers, pads, pins
Wider buses
Increasing # of registers, sizes and # of levels of caches, main
memory size
Improved design
Limitations
While # of transistors on a chip is limited by volume of die,
communication components are constrained by surface area of die
More stringent constraints on clock speeds at which external pins
can be driven
Since interconnect size does not scale as well as on-chip logic size,
on- and off-chip buses have higher capacitance and delays
DRAM bandwidth and latency improving at a much slower rate
compared to computation system performance
Memory system is increasingly becoming a performance
bottleneck
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
4
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Power, Cost, and Other Motivations
Increasing fraction of processor chip devoted to storage and
communication components
Number and size of off-chip storage and communication
components also increasing
Interconnect size does not scale as well as on-chip logic size
and therefore interconnect consumes an increasing fraction of
die area and power
In deep-submicron regime, not only does individual wire
capacitance, but inter-wire capacitance contributes increasingly
to power consumption, and this makes interconnect power
consumption even more important [Sotiriadis & Chandrakasan,
CICC, 2000]
Exploiting the fact that compression allows more efficient use
of storage and communication resources, one can derive other
benefits, such as, higher fault tolerance
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
5
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Outline
Problem and Motivation
Compressed Memory System (CMS) Architectures
Performance, power, and cost benefits
Opportunities for address, instruction, and data compression
Feasibility of compression
Classification of CMS architectures
Related Work
Target System, Simulation Methodology, Analysis Measures
Simulation Results
Conclusions and Future Work
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
6
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Performance Benefits of Compression
Storage capacity
More information can be stored using the same # of bits
Bandwidth
More information per unit time can be transmitted using the same
bus width
Latency
Can potentially improve latency too. We can exploit the fact that
compression can reduce the size of components keeping their
effective capacity same, and thereby improve their access latencies
because smaller is faster.
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
7
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Latency Benefit: Storage Components –
Cache Example
Original Cache
Size
S
Hit rate
H
Miss rate
m
Miss penalty P
Latency
L= H + m • P
S = Su + Sc
Uncompressed Cache
Size
Su
Hit rate
Hu
Miss rate
mu
Compressed Cache
Size
Sc
Hit rate
Hc
Miss rate
mc
Decomp.
Pd
penalty
Lu+c = Hu/c + mu(1-mc)Pd + mu • mc • P, where Hu/c = max(Hu, Hc)
Hu/c < H; mu = m + e; mu • mc < m
If Hu/c ~ H, m ~ mu, mc ~ 0.5, Pd ~ 0.5 • P, then Lu+c ~ L – 0.25 • m •P
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
8
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Latency Benefit: Communication
Components - Bus Example
If a compression scheme provides 50% compression, then one can
obtain the same effective bandwidth using half as many bus lines
If these bus lines are spaced apart further from each other and are
fattened appropriately, one can reduce the interwire capacitance (due
to more spacing) and resistance (due to fattening), and thereby reduce
the latency (which depends upon the RC product) of the bus
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
9
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Power Benefits of Compression
Same effective storage capacity can be provided while storing
fewer # of bits => storage components reduce in size and
number
Keeping bus width same (constant cost), fewer bits need to be
transmitted
Keeping effective bandwidth same, fewer bits need to be
transmitted over narrower buses => communication
components reduce in number
On-chip buses: Keeping effective bandwidth same and area
same by increasing the spacing between bus lines, fewer bits
need to transmitted over fewer bus lines that have less interwire capacitance
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
10
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Cost Benefits of Compression
Same effective storage capacity can be provided while storing
fewer # of bits => reduced storage component resources and
less die area
Same effective bandwidth can be provided with narrower buses
=> communication components reduce in number and die area
reduces
Lower power consumption means lower packaging costs
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
11
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Address Compression Opportunities
Where?
Storage components
Some registers (PC, memory address reg., branch related, etc.)
TLB
Tag field of caches
Page tables in main memory
Communication components
On- and off-chip instruction and data address buses at various
levels of the memory hierarchy
Why?
Instruction and data addresses exhibit spatial and temporal
locality, the former more so
Even when branches or jumps occur, the target address is not
very far away
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
12
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Instruction Compression Opportunities
Where?
Storage components
Some registers (instruction registers, etc.)
Data fields of instruction caches
Executable code text segment in main memory
Communication components
On- and off-chip instruction buses at various levels of the memory
hierarchy
Why?
Instruction accesses exhibit spatial and temporal locality
Different instructions, instruction sequences, opcodes, register
operands, and immediate constants occur with different frequencies
There is correlation between opcodes and register operands and
between opcodes and immediate constants
Programs share certain basic characteristics
May 25, 2002
They have procedures and procedure calls
Branches every few instructions
They use loops and if-then-else clauses
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
13
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Data Compression Opportunities
Where?
Storage components
Integer (INT) and floating-point (FP) registers
Data fields of data caches
Data in main memory
Communication components
On- and off-chip data buses at various levels of the memory
hierarchy
Why?
Data accesses exhibit spatial and temporal locality
For any given data type, not all values are equally likely
May 25, 2002
For example, many programs tend to use numbers that are of
much smaller magnitude than the maximum possible
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
14
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Feasibility of Compression
Challenges: Extra logic for compression & decompression can result in:
Performance overhead: Extra latency
Power overhead because of extra computation
Cost overhead because of area for extra logic
However, …
Since logic scales much better in speed, power, and size/cost compared to
interconnect, performance, power, and cost overheads of
compression/decompression are going to decrease as technology improves
The area, latency, and power overheads that can be tolerated vary from one
part of the memory system to another and from application to application.
Examples:
More latency overhead can be tolerated at higher levels of cache and main memory
than at lower levels close to the processor
More latency overhead can be tolerated in non-performance-critical systems than in
performance-critical ones
Depending upon state of the technology, memory system location where
compression is to be applied, and application system requirements, the
compression scheme can be made suitably aggressive (with corresponding
overheads)
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
15
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
CMS Architecture Classifications
Degree of specialization: Symbol statistics used for
compression can be specialized to different extents
Block-specific architecture
Memory-component-specific architecture
Application-program-specific architecture
Application-class-specific architecture
General architecture
Static or dynamic (adaptive) compression
Compression schemes that exploit different types of
patterns: zero order, first order, second order, etc.
Hybrids architectures and schemes
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
16
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Related Work
Analytical studies on address compression [Becker et al.,
MICRO – 24, 1991] and instruction compression feasibility
[Wang & Quong, MASCOTS, 1994]
Cache-based address compression schemes [Park & Farrens,
ISCA-18, 1991], and their extension to instruction and data
compression [Citron & Rudolph, HPCA-1, 1995]
Cache [Yang, et al., MICRO, 2000] and main memory
compression [Tremaine, et al., IBM JRD, 2001]
Code compression for embedded processors: Dictionary
compression [Lefurgy & Mudge, MICRO-30, 1997], semiadaptive Markov compression (SAMC) [Lekatsas & Wolf, DAC35, 1998], pattern-based compression [Araujo et al., MICRO-31,
1998]
Bus encoding for low power: Gray code [Su et al., TR, USC,
1994], bus-invert code [Stan & Burleson, IEEE-TVLSI, 1995], T0
code [Benini, et al., GLS-VLSI, 1997], working-zone code
[Musoll, et al., IEEE-TVLSI, 1998] – mostly for address buses
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
17
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Relationship of Our Work to Previous Research
Comprehensive analysis of all three primary types of
information: addresses, instructions, data
In the context of many real-world benchmark
programs
At all levels of a memory hierarchy with multi-level
caches
Analysis of compression with respect to various
factors: degree of specialization, multiplexing,
encoding, etc.
Analysis of potential power savings
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
18
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Target System and Compression Targets
Target Memory System
32 INT and FP registers
2 cache levels
L1: Separate I- and D-cache, write-through, 16KB each, 4-way
set associative, 32 byte block size
L2: Unified (I + D)-cache, write-back, 256KB, 4-way set
associative, 64 byte block size
Main memory
Demultiplexed buses (default) at all levels
Compression Targets
May 25, 2002
32 INT and 32 FP registers
L1 and L2 I-cache data field and L1 I-cache tag field
Executable code text segment in main memory
Demultiplexed (default) address, instruction, and data buses
across all 3 levels: P-L1, L1-L2, L2-M
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
19
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Simulation Methodology
Used cachesim5 cache analyzer in Sun Microsystems’ SHADE
5.34 SPARC simulator running on uniprocessor SPARC v9
platform to collect traces
9 integer and 6 floating-point benchmarks from SPEC CPU2000
benchmark suite
Used –O2 optimization to compile benchmarks
Statically-linked benchmark executables
2 million sample (trace) size for all simulations and warm up
window of 500M for most benchmarks [Skadron et al., IEEE-TC,
1999]
Default symbol size for compression analysis is an aligned
word (32 bits for address and instruction and 64 bits for data)
Default compression analysis is for component-specific case
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
20
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Analysis Measures
Zero Information Entropy
H = log2M
M = Number of unique symbols in the trace
Zeroth order Markov Entropy
H0 = -Σi [p(si)•log2p(si)]
First order Markov Entropy
H1 = -Σi p(si)•Σj [p(sj|si)•log2p(sj|si)]
Compression Ratio
R = Σall traces Compressed trace size
Σall traces Uncompressed trace size
RSAMC = Compression ratio when SAMC is used
RGZIP = Compression ratio when Gzip is used
Transition Ratio
T = Σall traces Number of transitions in modified trace
Σall traces Number of transitions in original trace
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
21
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Simulation Results Outline
Overall Memory System Analysis
Register Compression
Cache Compression
Compressibility across cache levels
Effect of cache parameters
Compression and Power Savings in Buses
Compression and transition ratios in buses across all levels
Effect of bus encoding and multiplexing
Compression ratio across different bit fields and bit-field groupings
Compression ratio and power savings in different application classes
Compression ratio across different degrees of specialization
Compression ratio and compression tool used
Compression ratio and sample size
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
22
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Overall Memory System Analysis
Communication
components more
compressible than
storage components
Most to least
compressible
Storage
components: Tag,
registers, L1 I-cache
data, main memory
Communication
components: Data
bus, instruction bus,
address bus
Most to least power
savings for
communication
components: Data,
address, instruction
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
23
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Register Compression
First order
analysis does
not make sense
for registers
Integer
registers are
more
compressible
than floatingpoint registers
Good amount
of variation in
compression
ratio across
registers, but
bounded above
by approx. 0.25
in all cases
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
24
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Cache Compression Across Levels
L1 and L2
caches almost
equally
compressible
with respect to
instructions
stored,
although L1 is
slightly more
compressible
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
25
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Effect of Cache Parameters
Increasing total cache size worsens compressibility somewhat, but
compressibility still very good for larger cache sizes
Increasing cache block size improves compressibility
Cache associativity has negligible impact on compressibility
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
26
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Compression and Power Savings Across Buses
Average H0 and
H1 compression
ratios are similar
across all levels
Instruction
address most
compressible,
and data address
least, except for
the case L2-M,
where data
address is most
compressible
Transition ratio >
1 for instruction
address bus and
for L2-M data
address bus
because SAMC
is not
particularly
suitable for
address
compression
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
27
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Effect of Compression, Encoding, and
Multiplexing
Compression alone provides quite a bit better power savings than
encoding alone and compression followed by encoding provides the
most power savings
Demultiplexed buses result in better compression and power savings
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
28
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Compression Ratio Across Different Bit Fields
and Bit-Field Groupings
Compressibility improves from low order to higher order bit fields,
except somewhat in the case of instruction bus traffic
Most to least compressible: Tag, address, data, instruction
Generally, longer bit-field groupings lead to better compression ratios
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
29
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Compression Ratio and Power Savings Across
Application Classes Integer data in
registers more
compressible than
FP data in
registers
FP program
instructions (in Icache data field,
main memory,
instruction bus)
and addresses (in
I-cache tag field
and address bus)
more
compressible than
corresponding INT
program info.
FP data on data
bus more
compressible than
INT data on data
bus
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
30
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Compression Ratio Across Different Degrees of
Specialization
The more
specialized the
symbol
statistics used
for
compression,
the better the
compression
ratio
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
31
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Compression Ratio Across Compression
Tools
From best to worst:
First order, zeroth
order, zero
information, SAMC,
Gzip (suited for text)
Almost an order of
magnitude gap
between first/zeroth
order Markov
entropies and SAMC
compression ratio
The difference
between first order
and zeroth order
entropies is
negligible in the
case of data, and
reasonably big in the
case of address =>
implications for
appropriate
compression
algorithms
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
32
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Compression Ratio and Sample Size
Trends observed
for a larger
sample size (50M)
are similar
compared to our
2M sample size
for a variety of
components
The difference in
compression
ratios between
small and large
samples indicates
the benefits to be
obtained from
dynamic
(adaptive)
compression
algorithms
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
33
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Conclusions and Future Work
Comprehensive analysis of compressibility of addresses,
instructions, and data at all levels of the memory hierarchy in
the context of many real-world programs and implications for
performance, power, and cost benefits
Found that compression has excellent potential for
performance, power, and cost benefits
Best-case average (among all memory components) zeroth and first
order Markov entropies of up to 0.084 and 0.042, respectively,
observed
Worst-case average zeroth and first order Markov entropies of
0.279 and 0.098, respectively, observed
Best- and worst-case average SAMC compression ratios of 0.363
and 0.489, respectively, observed
Best- and worst-case average transition ratios of 0.591 and 0.73,
respectively, observed
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
34
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Conclusions and Future Work
Also, studied effect of cache parameters, encoding,
multiplexing, bit fields and bit-field groupings, application
class, degree of specialization, compression tool, and sample
size
Results obtained have implications for potential of
compression, CMS architecture design, and compression
algorithms
Will develop scalable compression schemes that are suitable
for various types of information, the location in the memory
system where applied, application system under consideration,
and as constrained by technology
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
35