An Analysis of Compression in Improving Memory System

Download Report

Transcript An Analysis of Compression in Improving Memory System

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
THE PERFORMANCE ADVANTAGE
OF APPLYING COMPRESSION TO
THE MEMORY SYSTEM
NIHAR MAHAPATRA, JIANGJIANG LIU,
KRISHNAN SUNDARESAN
{mahapatr, jliu3, ks48}@cse.buffalo.edu
AND
SRINIVAS DANGETI,
BALAKRISHNA VENKATRAO
{srinivas.dangeti, balakrishna.venkatrao}@eng.sun.com
ACM SIGPLAN WORKSHOP ON MEMORY SYSTEM PERFORMANCE
(MSP 2002), BERLIN, GERMANY, JUNE 16, 2002
Department of Computer Science & Engineering
University at Buffalo, State University of New York, NY, U.S.A.
© 2002
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Computer System Components
 Information types



Addresses
Instructions
Data
 Computation system: Processor core
 Memory system

Storage components: Store information




Registers
TLB, Caches
Main memory
Communication components: Communicate information




I/O buffers and signaling/driving circuitry
I/O pads
Pins
On- and off-chip buses
 I/O system: Secondary and tertiary storage I/O, network I/O
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
2
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Perf. Motivation: Comp. Sys. Perf. Growth
 Raw computation system performance growth



Increasing device integration
Larger die area
Rising clock frequency
 Architectural advancements: Increasing levels of parallelism




Bit level
Instruction level
Thread level
Processor level
 Memory system requirements to avoid performance bottlenecks


Store increasing amounts of information
Supply this information at high enough bandwidth and low enough
latency
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
3
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Perf. Motivation: Mem. Sys. Solns. & Limitations
 Memory system solutions




Increasing # of I/O buffers, pads, pins
Wider buses
Increasing # of registers, sizes and # of levels of caches, main
memory size
Improved design
 Limitations




While # of transistors on a chip is limited by volume of die,
communication components are constrained by surface area of die
More stringent constraints on clock speeds at which external pins
can be driven
Since interconnect size does not scale as well as on-chip logic size,
on- and off-chip buses have higher capacitance and delays
DRAM bandwidth and latency improving at a much slower rate
compared to computation system performance
 Memory system is increasingly becoming a performance
bottleneck
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
4
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Power, Cost, and Other Motivations
 Increasing fraction of processor chip devoted to storage and
communication components
 Number and size of off-chip storage and communication
components also increasing
 Interconnect size does not scale as well as on-chip logic size
and therefore interconnect consumes an increasing fraction of
die area and power
 In deep-submicron regime, not only does individual wire
capacitance, but inter-wire capacitance contributes increasingly
to power consumption, and this makes interconnect power
consumption even more important [Sotiriadis & Chandrakasan,
CICC, 2000]
 Exploiting the fact that compression allows more efficient use
of storage and communication resources, one can derive other
benefits, such as, higher fault tolerance
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
5
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Outline
 Problem and Motivation
 Compressed Memory System (CMS) Architectures




Performance, power, and cost benefits
Opportunities for address, instruction, and data compression
Feasibility of compression
Classification of CMS architectures
 Related Work
 Target System, Simulation Methodology, Analysis Measures
 Simulation Results
 Conclusions and Future Work
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
6
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Performance Benefits of Compression
 Storage capacity

More information can be stored using the same # of bits
 Bandwidth

More information per unit time can be transmitted using the same
bus width
 Latency

Can potentially improve latency too. We can exploit the fact that
compression can reduce the size of components keeping their
effective capacity same, and thereby improve their access latencies
because smaller is faster.
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
7
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Latency Benefit: Storage Components –
Cache Example
Original Cache
Size
S
Hit rate
H
Miss rate
m
Miss penalty P
Latency
L= H + m • P
 S = Su + Sc
Uncompressed Cache
Size
Su
Hit rate
Hu
Miss rate
mu
Compressed Cache
Size
Sc
Hit rate
Hc
Miss rate
mc
Decomp.
Pd
penalty
 Lu+c = Hu/c + mu(1-mc)Pd + mu • mc • P, where Hu/c = max(Hu, Hc)
 Hu/c < H; mu = m + e; mu • mc < m
 If Hu/c ~ H, m ~ mu, mc ~ 0.5, Pd ~ 0.5 • P, then Lu+c ~ L – 0.25 • m •P
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
8
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Latency Benefit: Communication
Components - Bus Example
 If a compression scheme provides 50% compression, then one can
obtain the same effective bandwidth using half as many bus lines
 If these bus lines are spaced apart further from each other and are
fattened appropriately, one can reduce the interwire capacitance (due
to more spacing) and resistance (due to fattening), and thereby reduce
the latency (which depends upon the RC product) of the bus
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
9
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Power Benefits of Compression
 Same effective storage capacity can be provided while storing
fewer # of bits => storage components reduce in size and
number
 Keeping bus width same (constant cost), fewer bits need to be
transmitted
 Keeping effective bandwidth same, fewer bits need to be
transmitted over narrower buses => communication
components reduce in number
 On-chip buses: Keeping effective bandwidth same and area
same by increasing the spacing between bus lines, fewer bits
need to transmitted over fewer bus lines that have less interwire capacitance
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
10
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Cost Benefits of Compression
 Same effective storage capacity can be provided while storing
fewer # of bits => reduced storage component resources and
less die area
 Same effective bandwidth can be provided with narrower buses
=> communication components reduce in number and die area
reduces
 Lower power consumption means lower packaging costs
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
11
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Address Compression Opportunities
 Where?
 Storage components





Some registers (PC, memory address reg., branch related, etc.)
TLB
Tag field of caches
Page tables in main memory
Communication components

On- and off-chip instruction and data address buses at various
levels of the memory hierarchy
 Why?
 Instruction and data addresses exhibit spatial and temporal
locality, the former more so
 Even when branches or jumps occur, the target address is not
very far away
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
12
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Instruction Compression Opportunities
 Where?
 Storage components




Some registers (instruction registers, etc.)
Data fields of instruction caches
Executable code text segment in main memory
Communication components

On- and off-chip instruction buses at various levels of the memory
hierarchy
 Why?
 Instruction accesses exhibit spatial and temporal locality
 Different instructions, instruction sequences, opcodes, register
operands, and immediate constants occur with different frequencies
 There is correlation between opcodes and register operands and
between opcodes and immediate constants
 Programs share certain basic characteristics



May 25, 2002
They have procedures and procedure calls
Branches every few instructions
They use loops and if-then-else clauses
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
13
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Data Compression Opportunities
 Where?
 Storage components




Integer (INT) and floating-point (FP) registers
Data fields of data caches
Data in main memory
Communication components

On- and off-chip data buses at various levels of the memory
hierarchy
 Why?
 Data accesses exhibit spatial and temporal locality
 For any given data type, not all values are equally likely

May 25, 2002
For example, many programs tend to use numbers that are of
much smaller magnitude than the maximum possible
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
14
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Feasibility of Compression
 Challenges: Extra logic for compression & decompression can result in:



Performance overhead: Extra latency
Power overhead because of extra computation
Cost overhead because of area for extra logic
 However, …


Since logic scales much better in speed, power, and size/cost compared to
interconnect, performance, power, and cost overheads of
compression/decompression are going to decrease as technology improves
The area, latency, and power overheads that can be tolerated vary from one
part of the memory system to another and from application to application.
Examples:



More latency overhead can be tolerated at higher levels of cache and main memory
than at lower levels close to the processor
More latency overhead can be tolerated in non-performance-critical systems than in
performance-critical ones
Depending upon state of the technology, memory system location where
compression is to be applied, and application system requirements, the
compression scheme can be made suitably aggressive (with corresponding
overheads)
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
15
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
CMS Architecture Classifications
 Degree of specialization: Symbol statistics used for
compression can be specialized to different extents





Block-specific architecture
Memory-component-specific architecture
Application-program-specific architecture
Application-class-specific architecture
General architecture
 Static or dynamic (adaptive) compression
 Compression schemes that exploit different types of
patterns: zero order, first order, second order, etc.
 Hybrids architectures and schemes
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
16
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Related Work
 Analytical studies on address compression [Becker et al.,
MICRO – 24, 1991] and instruction compression feasibility
[Wang & Quong, MASCOTS, 1994]
 Cache-based address compression schemes [Park & Farrens,
ISCA-18, 1991], and their extension to instruction and data
compression [Citron & Rudolph, HPCA-1, 1995]
 Cache [Yang, et al., MICRO, 2000] and main memory
compression [Tremaine, et al., IBM JRD, 2001]
 Code compression for embedded processors: Dictionary
compression [Lefurgy & Mudge, MICRO-30, 1997], semiadaptive Markov compression (SAMC) [Lekatsas & Wolf, DAC35, 1998], pattern-based compression [Araujo et al., MICRO-31,
1998]
 Bus encoding for low power: Gray code [Su et al., TR, USC,
1994], bus-invert code [Stan & Burleson, IEEE-TVLSI, 1995], T0
code [Benini, et al., GLS-VLSI, 1997], working-zone code
[Musoll, et al., IEEE-TVLSI, 1998] – mostly for address buses
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
17
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Relationship of Our Work to Previous Research
 Comprehensive analysis of all three primary types of
information: addresses, instructions, data
 In the context of many real-world benchmark
programs
 At all levels of a memory hierarchy with multi-level
caches
 Analysis of compression with respect to various
factors: degree of specialization, multiplexing,
encoding, etc.
 Analysis of potential power savings
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
18
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Target System and Compression Targets
 Target Memory System


32 INT and FP registers
2 cache levels




L1: Separate I- and D-cache, write-through, 16KB each, 4-way
set associative, 32 byte block size
L2: Unified (I + D)-cache, write-back, 256KB, 4-way set
associative, 64 byte block size
Main memory
Demultiplexed buses (default) at all levels
 Compression Targets




May 25, 2002
32 INT and 32 FP registers
L1 and L2 I-cache data field and L1 I-cache tag field
Executable code text segment in main memory
Demultiplexed (default) address, instruction, and data buses
across all 3 levels: P-L1, L1-L2, L2-M
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
19
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Simulation Methodology
 Used cachesim5 cache analyzer in Sun Microsystems’ SHADE
5.34 SPARC simulator running on uniprocessor SPARC v9
platform to collect traces
 9 integer and 6 floating-point benchmarks from SPEC CPU2000
benchmark suite
 Used –O2 optimization to compile benchmarks
 Statically-linked benchmark executables
 2 million sample (trace) size for all simulations and warm up
window of 500M for most benchmarks [Skadron et al., IEEE-TC,
1999]
 Default symbol size for compression analysis is an aligned
word (32 bits for address and instruction and 64 bits for data)
 Default compression analysis is for component-specific case
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
20
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Analysis Measures
 Zero Information Entropy
H = log2M
M = Number of unique symbols in the trace
 Zeroth order Markov Entropy
H0 = -Σi [p(si)•log2p(si)]
 First order Markov Entropy
H1 = -Σi p(si)•Σj [p(sj|si)•log2p(sj|si)]
 Compression Ratio
R = Σall traces Compressed trace size
Σall traces Uncompressed trace size
RSAMC = Compression ratio when SAMC is used
RGZIP = Compression ratio when Gzip is used
 Transition Ratio
T = Σall traces Number of transitions in modified trace
Σall traces Number of transitions in original trace
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
21
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Simulation Results Outline
 Overall Memory System Analysis
 Register Compression
 Cache Compression


Compressibility across cache levels
Effect of cache parameters
 Compression and Power Savings in Buses


Compression and transition ratios in buses across all levels
Effect of bus encoding and multiplexing
 Compression ratio across different bit fields and bit-field groupings
 Compression ratio and power savings in different application classes
 Compression ratio across different degrees of specialization
 Compression ratio and compression tool used
 Compression ratio and sample size
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
22
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Overall Memory System Analysis
 Communication
components more
compressible than
storage components
 Most to least
compressible


Storage
components: Tag,
registers, L1 I-cache
data, main memory
Communication
components: Data
bus, instruction bus,
address bus
 Most to least power
savings for
communication
components: Data,
address, instruction
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
23
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Register Compression
 First order
analysis does
not make sense
for registers
 Integer
registers are
more
compressible
than floatingpoint registers
 Good amount
of variation in
compression
ratio across
registers, but
bounded above
by approx. 0.25
in all cases
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
24
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Cache Compression Across Levels
 L1 and L2
caches almost
equally
compressible
with respect to
instructions
stored,
although L1 is
slightly more
compressible
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
25
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Effect of Cache Parameters
 Increasing total cache size worsens compressibility somewhat, but
compressibility still very good for larger cache sizes
 Increasing cache block size improves compressibility
 Cache associativity has negligible impact on compressibility
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
26
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Compression and Power Savings Across Buses
 Average H0 and
H1 compression
ratios are similar
across all levels
 Instruction
address most
compressible,
and data address
least, except for
the case L2-M,
where data
address is most
compressible
 Transition ratio >
1 for instruction
address bus and
for L2-M data
address bus
because SAMC
is not
particularly
suitable for
address
compression
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
27
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Effect of Compression, Encoding, and
Multiplexing
 Compression alone provides quite a bit better power savings than
encoding alone and compression followed by encoding provides the
most power savings
 Demultiplexed buses result in better compression and power savings
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
28
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Compression Ratio Across Different Bit Fields
and Bit-Field Groupings
 Compressibility improves from low order to higher order bit fields,
except somewhat in the case of instruction bus traffic
 Most to least compressible: Tag, address, data, instruction
 Generally, longer bit-field groupings lead to better compression ratios
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
29
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Compression Ratio and Power Savings Across
Application Classes  Integer data in
registers more
compressible than
FP data in
registers
 FP program
instructions (in Icache data field,
main memory,
instruction bus)
and addresses (in
I-cache tag field
and address bus)
more
compressible than
corresponding INT
program info.
 FP data on data
bus more
compressible than
INT data on data
bus
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
30
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Compression Ratio Across Different Degrees of
Specialization
 The more
specialized the
symbol
statistics used
for
compression,
the better the
compression
ratio
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
31
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Compression Ratio Across Compression
Tools
 From best to worst:
First order, zeroth
order, zero
information, SAMC,
Gzip (suited for text)
 Almost an order of
magnitude gap
between first/zeroth
order Markov
entropies and SAMC
compression ratio
 The difference
between first order
and zeroth order
entropies is
negligible in the
case of data, and
reasonably big in the
case of address =>
implications for
appropriate
compression
algorithms
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
32
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Compression Ratio and Sample Size
 Trends observed
for a larger
sample size (50M)
are similar
compared to our
2M sample size
for a variety of
components
 The difference in
compression
ratios between
small and large
samples indicates
the benefits to be
obtained from
dynamic
(adaptive)
compression
algorithms
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
33
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Conclusions and Future Work
 Comprehensive analysis of compressibility of addresses,
instructions, and data at all levels of the memory hierarchy in
the context of many real-world programs and implications for
performance, power, and cost benefits
 Found that compression has excellent potential for
performance, power, and cost benefits




Best-case average (among all memory components) zeroth and first
order Markov entropies of up to 0.084 and 0.042, respectively,
observed
Worst-case average zeroth and first order Markov entropies of
0.279 and 0.098, respectively, observed
Best- and worst-case average SAMC compression ratios of 0.363
and 0.489, respectively, observed
Best- and worst-case average transition ratios of 0.591 and 0.73,
respectively, observed
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
34
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Conclusions and Future Work
 Also, studied effect of cache parameters, encoding,
multiplexing, bit fields and bit-field groupings, application
class, degree of specialization, compression tool, and sample
size
 Results obtained have implications for potential of
compression, CMS architecture design, and compression
algorithms
 Will develop scalable compression schemes that are suitable
for various types of information, the location in the memory
system where applied, application system under consideration,
and as constrained by technology
May 25, 2002
WMPI 2002 © N. R. Mahapatra, J. Liu, K. Sundaresan, S. Dangeti, and B. Venkatrao
35