Template - Webcourse

Download Report

Transcript Template - Webcourse

Computer Architecture
Probabilistic L1 Cache
Filtering
By Dan Tsafrir 7/5/2012
Presentation based on slides by Yoav Etsion
1
Computer Architecture 2012 – probabilistic L1 filtering
Lecture is based on…




Paper titled
 “L1 cache filtering through random selection of memory references”
Authors
 Yoav Etsion and Dror G. Feitelson (from the Hebrew U.)
Published in
 PACT 2007: the international conference on parallel architectures and
complication techniques
Can be downloaded from
 http://www.cs.technion.ac.il/~yetsion/papers/CorePact.pdf
2
Computer Architecture 2012 – probabilistic L1 filtering
MOTIVATION
3
Computer Architecture 2012 – probabilistic L1 filtering
A Case for more efficient caches

CPUs get more and more cache dependent



Growing gap between CPU and memory
Growing popularity of multi-core chips
Common solution: larger caches, but…


4
Larger caches consume more power
Longer wire delays yield longer latencies
Computer Architecture 2012 – probabilistic L1 filtering
Efficiency through insertion policies




Need smarter caches
We focus on insertion policies
 Currently everything goes into the cache
 Need to predict which blocks will be evicted quickly…
 … and prevent them from polluting the caches
Reducing pollution may enable use of low-latency, low-power,
direct-mapped caches
 Less pollution yields fewer conflicts
The Problem:
 “It is difficult to make predictions, especially about the future”
(many affiliations)
5
Computer Architecture 2012 – probabilistic L1 filtering
background
CDF,
RESIDENCY LENGTHS,
MASS-COUNT DISPARITY
6
Computer Architecture 2012 – probabilistic L1 filtering
PDF
(probability distribution function)



In statistics, a PDF is, roughly, a function f describing
 The likelihood to get some value in some domain
For example, f can specify how many students have a first
name comprised of exactly k Hebrew letters
 f(1) = 0%
f(2) = 22% (... ,‫ לי‬,‫ טל‬,‫ גד‬,‫ בן‬,‫ חן‬,‫ חן‬,‫ גל‬,‫ שי‬,‫ שי‬,‫ שי‬,‫ שי‬,‫ רם‬,‫)דן‬
f(3) = 24% (... ,‫ חנה‬,‫ רחל‬,‫ חיה‬,‫ משה‬,‫ נגה‬,‫ שיר‬,‫ שיר‬,‫ שיר‬,‫ שיר‬,‫ רון‬,‫ גיל‬,‫)גיל‬
f(4) = 25% ( ...,‫ נועה‬,‫ נועה‬,‫ נועה‬,‫ נועה‬,‫ נועה‬,‫ מיכל‬,‫ מיכל‬,‫ אביב‬,‫ אחמד‬,‫)יואב‬
f(5) = 13% ( ... ,‫ אביעד‬,‫ אביתר‬,‫ אביבה‬,‫ חנניה‬,‫ יהודה‬,‫ עירית‬,‫ ירוחם‬,‫)אביטל‬
f(6) = 9% ( ... ,‫ אדמונד‬,‫ אבשלום‬,‫ שלומית‬,‫ אביגיל‬,‫ אבינעם‬,‫ אביבית‬,‫)יחזקאל‬
f(7) = 6% ( ...‫ אנסטסיה‬,‫ עמנואלה‬,,‫ מתיתיהו‬,‫ אבינועם‬,‫)אביגדור‬
f(8) = 0.6% (... ,‫)אלכסנדרה‬
f(9) = 0.4% (... ,‫)קונסטנטין‬
f(10) = 0
Note that sigma[ f(k) ] = 100%
7
Computer Architecture 2012 – probabilistic L1 filtering
CDF
(cumulative distribution function)



In statistics, a CDF is, roughly, a function F describing
 The likelihood to get some value in some domain, or less
For example, f can specify how many students have a first
name comprised of exactly k Hebrew letters, or less
 F(1) = 0%
= f(0)
= 0%
F(2) = 22%
= f(0)+f(1)+f(2)
= 0%+22%
F(3) = 46%
= f(0)+f(1)+f(2)
= 0%+22%+24%
F(4) = 71%
= f(0)+f(1)+f(2)+f(3) = 0%+22%+24%+25%
F(5) = 84%
= F(4)+f(5)
= 71% + 13%
F(6) = 93%
= F(5)+f(6)
= 84% + 9%
F(7) = 99%
= F(6)+f(7)
= 93% + 6%
F(8) = 99.6%
= F(7)+f(8)
= 99% + 0.6%
F(9) = 100%
= F(8)+f(9)
= 99.6% + 0.4%
F(10) = 100%
Generally, F(x) = 𝒙𝒌=−∞ 𝒇(𝒌), monotonically non-decreasing
8
Computer Architecture 2012 – probabilistic L1 filtering
Cache residency



A “residency”
 Is what we call a block of memory
• From the time it was inserted to the cache
• Until the time it was evicted
 Each memory block can be associated with many residencies during a
single execution
“Residency length”
 Number of memory references (= load/store operation) served by the
residency
“The mass of residency length=k”
 Percent of memory references (throughout the entire program
execution) that were served by residencies of length k
9
Computer Architecture 2012 – probabilistic L1 filtering
Computing residency length on-the-fly

At runtime, residency length is generated like so (assume C++):
class Cache_Line_Residency {
private:
int counter;
// the residency length
public:
Cache_Line_Residency() { // constructor: a new object allocated
// when a cache-line is allocated for a
ctor
counter = 0;
// newly inserted memory block
}
~Cache_Line() {
// destructor: called when the block is
// evicted from the cache (or when the
// program ends)
dtor
cout << counter << endl;
}
void do_reference() {
// invoked whenever the cache line is
// referenced (read from or written to)
access
counter++;
memory
}
};
10
Computer Architecture 2012 – probabilistic L1 filtering
Example
Assume:
 Size of cache:
4 bytes
 Size of cache line:
2 bytes (namely, there are two lines)
 Cache is directly mapped => address x maps into x % 4
 A program references memory (order: top to bottom):
(90)
program end
11
(2)
#5
x
x (1)
x (2)
(2)
(2)
x (1)
(1)
(1)
x
(2)
(2)
(2)
#2
0
#4
…
#3
0 x (1)
x (2)
1
(2)
3
(2)
3
0 x (3)
4 x (1)
(1)
7
4 x (2)
4 x (3)
(3)
6
0 x (1)
0 x (2)
residency #1
addresses (of memory references)

print 3
print 2
print 3
print 90
print 2
Computer Architecture 2012 – probabilistic L1 filtering
Example – CDF of residency lengths
(90)
program end
12
(2)
100%
CDF
80%
#5
x
x (1)
x (2)
(2)
(2)
x (1)
(1)
(1)
x
(2)
(2)
(2)
Thus, CDF of residency length is:
• 40% of residencies have length <= 2
= |[2,2]| / |[3,2,3,90,2]|
• 80% of residencies have length <= 3
= |[2,2,3,3]| / |[3,2,3,90,2]|
• 100% of residencies have length <= 90
= |[2,2,3,3,90]| / |[3,2,3,90,2]|
#2
0
#4
…
#3
0 x (1)
x (2)
1
(2)
3
(2)
3
0 x (3)
4 x (1)
(1)
7
4 x (2)
4 x (3)
(3)
6
0 x (1)
0 x (2)
residency #1
addresses (of memory references)
So printed residency lengths are: 3, 2, 3, 90, 2
print 3
print 2
60%
40%
print 3
20%
2 3
print 90
print 2
lengths
90
Computer Architecture 2012 – probabilistic L1 filtering
Example – CDF of mass of references
(90)
program end
13
(2)
100%
CDF
80%
#5
x
x (1)
x (2)
(2)
(2)
x (1)
(1)
(1)
x
(2)
(2)
(2)
Thus, CDF of mass of references (“refs”) is:
• 4% of refs are to residencies with length <= 2
= (2+2) / (3+2+3+90+2)
• 10% of refs are to residencies with len <= 3
= (2+2+3+3) / (3+2+3+90+2)
• 100% of refs are to residencies w len <= 90
= (2+2+3+3+90) / (3+2+3+90+2)
#2
0
#4
…
#3
0 x (1)
x (2)
1
(2)
3
(2)
3
0 x (3)
4 x (1)
(1)
7
4 x (2)
4 x (3)
(3)
6
0 x (1)
0 x (2)
residency #1
addresses (of memory references)
So printed residency lengths are: 3, 2, 3, 90, 2
print 3
print 2
60%
40%
print 3
20%
2 3
print 90
print 2
lengths
90
Computer Architecture 2012 – probabilistic L1 filtering
Superimposing graphs
the
“counters”
100%
CDF
80%
60%
the
“mass”
40%
20%
2 3
lengths
90
“mass-count disparity” (disparity = ‫ָּלּות‬
‫ְד‬
‫ִב‬
‫ נ‬,‫ִי‬
‫ ׁשֹׁנ‬,‫ )ׁשֹונּות‬is the
term describing the phenomenon shown in the graph, whereby:
• most of the mass resides in very few counters, and
• most of the counters count very little mass
14
Computer Architecture 2012 – probabilistic L1 filtering
RESULTS FROM
REAL BENCHMARKS
15
Computer Architecture 2012 – probabilistic L1 filtering
Methodology




Using all benchmarks from the SPEC-CPU 2000 benchmarks
suite
 In this presentation we show only four
 But we include all the rest in the averages
The benchmarks were compiled for
 The Alpha AXP architecture
All benchmarks were fast-forwarded 15 billion instructions (to
skip any initialization code) and were then executed for another
2 billion instructions
Unless otherwise stated, all simulated runs utilize a 16KB directmapped cache
16
Computer Architecture 2012 – probabilistic L1 filtering
CDF of residency length
(of 4 SPEC benchmark apps)
Vortex
Facerec
Spsl
CDF
data
instructions
Crafty
length of residency


17
Vast majority of residencies are relatively short
 Which likely means they are transient
Small fraction of residencies are extremely long
Computer Architecture 2012 – probabilistic L1 filtering
CDF of mass of residencies
(of 4 SPEC benchmark apps)
Vortex
Facerec
Spsl
CDF
data
instructions
Crafty
length of residency


18
Fraction of memory references serviced by each length
Most references target residencies longer than, say, 10
Computer Architecture 2012 – probabilistic L1 filtering
Superimposing graphs reveals
mass-count disparity
Vortex
Facerec
Spsl
CDF
data
instructions
Crafty



of residency
Every x value along the curveslength
reveals
how many of the residencies
account for how much of the mass
For example, in Crafty, 55% if the (shortest) residencies account for only
5% of the mass
Which means the other 45% (longer) residencies account for 95% of the
mass
19
Computer Architecture 2012 – probabilistic L1 filtering
The joint-ratio
mass-disparity metric
Vortex
Vortex
Facerec
Facerec
Spsl
Spsl
CDF
data
instructions
Crafty
Crafty




The divergence between the distributions (= the mass-count disparity)
can be quantified by the “joint ratio”
It’s a generalization of the proverbial 20/80 principle
Definition: the joint ratio is the unique point in the graphs where the
sum of the two CDFs is 1
length of residency
Example: in the case of Vortex, the joint ratio is 13/87 (blue arrow in
middle of plot), meaning 13% of the (longest) residencies hold 87% of
the mass of the memory references, while the remaining 87% of the
residencies hold only 13% of the mass
20
Computer Architecture 2012 – probabilistic L1 filtering
The W1/2 mass-disparity metric
CDF
data
instructions
Crafty
Vortex
Facerec
Spsl
W1/2
length of residency



Definition: overall mass (in %) of the shorter half of the residencies
Example: in Vortex and Facerec W1/2 is less than 5% of the references
Average W1/2 across all benchmarks is < 10% (median of W1/2 is < 5%)
21
Computer Architecture 2012 – probabilistic L1 filtering
The N1/2 mass-disparity metric
N1/2
Vortex
Facerec
Spsl
CDF
data
instructions
Crafty
length of residency



Definition: % of longer residencies accounting for half of the mass
Example: in Vortex and Facerec N1/2 is less than 1% of the references
Average N1/2 across all benchmarks is < 5% (median of N1/2 is < 1%)
22
Computer Architecture 2012 – probabilistic L1 filtering
Let us utilize our understandings for…
DESIGNING A NEW
INSERTION POLICY
23
Computer Architecture 2012 – probabilistic L1 filtering
Probabilistic insertion?




The mass-disparity we’ve identified means

A small number of long residencies account for most memory
references; but still most residencies are short
So when randomly selecting a residency

It would likely be a short residency
Which means we have a way to approximate the future:

Given a block about to be inserted into cache, probabilistically
speaking, we know with high degree of certainty that it’d be
disadvantageous to actually insert it…

So we won’t! Instead, we’ll flip a coin…
•
Heads = insert block to cache (small probability)
•
Tails = insert block to a small filter (high probability)
Rationale

Long residencies will enjoy many coin-flips, so chances are they’ll
eventually get into the cache

Conversely, short residencies have little chance to get in
24
Computer Architecture 2012 – probabilistic L1 filtering
L1 with random filter



Design
 Direct-mapped L1 + small
fully-associative filter w/ CAM
 Insertion policy for lines not in
L1: for each mem ref, flip
biased coin to decide if line
goes into filter or into L1
SRAM is cache memory
 Not to be confused with DRAM
 Holds blocks that, by the coin
flip, shouldn’t be inserted to L1
Usage
 First, search data in L1
 If not found, search in filter
 If not found, go to L2, and
then use above insertion policy
25
Computer Architecture 2012 – probabilistic L1 filtering
L1 with random filter



Result
 Long residencies end up in L1
 Short residencies tend to end
up in filter
Benefit of randomness
 Filtering is purely statistical,
eliminating the need to save
any state or reuse information!
Explored filter sizes
 1KB, 2KB, and 4KB
 Consisting of 16, 32, and 64
lines, respectively
 Results presented in slides:
were achieved using a 2K filter
26
Computer Architecture 2012 – probabilistic L1 filtering
Exploring coin bias


Find the probability minimizing the miss-rate
 High probability swamps cache
 Low probability swamps filter
Constant selection probabilities seem sufficient
 Data miss-rate reduced by ~25% for P = 5/100
 Inst. miss-rate reduced by >60% for P = 1/1000
Data
Instruction
Reduction
in miss rate
27
Computer Architecture 2012 – probabilistic L1 filtering
Exploring coin bias


Random sampling with probability P turned out equivalent to
periodic sampling at a rate of ~1/P
 Do not need real randomness
Majority of memory refs serviced by L1 cache, whereas
majority of blocks remain in the filter; specifically:
 L1 services 80% - 90% of refs
 With only ~35% of the blocks
Data
Instruction
Reduction
in miss rate
28
Computer Architecture 2012 – probabilistic L1 filtering
Problem – CAM is wasteful & slow



Fully-associative filter uses CAM (content addressable memory)
 Input = address; output (on a hit) = “pointer” into SRAM saying
where’s the associated data
 CAM lookup done in parallel
Parallel lookup drawbacks
 Wastes energy
 Is slower (relative to directmapped)
Possible solution
 Introducing the “WLB”…
29
Computer Architecture 2012 – probabilistic L1 filtering
Wordline look-aside Buffer (WLB)


WLB is a small direct-mapped lookup table caching the most
recent CAM lookups
 (Recall: given an address, CAM returns a pointer into SRAM; it’s a
search like any search and therefore can be cached)
 Fast, low-power lookups
Filter usage when adding
to it the WLB
 First, search data in L1
 In parallel search its address in WLB
 If data not in L1 but WLB hits
• Access the SRAM without CAM
 If data not in L1 and WLB misses
• Only then use the slower /
wasteful CAM
 If not found, go to L2 as usual
30
Computer Architecture 2012 – probabilistic L1 filtering
Effectiveness of WLB?
Data
Instruction
size of WLB [number of entries]


WLB is quite effective with only 8 entries (for both I and D)
 Eliminates 77% of CAM data lookups
 Eliminates 98% of CAM instructions lookups
Since WLB is so small and simple (direct map)
 It’s fast and consumes extremely low power
 Therefore, it can be looked up in parallel with main L1 cache
31
Computer Architecture 2012 – probabilistic L1 filtering
PERFORMANCE
EVALUATION
32
Computer Architecture 2012 – probabilistic L1 filtering
Methodology





4 wide, out-of-order micro-architecture (SimpleScalar)
 (You’ll understand this when we learn out-of-order execution)
Simulated L1
 16K, 32K, 64K, with several set-associative configuration; latency:
• Direct-mapped: 1 cycle
• Set-associative: 2 cycles
Simulated filter
 2K, fully-associative, with 8-entry WLB; latency: 5 cycles
• 1 cycle = for WLB (in parallel to accessing the cache)
• 3 cycles = for CAM lookup
• 1 cycle = for SRAM access
Simulated L2
 512K; latency: 16 cycles
Simulated main-memory
 Latency: 350 cycles
33
Computer Architecture 2012 – probabilistic L1 filtering
Results – runtime
32K DM + filter
average relative
improvement [%]
16K DM + filter


Comparing random sampling filter cache to other common cache designs
Outperforms a 4-way cache double its size!
 Interesting: DM’s low-latency compensates for conflict misses
34
Computer Architecture 2012 – probabilistic L1 filtering
Results – power consumption




Expectedly, DM-filtered loses to DM, because it’s more complex
Direct mapped cache reduces dynamic power, but filter adds ~15%
more leakage over 4-way
Same size: 60%-80% reduction in dynamic power
Double size: ~40% reduction in leakage
36
Computer Architecture 2012 – probabilistic L1 filtering
Conclusions



The Mass-Count disparity phenomenon can be leveraged for
caching policies
Random Sampling effectively identifies frequently used blocks
 Adding just 2K filter is better than doubling the cache size, both in
terms of IPC and power
The WLB is effective at eliminating costly CAM lookups
 Offering fast, low-power access while maintaining fully-associativity
benefits
37
Computer Architecture 2012 – probabilistic L1 filtering