Template - Webcourse
Download
Report
Transcript Template - Webcourse
Computer Architecture
Probabilistic L1 Cache
Filtering
By Dan Tsafrir 7/5/2012
Presentation based on slides by Yoav Etsion
1
Computer Architecture 2012 – probabilistic L1 filtering
Lecture is based on…
Paper titled
“L1 cache filtering through random selection of memory references”
Authors
Yoav Etsion and Dror G. Feitelson (from the Hebrew U.)
Published in
PACT 2007: the international conference on parallel architectures and
complication techniques
Can be downloaded from
http://www.cs.technion.ac.il/~yetsion/papers/CorePact.pdf
2
Computer Architecture 2012 – probabilistic L1 filtering
MOTIVATION
3
Computer Architecture 2012 – probabilistic L1 filtering
A Case for more efficient caches
CPUs get more and more cache dependent
Growing gap between CPU and memory
Growing popularity of multi-core chips
Common solution: larger caches, but…
4
Larger caches consume more power
Longer wire delays yield longer latencies
Computer Architecture 2012 – probabilistic L1 filtering
Efficiency through insertion policies
Need smarter caches
We focus on insertion policies
Currently everything goes into the cache
Need to predict which blocks will be evicted quickly…
… and prevent them from polluting the caches
Reducing pollution may enable use of low-latency, low-power,
direct-mapped caches
Less pollution yields fewer conflicts
The Problem:
“It is difficult to make predictions, especially about the future”
(many affiliations)
5
Computer Architecture 2012 – probabilistic L1 filtering
background
CDF,
RESIDENCY LENGTHS,
MASS-COUNT DISPARITY
6
Computer Architecture 2012 – probabilistic L1 filtering
PDF
(probability distribution function)
In statistics, a PDF is, roughly, a function f describing
The likelihood to get some value in some domain
For example, f can specify how many students have a first
name comprised of exactly k Hebrew letters
f(1) = 0%
f(2) = 22% (... , לי, טל, גד, בן, חן, חן, גל, שי, שי, שי, שי, רם,)דן
f(3) = 24% (... , חנה, רחל, חיה, משה, נגה, שיר, שיר, שיר, שיר, רון, גיל,)גיל
f(4) = 25% ( ..., נועה, נועה, נועה, נועה, נועה, מיכל, מיכל, אביב, אחמד,)יואב
f(5) = 13% ( ... , אביעד, אביתר, אביבה, חנניה, יהודה, עירית, ירוחם,)אביטל
f(6) = 9% ( ... , אדמונד, אבשלום, שלומית, אביגיל, אבינעם, אביבית,)יחזקאל
f(7) = 6% ( ... אנסטסיה, עמנואלה,, מתיתיהו, אבינועם,)אביגדור
f(8) = 0.6% (... ,)אלכסנדרה
f(9) = 0.4% (... ,)קונסטנטין
f(10) = 0
Note that sigma[ f(k) ] = 100%
7
Computer Architecture 2012 – probabilistic L1 filtering
CDF
(cumulative distribution function)
In statistics, a CDF is, roughly, a function F describing
The likelihood to get some value in some domain, or less
For example, f can specify how many students have a first
name comprised of exactly k Hebrew letters, or less
F(1) = 0%
= f(0)
= 0%
F(2) = 22%
= f(0)+f(1)+f(2)
= 0%+22%
F(3) = 46%
= f(0)+f(1)+f(2)
= 0%+22%+24%
F(4) = 71%
= f(0)+f(1)+f(2)+f(3) = 0%+22%+24%+25%
F(5) = 84%
= F(4)+f(5)
= 71% + 13%
F(6) = 93%
= F(5)+f(6)
= 84% + 9%
F(7) = 99%
= F(6)+f(7)
= 93% + 6%
F(8) = 99.6%
= F(7)+f(8)
= 99% + 0.6%
F(9) = 100%
= F(8)+f(9)
= 99.6% + 0.4%
F(10) = 100%
Generally, F(x) = 𝒙𝒌=−∞ 𝒇(𝒌), monotonically non-decreasing
8
Computer Architecture 2012 – probabilistic L1 filtering
Cache residency
A “residency”
Is what we call a block of memory
• From the time it was inserted to the cache
• Until the time it was evicted
Each memory block can be associated with many residencies during a
single execution
“Residency length”
Number of memory references (= load/store operation) served by the
residency
“The mass of residency length=k”
Percent of memory references (throughout the entire program
execution) that were served by residencies of length k
9
Computer Architecture 2012 – probabilistic L1 filtering
Computing residency length on-the-fly
At runtime, residency length is generated like so (assume C++):
class Cache_Line_Residency {
private:
int counter;
// the residency length
public:
Cache_Line_Residency() { // constructor: a new object allocated
// when a cache-line is allocated for a
ctor
counter = 0;
// newly inserted memory block
}
~Cache_Line() {
// destructor: called when the block is
// evicted from the cache (or when the
// program ends)
dtor
cout << counter << endl;
}
void do_reference() {
// invoked whenever the cache line is
// referenced (read from or written to)
access
counter++;
memory
}
};
10
Computer Architecture 2012 – probabilistic L1 filtering
Example
Assume:
Size of cache:
4 bytes
Size of cache line:
2 bytes (namely, there are two lines)
Cache is directly mapped => address x maps into x % 4
A program references memory (order: top to bottom):
(90)
program end
11
(2)
#5
x
x (1)
x (2)
(2)
(2)
x (1)
(1)
(1)
x
(2)
(2)
(2)
#2
0
#4
…
#3
0 x (1)
x (2)
1
(2)
3
(2)
3
0 x (3)
4 x (1)
(1)
7
4 x (2)
4 x (3)
(3)
6
0 x (1)
0 x (2)
residency #1
addresses (of memory references)
print 3
print 2
print 3
print 90
print 2
Computer Architecture 2012 – probabilistic L1 filtering
Example – CDF of residency lengths
(90)
program end
12
(2)
100%
CDF
80%
#5
x
x (1)
x (2)
(2)
(2)
x (1)
(1)
(1)
x
(2)
(2)
(2)
Thus, CDF of residency length is:
• 40% of residencies have length <= 2
= |[2,2]| / |[3,2,3,90,2]|
• 80% of residencies have length <= 3
= |[2,2,3,3]| / |[3,2,3,90,2]|
• 100% of residencies have length <= 90
= |[2,2,3,3,90]| / |[3,2,3,90,2]|
#2
0
#4
…
#3
0 x (1)
x (2)
1
(2)
3
(2)
3
0 x (3)
4 x (1)
(1)
7
4 x (2)
4 x (3)
(3)
6
0 x (1)
0 x (2)
residency #1
addresses (of memory references)
So printed residency lengths are: 3, 2, 3, 90, 2
print 3
print 2
60%
40%
print 3
20%
2 3
print 90
print 2
lengths
90
Computer Architecture 2012 – probabilistic L1 filtering
Example – CDF of mass of references
(90)
program end
13
(2)
100%
CDF
80%
#5
x
x (1)
x (2)
(2)
(2)
x (1)
(1)
(1)
x
(2)
(2)
(2)
Thus, CDF of mass of references (“refs”) is:
• 4% of refs are to residencies with length <= 2
= (2+2) / (3+2+3+90+2)
• 10% of refs are to residencies with len <= 3
= (2+2+3+3) / (3+2+3+90+2)
• 100% of refs are to residencies w len <= 90
= (2+2+3+3+90) / (3+2+3+90+2)
#2
0
#4
…
#3
0 x (1)
x (2)
1
(2)
3
(2)
3
0 x (3)
4 x (1)
(1)
7
4 x (2)
4 x (3)
(3)
6
0 x (1)
0 x (2)
residency #1
addresses (of memory references)
So printed residency lengths are: 3, 2, 3, 90, 2
print 3
print 2
60%
40%
print 3
20%
2 3
print 90
print 2
lengths
90
Computer Architecture 2012 – probabilistic L1 filtering
Superimposing graphs
the
“counters”
100%
CDF
80%
60%
the
“mass”
40%
20%
2 3
lengths
90
“mass-count disparity” (disparity = ָּלּות
ְד
ִב
נ,ִי
ׁשֹׁנ, )ׁשֹונּותis the
term describing the phenomenon shown in the graph, whereby:
• most of the mass resides in very few counters, and
• most of the counters count very little mass
14
Computer Architecture 2012 – probabilistic L1 filtering
RESULTS FROM
REAL BENCHMARKS
15
Computer Architecture 2012 – probabilistic L1 filtering
Methodology
Using all benchmarks from the SPEC-CPU 2000 benchmarks
suite
In this presentation we show only four
But we include all the rest in the averages
The benchmarks were compiled for
The Alpha AXP architecture
All benchmarks were fast-forwarded 15 billion instructions (to
skip any initialization code) and were then executed for another
2 billion instructions
Unless otherwise stated, all simulated runs utilize a 16KB directmapped cache
16
Computer Architecture 2012 – probabilistic L1 filtering
CDF of residency length
(of 4 SPEC benchmark apps)
Vortex
Facerec
Spsl
CDF
data
instructions
Crafty
length of residency
17
Vast majority of residencies are relatively short
Which likely means they are transient
Small fraction of residencies are extremely long
Computer Architecture 2012 – probabilistic L1 filtering
CDF of mass of residencies
(of 4 SPEC benchmark apps)
Vortex
Facerec
Spsl
CDF
data
instructions
Crafty
length of residency
18
Fraction of memory references serviced by each length
Most references target residencies longer than, say, 10
Computer Architecture 2012 – probabilistic L1 filtering
Superimposing graphs reveals
mass-count disparity
Vortex
Facerec
Spsl
CDF
data
instructions
Crafty
of residency
Every x value along the curveslength
reveals
how many of the residencies
account for how much of the mass
For example, in Crafty, 55% if the (shortest) residencies account for only
5% of the mass
Which means the other 45% (longer) residencies account for 95% of the
mass
19
Computer Architecture 2012 – probabilistic L1 filtering
The joint-ratio
mass-disparity metric
Vortex
Vortex
Facerec
Facerec
Spsl
Spsl
CDF
data
instructions
Crafty
Crafty
The divergence between the distributions (= the mass-count disparity)
can be quantified by the “joint ratio”
It’s a generalization of the proverbial 20/80 principle
Definition: the joint ratio is the unique point in the graphs where the
sum of the two CDFs is 1
length of residency
Example: in the case of Vortex, the joint ratio is 13/87 (blue arrow in
middle of plot), meaning 13% of the (longest) residencies hold 87% of
the mass of the memory references, while the remaining 87% of the
residencies hold only 13% of the mass
20
Computer Architecture 2012 – probabilistic L1 filtering
The W1/2 mass-disparity metric
CDF
data
instructions
Crafty
Vortex
Facerec
Spsl
W1/2
length of residency
Definition: overall mass (in %) of the shorter half of the residencies
Example: in Vortex and Facerec W1/2 is less than 5% of the references
Average W1/2 across all benchmarks is < 10% (median of W1/2 is < 5%)
21
Computer Architecture 2012 – probabilistic L1 filtering
The N1/2 mass-disparity metric
N1/2
Vortex
Facerec
Spsl
CDF
data
instructions
Crafty
length of residency
Definition: % of longer residencies accounting for half of the mass
Example: in Vortex and Facerec N1/2 is less than 1% of the references
Average N1/2 across all benchmarks is < 5% (median of N1/2 is < 1%)
22
Computer Architecture 2012 – probabilistic L1 filtering
Let us utilize our understandings for…
DESIGNING A NEW
INSERTION POLICY
23
Computer Architecture 2012 – probabilistic L1 filtering
Probabilistic insertion?
The mass-disparity we’ve identified means
A small number of long residencies account for most memory
references; but still most residencies are short
So when randomly selecting a residency
It would likely be a short residency
Which means we have a way to approximate the future:
Given a block about to be inserted into cache, probabilistically
speaking, we know with high degree of certainty that it’d be
disadvantageous to actually insert it…
So we won’t! Instead, we’ll flip a coin…
•
Heads = insert block to cache (small probability)
•
Tails = insert block to a small filter (high probability)
Rationale
Long residencies will enjoy many coin-flips, so chances are they’ll
eventually get into the cache
Conversely, short residencies have little chance to get in
24
Computer Architecture 2012 – probabilistic L1 filtering
L1 with random filter
Design
Direct-mapped L1 + small
fully-associative filter w/ CAM
Insertion policy for lines not in
L1: for each mem ref, flip
biased coin to decide if line
goes into filter or into L1
SRAM is cache memory
Not to be confused with DRAM
Holds blocks that, by the coin
flip, shouldn’t be inserted to L1
Usage
First, search data in L1
If not found, search in filter
If not found, go to L2, and
then use above insertion policy
25
Computer Architecture 2012 – probabilistic L1 filtering
L1 with random filter
Result
Long residencies end up in L1
Short residencies tend to end
up in filter
Benefit of randomness
Filtering is purely statistical,
eliminating the need to save
any state or reuse information!
Explored filter sizes
1KB, 2KB, and 4KB
Consisting of 16, 32, and 64
lines, respectively
Results presented in slides:
were achieved using a 2K filter
26
Computer Architecture 2012 – probabilistic L1 filtering
Exploring coin bias
Find the probability minimizing the miss-rate
High probability swamps cache
Low probability swamps filter
Constant selection probabilities seem sufficient
Data miss-rate reduced by ~25% for P = 5/100
Inst. miss-rate reduced by >60% for P = 1/1000
Data
Instruction
Reduction
in miss rate
27
Computer Architecture 2012 – probabilistic L1 filtering
Exploring coin bias
Random sampling with probability P turned out equivalent to
periodic sampling at a rate of ~1/P
Do not need real randomness
Majority of memory refs serviced by L1 cache, whereas
majority of blocks remain in the filter; specifically:
L1 services 80% - 90% of refs
With only ~35% of the blocks
Data
Instruction
Reduction
in miss rate
28
Computer Architecture 2012 – probabilistic L1 filtering
Problem – CAM is wasteful & slow
Fully-associative filter uses CAM (content addressable memory)
Input = address; output (on a hit) = “pointer” into SRAM saying
where’s the associated data
CAM lookup done in parallel
Parallel lookup drawbacks
Wastes energy
Is slower (relative to directmapped)
Possible solution
Introducing the “WLB”…
29
Computer Architecture 2012 – probabilistic L1 filtering
Wordline look-aside Buffer (WLB)
WLB is a small direct-mapped lookup table caching the most
recent CAM lookups
(Recall: given an address, CAM returns a pointer into SRAM; it’s a
search like any search and therefore can be cached)
Fast, low-power lookups
Filter usage when adding
to it the WLB
First, search data in L1
In parallel search its address in WLB
If data not in L1 but WLB hits
• Access the SRAM without CAM
If data not in L1 and WLB misses
• Only then use the slower /
wasteful CAM
If not found, go to L2 as usual
30
Computer Architecture 2012 – probabilistic L1 filtering
Effectiveness of WLB?
Data
Instruction
size of WLB [number of entries]
WLB is quite effective with only 8 entries (for both I and D)
Eliminates 77% of CAM data lookups
Eliminates 98% of CAM instructions lookups
Since WLB is so small and simple (direct map)
It’s fast and consumes extremely low power
Therefore, it can be looked up in parallel with main L1 cache
31
Computer Architecture 2012 – probabilistic L1 filtering
PERFORMANCE
EVALUATION
32
Computer Architecture 2012 – probabilistic L1 filtering
Methodology
4 wide, out-of-order micro-architecture (SimpleScalar)
(You’ll understand this when we learn out-of-order execution)
Simulated L1
16K, 32K, 64K, with several set-associative configuration; latency:
• Direct-mapped: 1 cycle
• Set-associative: 2 cycles
Simulated filter
2K, fully-associative, with 8-entry WLB; latency: 5 cycles
• 1 cycle = for WLB (in parallel to accessing the cache)
• 3 cycles = for CAM lookup
• 1 cycle = for SRAM access
Simulated L2
512K; latency: 16 cycles
Simulated main-memory
Latency: 350 cycles
33
Computer Architecture 2012 – probabilistic L1 filtering
Results – runtime
32K DM + filter
average relative
improvement [%]
16K DM + filter
Comparing random sampling filter cache to other common cache designs
Outperforms a 4-way cache double its size!
Interesting: DM’s low-latency compensates for conflict misses
34
Computer Architecture 2012 – probabilistic L1 filtering
Results – power consumption
Expectedly, DM-filtered loses to DM, because it’s more complex
Direct mapped cache reduces dynamic power, but filter adds ~15%
more leakage over 4-way
Same size: 60%-80% reduction in dynamic power
Double size: ~40% reduction in leakage
36
Computer Architecture 2012 – probabilistic L1 filtering
Conclusions
The Mass-Count disparity phenomenon can be leveraged for
caching policies
Random Sampling effectively identifies frequently used blocks
Adding just 2K filter is better than doubling the cache size, both in
terms of IPC and power
The WLB is effective at eliminating costly CAM lookups
Offering fast, low-power access while maintaining fully-associativity
benefits
37
Computer Architecture 2012 – probabilistic L1 filtering