Design and Performance Analysis of a DRAM

Download Report

Transcript Design and Performance Analysis of a DRAM

Design and Performance Analysis of a
DRAM-based Statistics Counter Array
Architecture
Authors: Haiquan (Chuck) Zhao, HaoWang, Bill Lin, Jun (Jim) Xu
Conf. : The 5th ACM/IEEE Symposium on Architectures for
Networking and Communications Systems (ANCS '09)
Presenter : JHAO-YAN JIAN
Date : 2010/12/15
1
Outline
 Introduction
 Scheme
 Analysis
 Evaluation
2
Network Measurement
 Traffic statistics are widely used in
 router management
 traffic engineering
 network anomaly detection etc.
�
 Fine-grained network measurement
 Possibly tens of millions of flows (and counters)
 Wirespeed statistics counting
 8 ns update time at 40 Gb/s(OC-768)
3
Naive Implementations
 SRAM is fast, but too expensive
 e.g., 10 million counters × 64-bits = 80 MB, prohibitively
expensive (infeasible for on-chip)
 DRAM is cheap, but naïve implementation too slow
 e.g., 50 ns DRAM random access times typically quoted, 2 × 50
ns = 100 ns > 8 ns required for wirespeed updates (read,
increment, then write)
4
Hybrid SRAM/DRAM architectures(1/2)
 Based on premise that DRAM is too slow, hybrid
SRAM/DRAM architectures have been proposed
 e.g., Shah’02[25], Ramabhadran’03[21], Roeder’04[23], Zhao’06[31]
 All based on following idea:
 Store full counters in DRAM (64-bits)
 Keep say a 5-bit SRAM counter, one per flow
Wirespeed increments on 5-bit SRAM counters
 “Flush" SRAM counters to DRAM before they “overflow"
 Once “flushed", SRAM counter won’t overflow again for at least
say another 2^5 = 32 (or 2^b in general) cycles
5
Hybrid SRAM/DRAM architectures(2/2)
 10 to 57 MB needed far exceed available on-chip SRAM
 On-chip SRAM needed for other network processing
 SRAM amount depends on “how often" SRAM counters have
to be flushed - if arbitrary increments are allowed (e.g. byte
counting), more SRAM needed
 Integer specific, no decrements
6
Basic Architecture: Randomized Scheme
 Proposed by Lin & Xu in Hotmetrics08.
 Counters randomly distributed across B memory banks B > 1= 1/µ, where µ is
the SRAM-to-DRAM access latency ratio.
7
Adversarial Access Patterns
 Even though random address permutation is applied, the
memory loads to the DRAM banks may not be balanced.
 Counter index permutation in our scheme makes it difficult for
an adversary to purposely trigger a large number of consecutive
counter updates to the same memory bank with updates to
distinct counters since the pseudorandom permutation function
(or the key it uses) is not known to the outside world.
 An adversary can only try to trigger consecutive counter updates to
the same counter, which would result in consecutive accesses to the
same memory bank.
8
Extended Architecture to Handle Adversaries
 Fixed pipelined delay module absorbs repeated updates to the same memory
location.
 Implemented as a fully associative cache with FIFO replacement policy
1 1 21
9
Worst Case Request Pattern
 q + r requests for distinct counters a1, ..., aq+r
 q requests repeat T times each
 r requests repeat T − 1 times each
10
A Few Definitions(1/2)
 Eg. (1, 1, 1) ≤M (2, 1, 0) ≤M (3, 0, 0).
11
A Few Definitions(2/2)
12
A Useful Theorem
 The following theorem relates majorization, exchangeable
random variables and convex order together.
13
Chernoff Bound(1/2)
 Want to bound the probability that a request queue will overflow in n cycles
 Xs,t is the number of updates to the bank during cycles [s, t],
 τ = t − s, K is length of request queue.
 For total overflow probability bound multiply by B.
14
Chernoff Bound(2/2)
 mi : 1 ≤ i ≤ n be the count of the number of appearances of the ith address
 m∗ : worst case counter update sequences
 Xi, 1 ≤ i ≤ n be the indicator random variable for whether the ith
address is mapped to the DRAM bank
15
Overflow Probability
 Overflow probability for 16 million counters, µ = 1/16, B = 32.
16
Memory Usage Comparison
17