Transcript Power Point
Network Processor Algorithms:
Design and Analysis
Balaji
Prabhakar
Balaji
Prabhakar
High Performance
Switching and Routing
Telecom Center Workshop: Sept 4, 1997.
Stanford University
Stochastic Networks Conference
Montreal
July 22, 2004
Overview
• Network Processors
– What are they?
– Why are they interesting to industry and to researchers
• SIFT: a simple algorithm for identifying large flows
– The algorithm and its uses
• Traffic statistics counters
– The basic problem and algorithms
• Sharing processors and buffers
– A cost/benefit analysis
2
IP Routers
19”
Capacity is
sum of rates
of line-cards
19”
Capacity: 160Gb/s
Power: 4.2kW
6ft
Capacity: 80Gb/s
Power: 2.6kW
3ft
2ft
2.5ft
Cisco GSR 12416
2.5ft
Juniper M160
3
A Detailed Sketch
Lookup
Engine
Network
Processor
Lookup
Engine
Network
Processor
Lookup
Engine
Network
Processor
Line cards
Interconnection
Fabric
Packet
Buffers
Output
Scheduler
Switch
Packet
Buffers
Packet
Buffers
Outputs
4
Network Processors
• Network processors are an increasingly important
component of IP routers
• They perform a number of tasks
–
–
–
–
–
(essentially everything except Switching and Route lookup)
Buffer management
Congestion control
Output scheduling
Traffic statistics counters
Security …
• They are programmable, hence add great flexibility to a
router’s functionality
5
Network Processors
• But, because they operate under severe constraints
– very high line rates
– heat constraints
the algorithms that they can support should be lightweight
• They have become very attractive to industry
• They give rise to some interesting algorithmic and
performance analytic questions
6
Rest Of The Talk
SIFT: a simple algorithm for identifying large flows
– The algorithm and its uses
with Arpita Ghosh and Costas Psounis
• Traffic statistics counters
– The basic problem and algorithms
with Sundar Iyer, Nick McKeown and Devavrat Shah
• Sharing processors and buffers
– A cost/benefit analysis
with Vivek Farias and Ciamac Moallemi
7
SIFT: Motivation
•
Current egress buffers on router line cards serve packets in a FIFO
manner
•
But, giving the packets of short flows a higher priority, e.g. using the
SRPT (Shortest Remaining Processing Time) policy
–
–
reduces average flow delay
given the heavy-tailed nature of Internet flow size distribution, the
reduction in delay can be huge
Egress Buffer
FIFO:
SRPT:
8
But …
•
SRPT is unimplementable
–
router needs to know residual flow sizes for all enqueued flows:
virtually impossible to implement
•
Other pre-emptive schemes like SFF (shortest flow first) or LAS
(least attained service) are like-wise too complicated to
implement
•
This has led researchers to consider tagging flows at the edge,
where the number of distinct flows is much smaller
–
–
•
but, this requires a different design of edge and core routers
more importantly, needs extra space on IP packet headers to
signal flow size
Is something simpler possible?
9
SIFT: A randomized algorithm
• Flip a coin with bias p (= 0.01, say) for heads on each
arriving packet, independently from packet to packet
• A flow is “sampled” if one its packets has a head on it
T
T
T
T
H
T
H
10
SIFT: A Randomized Algorithm
• A flow of size X has roughly 0.01X chance of being sampled
– flows with fewer than 15 packets are sampled with prob
– flows with more than 100 packets are sampled with prob
– the precise probability is: 1 – (1-0.01)X
0.15
1
• Most short flows will not be sampled, most long flows will be
11
The Accuracy of Classification
• Ideally, we would like to sample like the blue curve
• Sampling with prob p gives the red curve
Prob (sampled)
– there are false positives and false negatives
Flow size
• Can we get the green curve?
12
SIFT+
• Sample with a coin of bias q = 0.1
– say that a flow is “sampled” if it gets two heads!
– this reduces the chance of making errors
– but, you have to have a count the number heads
• So, how can we use SIFT at a router?
13
SIFT at a router
• Sample incoming packets
• Place any packet with a head (or the second such packet) in the low
priority buffer
• Place all further packets from this flow in the low priority buffer (to avoid
mis-sequencing)
B
All flows
B/2
Short flows
sampling
B/2
Long flows
14
Simulation results
• Topology:
Traffic
Sources
Traffic
Sinks
15
Overall Average Delays
16
Average Delay for Short Flows
17
Average Delay for Long Flows
18
Implementation Requirements
• SIFT needs
–
–
–
–
two logical queues in one physical buffer
to sample arriving packets
a table for maintaining id of sampled flows
to check whether incoming packet belongs to sampled
flow or not
all quite simple to implement
19
A Big Bonus
• The buffer of the short flows has very low occupancy
– so, can we simply reduce it drastically without sacrificing
performance?
• More precisely, suppose
– we reduce the buffer size for the small flows, increase it
for the large flows, keep the total the same as FIFO
20
SIFT Incurs Fewer Drops
Buffer_Size(Short flows) = 10; Buffer_Size(Long flows) = 290;
Buffer_Size(Single FIFO Queue) = 300;
SIFT -----FIFO ------
Reducing Total Buffer Size
• Suppose we reduce the buffer size of the long
flows as well
• Questions:
– will packet drops still be fewer?
– will the delays still be as good?
22
Drops With Less Total Buffer
Buffer_Size(PRQ0) = 10; Buffer_Size(PRQ1) = 190;
Buffer_Size(One Queue) = 300;
OneSIFT
Queue
-----FIFO ------
Delay Histogram for Short Flows
SIFT -----FIFO ------
0.3
SIFT
0.25
FIFO
Percentage
0.2
0.15
0.1
0.05
0
0.00
0.30
0.61
Transfer Time (sec)
0.91
Delay Histogram for Long Flows
0.06
SIFT -----FIFO ------
SIFT
0.05
FIFO
Percentage
0.04
0.03
0.02
0.01
0
0.00
0.30
0.61
Transfer Time (sec)
0.91
Why SIFT Reduces Buffers
• The amount of buffering needed to keep links fully utilized
– old formula:
– corrected to:
= 10 Gbps x 0.25 = 2.5 G
¼ 250 M
• But, this formula is for large (elephant) flows, not for short (mice) flows
– elephant arrival rate: 0.65 or 0.7 of C; hence they smaller buffers for them
– mice buffers are almost empty due to high priority, mice don’t cause
elephant packet drops
– elephants use TCP to regulate their sending rate according to
mice
SIFT
elephants
26
Conclusions for SIFT
• A randomized scheme, preliminary results show that
– it has a low implementation complexity
– it reduces delays drastically
(users are happy)
– with 30-35% smaller buffers at egress line cards
(router manufacturers are happy)
• Leads to a 15 pkts or less lane on the Internet, could be useful
• Further work needed
– at the moment we have a good understanding of how to sample,
and extensive (and encouraging) simulation tests
– need to understand the effect of reduced buffers on end-to-end
congestion control algorithms
27
Traffic Statistics Counters: Motivation
• Switches maintain statistics, typically using counters that are
incremented when packets arrive
• At high line rates, memory technology is a limiting factor for
the implementation of counters; for example, in a 40 Gb/s
switch, each packet must be processed in 8 ns
• To maintain a counter per flow at these line rates, we would
like an architecture with the speed of SRAM, and the density
(size) of DRAM
28
Hybrid Architecture
• Shah, Iyer, Prabhakar, and McKeown (2001) proposed a
hybrid SRAM/DRAM architecture
SRAM
N counters
Arrivals
(at most one
per time slot)
…
Update counter in DRAM, empty
corresponding counter in SRAM
(once every b time slots)
Counter
Management
Algorithm
DRAM
…
29
Counter Management Algorithm
• Shah et al. place a requirement on the counter
management algorithm (CMA) that it must maintain
all counter values accurately
• That is, given N and b, what should the size of each
SRAM counter be so that no counts are missed?
30
Some CMAs
• Round robin
– maximum counter value is bN
• Largest Counter First (LCF)
– optimal in terms of SRAM memory usage
– no counter can have a value larger than:
31
Analysis of LCF
• This upper bound is proved by establishing a bound on the
following potential (Lyapunov) function
– let Qi(t) be the size of counter i at time t, then
• E.g. for b = 2,
• Hence, the size of the largest counter is at most
32
An Implementable Algorithm
• LCF is difficult to implement
– with one counter per flow, we would like to support at least 1
million counters
– maintaining a sorted list of counters to determine the longest
counter takes too much SRAM memory
• Ramabhadran and Varghese (2003) proposed a simpler
algorithm with the same memory usage as LCF
33
LCF with Threshold
• The algorithm keeps track of the counters that have value at
least as large as b
• At any service time, let j be the counter with the largest value
among those incremented since the previous service, and let
c be its value
– if c ¸ b, serve counter j
– if c · b, serve any counter with value at least b; if no such
counter exists, serve counter j
• Maintaining the counters with values at least b is a non-trivial
problem; it is solved using a bitmap and an additional data
structure
• Is something even simpler possible?
34
Some Simpler Algorithms …
• Possible approaches for a CMA that is simpler to implement:
– arrival information (serve largest counter among those
incremented)
– random sampling
– round-robin pointer
• Trade-off between simplicity and performance: more SRAM is
needed in the worst case for these schemes
35
An Alternative Architecture
DRAM
SRAM
N counters
…
FIFO Buffer
Counter
Management
Algorithm
…
• Decision problem: given a counter with a particular value and the
occupancy of the buffer, when should the counter value be moved
to the FIFO buffer? What size counters does this lead to?
– Interesting question with Poisson arrivals, exponential services,
tractable
36
The Cost of Sharing
• We have seen that there is a very limited amount of buffering and
processing capability in each line card
• In order to fully utilize these resources, it will become necessary to
share them amongst the packets arriving at each line card
• But, sharing imposes a cost
– we may need to traverse the switch fabric more often than needed
– each of the two processors involved in a migration will need to do some
processing; e.g. e local, 1 remote, instead of just 1
– or, the host processor may simply be worse at the processing;
e.g. 1 local versus K (> 1) remote
• Need to understand the tradeoff between costs and benefits
– will focus on a specific queueing model
– interested in simple rules
– benefit measured in reduction of backlogs
37
The Setup
Poisson (l)
Poisson (l)
Poisson (l)
Poisson (l)
K
1
exp(1)
exp(1)
exp(1)
1
exp(1)
• Does sharing reduce backlogs?
38
Additive Threshold Policy
• Job arrives at queue 1
• Send the job to queue 2 if
• Otherwise, keep the job in queue 1
• Analogous policy for jobs arriving at queue 2
39
Additive Thresholds - Queue Tails
1
0.1
0.01
0.001
No Sharing
0.0001
0.00001
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
40
Additive Thresholds - Stability
• Theorem: Additive policy is stable if
and unstable if
• For example, if
Stable for
Unstable for
41
Inference
• The pros/cons of sharing
– Reduction in backlogs
– Loss of throughput
42
Multiplicative Threshold Policy
• Job arrives at queue 1
• Send the job to queue 2 if
• Otherwise, keep the job in queue 1
• Theorem: Multiplicative policy is stable for all l < 1
• Interestingly, this policy improves delays while
preserving throughput!
43
Multiplicative Thresholds - Queue Tails
1
0.1
0.01
0.001
No Sharing
0.0001
0.00001
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
44
Multiplicative Thresholds - Delay
6.5
6
Average Delay
5.5
5
4.5
4
3.5
3
2.5
2
1
1.5
2
2.5
3
3.5
4
4.5
5
45
Conclusions
• Network processors add useful features to a router’s function
• There are many algorithmic questions that come up
– simple, high performance algorithms are needed
• For the theorist, there are many new and interesting
questions; we have seen three examples briefly
– SIFT: a sampling algorithm
– Designing traffic statistics counters
– Sharing: a cost-benefit analysis
46