Input Queue Switch Technologies

Download Report

Transcript Input Queue Switch Technologies

Input Queue Switch Technologies
Speaker : Kuo-Cheng Lu
N300/CCL/ITRI
Outline
•
•
•
•
•
•
•
Overview of switching fabric technologies
Input queue scheduling algorithms
Scheduling for quality of service
Multicast scheduling
Tiny Tera project
PRIZMA project
Conclusion
Evolution of Router/Switch
Fabric : two main categories
Why input queue switch is less efficient ?
• Head of Line Blocking (limited throughput)
• Input contention (difficult to control cell delay)
Input Queue Switch
Output Queue Switch
1
1
1
Input Queue Switch
1
2
H
1
1
1
2
1
1
1
1
1
1
H
2
L
1
1
1
1
2
2
1
1
1
1
1
2
1
1
1
2
2
2
H
1
X
1
H
1
H
2
L
1
2
L
2
H
Input Contention
HOL Blocking
1
2
2
L
1
X
2
1
H
1
2
H
2
Output Queue Switch
1
1
1
2
1
1
2
2
1
1
1
H
1
1
2
2
1
1
H
1
2
H
2
L
2
L
2
H
Solve the HOL blocking by VOQ
• Virtual Output Queuing to archive 100% throughput with
suitable scheduling algorithm
Input Queue Switch(VOQ)
Output Queue Switch
1
1
1
1
1
1
1
1
1
2
2
1
1
1
1
1
1
1
1
1
1
2
1
*Still can't get desired cell
output sequence due to
input contention!
2
1
1
1
1
1
1
1
2
2
1
1
1
2
2
2
1
2
1
2
1
2
1
1
1
Control cell delay by speedup
• Moderate speedup with suitable scheduling algorithm to
control cell delay
• Need output buffer
– CIOQ (combine input output queue) switch
Remark
• VOQ can avoid HOL blocking and provide 100%
throughput but need a complex scheduling algorithm
• m-time speedup (m<N) can reduce HOL blocking and
input/output contention by using a suitable scheduling
algorithm to approach the performance of output queuing
switch
• n-time speedup (n=2) with VOQ can emulate output
queuing (Nick McKeown)
Input Queue Scheduling
Input Queue Scheduling Algorithms
• First Goal : 100% throughput under admissible input
traffic
• Second Goal : Control cell transfer delay
• Methods : find a matching for
– maximum matching
– maximal matching
– maximum/maximal weight
– stable matching
using VOQ and/or moderate speedup!
• *Admissible : Sum(Lamda(I,J)) <1 for all input I, and
Sum(Lamda(I,J)) <1 for all output J
• *Stable : Q(I,J) < infinit =>(define) 100% throughput
Maximum or Maximal matching
• Maximum matching
– Maximizes instantaneous throughput
– Starvation
– Time complexity is very high
• Maximal matching
– Can’t add any connection on the current match without alert
existing connections
– More practical (e.g. WFA, PIM, iSLIP, DRR,RRM)
Request Graph
Maximum Matching
Maximal Matching
Parallel Iterative Matching
Random Selection Random Selection
#1
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
4
4
4
4
4
4
Requests
Grant
Accept/Match
1
2
#2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
4
4
4
4
4
4
PIM-Performance
N2
E U i   ------4i
E C   log N
C = # of iterations required to resolve connections
N = # of ports
U i = # of unresolved connections after iteration i
iSLIP
iSLIP with multiple iterations
iSLIP Properties
•
•
•
•
•
•
•
•
De-synchronization is the key to archive high throughput
Random under low load
TDM under high load
Lowest priority to MRU
1 iteration: fair to outputs
Converges in at most N iterations. On average <= log2N
Implementation: N priority encoders
Up to 100% throughput for uniform traffic in one iteration
(c.f. PIM can only archive 63% throughput in one iteration)
iSLIP Performance
iSLIP Implementation
Programmable
Priority Encoder
N
N
1
1
Grant
Accept
2
2
Grant
Accept
log2N
log2N
State
Decision
N
N
Grant
N
Accept
log2N
DRRM(Dual Round Robin Matching)
DRRM Implementation
DRRM Performance
Wrap Wave Front Arbiter
N steps instead of
2N-1
Requests
Match
Wave Front Arbiter
Requests
Match
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
WFA-Implementation
2x2 WFA
Maximum/Maximal Weight Matching
• 100% throughput for admissible traffic (uniform
or non-uniform)
• Maximum Weight Matching
– OCF (Oldest Cell First): w=cell waiting time
– LQF (Longest Queue First):w=input queue occupancy
– LPF (Longest Port First):w=QL of the source port +
Sum of QL form the source port to the destination port
• Maximal Weight Matching (practical algorithms)
– iOCF
– iLQF
– iLPF (comparators in the critical path of iLQF are
removed )
iLQF - Implementation
iLPF - Implementation
*10ns arbitration time for a 32x32 switch using 0.25 CMOS process
(Nick McKeown, IEEE Infocom 1998)
Scheduling for QoS
• WPIM (Weighted PIM)
• Prioritized iSLIP
• FARR (Fair Access Round Robin)
– a kind of maximal weight matching
– US Patent#5517495
• Output Queue Emulation(Nick McKeown, IEEE JSAC
1999)
– Speedup of 2 - 1/N is necessary and sufficient for OQ emulation
– using CCF algorithm(Critical Cell First)
• Rate Controller Approach(Anna Charny, IWQOS’98)
– putting rate controllers at the input and output channels and using
OCF arbitration policy can provide deterministic delay guarantee
for speedup >2 (delay is function of leaky bucket parameters of a
flow, the speedup, and N)
– incorporating the rate-controllers into the arbiter with speedup >=
6 can approach OQ delay performance
WPIM
• Each Iteration consists of four stages :
– Request : every unmatched port sends a request to the destination
of each active VOQ
– Mask : output create a mask (for per input)to indicate if the input
has transmitted as many cell as their credit to the output in the
current frame
– Grant : from the requests that remain from the masking stage, the
output port selects one randomly and sends a grant signal to its
originating input port
– Accept : every unmatched input port that receives one or more
grants selects one with equal probability and notifies the
corresponding output port
• Modification :
– allowing each output port to clear all its mask bits when all of its
incoming requests are masked and the output port remains
unmatched
Prioritized iSLIP
• Request :
– input I select the highest priority
nonempty queue for output J, Lij
• Grant :
– output J find the highest level request L(j)=max(Lij), the output
maintain a separate pointer for each level,for same level input, the
arbiter use the pointer GjL(j) and normal iSLIP scheme to choose
the input
• Accept :
– same as Grant
FARR
• Each input selects HOL cell of the highest priority queue
for each VOQ and sends the requests with the extended
timestamps( a timestamp prepended with its priority)
• Repeat the following steps for R times
– (1)for each unmatched output, if it has any request from
unmatched inputs, grants the request with smallest extended
timestamp
– (2)for each unmatched input, if it receives any grants, grants
smallest extended timestamp
– (3)any accepted grants are added into the match
VOQ#0 of input #0
Pri=0,time=23
Pri=1,time=18
VOQ#1 of input #0
o/p
#0
Pri=0,time=23
Pri=1,time=18
Pri=0,time=12
Pri=1,time=18
o/p
#1
empty
Pri=1,time=18
Output Queue Emulation(1/2)
TL(C)=3
OC(C)=2
IT(C)=1
L(C)=1
Sorting according to TL(C)
PIAO(Push In Arbitrary Out) Queue
• Definitions
–
–
–
–
TL(C) : Time to Leave of cell C
OC(C) : Output Cushion of cell C
IT(C) : Input Thread of cell C
L(C) : Slackness of cell C. = OC(C)-IT(C)
Output Queue Emulation (2/2)
• Using PIAO as an input queue, CCF(Critical Cell First) as
an input queue insertion policy and stable matching can
mimic output queuing with speedup=2
–
–
–
–
–
put the arriving cell at position OC(C)+1 of input PIAQ queue
Slackness always >= 0
when a cell reaches its time to leave(I.e.OC(C)=0), this means
(1)the cell is already at its output and may depart on time or
(2)the cell is simultaneously at the head of its input priority
list(because its input thread is zero) and at the head of its output
priority list(because it has reached its time to leave
Remarks
• iSLIP can get 100% throughput under uniform Bernoulli
traffic (Nick McKeown IEEE Transactions on Networking,
April 1999)
• Any maximum weight matching(e.g. OCF,LQF) algorithm
delivers 100% throughput under admissible traffic(Balaji
Prabhakar, IEEE Infocom 2000)
• Any maximal matching(e.g. PIM, iSLIP) with 2-time
speedup delivers 100% throughput under admissible traffic
• Speedup of 2 is sufficient for OQ emulation (Nick
McKeown, IEEE JSAC 1999)
• For bounded cell delay guarantee, exact OQ emulation
may be too costly! Probabilistic or soft-emulation is more
practical (Mounir Hamdi, IEEE Comm. Mag. 2000)
Summary of Fabric Architectures
==Centralized Shared Memory Switch==
Throughput = 100%
Delay = Guaranteed
Mem. BW = 2NR
==Input Queue Switch==
Throughput = 58.6%
Delay = ?
Speedup = 1
Mem. BW = 2R
==Output Queue Switch==
Throughput = 100%
Delay = Guaranteed
Speedup = N
Mem. BW = (N+1)R
PIM => 100% throughput (unif orm)
need random access FIFOs(look ahead)
R/k
R
R/k
OQ orCIOQ
SW
#1
cell level
demux.
cell level
mux.
OQ orCIOQ
SW
#2
==Virtual Output Queue==
Throughput = 100%
Delay = ? (due to input blocking)
Speedup = 1
Mem. BW = 2R
iSLIP, PIM, WFA => 100% throughput (unif orm)
(maximal size matching)
LQF, LPF =>100% throughput (un-unif orm)
(maximal w eight matching)
==Combined Input and Output Queue Swicth==
Throughput = 100%
Delay = Guaranteed
Speedup = 2
Mem. BW = 3R
Theoritical result f or any arrivals
(Can emulate the behavior of an OQ sw itch)
OQ orCIOQ
SW
#k
==Parallel Packet Swicth==
Throughput = 100%
Delay = Guaranteed
Speeddown = 3(R/k)
Mem. BW = 3 x 3(R/k) (CIOQ)
Theoritical result f or any arrivals
(Can emulate the behavior of an OQ sw itch)
Only meaningf ul f or R > mem BW
Multicast method #1
Copy network + unicast switching
Copy networks
Increased hardware, increased input contention
Multicast method #2
Use copying properties of crossbar fabric
No fanout-splitting: Easy, but low
throughput
Fanout-splitting: higher
throughput, but not as simple.
Leaves “residue”.
The effect of fanout-splitting
Performance of an 8x8 switch with and without fanout-splitting
under uniform IID traffic
Placement of residue
Key question: How should outputs grant requests?
(and hence decide placement of residue)
Residue and throughput
Result: Concentrating residue brings more new work
forward. Hence leads to higher throughput.
But, there are fairness problems to deal with.
This and other problems can be looked at in a unified
way by mapping the multicasting problem onto a
variation of Tetris.
ESLIP - Cisco 12000
Stanford University Tiny Tera Project
IBM’s PRIZMA Project
•16 input ports
•16 output ports
•1.6 - 1.8 Gbps per port
•QoS: up to four priorities
•Built-in support for modular growth in number of ports
•Built-in support for modular growth in port speed
•Built-in support for modular growth in aggregate throughput
•Built-in support for automatic load-sharing
•Self-routing switch element
•Dynamically shared-output buffered element
•Built-in multicast and broadcast
•Aggregate data rate 28 Gbit/s per module
•3.8 Million transistors on chip
•624 I/O pins
Architecture Descriptions
• A kind of CIOQ (VOQ+Output Queuing)
• Schedulers are distributed with complexity of O(N)
– The arbiters at the input side perform input contention resolution
– The output-buffered switch element performs classical output
contention resolution
• By means of the flow-control/VOQ interaction b.t. switch
element and input queues, the less expensive input-queue
memory is used to cope with burst-ness
Core of PRIZMA
• Conventional shared memory v.s PRIZMA
Performance of PRIZMA
• 16x16 switch element (N=16)
• Shared memory size M= 256 cells
• *Delay-throughput performance improves notably as the
degree of memory sharing is reduced
– when VOQ is used, there is no HOL blocking, and the
performance is determined only by the o/p queue space available
for every output to resolve contention!
Scalability of PRIZMA