CSIT560 - Department of Computer Science and Engineering

Download Report

Transcript CSIT560 - Department of Computer Science and Engineering

Introduction to HighPerformance Internet
Switches and Routers
CSIT560
1
Network Architecture
Long Haul Network
DWDM
Core
10GbE
Core
Core Routers
10GbE
Core
Routers
Campus /
Residential
Metropolitan
10GbE
Edge
Routers
Metropolitan
10GbE
Edge
switch
GbE
Access Routers
• • •
•••
Access switch
http://www.ust.hk/itsc/network/
CSIT560
2
pop
pop
pop
pop
CSIT560
3
How the Internet really is: Current
Trend
Modems,
DSL
SONET/SDH
DWDM
CSIT560
4
What is Routing?
R3
A
R1
R4
D
B
C
E
R2
R5
CSIT560
F
5
5
Points of Presence (POPs)
POP2
A
POP1
POP4
B
C
POP3
D
E
POP5
POP6
POP7
CSIT560
POP8
F
6
6
Where High Performance Routers are Used
(10 Gb/s)
R2
R1
R5
R4
R3
R8
R9
R10
R7
R11
R14
R13
(10 Gb/s)
R6
R15
CSIT560
049045 - Router Architectures
(10 Gb/s)
R12
R16
(10 Gb/s)
7
7
Hierarchical arrangement
End hosts
(1000s per mux)
Access multiplexer
Edge Routers
Core Routers
POP
10Gb/s “OC192”
POP
POP
Point of Presence (POP)
CSIT560
POP: Point of Presence. Richly
interconnected by mesh of
long-haul links.
Typically: 40 POPs per national
network operator; 10-40 core
routers per POP.
8
Typical POP Configuration
Transport
Network
DWDM/SONET
Terminal
Backbone
routers
10G WAN
Transport Links
> 50% of high speed
interfaces are router-torouter (Core routers)
10G Router-Router
Intra-Office Links
Aggregation
switches/routers
(Edge Switches)
CSIT560
9
Today’s Network Equipment
Routers
Switches
SONET
DWDM
LAYER 3
LAYER 2
LAYER 1
LAYER 0
Internet
Protocol
FR & ATM
SONET
DWDM
CSIT560
10
Functions in a packet switch
Ingress linecard
Interconnect
Buffer
Framing Route
TTL
lookup process
ing
ing
Egress linecard
Buffer
QoS Framing
ing
schedul
ing
Interconnect
scheduling
Control plane
Control path
Data path
Scheduling path
CSIT560
11
Functions in a circuit switch
Ingress
linecard
Interconnect
Framing
Egress
linecard
Framing
Interconnect
scheduling
Control plane
Control path
Data path
CSIT560
12
Our emphasis for now is to
look at packet switches (IP,
ATM, Ethernet, Framerelay,
etc.)
CSIT560
13
What a Router Looks Like
Cisco CRS-1
Juniper T1600
60 cm
44 cm
Capacity: 640Gb/s
Power: 13.2kW
Full rack
214 cm
Capacity: 1.6Tb/s
Power: 9.1kW
Half-a-rack
95 cm
101cm
79 cm
CSIT560
(16-Slot Single-Shelf System)049045 - Router Architectures
(16-Slot System)
14
14
What a Router Looks Like
Cisco GSR 12416
Juniper M160
19”
19”
Capacity: 160Gb/s
Power: 4.2kW
Capacity: 80Gb/s
Power: 2.6kW
6ft
3ft
2ft
2.5ft
CSIT560
15
A Router Chassis
Fans/
Power
Supplies
Linecards
CSIT560
16
Backplane
•
A Circuit Board with
connectors for line cards
•
High speed electrical
traces connecting line
cards to fabric
•
Usually passive
•
Typically 30-layer boards
CSIT560
17
Line Card Picture
CSIT560
18
What do these two have in common?
Cisco Catalyst 3750G
Cisco CRS-1
CSIT560
19
What do these two have in common?
CRS-1 linecard
Cat 3750G Switch
• 20” x (18”+11”) x 1RU
• 19” x 16” x 1RU
• 40Gbps, 80MPPS
• 52Gpbs, 78 MPPS
• State-of-the-art 0.13u silicon
• State-of-the-art 0.13u silicon
• Full IP routing stack
including IPv4 and IPv6
support
• Full IP routing stack
including IPv4 and IPv6
support
• Distributed IOS
• Distributed IOS
• Multi-chassis support
• Multi-chassis support
CSIT560
20
What is different between them?
Cisco Catalyst 3750G
Cisco CRS-1
CSIT560
21
A lot…
Cat 3750G Switch
CRS-1 linecard
• Up to 1024 linecards
• Up to 9 stack members
• Fully programmable
forwarding
• Hardwired ASIC forwarding
• 2M prefix entries and 512K
ACLs
• 46Tbps 3-stage switching
fabric
• MPLS support
• 11K prefix entries and 1.5K
ACLs
• 32Gbps shared
stack ring
• L2 switching support
• Re-startable routing applications
• H-A non-stop routing
protocols
CSIT560
22
Other packet switches
Cisco 7500 “edge” routers
Lucent GX550 Core ATM switch
DSL router
CSIT560
23
What is Routing?
R3
A
R1
R4
D
B
C
E
R2
Destination
Next Hop
D
R3
E
R3
F
R5
CSIT560
R5
F
24
What is Routing?
R3
R1
A
1
4
Ver
20 bytes
B
C
R4
16
HLen
32
T.Service
Total Packet Length
E
Flags Fragment Offset
Fragment ID
TTL
D
Protocol
Header Checksum
R2Source Address
Destination
AddressNext Hop
Destination
R5
F
D
Options (if any) R3
E
F
Data
R3
R5
CSIT560
25
What is Routing?
R3
A
R1
R4
D
B
C
E
R2
R5
CSIT560
F
26
Basic Architectural Elements
of a Router
Routing
•
•
•
•
Routing table update
(OSPF, RIP, IS-IS)
Admission Control
Congestion Control
Reservation
• Routing
Lookup
• Packet
Classifier
Control Plane
“Typically in Software”
• Switching
•Arbitration
•Scheduling
Switch (per-packet
processing)
“Typically in Hardware”
Switching
CSIT560
27
Basic Architectural Components
Datapath: per-packet processing
1.
Forwarding
Table
2.
Interconnect
3.
Output
Scheduling
Forwarding
Decision
Forwarding
Table
Forwarding
Decision
Forwarding
Table
Forwarding
Decision
CSIT560
28
Per-packet processing in a Switch/Router
1. Accept packet arriving on an ingress line.
2. Lookup packet destination address in the
forwarding table, to identify outgoing
interface(s).
3. Manipulate packet header: e.g., decrement TTL,
update header checksum.
4. Send packet to outgoing interface(s).
5. Queue until line is free.
6. Transmit packet onto outgoing line.
CSIT560
29
ATM Switch
•
•
•
•
Lookup cell VCI/VPI in VC table.
Replace old VCI/VPI with new.
Forward cell to outgoing interface.
Transmit cell onto link.
CSIT560
30
Ethernet Switch
• Lookup frame DA in forwarding table.
– If known, forward to correct port.
– If unknown, broadcast to all ports.
• Learn SA of incoming frame.
• Forward frame to outgoing interface.
• Transmit frame onto link.
CSIT560
31
IP Router
• Lookup packet DA in forwarding table.
– If known, forward to correct port.
– If unknown, drop packet.
• Decrement TTL, update header Cksum.
• Forward packet to outgoing interface.
• Transmit packet onto link.
CSIT560
32
Special per packet/flow processing
• The router can be equipped with additional capabilities to provide
special services on a per-packet or per-class basis.
• The router can perform some additional processing on the
incoming packets:
– Classifying the packet
• IPv4, IPv6, MPLS, ...
– Delivering packets according to a pre-agreed service: Absolute
service or relative service (e.g., send a packet within a given
deadline, give a packet a better service than another packet
(IntServ – DiffServ))
– Filtering packets for security reasons
– Treating multicast packets differently from unicast packets
CSIT560
33
Per packet Processing Must be Fast !!!
Year
1997
1999
2001
2003
2006
2008
Aggregate Linerate
622 Mb/s
Arriving rate of 40B POS
packets (Million pkts/sec)
1.56
2.5 Gb/s
10 Gb/s
40 Gb/s
6.25
25
100
80 Gb/s
…
200
…
1. Packet processing must be simple and easy to implement
2. Memory access time is the bottleneck
200Mpps × 2 lookups/pkt = 400 Mlookups/sec → 2.5ns per lookup
CSIT560
34
First Generation Routers
Shared Backplane
CPU
Route
Table
Buffer
Memory
Line
Interface
Line
Interface
Line
Interface
MAC
MAC
MAC
Typically <0.5Gb/s aggregate capacity
CSIT560
35
Bus-based Router Architectures with Single
Processor
• The first generation of IP router
• Based on software implementations on a single
general-purpose CPU.
• Limitations:
– Serious processing bottleneck in the central
processor
– Memory intensive operations (e.g. table lookup &
data movements) limits the effectiveness of
processor power
– A severe limiting factor to overall router
throughput from input/output (I/O) bus
CSIT560
36
Second Generation Routers
CPU
Buffer
Memory
Route
Table
Line
Card
Line
Card
Line
Card
Buffer
Memory
Buffer
Memory
Buffer
Memory
Fwding
Cache
Fwding
Cache
Fwding
Cache
MAC
MAC
MAC
Typically <5Gb/s aggregate capacity
CSIT560
37
Bus-based Router Architectures with Multiple
Processors
• Architectures with Route Caching
– Distribute packet forwarding operations
– Network interface cards
» Processors
» Route caches
– Packets are transmitted once over the shared bus
– Limitations:
» The central routing table is a bottleneck at high-speeds
» traffic dependent throughput (cache)
» shared bus is still a bottleneck
CSIT560
38
Third Generation Routers
Switched Backplane
Line
Card
CPU
Card
Line
Card
Local
Buffer
Memory
Routing
Table
Local
Buffer
Memory
Fwding
Table
Fwding
Table
MAC
MAC
Typically <50Gb/s aggregate capacity
CSIT560
39
Switch-based Router Architectures with Fully
Distributed Processors
• To avoid bottlenecks:
– Processing power
– Memory bandwidth
– Internal bus bandwidth
• Each network interface is equipped
with appropriate processing power
and buffer space.
CSIT560
40
Fourth Generation Routers/Switches
Optics inside a router for the first time
Optical links
100s
of metres
Switch Core
Linecards
0.3 - 10Tb/s routers in development
CSIT560
41
Juniper TX8/T640
Alcatel 7670 RSP
Avici TSR
Chiaro
CSIT560
42
Next Gen. Backbone Network Architecture – One
backbone, multiple access networks
Dual Stack IPv4-IPv6
Cable Network
CE router
Dual Stack IPv4-IPv6
Enterprise Network
Residentia
l
(G)MPLS based Multiservice Intelligent
Packet Backbone
PE
Network
PE Router
Service
(Service POP)
POP
PE
DSL,
FTTH,
Dial
CE router
GGSN
SGSN
CE router
ISP’s
Telecomm
Dual Stack uter
IPv4-IPv6
DSL/FTTH/Dial access
Network
IPv6 IX
ISP offering Native IPv6
services
• One Backbone
Network
CSIT560
43
• Maximizes speed, flexibility and manageability
Current Generation: Generic Router
Architecture
Header Processing
Data
Hdr
Lookup
Update
IP Address Header
IP Address
~1M prefixes
Off-chip DRAM
Queue
Packet
Data
Hdr
Next Hop
Address
Table
Buffer
Memory
CSIT560
~1M packets
Off-chip DRAM
44
Current Generation: Generic Router
Architecture (IQ)
Data Hdr
Header Processing
Lookup
IP Address
Update
Header
Address
Table
Data Hdr
Update
Header
Address
Table
Queue
Packet
2
2
Data Hdr
Buffer
Memory
Header Processing
Lookup
IP Address
1
Buffer
Address
Table
Data Hdr
1
Data
MemoryHdr
Header Processing
Lookup
IP Address
Queue
Packet
Update
Header
Scheduler
Queue
Packet
N
N
Buffer
Data
MemoryHdr
CSIT560
45
Current Generation: Generic Router
Architecture (OQ)
Data Hdr
Header Processing
Lookup
IP Address
Update
Header
1
1
Buffer
Memory
Address
Table
Data Hdr
Header Processing
Lookup
IP Address
Update
Header
2
2
Header Processing
Lookup
IP Address
Address
Table
Queue
Packet
Buffer
Memory
Address
Table
Data Hdr
Queue
Packet
Update
Header
N
N
CSIT560
Queue
Packet
Buffer
Memory
46
Basic Architectural Elements
of a Current Router
Typical IP Router Linecard
Buffer &
State Memory
Physical
Layer
Buffer Mgmt
& Scheduling
Framing &
Maintenance
Packet
Processing
Buffer Mgmt
& Scheduling
Lookup
Tables
Buffer &
State Memory
Scheduler
Buffered or
Bufferless
Fabric
(e.g. crossbar,
bus)
OC192c Linecard:
~10-30M gates
~2Gbits of memory
~2 square feet
>$10k cost; price $100K
Backplane
CSIT560
47
Performance metrics
1. Capacity
– “maximize C, s.t. volume < 2m3 and power < 5kW”
2. Throughput
– Operators like to maximize usage of expensive longhaul links.
3. Controllable Delay
– Some users would like predictable delay.
– This is feasible with output-queueing plus weighted
fair queueing (WFQ).
( ,  )
WFQ
CSIT560
( ,  )
48
Why do we Need Faster Routers?
1. To prevent routers from becoming the bottleneck
in the Internet.
2. To increase POP capacity, and to reduce cost,
size and power.
CSIT560
49
Why we Need Faster Routers
Normalized Growth since 1980
To prevent routers from being the bottleneck
1,000,000
Line Capacity
2x / 7 months
100,000
10,000
User Traffic
2x / 12months
1,000
100
Router
Capacity
2.2x / 18months
Moore’s Law
2x / 18 months
DRAM
Random Access Time
1.1x / 18months
10
1
1980
2005
CSIT560
50
Why we Need Faster Routers
1: To prevent routers from being the bottleneck
Disparity between traffic
and router growth
Normalized growth
600
500
traffic
400
5-fold
disparity
300
Router
capacity
200
100
0
2003
2006
2009
CSIT560
2012
51
Why we Need Faster Routers
2: To reduce cost, power & complexity of POPs
• Big POPs need big routers
POP with large routers
POP with smaller routers
• Interfaces: Price >$200k, Power > 400W
• About 50-60% of interfaces are used for interconnection within the POP.
• Industry trend is towards large, single router per POP.
CSIT560
52
A Case study: UUNET Internet Backbone Build Up
1999 View (4Q)
• 8 OC-48 links between POPs (not
parallel)
2002 View (4Q)
• 52 OC-48 links between POPs: many parallel links
• 3 OC-192 Super POP links: multiple parallel interfaces
between POPs (D.C. – Chicago; NYC – D.C.)
To Meet the traffic growth, Higher Performance Routers with
Higher Port Speed, are required
CSIT560
53
Why we Need Faster Routers
2: To reduce cost, power & complexity of POPs
DSLAM
DSLAM
DSLAM
L3/4
Switch
CMTS
L3/4
Switch
CMTS
Direct
Connects
L3/4
Switch
CMTS
Direct
Connects
Direct
Connects
Further Reduces CapEx, Operational cost
Further increases network stability
CSIT560
54
Ideal POP
Existing
Carrier
Equipment
Existing
Carrier
Equipment
Gigabit Routers
Gigabit Routers
VoIP Gateways
SONET
Digital Subscriber
Line Aggregation
CARRIER
OPTICAL
TRANSPORT
DWDM
and
OPTICAL
SWITCHES
DWDM
and
OPTICAL
SWITCHES
ATM
VoIP Gateways
SONET
Digital Subscriber
Line Aggregation
ATM
Gigabit Ethernet
Gigabit Ethernet
Cable Modem
Aggregation
Cable Modem
Aggregation
CSIT560
55
Why are Fast Routers Difficult to
Make?
Big disparity between line rates and memory access speed
1,000,000
100,000
10,000
1,000
100
10
20
05
1
19
80
Normalized Growth Rate
1.
CSIT560
56
Problem: Fast Packet Buffers
Example: 40Gb/s packet buffer
Size = RTT*BW = 10Gb; 64 byte packets
Write Rate, R
1 packet
every 12.8 ns
Buffer
Manager
Read Rate, R
1 packet
every 12.8 ns
Buffer
Memory
Use SRAM?
Use DRAM?
+ fast enough random access time, but
- too low density to store 10Gb of data.
+ high density means we can store data, but
- too slow (50ns random access time).
CSIT560
57
Memory Technology (2007)
Technology
Max single $/chip
chip density
($/MByte)
Access
speed
Watts/chip
Networking
DRAM
64 MB
$30-$50
($0.50-$0.75)
40-80ns
0.5-2W
SRAM
8 MB
$50-$60
($5-$8)
3-4ns
2-3W
TCAM
2 MB
$200-$250
($100-$125)
4-8ns
15-30W
CSIT560
58
How fast a buffer can be made?
~5ns for SRAM
~50ns for DRAM
External
Line
Buffer
Memory
64-byte wide bus
Rough Estimate:
–
–
–
5/50ns per memory operation.
Two memory operations per packet.
Therefore, maximum ~50/5 Gb/s.
64-byte wide bus
Aside: Buffers need to be large
for TCP to work well, so
DRAM is usually required.
CSIT560
59
Packet Caches
Small ingress SRAM
cache of FIFO tails
Arriving
Packets
Small ingress SRAM
cache of FIFO heads
1
60 59 58 57 56 55
97 96
91 90 89 88 87
2
Buffer
Manager
Q
SRAM
4 3 2 1
5
4 3
2 1
6 5 4 3 2 1
2
Departing
Packets
Q
b>>1 packets at a time
DRAM Buffer Memory
54 53 52 51 50
10 9
8
7
6
5
8
7
6
11 10 9
8
7
DRAM Buffer Memory
95 94 93 92 91 90 89 88 87 86
1
15 14 13 12 11 10 9
86 85 84 83 82
CSIT560
1
2
Q
60
Why are Fast Routers Difficult to Make?
Instructions per arriving byte
Packet processing gets harder
What we’d like: (more features)
QoS, Multicast, Security, …
What will happen
CSIT560
time
61
Why are Fast Routers Difficult to Make?
Clock cycles per minimum length packet since 1996
700
600
500
400
300
200
100
0
1996
1997
1998
1999
CSIT560
2000
2001
62
Options for packet processing
• General purpose processor
– MIPS
– PowerPC
– Intel
• Network processor
– Intel IXA and IXP processors
– IBM Rainier
– Control plane processors: SiByte (Broadcom), QED
(PMC-Sierra).
• FPGA
• ASIC
CSIT560
63
General Observations
• Up until about 2000,
– Low-end packet switches used general purpose
processors,
– Mid-range packet switches used FPGAs for datapath,
general purpose processors for control plane.
– High-end packet switches used ASICs for datapath,
general purpose processors for control plane.
• More recently,
– 3rd party network processors now used in many lowand mid-range datapaths.
– Home-grown network processors used in high-end.
CSIT560
64
Why are Fast Routers Difficult to Make?
Demand for Router Performance Exceeds Moore’s Law
Growth in capacity of commercial routers (per rack):
–
–
–
–
–
–
Capacity 1992 ~ 2Gb/s
Capacity 1995 ~ 10Gb/s
Capacity 1998 ~ 40Gb/s
Capacity 2001 ~ 160Gb/s
Capacity 2003 ~ 640Gb/s
Capacity 2007 ~ 11.5Tb/s
Average growth rate: 2.2x / 18 months.
CSIT560
65
Maximizing the throughput of a router
Engine of the whole router
• Operators increasingly demand throughput
guarantees:
– To maximize use of expensive long-haul links
– For predictability and planning
– Serve as many customers as possible
– Increase the lifetime of the equipment
– Despite lots of effort and theory, no commercial router
today has a throughput guarantee.
CSIT560
66
Maximizing the throughput of a router
Engine of the whole router
Ingress linecard
Interconnect
Buffer
Framing Route
TTL
lookup process
ing
ing
Egress linecard
Buffer
QoS Framing
ing
schedul
ing
Interconnect
scheduling
Control plane
Control path
Data path
Scheduling path
CSIT560
67
Maximizing the throughput of a router
Engine of the whole router
• This depends on the architecture of the switching:
– Input Queued
– Output Queued
– Shared memory
• It depends on the arbitration/scheduling
algorithms within the specific architecture
• This is key to the overall performance of the
router.
CSIT560
68
Why are Fast Routers Difficult to Make?
Power: It is exceeding the limit
6
Power (kW)
5
4
approx...
3
2
1
0
1990
1993
1996
1999
CSIT560
2002
69
Switching
Architectures
CSIT560
70
Generic Router Architecture
Data Hdr
Header Processing
Lookup
IP Address
Update
Header
1
1
Buffer
Memory
Address
Table
Data Hdr
Header Processing
Lookup
IP Address
Queue
Packet
Update
Header
2
2
NQueue
times line rate
Packet
Buffer
Memory
Address
Table
N times line rate
Data Hdr
Header Processing
Lookup
IP Address
Address
Table
Update
Header
N
N
CSIT560
Queue
Packet
Buffer
Memory
71
Generic Router Architecture
Data Hdr
Header Processing
Lookup
IP Address
Update
Header
Address
Table
Data Hdr
Update
Header
Address
Table
Queue
Packet
2
2
Data Hdr
Buffer
Memory
Header Processing
Lookup
IP Address
1
Buffer
Address
Table
Data Hdr
1
Data
MemoryHdr
Header Processing
Lookup
IP Address
Queue
Packet
Update
Header
Scheduler
Queue
Packet
N
N
Buffer
Data
MemoryHdr
CSIT560
72
Interconnects
Two basic techniques
Input Queueing
Output Queueing
Usually a non-blocking
switch fabric (e.g. crossbar)
Usually a fast bus
CSIT560
73
Simple model of output queued switch
Link 1, ingress
Link 2
Link 1
R1
Link 3
Link 4
Link 1, egress
Link rate, R
Link rate, R
Link 2, ingress
Link 2, egress
R
R
Link 3, ingress
Link 3, egress
R
R
Link 4, ingress
Link 4, egress
R
R
CSIT560
74
How an OQ Switch Works
Output Queued (OQ) Switch
CSIT560
75
Characteristics of an output queued
(OQ) switch
• Arriving packets are immediately written into the output
queue, without intermediate buffering.
• The flow of packets to one output does not affect the flow
to another output.
• An OQ switch has the highest throughput, and lowest
delay.
• The rate of individual flows, and the delay of packets can
be controlled (QoS).
CSIT560
76
The shared memory switch
A single, physical memory device
Link 1, ingress
Link 1, egress
Link 2, ingress
Link 2, egress
R
R
Link 3, ingress
Link 3, egress
R
R
Link N, ingress
Link N, egress
R
R
CSIT560
77
Characteristics of a shared memory
switch
Assume memory of size M bytes, and Qi (t ) is the length of
the queue for output i at time t.
Static queues: If Qi (t )  M / N for all i, then the switch
operates the same as the basic output queued switch.
Dynamic queues: If queues can have any length, so long
as

N
i 1
Qi (t )  M , then the loss rate is lower.
CSIT560
78
Memory bandwidth
Basic OQ switch:
• Consider an OQ switch with N different physical
memories, and all links operating at rate R bits/s.
• In the worst case, packets may arrive continuously from all
inputs, destined to just one output.
• Maximum memory bandwidth requirement for each
memory is (N+1)R bits/s.
Shared Memory Switch:
• Maximum memory bandwidth requirement for the memory
is 2NR bits/s.
CSIT560
79
How fast can we make a centralized
shared memory switch?
5ns SRAM
Shared
Memory
1

5ns per memory operation

Two memory operations per packet
Therefore, up to 160Gb/s (200 x
8/10 nsec)

2

In practice, closer to 80Gb/s
N
200 byte bus
CSIT560
80
Output Queueing
The “ideal”
1
2
1
2
1
2 1
2
11
2
2
1
CSIT560
81
How to Solve the Memory Bandwidth
Problem?
Use Input Queued Switches
• In the worst case, one packet is written and one
packet is read from an input buffer
• Maximum memory bandwidth requirement for each
memory is 2R bits/s.
• However, using FIFO input queues can result in what
is called “Head-of-Line (HoL)” blocking
CSIT560
82
Input Queueing
Delay
Head of Line Blocking
Load
58.6%
CSIT560
100%
83
Head of Line Blocking
CSIT560
84
CSIT560
85
CSIT560
86
Virtual Output Queues (VoQ)
• Virtual Output Queues:
– At each input port, there are N queues – each
associated with an output port
– Only one packet can go from an input port at a time
– Only one packet can be received by an output port at a
time
• It retains the scalability of FIFO input-queued
switches
• It eliminates the HoL problem with FIFO input
Queues
CSIT560
87
Input Queueing
Virtual output queues
CSIT560
88
Input Queues
Delay
Virtual Output Queues
Load
CSIT560
100%
89
Input Queueing (VoQ)
Memory b/w = 2R
Scheduler
CSIT560
Can be quite
complex!
90
Combined IQ/SQ Architecture
Can be a good compromise
1
.
….
…
Routing fabric
N
N output queues
In one shared memory
Packets (data)
Flow control
CSIT560
91
A Comparison
Memory speeds for 32x32 switch
Cell size = 64 bytes
Shared-Memory
Line Rate
Memory
BW
Access Time
Per cell
100 Mb/s
6.4 Gb/s
80 ns
1 Gb/s
64 Gb/s
2.5 Gb/s
160 Gb/s
10 Gb/s
640 Gb/s
Input-queued
Memory
BW
Access Time
200 Mb/s
2.56 s
8 ns
2 Gb/s
256 ns
3.2 ns
5 Gb/s
102.4 ns
0.8 ns
20 Gb/s
CSIT560
25.6 ns
92
Scalability of
Switching Fabrics
CSIT560
93
Shared Bus
• It is the simplest interconnect possible
• Protocols are very well established
• Multicasting and broadcasting is natural
• They have a scalability problem as we cannot have multiple
transmissions concurrently
• Its maximum bandwidth is around 100 Gbps – it limits the
maximum number of I/O ports and/or the line rates
• It is typically used for “small” shared memory switches or
output-queued switches – very good choice for Ethernet
switches
CSIT560
94
Crossbars
• It is becoming the preferred interconnect of choice for highspeed switches
• Have a very high throughput, and support QoS and multicast
Data In
• N2 crosspoints – but now it is not the real limitation nowadays
configuration
Data Out
CSIT560
95
Limiting factors
Crossbar switch
–
N2 crosspoints per chip,
–
It’s not obvious how to build a crossbar from multiple
chips,
–
Capacity of “I/O”s per chip.
•
State of the art: About 200 pins each operating at 3.125Gb/s
~= 600Gb/s per chip.
•
About 1/3 to 1/2 of this capacity available in practice
because of overhead and speedup.
•
Crossbar chips today are limited by the “I/O” capacity.
CSIT560
96
Limitations to Building Large Crossbar
Switches: I/O pins
•
Maximum practical bit rate per pin ~ 3.125
Gbits/sec
 At this speed you need between 2-4 pins per single bit
 To achieve a 10 Gbps/sec (OC-192) line rate, you need
around 4 parallel data lines (4-bit parallel transmission)
 For example, consider a 4-bit data data parallel 64-input
crossbar that is designed to support OC-192 line rates per port.
 Each port interface would require 4 x 3 = 12 pins in each
direction. Hence a 64-port crossbar would need 12 x 64 x 2 =
1536 pins just for the I/O data lines
 Hence, the real problem is I/O pin limitations
•
How to solve the problem?
CSIT560
97
Scaling: Trying to build a crossbar from
multiple chips
16x16 crossbar switch:
4 inputs
Building Block:
4 outputs
Eight inputs and eight
outputs required!
CSIT560
98
How to build a scalable crossbar
1. Use bit slicing – parallel crossbars
• For example, we can use 4-bit crossbars to implement the previous
example. So we need 4 parallel 1-bit crossbars.
• Each port interface would require 1 x 3 = 3 pins in each direction.
Hence a 64-port crossbar would need 3 x 64 x 2 = 384 pins for the I/O
data lines – which is reasonable (but we need 4 chips here).
CSIT560
99
Scaling: Bit-slicing
N
8
7
6
5
4
3
2
Cell
Cell
Cell
Linecard
1
Scheduler
CSIT560
• Cell is “striped”
across multiple
identical planes.
• Crossbar
switched “bus”.
• Scheduler makes
same decision for
all slices.
100
Scaling: Time-slicing
Linecard
N
Cell
8
7
6
5
4
3
2
Cell
Cell
Cell
1
Cell
Cell
Scheduler
CSIT560
• Cell goes over
one plane; takes N
cell times.
• Scheduler is
unchanged.
• Scheduler makes
decision for each
slice in turn.
101
HKUST 10Gb/s 256x256 Crossbar Switch
Fabric Design
• Our overall switch fabric is an OC-192 256*256
crossbar switch
• Such a system is composed of 8 256*256 crossbar
chips, each running at 2Gb/s (to compensate for the
overhead and to provide a switch speedup)
•
The Deserializer (DES) is to
convert the OC-192 10Gb/s
data at the fiber link to 8 low
speed signals, while the
Serializer (SER) is to
serialize the low speed
signals back to the fiber link
Input @ 10Gb/s
8
DES
256*256
Crossbar Switch
Output @ 10Gb/s
SER
8
Scheduler
CSIT560
256*256
Crossbar Switch
8 bits
102
Architecture of the Crossbar Chip
High Speed Data Link
PLL
Controller
High Speed Data Link
• Crossbar Switch Core –
fulfills the switch functions
• Control – configures the
crossbar core
• High speed data link –
communicates between this
chip and SER/DES
• PLL – provides on-chip
precise clock
1GHz 256*256
Crossbar Switch Core
High Speed Data Link
CSIT560
103
Technical Specification of our Core-Crossbar
Chip
Full crossbar core
256*256 (embedded with 2 bit-slices)
Technology
TSMC 0.25m SCN5M Deep (lambda=0.12 m)
Layout size
14 mm * 8 mm
Transistor counts
2000k
Supply voltage
2.5v
Clock Frequency
1GHz
Power
40 W
CSIT560
104
Layout of a 256*256 crossbar switch core
CSIT560
105
HKUST Crossbar Chip in the News
Researchers offer alternative to typical crossbar design
http://www.eetimes.com/story/OEG20020820S0054
By Ron Wilson - EE Times
August 21, 2002 (10:56 a.m. ET)
PALO ALTO, Calif. — In a technical paper presented at the Hot
Chips conference here Monday (Aug.19) researchers Ting Wu, ChiYing Tsui and Mounir Hamdi from Hong Kong University of Science
and Technology (China) offered an alternative pipeline approach to
crossbar design.
Their approach has yielded a 256-by-256 signal switch with a 2-GHz
input bandwidth, simulated in a 0.25-micron, 5-metal process.
The growing importance of crossbar switch matrices, now used for onchip interconnect as well as for switching fabric in routers, has led to
increased study of the best ways to build these parts.
CSIT560
106
Scaling a crossbar
• Conclusion: scaling the capacity is relatively
straightforward (although the chip count and
power may become a problem).
• In each scheme so far, the number of ports stays
the same, but the speed of each port is increased.
• What if we want to increase the number of ports?
• Can we build a crossbar-equivalent from multiple
stages of smaller crossbars?
• If so, what properties should it have?
CSIT560
107
Multi-Stage
Switches
CSIT560
108
Basic Switch Element
This is equivalent to crosspoint in the crossbar
(no longer a good argument)
0
1
0
X 2, 2
1
Two States
•Cross
•Through
Optional Buffering
CSIT560
109
Example of Multistage Switch
• It needs NlogN Internal switches (crosspoints)
– less than the crossbar
K
N
0
1
2
3
4
5
6
7
0
1
0
1
0
1
0
1
one
half
of
the
deck
the
other
half
of
the
deck
a perfect shuffle
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
a perfect shuffle
CSIT560
000
001
010
011
100
101
110
111
110
Packet Routing
The bits of the destination address provide the required
routing tags. The digits in the destination address are
used to set the state of the stages.
destination
port
address
0
1
011 2
3
white bit
controls
4
switch
setting
in each 5
stage
6
101 7
0
1
0
1
0
1
0
1
0
1
011
0
1
011
101
0
1
101
011
0
1
0
1
0
1
0
1
Stage 1
101
Perfect shuffle
Perfect shuffle
Stage 2
CSIT560
0
1
000
001
010
011
100
101
110
111
Stage 3
111
Internal blocking
• Internal link blocking as well as output blocking can
happen in a Multistage switch. The following example
illustrates an internal blocking for connections of input 0 to
output 3 and input 4 to output 2.
011
010
0
1
2
3
4
5
6
7
0
1
011
010
0
1
blocking link
0
1
???
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
Stage 1
Perfect shuffle
Perfect shuffle
Stage 2
CSIT560
???
???
000
001
010
011
100
101
110
111
Stage 3
112
Output Blocking
The following example illustrates output blocking for the
connections between input 1 and output 6, and input 3
and output 6.
110
110
0
1
2
3
4
5
6
7
0
1
0
1
110
0
1
0
1
Stage 1
110
Perfect shuffle
0
1
0
1
0
1
0
1
0
1
0
1
000
001
010
011
100
101
0
1
110
111
0
1
110
110
Perfect shuffle
Stage 2
CSIT560
Stage 3 output blocking
113
3-stage Clos Network
mxm
1
n
N
nxk
1
kxn
1
2
1
2
…
2
…
…
…
…
m
m
k
CSIT560
1
n
N
N=nxm
k >= n
114
Clos-network Properties
Expansion factors
• Strictly Nonblocking iff m >= 2n -1
Complexity O(n 3 / 2 )
First nonblockin g switch discovered of
complexity less than O(n 2 )
• Rearrangeable Nonblocking iff m >= n
CSIT560
115
3-stage Fabrics (Basic building block – a crossbar)
Clos Network
CSIT560
116
3-Stage Fabrics
Clos Network
Expansion factor required = 2-1/N (but still blocking for multicast)
CSIT560
117
4-Port Clos Network
Strictly Non-blocking
X 2,3
X 2, 2
X 3, 2
X 2, 2
X 2,3
X 2, 2
CSIT560
X 3, 2
118
Construction example
1
32x48
32 #1
48x48
#1
48x32
#1
32x48
#2
48x48
#2
48x32
#2
32x48
#32
48x48
#48
48x32
#32
33
64
993
1024
CSIT560
• Switch size
1024x1024
• Construction
module
– Input switch
thirty-two 32x48
– Central switch
forty-eight 48x48
– Output switch
thirty-two 48x32
– Expansion
48/32=1.5
119
Lucent Architecture
Buffers
CSIT560
120
MSM Architecture
CSIT560
121
Cisco’s 46Tbps Switch System
Fabric Card Chassis
Line Card Chassis
12.5G
40G
12.5G
LC (1)
S1/S3
(1)
18 x 18
S2 (1)
72 x 72
LC (16)
S1/S3
(8)
18 x 18
S2 (18)
72 x 72
LCC(1)
FCC(1)
LC (1137)
S1/S3
(569)
18 x 18
S2 (127)
72 x 72
LC (1152)
S1/S3
(576)
18 x 18
S2 (144)
72 x 72
LCC(72)
CSIT560
FCC(8)
• total 80
chassis
• 8 sw planes
• speedup 2.5
• 1152 LICs
• 1296x1296
switch fabric
• 3-stage
Benes sw
• multicast in
the sw
• 1:N fabric
redundancy
• 40 Gbps
packet
processor
(188 RISCs)
122
Massively Parallel Switches
• Instead of using tightly coupled fabrics like a crossbar or a bus,
they use massively parallel interconnects such as hypercube, 2D
torus, and 3D torus.
• Few companies use this design architecture for their core routers
• These fabrics are generally scalable
• However:
– It is very difficult to guarantee QoS and to include value-added
functionalities (e.g., multicast, fair bandwidth allocation)
– They consume a lot of power
– They are relatively costly
CSIT560
123
Massively Parallel Switches
CSIT560
124
3D Switching Fabric: Avici
• Three components
– Topology  3D torus
– Routing  source routing with randomization
– Flow control  virtual channels and virtual networks
• Maximum configuration: 14 x 8 x 5 = 560
• Channel speed is 10 Gbps
CSIT560
125
Packaging
• Uniformly short wires between
adjacent nodes
– Can be built in passive backplanes
– Run at high speed
Figures are from Scalable Switching Fabrics for Internet Routers, by W. J. Dally (can be found at www.avici.com)
CSIT560
126
Avici: Velociti™ Switch Fabric
• Toroidal direct connect fabric (3D Torus)
• Scales to 560 active modules
• Each element adds switching & forwarding
capacity
• Each module connects to
6 other modules
CSIT560
127
Switch fabric chips comparison
http://www.lightreading.com/document.asp?doc_i
d=47959
CSIT560
128