Presentation3

Transcript Presentation3

Department of Computer and IT Engineering
University of Kurdistan
Computer Networks II
Router Architecture
By: Dr. Alireza Abdollahpouri
What is Routing and forwarding?
R3
A
R1
R4
B
C
D
E
R2
R5
F
2
Introduction
History …
3
Introduction
History …
And future trends!
4
What a Router Looks Like
Cisco GSR 12416
Juniper M160
19”
19”
Capacity: 80Gb/s
Power: 2.6kW
Capacity: 160Gb/s
Power: 4.2kW
6ft
3ft
2ft
2.5ft
5
Packet Processing Functions
 Basic network system functionality
 Address lookup
 Packet forwarding and routing
 Fragmentation and re-assembly
 Security
 Queuing
 Scheduling
 Packet classification
 Traffic measurement
 …
6
Per-packet Processing in a Router
1. Accept packet arriving on an ingress line.
2. Lookup packet destination address in the
forwarding table, to identify outgoing interface(s).
3. Manipulate packet header: e.g., decrement TTL,
update header checksum.
4. Send packet to outgoing interface(s).
5. Queue until line is free.
6. Transmit packet onto outgoing line.
7
Basic Architecture of a Router
Routing
- Routing table update
(OSPF, RIP, IS-IS)
- Admission Control
- Congestion Control
- Reservation
How routing protocols
establish routes/etc
Control Plane
May be Slow
“Typically in Software”
• Routing
• Switching
Lookup
•Arbitration
• Packet
•Scheduling
Classifier
How packets get
forwarded
Data plane
(per-packet processing)
Switching
Must be fast
“Typically in Hardware”
8
Generic Router Architecture
Data Hdr
Header Processing
Lookup
IP Address
Update
Header
Header Processing
Lookup
IP Address
Update
Header
Buffer
Manager
Data Hdr
Buffer
Address
Table
Data Hdr
Data Hdr
Buffer
Memory
Address
Table
Data Hdr
Buffer
Manager
Data
MemoryHdr
Header Processing
Lookup
IP Address
Address
Table
Update
Header
Buffer
Manager
Buffer
Memory
9
Functions in a Packet Switch
Ingress linecard
Interconnect
Buffer
Framing Route
TTL
lookup process
ing
ing
Egress linecard
Buffer
QoS Framing
ing
schedul
ing
Interconnect
scheduling
Control plane
Control path
Data path
Scheduling path
usually multiple usage of memory
(DRAM for packet buffer,
SRAM for queues and tables)
10
Line Card Picture
11
Major Components of Routers: Interconnect
Memory
Bus
Shared Memory
Crossbar
Interconnect Input Ports to Output Ports, includes 3 modes

Bus



Shared Memory



All Input ports transfer data through the shared bus.
Problem : Often cause in data flow congestion.
Input port write data into the share memory. After destination lookup is performed, the
output port read data from the memory.
Problem : Require fast memory read/write and management technology.
Crossbar


N input ports has dedicated data path to N output ports. Result in N*N switching matrix.
Problem : Blocking (Input, Output, Head-of-line HOL). Max switch load for random traffic
is about 59%.
12
Interconnects: Two basic techniques
Input Queueing
Output Queueing
Usually a non-blocking
switch fabric (e.g. crossbar)
13
How an OQ Switch Works
Output Queued (OQ) Switch
14
Delay
Input Queueing: Head of Line Blocking
Load
58.6%
100%
15
Head of Line Blocking
16
17
18
Virtual Output Queues (VoQ)
 Virtual Output Queues:
 At each input port, there are N queues – each
associated with an output port
 Only one packet can go from an input port at a time
 Only one packet can be received by an output port
at a time
 It retains the scalability of FIFO input-queued switches
 It eliminates the HoL problem with FIFO input Queues
19
Input Queueing: Virtual output queues
20
Delay
Input Queueing: Virtual output queues
Load
100%
21
The Evolution of Router Architecture
First Generation
Routers
Modern Routers
22
First Generation Routers
Shared Backplane
CPU
Route
Table
Buffer
Memory
Line
Interface
Line
Interface
Line
Interface
MAC
MAC
MAC
Bus-based Router Architectures with Single Processor
23
First Generation Routers
 Based on software implementations on a single
CPU.
 Limitations:
 Serious processing bottleneck in the central
processor
 Memory intensive operations (e.g. table lookup
& data movements) limits the effectiveness of
processor power
24
Second Generation Routers
CPU
Route
Table
Buffer
Memory
Line
Card
Line
Card
Line
Card
Buffer
Memory
Buffer
Memory
Buffer
Memory
Fwding
Cache
Fwding
Cache
Fwding
Cache
MAC
MAC
MAC
Bus-based Router
Architectures with
Multiple Processors
25
Second Generation Routers
 Architectures with Route Caching
 Distribute packet forwarding operations
 Network interface cards
 Processors
 Route caches
 Packets are transmitted once over the shared bus
 Limitations:
 The central routing table is a bottleneck at high-speeds
 Traffic dependent throughput (cache)
 Shared bus is still a bottleneck
26
Third Generation Routers
Switched Backplane
Line
Card
CPU
Card
Line
Card
Local
Buffer
Memory
Routing
Table
Local
Buffer
Memory
Fwding
Table
Fwding
Table
MAC
MAC
Switch-based Architectures with Fully Distributed Processors
27
Third Generation Routers
 To avoid bottlenecks:
 Processing power
 Memory bandwidth
 Internal bus bandwidth
 Each network interface is equipped with
appropriate processing power and buffer space.

Data vs. control plane
• Data plane – line cards
• Control plane - processor
28
Fourth Generation Routers/Switches
Optics inside a router for the first time
Optical links
100s
of metres
Switch Core
Linecards
0.3 - 10Tb/s routers in development
29
Demand for More Powerful Routers
Do we still higher processing power in
networking devices?
Of course, YES
But why? and how?
30
Demands for Faster Routers (why?)
Beyond the moore’s law
107 x
link bandwidth 2 x / year
106 x
packet inter-arrival
time (for 40Gbps):
Big packet: 300 ns
Small packet: 12 ns
5
Growth
10 x
104 x
103 x
102 x
CPU 2 x / two years
10 x
1x
Mem improvement in latency 10% / year
1975
1980
1985
1990
1995
2000
Hundreds of
instructions per
packet
Layer 2
IPv4
switching
routing
Thousands of
instructions per
packet
Flow
Intrusion
Encryption
Classification
detection
Processing Complexity
2005
31
Demands for Faster Routers (why?)
 Future applications will demand TIPS
32
Demands for Faster Routers (why?)
 Future applications will demand TIPS
 Power? Heat?
33
Demands for Faster Routers (summary)
Technology push:
- Link bandwidth scaling much faster than CPU and memory
technology
- Transistor scaling and VLSI technology help but not enough
Application pull:
- More complex applications are required
- Processing complexity is defined as the number of instructions
and number of memory access to process one packet
34
Demands for faster routers (How?)
“Future applications will demand TIPS”
“Think platform beyond a single processor”
“Exploit concurrency at multiple levels”
“Power will be the limiter due to complexity and leakage”
Distribute workload on multiple cores
35
Multi-Core Processors
 Symmetric multi-processors allow multi-threaded
applications to achieve higher performance at less
die area and power consumption than single-core
processors
 Asymmetric multi-processors consume power and
provide increased computational power only on
demand
36
Performance Bottlenecks
Memory
Bandwidth available, but access time too slow
Increasing delay for off-chip memory
I/O
High-speed interfaces available
Cost problem with optical interfaces
Internal Bus
Can be solved with an effective switch, allowing simultaneous
transfers between network interfaces
Processing power
Individual cores are getting more complex
Problems with access to shared resources
Control processor can become bottleneck
37
Different Solutions
Flexibility
GPP
• ASIC
• FPGA
• NP
• GPP
NP
FPGA
ASIC
Performance
38
Different Solutions
By:
Niraj Shah
39
“It is always something
(corollary). Good, Fast, Cheap:
Pick any two (you can’t have all three).”
RFC1925
“The Twelve Networking Truths”
40
Why not ASIC?
 High cost to develop
 Network processing moderate quantity market
 Long time to market
 Network processing quickly changing services
 Difficult to simulate
 Complex protocol
 Expensive and time-consuming to change
 Little reuse across products
 Limited reuse across versions
 No consensus on framework or supporting chips
 Requires expertise
41
Network Processors
• Introduced several years ago (1999+)
• A way to introduce flexibility and programmability
in network processing
• Many players were there (Intel, Motorola, IBM)
• Only a few players still there
42
Intel IXP 2800
Initial release August 2003
43
What Was Correct With NPs?
 CPU-level flexibility
– A giant step forward compared to ASICs
 How?
– Hardware coprocessors
– Memory hierarchies
– Multiple hardware threads (zero context switching
overhead)
– Narrow (and multiple) memory buses
– Some other ad-hoc solutions for network processing, e.g.,
Fast
switching fabric, memory accesses, etc
44
What Was Wrong With NPs?
Programmability issues
– Completely new programming paradigm
– Developers are not familiar with the unprecedented
parallelism of the NPU, They do not know how to
exploit it at best
– New (proprietary) languages
– Portability among different network processors
families
45
What Happened in NP Market?
 Intel went out of the market in 2007
 Many other small players disappeared
 High risk when selecting a NP maker that may disappear
46
Every old idea will be proposed again
with a different name and a
different presentation,
regardless of whether it works.
RFC1925
“The Twelve Networking Truths”
47
Software Routers
 Processing in General-purpose CPUs
 CPUs optimized for few threads, high
performance per thread
– High CPU frequencies
– Maximize instruction-level parallelism
• Pipeline
• Superscalar
• Out-of-order execution
• Branch prediction
• Speculative loads
48
Software Routers
 Aim: Low cost, flexibility and extensibility
 Linux on PC with a bunch of NICs
 Changing a functionality is as simple as a
software upgrade
49
Software Routers (examples)
• RouteBricks [SOSP’09]
Uses Intel Nehalem architecture
• Packet shader [SIGCOMM’10]
GPU-Accelerated
Developed in KAIST, Korea
50
Intel Nehalem Architecture
C
0
C
1
C
2
C
3
L3 Common Cache
51
Intel Nehalem Architecture
 NUMA architecture: The latency to access the local
memory is, approximately, 65 nano-seconds. The
latency to access the remote memory is, approximately,
105 nano-seconds
 Three DDR3 channels to local DRAM support a
bandwidth of 31.992GB/s
 Bandwidth through of the QPI link is 12.8 GB/s
52
Intel Nehalem Architecture
Nehalem Quadcore
application
Core
Core
Core
Core
0
1
2
3
L1-I L1-D
L1-I L1-D
L1-I L1-D
L1-I L1-D
L2
cache
L2
cache
L2
cache
L2
cache
file system
communication
system
disk
network card
Shared L3 Cache
Power
and
clock
QPI
1
QPI
2
IMC
3 channels
DRAM
file system
DRAM
communication system
DRAM
application
QPI
I/O
controller
hub
PCI bus
PCI slots
PCI slots
PCI slots
network card
disk
53
Other Possible Platforms
Intel Westmere-EP
Intel Jasper Forest
54
Workload Partitioning (parallelization)
Parallel
Pipeline
Hybrid
55
Questions!

Presentation3

Transcript Presentation3

Directory