Presentation3
Download
Report
Transcript Presentation3
Department of Computer and IT Engineering
University of Kurdistan
Computer Networks II
Router Architecture
By: Dr. Alireza Abdollahpouri
What is Routing and forwarding?
R3
A
R1
R4
B
C
D
E
R2
R5
F
2
Introduction
History …
3
Introduction
History …
And future trends!
4
What a Router Looks Like
Cisco GSR 12416
Juniper M160
19”
19”
Capacity: 80Gb/s
Power: 2.6kW
Capacity: 160Gb/s
Power: 4.2kW
6ft
3ft
2ft
2.5ft
5
Packet Processing Functions
Basic network system functionality
Address lookup
Packet forwarding and routing
Fragmentation and re-assembly
Security
Queuing
Scheduling
Packet classification
Traffic measurement
…
6
Per-packet Processing in a Router
1. Accept packet arriving on an ingress line.
2. Lookup packet destination address in the
forwarding table, to identify outgoing interface(s).
3. Manipulate packet header: e.g., decrement TTL,
update header checksum.
4. Send packet to outgoing interface(s).
5. Queue until line is free.
6. Transmit packet onto outgoing line.
7
Basic Architecture of a Router
Routing
- Routing table update
(OSPF, RIP, IS-IS)
- Admission Control
- Congestion Control
- Reservation
How routing protocols
establish routes/etc
Control Plane
May be Slow
“Typically in Software”
• Routing
• Switching
Lookup
•Arbitration
• Packet
•Scheduling
Classifier
How packets get
forwarded
Data plane
(per-packet processing)
Switching
Must be fast
“Typically in Hardware”
8
Generic Router Architecture
Data Hdr
Header Processing
Lookup
IP Address
Update
Header
Header Processing
Lookup
IP Address
Update
Header
Buffer
Manager
Data Hdr
Buffer
Address
Table
Data Hdr
Data Hdr
Buffer
Memory
Address
Table
Data Hdr
Buffer
Manager
Data
MemoryHdr
Header Processing
Lookup
IP Address
Address
Table
Update
Header
Buffer
Manager
Buffer
Memory
9
Functions in a Packet Switch
Ingress linecard
Interconnect
Buffer
Framing Route
TTL
lookup process
ing
ing
Egress linecard
Buffer
QoS Framing
ing
schedul
ing
Interconnect
scheduling
Control plane
Control path
Data path
Scheduling path
usually multiple usage of memory
(DRAM for packet buffer,
SRAM for queues and tables)
10
Line Card Picture
11
Major Components of Routers: Interconnect
Memory
Bus
Shared Memory
Crossbar
Interconnect Input Ports to Output Ports, includes 3 modes
Bus
Shared Memory
All Input ports transfer data through the shared bus.
Problem : Often cause in data flow congestion.
Input port write data into the share memory. After destination lookup is performed, the
output port read data from the memory.
Problem : Require fast memory read/write and management technology.
Crossbar
N input ports has dedicated data path to N output ports. Result in N*N switching matrix.
Problem : Blocking (Input, Output, Head-of-line HOL). Max switch load for random traffic
is about 59%.
12
Interconnects: Two basic techniques
Input Queueing
Output Queueing
Usually a non-blocking
switch fabric (e.g. crossbar)
13
How an OQ Switch Works
Output Queued (OQ) Switch
14
Delay
Input Queueing: Head of Line Blocking
Load
58.6%
100%
15
Head of Line Blocking
16
17
18
Virtual Output Queues (VoQ)
Virtual Output Queues:
At each input port, there are N queues – each
associated with an output port
Only one packet can go from an input port at a time
Only one packet can be received by an output port
at a time
It retains the scalability of FIFO input-queued switches
It eliminates the HoL problem with FIFO input Queues
19
Input Queueing: Virtual output queues
20
Delay
Input Queueing: Virtual output queues
Load
100%
21
The Evolution of Router Architecture
First Generation
Routers
Modern Routers
22
First Generation Routers
Shared Backplane
CPU
Route
Table
Buffer
Memory
Line
Interface
Line
Interface
Line
Interface
MAC
MAC
MAC
Bus-based Router Architectures with Single Processor
23
First Generation Routers
Based on software implementations on a single
CPU.
Limitations:
Serious processing bottleneck in the central
processor
Memory intensive operations (e.g. table lookup
& data movements) limits the effectiveness of
processor power
24
Second Generation Routers
CPU
Route
Table
Buffer
Memory
Line
Card
Line
Card
Line
Card
Buffer
Memory
Buffer
Memory
Buffer
Memory
Fwding
Cache
Fwding
Cache
Fwding
Cache
MAC
MAC
MAC
Bus-based Router
Architectures with
Multiple Processors
25
Second Generation Routers
Architectures with Route Caching
Distribute packet forwarding operations
Network interface cards
Processors
Route caches
Packets are transmitted once over the shared bus
Limitations:
The central routing table is a bottleneck at high-speeds
Traffic dependent throughput (cache)
Shared bus is still a bottleneck
26
Third Generation Routers
Switched Backplane
Line
Card
CPU
Card
Line
Card
Local
Buffer
Memory
Routing
Table
Local
Buffer
Memory
Fwding
Table
Fwding
Table
MAC
MAC
Switch-based Architectures with Fully Distributed Processors
27
Third Generation Routers
To avoid bottlenecks:
Processing power
Memory bandwidth
Internal bus bandwidth
Each network interface is equipped with
appropriate processing power and buffer space.
Data vs. control plane
• Data plane – line cards
• Control plane - processor
28
Fourth Generation Routers/Switches
Optics inside a router for the first time
Optical links
100s
of metres
Switch Core
Linecards
0.3 - 10Tb/s routers in development
29
Demand for More Powerful Routers
Do we still higher processing power in
networking devices?
Of course, YES
But why? and how?
30
Demands for Faster Routers (why?)
Beyond the moore’s law
107 x
link bandwidth 2 x / year
106 x
packet inter-arrival
time (for 40Gbps):
Big packet: 300 ns
Small packet: 12 ns
5
Growth
10 x
104 x
103 x
102 x
CPU 2 x / two years
10 x
1x
Mem improvement in latency 10% / year
1975
1980
1985
1990
1995
2000
Hundreds of
instructions per
packet
Layer 2
IPv4
switching
routing
Thousands of
instructions per
packet
Flow
Intrusion
Encryption
Classification
detection
Processing Complexity
2005
31
Demands for Faster Routers (why?)
Future applications will demand TIPS
32
Demands for Faster Routers (why?)
Future applications will demand TIPS
Power? Heat?
33
Demands for Faster Routers (summary)
Technology push:
- Link bandwidth scaling much faster than CPU and memory
technology
- Transistor scaling and VLSI technology help but not enough
Application pull:
- More complex applications are required
- Processing complexity is defined as the number of instructions
and number of memory access to process one packet
34
Demands for faster routers (How?)
“Future applications will demand TIPS”
“Think platform beyond a single processor”
“Exploit concurrency at multiple levels”
“Power will be the limiter due to complexity and leakage”
Distribute workload on multiple cores
35
Multi-Core Processors
Symmetric multi-processors allow multi-threaded
applications to achieve higher performance at less
die area and power consumption than single-core
processors
Asymmetric multi-processors consume power and
provide increased computational power only on
demand
36
Performance Bottlenecks
Memory
Bandwidth available, but access time too slow
Increasing delay for off-chip memory
I/O
High-speed interfaces available
Cost problem with optical interfaces
Internal Bus
Can be solved with an effective switch, allowing simultaneous
transfers between network interfaces
Processing power
Individual cores are getting more complex
Problems with access to shared resources
Control processor can become bottleneck
37
Different Solutions
Flexibility
GPP
• ASIC
• FPGA
• NP
• GPP
NP
FPGA
ASIC
Performance
38
Different Solutions
By:
Niraj Shah
39
“It is always something
(corollary). Good, Fast, Cheap:
Pick any two (you can’t have all three).”
RFC1925
“The Twelve Networking Truths”
40
Why not ASIC?
High cost to develop
Network processing moderate quantity market
Long time to market
Network processing quickly changing services
Difficult to simulate
Complex protocol
Expensive and time-consuming to change
Little reuse across products
Limited reuse across versions
No consensus on framework or supporting chips
Requires expertise
41
Network Processors
• Introduced several years ago (1999+)
• A way to introduce flexibility and programmability
in network processing
• Many players were there (Intel, Motorola, IBM)
• Only a few players still there
42
Intel IXP 2800
Initial release August 2003
43
What Was Correct With NPs?
CPU-level flexibility
– A giant step forward compared to ASICs
How?
– Hardware coprocessors
– Memory hierarchies
– Multiple hardware threads (zero context switching
overhead)
– Narrow (and multiple) memory buses
– Some other ad-hoc solutions for network processing, e.g.,
Fast
switching fabric, memory accesses, etc
44
What Was Wrong With NPs?
Programmability issues
– Completely new programming paradigm
– Developers are not familiar with the unprecedented
parallelism of the NPU, They do not know how to
exploit it at best
– New (proprietary) languages
– Portability among different network processors
families
45
What Happened in NP Market?
Intel went out of the market in 2007
Many other small players disappeared
High risk when selecting a NP maker that may disappear
46
Every old idea will be proposed again
with a different name and a
different presentation,
regardless of whether it works.
RFC1925
“The Twelve Networking Truths”
47
Software Routers
Processing in General-purpose CPUs
CPUs optimized for few threads, high
performance per thread
– High CPU frequencies
– Maximize instruction-level parallelism
• Pipeline
• Superscalar
• Out-of-order execution
• Branch prediction
• Speculative loads
48
Software Routers
Aim: Low cost, flexibility and extensibility
Linux on PC with a bunch of NICs
Changing a functionality is as simple as a
software upgrade
49
Software Routers (examples)
• RouteBricks [SOSP’09]
Uses Intel Nehalem architecture
• Packet shader [SIGCOMM’10]
GPU-Accelerated
Developed in KAIST, Korea
50
Intel Nehalem Architecture
C
0
C
1
C
2
C
3
L3 Common Cache
51
Intel Nehalem Architecture
NUMA architecture: The latency to access the local
memory is, approximately, 65 nano-seconds. The
latency to access the remote memory is, approximately,
105 nano-seconds
Three DDR3 channels to local DRAM support a
bandwidth of 31.992GB/s
Bandwidth through of the QPI link is 12.8 GB/s
52
Intel Nehalem Architecture
Nehalem Quadcore
application
Core
Core
Core
Core
0
1
2
3
L1-I L1-D
L1-I L1-D
L1-I L1-D
L1-I L1-D
L2
cache
L2
cache
L2
cache
L2
cache
file system
communication
system
disk
network card
Shared L3 Cache
Power
and
clock
QPI
1
QPI
2
IMC
3 channels
DRAM
file system
DRAM
communication system
DRAM
application
QPI
I/O
controller
hub
PCI bus
PCI slots
PCI slots
PCI slots
network card
disk
53
Other Possible Platforms
Intel Westmere-EP
Intel Jasper Forest
54
Workload Partitioning (parallelization)
Parallel
Pipeline
Hybrid
55
Questions!