Unplug The Routers!
Download
Report
Transcript Unplug The Routers!
VROOM:
Virtual ROuters On the Move
Aditya Akella
Based on slides from Yi Wang
Virtual ROuters On the Move (VROOM)
• Key idea
– Routers should be free to roam around
• Useful for many different applications
– Simplify network maintenance
– Simplify service deployment and evolution
– Reduce power consumption
–…
• Feasible in practice
– No performance impact on data traffic
– No visible impact on routing protocols
2
VROOM: The Basic Idea
• Virtual routers (VRs) form logical topology
1
4
2
3
physical router
5
virtual router
logical link
3
VROOM: The Basic Idea
• VR migration does not affect the logical topology
2
physical router
3
virtual router
1
4
logical link
5
4
Outline
• Why is VROOM a good idea?
• What are the challenges?
– Or it is just technically trivial?
• How does VROOM work?
– The migration process
• Is VROOM practical?
– Prototype system
– Performance evaluation
• Where to migrate?
– The scheduling problem
• Still have questions? Feel free to ask!
5
The Coupling of Logical and Physical
• Today, the physical and logical configurations of a router is tightly
coupled
• Physical changes break protocol adjacencies, disrupt traffic
• Logical configuration as a tool to reduce the disruption
– E.g., the “cost-out/cost-in” of IGP link weights
– Cannot eliminate the disruption
– Account for over 73% of network maintenance events
6
VROOM Separates the Logical and Physical
• Make a logical router instance migratable
among physical nodes
• All logical configurations/states remain the
same before/after the migration
– IP addresses remain the same
– Routing protocol configurations remain the same
– Routing-protocol adjacencies stay up
• No protocol (BGP/IGP) reconvergence
– Network topology stays intact
• No disruption to data traffic
7
Case 1: Planned Maintenance
• Today’s best practice: “cost-out/cost-in”
– Router reconfiguration & protocol reconvergence
• VROOM
– NO reconfiguration of VRs, NO reconvergence
VR-1
PR-A
PR-B
8
Case 1: Planned Maintenance
• Today’s best practice: “cost-out/cost-in”
– Router reconfiguration & protocol reconvergence
• VROOM
– NO reconfiguration of VRs, NO reconvergence
PR-A
VR-1
PR-B
9
Case 1: Planned Maintenance
• Today’s best practice: “cost-out/cost-in”
– Router reconfiguration & protocol reconvergence
• VROOM
– NO reconfiguration of VRs, NO reconvergence
PR-A
VR-1
PR-B
10
Case 2: Service Deployment &
Evolution
• Deploy a new service in a controlled “test network” first
CE
CE
CE
Test network
Test network
Production network
Test network
11
Case 2: Service Deployment &
Evolution
• Roll out the service to the production network after it
matures
• VROOM guarantees seamless service to existing customers
during the roll-out and later evolution
Test network
Test network
Production network
Test network
12
Case 3: Power Savings
• Big power consumption of routers
– Millions of Routers in the U.S.
– Electricity bill: $ hundreds of millions/year
4
3
3.9
2
TwH/year
2.4
1
1.1
0
2000
2005
2010
(Source: National Technical Information Service, Department of
Commerce, 2000. Figures for 2005 & 2010 are projections.)
13
Case 3: Power Savings
• Observation: the diurnal traffic pattern
• Idea: contract and expand the physical
network according to the traffic demand
14
Case 3: Power Savings
Dynamically contract & expand the physical network in a day - 3PM
15
Case 3: Power Savings
Dynamically contract & expand the physical network in a day - 9PM
16
Case 3: Power Savings
Dynamically contract & expand the physical network in a day - 4AM
17
Virtual Router Migration: the Challenges
•
Migrate an entire virtual router instance
–
•
Minimize disruption
–
–
•
All control plane & data plane processes / states
Data plane: up to millions packets per second
Control plane: less stringent (w/ routing message retrans.)
Migrate links
18
Outline
• Why is VROOM a good idea?
• What are the challenges?
• How does VROOM work?
– The migration enablers
– The migration process
• What to be migrated?
• How? (in order to minimize disruption)
• Is VROOM practical?
• Where to migrate?
VROOM Architecture
• Three enablers that make VR migration possible
– Router virtualization
– Control and data plane separation
– Dynamic interface binding
20
A Naive Migration Process
1.
2.
3.
4.
Freeze the virtual router
Copy states
Restart
Migrate links
Practically unacceptable
Packet forwarding should not stop during migration
21
VROOM’s Migration Process
Key idea: separate the migration of control and data plane
No data-plane interruption
Low control-plane interruption
1. Control-plane migration
2. Data-plane cloning
3. Link migration
22
Control-Plane Migration
Two things to be copied
Router image
Binaries, configuration files, etc.
Memory
1st stage: pre-copy
2nd stage: stall-and-copy (when the control plane is “frozen”)
2
1
t1
t2
t3 t4
time
pre-copy stall-and-copy
1: router-image copy
2: memory copy
23
Data-Plane Cloning
Clone the data plane by repopulation
Copying the data plane states is wasteful, and could be hard
Instead, repopulate the new data plane using the migrated control plane
The old data plane continues working during migration
2
1
t1
t2
3
t3 t4
t5
time
1: router-image copy
2: memory copy
3: data-plane cloning
24
Remote Control Plane
The migrated control plane plays two roles
Act as a “remote control plane” for the old data plane
Populate the new data plane
2
1
t1
control
plane
t2
old node
3
t3 t4
t5
remote control plane
time
new node
1: router-image copy
2: memory copy
3: data-plane cloning
25
Keep the Control Plane “Online”
Data-plane cloning takes time
Around 110 us per FIB entry update (for high-end router) *
Installing 250k routes could take over 20 seconds
The control plane needs connectivity during this period
Redirect the routing messages through tunnels
*: P. Francios, et. al., Achieving sub-second IGP convergence in large IP networks, ACM SIGCOMM CCR, no. 3, 2005.
26
Double Data Planes
At the end of data-plane cloning, two data planes are ready to
forward traffic (i.e., “double data planes”)
0
t0
control
plane
2
1
t1
t2
4
3
t3 t4
old node
t5
remote control plane
t6
time
new node
old node
data
plane
new node
0: tunnel setup
1: router-image copy
2: memory copy
double
data
plane
3: data-plane cloning
4: asynchronous link migration
27
Asynchronous Link Migration
With the double data planes, each link can be migrated
independently
Eliminate the need for a synchronization system
28
Outline
•
•
•
•
Why is VROOM a good idea?
What are the challenges?
How does VROOM work?
Is VROOM practical?
– Prototype system
– Performance evaluation
• Where to migrate?
Prototype Implementation
PC + OpenVZ
OpenVZ: OS-level virtualization
Lighter-weight
Supports live migration
Two prototypes
Software-based data plane (SD): Linux kernel
Hardware-based data plane (HD): NetFPGA
NetFPGA: 4-port gigabit Ethernet PCI with an FPGA
Why two prototypes?
To validate the data-plane hypervisor design (e.g., migration
between SD and HD)
30
The Out-of-box OpenVZ Approach
Packets are forwarded inside each VE
When a VE is being migrated, packets are dropped
31
Control and Data Plane Separation
Move the FIBs out of the VEs
shadowd in each VE, “pushing down” route updates
virtd in VE0, as the “data-plane hypervisor”
32
Dynamic Interface Binding
bindd provides two types of bindings:
Map substrate interfaces to the right FIB
Map substrate interfaces to the right virtual interfaces
33
Putting It Altogether: Realizing Migration
1. The migration program notifies shadowd about the
completion of the control plane migration
34
Putting It Altogether: Realizing Migration
2. shadowd requests zebra to resend all the routes, and pushes
them down to virtd
35
Putting It Altogether: Realizing Migration
3. virtd installs routes the new FIB, while continuing to update
the old FIB
36
Putting It Altogether: Realizing Migration
4. virtd notifies the migration program to start link migration
after finishing populating the new FIB
5. After link migration is completed, the migration program
notifies virtd to stop updating the old FIB
37
Evaluation
Answer three questions
Performance of individual migration steps?
Impact on data traffic?
Impact on routing protocol?
Experiments on Emulab
38
Performance of Migration Steps
Memory copy time
5
Time (seconds)
With different
numbers of routes
(dump file sizes)
6
4
3
2
1
0
0
10k
100k
200k
300k
400k
500k
Number of routes
Suspend + dump
Copy dump file
Undump + resume
Bridging setup
39
Performance of Migration Steps
FIB population time
Grows linearly w.r.t. the number of route entries
Installing a FIB entry into NetFPGA: 7.4 microseconds
Installing a FIB entry into Linux kernel: 1.94 milliseconds
• FIB update time: time for virtd to install entries to FIB
• Total time: FIB update time + time for shadowd to send routes to virtd
40
Data Plane Impact
The diamond testbed
64-byte UDP packets, round-trip traffic
41
Data Plane Impact
HD router with separate migration bandwidth
No delay increase or packet loss
SD router with separate migration bandwidth
Up to 3.7% delay increase at 5k packets/s
Less than 0.4% delay increase at 25k packets/s
SD, 5k packets/s
42
The Importance of Separate Migration Bandwidth
The dumbbell testbed
250k routes in the RIB
43
Separate Migration Bandwidth is Important
Throughput of the migration traffic
44
Separate Migration Bandwidth is Important
Delay increase of the data traffic
45
Separate Migration Bandwidth is Important
Loss rate of the data traffic
46
Control Plane Impact
The Abilene testbed
Assume a backbone running MPLS
VR5 configured as
Core router (running OSPF only)
Edge router (running OSPF + BGP)
47
Core Router Migration
No events during migration
Average control plane downtime: 0.972 seconds (0.924 1.008 seconds in 10 runs)
Support 1-second OSPF hello-interval (with 4-second deadinterval)
Miss at most one hello message
48
Core Router Migration
Events happen during migration
Introducing events (LSA) by flapping link VR2-VR3
Miss at most one LSA
Get retransmission 5 seconds later (the default LSA retransmissioninterval)
Can use smaller LSA retransmission-interval (e.g., 1 second)
49
Edge Router Migration
255k BGP routes + OSPF
Dump file size grows from 3.2MB to 76.0MB
Average control plane downtime: 3.560 seconds (3.484 3.594 seconds in 10 runs)
Support 2-second OSPF hello-interval (with 8-second deadinterval)
BGP sessions stay up
In practice, ISPs often use the default values
10-second hello-interval
40-second dead interval
50
Outline
•
•
•
•
•
Why is VROOM a good idea?
What are the challenges?
How does VROOM work?
Is VROOM practical?
Where to migrate?
Deciding Where To Migrate
• Physical constraints
– Latency
• E.g, NYC to Washington D.C.: 2 msec
– Link capacity
• Enough remaining capacity for extra traffic
– Platform compatibility
• Routers from different vendors
– Router capability
• E.g., number of access control lists (ACLs) supported
• Good news: these constraints limit the search
space
52
Two Optimization Problems
• For planned maintenance/service deployment
– Minimize path stretch
– With constraints on link capacity, platform
compatibility, router capability, etc.
• For power savings
– Maximize power savings
• With different regional electricity prices
– With constraints on path stretch, link capacity, etc.
53
Conclusions
• VROOM offers a useful network-management primitive
– separates the tight coupling between physical and logical
– Simplify network management, enable new applications
• Live router migration with minimal disruption
– Data-plane hypervisor enables
• Data-plane cloning
• Remote control plane
• Double data plane and asynchronous link migration
– No data-plane disruption
– No visible control-plane disruption
54