HSA-NetPlumber - CSE Labs User Home Pages

Download Report

Transcript HSA-NetPlumber - CSE Labs User Home Pages

Verifying & Testing
SDN Data & Control Planes:
Header Space Analysis
Trouble-Shooting Network is Hard!
• Forwarding state is hard to analyze!
Rule
Rule
.
.
.
Rule
Rule
Rule
.
.
.
Rule
Rule
Rule
.
.
.
Rule
2
Why? Look at Networks Today
P1
SQL 1001
10* P1,P2
1*  P2
Drop SQL
Load balancing
P2
Access Control Lists (ACLs)
• Multiple Protocols: 6000 RFCs (MPLS, GRE . . .)
• Multiple Vendors: Broadcom, Arista, Cisco, . . .
• Manual Configurations: Additional arcane programs
kept working by “masters of complexity” (Shenker)
• Crude tools: SNMP, NetFlow, TraceRoute, . . .
Simple Questions Hard to
Answer Today
– Which packets from A can reach B?
– Is Group X provably isolated from Group Y?
– Is the network causing poor performance or
the server? Are QoS settings to blame?
– Why is my backbone utilization poor?
– Is my load balancer distributing evenly?
– Where are there mysterious packet losses?
Network Debugging Challenges
• Forwarding state is hard to analyze!
1. Distributed across multiple tables and boxes
2. Written to network by multiple independent
writers (different protocols, network admins)
3. Presented in different formats by vendors
4. Not directly observable or controllable
• Not constructed in a way that lend itself well
to checking and verification
5
Motivation to Do Better
• Internal: > 1 hr customer visible outage/quarter (P.
Patel)
– Azure: 30,000 cores down 3 hrs, L2/L3 configuration bug
– Bing: Entire data center, 8 hours, L2/L3 configuration
bug
• External: (2012 NANOG Network Operator Survey):
– 35% > 25 tickets per month, > 1 hour to resolve
– Amazon, BigDaddy, United Failures (NetCore paper)
– Welsh: vast majority of Google “production failures” due
to “bugs in configuration settings”
As we migrate to services ($100B public cloud
market), network failure a debilitating cost.
Leveraging SDN for Systematcally
Network Trouble-Shooting
• Network troubleshooting today is still a
largely ad hoc process.
• Network admins need an “automated
troubleshooting” tool which could:
• detect errant behavior
• precisely locate its cause
Can we leverage SDN layering for
systematically trouble-shooting networks?
position paper by Heller et al
SDN Stack
•
•
•
•
•
•
•
State layers hold a
representation of the
network’s configuration.
Code layers implement
logic to maintain the
mapping between
two state layers
Code Layers
State Layers
Policy
App
App
App
Logical View
Network Hypervisor
Physical View
Network OS
Device State
Firmware
Hardware
HW
Firmware
Firmware
HW
HW
Troubleshooting Workflow
Tools for Finding
The Code Layer
•
A: Actual Behavior ~ Policy?
•
•
B: Device State ~ Policy?
•
•
OFRewind
D: Device State ~ Hardware?
•
•
Anteater, Header Space Analysis, VeriFlow
C: Physical View ~ Device State?
•
•
Automatic Test Packet Generation, Network Debugger
SOFT
E: Logical View ~ Physical View?
•
Corresponding Checking
Tools for Localizing
within a Code Layer
•
Within the controller (control plane)
•
•
OFRewind, Retrospective Casual Inference, Synoptic,
NICE, VeriCon, …
Within the switch firmware (data plane)
•
Network Debugger, Automatic Test Packet Generation,
SOFT, …
 Systematic troubleshooting


from design (policy specification & verification),
configuration analysis, runtime dynamic analysis,
trouble-shooting, …
Verification vs. testing vs. debugging vs. troubleshooting, …
 Summarize tools and reveal missing tools
Some Existing Approaches to
SDN Verification & Testing
Specification
Policy
Language,
Semantics
NetKAT, NetCore
CTL, NoD, Klee Assert
Testing
NICE, ATPG
Verification
(e.g. reachabilty)
Data Plane: Anteater, VeriFlow, HSA, Atomic Predicates,
First-order + Transitive Closure, local checks
Control Plane: VeriCon, BatFish
Synthesis
(e.g. forwarding)
One big switch, VLAN, NetCore, NetKAT, FlowLog
SDN: Verification & Testing
• Data Plane Verification:
– given data plane FIBs f, every packet p satisfies designed
network spec/property
– If not, generate one counterexample or all
counterexamples
– Or provides optimizations or help adding functionality
• Data Plane Testing:
– detour via Symbolic Testing
• Control Plane Verification:
– given configuration c, for every packet & network environment,
verify control plane generates correct data plane states meeting
desired network specs/properties
• Control Plane Testing:
• Synthesis:
– design a Control Plane CP that will ∀packet p, s (network state)
satisfy Φ (policy specs), e.g., reachability, no loop, waypoint
traversal ), by construction
13
Verification:
Values and Obstacles
Hardware
Software
Networks
Chips
Devices (PC, phone)
Service
Bugs are:
Burned into
silicone
Exploitable,
workarounds
Latent, Exposed
Dealing
with
bugs:
Costly recalls
Online updates
Live site incidents
Obstacles
to
eradication:
Design Complexity
Code churn, legacy,
false positives
Topology, configuration
churn
Value
proposition
Cut time to market
Safety/OS critical
systems,
Quality of code base
Meet SLA,
Utilize bandwidth,
Enable richer policies
Header Space Analysis
Main Idea: Network as a Program (or rather a Function)
• Model header as point in high dimensional space and
all networking boxes as transformers of header space
3
1
Match
0xx1..x1
11xx..0x
Packet
Forwarding
Action
+
2
Send to port 2
3
Rewrite with 1x01xx..x1
1xx011..x1
ROUTER ABSTRACTED AS SET OF GUARDED COMMANDS . .
NETWORK BECOMES A PROGRAM CAN USE PL METHODS
FOR VERIFICATION, TESTING, AND SYNTHESIS
15
Header Space Analysis: Static
Checking for Networks
Three Goals:



Help system administrators statically analyze production network
Make it easier for system administrators to guarantee isolation
between sets of hosts
Enable static analysis of networks sliced in more general ways
–

e.g., not only VLANs, but also defined any combination of header
fields (as used by Flowvisor)
Sample questions that we want to answer using the analysis:
•
•
•
•
•
Can host A communicate to host B?
What are all the packet headers from A that can reach B?
Are there any loops in the network?
Is Slice A isolated totally from Slice B?
What happens if an entry from a router is removed.
HSA: Key Abstractions
• Header Space:
– packet & flow: point & region in header space
• Network Space: set of switches (“boxes”) with ports 1, …P
• Network Transfer Function:
– net box as a “transformer”
• Topology Transfer Function:
• Multihop Packet Traversal:
• Slice: slice network space, permission, slice transfer function
17
HSA Framework: Header Space
❑
Step 1 - Model packet header as a point in {0,1}L space
– The Header Space
Header
0xxxx0101xxx
01110011…1
L
Data
HSA Framework: Net. Tranf. Func.
❑
Step 2 – Model all networking boxes as transformer of header
space
1101..00
3
1
Transfer Function:
1110..00
Match
0xx1..x1
11xx..0x
Packet
Forwarding
Action
+
Send to port 23
Rewrite with 1x01xx..x1
1xx011..x1
2
Network Transfer Function:
An Example
❑ E.g.
- Transfer Function of an IPv4 Router
1
2
● 172.24.74.0
255.255.255.0 Port1
● 172.24.128.0 255.255.255.0 Port2
● 171.67.0.0
255.255.0.0
Port3
T(h, p) =
(h,1)
if dst_ip(h) = 172.24.74.x
(h,2)
if dst_ip(h) = 172.24.128.x
(h,3)
if dst_ip(h) = 171.67.x.x
3
Network Transfer Function:
An Example
❑ Example:
Transfer Function of an IPv4 Router
1
2
● 172.24.74.0
255.255.255.0 Port1
● 172.24.128.0 255.255.255.0 Port2
● 171.67.0.0
255.255.0.0
Port3
T(h, p) =
3
(dec_ttl(h),1)
if dst_ip(h) = 172.24.74.x
(dec_ttl(h),2)
if dst_ip(h) = 172.24.128.x
(dec_ttl(h),3)
if dst_ip(h) = 171.67.x.x
Network Transfer Function:
An Example
❑ Example:
Transfer Function of an IPv4 Router
1
2
● 172.24.74.0
255.255.255.0 Port1
● 172.24.128.0 255.255.255.0 Port2
● 171.67.0.0
255.255.0.0
Port3
T(h, p) =
3
(rw_mac(dec_ttl(h),next_mac) , 1)
if dst_ip(h) = 172.24.74.x
(rw_mac(dec_ttl(h),next_mac) , 2)
if dst_ip(h) = 172.24.128.x
(rw_mac(dec_ttl(h),next_mac) , 3)
if dst_ip(h) = 171.67.x.x
Example Rules
❑
FWD & RW: rewrite bits 0-2 with value 101
● (h & 000111…) | 101000…
❑
Encapsulate packet in a 1010 header.
● (h >> 4) | 1010….
❑
Decapsulate 1010xxx… packets
● (h << 4) | 000…xxxx
Header Space Algebra:
Properties of Transfer Function
❑
Properties of transfer functions
•
Composable:
R1
R2
T1(h, p)
R3
T3(h, p)
T2(h, p)
•
Invertible:
Domain (input)
Range (output)
Header Space Algebra:
Operations
❑
Step 3 - Develop an algebraic operation to work on these spaces.
❑ Every object in Header Space, can be described by union of
Wildcard Expressions.
❑ We want to perform the following set operations on wildcard
expressions:
• Intersection
• Complementation
• Difference
Header Space Algebra: Operations
❑ Intersection:
●
Bit by bit intersect using intersection table:
• Example:
• If result has any ‘z’, then intersection is empty:
• Example:
Use Case 1:
Can Host A Talk to Host B?
All Packets that A can use to communicate with B
A
T-11
Box 1
T-11
T-12
Box 2
T2(T1(X,A))
T1(X,A)
T-14
T4(T1(X,A))
T-13
Box 4
Box 3
B
T-13
T3(T2(T1(X,A)) U T3(T4(T1(X,A))
Use Case 2:
Is There a Loop?
Inject an all-x text packet from every switch-port
Follow the packet until it comes back to injection port
•
•
T1(X,P)
Box 2
T2(T1(X,P))
T-12
Box 1
T-13
T-11
Box 3
T-14
Original HS
Returned HS
T4(T3(T2(T1(X,P))))
T3(T2(T1(X,P)))
Box 4
Use Case 3:
Is the Loop Infinite?
Finite Loop
Infinite Loop
?
Use Case 4:
Are Two Slices Isolated?
❑
What do we mean by slice?
• Fixed Slices: VLAN slices
❑
Are two slices isolated?
•
•
slice definitions don’t intersect.
packets do not leak.
Box 2
Box 1
Box 3
Box 4
HAS: Key Advantages
❑ A powerful
●
●
●
General Foundation that gives -
A common model for all packets - Header Space.
A unified view of almost all type of boxes - Transfer
Function.
A powerful interface for answering different questions about
the network.
❖T(h,p) and T-1(h,p)
❖Set operations on Header Space
Evaluation:
Stanford Campus Network
~750K IP fwd rule.
~1.5K ACL rules.
~100 Vlans.
Vlan forwarding.
Evaluation:
Stanford Campus Network
❑ Loop
Vlan RED
Spanning Tree
detection test – run time < 10 minutes on a single laptop.
Vlan BLUE
Spanning Tree
Evaluation:
Stanford Campus Network
Performance on a single machine: 4 core, 4GB RAM.
Evaluation:
Complexity Analysis
❑
Run- time
•
•
Reach-ability Loop Detection • R: maximum number of rules per box.
• d: diameter of network.
• P: number of ports to be tested
❑ Slice Isolation Test:
• W: number of wildcard e xpressions in definition of a slice.
• N: number of slices in the network.
Evaluation:
Complexity Analysis
c2
R
c
R
W1,..WR
W1,..WR
W1,..WR
W1,..WR
E1 : Match M1,..
E2 : Match M2,..
E3 : Match M3,..
cR/3
c2R/9
c2R/9
c2R/9
R
cR/3
.
.
.
ER : Match MR,..
c2R/9
c2R/9
W1,..WR
cR/3
c2R/9
c2R/9
c2R/9
c2R/9
Summary of HAS (Hassel):
Snapshot-based Checking
a
TA
TD
TC
TB
Can host a talk to host b?
b
Is there any forwarding loop in the network?
37
Real-Time Incremental Checking
+
-
-
+
Time
38
Real-Time Incremental Checking
Time
Set of Policies/Invariants
+
?
+
Yes/No
Prevent errors before they hit network
Report a violation as soon as it happens
39
NetPlumber:
Runtime policy checking based on HSA
• The System for real time policy checking is
called NetPlumber
App
App
App
Controller
App
Logically centralized location
to observe the state changes
NetPlumber
40
NetPlumber
• The System we build for real time policy
checking is called NetPlumber
– Creates a dependency graph of all forwarding
rules in the network and uses it to verify policy
– Nodes: forwarding rules in the network
– Directed Edges: next hop dependency of
rules
Switch 1
Switch 2
R
1
R2
41
NetPlumber:
Nodes and Edges
0
1 X
X
1 001
1 0XX
S
S
42
NetPlumber:
Intra table dependency
S
S
43
NetPlumber:
Computing Reachability
A
B
Source
Node
S
S
Probe
Node
44
?
NetPlumber:
Computing Reachability with Updates
1) Create directed edges
A
B
Source
Node
S
S
?
45
NetPlumber: Computing Reachability
with Updates
1) Create directed edges
2) Route flows
3) Update intra-table dependency
A
B
Source
Node
S
S
?
46
NetPlumber – Checking Policy
1) Back-tracing to check if 0010
packets go through RED box
A
B
Source
Node
S
S
?
Policy:
packets go
through RED47
box.
10/1/1
NetPlumber: Checking Policy
1) Back-tracing to check if 0010 packets
go through RED box
2) Update policy checking with rule
deletion
A
B
Source
Node
S
S
?
Policy:
packets go
through RED box.
48
Checking Policy with NetPlumber
Policy: Guests can not access Server S.
G1
S
G2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
?
.
.
.
49
Checking Policy with NetPlumber
Policy: http traffic from client C to server S doesn’t go through more than 4 hops.
C
S
HTTP
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
?
.
.
.
50
Checking Policy with NetPlumber
Policy: traffic from client C to server S should go through middle box M.
C
M
S
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
?
.
.
.
51
Why Dependency Graph Helps
• Incremental update
– Only have to trace through dependency subgraph affected by an update
• Flexible policy expression
– Probe and source nodes are flexible to place
and configure
• Parallelization
– Can partition dependency graph into clusters
to minimize inter-cluster dependences
52
Distributed NetPlumber
S
?
53
Dependency Graph Clustering
S
?
54
Dependency Graph Clustering
?
55
Experiment On Google WAN
• Google Inter-datacenter WAN.
– Largest deployed SDN, running OpenFlow
– ~143,000 OF rules
56
Experiment On Google WAN
• Policy check: all 52 edge
switches can talk to each
other
• More than 2500 pairwise
reachability check
• Used two snapshots
taken 6 weeks apart
• Used the first snapshot to
create initial NetPlumber
state and used the diff as
a sequential update
?
57
Experiment On Google WAN
Default/Aggregate Rules
1
0.9
Run time with Hassel > 100
0.8
0.7
F(x)
0.6
0.5
0.4
Single instance
2 instances
3 instances
4 instances
5 instances
0.3
0.2
0.1
0
−2
10
−1
10
0
1
10
10
Run Time of NetPlumber (ms)
2
10
3
10
Not much more benefit!
Benchmarking Experiment
• For a single pairwise reachability
check.
59
NetPlumber Conclusions
• Designed a protocol-independent system
for real time network policy checking
• Key component: dependency graph of
forwarding rule, capturing all flow paths
– Incremental update
– Flexible policy expressions
– Parallelization by clustering
60