Reverse Hashing for High-speed Network Monitoring: Algorithms

Download Report

Transcript Reverse Hashing for High-speed Network Monitoring: Algorithms

EtherRake: Diagnosis and Monitoring
in Data Center & Enterprise Networks
Lab for Internet and Security Technology (LIST)
Northwestern Univ.
General Idea of EtherRake
• Problem statement:
Emerging DC and enterprise networks are
mainly comprised of large # of switches
which need monitoring and diagnosis
2
General Idea of EtherRake
• A centralized structure.
– Collector at each switches
• Collect Neighbors
• Collect port information
• Collect forwarding tables
– Monitor Plane
• Transmit collected information
– Processing Center
• Link the frames
• Construct Logical Topology
• Find the problems
3
Collector at each switches
• Take Cisco switches for example
• Port information
– show port status (display interface ethernet0/1
for huawei)
• Neighbor Information
– show CDP neighbors
• Forwarding tables (aka switch table)
– show MAC – interface mapping
4
Collector at each switches
• Port information
– Port Number: 2 Bytes
– Status: 4 bits
– Total: 3 Bytes * 100 = 300 Bytes < 0.4KB per switch
• Neighbor Information
– Mac Address: 48 bits
– Total: 6Bytes* 100 = 600 Bytes < 0.6KB per switch
• Forwarding Tables
– To be decided. We are not using it in our approach now.
We can transfer updates only which means normally we
don’t need to transfer anything.
5
• Total: 1 KB * 1024 (number of switches) = 1MB in one round.
Collector at each switches
• Synchronization
– Cristian's algorithm (P is processing center, and
S is a collector)
• P requests the time from S
• After receiving the request from P, S prepares a
response and appends the time T from its own clock.
• P then sets its time to be T + RTT/2
– Multiple measurement can reduce the error.
– Accuracy. (T + min) to (T + RTT - min) where
min is the minimum one-way time.
6
Monitor Plane
• Monitor Plane is a plane that co-exists with
data plane and control plane in the same
channel. It is used to transfer monitoring
data.
Assist
Adjust
Monitor
Control
7
Monitor Plane
• Monitor plane is used to collect data for
monitoring data plane.
• Switching in monitor plane has two
methods.
– Normally, control plane will assist monitor
plane forwarding.
– Under error, monitor plane will do flooding.
8
Processing Center
• Collect port information, forwarding tables and
neighbor information from all the switches.
• Construct the logical topology of switches
based on the port & neighbor info
– Detect loops in the logical topology for STP loop
problems
– Check for any missing/dead switches
9
Problems to Solve
• STP Error Detection
• End-to-end Error Detection
• Other Hardware/Software Errors of
Switches and Their Detection
• TRILL Potential Problems
10
End-to-end Connectivity
Monitoring
• Based on the neighbor and port information,
check if all switches and end hosts are on a
connected ST.
– End hosts are also neighbors for leaf node
switch.
• Forwarding table also records info of past
connectivity
11
Other Software Errors of Switches
and its Detection
• One-Way Link Problem. No backward frames.
– From EtherRake’s view, interface of the other
direction is dead.
• Deferred Frames. Buffer is full. Frames have to be
dropped.
– Encode the buffer status (e.g., full) to the status bit
• Links between switches and routers
disabled/unactivated.
– Detected by the port status bits or lack of heartbeat
• Switches down, e.g., unbootable IOS problems
12
Limitations on Other Switch
Software Errors Detection
• Some errors have to be detected at the data
plane or application plane.
– VLAN Problems. Hosts in the same VLAN cannot
communicate with each other.
13
Hardware Errors of Switches and its
Detection
• Switch Port Errors.
• Switch Module Errors.
• Both will be detected by the port status
reports
14
STP Errors (1)
• Count to Infinity when removing the root
4
2
3
1,2
1,3
2
3
1,3
1,4
2
3
1,4
3
1,4
1,4
1
1,3
1,3
1,1
1,2
1,2
1,2
4
1,
1
2
4
1,2
1,2
4
5
5
5
5
STP Errors (2)
• Forwarding Loops
– BPDU Loss Induced Forwarding Loops. If the
blocked port fails to receive BPDUs from its
peer bridge for an extended period of time, it
may start forwarding data.
STP Errors (3)
• Forwarding Loops
– MaxAge Induced Forwarding Loops (MaxAge
= 6)
STP Errors (4)
• Forwarding Loops
– Count to Infinity Induced Forwarding Loops
– Pollution of Forwarding Tables
Previous STP Errors Detection
• EtherFuse (sigcomm 07)
– Plug a fuse into Ethernet
• Problem Remaining
– Where to plug it?
– How many do we need?
19
Previous STP Errors Detection
• Cisco Prevention Methods
– Loop Guard. Prevent loss BPDU induced
loops.
20
Some Existing Solutions
• Cisco Discovery Protocol (CDP)
– Discovery cisco apparatus in neighborhood
– Monitoring aliveness of neighboring nodes
– Limitations
• No detail status report for diagnosis
• Limited by one hop.
• Cisco Unidirectional Link Detection (UDLD).
– Detect One-Way Link Problem.
21
General Monitoring Metrics for
Detection
• Connectivity. Based on frames tree,
EtherRake can find the connectivity of a path.
• Delay. EtherRake can link frames and
calculate the time spent on each switch.
• Throughput. EtherRake can calculate
throughput by collected frames.
22
TRILL Potential Problems
• Routing loops
– Caused by inconsistent views of network topology.
– Mitigated using hop count
• Scalability issue:
– No clear idea on how much TRILL can scale
23
Backup
24
Detection of STP Errors by
EtherRake
• Find STP errors by EtherRake.
– Link collected frames into traces
– Detect frame forwarding loops
– Leverage on the switch and ARP table info
– Challenges
• Scalability: optimize collection of traces
• Ambiguity and accuracy: frame linking
End-to-end Connectivity
Monitoring
• Diagnose Connectivity Problem from A to
B by EtherRake
– Find the frames that are on the way from A to
B.
– Link the frames and find a path.
– Locate the problem.
26
IP Router Errors – OSPF (1)
• Network Convergence Time. The time taken by
all the OSPF routers in the network to go back to
steady state operations after there is a change in
the network state.
27
IP Router Errors – OSPF (2)
• Routing Load on Processors
28
IP Router Errors – OSPF (3)
• Route Flaps. Routing table changes in a
router, usually in response to a network
failure or a recovery.
29
Cisco Solution
• Bi-directional Forwarding Detection (BFD)
– Try to Speed Network Convergence (three parts).
• Failure detection: the speed with which a device on the
network can detect and react to a failure of one of its
own components, or the failure of a component in a
routing protocol peer.
• Information dissemination: the speed with which the
failure in the previous stage can be communicated to
other devices in the network
• Repair: the speed with which all devices on the
network-having been notified of the failure-can
calculate an alternate path through which data can 30
flow.
IP Router Errors – DHCP
• DHCP problem
– Configuration problem.
– Inability to acquire or renew a lease.
– How to keep the same IP address in multiboot machines?
31
EtherFuse (1)
• A Ethernet Fuse that is plugged into the
network for monitoring the status of
network.
32
EtherFuse (2)
• Detection of Count to Infinity
– Detecting cost to the same root R of BPDUs
33
• Detection of Forwarding Loops.
– Combination of Passive Sniffing and Active
Probing.
34
Package View Switching
• Forwarding packages from the view of
packages.
• Each package will have memory about the
history of the path it has already gone
through and decide which way to go based
on the memory it has.
• Here is the steps. (Generally speaking, it
is deep-first searching from the view of
packages.)
35
Package View Switching
(1) Normally, when a package arrives at a switch, it will
choose the default port which is the port that control
plane provide.
(2) If the package has already tried the default port, it will
randomly choose a new port that it has never been to.
(3) If the package tried every port at this switch, it will go
back to the port where it is from.
(4) Package will be discarded when it arrived at its origin
and finds no other way to go. Or package arrives at the
destination which is the monitor center.
36