X-Trace: A Pervasive Network Tracing Framework
Download
Report
Transcript X-Trace: A Pervasive Network Tracing Framework
Presenter: Chi-Hung Lu
1
Problems
Distributed applications are hard to validate
Distribution of application state across many distinct
execution environments
Protocols involve complex interactions among a
collection of networked machines
Need to handle failures ranging from network problems
to crashing nodes
Intricate sequences of events can trigger complex errors
as a result of mishandled corner cases
2
Approaches
Logging-based Debugging
X-Trace
Bi-directional Distributed BackTracker (BDB)
Pip
Deterministic Replay
WiDS
Friday
Jockey
Model Checking
MaceMC
3
R. Fonseca et al, NSDI 07
4
Problem Description
It is difficult to diagnose the source of the problem for
an internet application
Current network diagnostic tools only focus on one
particular protocol
Does not share information on the application between
the user, service, and the network operators
5
Examples
traceroute
Could locate IP connectivity problem
Could not reveal proxy or DNS failures
HTTP monitoring suite
Could locate application problem
Could not diagnose routing problems
6
Examples
DNS Server
User
Web Server
Proxy
7
Examples
DNS Server
User
Web Server
Proxy
8
Examples
DNS Server
User
Web Server
Proxy
9
Examples
DNS Server
User
Web Server
Proxy
10
X-Trace
An integrated tracing framework
Record the network path that were taken
Invoke X-Trace when initiating an application task
Insert X-Trace metadata with a task identifier in the
request
Propagate the metadata down to lower layers through
protocol interfaces
11
Task Tree
X-Trace tags all network operations resulting from a
particular task with the same task identifier
Task tree is the set of network operations connected
with an initial task
Task tree could be reconstruct after collecting trace
data with reports
12
An example of the task tree
A simple HTTP request through a proxy
13
X-Trace Components
Data
X-Trace metadata
Network path
Task tree
Report
Reconstruct task tree
14
Propagation of X-Trace Metadata
The propagation of X-Trace metadata through the task
tree
15
Propagation of X-Trace Metadata
The propagation of X-Trace metadata through the task
tree
16
The X Trace metadata
Field
Usage
Flags
Bits that specify which of the three optional components are present
TaskID
An unique integer ID
TreeInfo
ParentID, OpID, EdgeType
Destination
Specify the address that X-Trace report should be sent to
Options
Accommodate future extensions mechanism
17
Operation of X-Trace Metadata
18
Operation of X-Trace Metadata
19
X-Trace Report Architecture
20
X-Trace Report Architecture
21
X-Trace Report Architecture
22
Usage Scenario (1)
Web request and recursive DNS queries
23
Usage Scenario (2)
A request fault annotated with user input
24
Usage Scenario (3)
A client and a server communicate over I3 overlay
network
25
Usage Scenario (3)
Internet Indirect Infrastructure (I3)
26
Usage Scenario (3)
Internet Indirect Infrastructure (I3)
27
Usage Scenario (3)
Internet Indirect Infrastructure (I3)
28
Usage Scenario (3)
Tree for normal operation
29
Usage Scenario (3)
The receiver host fails
30
Usage Scenario (3)
Middlebox process crash
31
Usage Scenario (3)
The middlebox host fails
32
Discussion
Report loss
Non-tree request structures
Partial deployment
Managing report traffic
Security Considerations
33
X. Liu et al, NSDI 07
34
Problem Description
Log mining is both labor-intensive and fragile
Latent bugs often are distributed across multiple
nodes
Logs reflect incomplete information of an execution
Non-determinism of distributed application
35
Goals
Efficiently verify application properties
Provide fairly complete information about an
execution
Reproduce the buggy runs deterministically and
faithfully
36
Approach
Log the actual execution of a distributed system
Apply predicate checking in a centralized simulator
over a run driven by testing scripts or replayed by logs
Output violation report along with message traces
An execution is interpreted as a sequence of events, which are
dispatched to corresponding handling routines
37
Components
A versatile script language
Allow a developer to refine system properties into
straightforward assertions
A checker
Inspect for violations
38
Architecture
Components of WiDS Checker
39
Architecture
Reproduce real runs
Log all non-deterministic events using Lamport’s logical
clock
Check user-defined predicates
A versatile scription language to specify system states
being observed and the predicates for invariants and
correctness
Screen out false alarms with auxiliary information
For liveness properties
Trace root causes using a visualization tool
40
Programming with WiDS
WiDS APIs are mostly member function of the
WiDSObject class
WiDS runtime maintains an event queue to buffer
pending events and dispatches them to corresponding
handling routines
41
Enabling Replay
Logging
Log all WiDS nondeterminism
Redirect OS calls and log the results
Embed a Lamport Clock in each out-going message
Checkpoint
Support partial replay
Save the WiDS process context
Replay
Start from the beginning or a checkpoint
Replay events in serialized Lamport order
42
Checker
Observe memory state
Define states and evaluate predicates
Refresh database for each event
Maintain history
Re-evaluate modified predicates
Auxiliary information for violations
Liveness properties only guarantee to be true eventually
43
44
45
46
Visualization Tools
Message flow graph
47
Evaluation
Benchmark and result summary
48
Performance
Running time for evaluating predicates
49
Logging Overhead
Percentage of logging time
50
Discussion
System is debugged by those who developed it
Bugs are hunted by those who are intimately familiar
with the system
51