X-Trace: A Pervasive Network Tracing Framework

Download Report

Transcript X-Trace: A Pervasive Network Tracing Framework

Presenter: Chi-Hung Lu
1
Problems
 Distributed applications are hard to validate
 Distribution of application state across many distinct
execution environments
 Protocols involve complex interactions among a
collection of networked machines
 Need to handle failures ranging from network problems
to crashing nodes
 Intricate sequences of events can trigger complex errors
as a result of mishandled corner cases
2
Approaches
 Logging-based Debugging
 X-Trace
 Bi-directional Distributed BackTracker (BDB)
 Pip
 Deterministic Replay
 WiDS
 Friday
 Jockey
 Model Checking
 MaceMC
3
R. Fonseca et al, NSDI 07
4
Problem Description
 It is difficult to diagnose the source of the problem for
an internet application
 Current network diagnostic tools only focus on one
particular protocol
 Does not share information on the application between
the user, service, and the network operators
5
Examples
 traceroute
 Could locate IP connectivity problem
 Could not reveal proxy or DNS failures
 HTTP monitoring suite
 Could locate application problem
 Could not diagnose routing problems
6
Examples
DNS Server
User
Web Server
Proxy
7
Examples
DNS Server
User
Web Server
Proxy
8
Examples
DNS Server
User
Web Server
Proxy
9
Examples
DNS Server
User
Web Server
Proxy
10
X-Trace
 An integrated tracing framework
 Record the network path that were taken
 Invoke X-Trace when initiating an application task
 Insert X-Trace metadata with a task identifier in the
request
 Propagate the metadata down to lower layers through
protocol interfaces
11
Task Tree
 X-Trace tags all network operations resulting from a
particular task with the same task identifier
 Task tree is the set of network operations connected
with an initial task
 Task tree could be reconstruct after collecting trace
data with reports
12
An example of the task tree
 A simple HTTP request through a proxy
13
X-Trace Components
 Data
 X-Trace metadata
 Network path
 Task tree
 Report
 Reconstruct task tree
14
Propagation of X-Trace Metadata
 The propagation of X-Trace metadata through the task
tree
15
Propagation of X-Trace Metadata
 The propagation of X-Trace metadata through the task
tree
16
The X Trace metadata
Field
Usage
Flags
Bits that specify which of the three optional components are present
TaskID
An unique integer ID
TreeInfo
ParentID, OpID, EdgeType
Destination
Specify the address that X-Trace report should be sent to
Options
Accommodate future extensions mechanism
17
Operation of X-Trace Metadata
18
Operation of X-Trace Metadata
19
X-Trace Report Architecture
20
X-Trace Report Architecture
21
X-Trace Report Architecture
22
Usage Scenario (1)
 Web request and recursive DNS queries
23
Usage Scenario (2)
 A request fault annotated with user input
24
Usage Scenario (3)
 A client and a server communicate over I3 overlay
network
25
Usage Scenario (3)
 Internet Indirect Infrastructure (I3)
26
Usage Scenario (3)
 Internet Indirect Infrastructure (I3)
27
Usage Scenario (3)
 Internet Indirect Infrastructure (I3)
28
Usage Scenario (3)
 Tree for normal operation
29
Usage Scenario (3)
 The receiver host fails
30
Usage Scenario (3)
 Middlebox process crash
31
Usage Scenario (3)
 The middlebox host fails
32
Discussion
 Report loss
 Non-tree request structures
 Partial deployment
 Managing report traffic
 Security Considerations
33
X. Liu et al, NSDI 07
34
Problem Description
 Log mining is both labor-intensive and fragile
 Latent bugs often are distributed across multiple
nodes
 Logs reflect incomplete information of an execution
 Non-determinism of distributed application
35
Goals
 Efficiently verify application properties
 Provide fairly complete information about an
execution
 Reproduce the buggy runs deterministically and
faithfully
36
Approach
 Log the actual execution of a distributed system
 Apply predicate checking in a centralized simulator
over a run driven by testing scripts or replayed by logs
 Output violation report along with message traces
 An execution is interpreted as a sequence of events, which are
dispatched to corresponding handling routines
37
Components
 A versatile script language
 Allow a developer to refine system properties into
straightforward assertions
 A checker
 Inspect for violations
38
Architecture
 Components of WiDS Checker
39
Architecture
 Reproduce real runs
 Log all non-deterministic events using Lamport’s logical
clock
 Check user-defined predicates
 A versatile scription language to specify system states
being observed and the predicates for invariants and
correctness
 Screen out false alarms with auxiliary information
 For liveness properties
 Trace root causes using a visualization tool
40
Programming with WiDS
 WiDS APIs are mostly member function of the
WiDSObject class
 WiDS runtime maintains an event queue to buffer
pending events and dispatches them to corresponding
handling routines
41
Enabling Replay
 Logging
 Log all WiDS nondeterminism
 Redirect OS calls and log the results
 Embed a Lamport Clock in each out-going message
 Checkpoint
 Support partial replay
 Save the WiDS process context
 Replay
 Start from the beginning or a checkpoint
 Replay events in serialized Lamport order
42
Checker
 Observe memory state
 Define states and evaluate predicates
 Refresh database for each event
 Maintain history
 Re-evaluate modified predicates
 Auxiliary information for violations
 Liveness properties only guarantee to be true eventually
43
44
45
46
Visualization Tools
 Message flow graph
47
Evaluation
 Benchmark and result summary
48
Performance
 Running time for evaluating predicates
49
Logging Overhead
 Percentage of logging time
50
Discussion
 System is debugged by those who developed it
 Bugs are hunted by those who are intimately familiar
with the system
51