Transcript Document

Connecting the Dots:
Using Runtime Paths for Macro Analysis
Mike Chen
[email protected]
http://pinpoint.stanford.edu
1/14/2003
ROC Retreat
Motivation
 Divide and conquer, layering, and replication are
fundamental design principles
– e.g. Internet systems, P2P systems, and sensor networks
 Execution context is dispersed throughout the system
=> difficult to monitor and debug
 Lots of existing low-level tools that help with debugging
individual components, but not a collection of them
– Much of the system is in how the components are put together
 Observation: a widening gap between the systems we
are building and the tools we have
1/14/2003
ROC Retreat
Current Approach
Apache
Sensor
Peer
Java
Sensor
Peer
Bean
Apache
Sensor
Peer
Java
Sensor
Peer
Bean
Database
Sensor
Peer
1/14/2003
ROC Retreat
Java
Sensor
Peer
Bean
Database
Sensor
Peer
Current Approach
 Micro analysis tools, like
code-level debuggers (e.g.
gdb) and application logs,
offers details of each
individual component
 Scenario
– A user reports request A1 failed
– You try the same request, A2,
but it works fine
– What to do next?
1/14/2003
ROC Retreat
A1
Apache
A2
Apache
gdb
XJava
=1
Bean
Y
=2
XJava
=3
Bean
Y
=1
Database
XJava
=3
Bean
Y
=2
Database
Macro Analysis
 Macro analysis exploits non-local context to
improve reliability and performance
– Performance examples: Scout, ILP, Magpie
 Statistical view is essential for large, complex
systems
 Analogy: micro analysis allows you to
understand the details of individual honeybee;
macro analysis is needed to understand how
the bees interact to keep the beehive
functioning
1/14/2003
ROC Retreat
Observation
 Systems have a single system-wide execution
paths associated with each request
– E.g. request/response, one-way messages
 Scout, SEDA, Ninja use paths to specify how to
service requests
 Our philosophy
– Use only dynamic, observed behavior
– Application-independent techniques
1/14/2003
ROC Retreat
Our Approach
 Use runtime paths to connect
the dots!
Apache
– dynamically captures the
interactions and dependency
paths
between components
– look across many requests to
Java
get the overall system behavior
Bean
• more robust to noise
Apache
Java
Bean
Java
Bean
 Components are only partially
known (“gray boxes”)
path
Database
1/14/2003
ROC Retreat
Database
Our Approach
 Applicable to a wide range of systems.
Apache
Sensor
Peer
A
Apache
Sensor
Peer
B
paths
Java
Sensor
Peer
C
Bean
Java
Sensor
Peer
D
Bean
Java
Database
Sensor
Peer
F
Bean
1/14/2003
Java
Sensor
Peer
E
Bean
Java
Database
Sensor
Peer
G
Bean
ROC Retreat
Open Challenges in Systems Today
1. Deducing system structure
– manual approach is error-prone
– static analysis doesn’t consider resources
2. Detecting application-level failures
– often don’t exhibit lower-level symptoms
3. Diagnosing failures
– failures may manifest far from the actual faults
– multi-component faults

Goal: reduce time to detection, recovery,
diagnosis, and repair
1/14/2003
ROC Retreat
Talk Outline




Motivation
Model and architecture
Applying macro analysis
Future directions
1/14/2003
ROC Retreat
Runtime Paths
 Instrument code to dynamically trace requests
through a system at the component level
– record call path + the runtime properties
– e.g. components, latency, success/failure, and
resources used to service each request
 Use statistical analysis detect and diagnose
problems
– e.g. data mining, machine learning, etc.
 Runtime analysis tells you how the system is
actually being used, not how it may be used
 Complements existing micro analysis tools
1/14/2003
ROC Retreat
Architecture
Reusable Analysis
Framework
request
 Tracer
Tracer
Tracer
A
B
Tracer
Tracer
C
D
 Aggregator + Repository
– Reconstructs paths and stores
them
 Declarative Query Engine
Tracer
Tracer
E
F
– Supports statistical queries on
paths
– Data mining and machine
learning routines
 Visualization
1/14/2003
Aggregator
– Tags each request with a
unique ID, and carries it with
the request throughout the
system
– Report observations
(component name + resource
+ performance properties) for
each component
Developers/
Operators
Visualization
Query Engine
Path Repository
ROC Retreat
Request Tracing
 Challenge: maintaining an ID with each request
throughout the system
 Tracing is platform-specific but can be applicationgeneric and reusable across applications
 2 classes of techniques
– Intra-thread tracing
• Use per-thread context to store request ID (e.g. ThreadLocal in
Java)
• ID is preserved if the same thread is used to service the request
– Inter-thread tracing
• For extensible protocols like HTTP, inject new headers that will be
preserved (e.g. REQ_ID: xx)
• Modify RPC to pass request ID under the cover
• Piggyback onto messages
1/14/2003
ROC Retreat
Talk Outline
 Motivation
 Model and architecture
 Applying macro analysis
– Inferring system structure
– Detection application-level failures
– Diagnosing failures
 Future directions
1/14/2003
ROC Retreat
Inferring System Structure
 Key idea: paths directly capture application structure
2 requests
1/14/2003
ROC Retreat
Indirect Coupling of Requests
 Key idea: paths associate requests with internal state
 Trace requests from web server to database
– Parse client-side SQL queries to get sharing of db tables
– Straightforward to extend to more fine-grained state (e.g.
rows)
Request
types
Database
tables
1/14/2003
ROC Retreat
Failure Detection and Diagnose
 Detecting application-level failures
– Key idea: paths change under failures => detect
failures via path changes.
 Diagnosing failures
– Key idea: bad paths touch root cause(s). Find
common features.
1/14/2003
ROC Retreat
Future Directions
 Key idea: violation of macro invariants are signs
of buggy implementation or intrusion
 Message paths in P2P and sensor networks
– a general mechanism to provide visibility into the
collective behavior of multiple nodes
– micro or static approaches by themselves don’t work
well in dynamic, distributed settings
– e.g. algorithms have upper bounds on the # of hops
• Although hop count violation can be detected locally, paths
help identify nodes that route messages incorrectly
– e.g. detecting nodes that are slow or corrupt msgs
1/14/2003
ROC Retreat
Conclusion
 Macro analysis fills the need when monitoring and
debugging systems where local context is of insufficient
use
 Runtime path-based approach dynamically traces
request paths and statistically infer macro properties
 A shared analysis framework that is reusable across
many systems
– Simplifies the construction of effective tools for other systems
and the integration with recovery techniques like RR
 http://pinpoint.stanford.edu
– Paper includes a commercial example from Tellme!
(thanks to Anthony Accardi and Mark Verber)
1/14/2003
ROC Retreat
Backup Slides
1/14/2003
ROC Retreat
Backup Slides
1/14/2003
ROC Retreat
Current Approach
 Micro analysis tools, like
code-level debuggers (e.g.
gdb) and application logs,
offers details of each
individual component
XJava
=1
Apache
Bean
Y
=2
XJava
=2
Apache
Bean
Y
=4
gdb
XJava
=1
Bean
Y
=2
XJava
=5
Bean
Y
=2
XJava
=2
Database
Bean
Y
=3
1/14/2003
ROC Retreat
XJava
=3
Bean
Y
=2
XJava
=7
Database
Bean
Y
=1
Related Work

Commercial request tracing systems
–
–
–
–

Announced in 2002, a few months after Pinpoint was developed
PerformaSure and AppAssure focus on performance problems.
IntegriTea captures and playback failure conditions.
Focus on individual requests rather than overall behavior, and on recreating the
failure condition.
Extensive work in event/alarm correlation, mostly in the context of
network management (i.e. IP)
– Don’t directly capture relationship between events
– Rely on human knowledge or use machine learning to suppress alarms.

Distributed debuggers
– PDT, P2D2, TotalView, PRISM, pdbx
– Aggregates views from multiple components, but do not capture relationship
and interaction between components
– Comparative debuggers: Wizard, GUARD

Dependency models
– Most are statically generated and are likely to be inconsistent.
– Brown et al. takes an active, black box approach but is invasive. Candea et al.
dynamically trace failures propagation.
1/14/2003
ROC Retreat
1. Detecting Failures using Anomaly Detection
 Key idea: paths change under failures
=> detect failures via path changes
 Anomalies
– Unusual paths
– Changes in distribution
– Changes in latency/response time
 Examples:
– Error paths are shorter.
– User behavior changes under failures
• Retries a few times then give up
 Implement as long running queries (i.e. diff)
 Challenges:
– detecting application-level failures
– comparing sets of paths
1/14/2003
ROC Retreat
2. Root-cause Analysis
 Key idea: all bad paths touch root cause, find common
features
 Challenge: a small set of known bad paths and a large
set of maybes
 Ideally want to correlate and rank all combinations of
feature sets
– E.g. association rules mining
– May get false alarms because the root cause may not be one of
the features
 Automatic generation of dynamic functional and state
dependency graphs
– Helps developers and operators understand inter-component
dependency and inter-request dependency
– Input to recovery algorithms that use dependency graphs
1/14/2003
ROC Retreat
3. Verifying Macro Invariants
 Key idea: violations of high-level invariants are signs of
intrusion or bugs
 Example: Peer auditing
– Problem: A small number of faulty or malicious nodes can bring
down the system
– Corruption should be statistically visible in your behavior
• look for nodes that delay or corrupt messages or route messages
incorrectly
– Apply root-cause analysis to locate the misbehaving peers
• Some distributed auditing is necessary
 Example: P2P implementation verification
– Problem: are messages delivered as specified by the algorithms?
– Detect extra hops, loops, and verify that the paths are correct
– Can implement as a query:
• select length from paths where (length > log2(N))
1/14/2003
ROC Retreat
4. Detecting Single Point of Failure
 Key idea: paths converge on a
single-point of failure
 Useful for finding out what to
replicate to improve availability
 P2P example:
– Many P2P systems rely on overlay
networks, which typically are
networks built on top of the IP
infrastructure.
– It’s common for several overlay links
to fail together if they depend on a
shared physical IP link that failed
Sensor
Peer
A
Sensor
Peer
C
Sensor
Peer
B
Sensor
Peer
D
Sensor
Peer
E
 Implement as a query:
– intersect edge.IP_links
from paths
1/14/2003
ROC Retreat
Sensor
Peer
F
Sensor
Peer
G
5. Monitoring of Sensor Networks
 An emerging area with primitive tools
 Key idea: use paths to reconstruct topology and
membership
 Example:
– Membership
• select unique node from paths
– Network topology
• for directed information dissemination
 Challenge: limited bandwidth
– Can record a (random) subset of the nodes for each
path, then statistically reconstruct the paths
1/14/2003
ROC Retreat
Macro Analysis
 Look across many requests to get the overall
system behavior
– more robust to noise
Macro Analysis
Request 1
Component A
X
Component B
X
Request 2
X
Component C
1/14/2003
Request 3
X
X
ROC Retreat
Request 4
X
Properties of Network Systems
 Web services, P2P systems, and sensor
networks can have tens of thousands of nodes
each running many application components
 Continuous adaptation provides high availability,
but also makes it difficult to reproduce and
debug errors
 Constant evolution of software and hardware
1/14/2003
ROC Retreat
Motivation
 Difficult to understand and debug network systems
– e.g. Clustered Internet systems, P2P systems and sensor
networks
– Composed of many components
– Systems are becoming larger, more dynamic, and more
distributed
 Workload is unpredictable and impractical to simulate
– Unit testing is necessary but insufficient. Components break
when used together under real workload
 Don’t have tools that capture the interactions between
components and the overall behavior
– Existing debugging tools and application-level logs only do
micro analysis
1/14/2003
ROC Retreat
Macro vs Micro Analysis
Macro Analysis Micro Analysis
Resolution
Component.
Complements micro
analysis tools.
Line or variable
Overhead
Low. Can use it in
actual deployment.
High. Typically not
used in deployment
other than
application logs.
1/14/2003
ROC Retreat
What’s a dynamic path?
 A dynamic path is the (control
flow + runtime properties) of a
request
– Think of it as a stack trace across
process/machine boundaries with
runtime properties
– Dynamically constructed by
tracing requests through a system
 Runtime properties
ROC Retreat
Path
A
A
B
D
C
– Resources (e.g. host, version)
– Performance properties (e.g.
latency)
– Arguments (e.g. URL, args, SQL
statement)
– Success/failure
1/14/2003
request
D
E
E
F
RequestID: 1
Seq Num: 1
Name: A
Host: xx
Latency: 10ms
Success: true
…..
Related Work
 Micro debugging tools
– RootCause provides extensible logging of method calls and
arguments.
– Diduce look for inconsistencies in variable usage.
– Complements macro analysis tools.
 Languages for monitoring
– InfoSpect looks for inconsistencies in system state using a logic
language
 Network flow-based monitoring
– RTFM and Cisco NetFlow classify and record network flows
 Statistical and data mining languages
– S, DMQL, WebML
1/14/2003
ROC Retreat
Visualization Techniques
 Tainted paths: mark all flows that
have a certain property (e.g. failed
or slow) with a distinct color and
overlay it on the graph
 Detecting performance bottlenecks:
look for replicated nodes that have
different colors
 Detecting anomaly: look for missing
edges and unknown paths
1/14/2003
ROC Retreat
Pinpoint Framework
Components
Requests
#1
#2
#3
A
External
F/D
Logs 1,
success
2, fail
3, ...
1/14/2003
B
C
Communications Layer
(Tracing & Internal F/D)
1,A
1,C
2,B.
.
ROC Retreat
Statistical Detected
Analysis Faults
Experimental Setup
 Demo app: J2EE Pet Store
– e-commerce site w/~30 components
 Load generator
– replay trace of browsing
– Approx. TPCW WIPSo load (~50% ordering)
 Fault injection parameters
– Trigger faults based on combinations of used components
– Inject exceptions, infinite loops, null calls
 55 tests with single-components faults and interaction
faults
– 5-min runs of a single client (J2EE server limitation)
1/14/2003
ROC Retreat
Application Observations
 # of components used in
a dynamic web page
request:
– median 14, min 6, max 23
 large number of
tightly coupled
components that are
always used together
12
80%
# of tightly coupled components
60%
40%
20%
10
8
6
4
2
1/14/2003
ROC Retreat
JB
Ca
rtE
Sh
op
pi
ng
on
tro
l le
rE
JB
nE
JB
C
# of components
li e
nt
C
21
Si
gn
O
16
JB
11
yE
6
In
ve
nt
or
1
pr
od
uc
tIt
em
Li
st
0
0%
ca
rt
Cumulative % of total requests
100%
Metrics
Correctly
Identified
Faults (C)
Predicted
Faults (P)
Actual Faults (A)
 Precision: C/P
 Recall: C/A
 Accuracy: whether all actual faults are
correctly identified (recall == 100%)
– boolean measure
1/14/2003
ROC Retreat
4 Analysis Techniques
 Pinpoint: clusters of components that
statistically correlate with failures
 Detection: components where Java exceptions
were detected
– union across all failed requests
– similar to what an event monitoring system outputs
 Intersection: intersection of components used
in failed requests
 Union: union of all components used in failed
requests
1/14/2003
ROC Retreat
Results
100%
average accuracy
average precision
80%
77%
76%
60%
40%
100%
83%
50%
33%
15%
20%
11%
Average Accuracy
100%
80%
60%
Pinpoint
Detection
40%
Intersection
Union
20%
0%
0%
Pinpoint
Detection
Intersection
1
Union
2
3
# of Interacting Components
 Pinpoint has high accuracy with relatively
high precision
1/14/2003
ROC Retreat
4
Pinpoint Prototype Limitations
 Assumptions
– client requests provide good coverage over
components and combinations
– requests are autonomous (don’t corrupt state and
cause later requests to fail)
 Currently can’t detect the following:
– faults that only degrade performance
– faults due to pathological inputs
 Single-node only
1/14/2003
ROC Retreat
Current Status
 Simple graph visualization
1/14/2003
ROC Retreat
Proposed Research
 3 classes of large network systems
– Clustered Internet systems
• Tiered architecture, high bandwidth, many replicas
– Peer-to-peer (P2P) systems, including sensor
networks
• Widely distributed nodes, dynamic membership
– Sensor networks
• Limited storage, processing, and bandwidth.
1/14/2003
ROC Retreat
P2P Systems: Tracing
 Trace messages by piggybacking the current
node name to the messages
 Tracing overhead
– Assume 32-bit per node name and a very
conservative log2(N) hops for each msg and
– Data overhead is 40% for a 1500-byte message in a
106-node system
1/14/2003
ROC Retreat
P2P Systems: Implementation Verification
 Current debugging techniques: lots of printf()’s on each
node and manually correlate the paths taken by
messages
 How do you know the messages are delivered as
specified by the algorithms?
 Use message paths to check for routing invariants
– detect extra hops, loops, and verify that the paths are correct
 Can implement as a query:
– select length from paths where (length >
log2(N))
1/14/2003
ROC Retreat