application - Networked Systems Laboratory

Download Report

Transcript application - Networked Systems Laboratory

Real-time Application Monitoring and
Diagnosis for Service Hosting
Platforms of Black Boxes
Huadong Liu (U. of Tennessee)
Hui Zhang, Rauf Izmailov, Guofei Jiang,
Xiaoqiao Meng (NEC Labs America)
Presented by: Hui Zhang
©NEC Laboratories America
1
outline
Motivation
SRAMD architecture
Application component dependency
discovery
Evaluation
Conclusions
©NEC Laboratories America
2
Motivation
App. 1
App. 3
App. 2
App. 4
 Service hosting systems
 Web farms, service-oriented utility computing networks, Peer-to-Peer
service composition based computing grids, …
 Service management
 Fault diagnosis, capacity planning, performance analysis, impact
analysis, etc.
 Challenges
 Application components are usually delivered as black-boxes w/o
sufficient instrumentation
 The huge amount of logging information in large-scale systems makes
real-time monitoring and debugging unrealistic with a centralized
approach
©NEC Laboratories America
3
An intuition of the SRAMD Art
Source: www.pictureMOSAICs.com
©NEC Laboratories America
4
Scalable Real-time Application Monitoring and Diagnosis
 SRAMD: an extensible tool that is
 easy to deploy
 scalable, and
 able to effectively profile the intricate dependency
relationships among interacting application components seen
as black boxes.
 Our approach
 uses low level packet traces instead of high level event traces
to get insight into application components
 Has end-system instrumentation for close observation on the
correlation between application performance and local
resource utilization, and for enabling a rich set of queries for
diagnosis
 understands the overall system/application behavior and
performance by aggregating and correlating summarizations
from distributed components
©NEC Laboratories America
5
SRAMD in Operation
An extensible framework for
application topology discovery,
capacity planning and
performance debugging
application X
application Y
application Z
hosting server
An application level
passive resource
monitor with active
summarization
©NEC Laboratories America
6
The SRAMD Controller
 collector
 passively collects summarization
data from distributed monitors
through UDP.
Aggregator
Visualizer
Collector
Diagnosis
 Aggregator
 retrieves, validates information
blocks available in the
repository, and organizes them
into per-application groups.
 Visualizer
 Diagnosis
 generates probing requests to
 constructs in-memory DOT files
related monitors with operator
[DOT] using outputs from the
interaction to get detailed
aggregator and calls the Grappa
information about application
[Grappa] to visualize application
components and to isolate
topologies enriched with
possible bottlenecks for
component traffic statistics and
causal probabilities. ©NEC Laboratories America performance debugging.
7
The SRAMD Controller snapshot
©NEC Laboratories America
8
The SRAMD Monitor
 Periodically probe for CPU, memory and disk
usage of every registered application component.
 Passively capture network traffic and associate
captured packets to registered application
components,
 Actively calculate useful local application
statistics and dependencies from packet traces
 Temporarily perform diagnosis tasks on-demand
to assist performance diagnosis and debugging.
©NEC Laboratories America
9
Application component dependency discovery
r1
A
B
C
r2
D
E
A
B
time line
r3
A
B
A
C
D
B
C
C
E
D
E
D
E
 Given two application components A and B in the
system, we want to discover the following real-time
dependency relationships between A and B during a
time interval:
 are the input requests of one components caused by another
one (directly or indirectly)? and in what percentage if yes?
©NEC Laboratories America
10
Dealing with transient connections
 Local Dependency Discovery (LDD)
 Find IDs of peer application components that local ones
talked to in the last report interval. Every SRAMD monitor
sends a list of (LocalPort, AppCompID) to the monitor at
every hosting server that the communicating application
components are running on.
 Count the number of requests (including nesting requests)
between application components and calculate the probability
of their causal dependency.
 Although requests appear to be nested by accident, if the same
nesting relationship appears with a high probability, it is highly
possible that the nesting represents a causal dependency of
application components.
©NEC Laboratories America
11
Dealing with persistent connections and connectionless
communications
 Traffic Regulation based Component Dependency
Discovery (TRCDD)
 Divert socket based traffic regulation. Under
investigation.
B->C
A->B
©NEC Laboratories America
12
Evaluation: SRAMD overhead (1)
 Experiment setup
SRAM
Controller
Sender
thrulay
UDP Packets
over giga ethernet
SRAM
Monitor
Receiver
thrulayd
Intel 2.8GHz SMP
©NEC Laboratories America
13
Evaluation: SRAMD overhead (2)
 CPU overhead of the SRAMD monitor with bulk UDP traffic
using different packet sending rates and packet sizes
©NEC Laboratories America
14
Evaluation: SRAMD overhead (3)
 CPU overhead of packet-application matching and sniffing
 data rate 100Mb/s and packet size 1500 Bytes.
Association
Probability
1/250k
0.01
0.02
0.03
0.04
CPU Overhead %
4.22
5.53
6.15
7.12
7.86
©NEC Laboratories America
15
Evaluation: LDD algorithm (1)
 Experiment setup
C
Clients (httperf)
W1
W2
A1
A2
D1
D2
Logic view
Web Server
(tinyproxy)
Web Server
(tinyproxy)
Application Application
Server I1
Server I2
Database Server
(derby)
Application Application
Server I1
Server I2
Database Server
(derby)
physical view
©NEC Laboratories America
16
Evaluation: LDD algorithm (2)
 Causal probability as observed on application server
A1 with different number of concurrent clients
©NEC Laboratories America
17
Conclusions and Future Work
 An unobtrusive application-level monitoring and
diagnosis tool that does not make any assumptions
about the traced applications.
 Two schemes to infer dependency relationships of
application components in different scenarios.
 An initial assessment of the quality and overhead of
application-level packet tracing and an evaluation of
the statistical dependency discovery scheme.
 Possible extensions
 A kernel module to obtain per-application disk read / write
statistics
 Application of data mining techniques to packet traces
©NEC Laboratories America
18
Thanks!
 Questions?
©NEC Laboratories America
19
Backup slides
©NEC Laboratories America
20
Calculate Response Time from Traces
WS
a
AS
DS
WS
b
AS
DS
c
t1
t3
t5
t4
t2
WS
AS
DS
©NEC Laboratories America
21