G-RCA: A Generic Root Cause Analysis Platform for Service

Download Report

Transcript G-RCA: A Generic Root Cause Analysis Platform for Service

G-RCA: A Generic Root Cause
Analysis Platform for Service
Quality Management in Large IP
Networks
He Yan, Lee Breslau, Zihui Ge, Dan Massey, Dan Pei, Jennifer Yates
Abstract
● Best effort networks --> QoS
● Manage end-to-end service quality as a whole
● Generic Root Cause Analysis (G-RCA)
o Service Quality Management (SQM)
● FCAPS
Introduction
• Finding root to errors
– transient errors
• Gather information for
network operators
• Helps Service Quality
Management (SQM) for
ISPs.
G-RCA Architecture
•
•
•
Consists of five main
components.
G-RCA determines
where and when to
look for diagnostic
events.
Used for:
– Troubleshoot
ongoing
networks
– Investigate past
behavior.
Data Collection and Management
•
Proactively collects data from
network, such as alarms, logs
and performance
measurements.
•
Uses a data collector and
database to store data
•
“Events”
– event-name, location type,
retrieval process and
information
Service Dependency Model
● Figure 2 used to include network
elements associated with a problem
● Hard to realize theory
o Traffic sampling data
o Snapshots of router configs
Spatial-Temporal Correlation (1)
● How to relate what has happened to service problem?
● G-RCA defines a temporal and spatial joining rule
● Temporal Joining Rule
○ Defines a time window to allow symptom and diagnostic event to be
joined.
○ 6 parameters for symptom & diagnostic event
■ Left expansion margin
■ Right expansion margin
■ Expanding option (Start/End, Start/Start or End/End)
Spatial-Temporal Correlation (2)
○ Symptom and diagnostic event are joint when the windows overlap.
Spatial-Temporal Correlation (3)
● Spatial Joining Rule
○ Symptom event location type
○ Diagnostic event location type
○ Joining level
● Joining level
○ Link symptom locations and diagnostic event locations together
● Model diagnostic signatures using diagnosis graph
● A symptom and diagnostic event pair is called diagnosis rule
● G-RCA evaluates the time and location conditions and
collected data
● Determine whether diagnostic signature is present
Reasoning Logic
Rule-Based Reasoning Module
• Priority value in the diagnosis graph
– Assigned by operator
– Higher value means more confidence on the diagnostic
event to be the real root cause
– Can be examined by G-RCA’s Result Browser
• How does rule-based reasoning work?
Diagnosis graph for BGP flaps root cause analysis
Bayesian Inference
• Determining the root cause is to identify the one producing the
following maximum likelihood ratio:
Potential
root
causes:
classes
A set of r
presence or absence of the
diagnostic evidence and
symptom events themselves :
features
First term
Second
term
• When the features are conditionally independent
– The second term can be decoupled to
• Parameters configuration (ratios of:
and
)
– bootstrap using the rule-based reasoning
– define a fuzzy type of discrete values
• Low, Medium, and High, which corresponds to values 2, 100, and 20 000.
Comparison
• In the operational practice,rule-based reasoning logic is often
preferred over Bayesian inference
– Easier to configure
– Gives simple and direct association between the
diagnosed root cause and the evidence
– Effective in most applications
• However, there are a few cases where Bayesian inference is
preferred
– Root cause condition is unobservable
Domain Knowledge Building
● Issue: The specification of a diagnosis graph for a SQM application
offered by an operator, especially the initial version, can be inaccurate
and incomplete.
● G-RCA addresses this concern regarding incomplete diagnosis graph
through iteratively using the Correlation Tester and Result Browser.
○ Firstly, operator filters out the symptom events with known root
causes with the root cause classification capability provided in the
result browser.
○ Secondly, operator could focus on the rest of symptom events by
comparing with other suspected diagnostic events that occur at the
same time and that are spatially related to the service problem.
Domain Knowledge Building
● On one hand, the second step can be done via manual drill-down and
data exploration capability in the result browser;
● On the other hand, operators can also to run the correlation tester
blindly between the symptom events without known root causes and
each type of suspected diagnostic graph.
● As G-RCA emphasizes usability, the newly uncovered diagnosis rules need
to be verified by operators before incorporating into the diagnosis graph.
Introduction of G-RCA Applications
• The key advantage of G-RCA in SQM is its capability to be rapidly
customized into different RCA applications in the ISP’s network.
• In this section, the following three case studies are included in order to
demonstrate effectiveness of G-RCA
– 1) customer BGP flaps
– 2) end-to-end throughput management in a CDN service
– 3) network PIM flaps in multicast VPN
BGP Flaps Root Cause Analysis
Purpose: Understanding the root cause of flaps.
● Achieving this using G-RCA by constructing application specific events
and rules.
○ Starting by constructing our BGP flap-specific events.
○ Adding a few application-specific diagnosis rules.
○ Specifying priorities for different diagnosis rules for BGP flaps RCA.
(Please refer to the figure of “Diagnosis graph for BGP flaps root cause analysis” shown in the previous
slides)
Application-specific events for BGP flaps root cause
analysis
Conclusion
1. It captures the layered network model in its knowledge library, by implementing
- temporal/spatial correlation,
- rule-based reasoning, and
- Bayesian inference.
2. Domain knowledge in existing RCA application can be refined by the interaction
between the RCA engine and the Correlation Tester.
3. In order to analyse a large number of service quality issues and classify trend their
root causes, it proactively collects all types of data from different sources and
normalize them in real time.