Automatic Misconfiguration Troubleshooting with PeerPressure

Download Report

Transcript Automatic Misconfiguration Troubleshooting with PeerPressure

Automatic Misconfiguration
Troubleshooting with PeerPressure
Helen J. Wang, John C. Platt, Yu Chen, Ruyun
Zhang, Yi-Min Wang
Microsoft Research
Presenter: Sara Salahi
Northwestern University
Agenda
• Importance of this work
• Key ideas
• PeerPressure: Architecture &
Algorithm
• Prototype
• Performance
• Future Work
Importance
• Tech support = 17% total cost of
ownership of today’s desktop PCs
• Large amount of Tech support is spent on
troubleshooting
• Many troubleshooting cases are due to
misconfiguration
• Misconfiguration is often caused by data
that is in shared persistent stores (e.g.
Windows registry)
Authors focus on this
Key Ideas: Misconfigurations
• Can have many different “root causes”
– Seemingly innocuous changes to shared
system configurations
– System bugs
– Security patches may introduce
incompatible registry settings
– Failed uninstallation of applications
– Manual intervention using Registry editor
Key Ideas: The Golden State
• “Golden State” – a perfect configuration
• Assume that the golden state is in the
mass
• Combine statistical golden state with
Bayesian statistics to identify anomalous
misconfigurations on “sick” machines
Key Ideas: Goals of Troubleshooting
• Effectiveness
– System should identify a small set of
sick configuration candidates in a short
amount of time
• Automation
– Minimize number of manual steps and
number of users involved
PeerPressure: Architecture
2) I found you 
3) Turns user- or
machine-specific
entries into
canonicalized form
1) Sick computer 
5) Bayesian estimation used
to calculate probability of a
suspect being sick
4) Database containing a
number of machine
configuration snapshots
PeerPressure: Architecture
• Manual Steps
– User runs faulty application to record suspects
– User determines if sickness is cured
• Manual steps involve only the
troubleshooting user and no second-party
PeerPressure: Algorithm
• Intuition and Objectives
• e1: Probably healthy
• e2: Most probably sick
• e3: “Natural biological diversity”
• Type I: application configuration states
– e1 and e2
• Type II: operational states (timestamps, caches
etc)
– e3
– Want to weed out; most likely false positives
PeerPressure: Algorithm
Formulation:
• (3) + (1)  when m=0, P(S|V) = 1
• Bayesian estimation used to overcome this.
• Vector pj: probability of event happening and
its outcome being Vj; pj follows Direchtlet
distribution.
• mj: count of number of values matching suspect
value
PeerPressure: Algorithm
Asymptotic Analysis:
Prototype
• GeneBank Database: Microsoft SQL Server
2000 containing snapshots from 87 Windows
XP PCs
• PeerPressure troubleshooter implemented in
C#
• “Data Sanitization”
– Unification of different representations of the
same value
• Dual Intel Xeon 2.4 GHz CPU workstation with
1 Gb RAM hosts SQL Server
Performance
Response Time vs. Number of Suspects
• 20 real-world troubleshooting cases used
• Database queries dominate troubleshooting response
time (one query per suspect entry)
Prototype: GeneBank
• Registry characteristics in GeneBank
• Unseen – values that are unknown to the
GeneBank, increments observed cardinality by 1
– Any entry from GeneBank has cardinality of at least 2
• Entries that do no exist on some sample
machines have value no entry
• When cardinality is low, conformity
among samples is strong
Performance
Root-Cause Ranking Results
• 87% have cardinality of 2, 94% no more
than 3, 97% no more than 4
Performance
False Positives
• Large cardinality of root-cause entry
• Relation between root-cause entry
and other entries in the suspect set
• GeneBank is not pristine
Performance
Impact of Sample Set Size
Performance
Sick Machine Sensitivity
Format: RootCauseRanking (NumberOfTies) / NumberOfSuspects
Future Work
• Multi-gene troubleshooting
– Multiple sick entries among suspects
• Cross-application misconfiguration
• Heavy customization of apps can
break assumption of strong
conformance in most configuration
entries
• GeneBank maintenance – privacy issue