Automatic Misconfiguration Troubleshooting with PeerPressure

Download Report

Transcript Automatic Misconfiguration Troubleshooting with PeerPressure

Automatic Misconfiguration
Disagnosis with PeerPressure
Helen J. Wang, John C. Platt, Yu Chen, Ruyun
Zhang, and Yi-Min Wang
Microsoft Research
OSDI 2004, San Francisco, CA
1
Misconfiguration Diagnosis
• Technical support contributes 17% of TCO
[Tolly2000]
• Much of application malfunctioning comes from
misconfigurations
• Why?
– Shared configuration data (e.g., Registry) and
uncoordinated access and update from different
applications
• How about maintaining the golden config state?
– Very hard [Larsson2001]
• Complex software components and compositions
• Third party applications
• …
2
Outline
Motivation
• Goals
• Design
• Prototype
• Evaluation results
• Future work
• Concluding remarks
3
Goals
• Effectiveness
– Small set of sick configuration candidates that
contain the root-cause entries
• Automation
– No second party involvement
– No need to remember or identify what is
healthy
4
Intuition behind PeerPressure
• Assumption
– Applications function correctly on most
machines -- malfunctioning is anomaly
• Succumb to the peer pressure
5
An Example
Suspects
Mine
P1’s
P2’s
P3’s
P4’s
e1
e2
e3
0
on
57
1
on
4
1
on
0
1
on
100
1
off
34
• Is R1 sick? Most likely
• Is R2 sick? Probably not
• Is R3 sick? Maybe not
– R3 looks like an operational state
• We use Bayesian statistics to estimate the sick probability
of a suspect -- our ranking metric
6
System Overview
Registry Entry Suspects
App
Tracer
Entry
Data
HKLM\Software\Msft\...
On
HKLM\System\Setup\...
0
HKCU\%\Software\...
null
Run the
faulty app
Canonicalizer
Troubleshooting Result
Entry
Prob.
HKLM\Software\Msft\...
0.6
HKLM\System\Setup\...
0.2
HKCU\%\Software\...
0.003
Peer-to-Peer
Troubleshooting
Community
Search
& Fetch
Database
Statistical
Analyzer
PeerPressure
7
The Sick Probability
• P(Sick) = (N + c) / (N + ct + cm (t-1) )
–
–
–
–
N: # of the samples
C: cardinality
t: the number of suspects
m: the number of entries that match the suspect entry
value
• Properties:
– As m increases, P decreases
– As c increases, P decreases; when m = 0, smaller c
implies smaller p
8
The PeerPressure Prototype
• Database of 87 live Windows XP registry
snapshots as our sample pool
– hierarchical persistent storage for named, typed
entries
• PeerPressure troubleshooter implemented in C#
• Needed to “sanitize” the entry values
– 1, “1”, “#1”
– Heuristics: unifying values of entries with different
types
9
Outline
Motivation
Goals
Design
Prototype
• Evaluation results
• Future work
• Concluding remarks
10
Windows Registry Characteristics
•
•
•
•
•
•
Max size: 333,193
Min size: 77,517
Average size: 198,376
Median size: 198,608
Cardinality: 87% 1, 94% <=2
Distinct canonicalized entries in GeneBank
1,476,665
• Common canonicalized entries 43,913
• Distinct entries data-sanitized 1,820,706
11
Evaluation Data Set
• 87 live Windows XP registry snapshots (in the
database)
– Half of these snapshots are from three diverse
organizations within Microsoft: Operations and
Technology Group (OTG) Helpdesk in Colorado,
MSR-Asia, and MSR-Redmond.
– The other half are from machines across Microsoft
that were reported to have potential Registry
problems
• 20 real-world troubleshooting cases with known
root-causes
12
Response Time
250.00
Seconds
200.00
150.00
100.00
50.00
5483
3983
3590
3209
1779
1777
1350
1230
1171
853
482
354
293
237
182
135
105
64
37
8
0.00
# of Suspects
• # of suspects: 8 to 26,308 with a median: 1171
• 45 seconds in average for SQL server hosted on a
2.4GHz CPU workstation with 1 GB RAM
• Sequential database queries dominate
13
Troubleshooting Effectiveness
• Metric: root cause ranking
• Results:
– Rank = 1 for 12 cases
– Rank = 2 for 3 cases
– Rank = 3, 9, 12, 16 for 4 cases, respectively
– cannot solve one case
14
Source of False Positives
• Nature of the root-cause entry
– Root cause entry has a large cardinality
• How unique other suspects
– A highly customized machine likely produces
more noise
• The database is not pristine
15
Impact of the Sample Set Size
• Larger sample set doesn’t necessarily
indicate better accuracy
– Strong conformity doesn’t depend on the
number of samples
– Operational state doesn’t depend on the
number of samples
– Only helps with non-pristine sample set
• 10 samples are large enough for most
cases
16
Related Work
• Blackbox-based techniques
– Strider: need to identify the healthy [Wang ‘03]
– Hardware, software component dependencies [Brown
‘01]
• Much prior on leveraging statistics to pinpoint
anomaly
– Bug as deviant behavior [Engler et al SOSP ‘01]
– Host-based intrusion detection based on system calls
[Forrest ’96] and based on registry behavior [Apap et
al, ‘99]
17
Future Work
•
•
•
•
•
Only scratch the surface!
Multiple root cause entries
Cross-application troubleshooting
Database maintenenance
Privacy
– Friends Troubleshooting Network
18
Concluding Remarks
• Automatic misconfiguration diagnosis is
possible
– Use statistics from the mass to automate
manual identification of the healthy
– Initial results promising
19