A Case for Open Source (Failure) Data Collection

Download Report

Transcript A Case for Open Source (Failure) Data Collection

A Case for an Open Source
Data Repository
Archana Ganapathi
Department of EECS, UC Berkeley
([email protected])
Why do we study failure data?


Understand cause->effect relationship
between configurations and system behavior
Still don’t have a complete understanding of
failures in systems




Can’t worry about fixing problems if we don’t
understand them in the first place
Gauge behavioral changes over time
Need realistic workload/faultload data to
test/evaluate systems
Success stories…people have benefited from
failure data analysis
Crash data collection success
stories



Berkeley EECS
BOINC
2 Unnamed Companies
…So Why Does Windows
Crash?
Definitions

Crash



Application Crash




An application crash caused as a result of the user terminating a process that is
potentially deadlocked or running an infinite loop.
Component (.exe/.dll file routing) causing the loop/deadlock cannot be identified (yet)
OS Crash




A crash occurring at user-level, caused by one or more components (.exe/.dll files)
Requires an application restart.
Application Hang


Event caused by a problem in the operating system(OS) or application(app)
Requires OS or app restart.
A crash occurring at kernel-level, caused by memory corruption, bad drivers or faulty
system-level routines.
Blue-screen-generating crashes require a machine reboot
Windows explorer crashes require restarting the explorer process.
Bluescreen

An OS crash that produces a user-visible blue screen followed by a non-optional
machine reboot.
Procedure

Collect crash dumps from two different sources



Filter data/form crash clusters to avoid doublecounting



UC Berkeley EECS department
BOINC volunteers
Account for shared resources, dependent processes, system
instability, user retry
Parse/Interpret crash dumps using Debugging tools
for Windows
Study both application crash behavior and operating
systems crashes

Supplement crash data with usage data
EECS Dataset
Type of User
Number of
Number of Users
Crashes
graduate
student
30
621
staff
28
414
unknown
16
197
faculty
14
191
undergraduate
4
19
visitor
3
51
guest
1
9
postdoc
1
19
TOTAL
97
1521
282
300
237
248
250
201
n
14
05
20
4,
-1
5
r1
00
Ap
,2
31
105
ar
M
20
,
28
15
b
00
Fe
,2
31
14
n
00
Ja
,2
31
14
00
ec
D
,2
30
14
ov
00
N
,2
31
14
ct
00
O
,2
30
14
p
00
Se
,2
31
1g
4
Au
00
2
1,
4
-3
00
l1
,2
30
Ju
Ju
Month
191
184
200
150
# crashes
Crashes reported per month
Number of Crashes per Month
350
320
220
204
113
100
50
54
0
Usage/Crashes per day of
week
Percentage of Computer Users per Day of Week
120
% users
100
80
 EECS department users use
their EECS computers Monday
through Friday.
60
40
20
y
da
 Few users use computers on
weekends.
Sa
t
Su
n
ur
da
y
y
Fr
id
a
ay
Th
ur
sd
ay
ne
sd
y
W
ed
M
Tu
es
da
on
da
y
0
482
500
450
400
350
300
250
200
150
100
50
0
446
 Crashes do not occur
uniformly across the five days
of the working week.
410
371
297
132
Day of Week
Su
nd
a
y
y
tu
rd
a
Sa
ay
Fr
id
y
ur
sd
a
Th
sd
ay
ne
W
ed
es
da
y
Tu
da
y
116
M
on
# crashes
Number of Crashes per Day of Week
Day of Week
12am-12:59am
4
2
4
2
8
4:00am-4:59am
5:00am-5:59am
6:00am-6:59am
13
3:00am-3:59am
0
2:00am-2:59am
Hour of Day
11:00pm-11:59pm
10:00pm-10:59pm
9:00pm-9:59pm
95
8:00pm-8:59pm
100
7:00pm-7:59pm
114
6:00pm-6:59pm
150
5:00pm-5:59pm
196
4:00pm-4:59pm
214
3:00pm-3:59pm
200
2:00pm-2:59pm
1:00pm-1:59pm
12:00pm-12:59pm
161 164
11:00am-11:59am
10:00am-10:59am
9:00am-9:59am
8:00am-8:59am
50
7:00am-7:59am
21
1:00am-1:59am
# crashes
12am-12:59am
11:00pm-11:59pm
10:00pm-10:59pm
9:00pm-9:59pm
8:00pm-8:59pm
7:00pm-7:59pm
6:00pm-6:59pm
5:00pm-5:59pm
4:00pm-4:59pm
3:00pm-3:59pm
2:00pm-2:59pm
1:00pm-1:59pm
12:00pm-12:59pm
11:00am-11:59am
10:00am-10:59am
9:00am-9:59am
8:00am-8:59am
7:00am-7:59am
6:00am-6:59am
5:00am-5:59am
4:00am-4:59am
3:00am-3:59am
2:00am-2:59am
1:00am-1:59am
% users
Usage/Crashes per hour of
day
Percentage of Computer Users per Hour of Day
100
80
60
40
20
0
 Most people work during the
typical hours of 9am to 5pm.
Hour of Day
Number of Crashes per Hour of Day
250
208 204
171 167
139
107
87
26
58
43 46
 Our data set involves users of
various affiliations to the
department, hence the wider
spectrum of work schedules
Reboot Frequency
Percentage of Users Rebooting their Computer at
Specified Frequency
Percentage of users
30
25
20
15
10
5
0
1
2.5
5
7
10
14
interval (days)
30
60
365
Automatic Clustering Experiment
for Categorizing Apps



Augment the crash data with information about
usage patterns and program dependencies
Feed data into the k-means and agglomerative
clustering algorithms to determine which applications
are behaviorally related.
We determined that we did not have enough data to
derive a method to categorize applications in our
data set


Need several instances of every (application, component,
error code) combo
As a last resort, we chose to categorize apps based
on categorization based on application functionality
Crash Cause by Application
Category
Application Category
# Crashes
Crash %
Usage %
web browsing
598
41%
18%
unknown
185
13%
n/a
document preparation
152
11%
22%
email
130
9%
24%
scientific computing
95
7%
7%
document viewer
84
6%
8%
multimedia
57
4%
6%
code development
26
2%
10%
document archiving
23
2%
n/a
remote connection
23
2%
n/a
instant messaging
17
1%
n/a
i/o
15
1%
n/a
Application Hang vs Crashes
due to Faulty Component
Crash Cause
faulty
component
52%
application
hang
48%
Which applications hang?
Application
# hangs
% Running
Total
% hangs
iexplore.exe
185
25%
25%
matlab.exe
68
9%
34%
winword.exe
67
9%
43%
outlook.exe
60
8%
51%
firefox.exe
47
6%
57%
netscape.exe
41
6%
63%
unknown
25
3%
66%
powerarc.exe
19
3%
69%
powerpnt.exe
13
2%
71%
thunderbird.exe
13
2%
73%
excel.exe
12
2%
75%
acrobat.exe
11
1%
76%
Which components cause
crashes?
Component
Description
Author
Apps invoking
component
%crash
NT system functions
MS
Internet Explorer,
Matlab
11% (86)
msvcrt.dll
Microsoft C runtime library
MS
Acrobat, Netscape
5% (37)
acrord32.exe
Acrobat Reader
Acrobat Reader
4% (29)
Visual Studio,
Internet
Explorer
3% (23)
Firefox
2% (19)
Firefox, Internet
Explorer
2% (17)
--
2% (16)
MS
Word, Outlook
2% (15)
MS
Internet Explorer,
Netscape
2% (15)
--
2% (15)
ntdll.dll
Scripting component functions
3rd party
MS
pdm.dll
firefox.exe
Web browser
user32.dll
Communication, message
handler, timer functions
ray_tracing.exe
User application
winword.exe
Windows document editor
mshtml.dll
HTML related functions
tempest.exe
Unknown
3rd party
MS
3rd party
3rd party
BOINC
http://winerror.cs.berkeley.edu/crashcollection/




Berkeley Open Infrastructure for Network
Computing
Users download boinc client app
Crash dumps are scraped/sent to boinc
servers
Currently 791 accounts created for crash
collection + resource management

492 users for crash collection
OS Crashes

Driver faults




asynchronous events
code must follow kernel programming etiquette
exceedingly difficult to debug
Memory corruption



Hardware problems (e.g. non-ECC mem)
Software-related
47 of these in our dataset so far…don’t have tools
to analyze these in detail
OS crash causing images(based
on 150 boinc users, 562 crashes)
Image Name
Image Description
Num
crashes
ntoskrnl.exe
NT kernel and system
150
27%
27%
GDFSHK.SYS
McAfee Privacy Service File Guardian
42
8%
35%
Windows (R) WDM driver for Realtek
AC'97
40
7%
42%
kmixer.sys
kernel audio mixer of Microsoft
Windows
28
5%
47%
win32k.sys
multi user win32 driver
19
3%
50%
ati3d2ag.dll
ATI Technologies Inc. Radeon
DirectX Universal Driver
18
3%
53%
Brwgate.sys
NAT/Proxy/Firewall system
16
3%
56%
HSF_CNXT.sys
Conexant Systems SoftK or SoftK56
Modem Driver
10
2%
58%
Ialmdev5.DLL
Intel graphics driver
10
2%
60%
ati2dvag.dll
ATI Radeon WinNT display driver
8
1%
61%
ALCXWDM.SYS
NVIDIA Compatible Windows 2000
% crashes
% Running
Total
Crash generating driver fault
type
Driver Fault Type
Num Crashes
PAGE FAULT IN NONPAGED AREA
118
IRQL NOT LESS OR EQUAL
105
KERNEL MODE EXCEPTION NOT HANDLED
67
UNEXPECTED KERNEL MODE TRAP
63
BAD POOL CALLER
46
THREAD STUCK IN DEVICE DRIVER
36
SYSTEM THREAD EXCEPTION NOT HANDLED
29
Unknown bugcheck code
16
Other (each caused 1 crash)
14
PFN LIST CORRUPT
13
DRIVER CORRUPTED EXPOOL
12
DRIVER UNLOADED WITHOUT CANCELLING PENDING OPERATIONS
8
MANUALLY INITIATED CRASH
5
File Corruption - Unreadable File
4
Summary of crash analysis



Application crashes are caused by both faulty
non-robust dll files as well as impatient users
OS crashes are predominantly caused by
poorly-written device driver code
Commonly used core components are blamed
for most crashes

need to improve reliability of these components
Practical techniques to reduce
crashes





Software-Based Fault Isolation
Nooks
Separate protection level for drivers
Move driver code to user libraries
Virtual Machine for each
unsafe/distrusted app
Lessons from crash data study



Clearly people want to know what’s
wrong and how to fix it
The more feedback we give, the more
data sets we receive
...but it’s not as easy as it sounds
What kinds of data should we
collect?







Failure data
Configuration information
Logs of normal behavior
Usage data
Performance logs
Annotations of data
Collect data for Individual Machines +
Services
Why are people afraid of sharing
data?





Fear of public humiliation (reverse
engineering what user was doing)
Revealing problems within their organization
Fear of competitors using data against them
Revealing loopholes through which malware
can easily propagate.
Revealing dependability problems in third
party products (MS)
Non-technical challenges to getting
data

Collecting (useful) data is tedious


Privacy concerns


No central location that can be queried for data
Legal agreements take a long time to draft


Especially with usage data
Finding the person with access to data


What information is “necessary and sufficient” to
understand data trends?
Researchers are more willing to share data than
lawyers
Publicity
Technical solution


Amortize the cost of data collection by
building an open source repository
Provide a set of tools to cleanse and
mine the data
What tools should we implement?

Collect




BOINC
Instrumentation (MS, Pinpoint)
Pre-aggregated data from companies
Anonymize/Preprocess


Pre-written anonymization tools
Company-specific privacy requirements



Hash values of certain fields
Drop irrelevent fields
Mask part of data
Tools cont’d

Store





Open-source repository schema
Common log format/ data descriptor headers
Tools to convert log metadata to common format to
cross-link data tables
Sample queries: data mining ~ asking questions about
data as it is
Analyze/Experiment




SLT algorithms
Visualization
Stream processing
Other tools (eg. WinDbg)
Thoughts on
Collection/Anonymization

Defining necessary and sufficient




Bad example: Cannot correlate crashes if
we getting rid of all user/machine names
Good example: Hash user/machine names
Default: hide if not necessary?
What would it take for you not to
invoke the legal dept?
Thoughts on Storage/Analysis




Use time/data source as primary key?
How domain-specific should the
common format be?
Management logistics…
Access control…
Acronym Suggestions???
Open Source (Failure) Data Repository