Presentation - RAD Lab - University of California, Berkeley

Transcript Presentation - RAD Lab - University of California, Berkeley

Automatic Clustering of
Win32 Applications
Based on
Failure Behavior
Archana Ganapathi
Andrew Zimdars
University of California, Berkeley
CS 294-4: RADS
Final Presentation
Project Overview
Goal: partition/cluster applications based
on their failure behavior
 Why?

 Provide
users with a “likely frustration factor”
at the point of installing a new app
What underlying design principles determine
good/bad applications?
 What application features guide application crash
behavior?

Raw Data






Crash event timestamp
Machine ID, user ID
Failed application name and version
Failed DLL name and version
Application error code
Calculated values:
 Time
interval between consecutive crashes for
(applicationName, machineID) pair
 “Crash clusters”: sequence of consecutive crashes of
the same app in a short period of time (10 minutes)
Supplemental Features

Usage Data
 Based
on survey of users whose machines
crashed

DLL support for each application
 Used
“Dependency Walker”
http://www.dependencywalker.com/
Derived Features








Time since previous crash normalized by
average time between crashes on machine
Time since previous cluster normalized by
average time between clusters on machine
Size of crash cluster
Time between events in a crash cluster
Frequency of crash normalized by frequency of
usage
Complexity of application (# of dlls)
Specific dlls used
Crashing dlls for app
Crash Behavior on a Machine
App1
App2
App3
time
Clustering Algorithms (1)
Purpose: To reveal structure in data set by approximately finding
groups of data points that are closer to each other than to other groups
K-Means Clustering
Agglomerative Clustering
Initialization Choose k data points uniformly Each data point is the center of
at random from the full data set a singleton cluster
Iteration
1) Assign each data point to
the nearest cluster center
2) Recompute cluster centers
as the “average” of all member
data points
1) Merge two clusters whose
centers are closest to each other
2) Recompute cluster centers
as the “average” of all member
data points
Runtime
O(n2)
O(n) (one new cluster per
iteration)
Clustering Algorithms (2)

k-means clustering:
Final clusters sensitive to initial choice of cluster
centers
Doesn’t provide much information about structure
within clusters

Agglomerative clustering:


Deterministic operation (but deterministic-annealing
variants possible)
Algorithm results in binary cluster “tree” (pairs of
clusters successively merged)
Greedy clustering can lead to odd groupings
Challenges of non-numeric data


Clusters sensitive to choice of scale
Impossible/meaningless to instantiate “average” data point for many
data types:





Numerical vectors have “natural” difference (vector subtraction) and
metrics (Lp) but non-numeric components require “unnatural”
difference operation




Strings: discrete values, and order irrelevant to domain structure
Sets (e.g. of DLL names): no natural ordering
Version numbers: order irrelevant to domain structure
Hexadecimal error codes: non-ordinal when used as crash labels
Strings: d(s1, s2) = 0 iff s1 = s2; d(s1, s2) = 1 otherwise
Sets: d(s1, s2) = |(s1 \ s2)  (s2 \ s1)| (size of symmetric difference)
Version numbers: Component-wise difference scaled so difference
between minor version numbers doesn’t dominate difference between
major version numbers
Vectors with mixed numeric and non-numeric components should
still satisfy triangle inequality (artificial distance should be a metric)
Clustering Results: Raw Clusters

Initial clusters correspond to single-application crash clusters


Same user/machine/application/DLL
Larger clusters aggregate per-application crashes across machines


Common: Error codes specific to crash clusters
Uncommon: Same DLL causes an application to crash on two different
machines
 Rare: Same DLL causes an application to crash on two different
machines with the same error code!

Analytical challenges

Most users stick to one computer: doubles effective weight of machine
in assigning clusters
 Feature expansion mostly application-specific: multiplies effective
weight of application in assigning clusters
 Sparse data
 Pathological environment (cf. crashes of Lisp and Java IDEs)
Clustering Results: Sensitivity Analysis


Make one dimension at a time twice as important as the
others, and observe the effect on cluster membership
Machine-specific properties




Affect order of cluster formation
Consistent basis for forming initial (small) clusters
Strong effect, but not overwhelming
Application-specific properties


DLL set discriminates between statically- and dynamically-linked
apps, but not between “simple” and “complex”
Most apps use the same set of (system, VC++) runtime DLLs
Clustering Results:
Application Features Only

Initial results show clusters corresponding to crash series



Lower-dimensional space fights sparsity
Standouts:




Do our raw data overweight machine/user identifiers?
MS Office crashes cluster together (same DLLs)
Research apps cluster together (similar crash patterns, static linking)
Alisp and Firefox (!): some crashes in both caused by same DLL (but
different error codes)
Other clusters:



“Complex” applications that hang (but not Firefox/Thunderbird)
Matlab and Internet Explorer (similar DLL sets: Matlab = IE +
LAPACK?)
Network apps (IE, Netscape, Y! Messenger, Outlook, Acrobat Reader)
DLLs: Model citizens and bad apples


227 dlls used by 33 apps analyzed
37 of these dlls caused crashes in our data set
 Worst
offenders (widely used and frequent causes
of crashes): ntdll.dll, kernel32.dll, msvcrt.dll


190 model citizens
Random sample of (almost) universally used
dlls that have not produced crashes:
CFGMGR32.DLL
NETAPI32.DLL
RPCRT4.DLL
DBGHELP.DLL
PRINTUI.DLL
USERENV.DLL
MSIMG32.DLL
RASMAN.DLL
VERSION.DLL
MSSIGN32.DLL
REGAPI.DLL 32
WMI.DLL
Crash Data and Application Structure

Clusters should reflect relevant structure in data set

Distance between data points:




Introduction of derived features that express application complexity,
usage patterns:



Irrelevant: Edit distance between two app names (e.g. “mozilla.exe” and
“iexplore.exe”)
Irrelevant: Order of hexadecimal error codes
Relevant: Agreement/disagreement of two app names
Static processing identifies DLL support of application
Time between crashes and “crash clusters”
Scaling dimensions of vector space to “normalize” contribution of
different factors


Normalization easy way to perform “sensitivity analysis” of clusters:
“What if the DLL that caused the crash were twice as important as all other
features?”
Consider the relative effect of t in milliseconds vs. t in days vs. t/max(t)
Conclusion: The Blame Game

Most crash histories machine/user-specific
 Configuration
management more important than
blaming (insert victim here)?

From patterns to testable hypotheses: Statistical
significance of analysis?
 Trustworthy
patterns require huge amounts of
structured data
 Tradeoff between introducing domain knowledge and
biasing results to confirm hypotheses
 Crash patterns are not usage patterns

Presentation - RAD Lab - University of California, Berkeley

Transcript Presentation - RAD Lab - University of California, Berkeley

Directory