Debugging Distributed Systems via Data Mining

Transcript Debugging Distributed Systems via Data Mining

Troubleshooting
Distributed Systems via
Data Mining
David Cieslak, Douglas Thain, and Nitesh Chawla
University of Notre Dame
It’s Ugly in the Real World

Machine related failures:


Job related failures:


Missing libraries, not enough disk/cpu/mem, wrong software installed,
wrong version installed, wrong memory layout...
Load related failures:


Crash on some args, bad executable, missing input files, mistake in args,
missing components, failure to understand dependencies...
Incompatibilities between jobs and machines:


Power outages, network outages, faulty memory, corrupted file system,
bad config files, expired certs, packet filters...
Slow actions induce timeouts; kernel tables: files, sockets, procs; router
tables: addresses, routes, connections; competition with other users...
Non-deterministic failures:

Multi-thread/CPU synchronization, event interleaving across systems,
random number generators, interactive effects, cosmic rays...
Reports of Bad News
Grid2003: Thirty percent of ATLAS/CMS jobs failed!
“Jobs often failed due to site configuration problems, or
in groups from site service failures”
- R. Gardner et al, “The Grid2003 Production Grid:
Principles and Practice”, HPDC 2003.
“Users ... need tools to help debug why failures happen.”
Need “user oriented diagnostic tools”
- J. Schopf, “State of Grid Users: 25 Conversations with
UK eScience Groups”, Argonne Tech Report
ANL/MCS-TM-278.
A “Grand Challenge” Problem:



A user submits one million jobs to the grid.
Half of them fail.
Now what?
Examine the output of every failed job?
 Login to every site to examine the logs?
 Resubmit and hope for the best?



We need some way of getting the big picture.
Need to identify problems not seen before.
The Wisdom of Secretary Rumsfeld
As we know,
There are known knowns.
There are things we know we know.
We also know
There are known unknowns.
That is to say
We know there are some things
We do not know.
But there are also unknown unknowns,
The ones we don't know
We don't know.
- Donald Rumsfeld, 12 February 2002
An Idea:


We have lots of structured information about the
components of a grid.
Can we perform some form of data mining to discover
the big picture of what is going on?



User: Your jobs work fine on RH Linux 12.1 and 12.3 but
they always seem to crash on version 12.2.
Admin: User “joe” is running 1000s of jobs that transfer 10
TB of data that fail immediately; perhaps he needs help.
Can we act on this information to improve the system?


User: Avoid resources that are working for you.
Admin: Assist the user in understand and fixing the problem.
Job ClassAd
Machine ClassAd
MyType = "Job"
MyType = "Machine"
TargetType = "Machine"
TargetType = "Job"
ClusterId = 11839
Name
= "ccl00.cse.nd.edu"
User Job
Log
QDate = 1150231068
Job 1 submitted. CpuBusy = ((LoadAvg - CondorLoadAvg)
CompletionDate = 0
Job 2 submitted. >= 0.500000)
Owner = "dcieslak“
MachineGroup = "ccl"
JobUniverse = 5
MachineOwner = "dthain"
Job 1 placed on ccl00.cse.nd.edu
Cmd = "ripper-cost-can-9-50.sh"
CondorVersion = "6.7.19 May 10 2006"
Job 1 evicted.
LocalUserCpu = 0.000000 Job 1 placed on smarty.cse.nd.edu.
CondorPlatform = "I386-LINUX_RH9"
LocalSysCpu = 0.000000 Job 1 completed. VirtualMachineID = 1
ExitStatus = 0
ExecutableSize = 20000
ImageSize = 40000
JobUniverse = 1
Job 2 placed on dvorak.helios.nd.edu
DiskUsage = 110000
Job 2 suspended NiceUser = FALSE
NumCkpts = 0
VirtualMemory = 962948
Job 2 resumed
NumRestarts = 0
Memory
= 498
Job 2 exited normally
with status
1.
NumSystemHolds = 0
Cpus = 1
CommittedTime = 0
Disk = 19072712
ExitBySignal = FALSE
CondorLoadAvg = 1.000000
PoolName = "ccl00.cse.nd.edu"
LoadAvg = 1.130000
CondorVersion = "6.7.19 May 10 2006"
KeyboardIdle = 817093
CondorPlatform = I386-LINUX_RH9
ConsoleIdle = 817093
RootDir = "/"
StartdIpAddr = "<129.74.153.164:9453>"
...
Job
Job
Job Ad
Job Ad
Ad
Ad
User
Job
Log
Success Class
Machine
Machine
Ad
Machine
Ad
Machine
Ad
Ad
Failure Class
Failure Criteria:
exit !=0
core dump
evicted
suspended
bad output
RIPPER
Your jobs work fine on RH Linux 12.1 and 12.3 but
they always seem to crash on version 12.2.
------------------------- run 1 ------------------------Hypothesis:
exit1 :- Memory>=1930, JobStart>=1.14626e+09, MonitorSelfTime>=1.14626e+09 (491/377)
exit1 :- Memory>=1930, Disk<=555320 (1670/1639).
default exit0 (11904/4503).
Error rate on holdout data is 30.9852%
Running average of error rate is 30.9852%
------------------------- run 2 ------------------------Hypothesis: exit1 :- Memory>=1930, Disk<=541186 (2076/1812).
default exit0 (12090/4606).
Error rate on holdout data is 31.8791%
Running average of error rate is 31.4322%
------------------------- run 3 ------------------------Hypothesis:
exit1 :- Memory>=1930, MonitorSelfImageSize>=8.844e+09 (1270/1050).
exit1 :- Memory>=1930, KeyboardIdle>=815995 (793/763).
exit1 :- Memory>=1927, EnteredCurrentState<=1.14625e+09, VirtualMemory>=2.09646e+06,
LoadAvg>=30000, LastBenchmark<=1.14623e+09, MonitorSelfImageSize<=7.836e+09
(94/84). exit1 :- Memory>=1927, TotalLoadAvg<=1.43e+06, UpdatesTotal<=8069,
LastBenchmark<=1.14619e+09, UpdatesLost<=1 (77/61).
default exit0 (11940/4452).
Error rate on holdout data is 31.8111%
Running average of error rate is 31.5585%
Unexpected Discoveries

Purdue Teragrid (91343 jobs on 2523 CPUs)
Jobs fail on machines with (Memory>1920MB)
 Diagnosis: Linux machines with > 3GB have a
different memory layout that breaks some programs
that do inappropriate pointer arithmetic.


UND & UW (4005 jobs on 1460 CPUs)
Jobs fail on machines with less than 4MB disk.
 Diagnosis: Condor failed in an unusual way when
the job transfers input files that don’t fit.

Many Open Problems

Strengths and Weaknesses of Approach




Acting on Information





Correlation != Causation -> could be enough?
Limits of reported data -> increase resolution?
Not enough data points -> direct job placement?
Steering by the end user.
Applying learned rules back to the system.
Evaluating (and sometimes abandoning) changes.
Creating tools that assist with “digging deeper.”
Data Mining Research


Continuous intake + incremental construction.
Creating results that non-specialists can understand.
Just Getting Started

Douglas Thain
University of Notre Dame
[email protected]

We like to collect things:


Obscure failure modes.
 War stories about how the bugs were found.
 Log files from big batch runs.


Debugging Distributed Systems via Data Mining

Transcript Debugging Distributed Systems via Data Mining

Directory