Transcript Data Mining
Grid Computing in Data Mining
and
Data Mining on Grid Computing
David Cieslak ([email protected])
Advisor: Nitesh Chawla ([email protected])
University of Notre Dame
1
Grid Computing in Data Mining
How you help me
2
Data Mining Primer
Data Mining:"The non-trivial process of
identifying valid, novel, potentially useful, and
ultimately understandable patterns in data".
-Fayyad, Piatetsky-Shapiro & Smyth, 1996.
Classifier: Learning algorithm which trains a
predictive model from data
Ensemble: A set of classifiers working
together to improve prediction
3
Applications of Data Mining
Network Intrusion Detection
Categorizing Adult Income
Finding Calcifications in Mammography
Looking for Oil Spills
Identifying Handwritten Digits
Predicting Job Failure on a Computing Grid
Anticipating Successful Companies
4
Condor Makes DM Tractable
I use a small set of algorithms in high volume
Ex: Run same classifier on many datasets
A single data mining operation may have
easily parallelized segments
Ex: Learn an ensemble of 10 classifiers on dataset
Introducing simple parallelism into data
mining conserves time significantly
5
Common DM Task: 10 Fold CV
Original Data
Network Traffic Dataset
~30 MB Data
6
Common DM Task: 10 Fold CV
Original Data
10 Training Folds
~27 MB Data
~30 MB Data
10 Testing Folds
~3 MB Data
7
Common DM Task: 10 Fold CV
Training Fold i
Learning Algorithm
Train
Classifier
~27 MB Data
Evaluate
Classifier
~3 MB Data
RIPPER
~2 Hours
Testing Fold i
< 1 min
8
Common DM Task: 10 Fold CV
~27 MB Data
~3 MB Data
Average and aggregate
various statistics and
measures across folds
9
Using Condor on 10 Folds
Condor Pool
Local Host
Local Host
•Receive Results
•Aggregate/Average
•Splits Data
•Upload Data and
Task to Pool
•Learn Classifier
•Evaluate Classifier
•Return results
~ 5 mins
~ 2 Hours
~ 5 mins
•If there is 1 Hour Learn/Eval time, Condor saves up to 18 hours in real time
10
A More Complex DM Task
Over/Under Sampling Wrapper
1.
2.
3.
4.
5.
6.
7.
8.
Split data into 50 folds
Generate 10 undersamplings and 20
oversamplings per fold
Learn classifier on each undersampling
Evaluate and select best undersampling
Learn classifier combing best undersampling
with each oversampling
Evaluate best combination
Obtain results on test folds
Aggregate/Average results
(single)
(pool)
(pool)
(single)
(pool)
(single)
(pool)
(single)11
Condor Speed-Ups & Usage
10 Fold CV Evaluation
Over/Under Sampling Wrapper
Single Machine: roughly one day
Using Condor: under one hour
Single Machine: days to weeks
Using Condor: under a day
In 2006, I used 471,126 CPU hours via Condor
I am “slacking” in 2007: 13,235 CPU hours
12
A Data Miner’s Wishlist
User specifies task to system
Outlines serial task phases
System “smartly” divides labor
What is the logical task granule based on:
Condor Pool Performance
Upload/download latency
Data size
Algorithm Complexity
13
Data Mining on Grid Computing
How I help you
14
It’s Ugly in the Real World
Machine related failures:
Job related failures:
Missing libraries, not enough disk/cpu/mem, wrong software installed,
wrong version installed, wrong memory layout...
Load related failures:
Crash on some args, bad executable, missing input files, mistake in args,
missing components, failure to understand dependencies...
Incompatibilities between jobs and machines:
Power outages, network outages, faulty memory, corrupted file system,
bad config files, expired certs, packet filters...
Slow actions induce timeouts; kernel tables: files, sockets, procs; router
tables: addresses, routes, connections; competition with other users...
Non-deterministic failures:
Multi-thread/CPU synchronization, event interleaving across systems,
random number generators, interactive effects, cosmic rays...
15
A “Grand Challenge” Problem:
A user submits one million jobs to the grid.
Half of them fail.
Now what?
Examine the output of every failed job?
Login to every site to examine the logs?
Resubmit and hope for the best?
We need some way of getting the big picture.
Need to identify problems not seen before.
16
An Idea:
We have lots of structured information about the
components of a grid.
Can we perform some form of data mining to
discover the big picture of what is going on?
User: Your jobs work fine on RH Linux 12.1 and 12.3 but
they always seem to crash on version 12.2.
Admin: User “joe” is running 1000s of jobs that transfer 10
TB of data that fail immediately; perhaps he needs help.
Can we act on this information to improve the
system?
User: Avoid resources that are working for you.
Admin: Assist user in understand and fixing the problem.
17
Job ClassAd
Machine ClassAd
MyType = "Job"
MyType = "Machine"
TargetType = "Machine"
TargetType = "Job"
ClusterId = 11839
Name
= "ccl00.cse.nd.edu"
User Job
Log
QDate = 1150231068
Job 1 submitted. CpuBusy = ((LoadAvg - CondorLoadAvg)
CompletionDate = 0
Job 2 submitted. >= 0.500000)
Owner = "dcieslak“
MachineGroup = "ccl"
JobUniverse = 5
MachineOwner = "dthain"
Job 1 placed on ccl00.cse.nd.edu
Cmd = "ripper-cost-can-9-50.sh"
CondorVersion = "6.7.19 May 10 2006"
Job 1 evicted.
LocalUserCpu = 0.000000 Job 1 placed on smarty.cse.nd.edu.
CondorPlatform = "I386-LINUX_RH9"
LocalSysCpu = 0.000000 Job 1 completed. VirtualMachineID = 1
ExitStatus = 0
ExecutableSize = 20000
ImageSize = 40000
JobUniverse = 1
Job 2 placed on dvorak.helios.nd.edu
DiskUsage = 110000
Job 2 suspended NiceUser = FALSE
NumCkpts = 0
VirtualMemory = 962948
Job 2 resumed
NumRestarts = 0
Memory
= 498
Job 2 exited normally
with status
1.
NumSystemHolds = 0
Cpus = 1
CommittedTime = 0
Disk = 19072712
ExitBySignal = FALSE
CondorLoadAvg = 1.000000
PoolName = "ccl00.cse.nd.edu"
LoadAvg = 1.130000
CondorVersion = "6.7.19 May 10 2006"
…
…
...
Job
Job
Job Ad
Job Ad
Ad
Ad
User
Job
Log
Success Class
Machine
Machine
Ad
Machine
Ad
Machine
Ad
Ad
Failure Class
Failure Criteria:
exit !=0
core dump
evicted
suspended
bad output
DATA
MINING
Your jobs work fine on RH Linux 12.1 and 12.3 but
they always seem to crash on version 12.2.
------------------------- run 1 ------------------------Hypothesis:
exit1 :- Memory>=1930, JobStart>=1.14626e+09, MonitorSelfTime>=1.14626e+09
(491/377)
exit1 :- Memory>=1930, Disk<=555320 (1670/1639).
default exit0 (11904/4503).
Error rate on holdout data is 30.9852%
Running average of error rate is 30.9852%
------------------------- run 2 ------------------------Hypothesis: exit1 :- Memory>=1930, Disk<=541186 (2076/1812).
default exit0 (12090/4606).
Error rate on holdout data is 31.8791%
Running average of error rate is 31.4322%
------------------------- run 3 ------------------------Hypothesis:
exit1 :- Memory>=1930, MonitorSelfImageSize>=8.844e+09 (1270/1050).
exit1 :- Memory>=1930, KeyboardIdle>=815995 (793/763).
exit1 :- Memory>=1927, EnteredCurrentState<=1.14625e+09,
VirtualMemory>=2.09646e+06, LoadAvg>=30000,
LastBenchmark<=1.14623e+09, MonitorSelfImageSize<=7.836e+09 (94/84).
exit1 :- Memory>=1927, TotalLoadAvg<=1.43e+06, UpdatesTotal<=8069,
LastBenchmark<=1.14619e+09, UpdatesLost<=1 (77/61).
default exit0 (11940/4452).
Error rate on holdout data is 31.8111%
Running average of error rate is 31.5585%
Unexpected Discoveries
Purdue Teragrid (91343 jobs on 2523 CPUs)
Jobs fail on machines with (Memory>1920MB)
Diagnosis: Linux machines with > 3GB have a
different memory layout that breaks some
programs that do inappropriate pointer arithmetic.
UND & UW (4005 jobs on 1460 CPUs)
Jobs fail on machines with less than 4MB disk.
Diagnosis: Condor failed in an unusual way when
the job transfers input files that don’t fit.
21
Many Open Problems
Strengths and Weaknesses of Approach
Acting on Information
Correlation != Causation -> could be enough?
Limits of reported data -> increase resolution?
Not enough data points -> direct job placement?
Steering by the end user.
Applying learned rules back to the system.
Evaluating (and sometimes abandoning) changes.
Creating tools that assist with “digging deeper.”
Data Mining Research
Continuous intake + incremental construction.
Creating results that non-specialists can understand.
22
Acknowledgements
Dr. Thain (University of Notre Dame)
Local Condor expert
Use of some slides for this presentation
Cooperative Computing Lab
Maintain/Improve local Condor Pool
Provide computing resources
23
Condor Related Publications
D. Cieslak, D. Thain, N. Chawla, "Troubleshooting
Distributed Systems via Data Mining," (HPDC-15),
June 2006
N. Chawla, D. Cieslak, "Evaluating Calibration of
Probability Estimation Trees," AAAI Workshop on the
Evaluation Methods in Machine Learning, July 2006
N. Chawla, D. Cieslak, L. Hall, A. Joshi, “Killing Two
Birds with One Stone: Countering Cost and
Imbalance,” Data Mining and Knowledge Discovery,
Under Revision
24
Questions?
25