Attacking Data Intensive Science with Distributed Computing

Download Report

Transcript Attacking Data Intensive Science with Distributed Computing

Attacking Data Intensive
Science with Distributed
Computing
Prof. Douglas Thain
University of Notre Dame
http://www.cse.nd.edu/~dthain
Outline
Large Scale Distributed Computing
– Plentiful Computing Resources World-Wide
– Challenges: Data and Debugging
The Cooperative Computing Lab
– Distributed Data Management
– Applications to Scientific Computing
– Debugging Complex Systems
Open Problems in Distributed Computing
– Proposal: The All-Pairs Abstraction
Plentiful Computing Power
As of 04 Sep 2006:
Teragrid
– 21,972 CPUs / 220 TB / 6 sites
Open Science Grid
– 21,156 CPUs / 83 TB / 61 sites
Condor Worldwide:
– 96,352 CPUs / 1608 sites
At Notre Dame:
– CRC: 500 CPUs
– BOB: 212 CPUs
– Lots of little clusters!
Who is using all of this?
Anyone with unlimited computing needs!
High Energy Physics:
– Simulating the detector a particle accelerator before
turning it on allows one to understand the output.
Biochemistry:
– Simulate complex molecules under different forces to
understand how they fold/mate/react.
Biometrics:
– Given a large database of human images, evaluate
matching algorithms by comparing all to all.
Climatology:
– Given a starting global climate, simulate how climate
develops under varying assumptions or events.
Buzzwords
Distributed Computing
Cluster Computing
Beowulf
Grid Computing
Utility Computing
Something@Home
= A bunch of computers.
Some Outstanding Successes
TeraGrid:
– AMANDA project uses 1000s of CPUs over months to
calibrate and process data from a neutrino telescope.
PlanetLab:
– Hundreds of nodes used to test and validate a wide
variety of dist. and P2P systems: Chord, Pastry, etc...
Condor:
– MetaNEOS project solves a 30-year-old optimization
problem using brute force on 1000 heterogeneous
CPUs across multiple sites over several weeks.
Seti@Home:
– Millions of CPUs used to analyze celestial signals.
And now the bad news...
Large distributed systems
fall to pieces
when you have lots of data!
Example: Grid3 (OSG)
Robert Gardner, et al. (102 authors)
The Grid3 Production Grid
Principles and Practice
IEEE HPDC 2004
The Grid2003 Project has deployed a multi-virtual
organization, application-driven grid laboratory
that has sustained for several months the
production-level services required by…
ATLAS, CMS, SDSS, LIGO…
Problem: Data Management
The good news:
– 27 sites with 2800 CPUs
– 40985 CPU-days provided over 6 months
– 10 applications with 1300 simultaneous jobs
The bad news:
– 40-70 percent utilization
– 30 percent of jobs would fail
– 90 percent of failures were site problems
– Most site failures were due to disk space!
Problem: Debugging
“Most groups reported problems in which a
job had been submitted... and something
had not performed correctly, but they were
unable to determine where, why, or how to
fix that problem...”
Jennifer Schopf and Steven Newhouse,
“State of Grid Users: 25 Conversations with UK eScience Users”
Argonne National Lab Tech Report ANL/MCS-TM-278, 2004.
Both Problems: Debugging I/O
A user submits 1000 jobs to a grid.
Each requires 1 GB of input.
100 start at once. (Quite normal.)
The interleaved transfers all fail.
The “robust” system retries...
(Happened last week in this department!)
A Common Thread:
Each of these problems:
– “I can’t make storage do what I want!”
– “I have no idea why this system is failing!”
Arises from the following:
– Both service providers and users are lacking
the tools and models that they need to
harness and analyze complex environments.
Outline
Large Scale Distributed Computing
– Plentiful Computing Resources World-Wide
– Challenges: Data and Debugging
The Cooperative Computing Lab
– Distributed Data Management
– Applications to Scientific Computing
– Debugging Complex Systems
Open Problems in Distributed Computing
– Proposal: The All-Pairs Abstraction
Cooperative Computing Lab
at the University of Notre Dame
Basic Computer Science Research
– Overlapping categories: Operating systems, distributed
systems, grid computing, filesystems, databases...
Modest Local Operation
– 300 CPUs, 20 TB of storage, 6 stakeholders
– Keeps us honest + eat our own dog food.
Software Development and Publication
– http://www.cctools.org
– Students learn engineering as well as science.
Collaboration with External Users
– High energy physics, bioinformatics, molecular dynamics...
http://www.cse.nd.edu/~ccl
I will only run jobs between
midnight and 8 AM
I will only run jobs when
Computing Environment
there is no-one working at
the keyboard
Miscellaneous CSE Workstations
CPU
CPU
CPU
CPU
CPU
Fitzpatrick Workstation Cluster
CPU
Job
Job CPU
Job CPU
CPU CPU
Disk
Disk Disk
Disk
Disk
Job
Job
Job
Job
Job
Job
Job
Job
Condor
Match
Maker
Disk
Disk
Disk
I prefer to run a job
submitted by a CCL student.
CPU
CPU
CPU
CPU
CPU CPU CPU
Job
Disk
Job
Disk
Job
Disk
Job
Disk
Job
Disk
CVRL Research Cluster
Disk
Disk
CCL Research Cluster
CPU History
Storage History
Flocking Between Universities
Wisconsin
1200 CPUs
Purdue A
541 CPUs
Notre Dame
300 CPUs
Purdue B
1016 CPUs
http://www.cse.nd.edu/~ccl/operations/condor/
Problems and Solutions
“I can’t make storage do what I want!”
– Need root access, configure, reboot, etc...
– Solution: Tactical Storage Systems
I have no idea why this system is failing!
– Multiple services, unreliable networks...
– Solution: Debugging Via Data Mining
Why is Storage Hard?
Easy within one cluster:
– Shared filesystem on all nodes.
– But, limited to a few disks provided by admin.
– Even a “macho” file server has limited BW.
Terrible across two or more clusters:
– No shared filesystem on all nodes.
– Too hard to move data back and forth.
– Limited to using storage on head nodes.
– Unable to become root to configure.
Conventional Clusters
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
Disk
Disk
Tactical Storage Systems (TSS)
A TSS allows any node to serve as a file server
or as a file system client.
All components can be deployed without special
privileges – but with standard grid security (GSI)
Users can build up complex structures.
– Filesystems, databases, caches, ...
– Admins need not know/care about larger structures.
Takes advantage of two resources:
– Total Storage
(200 disks yields 20TB)
– Total Bandwidth (200 disks at 10 MB/s = 2 GB/s)
Tactical Storage System
1 – Uniform access between any nodes in either cluster
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk CPU
Disk CPU
Disk
Logical Volume
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
Logical Volume
CPU
CPU
Disk
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
2 – Ability to group together multiple disks for a common purpose.
Tactical Storage Structures
Appl
Appl
Appl
Appl
Adapter
Appl
Adapter
Appl
Appl
Adapter
Adapter
Adapter
Adapter
Adapter
Secured by
Grid GSI Credentials
Scalable Bandwidth
for Small Data
Scalable Capacity/BW
for Large Data
Logical Volume
Server
Server
Server
Server
Server
Server
Server
Disk
Disk
Disk
Disk
Disk
Disk
Disk
WAN File System
Replicated File System
Expandable File System
Applications and Examples
Bioinformatics:
– A WAN Filesystem for BLAST on EGEE grid.
Atmospheric Physics
– A cluster filesystem for scalable data analysis.
Biometrics:
– Dist. I/O for high-xput image comparison.
Molecular Dynamics:
– GEMS: Scalable distributed database.
High Energy Physics:
– Global access to software distributions.
Simple Wide Area File System
Bioinformatics on the European Grid
– Users want to run BLAST on standard DBs.
– Cannot copy every DB to every node of the grid!
Many databases of biological data in different
formats around the world:
– Archives: Swiss-Prot, TreMBL, NCBI, etc...
– Replicas: Public, Shared, Private, ???
Goal: Refer to data objects by logical name.
– Access the nearest copy of the non-redundant protein
database, don’t care where it is.
Credit: Christophe Blanchet, Bioinformatics Center of Lyon, CNRS IBCP, France
http://gbio.ibcp.fr/cblanchet, [email protected]
Wide Area File System
Run BLAST on
LFN://ncbi.gov/nr.data
Where is
LFN://ncbi.gov/nr.data?
BLAST
HTTP
Server
nr.data
EGEE File
Location Service
open(LFN://ncbi.gov/nr.data)
RFIO
Server
Find it at:
FTP://ibcp.fr/nr.data
Adapter
nr.data
open(FTP://ibcp.fr/nr.data)
RFIO
gLite
HTTP
FTP
RETR nr.data
FTP
Server
nr.data
Performance of Bio Apps on EGEE
Runtime (sec)
450
400
BLAST+Parrot
350
SSearch+Parrot
300
FastA+copy
FastA+Parrot
BLAST+copy
SSearch+copy
250
200
150
100
50
0
0
200 000
400 000
600 000
800 000
1 000 000
Protein Database Size (sequences)
1 200 000
Credit: John Poirer @ Notre Dame Astrophysics Dept.
Expandable Filesystem
for Experimental Data
Project GRAND
http://www.nd.edu/~grand
Can only analyze
the most recent data.
analysis
code
2 GB/day today
could be lots more!
buffer
disk
daily
tape
30-year
archive
daily
daily
tape
daily
tape
daily
tape
tape
Credit: John Poirer @ Notre Dame Astrophysics Dept.
Expandable Filesystem
for Experimental Data
Project GRAND
http://www.nd.edu/~grand
Can analyze all data
over large time scales.
analysis
code
2 GB/day today
could be lots more!
30-year
archive
Adapter
Logical Volume
file
server
file
server
file
server
buffer
disk
file
server
daily
tape
daily
daily
tape
daily
tape
daily
tape
tape
Scalable I/O for Biometrics
Computer Vision Research Lab in CSE
– Goal: Develop robust algorithms for
identifying humans from (non-ideal) images.
– Technique: Collect lots of images. Think up
clever new matching function. Compare them.
How do you test a matching function?
– For a set S of images,
– Compute F(Si,Sj) for all Si and Sj in S.
– Compare the result matrix to known functions.
Credit: Patrick Flynn at Notre Dame CSE
Computing Similarities
1
.8
.1
0
0
.1
1
0
.1
.1
0
1
0
.1
.7
1
0
0
1
.1
F
1
A Big Data Problem
Data Size: 10k images of 1MB = 10 GB
Total I/O: 10k * 10k * 2 MB *1/2 = 100 TB
Would like to repeat many times!
In order to execute such a workload, we
must be careful to partition both the I/O
and the CPU needs, taking advantage of
distributed capacity.
Conventional Solution
Disk
Disk
Disk
Disk
Move 200 TB at Runtime!
Job
Job
Job
Job
Job
Job
Job
Job
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Using Tactical Storage
3. Jobs find nearby data copy,
and make full use before discarding.
CPU
Job
CPU
CPU
Job
CPU
Job
CPU
Job
CPU
CPU
CPU
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
2. Replicate data to many disks.
1. Break array into MB-size chunks.
Problems and Solutions
“I can’t make storage do what I want!”
– Need root access, configure, reboot, etc...
– Solution: Tactical Storage Systems
I have no idea why this system is failing!
– Multiple services, unreliable networks...
– Solution: Debugging Via Data Mining
It’s Ugly in the Real World
Machine related failures:
– Power outages, network outages, faulty memory, corrupted file
system, bad config files, expired certs, packet filters...
Job related failures:
– Crash on some args, bad executable, missing input files,
mistake in args, missing components, failure to understand
dependencies...
Incompatibilities between jobs and machines:
– Missing libraries, not enough disk/cpu/mem, wrong software
installed, wrong version installed, wrong memory layout...
Load related failures:
– Slow actions induce timeouts; kernel tables: files, sockets,
procs; router tables: addresses, routes, connections;
competition with other users...
Non-deterministic failures:
– Multi-thread/CPU synchronization, event interleaving across
systems, random number generators, interactive effects,
cosmic rays...
A “Grand Challenge” Problem:
A user submits one million jobs to the grid.
Half of them fail.
Now what?
– Examine the output of every failed job?
– Login to every site to examine the logs?
– Resubmit and hope for the best?
We need some way of getting the big picture.
Need to identify problems not seen before.
An Idea:
We have lots of structured information about the
components of a grid.
Can we perform some form of data mining to
discover the big picture of what is going on?
– User: Your jobs work fine on RH Linux 12.1 and 12.3
but they always seem to crash on version 12.2.
– Admin: Joe is running 1000s of jobs with 10 TB of
data that fail immediately; perhaps he needs help.
Can we act on this information?
– User: Avoid resources that aren’t working for you.
– Admin: Assist the user in understand and fixing the
problem.
Job ClassAd
Machine ClassAd
MyType = "Job"
MyType = "Machine"
TargetType = "Machine"
TargetType = "Job"
ClusterId = 11839
Name
= "ccl00.cse.nd.edu"
User Job
Log
QDate = 1150231068
Job 1 submitted. CpuBusy = ((LoadAvg - CondorLoadAvg)
CompletionDate = 0
Job 2 submitted. >= 0.500000)
Owner = "dcieslak“
MachineGroup = "ccl"
JobUniverse = 5
MachineOwner = "dthain"
Job 1 placed on ccl00.cse.nd.edu
Cmd = "ripper-cost-can-9-50.sh"
CondorVersion = "6.7.19 May 10 2006"
Job 1 evicted.
LocalUserCpu = 0.000000 Job 1 placed on smarty.cse.nd.edu.
CondorPlatform = "I386-LINUX_RH9"
LocalSysCpu = 0.000000 Job 1 completed. VirtualMachineID = 1
ExitStatus = 0
ExecutableSize = 20000
ImageSize = 40000
JobUniverse = 1
Job 2 placed on dvorak.helios.nd.edu
DiskUsage = 110000
Job 2 suspended NiceUser = FALSE
NumCkpts = 0
VirtualMemory = 962948
Job 2 resumed
NumRestarts = 0
Memory
= 498
Job 2 exited normally
with status
1.
NumSystemHolds = 0
Cpus = 1
CommittedTime = 0
Disk = 19072712
ExitBySignal = FALSE
CondorLoadAvg = 1.000000
PoolName = "ccl00.cse.nd.edu"
LoadAvg = 1.130000
CondorVersion = "6.7.19 May 10 2006"
KeyboardIdle = 817093
CondorPlatform = I386-LINUX_RH9
ConsoleIdle = 817093
RootDir = "/"
StartdIpAddr = "<129.74.153.164:9453>"
...
Job
Job
Job Ad
Job Ad
Ad
Ad
User
Job
Log
Success Class
Machine
Machine
Ad
Machine
Ad
Machine
Ad
Ad
Failure Class
Failure Criteria:
exit !=0
core dump
evicted
suspended
bad output
RIPPER
Your jobs work fine on RH Linux 12.1 and 12.3
but they always seem to crash on version 12.2.
Unexpected Discoveries
Purdue Teragrid (91343 jobs on 2523 CPUs)
– Jobs fail on machines with (Memory>1920MB)
– Diagnosis: Linux machines with > 3GB have a
different memory layout that breaks some programs
that do inappropriate pointer arithmetic.
UND & UW (4005 jobs on 1460 CPUs)
– Jobs fail on machines with less than 4MB disk.
– Diagnosis: Condor failed in an unusual way when the
job transfers input files that don’t fit.
Many Open Problems
Strengths and Weaknesses of Approach
– Correlation != Causation -> could be enough?
– Limits of reported data -> increase resolution?
– Not enough data points -> direct job placement?
Acting on Information
– Steering by the end user.
– Applying learned rules back to the system.
– Evaluating (and sometimes abandoning) changes.
Data Mining Research
– Continuous intake + incremental construction.
– Creating results that non-specialists can understand.
Next Step: Monitor 21,000 CPUs on the OSG!
Problems and Solutions
“I can’t make storage do what I want!”
– Need root access, configure, reboot, etc...
– Solution: Tactical Storage Systems
I have no idea why this system is failing!
– Multiple services, unreliable networks...
– Solution: Debugging Via Data Mining
Outline
Large Scale Distributed Computing
– Plentiful Computing Resources World-Wide
– Challenges: Data and Debugging
The Cooperative Computing Lab
– Distributed Data Management
– Applications to Scientific Computing
– Debugging Complex Systems
Open Problems in Distributed Computing
– Proposal: The All-Pairs Abstraction
Some Ruminations
These tools attack technical problems.
But, users still have to be clever:
– Where should my jobs run?
– How should I partition data?
– How long should I run before a checkpoint?
Can we provide an interface such that:
– Scientific users state what to compute.
– The system decides where, when, and how.
Previous attempts didn’t incorporate data.
The All-Pairs Abstraction
All-Pairs:
– For a set S and a function F:
– Compute F(Si,Sj) for all Si and Sj in S.
The end user provides:
– Set S: A bunch of files.
– Function F: A self-contained program.
The computing system determines:
– Optimal decomposition in time and space.
– What resources to employ. (F easy to distr.)
– What to do when failures occur.
An All-Pairs Facility at Notre Dame
100s-1000s of machines
S
F
All
Pairs
Web
Portal
F CPU
F CPU
F CPU
F
CPU
Disk Disk Disk Disk
2 – Backend decides where to run,
how to partition, when to retry failures...
1 – User uploads S and F
into the system.
3 – Return result matrix to user.
Our Mode of Research
Find researchers with systems problems.
Solve them by developing new tools.
Generalize the solutions to new domains.
Publish papers and software!
Acknowledgments
Science Collaborators:
–
–
–
–
–
–
–
Christophe Blanchet
Sander Klous
Peter Kunzst
Erwin Laure
John Poirier
Igor Sfiligoi
Francesco Delli Paoli
CSE Students:
–
–
–
–
–
–
–
–
Paul Brenner
Tim Faltemier
James Fitzgerald
Jeff Hemmes
Chris Moretti
Gerhard Niederwieser
Phil Snowberger
Justin Wozniak
CSE Faculty:
–
–
–
–
Jesus Izaguirre
Aaron Striegel
Patrick Flynn
Nitesh Chawla
For more information...
Cooperative Computing Lab
http://www.cse.nd.edu/~ccl
Cooperative Computing Tools
http://www.cctools.org
Douglas Thain
– [email protected]
– http://www.cse.nd.edu/~dthain