Transcript PowerPoint
High Level Abstractions
for
Data-Intensive Computing
Christopher Moretti, Hoang Bui,
Brandon Rich, and Douglas Thain
University of Notre Dame
4/30/2008
Christopher Moretti – University of Notre Dame
1
Computing’s central challenge,
“How not to make a mess of it,”
has not yet been met.
-Edsger Dijkstra
4/30/2008
Christopher Moretti – University of Notre Dame
2
Overview
Many systems today give end users access to hundreds
or thousands of CPUs.
But, it is far too easy for the naive user to create a big
mess in the process.
Our Solution:
Deploy high-level abstractions that describe both
data and computation needs.
Some examples of current work:
All-Pairs: An abstraction for biometric workloads.
Distributed Ensemble Classification
DataLab: A system and language for data-parallel
computation.
4/30/2008
Christopher Moretti – University of Notre Dame
3
Distributed Computing is Hard!
Which
resources
?
How do I fit
my
workload
into jobs?
What about
job input
data?
How can I
measure job
stats?
4/30/2008
What is
Condor?
How
Many?
What happens
when things
fail?
How long will
it take?
What do I do
with the
results?
Christopher Moretti – University of Notre Dame
5
Distributed Computing is Hard!
Which
resources
?
How do I fit
my
workload
into jobs?
What about
job input
data?
How can I
measure job
stats?
What is
Condor?
How
Many?
What happens
when things
fail?
ARGH!
4/30/2008
How long will
it take?
What do I do
with the
results?
Christopher Moretti – University of Notre Dame
6
The All-Pairs Problem
S2
1
All-Pairs(
Set S1,
Set S2,
Function F
)
yields a matrix M:
Mij = F(S1i,S2j)
S2
2
S2
3
S2
4
S2
5
S2
6
S2
7
S1
1
S1
2
S1
3
S1
4
S1
5
S1
6
60K 20KB images >1GB
3.6B comparisons
@ 50/s = 2.3 CPUYrs
x 8B output = 29GB
4/30/2008
S1
7
Christopher Moretti – University of Notre Dame
11
Biometric All-Pairs Comparison
1
.8
.1
0
0
.1
1
0
.1
.1
0
1
0
.1
.7
1
F
1
0
.1
1
4/30/2008
Christopher Moretti – University of Notre Dame
12
Naïve Mistakes
Computing Problem: Even expert users don’t
know how to tune jobs optimally, and can make 100
CPUs even slower than one by overloading the file
server, network, or resource manager.
For all $X
:
For all $Y
:
cmp $X to $Y
Batch System
CPU
4/30/2008
CPU
CPU
CPU
Each CPU
reads 10TB!
file
server
Christopher Moretti – University of Notre Dame
13
Consequences of Naïve Mistakes
4/30/2008
Christopher Moretti – University of Notre Dame
14
All Pairs Abstraction
set S of files
binary function F
F
invocation
M = AllPairs(F,S)
4/30/2008
Christopher Moretti – University of Notre Dame
15
All-Pairs Production System at Notre Dame
Web Portal
F
1 - Upload F and S
into web portal.
S
G
H
T
300 active storage units
500 CPUs, 40TB disk
4 – Choose optimal partitioning
and submit batch jobs.
F
F
F
F
F
F
2 - AllPairs(F,S)
All-Pairs
Engine
6 - Return result
matrix to user.
4/30/2008
5 - Collect and
assemble results.
3 - O(log n) distribution
by spanning tree.
Christopher Moretti – University of Notre Dame
17
4/30/2008
Christopher Moretti – University of Notre Dame
18
4/30/2008
Christopher Moretti – University of Notre Dame
19
4/30/2008
Christopher Moretti – University of Notre Dame
20
Returning the Result Matrix
4.37
Too many files.
6.01
Hard to do prefetching.
2.22
4.37
7.13
8.94
6.72
1.34
…
…
…
0.98
Too large files.
Must scan entire file.
Row/Column ordered.
How can we build it?
4/30/2008
Christopher Moretti – University of Notre Dame
21
Result Storage by Abstraction
Chirp_array allows users to create, manage, modify
large arrays without having to realize underlying form.
Operations on chirp_array:
create a chirp_array
open a chirp_array
set value A[i,j]
get value A[i,j]
X
X
get row A[i]
get column A[j]
set row A[i]
set column A[j]
4/30/2008
Christopher Moretti – University of Notre Dame
CPU
CPU
CPU
Disk
Disk
Disk
22
Result Storage with chirp_array
chirp_array_get(i,j)
4/30/2008
Christopher Moretti – University of Notre Dame
CPU
CPU
CPU
Disk
Disk
Disk
23
Result Storage with chirp_array
chirp_array_get(i,j)
4/30/2008
Christopher Moretti – University of Notre Dame
CPU
CPU
CPU
Disk
Disk
Disk
24
Result Storage with chirp_array
chirp_array_get(i,j)
4/30/2008
Christopher Moretti – University of Notre Dame
CPU
CPU
CPU
Disk
Disk
Disk
25
Data Mining on Large Data Sets
Problem: Supercomputers are expensive, not all
scientists have access to them for completing very large
memory problems. Classification on large data sets
without sufficient memory can degrade throughput,
degrade accuracy, or fail outright.
4/30/2008
Christopher Moretti – University of Notre Dame
26
Data Mining Using Ensembles
training
data
partitioning/sampling
(optional)
algorithm 1
algorithm n
classifier 1
classifier n
test
instance
voting
classification
(From Steinhaeuser and Chawla, 2007)
4/30/2008
Christopher Moretti – University of Notre Dame
27
Data Mining Using Ensembles
training
data
partitioning/sampling
(optional)
algorithm 1
algorithm n
classifier 1
classifier n
test
instance
voting
classification
(From Steinhaeuser and Chawla, 2007)
4/30/2008
Christopher Moretti – University of Notre Dame
28
Abstraction for Ensembles Using Natural Parallelism
Here are my algorithms.
Here is my data set.
Here is my test set.
Abstraction Engine
Choose optimal partitioning
and submit batch jobs.
CPU
CPU
CPU
CPU
Local Votes
Return local votes for tabulation
and final prediction.
4/30/2008
Christopher Moretti – University of Notre Dame
29
DataLab Abstractions
file system
tcsh
emacs
perl
distributed data structures
set S
A
B
file F
function evaluation
Y = F(X)
C
job_start
job_commit
job_wait
job_remove
parrot
chirp
server
chirp
server
chirp
server
unix
filesys
unix
filesys
unix
filesys
4/30/2008
chirp
server
Christopher Moretti – University of Notre Dame
chirp
server
F
X
Y
30
DataLab Language Syntax
apply F on S into T
set S
A
chirp
server
4/30/2008
B
set T
F
C
A
chirp
server
chirp
server
chirp
server
F
F
F
Christopher Moretti – University of Notre Dame
B
C
chirp
server
31
For More Information
Christopher Moretti
[email protected]
Douglas Thain
[email protected]
Cooperative Computing Lab
http://cse.nd.edu/~ccl
4/30/2008
Christopher Moretti – University of Notre Dame
32