6.896 Project Presentations
Download
Report
Transcript 6.896 Project Presentations
6.895/SMA5509 Project
Presentations
October 15-17 2003
Improving Cilk
Adaptively Parallel Processor
Allocation for Cilk Jobs
Kunal Agrawal
Siddhartha Sen
Goal
Design and implement a dynamic processorallocation system for adaptively parallel jobs (jobs
for which the number of processors that can be used
without waste varies during execution)
The problem of allocating processors to adaptively
parallel jobs is called the adaptively parallel
processor-allocation problem [2].
At any given time, each job j has a desire dj and an
allotment mj
Goal (cont.)
We want to design a processor-allocation
system that achieves a “fair” and “efficient”
allocation among all jobs.
–
–
“fair” means that whenever a job receives fewer
processors than it desires, no other job receives
more than one more processor than this job
received [2]
“efficient” means that no job receives more
processors than it desires [2]
Illustration of Problem
…
dk = 10
mk = 6
…
P processors
dj = 6
mj = 5
Main Algorithmic Questions
Dynamically determine the desires of each job in the
system
–
–
–
Heuristics on steal rate
Heuristics on number of visible stack frames
Heuristics on number of threads in ready deques
Dynamically determine the allotment for each job
such that the resulting allocation is fair and efficient
–
–
–
SRLBA algorithm [2]
Cilk macroscheduler algorithm [3]
Lottery scheduling techniques [5]
Assumptions
All jobs on the system are Cilk jobs
Jobs can enter and leave the system and
change their parallelism during execution
All jobs are mutually trusting (they will stay
within the bound of their allotments and
communicate their desires honestly)
Each job has at least one processor to start
with
References
1.
2.
3.
4.
5.
6.
Supercomputing Technologies Group. Cilk 5.3.2 Reference Manual. MIT Lab for
Computer Science, November 2001.
B. Song. Scheduling adaptively parallel jobs. Master's thesis, Massachusetts
Institute of Technology, January 1998.
R. D. Blumofe, C. E. Leiserson, and B. Song. Automatic processor allocation for
work-stealing jobs.
C. A. Waldspurger. Lottery and Stride Scheduling: Flexible Proportional-Share
Resource Management. PhD thesis, Massachusetts Institute of Technology,
September 1995.
C. A. Waldspurger and W. E. Weihl. Lottery scheduling: Flexible proportionalshare resource management. In Proceedings of the First Symposium on
Operating System Design and Implementation, pages 1-11. USENIX,
November 1994.
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and
Y. Zhou. Cilk: An e±cient multithreaded runtime system. In Proceedings of the
Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming (PPoPP), pages 207-216, Santa Barbara, California, July 1995.
Fast Serial-Append for Cilk
6.895 – Theory of Parallel Systems
Project Presentation (Proposal)
Alexandru Caracas
Serial-Append (cont'd)
●
●
●
A Cilk dag
Numbers represent the serial
execution of threads
For example:
–
Consider all threads perform file
I/O operations
Serial-Append (cont'd)
●
●
●
Same Cilk dag
Numbers represent the
parallel execution of threads
Goal:
–
We would like to execute the
threads in parallel while still
allowing for the serial order to be
reconstructed.
Serial-Append (cont'd)
●
●
Partitioning of the Cilk dag
Colors represent the I/O
operations done by a
processor between two steal
operations.
Serial-Append
●
The Serial-Append of the
computation is obtained by
reading the output in the
order:
–
orange (I)
–
green (II)
–
dark-red (III)
–
purple (IV)
–
red (V)
Project Goals
●
Efficient serial-append
algorithm
–
●
●
●
possibly using B+-trees
Analyze possible
implementation:
–
kernel level
–
formatted file
–
Cilk file system
Implementation
–
●
(one of the above)
●
Port Cheerio to Cilk 5.4.
Comparison with own
implementation
Port serial applications to use
serial-append (analyze their
performance)
References
●
●
●
Robert D. Blumofe and Charles E. Leiserson. Scheduling
multithreaded computations by work stealing. In Proceedings of
the 35th Annual Symposium on Foundations of Computer
Science, pages 356-368, Santa Fe, New Mexico, November 1994.
Matthew S. DeBergalis. A parallel file I/O API for Cilk. Master's
thesis, Department of Electrical Engineering and Computer
Science, Massachusetts Institute of Technology, June 2000.
Supercomputing Technology Group MIT Laboratory for
Computer Science. Cilk 5.3.2 Reference Manual, November
2001. Available at http://supertech.lcs.mit.edu/cilk/manual5.3.2.pdf.
A Space-Efficient Global
Scheduler for Cilk
Jason Hickey
Tyeler Quentmeyer
Idea
Implement a new scheduler for Cilk based on
Narlikar and Blelloch’s Space-Efficient
Scheduling of Nested Parallelism paper
AsyncDF Scheduler
Idea: minimize peak memory usage by
preempting threads that try to allocate large
blocks of memory
Algorithm: Roughly, AsyncDF runs the P
left-most threads in the computation DAG
Example
In parallel for i = 1 to n
Temporary B[n]
In parallel for j = 1 to n
F(B,i,j)
Free B
Comparison to Cilk
Cilk
Distributed
Run threads like a serial program until a processor runs
out of work and has to steal
Space: p*S1
AsyncDF
Global
Preempt a thread when it tries to allocate a large block
of memory and replace it with a computationally heavy
thread
Space: S1 + O(K*D*p)
AsyncDF Optmizations
Two versions of scheduler
Serial
Parallel
Running P left-most threads does not exploit
the locality of a serial program
Solution: Look at the cP left-most threads and
“group” threads together for locality
Project Plan
Implement serial version of AsyncDF
Experiment with algorithms to group threads for locality
Performance comparison to Cilk
Different values of AsyncDF’s tunable parameter
Case studies
Recursive matrix multiplication
Strassen multiplication
N-body problem
Implement parallel version of AsyncDF
Performance comparison to Cilk
Additional experiments/modifications as research suggests
Questions?
Automatic conversion of
NSP DAGs to SP DAGs
Sajindra
Jayasena
Sharad
Ganesh
Objectives :
Design of an efficient Algorithm to convert
an arbitrary NSP DAG to SP DAG
- Correctness preserving
- Identifying different graph topologies
Analysis of Space and Time complexities
- Bound on increase in critical path length
Perform empirical analysis using cilk
7/17/2015
Theory of Parallel Systems 6.895
Transactional Cilk
Language-Level Complex
Transactions
C. Scott Ananian
An Evaluation of Nested Transactions
Sean Lie
[email protected]
6.895 Theory of Parallel Systems
Wednesday October 15th, 2003
Transactional Memory
The programmer is given the ability to define atomic
regions of arbitrary size.
Start_Transaction;
Flow[i] = Flow[i] – X;
Flow[j] = Flow[j] + X;
End_Transaction;
With hardware support, different transactions can be
executed concurrently when there are no memory
conflicts.
Nested Transactions
Nested transactions can be handled by simply
merging all inner transactions into the outermost
transaction.
Start_Transaction;
Start_Transaction;
..
..
Start_Transaction
;
Start_Transactio
n;
..
..
End_Transaction;
End_Transaction;
..
..
..
..
..
..
End_Transaction;
End_Transaction;
Concurrency
Merging transactions is simple but may decrease the
program concurrency.
Start_Transaction;
..
Start_Transaction
;
Conflict!
Access S;
Start_Transaction;
End_Transaction;
Access S;
..
..
..
End_Transaction;
End_Transaction;
Nested Concurrent Transactions
Nested Concurrent Transactions (NCT):
Questions:
When is NCT actually useful?
How much overhead is incurred by NCT?
How much do we gain from using NCT?
The tool needed to get the answers:
Allow the inner transaction to run and complete independently
from the outer transaction.
Transactional memory hardware simulator – UVSIM
UVSIM currently supports merging nested transactions.
How to get the answers:
Identify applications where NCT is useful. Look at malloc?
Implement necessary infrastructure to run NCTs in UVSIM.
Evaluate performance of NCT vs. merging transactions.
Compiler Support for Atomic
Transactions in Cilk
Jim Sukha
Tushara Karunaratna
Determinacy Race Example
If we could
guarantee that this
line of code
executed atomically,
then the program
would have no
determinacy race.
New Keywords in Cilk
It would be convenient if programmers could
make a section of code execute atomically
by using xbegin and xend keywords.
Simulation of Transactional Memory
For a transaction, keep track of the
memory locations accessed.
Transformation load/store operations in
original code. As long as there are no
conflicts, the transaction executes as
usual.
If there is a conflict, the transaction is
aborted. Restore the modified
locations to their original values, wait
some time, and then try again
If we successfully completed,
commit the changes and cleanup.
Work Involved
• Modify the Cilk parser to accept xbegin and xend as
keywords.
• Identify all load and store operations in user code, and
replace them with calls to functions from the runtime
system.
• Implement the runtime system. An initial implementation
is to divide memory into blocks and to use a hash table
to store a lock and backup values for each block.
• Experiment with different runtime implementations.
Data Race in Transactional
Cilk
Xie Yong
Background
• Current Cilk: achieve atomicity using lock
– problems such as priority inversion, deadlock,
etc.
– Nontrivial to code
• Transactional memory
– Software TM: overhead -> slow
– Hardware TM
Cilk’s transaction everywhere
• Every instruction becomes part of the a
transaction
• Based on HTM
• Cilk transaction:
– cut Cilk program into atomic pieces
– Base on some language construct or compiler
automatic generate
Data Race
• Very different from traditional definition in
Nondeterminator
• Assumption (correct parallelization)
• Definition (1st half of Kai’s master thesis)
– Efficient detection (v.s. NP-complete proved
by Kai)
– Make use of current algorithms/tools
Non-Determinacy Detection
Linear-time determinacy race
detection
Jeremy Fineman
Motivation
• Nondeterminator has two parts:
– Check whether threads are logically parallel
– Use shadow spaces to detect determinacy race
• SP-Bags algorithm uses LCA lookup to
determine whether two threads are parallel
• LCA lookup with disjoint-sets data structure
takes O(α(v,v)) time.
• We do not care about the LCA. We just want to
know if two threads are logically parallel.
New algorithm for determinacyrace detection
• Similar to SP-Bags algorithm with two
parts:
– Check whether threads are logically parallel
– Use shadow spaces to detect determinacy
race
• Use order maintenance data-structure to
determine whether threads are parallel
• Gain: Order maintenance operations are
O(1) amortized time.
The algorithm
• Maintain two orders:
– Regular serial, depth-first, left-to right execution.
• At each spawn, follow spawn thread before continuation
thread
– Right-to-left execution.
• At each spawn, follow continuation thread before spawn
thread
• Claim: Two threads e1 and e2 are parallel if and
only if e1 precedes e2 in one order, and e2
precedes e1 in the other.
Depth-first, left-to-right execution
e0
e5
F
e6
e7
e1
e2
F1
F2
e3
F2
e4
F3
Right-to-left execution
e0
e5
F
e6
e7
e1
e2
F1
F2
e3
F2
e4
F3
How to maintain both orders in
serial execution
• Keep two pieces of state for each procedure F:
– CF is current thread in F
– SF is next sync statement in F.
• In both orders, insert new threads after current
thread
• On spawn, insert continuation thread before
spawn thread in one order. Do the opposite in
the other.
• On sync, advance CF to SF
• In any other thread, update shadow spaces
based on current thread
Example
e0
e2
F
s1
e4
e1
e5
e3
CF: e5420
Order:
e0 se1 e2
e3 e4
SF: s12
Order:
e0 se12 e14 e3
e5
e5
Project Proposal:
Parallel Nondeterminator
He Yuxiong
Objective
I propose Parallel Nondeterminator to:
• Check the determinacy race in the parallel
execution of the program written in the language
like Cilk
• Develop efficient algorithm to decide the
concurrency between threads
• Develop efficient algorithm to reduce the number
of entries in access history
Primitive Idea in Concurrency Test
(1) Labeling Scheme
(2) Set operation
Thread representation (fid, tid)
Two sets:
Parallel set PS(f) ={(pf, ptid) | all threads with tid >= ptid in function
pf is parallel with the running thread of f}
Children set CS(f)={fid | fid is the descendant of current function }
Operations:
Spawn: Thread Tx of function Fi spawn function Fj
Operations on child Fj
Operations on parent Fi
Sync: Function Fi executes sync
PS(Fi)=PS(Fi)-CS(Fi)
Return: Function Fj returns to Function Fi
CS(Fi) = CS(Fi) + CS(Fj)
Release PS(Fj) and CS(Fj)
Concurrency Test:
Check if (fx, tx) is parallel with the current running thread
(fc, tc):
Primitive Idea for Access History
Serial
read(l), write(l)
Simplest parallel program without nested parallelism
Two parallel read records, one write record
Language structure like Cilk
read: max level of parallelism, one write record
Q: Is it possible to keep only two read records for each
shared location in Cilk Parallel Nondeterminator?
Keep two parallel read records with highest level of LCA in
parent child spawn tree.
Thank you very much!
Parallel Nondeterminator
Wang Junqing
Using Cilk
Accelerating
Multiprocessor
Simulation
6.895 Project Proposal
Kenneth C. Barr
Introduction
• Inspiration
– UVSIM is slow
– FlexSIM is cool
* Interleave short periods of detailed simulation with long periods
of functional warmup (eg., cache and predictor updates, but not
out-of-order logic)
* Achieve high accuracy in fraction of the time
– Multi-configuration simulation is cool
* Recognize structural properties. E.G., “contents of FA cache
are subset of all larger FA caches with same line size” so search
small->large. Once we hit is small cache, we can stop
searching
* Simulate many configurations with a single run
Kenneth C. Barr – MIT 6.895
October 15, 2003
The Meta Directory
• Combine previous ideas to speed multiprocessor
simulation
• Sequential Consistency is what matters, perhaps detailed
consistency mechanisms can be interleaved with a fast,
functional equivalent method
Kenneth C. Barr – MIT 6.895
October 15, 2003
Meta-directory example
• Detailed simulation:
–
–
–
–
–
–
P1
P2
P3
P2
P1
P2
ld x
ld x
ld x
st x
st x
ld x
;null -> Sh, data sent to P1
;Sh -> Sh, data sent to P2
;Sh- > Sh data sent to P3
;Sh -> Ex, inv sent to P1 and P3, data sent to P2
;Ex -> Ex, inv sent to P2, data sent to P1
;Ex -> Sh, data sent to P2
• Shortcut
– Record access in a small “meta-directory:”
x: {P1, 0, r}, {P2, 1, r}, {P3, 2, r}, {P2, 3, w}, {P1, 4, w}, {P2, 5, r}
– All reads and writes occur in a memory; no messages sent or received,
no directory modeled, no cache model in processor (?)
– When it comes time for detailed simulation, we can reconstruct
directory by scanning backwards: x is shared by P1 and P2.
Kenneth C. Barr – MIT 6.895
October 15, 2003
Challenges
• Accesses stored in circular queue. How many records
needed for each address?
• What happens when processor writes back data, current
scheme introduces false hits.
• Does this scheme always work? Some proofs would be
nice.
Kenneth C. Barr – MIT 6.895
October 15, 2003
Methodology
• Platform
–
–
–
–
I got Simics to boot SMP Linux
But Bochs is open-source
X86 is key if this is to be long-term framework
Cilk benchmarks?
• Tasks
–
–
–
–
Create detailed directory model
Create meta directory
Create a way to switch back and forth maintaining program correctness
Measure the dramatic improvement!
Kenneth C. Barr – MIT 6.895
October 15, 2003
Conclusion
• Reality
– Sounds daunting for December deadline, but if I can prove feasibility or
fatal flaws, I’d be excited
• Suggestions?
Kenneth C. Barr – MIT 6.895
October 15, 2003
Project Proposal: Parallelizing
METIS
Zardosht Kasheff
What is METIS?
• Graph Partitioning algorithm
• Developed at University of Minnesota by
George Karypis and Vipin Kumar
• Information on METIS:
http://wwwusers.cs.umn.edu/~karypis/metis/
Stages of Algorithm: Coarsening
• Task: Create sequentially smaller graphs
that make good representation of original
graph by collapsing connected nodes.
• Issues:
– Minor concurrency issues.
– Maintaining data locality.
– Writing large amount of data to memory in a
scalable fashion.
Stages of Algorithm: Initial
partitioning.
• Task: partition small graph
• Issues: none. Runtime of this portion of
algorithm very small.
Stages of Algorithm: Uncoarsening
and Refinement
• Uncoarsening and Refinement: project
partition of coarsest graph to original
graph and refine partition.
• Issues:
– Major concurrency issues.
– Remaining issues under research.
Parallel Sorting
Paul Youn
I propose to parallelize a sorting algorithm.
Initially, looking at Quick Sort, but may
investigate other sorting algorithms.
Sorting is a problem that can potentially be
sped up significantly (mergesort).
As an alternative, considering other
algorithms.
Goals
•
•
•
•
•
•
Similar to Lab 1 approach.
Speedy serial algorithm.
Basic parallel algorithm.
Runtime Analysis.
Empirical verification of runtime analysis.
Parallel speedups.
HELP!
• Not sure about the scope of the problem.
• Anyone out there interested in this stuff?
• Other appropriate algorithms? Max-flow?
PROJECT PROPOSAL
SMA5509
Implement FIR filter in
parallel computing
using Cilk
Name: Pham Duc Minh
Matric: HT030502H
Email: [email protected]
[email protected]
Discrete Signal
the discrete input signal is x(n)
the output signal is y(n)
the impulse response h(n)
y(n)
x(m)h(n m)
i, j
i j n
FIR filter
Discrete time LTI (Linearity and Time Invariance) systems
can be classified into FIR and IIR.
An FIR filter has impulse response h(n) that extends only
over a finite time interval, say 0≤ n ≤ M, and is
identically zero beyond that:
{h0, h1, h2 . . . , hM, 0, 0, 0 . . .}
M is referred to as the filter order.
The output y(n) in FIR is simplified to the finite-sum form:
M
y(n) h(m) x(n m)
m 0
or, explicitly
y(n) h0 x(n) h1 x(n 1) h2 x(n 2) ... hM x(n M )
Methods to process
1. Block processing
a. Convolution
y(n)
x(m)h(n m)
i, j
i j n
(convolution table form)
b. Direct Form
The length of the input signal x(n) is L
The length of output y(n) will be Ly = Lx+Lh-1
y ( n)
min(n , M )
h(m) x(n m)
max(0, n L 1)
c. Matrix form
y0
y
h0 0 0 0 0 x0
1
h h 0 0 0 x
1 Hx
1 0
y y2
.......................... ....
...
0 0 0 0 h x
M L 1
y L M 1
with the dimension of H is
Ly Lx (L M ) L
a. Overlap-Add Block Convolution Method
above methods are not feasible in infinitive or
extremely long applications
divide the long input into contiguous non-overlapping
blocks of manageable length, say L samples
y0 = h*x0
y1 = h*x1
y2 = h*x2
Thus the resulting block y0 starts at absolute time n = 0
block y1 starts at n = L, and so on.
The last M samples of each output block will overlap with
the first M outputs of the next block
.
2. Sampling Processing
The direct form I/O convolutional equation
y(n) h0 x(n) h1 x(n 1) ...hM x(n M )
We define the internal states w1(n), w2(2), w3(n)….
w0(n) = x(n)
w1(n) = x(n-1) = w0(n-1)
w2(n) = x(n-2) = w1(n-1)
…..
With this definition, we can y(n) in the form
y(n) h0 w0 (n) h1w1 (n 1) ...hM wM (n)
Goal of project
Use Cilk to implement FIR filter in parallel
Compare with the processes of DSP kit
(pipeline) and the process of FPGA (Field
Programmable Gate Array) (also
parallel).
How to implement
Multithreaded programming
Cilk compiler
Other things
If I finish soon, I hope additional work
suggestion from staff.
Cache-Oblivious Algorithms
SMA5509: Theory of Parallel Systems
Project Proposal
Cache-oblivious sorting for
Burrow-Wheeler Transform
Sriram Saroop
Advait D. Karande
bzip2 compression
Lossless text compression scheme
Input data
Burrow Wheeler
Transform
Move-to-Front
coding
Huffman
encoding
Compressed
data
Burrows-Wheeler Transform (1994)
A pre-processing step for data compression
Involves sorting of all rotations of the block of
data to be compressed
Rationale: Transformed strings compress
better
Worst case complexity of N2log(N), where N
= size of block to be sorted
Can be improved to N(logN)2 (Sadakane ’98)
Why make it cache-oblivious?
Performance of BWT heavily dependent
on cache behavior (Seward ’00)
Avoid slowdown for large files with high
degree of repetitiveness
Especially useful in applications like
bzip2 that are required to perform well
on different memory structures
Project Plan
Design and implementation of a cache
oblivious algorithm for BWT sorting
Performance comparisons with standard
bzip2 for available benchmarks
Back-up: Optimizations for BWT
including parallelization, and
incorporating Cache-Obliviousness in
other parts of bzip2 compression
New Cache Oblivious
Algorithms
Neel Kamal
Zhang JiaHui
Our Goals
Designing Cache Oblivious Algorithms for
problems in several areas of Compute
Science.
Finding theoretical bounds of number of
cache misses for these algorithms.
Trying to parallelize them …
Large Integer Arithmetic Operations
• Basic Operators
– Addition, Subtraction
– Multiplication
– Factorial
– Matrix Multiplication
• Application: RSA encryption algorithm
Large Integer Multiplication
Basic Idea:
A
B
C
D
B
A
B
A
x
C
x
x
D
C
x
D
Dynamic Programming Algorithms
• Longest Common Sequence Basic Idea
A
B
C
B
D
A
B
B
0
1
1
1
1
1
1
D
0
1
1
1
2
2
2
C
0
1
2
2
2
2
2
A
1
1
2
2
2
3
3
B
1
2
2
3
3
3
4
A
1
2
2
3
3
4
4
Graph Algorithm
• All-pair shortest path
algorithm – Floyd
• Connectivity Testing of
a Graph
• And more …
Computational Geometry
• Finding the closest pair of points
• And More …
Concurrent
Cache-Oblivious Algorithms
Seth Gilbert
Cache-Oblivious Algorithms
Optimal memory access
Matrix multiply, Fast Fourier Transform
B-trees, Priority Queues
Concurrent?
wait-free, lock-free, obstruction-free
Synchronization?
DCAS, CAS, LL/SC
Packed Memory Structure
Operations:
Traverse(k) -- O(k/B)
Insert -- O(log2n/B)
Delete -- O(log2n/B)
Goal:
Design a lock-free version of the cacheoblivious algorithm using CAS
More
Basis for many data structure
B-tree
distributed data structures
Maybe parallel nondeterminator
The End
Packed Memory Structure
V1
V2
V3
V4
size(array) < 4 x num. elements
elements well-spaced in array
V5