Object Based Parallel Programming
Download
Report
Transcript Object Based Parallel Programming
Blue Gene Simulator
Gengbin Zheng
[email protected]
Gunavardhan Kakulapati
[email protected]
Parallel Programming Laboratory
Department of Computer Science
University of Illinois at Urbana-Champaign
http://charm.cs.uiuc.edu
1
Overview
Blue
Gene Emulator
Blue
Gene Simulator
Timing
correction schemes
Performance
and results
2
Emulation on a Parallel Machine
BG/C Nodes
Simulating (Host) Processor
Hardware thread
3
Blue Gene Emulator: functional view
Communication threads
Worker threads
inBuffer
CorrectionQ
Non-affinity message queues
Affinity message queues
One Blue Gene/C node
4
Blue Gene Emulator: functional view
Communication
threads
Communication
threads
Worker
threads
Worker
threads
inBuff
inBuff
Correction
Q
Non-affinity message
queues
Affinity message
queues
Correction
Q
Non-affinity message
queues
Affinity message
queues
Converse scheduler
Converse Q
5
What is capable …
Blue Gene API support
Blue Gene Charm++
– Structured Dagger
Trace Projections
6
Emulator to Simulator
Emulator:
– Study programming model and application development
Simulator:
– performance prediction capability
– models communication latency based on network model;
– Doesn’t model memory access on chip, or network
contention
7
Simulator
Parallel performance is hard to model
– Communication subsystem
Out of order messages
Communication/computation overlap
– Event dependencies
Parallel Discrete Event Simulation
– Emulation program executes in parallel with
event time stamp correction.
– Exploit inherent determinacy of application
8
How to simulate?
Time stamping events
– Per thread timer (sharing one physical timer)
– Time stamp messages
Calculate communication latency based on network model
Parallel event simulation
– When a message is sent out, calculate the predicted
arrival time for the destination bluegene-processor
– When a message is received, update current time.
currTime = max(currTime,recvTime)
– Time stamp correction
9
Time Stamping messages and threads
Message sent:
RecvT(msg) = curT+Latency
Thread Timer: curT
Message scheduled:
curT = max(curT, RecvT(msg))
10
Need for timestamp correction
Time stamp correction needed for out-oforder messages
Out-of-order delivery can occur:
– A message arrives late while some other
message updates the thread time to future
– So late message executes in the context of
future, although its predicted time is earlier
11
Parallel correction algorithm
Sort message execution by receive time;
Adjust time stamps when needed
Use correction message to inform the change
in event startTime.
Send out correction messages following the
path message was sent
The events already in the timeline may have
to move.
12
Timestamps Correction
RecvTime
M1
M2
M3
M4
M5
M6
M7
Execution
TimeLine
M8
13
Timestamps Correction
RecvTime
M1
M2
M3
M8
M4
M5
M6
M7
Execution
TimeLine
14
Timestamps Correction
RecvTime
M1
M2
M3
M4
M5
M6
M7
Execution
TimeLine
M8
RecvTime
M1
M2
M3
M8
M4
M5
M6
M7
Execution
TimeLine
Correction Message
15
Timestamps Correction
RecvTime
M1
M2
M3
M4
M5
M6
M4
M7
Execution
TimeLine
M4
Correction Message (M4)
RecvTime
M1
M2
M4
M3
M5
M6
M7
Execution
TimeLine
Correction Message
Correction Message (M4)
RecvTime
M1
M2
M3
M5
M6
M4
Correction Message
M7
Execution
TimeLine
16
Linear-order correction
Works only when
– Programs have no alternate orders of
execution possible
– Messages are processed in the same order for
multiple executions
– Eg: MPI programs with no-wildcard recvs,
structured-dagger code with no “overlap” or
“forall”.
17
Reasons:
Correction algorithm breaks dependency
logic
– Only based on receive time;
– Cases:
When an event depends on several messages
– Last message triggers the computation
Message buffered until some condition holds
Example for invalid correction scheme:
Jacobi-1D
18
19
Solution
Use structured dagger to retrieve
dependence information
As the program runs, form a chain of
bluegene logs preserving the dependency
information .
Bluegene logs for entry functions and
structured dagger functions
20
Timestamp correction scheme
Every event has a list of backward and forward
dependents.
An event cannot start till its backward
dependents have finished.
Define
effRecvTime =
max(recvTime, endOfBackDeps)
An event can start only after its effRecvTime.
startTime =
max(effRecvTime,timeline.last.endTime)
21
Timestamp correction scheme
Timeline is not sorted on the recvTime of the event
like the previous case.
Timeline is sorted based on the effRecvTime.
Steps to process a correction message
–
–
–
–
Find the earliest updated event due to the message
Cut the timeline from that event
Calculate new effRecvTimes from then.
Reinsert into the timeline in the order of effRecvTime
22
Non-linear order correction
scheme
The new scheme :
– Takes into account the event dependencies
– Works even when messages can be received in
different orders in different runs.
– Requires all the dependencies to be captured
using structured dagger.
But the timing correction is very slow.
Several optimizations possible.
23
Optimizations to online
correction scheme
Overwrite old corrections:
– An event can get multiple correction
messages.
– Reduce the number of corrections
– Same scheme if correction message arrives
earlier than the message itself
Use multisend
– Messages destined to same real processor but
different events can be sent collectively.
24
More optimizations
Prioritize messages based on their predicted
recvTime.
Lazy processing
– Process correction messages periodically.
– Allows corrections to be overwritten.
Batch processing
– Process many correction messages at a time
– Many events will be affected
– Choose the earliest and reinsert in the order of
effRecvTime.
Ability to start corrections in the middle
– Can ignore the startup events for timing correction
25
Timing correction still very slow.
Observations:
– Don’t let the execution go far ahead of the
correction wave.
– A large difference means many wrong events to
be corrected.
– Closely following the execution wave also may
not help.
A new scheme
– Similar to the one used for gvt (Global virtual
time)
26
GVT-like scheme
Use heartbeat
– Periodically broadcast asking for gvt
Gvt
– Is the time after which the events are invalid
due to pending corrections
– Compute the gvt as the minimum of predict
recvTimes of all correction messages and
startTimes of all affected events.
Use a parameter “leash”. Execution of the
program cannot go beyond “gvt + leash”
27
Projections before correction
28
Projections after correction
29
Correctness of the scheme (using
Jacobi1D)
30
Predicted time vs latency factor
31
Predicted speedup
32
More work
Ongoing work
– Make sure gvt scheme is correct
Future work
– The presented scheme is on-line correction
– Explore the off-line (post-mortem) correction
scheme using generated traces.
33