release consistency. - University of New Hampshire

Download Report

Transcript release consistency. - University of New Hampshire

Experience with a Cluster JVM
Philip J. Hatcher
University of New Hampshire
[email protected]
Acknowledgements
• UNH students
– Mark MacBeth and Keith McGuigan
• PM2 team
– very effective and enjoyable collaboration
2
Traditional Parallel Programming
• Parallel programming supported by
using serial language plus a “bag on
the side”.
– e.g. Fortran plus MPI
• Parallel programming supported by
extending a serial language.
– e.g. High Performance Fortran
3
My History
• I spent years studying data-parallel
extensions to C, such as C*.
– Users never really accepted extensions.
– They found them too complex.
– They wanted standard, well-integrated
solutions.
4
Java is a good thing!
• Java is explicitly parallel!
– Language includes a threaded programming
model.
• Java employs a relaxed memory model.
– Consistency model aids an implementation
on distributed-memory parallel computers.
5
Java Threads
• Threads are objects.
• The class java.lang.Thread contains all
of the methods for initializing, running,
suspending, querying and destroying
threads.
6
java.lang.Thread methods
• Thread() - constructor for thread
object.
• start() - start the thread executing.
• run() - method invoked by ‘start’.
• stop(), suspend(), resume(), join(),
yield().
• setPriority().
7
Java Synchronization
• Java uses monitors, which protect a
region of code by allowing only one
thread at a time to execute it.
• Monitors utilize locks.
• There is a lock associated with each
object.
8
synchronized keyword
• synchronized ( Exp ) Block
• public class Q {
synchronized void put(…) {
…
}
}
9
java.lang.Object methods
• wait() - the calling thread, which must
hold the lock for the object, is placed in
a wait set associated with the object.
The lock is then released.
• notify() - an arbitrary thread in the wait
set of this object is awakened and then
competes again to get lock for object.
• notifyall() - all waiting threads
awakened.
10
Shared-Memory Model
• Java threads execute in a virtual
shared memory.
• All threads are able to access all
objects.
• But threads may not access each
other’s stacks.
11
Java Memory Consistency
• A variant of release consistency.
• Threads can keep locally cached copies
of objects.
• Consistency is provided by requiring
that:
– a thread's object cache be flushed upon
entry to a monitor.
– local modifications made to cached objects
be transmitted to the central memory when
a thread exits a monitor.
12
Problems with Java Threads
• Java support for threads is very low
level.
• Java memory model is not very well
understood.
13
Threads API
•
•
•
•
No condition variables.
No semaphores.
No barriers.
No collective operations on thread
groups (e.g. sum reduction).
• No parallel collections.
14
So…
• Using low-level operations can be
difficult and error-prone.
• Everyone is “re-inventing the wheel” as
they struggle to construct higher level
abstractions.
15
Java Specification Request 166
• Expert Group formed 01/23/02.
• Goal is to provide java.util.concurrent:
– atomic variables
– special-purpose locks, barriers,
semaphores and condition variables
– queues and related collections for
multithreaded use
– thread pools
16
Java Memory Model
• Most programmers did not read Chapter
17 of the Java Language Specification.
• Those that did read it, did not fully
understand it.
• Lots of code has been written that is
not portable.
17
For example,
• The Java Grande Forum distributes
multithreaded Java benchmarks.
• These benchmarks utilize a barrier
method implemented with volatile
variables and “busy waiting”.
• However, benchmarks assume when
volatile variable is set all of memory will
also be made consistent. Not true!
18
Implementors also struggled…
• In June 2000, IBM researchers
suggested my cluster JVM violated the
JMM, but could not cite an example.
• In July 2000, I produced a “proof” of
correctness.
• In June 2001, a counter-example was
found.
• Problem concerns properly handling
“improperly synchronized” programs.
19
Java Specification Request 133
• Expert Group formed 06/12/01.
• Goal is to re-specify the Java memory
model:
– Maintain relaxed consistency.
– Loosen implementation requirements for
handling “improperly synchronized”
programs.
– Fix ambiguities and holes.
• Current draft is still “rough sledding”!
20
Cluster Implementation of Java
• Single JVM running on a cluster of
machines.
• Nodes of the cluster are transparent.
• Multithreaded applications exploit
multiple processors of cluster.
21
Hyperion
• Cluster implementation of Java developed
at the University of New Hampshire.
• Currently built on top of the PM2
distributed, multithreaded runtime
environment from ENS-Lyon.
22
General Hyperion Overview
prog.java
prog.class
javac
prog
prog.[ch]
java2c
gcc -06
(bytecode)
Sun's
Java
compiler
Instruction-wise
translation
libs
Runtime
libraries
23
The Hyperion Run-Time System
• Collection of modules to allow “plugand-play” implementations:
– inter-node communication
– threads
– memory and synchronization
– etc
24
Thread and Object Allocation
• Currently, threads are allocated to
processors in round-robin fashion.
• Currently, an object is allocated to the
processor that holds the thread that is
creating the object.
• Currently, DSM-PM2 is used to implement
the Java memory model.
25
Hyperion Internal Structure
Load
balancer
Thread
subsystem
Native
Java API
Memory
subsystem
Comm.
subsystem
PM2 API: pm2_rpc, pm2_thread_create, etc.
PM2
DSM subsystem
Thread subsystem
Comm. Subsystem
26
PM2: A Distributed, Multithreaded
Runtime Environment
• Thread library: Marcel
• Communication
library: Madeleine
– User-level
– Supports SMP
– POSIX-like
– Portable: BIP, SISCI/SCI,
MPI, TCP, PVM
– Preemptive thread
– Efficient
migration
Context Switch
Create
SMP
0.250 s
2
Non-SMP
0.120 s
0.55 s
Thread Migration
SCI/SISCI
24 s
BIP/Myrinet
75 s
s
SCI/SISCI
BIP/Myrinet
Latency
6 s
8 s
Bandwidth
70 MB/s
125 MB/s
27
DSM-PM2: Architecture
• DSM comm:
DSM-PM2
DSM Protocol Policy
DSM Protocol lib
–
–
–
–
send page request
send page
send invalidate request
…
• DSM page manager:
DSM Page
Manager
DSM Comm
–
–
–
–
set/get page owner
set/get page access
add/remove to/from copyset
...
Madeleine
Comms
Marcel
Threads
PM2
28
DSM Implementation
•
•
•
•
•
Node-level caches.
Page-based and home-based protocol.
Use page faults to detect remote objects.
Log modifications made to remote objects.
Each node allocates objects from a different
range of the virtual address space.
29
Benchmarking
• Two Linux 2.2 clusters:
– twelve 200 MHz Pentium Pro processors
connected by Myrinet switch and using
BIP.
– six 450 MHz Pentium II processors
connected by a SCI network and using
SISCI.
• gcc 2.7.2.3 with -O6
30
Pi (50M intervals)
12
Seconds
10
8
200MHz/BIP
450MHz/SCI
6
4
2
0
1
2
4
6
Nodes
8
10
12
31
Jacobi (1024x1024)
100
Seconds
80
60
200MHz/BIP
450MHz/SCI
40
20
0
1
2
4
6
Nodes
8
10 12
32
Seconds
Traveling Salesperson (17 cities)
1400
1200
1000
800
600
400
200
0
200MHz/BIP
450MHz/SCI
1
2
4
6
Nodes
8
10 12
33
All-pairs Shortest Path (2K nodes)
1000
Seconds
800
600
200MHz/BIP
450MHz/SCI
400
200
0
1
2
4
6
Nodes
8
10 12
34
Seconds
Barnes-Hut (16K bodies)
140
120
100
80
60
40
20
0
200MHz/BIP
450MHz/SCI
1
2
4
6
Nodes
8
10 12
35
Current Work
•
•
•
•
Comparing Hyperion to mpiJava.
mpiJava is set of JNI wrappers to MPI.
Using Java Grande Forum benchmarks.
mpiJava implemented on top of singlenode version of Hyperion.
• This controls for quality of bytecode
implementation.
36
Problems with JGF Benchmarks
• Written with SMP hardware in mind.
– bogus synchronization.
– all data allocated by one thread.
• SMP is not the right model!
37
An Alternative Model
• Programmer should be aware of the
memory hierarchy.
• Do not require “magic” implementation.
• The thread is the correct level of
abstraction:
– If an object was created by a thread, then
the object is “near” the thread.
– Otherwise the object might be “far” from
the thread.
38
Efficiency and Portability
• Will not hurt on SMP hardware and may
even help.
• Implementation can be straightforward.
• “Magic” implementations also possible.
• Encourages portability across different
implementations and hardware.
39
Other Lessons Learned
• in-line checks vs. page faults
• network reactivity
• System.arraycopy
40
In-line Checks vs. Page Faults
• Earlier version of Hyperion used in-line
checks to detect remote objects.
• For our benchmarks, using page faults
was always better.
• Local accesses are free.
• Remote accesses are more expensive.
• But most accesses are local!
41
Network Reactivity
• Fetch of remote object implemented by
asynchronous message to home node.
• Message handled by service thread on
home node.
• When message arrives, service thread
needs to be scheduled.
• Need integration of network layer and
thread scheduler.
42
Short-term Solution
• Over-synchronize:
– use BSP programming style
– distinct phases for communication and
computation
– phases separated by barrier synchronization
– so only service thread ready to run during
communication phase
43
System.arraycopy
• Native implementation can transmit data
in units that are bigger than a page.
• Requires in-line check but usually
amortized over large amount of data.
44
Conclusions
• Java threads is an attractive vehicle for
parallel programming.
• Is Java serial execution fast enough?
– Need true multi-dimensional arrays?
• Need clarified memory model.
• Need extended thread API.
• Programmers need to be aware of
memory hierarchy.
45