Threads - University of New Hampshire

Download Report

Transcript Threads - University of New Hampshire

On the Conformance of a
Cluster Implementation of
the Java Memory Model
Philip J. Hatcher
University of New Hampshire
[email protected]
Collaborators
• ENS-Lyon
– Gabriel Antoniu, Luc Bougé, Raymond Namyst
Cluster Environment
• collection of machines connected by network
– commercial distributed-memory parallel computer
– “build-your-own” parallel computer
Cluster Implementation of Java
• Single JVM running on cluster.
• Nodes of cluster are transparent.
• Multithreaded applications exploit
multiple processors of cluster.
Examples
• Java/DSM (Rice Univ. - Houston)
– transparent heterogeneous computing
• cJVM (IBM Research - Haifa)
– scalable Java server
• Jackal (Vrije Univ. – Amsterdam)
– high-performance computing
Hyperion
• Cluster implementation of Java developed at
the Univ. of New Hampshire.
• Currently built on top of the PM2 distributed,
multithreaded runtime environment from
ENS-Lyon.
Motivation
• Use Java “as is” for high-performance
computing
– support computationally intensive
applications
– utilize parallel computing hardware
Why Java?
• Explicitly parallel!
– includes a threaded programming model
• Relaxed memory model
– consistency model aids an implementation
on distributed-memory parallel computers
Java Threads
• Threads are objects.
• The class java/lang/Thread contains all of
the methods for initializing, running,
suspending, querying and destroying
threads.
java/lang/Thread methods
• Thread() - constructor for thread object.
• start() - start the thread executing.
• run() - method invoked by ‘start’.
• stop(), suspend(), resume(), join(),
yield().
• setPriority().
Java Synchronization
• Java uses monitors, which protect a
region of code by allowing only one
thread at a time to execute it.
• Monitors utilize locks.
• There is a lock associated with each
object.
synchronized keyword
• synchronized ( Exp ) Block
• public class Q {
synchronized void put(…) {
…
}
}
java/lang/Object methods
• wait() - the calling thread, which must
hold the lock for the object, is placed in a
wait set associated with the object. The
lock is then released.
• notify() - an arbitrary thread in the wait
set of this object is awakened and then
competes again to get lock for object.
• notifyall() - all waiting threads awakened.
Shared-Memory Model
• Java threads execute in a virtual shared
memory.
• All threads are able to access all objects.
• But threads may not access each other’s
stacks.
Java Memory Consistency
• A variant of release consistency.
• Threads can keep locally cached copies of
objects.
• Consistency is provided by requiring that:
– a thread's object cache be flushed upon entry
to a monitor.
– local modifications made to cached objects
be transmitted to the central memory when a
thread exits a monitor.
General Hyperion Overview
prog.java
prog.class
javac
java2c
gcc -06
(bytecode)
Sun's
Java
compiler
prog
prog.[ch]
Instruction-wise
translation
libs
Runtime
libraries
The Hyperion Run-Time System
• Collection of modules to allow “plug-andplay” implementations:
– inter-node communication
– threads
– memory and synchronization
– etc
Thread and Object Allocation
• Currently, threads are allocated to processors
in round-robin fashion.
• Currently, an object is allocated to the
processor that holds the thread that is
creating the object.
• Currently, DSM-PM2 is used to implement the
Java memory model.
Hyperion Internal Structure
Load
balancer
Thread
subsystem
Native
Java API
Memory
subsystem
Comm.
subsystem
PM2 API: pm2_rpc, pm2_thread_create, etc.
PM2
DSM subsystem
Thread subsystem
Comm. Subsystem
PM2: A Distributed, Multithreaded
Runtime Environment
• Thread library: Marcel
– User-level
• Communication library:
Madeleine
– Portable: BIP, SISCI/SCI,
MPI, TCP, PVM
– Supports SMP
– POSIX-like
– Preemptive thread migration
Context Switch
Create
SMP
0.250 s
2
Non-SMP
0.120 s
0.55 s
Thread Migration
SCI/SISCI
24 s
BIP/Myrinet
75 s
s
– Efficient
SCI/SISCI
BIP/Myrinet
Latency
6 s
8 s
Bandwidth
70 MB/s
125 MB/s
DSM-PM2: Architecture
• DSM comm:
DSM-PM2
– send page request
DSM Protocol Policy
DSM Protocol lib
– send page
– send invalidate request
– …
DSM Page
Manager
DSM Comm
• DSM page manager:
– set/get page owner
– set/get page access
Madeleine
Comms
Marcel
Threads
PM2
– add/remove to/from copyset
– ...
Hyperion’s DSM API
• loadIntoCache
• invalidateCache
• updateMainMemory
• get
• put
DSM Implementation
• Node-level caches.
• Page-based and home-based protocol.
• Use page faults to detect remote objects.
• Log modifications made to remote objects.
• Each node allocates objects from a different
range of the virtual address space.
Using Page Faults: details
• An object reference is the address of the
base of the object.
• loadIntoCache does nothing.
• DSM-PM2 is used to handle page faults
generated by the get/put primitives.
More details
• When an object is allocated, its address is
appended to a list attached to the page
that contains its header.
• When a page is loaded on a remote node,
the list is used to turn on the header bit for
all object headers on the page.
• The put primitive checks the header bit to
see if a modification should be logged.
• updateMainMemory sends the logged
changes to the home node.
Benchmarking
• Two Linux 2.2 clusters:
– twelve 200 MHz Pentium Pro processors
connected by Myrinet switch and using BIP.
– six 450 MHz Pentium II processors
connected by a SCI network and using
SISCI.
• gcc 2.7.2.3 with -O6
Pi (50M intervals)
12
Seconds
10
8
200MHz/BIP
450MHz/SCI
6
4
2
0
1
2
4
6
Nodes
8
10
12
Jacobi (1024x1024)
100
Seconds
80
60
200MHz/BIP
450MHz/SCI
40
20
0
1
2
4
6
Nodes
8
10 12
Seconds
Traveling Salesperson (17 cities)
1400
1200
1000
800
600
400
200
0
200MHz/BIP
450MHz/SCI
1
2
4
6
Nodes
8
10 12
All-pairs Shortest Path (2K nodes)
1000
Seconds
800
600
200MHz/BIP
450MHz/SCI
400
200
0
1
2
4
6
Nodes
8
10 12
Seconds
Barnes-Hut (16K bodies)
140
120
100
80
60
40
20
0
200MHz/BIP
450MHz/SCI
1
2
4
6
Nodes
8
10 12
Correctness
• Does the Hyperion approach fully support
the Java Memory Model?
• Hyperion follows the operational
specification of the JMM with two
exceptions:
– page granularity
– node-level caches
Java Memory Model: the actors
Threads
Main Memory
lock
lock
unlock
unlock
load
read
store
write
use
assign
Entry to Monitor
lock
lock
lock
read
load
read
load
lock
use
use
Exit from Monitor
assign
assign
store
write
store
unlock
unlock
unlock
unlock
write
Serializability
• Main memory actions are serializable for
a single variable or lock.
Page Granularity
• Hyperion fetches remote memory locations
with page granularity.
– always OK to perform load/read actions “early”.
Node-level Caches
• All threads on a node share a cache.
• Cache contains values being accessed
whose homes are on other nodes.
• Values whose homes are on this node are
directly accessed.
– as if every use is immediately preceded by a
load/read and every assign is immediately
followed by a store/write.
Node-level Caches
• If one thread invalidates the cache, the
cache is invalidated for all threads.
– always OK to do “extra” load/read actions.
• If one thread updates main memory, all
threads do.
– always OK to do “early” store/write actions.
Node-level Caches
• If one thread performs a load/read, then
all threads see the result.
– as if all threads perform the load/read.
• If one thread performs an assign, then
other threads see the result before the
subsequent store/write.
– hmmm...
Implementation Traces
• load/read across cluster implemented by
request-send message pair.
• store/write across cluster implemented
by transmit-receive message pair.
An Example
Node 0
Home(x)
request x
Node 1
request x
send x (value 7) to Node 0
send x (7) to Node 1
T0: use x (7)
T3: use x (7)
T1: assign x = 17
T4: assign x = 19
T2: use x (17)
T5: use x (19)
transmit x (17)
transmit x (19)
receive x (17) from Node 0
receive x (19) from Node 1
Serialization
Main Memory
read x (value 7) for T0
read x (7) for T3
write x (17) for T1
read x (17) for T2
write x (19) for T4
read x (19) for T5
Algorithm
• The model-satisfying serialization can
always be constructed from the nodelevel traces of the Hyperion actions.
• Merge node traces, controlled by the
pairing of request-send and transmitreceive actions.
• Correct, if memory actions at the home
node are serializable.
Therefore:
• Node-level caches conform to the JMM.
• Simplify implementation.
• Reduce memory consumption.
• Facilitate pre-fetching.
• However, “invalidate one, invalidate all”.
Implementation Difficulties
• Node-level concurrency makes the
implementation tricky. (duh!)
– concurrent page fetches
– cache invalidate while page fetch pending?
– thread reading page as it is being installed
Java API Additions?
• These would be desirable:
– barrier synchronizations
– data reductions
– query cluster configuration
• i.e. number of nodes
• Is this cheating?
– no longer “as is” Java?
API implementation
• Careful implementation of API extensions
can lessen potential cost of “invalidate
one, invalidate all”.
• Implementation of barrier should only
invalidate local cache once all threads
have reached barrier.
Level of Transparency
• Consider the current Hyperion
thread/object allocation strategies:
– not mandated by Java Language Spec
– might be superceded by smarter run-time lib
– but, still good guidelines for programmer?
• i.e. if I didn’t create the object, it might be
expensive to access.
• not unreasonable to expect user to be aware
that there might be an extended memory
hierarchy.
Final Thoughts
• Java is extremely attractive vehicle for programming
parallel computers.
• Can acceptable performance be obtained?
– Scalar execution
– DSM implementation
– Application locality
– Scalability
More Final Thoughts
• Extended API for threads required.
• Transparent implementations are
necessary.
• JMM specification under review but it will
remain based upon “release consistency”,
which is important.
Support for Reflection
• Object headers point to virtual method
table, which also contains a pointer to a
class database structure.
• The classDB contains all information
pertinent to class introspection.
• The classDBs are initialized at program
start-up using the tree of static class
dependences rooted at the class that
contains the main method to be called.
Static Initialization
• Static initializers are executed in a
postorder traversal of the static class
dependence tree.
– violates the JLS as static initializers should
not execute until the class is first accessed.
Class Loading
• Not currently supported.
• But conceptually possible.
– dynamically compile and then dynamically
link in the new class.
– many details to be worked out, however.
java2c Details
• Maps JVM stack locations to C locals.
– relies on gcc to map these to registers
• Interface method invocation is implemented
via a table of pairs:
– (interface VMT, implementation VMT)
• Exceptions are implemented via setjmp and
longjmp.
• Null references are detected by segfault
handler.
Future Work
• Continue to investigate DSM protocols
– eliminate check in put primitive
– compare object-based and page-based protocols
• Implement message aggregation
– send mods for multiple pages in one message
– prefetch pages in groups
Future Work
• Utilize multiprocessor nodes
– true node-level concurrency puts more
demands on the implementation
• java2IC
– java2c has been modified to generate a
(three-address) intermediate code
– want to connect to SUIF, or some other
compiler toolset
Future Work
• Further benchmarking
– converting SPLASH-2 applications to Java
• Java API
– minimal support now (and still JDK 1.1)
– is there a “clean” API implementation?
– evaluate extensions: barriers, etc.
– distributed garbage collection