PPT - The University of Hong Kong

Download Report

Transcript PPT - The University of Hong Kong

JESSICA2:
A Distributed Java Virtual Machine
with Transparent Thread Migration Support
Wenzhang Zhu, Cho-Li Wang, Francis Lau
The Systems Research Group
Department of Computer Science and Information Systems
The University of Hong Kong
HKU JESSICA Project
JESSICA: “Java-Enabled Single-SystemImage Computing Architecture” :

Project started in 1996. First version (JESSICA1) in 1999.

A middleware that runs on top of the standard UNIX/Linux
operating system to support parallel execution of multithreaded Java applications in a cluster of computers.

JESSICA hides the physical boundaries between machines
and makes the cluster appear as a single computer to
applications -- a single-system image (SSI).

Special feature : preemptive thread migration which allows a
thread to freely move between machines.

Part of the RGC’s Area of Excellence project in 1999-2002.
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
2
JESSICA Team Members
Supervisors:


Dr. Francis C.M. Lau
Dr. Cho-Li Wang
Research Students:





Ph.D: Wenzhang Zhu (Thread
Migration)
Ph.D: WeiJian Fang (Global Heap)
M.Phil: Zoe Ching Han Yu
(Distributed Garbage Collection)
Ph.D: Benny W. L. Cheung
(Software Distributed Shared
Memory)
Graduated: Matchy Ma (JESSICA1)
JESSICA Team Members
The Systems Research Group
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
3
Outline
Introduction of Cluster Computing
Motivations
Related works
JESSICA2 features
Performance Analysis
Conclusion & Future works
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
4
What’s a cluster ?


A cluster is a type of parallel or
distributed processing system,
which consists of a collection of
interconnected standalone/complete computers
cooperatively working together as
a single, integrated computing
resource – IEEE TFCC.
My definition : a HPC system that
integrates mainstream
commodity components to
process large-scale problems 
low cost, self-made, yet powerful.
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
5
Cluster Computer
Architecture
Programming
Environment
(Java, C, MPI, HPF, DSM)
Management &
Monitoring &
Job Scheduling
Cluster Applications
(Web, Storage, Computing,
Rendering, Financing..)
Single System Image Infrastructure
Availability Infrastructure
OS
OS
OS
OS
Node
Node
Node
Node
High-Speed LAN (Fast/Gigabit Ethernet, SCI, Myrinet)
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
6
Single System Image (SSI) ?



JESSICA Project: Java-Enabled Single-SystemImage Computing Architecture
A single system image is the illusion, created by
software or hardware, that presents a collection of
resources as one, more powerful resource.
Ultimate Goals of SSI : makes the cluster appear
like a single machine to the user, to applications,
and to the network.
Single Entry Point, Single File System, Single Virtual
Networking, Single I/O and Memory Space, Single
Process Space, Single Management / Programming
View …
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
7
Top 500 computers by “classification” (June 2002)
(Source: http://www.top500.org/ )
Count
Share
Rmax [GF/s]
Rpeak [GF/s]
Processors
MPP
224
44.8 %
104899.21
168829.00
111104
Constellations
187
37.4 %
40246.50
59038.00
33828
Cluster
80
16 %
37596.16
69774.00
50181
SMP
9
1.8 %
39208.10
44875.00
6056
Total
500
100 %
221949.97
342516.00
201169
MPP
Constellation
Cluster
SMP
Massively Parallel Processor
E.g., cluster of HPCs
Cluster of PCs
Symmetric Multiprocessor
About the TOP500 List:
1.
2.
3.
the 500 most powerful computer
systems installed in the world.
Compiled twice a year since June
1993
Ranked by their performance on
the LINPACK Benchmark
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
8
#1 Supercomputer:
NEC’s Earth Simulator
Linpack : 35.86 Tflop/s
(Tera FLOPS = 1012
floating point
operations per second
= 450 x Pentium 4 PCs)
Interconnect: Single
stage crossbar (1800
miles of cable) 83,000
copper cables, 16 GB/s
cross section
bandwidth
Built by NEC, 640 processor nodes, each consisting
of 8 vector processors, total of 5120 processors, 40
TFlop/s peak, and 10 TB memory.
Area of computer = 4
tennis courts, 3 floors
(Source: NEC)
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
9
Other Supercomputers in TOP500
#2 #3 Supercomputer: ASCI Q



7.7 TF/s Linpack performance.
Los Alamos National Laboratory, U.S.
HP Alphserver SC (375 x 32-way multiprocessors, total 11,968
processors), 12 terabytes of memory and 600 terabytes of disk storage
#4: IBM ASCI White (U.S.)


8,192 copper microprocessors (IBM SP POWER3), and contains 160
trillion bytes (TB) of memory with more than 160 TB of IBM disk
storage capacity; Linpack: 7.22 Tflops. Located at Lawrence Livermore
National Laboratory.
512-node, 16-way symmetric multiprocessor. Covers an area the size of
two basketball courts, weighs 106 tons. 2,000 miles of copper wiring.
Cost: US$110 million.
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
10
TOP500 Nov 2002 List
2 new PC clusters made the TOP 10:


#5 is a Linux NetworX/Quadrics cluster at
Lawrence Livermore National Laboratory.
#8 is a HPTi/Myrinet cluster at the Forecast
Systems Laboratory at NOAA.
A total of 55 Intel based and 8 AMD based
PC clusters are in the TOP500.
The number of clusters in the TOP500
grew again to a total of 93 systems.
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
11
Poor Man’s Cluster

HKU Ostrich Cluster
 32 x 733 MHz
Pentium III PCs,
384MB Memory
 Hierarchical
Ethernet-based
network : four 24port Fast Ethernet
switches + one 8-port
Gigabit Ethernet
backbone switch)
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
12
Rich Man’s Cluster



Computational Plant (C-Plant
cluster)
1536 Compaq DS10L 1U servers
(466 MHz Alpha 21264 (EV6)
microprocessor, 256 MB ECC
SDRAM)
Each node contains a 64-bit, 33
MHz Myrinet network interface
card (1.28 Gbps/s) connected to a
64-port Mesh64 switch.
48 cabinets, each of which
contains 32 nodes
(48x32=1536)
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
13
The HKU Gideon 300 Cluster
(Operating in mid Oct. 2002)
Linpack performance: 355 Gflops
#175 in TOP500 (Nov. 2002 List)
300 PCs (2.0GHz Pentium 4, 512MB DDR mem, 40GB disk, Linux OS)
connected by a 312-port Foundry FastIron 1500 (Fast Ethernet) switch
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
14
Building Gideon 300
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
15
JESSICA2 : Introduction
Research Goal

High Performance Java Computing using Clusters
Why Java?





The dominant language for server-side programming.
More than 2 million Java developers [CNETAsia: 06/2002]
Platform independent: “Compile once, run anywhere”
Code mobility (i.e., dynamic class loading) and data mobility
(i.e., object serialization).
Built-in multithreading support at language level (parallel
programming using MPI, PVM, RMI, RPC, HPF, DSM is difficult)
Why cluster?


Large scale server-side applications need high-performance
multithreaded programming supports
A cluster provides a scalable hardware platform for true
parallel execution.
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
16
Java Virtual Machine
Application Class File
Java API Class File
Class Loader

Loads class files
Class loader
Interpreter

Executes bytecode
Runtime Compiler

Bytecode
0a0b0c0d0c6262431
c1d688662a0b0c0d0
c1334514726522723
Interpreter
01010101000101110
10101011000111010
10110011010111011
Runtime
compiler
Converts bytecode to
native code
Native code
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
17
Threads in JVM
Stack Frame
Stack Frame
JESSICA2, CSIS, HKU
}
}
Java Method
Area (Code)
PC
object
public class ProducerConsumerTest {
public static void main(String[] args) {
CubbyHole c = new CubbyHole();
Producer p1 = new Producer(c, 1);
Consumer c1 = new Consumer(c, 1);
p1.start();
c1.start();
Thread 3
Thread 2
Thread 1
A Multithreaded Java Program
Heap (Data)
Class loader
Execution
Engine
object
HKJU, Dec. 18, 2002
Class
files
18
Java Memory Model
(How to maintain memory consistency between threads)
T1
Load variable from
main memory
to
Variable
is modified
working
memory
Upon
in
T1’sTworking
1 performs
before
Upon
Tuse.
performsis
unlock,
memory.
2 variable
lock,
variable
writtenflush
to
When
Tback
2 uses
in
working
main
memory
variable,
it memory
will be
loaded from main
memory
T2
Per-Thread
working memory
Main memory
Garbage
Bin
Object
Variable
Threads: T1, T2
JESSICA2, CSIS, HKU
Heap Area
master copy
Threads in a JVM
HKJU, Dec. 18, 2002
20
Distributed Java Virtual Machine (DJVM)
JESSICA2: A distributed Java Virtual Machine (DJVM) spanning multiple
cluster nodes can provide a true parallel execution environment for
multithreaded Java applications with a Single System Image illusion to Java
threads.
Java Threads created in a program
Global
Object Space
OS
PC
OS
PC
OS
PC
OS
PC
High Speed Network
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
21
Problems in Existing DJVMs
Mostly based on interpreters

Simple but slow
Layered design using distributed shared memory
system (DSM)  can’t be tightly coupled with JVM



JVM runtime information can’t be channeled to DSM
False sharing if page-based DSM is employed
Page faults block the whole JVM
Programmer specifies thread distribution  lacks
of transparency


Need to rewrite multithreaded Java applications
No dynamic thread distribution (preemptive thread
migration) for load balancing.
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
22
Related Work
Method shipping: IBM cJVM



Like remote method invocation (RMI) : when accessing object
fields, the proxy redirects the flow of execution to the node
where the object's master copy is located.
Executed in Interpreter mode.
Load balancing problem : affected by the object distribution.
Page shipping : Rice U. Java/DSM, HKU JESSICA



Simple. GOS was supported by some page-based Distributed
Shared Memory (e.g., TreadMarks, JUMP, JiaJia)
JVM runtime information can’t be channeled to DSM.
Executed in Interpreter mode.
Object shipping: Hyperion, Jackal


Leverage some object-based DSM
Executed in native mode: Hyperion: translate Java bytecode to
C. Jackal: compile Java source code directly to native code
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
23
JESSICA2 Main Features
Transparent Java thread migration



Runtime capturing and restoring of
thread execution context.
No source code modification. No
bytecode instrumenting (preprocessing)
No new API introduced.
Enable dynamic load balancing on
clusters
JESSICA2
Transparent
migration
JIT
GOS
Operated in Just-In-Time (JIT)
compilation Mode
Global Object Space



A shared global heap spanning all cluster
nodes
Adaptive object home migration protocol
I/O redirection
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
25
JESSICA2 Architecture
Java Bytecode or Source Code
public class ProducerConsumerTest {
public static void main(String[] args) {
CubbyHole c = new CubbyHole();
Producer p1 = new Producer(c, 1);
Consumer c1 = new Consumer(c, 1);
Migration
}
threads
Master JVM
}
er
atio
r
g
Mi
rd
no
JITEE
worker
JVM
threads
p1.start();
c1.start();
Migration
worker
JVM
JITEE
threads
Migration
Portable Java
Frames
JITEE
Global Object Space
Load monitor
Host
Manager
Host
Manager
OS
OS
OS
Hardware
Hardware
Hardware
Host
Manager
...
OS
Hardware
Communication Network
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
26
Transparent Thread
Migration in JIT Mode
Simple for interpreters (e.g. JESSICA)



Interpreter sits in the bytecode decoding loop which can be stopped
upon a migration flag checking
The full state of a thread is available in the data structure of
interpreter
No register allocation
JIT mode execution makes things complex (JESSICA2)





Native code has no clear bytecode boundary
How to deal with machine registers?
How to organize the stack frames (all are in native form now) ?
How to make extracted thread states portable and recognizable by
the remote JVM ?
How to restore the extracted states (rebuild the stack frames) and
restart the execution in native form ?
Need to modify JIT compiler to instrument native codes
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
27
An overview of JESSICA2 Java
thread migration
Thread
Frame parsing
GOS
(heap)
Frames
Frames
Frames
Frame
PC
JVM
(4a) Object Access
Frame
Method Area
GOS
(heap)
PC
(4b) Load method
from NFS
Source node
Destination node
Load
Monitor
JESSICA2, CSIS, HKU
Migration
Manager
Method Area
(2) Stack analysis
Stack capturing
Thread Scheduler
(1) Alert
(3) Restore execution
HKJU, Dec. 18, 2002
29
What are those functions?
Migration points selection

Delayed to the head of loop basic block or method
Register context handler


Spill dirty registers at migration point without invalidation so
that native codes can continue the use of registers
Use register recovering stub at restoring phase
Variable type deduction

Spill type in stacks using compression
Java frames linking

Discover consecutive Java frames
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
30
Dynamic Thread State Capturing and
Restoring in JESSICA2
migration point
Bytecode verifier
migration point
Selection
(Restore)
register
allocation
bytecode translation
1. Add migration checking
2. Add object checking
3. Add type & register spilling
Intermediate Code
Register recovering
reg
slots
invoke
cmp mflag,0
jz ...
cmp obj[offset],0
jz ...
mov 0x110182, slot
...
code generation
Native Code
Linking &
Constant Resolution
(Capturing)
Global Object Access
Native stack scanning
Java frame
mov slot1->reg1
mov slot2->reg2
...
C frame
Frame
JESSICA2, CSIS, HKU
Native
thread
stack
HKJU, Dec.
18, 2002
31
How to Maintain Memory Consistency
in a Distributed Environment ?
T1
T2
T3
T4
Heap
OS
PC
T5
T6
T7
T8
Heap
OS
PC
OS
PC
OS
PC
High Speed Network
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
32
Embedded Global Object
Space (GOS)
Main Features:





Take advantage of JVM runtime information for
optimization (e.g. object types, accessing threads, etc.)
Use threaded I/O interface inside JVM for
communication to hide the latency  Non-blocking GOS
access
OO based to reduce false sharing
Home-based, compliant with JVM Memory Model (“Lazy
Release Consistency”)
Master Heap (home objects) and Cache Heap (local and
cached objects) : reduce object access latency
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
33
Object Cache
JVM
Cache Heap Area
Hash
table
Java thread
Hash table
Java thread
Master Heap Area
Java thread
Java thread
Hash
table
JVM
Master Heap Area
Hash
table
Cache Heap Area
Global Heap
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
34
Adaptive object home
migration
Definition

“home” of an object = the JVM that holds the
master copy of an object
Problems

cache objects need to be flushed and re-fetched
from the home whenever synchronization happens
Adaptive object home migration

if # of accesses from a thread dominates the total
# of accesses to an object, the object home will be
migrated to the node where the thread is running
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
35
I/O redirection
Timer


Use the time in Master node as the standard time
Calibrate the time in worker node when they register to master
node
File I/O



Use half word of fd as node number
Open file
 For read, check local first, then master node
 For write, go to master node
Read/Write
 Go to the node specified by the node number in fd
Network I/O


Connectionless send: do locally
Others, go to master
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
36
Experimental Setting
Modified Kaffe Open
JVM version 1.0.6
Linux PC Clusters:
1.
2.
Pentium II PCs at 540MHz
(Linux 2.2.1 kernel)
Connected by Fast Ethernet
HKU Gideon 300 Cluster
(RayTracing)
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
37
Parallel Ray Tracing on JESSICA2
(Running at 64-node Gideon 300 cluster)
Linux 2.4.18-3 kernel (Redhat 7.3)
64 nodes: 108 seconds
1 node: 3430 seconds (~ 1 hour)
Speedup = 4402/108=40.75
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
38
Conclusions
Transparent Java thread migration in JIT
compiler enable the high-performance
execution of multithreaded Java application
on clusters while keeping the merits of Java


JVM approach => dynamic class loading
Just-in-Time compilation for speed
An embedded GOS layer can take advantage
of the JVM runtime information to reduce
communication overhead
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
45
Thanks
HKU SRG:

http://www.srg.csis.hku.hk/
JESSICA2 Webpage:

http://www.csis.hku.hk/~clwang/proj
ects/JESSICA2.html
JESSICA2, CSIS, HKU
HKJU, Dec. 18, 2002
46