Fast JIT Compilation

Download Report

Transcript Fast JIT Compilation

HotSpotTM:
A Huge Step Beyond JIT’s
Zhanyong Wan
May 1st, 2000
Sources of Information
 From Sun’s web-site
– HotSpot white paper
http://java.sun.com/products/hotspot/whitepaper.html
– Various articles on Sun’s web-site
http://java.sun.com/products/hotspot/
 From other web-sites
– Java on Steroids: Sun's High-Performance Java Implementation,
U. Hölzle et.al. (slides from HotChips IX, August 1997)
http://www.cs.ucsb.edu/oocsb/papers/HotChips.pdf
– The HotSpot Virtual Machine, Bill Venners
http://www.artima.com/designtechniques/hotspot.html
– HotSpot: A new breed of virtual machine, Eric Amstrong
http://www.javaworld.com/jw-03-1998/f_jw-03-hotspot.html
5/1/2000
Zhanyong Wan
2
Overview
 Why Java is different
 Why JIT is not good enough
 What HotSpot does
 The HotSpot architecture
– Memory model
– Thread model
– Adaptive optimization
 Conclusions
5/1/2000
Zhanyong Wan
3
History
 1st generation JVM
– Purely interpreting
– 30 - 50 times slower than C++
 2nd generation JVM
– JIT compilers
– 3 - 10 times slower than C++
 Static compilers
– Better performance than JIT’s
5/1/2000
Zhanyong Wan
4
The Future?
 HotSpot
– Dynamic, fully optimizing compiler
– Close-to-C++ performance
– May even exceed the speed of C++ in the future
5/1/2000
Zhanyong Wan
5
Questions of Interest
 How is it possible that HotSpot runs programs




faster than the native code generated by a static
optimizing Java compiler?
How does HotSpot score? (The collection of
technologies used by HotSpot.)
Where did they get the ideas?
Which of these technologies also apply in other
systems (e.g. JIT, static source code/bytecode
compiler, C++)?
Can Java be made to surpass the performance of
C++, or is this a hype?
5/1/2000
Zhanyong Wan
6
Why Java Is Different (to C++)
 Granularity of factoring
–
–
–
–
Smaller classes
Smaller methods
More frequent calls
Standard compiler analysis fails
 Dynamic dispatch
– Slower calls for virtual functions
– Much more frequent than in C++
 Sophisticated run-time system
– Allocation, garbage collection
– Threads, synchronization
 Dynamically changing program
– Classes loaded/discarded on the fly
5/1/2000
Zhanyong Wan
7
Why Java Is Different (cont’d)
 Distributed in a portable form
– A compiler can generate optimal machine code for a
particular processor version
• e.g. Pentium vs. Pentium II
– Welcomes dynamic compilation (developed in the last
decade)!
5/1/2000
Zhanyong Wan
8
Find the Java Bottleneck
 Time used in a typical Java program executed w/
JDK interpreter:
–
–
–
–
Allocation/GC: 1/6
Synchronization: 1/6
Byte code: 2/3
Native methods: negligible
Byte codes
Alloc/GC
Synch
Native
 Performance critical code: the “hot spots”
5/1/2000
Zhanyong Wan
9
Why JIT Is Not Good Enough
 Compiles on method-by-method basis when a
method is first invoked
 Compilation consumes “user time”
– Startup latency
– Dilemma: either good code or fast compiler
• Gains of better optimization may not justify extra compile
time
• More concerned w/ generating code quickly than w/
generating the quickest code
 Root of problem: compilation is too eager
5/1/2000
Zhanyong Wan
10
The Baaad Way to Optimize
 People try to help: the optimization lore
– Make methods final or static
– Large classes/methods
– Avoid interfaces (interface method invocation much
slower than regular dynamic method dispatch)
– Avoid creating lots of short-lived objects
– Avoid synchronization (very expensive)
– Against good OO design!
 “Premature optimization is the root of all evil.”
(Donald Knuth)
5/1/2000
Zhanyong Wan
11
The HotSpot Way to Optimize
 Optimize only when you know you have a problem
1.
2.
3.
4.
A program starts off being interpreted
A profiler collects run-time info in the background
After a while, a set of hot spots is identified
A thread is launched to compile the methods in the hot
spots
•
•
•
Execution of the program is *not* blocked
“Take your time!” – fully optimizing
Take advantage of the late compilation: run-time info used
•
•
Keeping the footprint small
Bytecode is always kept around
5. Once a method is compiled, it doesn’t need to be
interpreted
6. Native code can be discarded when the hot spots change
5/1/2000
Zhanyong Wan
12
The HotSpot Way (cont’d)
 Tackles each of the bottlenecks
– Adaptive optimization
– Fast, accurate garbage collection
– Fast thread synchronization
 Performance
– 2-3 times faster than JITs
– Comparable to C++
 Most importantly, eliminates the “performance
excuse” for poor designs/code
5/1/2000
Zhanyong Wan
13
The HotSpot Architecture
 Memory model
 Thread model
 Adaptive compiler
5/1/2000
Zhanyong Wan
14
The HotSpot Memory Model
 Object references
– Java 2 SDK: as indirect handles
• Relocating objects made easy
• A significant performance bottleneck
– HotSpot: as direct pointers
• A performance boost
• GC must adjust all reference to an object when it is
relocated
 Object headers
– Java 2 SDK: 3-word
– HotSpot: 2-word
• 2 bits for GC mark (reference count removed?)
• An 8% savings in heap size
5/1/2000
Zhanyong Wan
15
Garbage Collection Background
 GC traditionally considered inefficient
– Takes 1/6 of the time in an interpreting JVM
– Even worse in a JIT VM
 Modern GC technology
– Performs substantially better than explicit freeing
– How can this be true?
• Unnecessary copies avoided
• Memory segmentation, space locality
5/1/2000
Zhanyong Wan
16
The HotSpot Garbage Collector
 A high-level GC framework
– New collection algorithms can be “plugged-in”
– Currently has 3 cooperating GC algorithms
 Major features
–
–
–
–
–
Fast allocation and reclamation
Fully accurate: guarantees full memory reclamation
Completely eliminates memory fragmentation
Incremental, no perceivable pauses (usually < 10ms)
Small memory overhead
• 2-bit GC mark per object
• 2-word object header (instead of 3- in Java 2 SDK)
5/1/2000
Zhanyong Wan
17
The HotSpot GC: Accuracy
 A partially accurate (conservative) collector must
– Either avoid relocating objects
– Or use handles to refer indirectly to objects (slow)
 The HotSpot collector
– Fully accurate
– All inaccessible objects can be reclaimed
– All objects can be relocated
• Eliminates memory fragmentation
• Increases memory locality
5/1/2000
Zhanyong Wan
18
The HotSpot GC: the Structure
 Three cooperating collectors
– A generational copying collector
• For short-lived objects
– A mark-compact “old object” collector
• For longer-lived objects when the live object set is small
– An incremental “pauseless” collector
• For longer-lived objects when the live object set is big
5/1/2000
Zhanyong Wan
19
Generational Copying Collector
 Observation: the vast majority (often > 95%) of
the objects are very short-lived
 The way it works
– A memory area is reserved as an object “nursery”
– Allocation is just updating a pointer and checking for
overflow: extremely fast
– By the time the nursery overflows, most objects in it are
dead; the collector just moves the few survivors to the
“old object” memory area
5/1/2000
Zhanyong Wan
20
Mark-Compact Collector
 Rare case
– Triggered by low-memory conditions or programmatic
requests
 Time proportional to the size of the set of live
objects
– Calls for an incremental collector when the size is large
5/1/2000
Zhanyong Wan
21
Incremental Pauseless Collector
 An alternative to the mark-compact collector
 Relatively constant pause time even w/ extremely
large data set
 Suitable for server applications and soft-real time
applications (games, animations)
 The way it works
– The “train” algorithm
– Breaks up GC pauses into tiny pauses
– Not a hard-real time algorithm: no guarantee for upper
limit on pause times
 Side-benefit: better memory locality
– Tends to relocate tightly-coupled objects together
5/1/2000
Zhanyong Wan
22
The HotSpot Thread Model
 Native thread support
– Currently supports Solaris & 32bit Windows
– Preemption
– Multiprocessing
 Per-thread activation stack is shared w/ native
methods
– Fast calls between C and Java
5/1/2000
Zhanyong Wan
23
Thread Synchronization
 takes 1/6 of the time in an interpreting JVM
– (I think) the proportion can be even higher for a JIT
 HotSpot’s thread synchronization
–
–
–
–
5/1/2000
Ultra-fast (“a breakthrough”)
Constant time for all uncontended (no rival) synch
Fully scalable to multiprocessor
Makes fine-grain synch practical, encouraging good OO
design
Zhanyong Wan
24
Adaptive Inlining
 Method invocations reduce the effectiveness of
optimizers
– Standard optimizers don’t perform well across method
boundaries (need bigger block of code)
– Inlining is the solution
 Inlining has problems
– Increased memory foot-print
– Inlining is harder w/ OO languages because of dynamic
dispatching (worse in Java than in C++)
 HotSpot uses run-time information to
– Inline only the critical methods
– Limit the set of methods that might be invoked at a
certain point
5/1/2000
Zhanyong Wan
25
Dynamic Deoptimization
 Simple inlining may violate the Java semantics
– A program can change the patterns of method invocation
– Java program can change on the fly via dynamic class
loading/discarding
– Optimizations may become invalid
 Must be able to deoptimize dynamically!
– HotSpot can deoptimize (revert back to bytecode?) a hot
spot even during the execution of the code for it.
5/1/2000
Zhanyong Wan
26
Fully Optimizing Compiler
 Performs all the classic optimizations
–
–
–
–
–
Dead code elimination
Loop invariant hoisting
Common sub-expression elimination
Constant propagation
And more …
 Java-specific optimizations
– Null-check elimination
– Range-check elimination
 Global graph coloring register allocator
 Highly portable
– Relying on a small machine description file
5/1/2000
Zhanyong Wan
27
Transparent Debugging & Profiling Semantics
 Native code generation & optimization fully
transparent to the programmer
– Uses two stacks
• One real, one simulating
– Overhead of two stacks?
 Pure bytecode semantics: easy debugging &
profiling
 Question: what’s the point of a transparent
profiling semantics?
5/1/2000
Zhanyong Wan
28
Performance Evaluation
 Micro-benchmarks: not the way
–
–
–
–
No or few method calls/synchronizations
Small live data set
No correlation w/ real programs
Give unrealistic results for HotSpot
 SPEC JVM98 benchmark
– The only industry-standard benchmark for Java
– Predictive of the performance across a number of real
applications
5/1/2000
Zhanyong Wan
29
Where are the ideas from?
 Mostly from the last decade’s academic work
– Dynamic compilation
– Modern GC
– HotSpot puts them together
 Academic research is relevant!
5/1/2000
Zhanyong Wan
30
(My) Conclusions
 HotSpot is great
– Many new technologies previously only seen in academia
 Java performance may come close to or exceed
the current implementation of C++
 However Sun’s argument that Java can be faster
than C++ is not convincing yet:
– C++ has better control on machine resources
– Many technologies used in HotSpot can be exploited for
C++ as well. Especially:
• Fast synchronization
• Dynamic compilation
• Maybe GC (for some dialects of C++)
– Whether Java can exceed C++ remains to be tested
5/1/2000
Zhanyong Wan
31