Enhancing X10 Performance by Auto

Download Report

Transcript Enhancing X10 Performance by Auto

Enhancing X10 Performance
by Auto-tuning
the Managed Java Back-end
Vimuth Fernando, Milinda Fernando, Tharindu Rusira, Sanath Jayasena
Department of Computer Science and Engineering
University of Moratuwa
X10 Language
• X10[1] is an object-oriented, Asynchronous Partitioned Global
Address Space (APGAS) language
• Designed to obtain productivity for high performance applications
• X10 code is compiled into either
• Native backend - C++ code
• Managed backend - Java code
• In general, the Java back end's performance is low
2
Performance Tuning the Java Backend
• Our work focused on improving the performance of X10 programs
that uses the Java back-end
• The native backend is commonly used for performance critical
applications
• Tuning, if successful, can provide best of both worlds for the Java
backend in terms of
• Performance
• Java’s portability and availability of enterprise level libraries
• Employs auto-tuning techniques that have been widely used in similar
High Performance Computing workloads[4],[5],[6]
3
OpenTuner Framework[2]
• A framework for developing
auto-tuning applications
• Can be used to
• Define a set of configurations and
the possible configuration space
• Search through the configuration
space to identify the best
configuration
Structure of the OpenTuner Framework[2]
4
Our Auto-tuning Approach
• X10 Java backend code runs on a set
of Java Virtual Machines (JVMs)
• JVMs expose hundreds of tunable
parameters that can be used to
change their behavior
• By tuning these parameters we can
generate increased performance
Some Tunable parameters exposed by Java Virtual Machines
5
Our Auto-tuning Approach ctd.
• X10 programs typically run on large
distributed systems
• So a large number of JVMs are created
each with 100s of parameters that
needs to be tuned
• To overcome this issue we use
• JATT[2] – Introduces a hierarchical structure
for the JVM tunable parameters making
tuning faster
• Parameter duplication – All JVMs share a
single set of parameter values. Works
because the workloads are similar
The process by our auto-tuner
6
Our Auto-tuning Approach ctd.
• OpenTuner is used to generate test configurations consisting of a
specific combination of parameter values
• These configurations are then distributed to the nodes running the
X10 program compiled to the Java backend
• The resulting performance is measured and used to direct the search
algorithm to get to a optimal configuration faster
• This cycle continues until a good performance gain is achieved
7
Benchmark Applications
• Benchmark applications provided with X10 source distribution were used
in our experiments
• The following benchmarks were available
• Kmeans – A program that implements K-means clustering problem. Total execution
time in seconds is measured as the performance metric.
• SSCA1 – Implements the sequence alignment problem using Smith-Waterman
algorithm.
• SSCA2 – A graph theory benchmark that stresses integer operations, large memory
footprints, and irregular memory access patterns.
• Stream - Measures the memory bandwidth and related computation rate for vector
kernels. Data rate in GB/s is measured as the performance metric.
• UTS - implements an Unbalanced Tree Search algorithm that measures X10’s ability
to balance irregular work across many places.
• LULESH 2.0 - Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics [26],
a proxy application for shock hydrodynamics.
8
Experimental Setup
• Our auto-tuning experiments were run on the following platforms
 Architecture-1
Intel Xeon E7-4820v2 CPU @2.00 GHz (32 cores)
64GB RAM
Ubuntu 14.04 LTS OpenJDK 7 (HotSpot JVM) update 55
 Architecture-2
Intel Core i7 CPU @3.40Ghz (4 cores)
16 GB RAM
Ubuntu 12.04 LTS OpenJDK 7 (HotSpot JVM) update 55
9
Results
• Significant performance gains achieved
through auto-tuning in all benchmarks
• Performance metrics differ across the
benchmarks. So following formula used to
measure performance gains
Fig. 2: X10 benchmarks performance improvements
10
• Both Architectures show similar performance gains
11
Performance tuning with time
12
Profiling
• Profiling of underlying JVM
characteristics was used to try and
identify the cause of performance
gains
• We profiled
•
•
•
•
Heap Usage
Compilation Rate
Class Loading Rate
Garbage Collection Time
• The profiling data of the default
configuration and the tuned one
was used to compare the two
runtimes
13
Discussion
• No single source of performance gains
• Profiling data shows that the performance gains are achieved through
different means for each benchmark. This shows that there is no silver bullet
configuration that fits all circumstances.
• We believe that this is further cause for auto-tuners as opposed to manual
tuning methods.
• Java backend still performs poorly compared to the native backend
• But with the performance improvements they become more acceptable
• This with the additional advantages offered by the Java backend makes it
more feasible for HPC applications
14
Input Sensitivity
• We wanted to analyze the identified
configurations’ sensitivity to input sizes
• We found that identified good configurations
tend to provide better performance that scale
with bigger input sizes
• The figure shows how the configuration
identified with the input size at it’s smallest
perform against the default configuration for
the LULESH benchmark
• Other Benchmarks displayed similar behavior
• This allows for us to tune benchmarks with
smaller inputs leading to reduced tuning
times.
15
Conclusions
• Auto-tuning methods can be used to generate performance gains in
distributed X10 programs compiled to the Java Backend
• This performance gains scale well with larger input sizes
• Using profiling data we studied the causes for performance gains and
identified that they differ based on the characteristics of the
benchmark
• This work provides evidence that auto-tuning methods needs to be
further explored in the search for better performance in X10
applications
16
Acknowledgements
• This project was supported by a Senate Research Grant awarded by
the University of Moratuwa
• We would also like to acknowledge the support provided by the LK
Domain Registry through the V. K. Samaranayake Grant
17
Thank you
18
Main References
[1] Mikio Takeuchi, Yuki Makino, Kiyokuni Kawachiya, Hiroshi Horii, Toyotaro Suzumura, Toshio Suganuma, and Tamiya
Onodera. 2011. “Compiling X10 to Java”. In Proceedings of the 2011 ACM SIGPLAN X10 Workshop (X10 ’11). ACM, New York,
NY, USA, Article 3, 10 pages.
[2] Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan- Kelley, Jeffrey Bosboom, Una-May O’Reilly, and
Saman Amarasinghe. 2014. “OpenTuner: an extensible framework for program autotuning”. In Proceedings of the 23rd
international conference on Parallel architectures and compilation (PACT ’14). ACM, New York, NY, USA, 303-316.
[3] Sanath Jayasena, Milinda Fernando, Tharindu Rusira, Chalitha Perera and Chamara Philips, “Auto-tuning the Java Virtual
Machine”, In proceedings of 10th IEEE International Workshop in Automatic Performance Tuning(iWAPT’15), 2015.
[4] Hamouda, Sara S., Josh Milthorpe, Peter E. Strazdins, and Vijay Saraswat. “A Resilient Framework for Iterative Linear
Algebra Applications in X10.” In 16th IEEE International Workshop on Parallel and Distributed Scientific and Engineering
Computing (PDSEC 2015). 2015.
[5] Cheng, Long, Spyros Kotoulas, Tomas E. Ward, and Georgios Theodoropoulos. “High Throughput Indexing for Large-scale
Semantic Web Data.” (2015).
[6] Bilmes J, Asanovic K, Chin CW, Demmel J. “Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSIC
coding methodology”. Proceedings of the 1997 ACM International Conference on Supercomputing,1997.
19