A Comparative Evaluation of Parallel Programming Models for

Download Report

Transcript A Comparative Evaluation of Parallel Programming Models for

Parallel Programming
Models
Monica Borra
Outline
 Shared Memory Models – Revision
 Comparison of shared memory models
 Distributed Memory Models
 Parallelism in Big Data Technologies
 Conclusion
Shared Memory Models
 Multi-threaded – Posix Threads(PThreads), TBB, OpenMP
 Multi-processors – Cilk, ArBB, CUDA, Microsoft Parallel Patterns
 Compiler Directives and Library functions(PRAM-like)
Comparative Study I
Most commercially available general purpose computers include hardware
features to increase the parallelism.
hyperthreading, multi-core, ccNUMA architecture
General-purpose threading
CPU vector instructions and GPUs
SIMD
Compared models
 Four Parallel programming models have been selected.
 Each of these models exploits different hardware parallel features
mentioned earlier.
 Also, they require different levels of programming skills
 OPenMP , Intel TBB – parallel threads on multicore systems
 Intel ArBB – threads + multicore SIMD features
 CUDA – SIMD GPU features.
CUDA
•
CUDA (Compute Unified Device Architecture) is a C/C++ programming
model and API (Application Programming Interface) introduced by NVIDIA
to enable software developers to code general purpose apps that run on
the massively parallel hardware on GPUs.
•
GPUs are optimal for data parallel apps aka SIMD (Single Instruction
Multiple Data).
•
Threads running in parallel use extremely fast shared memory for
communication.
Evaluations:
4 benchmarks:
Matrix multiplication, Simple 2D convolution, Histogram computation and
Mandelbrot.
Different underlying computer architectures.
Comparison between OpenMP – TBB and ArBB –
CUDA for simple 2D Convolution
Comparison Summary:
 OpenMP and TBB show a very low performance compared to ArBB and
CUDA.
 TBB seems to have a lower performance than OpenMP for single socket
architectures, the situation seems to reverse when running on ccNUMA
architectures, where TBB shows a significant improvement.
 ArBB and CUDA. But also that ArBB performance tends to be comparable
with CUDA performance in most cases (although it is normally lower).
 Hence, there are evidences that a carefully designed top range multicore
and multisocket architecture( advantage of the TLP and SIMD features) like
ArBB applications may approach the performance of top range CUDA
GPGPU.
Comparative Study II
 OpenMP, Pthread, Microsoft Parallel Patterns APIs
 Computation of matrix multiplication
 Performed on an Intel i5 processor
 Execution time and speed up
Experimental Results:
Distributed Parallel Computing
 Cluster based
 Message Passing Interface(MPI) – de-facto standard
 More advantageous when communication between the nodes is high
 Originally designed for HPC
 Apache Hadoop
 Parallel processing for Big Data
 Implementation of a programming Model, “Map Reduce”
Why is parallelism in
Big Data important?
 Innumerable sources – RFID, Sensors,
Social Networking
 Volume, Velocity and Variety
Apache Hadoop
 Framework that allows for the distributed parallel processing of large data
sets.
 Batch processes raw unstructured data
 Highly reliable and scalable
 Consists of 4 modules: common utilities, storage, resource management
and processing
Parallel
Case Study: Can we take advantage
of MPI to overcome communication
overhead in Big Data Technologies?
 Challenges:
1. Is it worth to speed-up communication?
a. Percentage of time taken for communications alone
b. Comparisons of achievable latency and peak bandwidth for
point to point communications through MPI against Hadoop.
2. How difficult it is to adapt MPI to Hadoop and what are the minimal
extensions to the MPI standard?
A pair of new MPI calls supporting Hadoop data communication specified
via key-value pairs.
Contributions of the case study:
 Abstracting the requirements of the communication model
 Dichotomic, dynamic, data-centric bipartite model.
 Key-Value pair based
 Novel design of DataMPI – High Performance Communication
Library
 Various benchmarks to prove efficiency and ease of use.
Contributions:
Comparision: DataMPI vs Hadoop
 Several big data representative benchmarks
WordCount, Terasort, K-means, Top K, PageRank
 Compared for various parameters
Efficiency, fault tolerance, easy of use
Comparisons for
Terasort
Both Hadoop and DataMPI exhibit
similar trends
DataMPI shows better results in all
cases.
Results:
 Efficiency: DataMPI speeds up varied Big Data workloads and improves job
execution time by 31%-41%.
 Fault Tolerance: DataMPI supports fault tolerance. Evaluations show that
DataMPI-FT can attain 21% improvement over Hadoop.
 Scalability: DataMPI achieves high scalability as Hadoop and 40%
performance improvement.
 Flexibile and the coding complexity of using DataMPI is on par with that of
using traditional Hadoop
Conclusion:
 The efficiency of a model in shared memory parallel computing depends
on the type of the program and best use of underlying hardware parallel
processing features.
 Extending MPI for high computational problems like big data mining is
much more efficient than the traditional frameworks.
 Shared memory models are easy to implement but MPI gives best optimal
results for more complex problems.
References
 L. SanChez, J. Fernandez, R. Sotomayor, J. D. Garcia, “A Comparative
Evaluation of Parallel Programming Models for Shared-Memory
Architectures”, IEEE 10th International Symposium on Parallel and
Distributed Processing with Applications, 2012, pp 363 - 374
 M. Sharma, P. Soni, “Comparative Study of Parallel Programming Models to
Compute Complex Algorithm”, IEEE International Journal of Computer
Applications, 2014, pp 174 - 180
 Apache Hadoop, hadoop.apache.org
 Xiaoyi Lu, Fan Liang, Bing Wang, Li Zha, Zhiwei Xu, “DataMPI: Extending MPI
to Hadoop-like Big Data Computing”, IEEE 28th Internation Parallel and
Distributed Processing Symposium, 2014, pp 829 - 838
 Lorin Hochstein, Victor R. Basili, Uzi Vishkin, John Gilbert, “A pilot study to
compare programming effort for two parallel programming models”, The
Journal of Systems and Software, 2008
Questions?
Thank you!!