A Comparative Evaluation of Parallel Programming Models for
Download
Report
Transcript A Comparative Evaluation of Parallel Programming Models for
Parallel Programming
Models
Monica Borra
Outline
Shared Memory Models – Revision
Comparison of shared memory models
Distributed Memory Models
Parallelism in Big Data Technologies
Conclusion
Shared Memory Models
Multi-threaded – Posix Threads(PThreads), TBB, OpenMP
Multi-processors – Cilk, ArBB, CUDA, Microsoft Parallel Patterns
Compiler Directives and Library functions(PRAM-like)
Comparative Study I
Most commercially available general purpose computers include hardware
features to increase the parallelism.
hyperthreading, multi-core, ccNUMA architecture
General-purpose threading
CPU vector instructions and GPUs
SIMD
Compared models
Four Parallel programming models have been selected.
Each of these models exploits different hardware parallel features
mentioned earlier.
Also, they require different levels of programming skills
OPenMP , Intel TBB – parallel threads on multicore systems
Intel ArBB – threads + multicore SIMD features
CUDA – SIMD GPU features.
CUDA
•
CUDA (Compute Unified Device Architecture) is a C/C++ programming
model and API (Application Programming Interface) introduced by NVIDIA
to enable software developers to code general purpose apps that run on
the massively parallel hardware on GPUs.
•
GPUs are optimal for data parallel apps aka SIMD (Single Instruction
Multiple Data).
•
Threads running in parallel use extremely fast shared memory for
communication.
Evaluations:
4 benchmarks:
Matrix multiplication, Simple 2D convolution, Histogram computation and
Mandelbrot.
Different underlying computer architectures.
Comparison between OpenMP – TBB and ArBB –
CUDA for simple 2D Convolution
Comparison Summary:
OpenMP and TBB show a very low performance compared to ArBB and
CUDA.
TBB seems to have a lower performance than OpenMP for single socket
architectures, the situation seems to reverse when running on ccNUMA
architectures, where TBB shows a significant improvement.
ArBB and CUDA. But also that ArBB performance tends to be comparable
with CUDA performance in most cases (although it is normally lower).
Hence, there are evidences that a carefully designed top range multicore
and multisocket architecture( advantage of the TLP and SIMD features) like
ArBB applications may approach the performance of top range CUDA
GPGPU.
Comparative Study II
OpenMP, Pthread, Microsoft Parallel Patterns APIs
Computation of matrix multiplication
Performed on an Intel i5 processor
Execution time and speed up
Experimental Results:
Distributed Parallel Computing
Cluster based
Message Passing Interface(MPI) – de-facto standard
More advantageous when communication between the nodes is high
Originally designed for HPC
Apache Hadoop
Parallel processing for Big Data
Implementation of a programming Model, “Map Reduce”
Why is parallelism in
Big Data important?
Innumerable sources – RFID, Sensors,
Social Networking
Volume, Velocity and Variety
Apache Hadoop
Framework that allows for the distributed parallel processing of large data
sets.
Batch processes raw unstructured data
Highly reliable and scalable
Consists of 4 modules: common utilities, storage, resource management
and processing
Parallel
Case Study: Can we take advantage
of MPI to overcome communication
overhead in Big Data Technologies?
Challenges:
1. Is it worth to speed-up communication?
a. Percentage of time taken for communications alone
b. Comparisons of achievable latency and peak bandwidth for
point to point communications through MPI against Hadoop.
2. How difficult it is to adapt MPI to Hadoop and what are the minimal
extensions to the MPI standard?
A pair of new MPI calls supporting Hadoop data communication specified
via key-value pairs.
Contributions of the case study:
Abstracting the requirements of the communication model
Dichotomic, dynamic, data-centric bipartite model.
Key-Value pair based
Novel design of DataMPI – High Performance Communication
Library
Various benchmarks to prove efficiency and ease of use.
Contributions:
Comparision: DataMPI vs Hadoop
Several big data representative benchmarks
WordCount, Terasort, K-means, Top K, PageRank
Compared for various parameters
Efficiency, fault tolerance, easy of use
Comparisons for
Terasort
Both Hadoop and DataMPI exhibit
similar trends
DataMPI shows better results in all
cases.
Results:
Efficiency: DataMPI speeds up varied Big Data workloads and improves job
execution time by 31%-41%.
Fault Tolerance: DataMPI supports fault tolerance. Evaluations show that
DataMPI-FT can attain 21% improvement over Hadoop.
Scalability: DataMPI achieves high scalability as Hadoop and 40%
performance improvement.
Flexibile and the coding complexity of using DataMPI is on par with that of
using traditional Hadoop
Conclusion:
The efficiency of a model in shared memory parallel computing depends
on the type of the program and best use of underlying hardware parallel
processing features.
Extending MPI for high computational problems like big data mining is
much more efficient than the traditional frameworks.
Shared memory models are easy to implement but MPI gives best optimal
results for more complex problems.
References
L. SanChez, J. Fernandez, R. Sotomayor, J. D. Garcia, “A Comparative
Evaluation of Parallel Programming Models for Shared-Memory
Architectures”, IEEE 10th International Symposium on Parallel and
Distributed Processing with Applications, 2012, pp 363 - 374
M. Sharma, P. Soni, “Comparative Study of Parallel Programming Models to
Compute Complex Algorithm”, IEEE International Journal of Computer
Applications, 2014, pp 174 - 180
Apache Hadoop, hadoop.apache.org
Xiaoyi Lu, Fan Liang, Bing Wang, Li Zha, Zhiwei Xu, “DataMPI: Extending MPI
to Hadoop-like Big Data Computing”, IEEE 28th Internation Parallel and
Distributed Processing Symposium, 2014, pp 829 - 838
Lorin Hochstein, Victor R. Basili, Uzi Vishkin, John Gilbert, “A pilot study to
compare programming effort for two parallel programming models”, The
Journal of Systems and Software, 2008
Questions?
Thank you!!