Parallel Computing - Nanjing University

Download Report

Transcript Parallel Computing - Nanjing University

Parallel Programming
Yang Xianchun
Department of Computer Science and Technology
Nanjing University
Introduction
Agenda
–
–
–
–
–
–
–
–
About the Course
Evolution of Computing Technology
Grand Challenge Problem Examples
Motivations for Parallelism
Parallel Computation
A Road Map of Topics
Parallel Computing Terminology
An Illustrative Example
• Control-Parallel Approach
• Data-Parallel Approach
• Performance Analysis
About the course
• Description
– An introduction course to parallel programming
concepts and techniques
– No prior parallel computation background is
necessary
• Prerequisites
– Working knowledge of computer systems
– Adequate programming skill in C or C++
• Textbook
– Harry Jordan and Gita Alaghband, “Fundamentals
of Parallel Processing,” Prentice Hall, 2003.
About the course (cont.)
• Topics
– Introduction
– Parallel architectures
– Data parallel computing
• Fortran 90 / 95
– Shared memory computing
• Pthreads
• OpenMP
– Message passing programming
• PVM
• MPI
– Applications
• Sorting
• Numerical algorithms
• Graph algorithms
– Interconnection network
– Performance
– Models
About the course (cont.)
• Organization
– Lectures
– Programming projects
• using parallel programming tools PVM and MPI on
Linux / Unix machines
– Homeworks
– Class Tests and Final Exam (Open book)
Evolution of Computing Technology
• Hardware
–
–
–
–
Vacuum tubes, relay memory
Discrete transistors, core memory
Integrated circuits, pipelined CPU
VLSI microprocessors, solid state memory
• Languages and Software
–
–
–
–
Machine / Assembly languages
Algol / Fortran with compilers, batch processing OS
C, multiprocessing, timesharing OS
C++ / Java, parallelizing compilers, distributed OS
Evolution of Computing Technology
(cont.)
• The Driving Force Behind the Technology
Advances
– The ever-increasing demands on computing power
• Scientific computing (e.g. Large-scale simulations)
• Commercial computing (e.g. Databases)
• 3D graphics and realistic animation
• Multimedia internet applications
Grand Challenge Problem Examples
• Simulations of the earth’s climate
• Resolution: 10 kilometers
• Period: 1 year
• Ocean and biosphere models: simple
– Total requirements: 1016 floating-point operations per
second
– With a supercomputer capable of 10 Giga FLOPS, it will
take 10 days to execute
• Real-time processing of 3D graphics
• Number of data elements: 109 (1024 in each dimension)
• Number of operations per element : 200
• Update rate: 30 times per second
– Total requirements: 6.4 x 1012 operations per second
– With processor capable of 10 Giga IOPS, we need 640
of them
Motivations for Parallelism
• Conventional computers and sequential
• a single CPU
• a single stream of instructions
• executing one instruction at a time (not completely true)
– Single-CPU processor has a performance limit
• Moore’s Law can’t go on forever
• How to increase computing power?
• Better processor design
– More transistors, larger caches, advanced architectures
• Better system design
– Faster / larger memory, faster buses, better OS
• Scale up the computer (parallelism)
– Replicate hardware at component or whole computer levels
– Parallel processor’s power is virtually unlimited
•
•
•
•
10 processor @ 500 Mega FLOPS each = 5 Giga FLOPS
100 processor @ 500 Mega FLOPS each = 50 Giga FLOPS
1,000 processor @ 500 Mega FLOPS each = 500 Giga FLOPS
...
Motivations for Parallelism (cont.)
• Additional Motivations
– Solving bigger problems
– Lowering cost
Parallel Computation
• Parallel computation means
– multiple CPUs
– single or multiple streams of instructions
– executing multiple instructions at a time
• Typical process
– Breaking a problem into pieces and arranging for all
pieces to be solved simultaneously on a multi-CPU
computer system
• Requirements
– Parallel algorithms
• only parallelizable applications can benefit from parallel
implementation
– Parallel languages and/or constructs
• expressing parallelism
– Parallel architectures
• provide hardware support
A Road Map of Topics
Parallel Architectures
Machine Models
•
•
•
•
Vector / SIMD / MIMD
design issues
PRAM
LogP
Programming Models
Parallel Algorithms
•
•
•
•
•
•
•
data parallel
shared memory
message passing
master-slave
divide-conquer
control vs. Data
complexity analysis
Programming Languages
Applications
•
•
•
•
•
•
Fortran 90 / 95
Pthreads, OpenMP
PVM, MPI
Scientific computations
data-intensive problems
performance measurement
Parallel Computing Terminology
(1)
• Hardware
– Multicomputers
• tightly networked, multiple uniform computers
– Multiprocessors
• tightly networked, multiple uniform processors with
additional memory units
– Supercomputers
• general purpose and high-performance, nowadays
almost always parallel
– Clusters
• Loosely networked commodity computers
Parallel Computing Terminology
(2)
• Programming
– Pipelining
• divide computation into stages (segments)
• assign separate functional units to each stage
– Data Parallelism
• multiple (uniform) functional units
• apply same operation simultaneously to different
elements of data set
– Control Parallelism
• multiple (specialized) functional units
• apply distinct operations to data elements concurrently
Parallel Computing Terminology
• Performance
(3)
– Throughput
• number of results per unit time
– Speedup
Time needed for the most efficient sequential algorithm
S= ——————————————————————————
—
Time needed on a pipelined / parallel machine
– Scalability
• An algorithm is scalable if the available parallelism
increases at least linearly with problem size
• An architecture is scalable if it gives same performance
per processor, as the number of processors and the size
of the problem are both increased
• Data-parallel algorithms tend to be more scalable than
control-parallel algorithms
An Illustrative Example
• Problem
– Find all primes less than or equal to some positive
integer n
• Method (the sieve algorithm)
– Write down all th integers from 1 to n
– Cross out from the list all multiples of 2, 3, 5, 7, …
up to sqrt (n)
An Illustrative Example (cont.)
• sequential Implementation
• Boolean array representing the integers from 1 to n
• Buffer for holding current prime
• Index for loop iterating through the array
An Illustrative Example (cont.)
• Control-Parallel Approach
• Different processors strike out multiples of different
primes
• The boolean array and the current prime is shared;
each processor has its own private copy of loop
index
An Illustrative Example (cont.)
• Control-Parallel Approach (cont.)
– Potential Problem — Race Conditions
• Race 1: More than one processor may sieve multiples of
the same prime
– a processor reads the current prime, p, and goes off to sieve
multiples of p ; later it finds a new prime and updates the
current prime buffer
– before the current prime is updated, another processor comes
and reads p and goes off to do the same thing
• Race 2: A processor may sieve multiples of a composite
number
– processor A start marking multiples of 2
– before it can mark any cells, processor B finds an unmarked
cell, 3, and starts marking multiples of 3
– then processor C comes in and finds the next unmarked cell, 4,
and starts marking multiples of 4
• These two race conditions would not cause incorrect
result, but they will cause inefficiency
An Illustrative Example (cont.)
• Data-Parallel Approach
• Each processor responsible for a unique range of
the integers, it does all the striking in that range
• Processor 1 is responsible for broadcasting its
findings to other processors
• Potential Problem
– If [n/p] < sqrt(n), more than one processor need to
broadcast their findings
An Illustrative Example (cont.)
• Performance Analysis
– Sequential Algorithm
•
•
•
•
•
Cost of sieving multiples of 2: [(n-3)/2]
Cost of sieving multiples of 3: [(n-8)/3]
Cost of sieving multiples of 5: [(n-24)/5]
...
For n=1,000, T=1,411
– Control-Parallel Algorithm
• For p=2, n=1,000, T=706
• For p=3, n=1,000, T=499
• For p=4, n=1,000, T=499
An Illustrative Example (cont.)
• Performance Analysis (cont.)
– Data-Parallel Algorithm
• Cost of broadcasting: k(P-1)l
• Cost of striking:
([(n/p)/2]+ [(n/p)/3]+ … + [(n/p)/pk])
• For p=2, n=1,000, T≈781
• For p=3, n=1,000, T≈ 471
• For p=4, n=1,000, T≈ 337