Parallel Computers-3.. - e-Acharya Integrated E

Download Report

Transcript Parallel Computers-3.. - e-Acharya Integrated E

PARALLEL COMPUTERS3
3/12/2013
Computer Engg, IIT(BHU)
1
Computing Technology
Hardware
➢Vacuum tubes, relay memory
●
Discrete transistors, core memory
➢
Integrated circuits, pipelined CPU
➢
VLSI microprocessors, solid state memory
➢
Languages and Software
➢Machine / Assembly languages
●
Algol / Fortran with compilers, batch processing OS
➢
C, multiprocessing, timesharing OS
➢
C++ / Java, parallelizing compilers, distributed OS
➢
Computing Technology
The Driving Force Behind the Technology Advances
➢The ever-increasing demands on computing power
●
Scientific computing (e.g. Large-scale simulations)
✔
Commercial computing (e.g. Databases)
✔
3D graphics and realistic animation
✔
Multimedia internet applications
✔
Challenge Problem
Simulations of the earth’s climate
•Resolution: 10 kilometers
●
•
Period: 1 year
•
Ocean and biosphere models: simple
Total requirements: 1016 floating-point operations per second
➢
With a supercomputer capable of 10 Giga FLOPS, it will take 10 days to execute
➢
Real-time processing of 3D graphics
9
•Number of data elements: 10 (1024 in each dimension)
●
•
Number of operations per element : 200
•
Update rate: 30 times per second
Total requirements: 6.4 x 1012 operations per second
➢
With processor capable of 10 Giga IOPS, we need 640 of them
➢
Motivations for Parallelism
Conventional computers and sequential
•a single CPU
●
a single stream of instructions
•
executing one instruction at a time (not
completely true)
•
➢
Single-CPU processor has a performance limit
➢
Moore’s Law can’t go on forever
Motivation for Parallelism
How to increase computing power?
•Better processor design
●
More transistors, larger caches, advanced architectures
➢
Better system design
•
Faster / larger memory, faster buses, better OS
➢
Scale up the computer (parallelism)
•
Replicate hardware at component or whole computer levels
➢
Parallel processor’s power is virtually unlimited
10 processor @ 500 Mega FLOPS each = 5 Giga FLOPS
➢
100 processor @ 500 Mega FLOPS each = 50 Giga FLOPS
➢
1,000 processor @ 500 Mega FLOPS each = 500 Giga FLOPS
➢
Motivation for Parallelism
Additional Motivations
➢Solving bigger problems
●
➢
Lowering cost
Terminology
Hardware
➢Multicomputers
●
•
tightly networked, multiple uniform computers
Multiprocessors
➢
•
tightly networked, multiple uniform processors with additional memory units
Supercomputers
➢
•
general purpose and high-performance, nowadays almost always parallel
Clusters
➢
•
Loosely networked commodity computers
Terminology
Programming
➢Pipelining
●
•
divide computation into stages (segments)
•
assign separate functional units to each stage
Data Parallelism
➢
•
multiple (uniform) functional units
•
apply same operation simultaneously to different elements of data set
Control Parallelism
➢
•
multiple (specialized) functional units
•
apply distinct operations to data elements concurrently
Terminology
Performance
➢Throughput
●
number of results per unit time
•
➢
Speedup
Time needed for the most efficient
sequential algorithm
S= ——————————————————
Time needed on a pipelined / parallel
machine
Terminology
➢
Scalability
An algorithm is scalable if the available
parallelism increases at least linearly with
problem size
•
An architecture is scalable if it gives same
performance per processor, as the number of
processors and the size of the problem are both
increased
•
Data-parallel algorithms tend to be more scalable
than control-parallel algorithms
•
Example
Problem
➢Find all primes less than or equal to some
positive integer n
●
Method (the sieve algorithm)
➢Write down all th integers from 1 to n
●
Cross out from the list all multiples of 2, 3,
5, 7, … up to sqrt (n)
➢
Example
sequential Implementation
●
Boolean array representing the integers from 1 to n
➢
Buffer for holding current prime
➢
Index for loop iterating through the array
➢
Example
Control-Parallel Approach
➢Different processors strike out multiples of different
primes
●
The boolean array and the current prime is shared; each
processor has its own private copy of loop index
➢
Example
Data-Parallel Approach
➢Each processor responsible for a unique range of the
integers, it does all the striking in that range
●
Processor 1 is responsible for broadcasting its findings to
other processors
➢
Example
Performance Analysis
➢Sequential Algorithm
●
Cost of sieving multiples of 2: [(n-3)/2]
•
Cost of sieving multiples of 3: [(n-8)/3]
•
Cost of sieving multiples of 5: [(n-24)/5]
•
...
For n=1,000, T=1,411
Example
Control-Parallel Algorithm
•
For p=2, n=1,000, T=706
•
For p=3, n=1,000, T=499
•
For p=4, n=1,000, T=499
•
Example
Data-Parallel Algorithm
➢
•
Cost of broadcasting: k(P-1)
•
Cost of striking:
([(n/p)/2]+ [(n/p)/3]+ … + [(n/p)/k])
For p=2, n=1,000, T≈781
For p=3, n=1,000, T≈ 471
For p=4, n=1,000, T≈ 337