Lecture 1 - Suraj @ LUMS

Download Report

Transcript Lecture 1 - Suraj @ LUMS

Tuesday, September 04, 2006
I hear and I forget,
I see and I remember,
I do and I understand.
-Chinese Proverb
Today
 Course Overview.
 Why Parallel Computing?
 Evolution of Parallel Systems.
CS 524 : High Performance Computing
 Course URL
http://suraj.lums.edu.pk/~cs524a06
 Folder on indus
\\indus\Common\cs524a06
 Website – Check Regularly: Course
announcements, office hours, slides,
resources, policies …
 Course Outline
 Several programming exercises will be
given throughout the course. Assignments
will include popular programming
models for shared memory and message
passing such as OpenMP and MPI.
 The development environment will be
C/C++ on UNIX.
Pre-requisites
 Computer Organization & Assembly
Language (CS 223)
 Data Structures & Algorithms (CS 213)
 Senior level standing.
 Operating Systems?
 Five minute rule.
Hunger For More Power!
Hunger For More Power!
 Endless quest for more and
more computing power.
 However much computing
power there is, it is never
enough.
Why this need for greater
computational power?
 Science, engineering, businesses,
entertainment etc., all are providing the
impetus.
 Scientists – observe, theorize, test
through experimentation.
 Engineers – design, test prototypes, build.
HPC offers a new way to do
science:
Computation used to approximate physical
systems - Advantages include:
 Playing with simulation parameters to study
emergent trends
 Possible replay of a particular simulation event
 Study systems where no exact theories exist
Why Turn to Simulation?
When the problem is
too . . .
 Complex
 Large
 Expensive
 Dangerous
Why this need for greater
computational power?
 Less expensive to carry out computer
simulations.
 Able to simulate phenomenon that could
not be studied by experimentation. e.g.
evolution of universe.
Why this need for greater
computational power?
 Problems such as:
 Weather prediction.
 Aeronautics (airflow analysis, structural
mechanics, engine efficiency etc) .
 Simulating world economy.
 Pharmaceutical (molecular modeling).
 Understanding drug receptor interactions in
brain.
 Automotive crash simulation.
are all computationally intensive.
 The more knowledge we acquire the more complex
our questions become.
Why this need for greater
computational power?
 In 1995, the first full length computer
animated motion picture, Toy Story, was
produced on a parallel system composed
on hundreds of Sun workstations.


Decreased cost
Decreased Time (Several months on several
hundred processors)
Why this need for greater
computational power?
 Commercial Computing has also come to
rely on parallel architectures.
 Computer system speed and capacity 
Scale of business.

OLTP (Online transaction processing)
benchmark represent the relation between
performance and scale of business.
 Rate performance of system in terms of
its throughput in transactions per
minute.
Why this need for greater
computational power?
 Vendors supplying database hardware or
software offer multiprocessor systems
that provide performance substantially
greater than uniprocessor products.
 One solution in the past: Make the clock
run faster.
 The advance of VLSI technology allowed
clock rates to increase and larger number of
components to fit on a chip.
 However there are limits…
Electrical signal cannot propagate faster
than the speed of light: 30cm/nsec in
vacuum and 20cm/nsec in copper wire or
optical fiber.
 Electrical signal cannot propagate faster
than the speed of light: 30cm/nsec in
vacuum and 20cm/nsec in copper wire or
optical fiber.
10-GHz clock - signal path length 2cm in total
100-GHz clock - 2mm
1 THZ (1000 GHz) computer will have to be
smaller than 100 microns if the signal has to
travel from one end to the other and back
with a single clock cycle.
Another fundamental problem:
 Heat dissipation
 The faster a computer runs: more heat it
generates
 High end Pentium systems: CPU cooling
system bigger than the CPU itself.
Evolution of Parallel Architecture
 New dimension added to design space:
Number of processors.
 Driven by demand for performance at
acceptable cost.
Evolution of Parallel Architecture
 Advances in hardware capability enable
new application functionality, which
places a greater demand on the
architecture.
 This cycle drives the ongoing design,
engineering and manufacturing effort.
Evolution of Parallel Architecture
 Microprocessor performance has been
improving at a rate of about 50% per year.
 A parallel machine of hundred processors can
be viewed as providing to applications
computing power that will be available in 10
years time.
 1000 processors  20 year horizon
 The advantages of using small, inexpensive,
mass produced processors as building blocks
for computer systems are clear.
Technology trends
 With technological advance, transistors, gates
etc have been getting smaller and faster.

More can fit in same area.
 Processors are getting faster by making more
effective use of ever larger volume of
computing resources.
 Possibilities:


Place more computer system on chip including
memory and I/O. (Building block for parallel
architectures. System-on-a-chip)
Or multiple processors on chip. (Parallel
architecture on single-chip regime)
Microprocessor Design Trends
 Technology determines what is possible.
 Architecture translates the potential of
technology into performance.
 Parallelism is fundamental to
conventional computer architecture.

Current architectural trends are leading to
multiprocessor designs.
Bit level Parallelism
 From 1970 to 1986 advancements in bitlevel parallelism
 4bit, 8 bit, 16 bit and so-on

Doubling the data path reduces the number
of cycles required to perform an operation.
Instruction level Parallelism
Mid 1980s to mid 1990s
 Performing portions of several machine
instructions concurrently.
 Pipelining (kind of parallelism also)
 Fetch multiple instructions at a time and
issue them in parallel to distinct function
units in parallel (superscalar)
Instruction level Parallelism
However…
 Instruction level parallelism is worthwhile only
if processor can be supplied with instructions
and data fast enough.
 Gap between processor cycle time and memory
cycle time has grown wider.
 To satisfy increasing bandwidth requirements,
larger and larger caches are placed on chip
with the processor.


cache miss
control transfer
 Limits
 In mid 1970s, the introduction of vector
processors marked the beginning of modern
supercomputing


Perform operations on sequences of data
elements rather than individual scalar data
Offered advantage of at least one order of
magnitude over conventional systems of that
time.
 In late 1980s a new generation of systems
came on market. These were
microprocessor based supercomputers that
initially provided about 100 processors and
increased roughly to 1000 in 1990.
 These aggregation of processors are known
as massively parallel processors (MPPs).
 Factors behind emergence of MPPs




Increase in performance of standard
microprocessors
Cost advantage
Usage of “off-the-shelf” microprocessors
instead of custom processors
Fostered by government programs for
scalable parallel computing using distributed
memory.
 MPPs claimed to equal or surpass the performance
of vector multiprocessors.
 Top500


Lists the sites that have the 500 most powerful installed
computer systems.
LINPACK benchmark
• Most widely used metric of performance on numerical
applications
• Collection of Fortran subroutines that analyze and solve linear
equations and linear least squares problems
 Top500 (Updated twice a year since June
1993)

In the first Top500 list there were already 156
MPP and SIMD systems present (around 1/3rd)
Some memory related issues
 Time to access memory has not kept pace with
CPU clock speeds.
 SRAM


Each bit is stored in a latch made up of transistors
Faster than DRAM, but is less dense and requires
greater power
 DRAM


Each bit of memory is stored as a charge on a capacitor
1GHz CPU will execute 60 instructions before a
typical 60ns DRAM can return a single byte
Some memory related issues
 Hierarchy

Cache memories
 Temporal locality
 Cache lines (64, 128, 256 bytes)
Parallel Architectures: Memory Parallelism
 One way to increase performance is to
replicate computers.
 Major choice is between shared memory
and distributed memory
Memory Parallelism
 In mid 1980s, when 32-bit
microprocessor was first introduced,
computers containing multiple
microprocessors sharing a common
memory became prevalent.
 In most of these designs all processors
plug into a common bus.
 However, a small number of processors
can be supported by bus
UMA bus based SMP architecture
 If the bus is busy, when a CPU wants to
read or write memory, the CPU waits for
CPU to become idle.
 Contention of bus can be manageable for
small number of processors only.
 The system will be totally limited by
bandwidth of the bus and most of the
CPUs will be idle most of the time.
UMA bus based SMP architecture
 One way to alleviate this problem is to
add a cache to each CPU.
 Less bus traffic if most reads can be
satisfied from the cache and system can
support more CPUs.
 Single bus limits UMA microprocessor to
about 16-32 CPUs.
SMP
 SMP (Symmetric multiprocessor)

Shared memory multiprocessor where the
cost of accessing a memory location is same
for all processors.