Lecture 1 - Suraj @ LUMS
Download
Report
Transcript Lecture 1 - Suraj @ LUMS
Tuesday, September 04, 2006
I hear and I forget,
I see and I remember,
I do and I understand.
-Chinese Proverb
Today
Course Overview.
Why Parallel Computing?
Evolution of Parallel Systems.
CS 524 : High Performance Computing
Course URL
http://suraj.lums.edu.pk/~cs524a06
Folder on indus
\\indus\Common\cs524a06
Website – Check Regularly: Course
announcements, office hours, slides,
resources, policies …
Course Outline
Several programming exercises will be
given throughout the course. Assignments
will include popular programming
models for shared memory and message
passing such as OpenMP and MPI.
The development environment will be
C/C++ on UNIX.
Pre-requisites
Computer Organization & Assembly
Language (CS 223)
Data Structures & Algorithms (CS 213)
Senior level standing.
Operating Systems?
Five minute rule.
Hunger For More Power!
Hunger For More Power!
Endless quest for more and
more computing power.
However much computing
power there is, it is never
enough.
Why this need for greater
computational power?
Science, engineering, businesses,
entertainment etc., all are providing the
impetus.
Scientists – observe, theorize, test
through experimentation.
Engineers – design, test prototypes, build.
HPC offers a new way to do
science:
Computation used to approximate physical
systems - Advantages include:
Playing with simulation parameters to study
emergent trends
Possible replay of a particular simulation event
Study systems where no exact theories exist
Why Turn to Simulation?
When the problem is
too . . .
Complex
Large
Expensive
Dangerous
Why this need for greater
computational power?
Less expensive to carry out computer
simulations.
Able to simulate phenomenon that could
not be studied by experimentation. e.g.
evolution of universe.
Why this need for greater
computational power?
Problems such as:
Weather prediction.
Aeronautics (airflow analysis, structural
mechanics, engine efficiency etc) .
Simulating world economy.
Pharmaceutical (molecular modeling).
Understanding drug receptor interactions in
brain.
Automotive crash simulation.
are all computationally intensive.
The more knowledge we acquire the more complex
our questions become.
Why this need for greater
computational power?
In 1995, the first full length computer
animated motion picture, Toy Story, was
produced on a parallel system composed
on hundreds of Sun workstations.
Decreased cost
Decreased Time (Several months on several
hundred processors)
Why this need for greater
computational power?
Commercial Computing has also come to
rely on parallel architectures.
Computer system speed and capacity
Scale of business.
OLTP (Online transaction processing)
benchmark represent the relation between
performance and scale of business.
Rate performance of system in terms of
its throughput in transactions per
minute.
Why this need for greater
computational power?
Vendors supplying database hardware or
software offer multiprocessor systems
that provide performance substantially
greater than uniprocessor products.
One solution in the past: Make the clock
run faster.
The advance of VLSI technology allowed
clock rates to increase and larger number of
components to fit on a chip.
However there are limits…
Electrical signal cannot propagate faster
than the speed of light: 30cm/nsec in
vacuum and 20cm/nsec in copper wire or
optical fiber.
Electrical signal cannot propagate faster
than the speed of light: 30cm/nsec in
vacuum and 20cm/nsec in copper wire or
optical fiber.
10-GHz clock - signal path length 2cm in total
100-GHz clock - 2mm
1 THZ (1000 GHz) computer will have to be
smaller than 100 microns if the signal has to
travel from one end to the other and back
with a single clock cycle.
Another fundamental problem:
Heat dissipation
The faster a computer runs: more heat it
generates
High end Pentium systems: CPU cooling
system bigger than the CPU itself.
Evolution of Parallel Architecture
New dimension added to design space:
Number of processors.
Driven by demand for performance at
acceptable cost.
Evolution of Parallel Architecture
Advances in hardware capability enable
new application functionality, which
places a greater demand on the
architecture.
This cycle drives the ongoing design,
engineering and manufacturing effort.
Evolution of Parallel Architecture
Microprocessor performance has been
improving at a rate of about 50% per year.
A parallel machine of hundred processors can
be viewed as providing to applications
computing power that will be available in 10
years time.
1000 processors 20 year horizon
The advantages of using small, inexpensive,
mass produced processors as building blocks
for computer systems are clear.
Technology trends
With technological advance, transistors, gates
etc have been getting smaller and faster.
More can fit in same area.
Processors are getting faster by making more
effective use of ever larger volume of
computing resources.
Possibilities:
Place more computer system on chip including
memory and I/O. (Building block for parallel
architectures. System-on-a-chip)
Or multiple processors on chip. (Parallel
architecture on single-chip regime)
Microprocessor Design Trends
Technology determines what is possible.
Architecture translates the potential of
technology into performance.
Parallelism is fundamental to
conventional computer architecture.
Current architectural trends are leading to
multiprocessor designs.
Bit level Parallelism
From 1970 to 1986 advancements in bitlevel parallelism
4bit, 8 bit, 16 bit and so-on
Doubling the data path reduces the number
of cycles required to perform an operation.
Instruction level Parallelism
Mid 1980s to mid 1990s
Performing portions of several machine
instructions concurrently.
Pipelining (kind of parallelism also)
Fetch multiple instructions at a time and
issue them in parallel to distinct function
units in parallel (superscalar)
Instruction level Parallelism
However…
Instruction level parallelism is worthwhile only
if processor can be supplied with instructions
and data fast enough.
Gap between processor cycle time and memory
cycle time has grown wider.
To satisfy increasing bandwidth requirements,
larger and larger caches are placed on chip
with the processor.
cache miss
control transfer
Limits
In mid 1970s, the introduction of vector
processors marked the beginning of modern
supercomputing
Perform operations on sequences of data
elements rather than individual scalar data
Offered advantage of at least one order of
magnitude over conventional systems of that
time.
In late 1980s a new generation of systems
came on market. These were
microprocessor based supercomputers that
initially provided about 100 processors and
increased roughly to 1000 in 1990.
These aggregation of processors are known
as massively parallel processors (MPPs).
Factors behind emergence of MPPs
Increase in performance of standard
microprocessors
Cost advantage
Usage of “off-the-shelf” microprocessors
instead of custom processors
Fostered by government programs for
scalable parallel computing using distributed
memory.
MPPs claimed to equal or surpass the performance
of vector multiprocessors.
Top500
Lists the sites that have the 500 most powerful installed
computer systems.
LINPACK benchmark
• Most widely used metric of performance on numerical
applications
• Collection of Fortran subroutines that analyze and solve linear
equations and linear least squares problems
Top500 (Updated twice a year since June
1993)
In the first Top500 list there were already 156
MPP and SIMD systems present (around 1/3rd)
Some memory related issues
Time to access memory has not kept pace with
CPU clock speeds.
SRAM
Each bit is stored in a latch made up of transistors
Faster than DRAM, but is less dense and requires
greater power
DRAM
Each bit of memory is stored as a charge on a capacitor
1GHz CPU will execute 60 instructions before a
typical 60ns DRAM can return a single byte
Some memory related issues
Hierarchy
Cache memories
Temporal locality
Cache lines (64, 128, 256 bytes)
Parallel Architectures: Memory Parallelism
One way to increase performance is to
replicate computers.
Major choice is between shared memory
and distributed memory
Memory Parallelism
In mid 1980s, when 32-bit
microprocessor was first introduced,
computers containing multiple
microprocessors sharing a common
memory became prevalent.
In most of these designs all processors
plug into a common bus.
However, a small number of processors
can be supported by bus
UMA bus based SMP architecture
If the bus is busy, when a CPU wants to
read or write memory, the CPU waits for
CPU to become idle.
Contention of bus can be manageable for
small number of processors only.
The system will be totally limited by
bandwidth of the bus and most of the
CPUs will be idle most of the time.
UMA bus based SMP architecture
One way to alleviate this problem is to
add a cache to each CPU.
Less bus traffic if most reads can be
satisfied from the cache and system can
support more CPUs.
Single bus limits UMA microprocessor to
about 16-32 CPUs.
SMP
SMP (Symmetric multiprocessor)
Shared memory multiprocessor where the
cost of accessing a memory location is same
for all processors.