CSE 574 Parallel Processing
Download
Report
Transcript CSE 574 Parallel Processing
Lecture 1:
Introduction to High
Performance Computing
Grand challenge problem
A grand challenge problem is
one that cannot be solved in a
reasonable amount of time with
today’s computers.
Weather Forecasting
Cells of size 1 mile x 1 mile x 1 mile
=> Whole global atmosphere about 5 x 108 cells
If each calculation requires 200 Flops
=> 1011 Flops, in one time step
To forecast the weather over 10 days using 10-minute
intervals, with a computer operating at 100 Mflops (108
Flops/s)
=> would take 107 seconds or over 100 days.
To perform the calculation in 10 minutes would require a
computer operating at 1.7 Tflops (1.7 x 1012 Flops/s).
Some Grand Challenge
Applications
Science
•
•
•
•
•
Global climate modeling
Astrophysical modeling
Biology: genomics; protein folding; drug design
Computational Chemistry
Computational Material Sciences and Nanosciences
Engineering
•
•
•
•
•
Crash simulation
Semiconductor design
Earthquake and structural modeling
Computation fluid dynamics (airplane design)
Combustion (engine design)
Business
•
•
Financial and economic modeling
Transaction processing, web services and search engines
Defense
•
•
Nuclear weapons -- test by simulations
Cryptography
Units of High Performance
Computing
1 Mflop/s
1 Gflop/s
1 Tflop/s
1 Pflop/s
Speed
1 Megaflop/s
1 Gigaflop/s
1 Teraflop/s
1 Petaflop/s
106 Flop/second
109 Flop/second
1012 Flop/second
1015 Flop/second
1 MB
1 GB
1 TB
1 PB
Capacity
1 Megabyte
1 Gigabyte
1 Terabyte
1 Petabyte
106 Bytes
109 Bytes
1012 Bytes
1015 Bytes
Moore’s Law
Gordon Moore (co-founder of Intel) predicted in 1965
that the transistor density of semiconductor chips
would double roughly every 18 months.
Moore’s Law holds also for
performance and capacity
1945
2002
Computer
ENIAC
Laptop
Number of
vacuum tubes / transistors
18 000
6 000 000 000
Weight (kg)
27 200
0.9
68
0.0028
20 000
60
4 630 000
1 000
Memory (bytes)
200
1 073 741 824
Performance (Flops/s)
800
5 000 000 000
Size (m3)
Power (watts)
Cost ($)
Peak Performance
A contemporary RISC processor
delivers 10% of its peak performance
Two primary reasons behind this low
efficiency:
•
•
IPC inefficiency
Memory inefficiency
Instructions per cycle (IPC)
inefficiency
Today the theoretical IPC is 4-6
Detailed analysis for a spectrum of
applications indicates that the average
IPC is 1.2–1.4
~75% of the performance is not used
Reasons for IPC inefficiency
Latency
•
Waiting for access to memory or other parts of the system
Overhead
•
Extra work that has to be done to manage program
concurrency and parallel resources the real work you want
to perform
Starvation
•
Not enough work to do due to insufficient parallelism or
poor load balancing among distributed resources
Contention
•
Delays due to fighting over what task gets to use a shared
resource next. Network bandwidth is a major constraint
Memory Hierarchy
Processor-Memory Problem
Processors issue instructions roughly
every nanosecond
DRAM can be accessed roughly every 100
nanoseconds
The gap is growing:
•
•
processors getting faster by 60% per year
DRAM getting faster by 7% per year
Processor-Memory Problem
How fast can a serial computer
be?
Consider the 1 Tflop sequential machine
•
•
•
data must travel distance, r, to get from memory to
CPU
to get 1 data element per cycle, this means 1012
times per second at the speed of light, c = 3x108 m/s
so r < c / 1012 = 0.3 mm
For 1 TB of storage in a 0.3 mm2 area
•
each word occupies about 3 Angstroms2, the size of
a small atom
So, we need Parallel Computing!
High Performance Computers
In 1980s
•
•
1x106 Floating Point Ops/sec (Mflop/s)
Scalar based
In 1990s
•
•
1x109 Floating Point Ops/sec (Gflop/s)
Vector & Shared memory computing
Today
•
•
1x1012 Floating Point Ops/sec (Tflop/s)
Highly parallel, distributed processing, message
passing
What is a Supercomputer?
A supercomputer is a hardware
and software system that
provides close to the maximum
performance that can currently
be achieved
Top500 Computers
Over the last 10 years the range for the
Top500 has increased greater than Moore’s
law:
1993
•
•
#1 = 59.7 GFlop/s
#500 = 422 MFlop/s
2004
•
•
#1 = 70 TFlop/s
#500 = 850 GFlop/s
Top500 List at June 2005
Manuf.
Computer
Instal. Site
Cntry
Year
Rmax
(Tflop/s)
#proc
1 IBM
BlueGene/L
LLNL
USA
2005
136.8
65536
2 IBM
BlueGene/L
IBM Watson
Res. Center
USA
2005
91.3
40960
3 SGI
Altix
NASA
USA
2004
51.9
10160
4 NEC
Vector
Earth Simulator
Center
Japan
2002
35.9
5120
5 IBM
Cluster
Barcelona
Supercomp. C.
Spain
2005
27.9
4800
Performance Development
Increasing CPU Performance
Manycore Chip
Composed of hybrid
cores
• Some general purpose
• Some graphics
• Some floating point
What is Next?
Board composed of
multiple manycore
chips sharing memory
Rack composed of multiple boards
A room full of these racks
Millions of cores
Exascale systems (1018 Flop/s)
Moore’s Law Reinterpreted
Number of cores per chip doubles
every 2 year, while clock speed
decreases (not increases).
• Need to deal with systems with millions of
concurrent threads
Number of threads of execution
doubles every 2 year
Performance Projection
Directions
Move toward shared memory
•
•
SMPs and Distributed Shared Memory
Shared address space with deep memory
hierarchy
Clustering of shared memory machines for
scalability
Efficiency of message passing and data
parallel programming
•
MPI and HPF
Future of HPC
Yesterday's HPC is
today's mainframe is
tomorrow's workstation