parallel computer

Download Report

Transcript parallel computer

• Problem is to compute:
f(latitude, longitude, elevation, time) 
temperature, pressure, humidity, wind velocity
• Approach:
– Discretize the domain, e.g., a measurement point every 10 km
– Devise an algorithm to predict weather at time t+1 given t
• Uses:
- Predict major events,
e.g., El Nino
- Use in setting air
emissions standards
Source: http://www.epm.ornl.gov/chammp/chammp.html
Weather Forecasting
An accurate long-range forecast requires huge amounts of cells
and hence computations.
Case Study: Global Climate Modelling
 earth’s surface is approximately 5 x 108 km2
 Considering one cell per square km with 15 levels ( ground
level up to 14 km high )
 6 data values ( update once every minute ) : humidity,
temperature, wind, latitude, longitude and height
Throughput: 3 Gigabytes of data per second
Global Climate Modeling Computation
• One piece is modeling the fluid flow in the atmosphere
– Solve Navier-Stokes problem
• Roughly 100 Flops per grid point with 1 minute timestep
• Computational requirements:
–
–
–
–
To match real-time, need 5x 1011 flops in 60 seconds = 8 Gflop/s
Weather prediction (7 days in 24 hours)  56 Gflop/s
Climate prediction (50 years in 30 days)  4.8 Tflop/s
To use in policy negotiations (50 years in 12 hours)  288 Tflop/s
• To double the grid resolution, computation is at least 8x
• State of the art models require integration of atmosphere,
ocean, sea-ice, land models, plus possibly carbon cycle,
geochemistry and more
• Current models are coarser than this
Weather Forecasting
Computer
Visualisation of
a Hurricane
High Resolution
Climate Modeling on
NERSC-3 – P. Duffy,
et al., LLNL
Protein Folding



One of the major challenges in molecular biology.
Proteins perform over a thousand different jobs.
( As enzymes they accelerate reactions. They also carry
oxygen and antibodies to fight disease. )
Before proteins can go to work they must fold into the
correct shape.
( The string of amino acids in the protein twist and fold to
form the final protein )
Scientists are using supercomputers to discover the rules that
describe why a string of amino acids folds into a particular
protein.
Protein Folding
Researchers at the Pittsburgh
Computing Center tracked the
folding of a small protein (300
amino acids) in water ( ~ 32000
atoms ).
Folding time: 1 millisecond
#FLOPS required: 3 x 1022
With a PetaFLOP computer
the simulation would take a
year.
IBM are funding a $100M project called The Blue Gene Project to build
a 1 PetaFLOP/s computer ( PetaScale Computing – 1015 Flops/s)
The Production of Toy Story
•
•
•
•
140,000 frames rendered for full-length
feature film.
10,000 seconds required to render each
frame.
~ 1017 operations
Operations were distributed over dozens of
Sun workstations, ~ 10 MIPS ( millions of
instructions per second ) per Sun.
What is Parallel Architecture?
A parallel computer is a collection of processing elements
that cooperate to solve large problems fast
• Some broad issues:
– Resource Allocation:
• how large a collection?
• how powerful are the elements?
• how much memory?
– Data access, Communication and Synchronization
• how do the elements cooperate and communicate?
• how are data transmitted between processors?
• what are the abstractions and primitives for cooperation?
– Performance and Scalability
• how does it all translate into performance?
• how does it scale?
Why Study Parallel
Architecture?
Parallelism:
• Provides alternative to faster clock for
performance
• Applies at all levels of system design
• Is a fascinating perspective from which to view
architecture
• Is increasingly central in information
processing
Architectural Trends
Greatest trend in VLSI generation is increase in
parallelism
– Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
• slows after 32 bit
• adoption of 64-bit now under way, 128-bit far (not
performance issue)
• great inflection point when 32-bit micro and cache fit on a chip
– Mid 80s to mid 90s: instruction level parallelism
• pipelining and simple instruction sets, + compiler advances
(RISC)
• on-chip caches and functional units => superscalar execution
• greater sophistication: out of order execution, speculation,
prediction
– to deal with control transfer and latency problems
– Next step: thread level parallelism
How far will ILP go?
3
25
2.5
20
2
Speedup
Fraction of total cycles (%)
30
15

1.5
10
1
5
0.5
0
0
0
1
2
3
4
5
Number of instructions issued
6+




0
5
10
Instructions issued per cycle
• Infinite resources and fetch bandwidth,
perfect branch prediction and renaming
– real caches and non-zero miss latencies
15