Lec01Overview - Computer Science & Engineering

Download Report

Transcript Lec01Overview - Computer Science & Engineering

CSCE 713 Computer Architecture
Topics



Speedup
Amdahl’s law
Execution time
Readings
January 10, 2012
Overview
Readings for today



Landscape of Parallel Computing Research Berkeley View
EECS-2006-183
Parallel Benchmarks Inspired by Berkeley Dwarfs
Ka10_7dwarfsOfSymbolicComputation
New


Topics overview
Syllabus and other course pragmatics
 Website (not shown)
 Dates



Power wall, ILP wall,  to multicore
Seven Dwarfs
Amdahl’s Law, Gustaphson’s law
–2–
CSCE 713 Spring 2012
Move to multi-processor
Introduction
Single Processor Performance
RISC
–3–
Copyright © 2012, Elsevier Inc. All rights reserved.
CSCE 713 Spring 2012
Power Wall
Note both of dynamic power and energy have voltage2
as dominant term
1
Powerdynamic   CapacitiveLoad  Voltage 2  FrequencySwitched
2
Energydynamic  Capacitive _ Load Voltage
2
So lower voltage improves both; 5V  1V over period
of time, but then can’t continue without errors
–4–
CSCE 713 Spring 2012
Static Power
CMOS chip have power loss due to current leakage
even when the transistor is off
In 2006 the goal for leakage is 25%
Powerstatic  Currentstatic Voltage
–5–
CSCE 713 Spring 2012
Single CPU Single Thread Programming
Model
–6–
CSCE 713 Spring 2012
Berkeley Conventional Wisdom
Old CW: Power is free, but transistors are expensive.
· New CW is the “Power wall”: Power is expensive, but
transistors are “free”. That is, we can put more
transistors on a chip than we have the power to turn on.
2. Old CW: If you worry about power, the only concern is
dynamic power.
· New CW: For desktops and servers, static power due to
leakage can be 40% of total power.
3. Old CW: Monolithic uniprocessors in silicon are reliable
internally, with errors occurring only at the pins.
· New CW: As chips drop below 65 nm feature sizes, they
will have high soft and hard error rates. [Borkar 2005]
– 7 – [Mukherjee et al 2005]
CSCE 713 Spring 2012
Old CW: By building upon prior successes, we can continue
to raise the level of abstraction and hence the size of
hardware designs.
· New CW: Wire delay, noise, cross coupling (capacitive
and inductive), manufacturing variability, reliability (see
above), clock jitter, design validation, and so on conspire
to stretch the development time and cost of large designs
at 65 nm or smaller feature sizes.
–8–
CSCE 713 Spring 2012
Old CW: Researchers demonstrate new architecture ideas
by building chips.
· New CW: The cost of masks at 65 nm feature size, the
cost of Electronic Computer Aided Design software to
design such chips, and the cost of design for GHz clock
rates means researchers can no longer build believable
prototypes.
Thus, an alternative approach to evaluating architectures
must be developed.
–9–
CSCE 713 Spring 2012
Old CW: Performance improvements yield both lower
latency and higher bandwidth.
· New CW: Across many technologies, bandwidth improves
by at least the squareof the improvement in latency.
[Patterson 2004]
7. Old CW: Multiply is slow, but load and store is fast.
· New CW is the “Memory wall” [Wulf and McKee 1995]:
Load and store is slow, but multiply is fast.
Modern microprocessors can take 200 clocks to access
Dynamic Random Access Memory (DRAM), but even
floating-point multiplies may take only four clock cycles.
– 10 –
CSCE 713 Spring 2012
CSAPP – Bryant O’Hallaron
.
– 11 –
CSCE 713 Spring 2012
Topics Covered
The needs for gains in
performance
The need for
Parallelism
Amdahl’s and
Gustaphson’s laws
Various Problems: the
7 Dwarfs and …
Multithreaded
•
•
•
Multicore
Posix pthreads
Intel’s TTB
•Distributed – MPI
•Shared Memory –
OpenMP
•GPUs
Various Approaches
•Grid Computing
Bridges between
•Cloud Computing
– 12 –
CSCE 713 Spring 2012
Top 10 challenges in parallel computing
By Michael Wrinn (Intel) In priority order:
1. Finding concurrency in a program - how to help programmers
“think parallel”?
2. Scheduling tasks at the right granularity onto the processors
of a parallel machine.
3. The data locality problem: associating data with tasks and
doing it in a way that our target audience will be able to use
correctly.
4. Scalability support in hardware: bandwidth and latencies to
memory plus interconnects between processing elements.
5. Scalability support in software: libraries, scalable algorithms,
and adaptive runtimes to map high level software onto platform
details.
– 13 –
http://www.multicoreinfo.com/2009/01/wrinn-top-10-challenges/
CSCE 713 Spring 2012
6. Synchronization constructs (and protocols) that
enable programmers write programs free from
deadlock and race conditions.
7. Tools, API’s and methodologies to support the
debugging process.
8. Error recovery and support for fault tolerance.
9. Support for good software engineering practices:
composability, incremental parallelism, and code
reuse.
10. Support for portable performance. What are the right
models (or abstractions) so programmers can write
code once and expect it to execute well on the
important parallel platforms?
– 14 –
http://www.multicoreinfo.com/2009/01/wrinn-top-10-challenges/
CSCE 713 Spring 2012
Berkeley Conventional Wisdom
1. Old CW: Power is free, but transistors are expensive.
· New CW is the “Power wall”: Power is expensive, but
transistors are “free”. That is, we can put more
transistors on a chip than we have the power to turn on.
2. Old CW: If you worry about power, the only concern is
dynamic power.
· New CW: For desktops and servers, static power due to
leakage can be 40% of total power.
– 15 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
CSCE 713 Spring 2012
Old CW: Monolithic uniprocessors in silicon are reliable
internally, with errors occurring only at the pins.
· New CW: As chips drop below 65 nm feature sizes, they
will have high soft and hard error rates. [Borkar 2005]
[Mukherjee et al 2005]
4. Old CW: By building upon prior successes, we can
continue to raise the level of abstraction and hence the
size of hardware designs.
· New CW: Wire delay, noise, cross coupling (capacitive
and inductive), manufacturing variability, reliability (see
above), clock jitter, design validation, and so on conspire
to stretch the development time and cost of large designs
at 65 nm or smaller feature sizes.
– 16 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
CSCE 713 Spring 2012
5. Old CW: Researchers demonstrate new architecture
ideas by building chips.
· New CW: The cost of masks at 65 nm feature size, the
cost of Electronic Computer Aided Design software to
design such chips, and the cost of design for GHz clock
rates means researchers can no longer build believable
prototypes.
Thus, an alternative approach to evaluating architectures
must be developed.
– 17 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
CSCE 713 Spring 2012
6. Old CW: Performance improvements yield both lower
latency and higher bandwidth.
· New CW: Across many technologies, bandwidth improves
by at least the square of the improvement in latency.
[Patterson 2004]
7. Old CW: Multiply is slow, but load and store is fast.
· New CW is the “Memory wall” [Wulf and McKee 1995]:
Load and store is slow, but multiply is fast. Modern
microprocessors can take 200 clocks to access
Dynamic Random Access Memory (DRAM), but even
floating-point multiplies may take only four clock cycles.
– 18 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
CSCE 713 Spring 2012
8. Old CW: We can reveal more instruction-level
parallelism (ILP) via compilers and architecture
innovation. Examples from the past include branch
prediction, out-of-order execution, speculation, and Very
Long Instruction Word systems.
· New CW is the “ILP wall”: There are diminishing returns on
finding more ILP.
– 19 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
CSCE 713 Spring 2012
Old CW: Uniprocessor performance doubles every 18
months.
· New CW is Power Wall + Memory Wall + ILP Wall = Brick
Wall. Figure 2 plots processor performance for almost 30
years. In 2006, performance is a factor of three below the
traditional doubling every 18 months that we enjoyed
between 1986 and 2002. The doubling of uniprocessor
performance may now take 5 years.
– 20 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
CSCE 713 Spring 2012
Old CW: Don’t bother parallelizing your application, as you
can just wait a little while and run it on a much faster
sequential computer.
· New CW: It will be a very long wait for a faster sequential
computer
– 21 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
CSCE 713 Spring 2012
Old CW: Increasing clock frequency is the primary method
of improving processor performance.
· New CW: Increasing parallelism is the primary method of
improving processor performance.
12. Old CW: Less than linear scaling for a multiprocessor
application is failure.
· New CW: Given the switch to parallel computing, any
speedup via parallelism is a success.
– 22 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
CSCE 713 Spring 2012
.
– 23 –
CSCE 713 Spring 2012
Amdahl’s Law
Suppose you have an enhancement or improvement in
a design component.
The improvement in the performance of the system is
limited by the % of the time the enhancement can be
used
Speedupoverall 
1
Fracenhanced
[(1  Fracenhanced ) 
]
Speedupenhanced
– 24 –
CSCE 713 Spring 2012
Exec Time of Parallel Computation
– 25 –
CSCE 713 Spring 2012
Gustafson’s Law: Scale the problem
– 26 –
http://en.wikipedia.org/wiki/Gustafson%27s_law
CSCE 713 Spring 2012
Matrix Multiplication – scaling the problem
Note we would really scale a model of a “real problem,”
but matrix multiplication might be one step required
– 27 –
CSCE 713 Spring 2012
– 28 –
CSCE 713 Spring 2012
Phillip Colella’s “Seven dwarfs”
High-end simulation in the physical
sciences = 7 numerical methods:
1.
Structured Grids (including
locally structured grids,
e.g. Adaptive Mesh
Refinement)
2.
Unstructured Grids
3.
Fast Fourier Transform
If add 4 for
embedded, covers all
41 EEMBC
benchmarks

8. Search/Sort
9. Filter
10. Combinational logic
11. Finite State Machine
6.
Note: Data sizes (8
bit to 32 bit) and
Dense Linear Algebra
types (integer,
Sparse Linear Algebra
character) differ, but
algorithms
the
same
Well-defined targets from
algorithmic, software,
Particles
7.
Monte Carlo
4.
5.
Slide from “Defining Software
Requirements for Scientific
– 29 –
Computing”,
Phillip Colella, 2004

and architecture standpoint
www.eecs.berkeley.edu/bears/presentations/06/Patterson.ppt
CSCE 713 Spring 2012
Seven Dwarfs - Dense Linear Algebra
Data are dense matrices or vectors.
• Generally, such applications use unit-stride memory
accesses to read data from rows, and
• strided accesses to read data from columns.
• Communication pattern
•
Black is no communication
– 30 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
CSCE 713 Spring 2012
Seven Dwarfs -Sparse Linear Algebra
.
– 31 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
CSCE 713 Spring 2012
Seven Dwarfs –Spectral Methods
(e.g., FFT)
.
– 32 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
CSCE 713 Spring 2012
Seven Dwarfs - N-Body Methods
Depends on interactions between many discrete points.
Variations include particle-particle methods, where every
point depends on all others, leading to an O(N2)
calculation, and hierarchical particle methods, which
combine forces or potentials from multiple points to
reduce the computational complexity to O(N log N) or
O(N).
– 33 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
CSCE 713 Spring 2012
Seven Dwarfs –Structured Grids
.
– 34 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
CSCE 713 Spring 2012
Seven Dwarfs – Unstructured Grids
An irregular grid where data locations are selected, usually
by underlying characteristics of the application.
– 35 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
CSCE 713 Spring 2012
Seven Dwarfs - Monte Carlo
Calculations depend on statistical results of repeated
random trials. Considered embarrassingly parallel.
Communication is typically not dominant in Monte Carlo
methods.
EmbarrassinglyParallel / NSF Teragrid
– 36 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
CSCE 713 Spring 2012
Principle of Locality
Rule of thumb –
A program spends 90% of its execution time in only
10% of the code.
So what do you try to optimize?
Locality of memory references
Temporal locality
Spatial locality
– 37 –
CSCE 713 Spring 2012
Taking Advantage of Parallelism
Logic parallelism – carry lookahead adder
Word parallelism – SIMD
Instruction pipelining – overlap fetch and execute
Multithreads – executing independent instructions at
the same time
Speculative execution -
– 38 –
CSCE 713 Spring 2012
Linux – Sytem Info
saluda> lscpu
Architecture:
i686
CPU op-mode(s):
32-bit, 64-bit
CPU(s):
4
Thread(s) per core: 1
Core(s) per socket: 4
CPU socket(s):
1
Vendor ID:
GenuineIntel
CPU family:
6
Model:
15
Stepping:
11
CPU MHz:
2393.830
Virtualization:
VT-x
L1d cache:
32K
L1i cache:
32K
L2 cache:
4096K
– saluda>
39 –
CSCE 713 Spring 2012
Control Panel  System and Sec…  System
…
…
– 40 –
CSCE 713 Spring 2012
Task Manager
.
– 41 –
CSCE 713 Spring 2012