Transcript Document

ME964
High Performance Computing
for Engineering Applications
C Programming Intro, wrap up
&
Brief Overview of Parallel Computing
Sept. 9, 2008
Before we get started…

Last Time




Memory layout and access
C pointers and pointer arithmetic
Misc C elements: struct, typedef, arrays, sizeof()
HW 1 assigned, due on Th


Use the Bulletin Board to post questions, they’ll be answer by me,
the grader, or your colleagues
Today


Wrap up the C Programming Intro
Overview of Parallel Computing



A look at trends in computing paradigm
The Hardware perspective on Parallel Computing
The Software perspective on Parallel Computing
2
Function Types
The other confusing construct is the “pointer to function” type.
For example, qsort: (a sort function in the standard library)
void qsort(void *base, size_t nmemb, size_t size,
int (*compar)(const void *, const void *));
/* function matching this type: */
int cmp_function(const void *x, const void *y);
/* typedef defining this type: */
typedef int (*cmp_type) (const void *, const void *);
The last argument is a
comparison function
const means the function
is not allowed to modify
memory via this pointer.
/* rewrite qsort prototype using our typedef */
void qsort(void *base, size_t nmemb, size_t size, cmp_type compar);
size_t is an unsigned int
void * is a pointer to memory of unknown type.
3
Revisiting Dynamic Memory Allocation
Recall that variables are allocated statically by having declared with a
given size. This allocates them in the stack.
Allocating memory at run-time requires dynamic allocation. This
allocates them on the heap.
sizeof() reports the size of a type in bytes
int * alloc_ints(size_t requested_count)
{
int * big_array;
big_array = (int *)calloc(requested_count, sizeof(int));
if (big_array == NULL) {
printf(“can’t allocate %d ints: %m\n”, requested_count);
return NULL;
}
/* now big_array[0] .. big_array[requested_count-1] are
* valid and zeroed. */
return big_array;
}
4
calloc() allocates memory
for N elements of size k
Returns NULL if can’t alloc
It’s OK to return this pointer.
It will remain valid until it is
freed with free(). However,
it’s a bad practice to return it
(if you need is somewhere
else, declare and define it
there…)
Caveats with Dynamic Memory
Dynamic memory is useful. But it has several caveats:
Whereas the stack is automatically reclaimed, dynamic allocations must be
tracked and free()’d when they are no longer needed. With every
allocation, be sure to plan how that memory will get freed. Losing track of
memory causes “memory leak”.
Whereas the compiler enforces that reclaimed stack space can no longer
be reached, it is easy to accidentally keep a pointer to dynamic memory
that has been freed. Whenever you free memory you must be certain that
you will not try to use it again.
Because dynamic memory always uses pointers, there is generally no way
for the compiler to statically verify usage of dynamic memory. This means
that errors that are detectable with static allocation are not with dynamic
5
Some Common Errors and Hints
sizeof() can take a variable reference in place of a type name. This guarantees the right
allocation, but don’t accidentally allocate the sizeof() the pointer instead of the object!
malloc() allocates n bytes
/* allocating a struct with malloc() */
struct my_struct *s = NULL;
s = (struct my_struct *)malloc(sizeof(*s)); /* NOT sizeof(s)!! */
if (s == NULL) {
printf(stderr, “no memory!”);
Always check for NULL.. Even if
exit(1);
}
memset(s, 0, sizeof(*s));
Why?
you just exit(1).
malloc() does not zero the memory,
so you should memset() it to 0.
/* another way to initialize an alloc’d structure: */
struct my_struct init = {
counter: 1,
average: 2.5,
in_use: 1
};
/* memmove(dst, src, size) (note, arg order like assignment) */
memmove(s, &init, sizeof(init));
/* when you are done with it, free it! */
free(s);
s = NULL;
memmove is preferred because it is
safe for shifting buffers
Why?
Use pointers as implied in-use flags!
6
High Level Question: Why is Software Hard?

Complexity: Every conditional (“if”) doubles the number of paths
through your code, every bit of state doubles possible states


Mutability: Software is easy to change.. Great for rapid fixes…
And rapid breakage… Always one character away from a bug


Recommendation: reuse code with functions, avoid duplicate state
variables
Recommendation: tidy, readable code, easy to understand by
inspection.
Flexibility: Problems can be solved in many different ways. Few
hard constraints, don’t let your horses run loose

Recommendation: discipline and use of design patterns
7
Design Patterns

A really good book if you are serious about this…
8
End: Quick Review of C
Beginning: Discussion of Computational
Models and Trends
9
High Performance Computing (HPC):
How, Why, and Why Now.

Objectives of course segment:

Discuss some barriers facing the traditional sequential
computation model

Discuss some proposed solutions and corresponding
trends in hardware and software industries

Overview hardware and software solutions in relation to
parallel computing
10
Acknowledgements

Material presented on this topic draws on
presentations made by

Darío Suárez, Universidad de Zaragoza

John Cavazos, University of Delaware

Others, as indicated on various slides
11
Three Walls to Serial Performance



Memory Wall
Instruction Level
Parallelism (ILP) Wall
Source: excellent article, “The ManyCore Inflection Point for Mass Market
Computer Systems”, by John L.
Manferdelli, Microsoft Corporation
http://www.ctwatch.org/quarterly/articles
/2007/02/the-many-core-inflection-pointfor-mass-market-computer-systems/
Power Wall
12
Memory Wall

Memory Wall: What is it?
 The growing disparity of speed between CPU and memory
outside the CPU chip.

The growing memory latency is a barrier to computer
performance improvements


Current architectures have ever growing caches to improve the
“average memory reference” time to fetch or write instructions or
data
All due to latency and limited communication bandwidth beyond
chip boundaries.

From 1986 to 2000, CPU speed improved at an annual rate of 55%
while memory access speed only improved at 10%.
13
The ILP Wall

Duplicate hardware speculatively executes future instructions before
the results of current instructions are known, while providing
hardware safeguards to prevent the errors that might be caused by
out of order execution
1. e = a + b
2. f = c + d
3. g = e * f

Branches must be “guessed” to decide what instructions to execute
simultaneously
 If you guess wrong, you throw away this part of the result

Data dependencies may prevent successive instructions from
executing in parallel, even if there are no branches.
14
The ILP Wall

ILP, the good:


ILP, the bad:



Existing programs enjoy performance benefits without any modification.
Improvements are difficult to forecast since the “speculation” success is
difficult to predict
Moreover, ILP causes a super-linear increase in execution unit
complexity (and associated power consumption) without linear speedup.
ILP, the ugly: serial performance acceleration using ILP has stalled
because of these effects
15
The Power Wall

Power, and not manufacturing, limits traditional general purpose
microarchitecture improvements (F. Pollack, Intel Fellow)

Leakage power dissipation gets worse as gates get smaller,
because gate dielectric thicknesses must proportionately decrease
W / cm2
Nuclear reactor
Pentium II
Pentium
i386
i486
Pentium 4
Core DUO
Pentium III
Pentium Pro
Technology from older to newer (μm)
Adapted from
F. Pollack (MICRO’99)
16
The Power Wall

Power dissipation in clocked digital devices is proportional to the
clock frequency, imposing a natural limit on clock rates

Significant increase in clock speed without heroic (and
expensive) cooling is not possible. Chips would simply melt.

Clock speed increased by a factor of 4,000 in less than two
decades



The ability of manufacturers to dissipate heat is limited though…
Look back at the last five years, the clock rates are pretty much flat
You could bank on Materials Science (MS) breakthroughs

The MS Engineers have usually delivered, can they keep doing it ??
17
Intel Perspective

Intel’s “Platform 2015” documentation, see
http://download.intel.com/technology/computing/archinnov/platform2015/download/RMS.pdf
First of all, as chip geometries shrink and clock frequencies rise,
the transistor leakage current increases, leading to excess power
consumption and heat.
[]
Secondly, the advantages of higher clock speeds are in part
negated by memory latency, since memory access times have not
been able to keep pace with increasing clock frequencies.
[]
Third, for certain applications, traditional serial architectures are
becoming less efficient as processors get faster further
undercutting any gains that frequency increases might otherwise
buy.
18

What can be done?
Moore’s Law

1965 paper: Doubling of the number of transistors on integrated
circuits every two years


Moore himself wrote only about the density of components (or
transistors) at minimum cost
Increase in transistor count is also a rough measure of computer
processing performance

Moore quote: “Moore's law has been the name given to everything that
changes exponentially. I say, if Gore invented the Internet, I invented
the exponential”
http://news.cnet.com/Images-Moores-Law-turns-40/2009-1041_3-5649019.html
20
The Ox vs. Chickens Analogy
Seymour Cray: "If you were plowing a field, which would
you rather use: Two strong oxen or 1024 chickens?"

Chicken is trendy these days:


For certain classes of applications, you can run many cores at lower
frequency and come ahead at the speed game
Example (John Cavazos):

Scenario One: one-core processor w/ power budget W

Increase frequency by 20%



Substantially increases power, by more than 50%
But, only increase performance by 13%
Scenario Two: Decrease frequency by 20% with a simpler core



Decreases power by 50%
Can now add another dumb core (one more chicken…)
Power budget stays at W with increased performance!
CPU Performance, over time

Lifted from Tom’s Hardware CPU benchmark section



Blue: Intel
Green: AMD
It’s interesting to see that it is multi-core technology rather
than clock rate that made a big difference recently…
22
Micro2015: Evolving Processor Architecture, Intel® Developer Forum, March 2005
Intel’s Vision:
Evolutionary Configurable Architecture
Large, Scalar cores for
high single-thread
performance
Scalar plus many core for
highly threaded workloads
Multi-core array
• CMP with ~10 cores
Many-core array
• CMP with 10s-100s low
power cores
• Scalar cores
• Capable of TFLOPS+
• Full System-on-Chip
• Servers, workstations,
embedded…
Dual core
• Symmetric multithreading
CMP = “chip multi-processor”
Presentation Paul Petersen,
Sr. Principal Engineer, Intel
23
Performance
Vision of the Future
ISV: Independent
Software Vendors
Growing gap!
GHz Era
Multi-core Era
Time


“Parallelism for Everyone”
Parallelism changes the game
 A large percentage of people who provide applications are going
to have to care about parallelism in order to match the
capabilities of their competitors.
competitive pressures = demand for parallel applications
Presentation Paul Petersen,
Sr. Principal Engineer, Intel
24
Intel Larrabee

Paul Otellini, President and CEO, Intel


"We are dedicating all of our future product development to
multicore designs,"
"We believe this is a key inflection point for the industry."
25
Putting things in perspective…
The way business has been run in the past
It will probably change to this…
Increasing clock frequency is primary
method of performance improvement
Processors parallelism is primary method
of performance improvement
Don’t bother parallelizing an application,
Nobody is building one processor per
just wait and run on much faster sequential chip. This marks the end of the La-Z-Boy
computer
programming era
Less than linear scaling for a
multiprocessor is failure
Given the switch to parallel hardware,
even sub-linear speedups are beneficial as
long as you beat the sequential
Slide Source: Berkeley View of Landscape
26
End: Discussion of Computational
Models and Trends
Beginning: Overview of H&S for
parallel computing
27
Amdahl’s Law
Basically measures what parallelism can buy you…





Let “a” be the amount of time spent in a part of the code that will run sequentially
Let “b” be the amount of time spent in a part of the code that you are about to
parallelize using P threads
Originally, the program runs in TLong=a+b
You parallelize the code, which now runs in Tshort=a+b/P.
You improve the performance of the program by a factor of
TLong/Tshort=(a+b)/(a+b/P)


Whatever you do, even if b/P is tiny (if you have a large number of threads and no
overhead), the speedup is no larger than (a+b)/a
NOTE:


The art is in finding algorithms with small “a” and limiting the overhead associated with
the using of the P threads
Programs with a=0 are called “embarrassingly parallelizable”
28
Parallel Computing:
There’s a zoo out there…



Shared-Memory

Nehalem micro-architecture, released in October 2008

AMD “Barcelona” (quad-core)

Sun Niagara
Distributed-Memory

IBM BlueGene/L

Cell (see http://users.ece.utexas.edu/~adnan/vlsi-07/hofstee-cell.ppt)
Mini-cores

Intel Teraflops

GPUs
How can I place some order in here?
29
Flynn’s Taxonomy of Architectures

SISD - Single Instruction/Single Data

SIMD - Single Instruction/Multiple Data

MISD - Multiple Instruction/Single Data

MIMD - Multiple Instruction/Multiple Data
30
Single Instruction/Single Data
PU – Processing Unit
Your desktop, before the spread of dual core CPUs
Slide Source: Wikipedia, Flynn’s Taxonomy
31
Flavors of SISD
Instructions:
32
More on pipelining…
33
Single Instruction/Multiple Data
Processors that execute same instruction on
multiple pieces of data: NVIDIA GPUs
Slide Source: Wikipedia, Flynn’s Taxonomy
34
Single Instruction/Multiple Data


Each core runs the same set of instructions on different data
Example:

NVIDIA: processes pixels of an image in parallel
Slide Source: Klimovitski & Macri, Intel
35
SISD versus SIMD
Writing a compiler for SIMD architectures is VERY difficult
(inter-thread communication complicates the picture…)
Slide Source: ars technica, Peakstream article
36
Multiple Instruction/Single Data
Not useful…
Slide Source: Wikipedia, Flynn’s Taxonomy
37
Multiple Instruction/Multiple Data
As of 2006, all the top 10 and most of the TOP500
supercomputers were based on a MIMD architecture
Slide Source: Wikipedia, Flynn’s Taxonomy
38
Multiple Instruction/Multiple Data

The sky is the limit: each PU is free to do as it pleases

Can be of either shared memory or distributed memory
categories
Instructions:
39
Multicore Programming Models

Message Passing Interface (MPI)

Originally aimed at distributed memory architectures,
now very effective on shared memory

OpenMP

Threads
 Pthreads (“P” comes from Posix)
 Cell threads

Parallel Libraries
 Intel’s Thread Building Blocks (TBB) - mature
 Microsoft’s Task Parallel Library - mature
 SWARM (GTech) – small scope
 Charm++ (UIUC) – growing, undergoing effort
 STAPL (Standard Template Adaptive Parallel Library,
B. Stroustrup Texas A&M) – undergoing effort
Slide Source: John Cavazos
40