Transcript Parallelism
Parallelism
Can we make it faster?
12-Apr-16
The RAM model
The RAM (Random Access Machine) model of computation
assumes:
This simple model is very useful guide for algorithm design
There is a single processing unit
There is an arbitrarily large amount of memory
Accessing any arbitrarily chosen (i.e. random) memory location takes unit
time
For maximum efficiency, “tuning” to the particular hardware is required
The RAM model breaks down when the assumptions are violated
If an array is so large that only a portion of it fits in memory (the rest is on
disk), very different sorting algorithms should be used
2
Approaches to parallelism
The basic question is, do the processing units share
memory, or do they send messages to one another?
A thread consists of a single flow of control, a program
counter, a call stack, and a small amount of threadspecific data
Threads share memory, and communicate by reading and
writing to that memory
This is thread-based or shared-memory parallel processing
Java “out of the box” is thread-based
A process is a thread that has its own private memory
Threads (sometimes called actors) send messages to one
another
This is message-passing parallel processing
3
The PRAM model
An obvious extension to the RAM model is the Parallel Random
Access model, which assumes:
The third assumption is “good enough” for many in-memory
sequential programs, but not good enough for parallel programs
There are multiple processing units
There is an arbitrarily large amount of memory
Accessing any memory location takes unit time
If the processing units share memory, then complicated and expensive
synchronization mechanisms must be used
If the processing units do not share memory, then each has its own (fast)
local memory, and communicates with other processes by sending
messages to them (much slower--especially if over a network!)
Bottom line: Because there seems to be no way to meet the unit
time assumption, the PRAM model is seriously broken!
4
The CTA model
The Candidate Type Architecture model makes these
assumptions:
There are P standard sequential processors, each with its own
local memory
Processors can access non-local memory over a
communication network
One of the processors may be acting as “controller,” doing things like
initialization and synchronization
Non-local memory is between 100 times and 10000 times slower to
access than local memory (based on common architectures)
A processor can make only a very small number (maybe 1 or
2) of simultaneous non-local memory accesses
5
Consequences of CTA
The CTA model does not specify how many processors
are available
The programmer does not need to plan for some specific
number of processors
More processors may cause the code to execute somewhat
more quickly
The CTA modes does specify a huge discrepancy
between local and non-local memory access
The programmer should minimize the number of non-local
memory accesses
6
Costs of parallelism
It would be great if having N processors meant our programs
would run N times as fast, but...
There is overhead involved in setting up the parallelism, which
we don’t need to pay for a sequential program
There are parts of any program that cannot be parallelized
Some processors will be idle because there is nothing for them to
do
Processors have to contend for the same resources, such as
memory, and may have to wait for one another
7
Overhead
Overhead is any cost incurred by the parallel algorithm but not by
the corresponding sequential algorithm
Communication among threads and processes (a single thread has no other
threads with which to communicate)
Synchronization is when one thread or process has to wait for results or
events from another thread or process
Contention for a shared resource, such as memory
Java’s synchronized is used to wait for a lock to become free
Extra computation to combine the results of the various threads or
processes
Extra memory may be needed to give each thread or process the memory
required to do its job
8
Amdahl’s law
Some proportion P of a program can be made to run in
parallel, while the remaining (1 - P) must remain
sequential
If there are N processors, then the computation can be
done in (1 - P) + P/N time
The maximum speedup is then
1
.
(1 - P) + P/N
As N goes to infinity, the maximum speedup is 1/(1 - P)
For example, if P = 0.75, the maximum speedup is
(1/0.25), or four times
9
Consequences of Amdahl’s law
If 75% of a process can be parallelized, and there are four
processors, then the possible speedup is
1 / ((1 - 0.75) + 0.75/4) = 2.286
But with 40 processors--ten times as many--the speedup is only
1 / ((1 - 0.75) + 0.75/40) = 3.721
This has led many people (including Amdahl) to conclude that
having lots of processors won’t help very much
However....
For many problems, as the data set gets larger,
The inherently sequential part of the program remains (fairly) constant
Thus, the sequential proportion P becomes smaller
So: The greater the volume of data, the more speedup we can get
10
Idle time
Idle time results when
There is a load imbalance--one process may have much less
work to do than another
A process must wait for access to memory or some other
shared resource
Data is registers is most quickly accessed
Data in a cache is next most quickly accessed
A level 1 cache is the fastest, but also the smallest
A level 2 cache is larger, but slower
Memory--RAM--is much slower
Disk access is very much slower
11
Dependencies
A dependency is when one thread or process requires
the result of another thread or process
Example: (a + b) * (c + d)
The additions can be done in parallel
The multiplication must wait for the results of the additions
Of course, at this level, the hardware itself handles the parallelism
Threads or processors that depend on results from other
threads or processors must wait for those results
12
Parallelism in Java
Java uses the shared memory model
There are various competing Java packages (such as Akka
and Kilim) to support message passing, but nothing yet in
the official Java release
The programming language Erlang has developed the
message passing approach
Scala is a Java competitor that supports both
approaches
Scala’s message passing is based on Erlang
13
Concurrency in Java, I
1.
2.
3.
4.
5.
Java Concurrency in Practice, by Brian Goetz, is the
book to have if you need to do much concurrent
programming in Java
The following 11 points are from his summary of basic
principles
It’s the mutable state, stupid!
Make fields final unless they need to be mutable.
Immutable objects are automatically thread-safe.
Encapsulation makes it practical to manage the complexity.
Guard each mutable variable with a lock.
14
Concurrency in Java, II
6.
7.
8.
Guard all variables in an invariant with the same lock.
Hold locks for the duration of compound actions.
A program that accesses a mutable variable from multiple
threads without synchronization is a broken program.
9. Don’t rely on clever reasoning about why you don’t need to
synchronize.
10. Include thread safety in the design process—or explicitly
document that your class in not thread-safe.
11. Document your synchronization policy.
15
Functional programming
In functional programming (FP):
A function is a value
A function acts like a function in mathematics
It can be assigned to variables
It can be passed as an argument to another function
It can be returned as the result of a function call
There are much briefer ways of writing a literal function
Scala example: a => 101 * a
If you call it with the same arguments, you will get the same result.
Every time. Guaranteed.
Functions have no side effects
Immutable values are strongly emphasized over mutable values
Some languages, such as Haskell, don’t allow mutable values at all
Computation proceeds by the application of functions, not by changing
the state of mutable variables
16
Why functional programming?
Here are the three most important reasons that functional
programming is better for concurrency than imperative
programming:
Immutable values are automatically thread safe
Immutable values are automatically thread safe
Immutable values are automatically thread safe
Why not functional programming?
Functional languages—Lisp, Haskell, ML, OCaml—have
long been regarded as only for ivory-tower academics
Functional languages are “weird” (meaning: unfamiliar)
17
What’s happening now?
Moore’s law has ended
Instead of getting faster processors, we’re now getting more of
them
Consequently, parallelism, and concurrency, have become much
more important
After about ten years, CIS 120 is once again starting with
OCaml
Python has gotten more functional
Other languages are getting more functional
Microsoft is starting to promote F# (based on ML?)
Java 8 will have some functional features
Scala is a hybrid object/functional language based on Java,
and is freely available now
18
The End
…for now
19