Multicores Multiprocessors Clusters
Download
Report
Transcript Multicores Multiprocessors Clusters
Kristopher Windsor
CS 147, Fall 2008
Parallel processing on one core
Multicore usage, difficulties, and next steps
Alternatives to multicore CPUs
Multicore benchmarks
Multiple instructions and / or data can be
processed each cycle, for batch-processing
efficiency
For example, MMX has many ALUs operate
simultaneously to process multiple data
Vector architecture is similar to SIMD, but its
speed comes from parallel data movement, not
parallel data processing
Single data stream
Multiple data streams
Single instruction stream SISD (Pentium 4)
SIMD (x86 MMX)
Multiple instruction
streams
MIMD (Xeon /
Clovertown)
MISD (not used)
Required whenever there are more threads
than cores
There are multiple ways for a core to switch to
a different thread
Fine-grained multithreading: switch every cycle
Course-grained multithreading: switch when the
current thread is stalled (IE it is waiting for some
data to come back from the RAM)
Simultaneous multithreading (SMT): multiple
threads are processed each cycle
Clock speed limits for each core due to heat
Heat produced is exponentially related to clock
speed, and cooling methods are limited
This limit has already been reached, and one core is
not enough
Power efficiency
Smaller CPU designs can be optimized better
Individual cores or processors can be turned off
when not needed
Job-level parallelism
Each process can only
use one core
Easier to code
Most programs are
written like this
Inefficient when you
have multiple cores but
only one main program
Parallel processing program
Each process can have
multiple threads, which
run on different cores
Harder to code
Used in OS, which has
many independent tasks,
and in web servers, where
each request can be
handled separately
Best use of multiple cores
Software-rendered display represents most of the
game’s CPU usage (IE more than the physics
calculations), and the graphics output cannot naturally
be split into multiple threads
3D hardware-accelerated graphic output is typically
the performance bottleneck, and since the GPU is 50x +
faster on a video card than on a CPU, multicore CPUs
will not help
In games where every object can collide with every
other object, physics cannot be parallelized easily
because any two collisions may need to access the same
memory
Every event has to happen in order, but parallel
processing does not naturally do this
Sequential
Dim Shared As Integer total
Sub program ()
'this part can be done several times at once
'because it does not depend on
'other parts of the program
Dim As Integer addme = 0
For i As Integer = 1 To 10000
addme += 1
Next i
'accesses a global variable
total += addme
End Sub
For i As Integer = 1 To 100
program()
Next i
Concurrent
Dim Shared As Integer total
Dim Shared As Any Ptr mutex
Sub program ()
Dim As Integer addme = 0
For i As Integer = 1 To 10000
addme += 1
Next i
Mutexlock(mutex)
total += addme
Mutexunlock(mutex)
End Sub
mutex = Mutexcreate()
Dim As Any Ptr threads(1 To 100)
For i As Integer = 1 To 100
threads(i) = Threadcreate(@program())
Next i
For i As Integer = 1 To 100
Threadwait(threads(i))
Next i
Mutexdestroy(mutex)
Each processor has its own
cache
If one processor changes the
memory, the other processors
may have the wrong data
cached
Snooping protocol: when one
processor changes the data,
every other processor must
remove (invalidate) its copy
AMD’s MOESI protocol:
every cache block has data in
one of these five states:
modified, owned, exclusive,
shared, or invalid
Adding several cores to
a machine will provide
limited speed
improvements, because
the other components
have not been
upgraded
In this example, adding
cores allows more
FLOPs, but not more
data transfer
Intel is developing 6
and 8 core processors
(Westmere and
Nehalem)
Tilera produces 64-core
chips (TILE64) with an
architecture made for
many cores
Removes the bus datatransfer bottleneck
Saves power by
powering-off individual
cores
Comes with developer
tools for making parallel
processing programs
CPU
Slowly adopting
multiple cores
Caches exploit locality
Needs low-latency
RAM
GPU
Naturally better suited to
parallelism, and uses major
multithreading to achieve
performance
The GeForce 8800 GTX has 16
multiprocessors and 16 * 8
multithreaded floating-point
processors
No locality; uses coursegrained hardware
multithreading to minimize
time loss
Needs high-bandwidth RAM
Costs
Maintenance and
storage costs for each
machine
Operating systems will
take RAM from each
machine
Resources such as RAM
cannot be shared well
among machines
Benefits
Can be built with massproduced computers and
standard LAN hardware.
Can reach sizes beyond the
limits of current multicore chips
Can be spread over multiple
physical locations
Gives your company more
bandwidth than any one ISP offers
Provides redundancy in case of fire or
power outage
Can be upgraded without
replacing the current hardware
Sparse Matrix-Vector multiplication test and the LatticeBoltzmann Magneto-Hydrodynamics test give different
results
Less FLOPs per core when there are many cores
Upgrading from 2 cores to 4 may have little effect
Certain processors better for certain applications (IE Xeon)
Multicores demand new methods of software optimization
Computer
Organization and
Design: the Hardware
/ Software Interface,
4th ed., by David A.
Patterson and John L.
Hennessy
AMD.com
PCLaunches.com
(New Intel Processors)
Tilera.com