Supercomputing in Plain English: Distributed Parallel
Download
Report
Transcript Supercomputing in Plain English: Distributed Parallel
Lecture 27 Computer
Architecture
Adapted From:
Supercomputing in Plain English
Part VII: Multicore Madness
Henry Neeman, Director
OU Supercomputing Center for Education & Research
University of Oklahoma
Wednesday October 17 2007
Outline
The March of Progress
Multicore/Many-core Basics
Software Strategies for Multicore/Many-core
A Concrete Example: Weather Forecasting
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
2
The March of Progress
OU’s TeraFLOP Cluster, 2002
10 racks @ 1000 lbs per rack
270 Pentium4 Xeon CPUs,
2.0 GHz, 512 KB L2 cache
270 GB RAM, 400 MHz FSB
8 TB disk
Myrinet2000 Interconnect
100 Mbps Ethernet Interconnect
OS: Red Hat Linux
Peak speed: 1.08 TFLOP/s
(1.08 trillion calculations per second)
One of the first Pentium4 clusters!
boomer.oscer.ou.edu
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
4
TeraFLOP, Prototype 2006, Sale 2011
9 years from room to chip!
http://news.com.com/2300-1006_3-6119652.html
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
5
Moore’s Law
In 1965, Gordon Moore was an engineer at Fairchild
Semiconductor.
He noticed that the number of transistors that could be
squeezed onto a chip was doubling about every 18 months.
It turns out that computer speed is roughly proportional to the
number of transistors per unit area.
Moore wrote a paper about this concept, which became known
as “Moore’s Law.”
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
6
log(Speed)
Moore’s Law in Practice
Year
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
7
log(Speed)
Moore’s Law in Practice
Year
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
8
log(Speed)
Moore’s Law in Practice
Year
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
9
log(Speed)
Moore’s Law in Practice
Year
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
10
log(Speed)
Moore’s Law in Practice
Year
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
11
Fastest Supercomputer vs. Moore
Fastest Supercomputer in the World
1000000
www.top500.org
Speed in GFLOP/s
100000
10000
Fastest
Moore
1000
100
10
1
1992
1994
1996
1998
2000
2002
2004
2006
2008
Year
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
12
The Tyranny of
the Storage Hierarchy
The Storage Hierarchy
[5]
Fast, expensive, few Registers
Cache memory
Main memory (RAM)
Hard disk
Removable media (e.g., DVD)
Internet
Slow, cheap, a lot
[6]
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
14
RAM is Slow
The speed of data transfer
between Main Memory and the
CPU is much slower than the
speed of calculating, so the CPU
spends most of its time waiting
for data to come in or go out.
CPU 351 GB/sec[7]
Bottleneck
10.66 GB/sec[9] (3%)
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
15
Why Have Cache?
Cache is nearly the same speed
as the CPU, so the CPU doesn’t
have to wait nearly as long for
stuff that’s already in cache:
it can do more
operations per second!
CPU 351 GB/sec[7]
253 GB/sec[8] (72%)
10.66 GB/sec[9] (3%)
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
16
Storage Use Strategies
Register reuse: Do a lot of work on the same data before
working on new data.
Cache reuse: The program is much more efficient if all of
the data and instructions fit in cache; if not, try to use
what’s in cache a lot before using anything that isn’t in
cache.
Data locality: Try to access data that are near each other
in memory before data that are far.
I/O efficiency: Do a bunch of I/O all at once rather than a
little bit at a time; don’t mix calculations and I/O.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
17
A Concrete Example
OSCER’s big cluster, topdawg, has Irwindale CPUs: single
core, 3.2 GHz, 800 MHz Front Side Bus.
The theoretical peak CPU speed is 6.4 GFLOPs (double
precision) per CPU, and in practice we’ve gotten as high as
94% of that.
So, in theory each CPU could consume 143 GB/sec.
The theoretical peak RAM bandwidth is 6.4 GB/sec, but in
practice we get about half that.
So, any code that does less than 45 calculations per byte
transferred between RAM and cache has speed limited by
RAM bandwidth.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
18
Good Cache Reuse
Example
A Sample Application
Matrix-Matrix Multiply
Let A, B and C be matrices of sizes
nr nc, nr nk and nk nc, respectively:
a1,1
a
2,1
A a3,1
anr ,1
a1, 2
a1,3
a2 , 2
a3, 2
a2 , 3
a3,3
anr , 2
anr ,3
a1,nc
a2,nc
a3,nc
anr ,nc
b1,1 b1, 2
b
2,1 b2, 2
B b3,1 b3, 2
bnr ,1 bnr , 2
b1,3
b2,3
b3,3
bnr ,3
b1,nk
b2,nk
b3,nk
bnr ,nk
c1,1 c1, 2
c
2,1 c2, 2
C c3,1 c3, 2
cnk ,1 cnk , 2
c1,3
c2 , 3
c3,3
cnk ,3
c1,nc
c2,nc
c3,nc
cnk ,nc
The definition of A = B • C is
nk
ar ,c br ,k ck ,c br ,1 c1,c br , 2 c2,c br ,3 c3,c br ,nk cnk ,c
k 1
for r {1, nr}, c {1, nc}.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
20
Matrix Multiply: Naïve Version
SUBROUTINE matrix_matrix_mult_naive
&
IMPLICIT NONE
INTEGER,INTENT(IN) :: nr, nc, nq
REAL,DIMENSION(nr,nc),INTENT(OUT)
REAL,DIMENSION(nr,nq),INTENT(IN)
REAL,DIMENSION(nq,nc),INTENT(IN)
(dst, src1, src2, &
nr, nc, nq)
:: dst
:: src1
:: src2
INTEGER :: r, c, q
DO c = 1, nc
DO r = 1, nr
dst(r,c) = 0.0
DO q = 1, nq
dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c)
END DO
END DO
END DO
END SUBROUTINE matrix_matrix_mult_naive
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
21
Performance of Matrix Multiply
800
Matrix-Matrix Multiply
700
CPU sec
600
500
400
300
Better
Init
200
100
0
0
10000000
20000000
30000000
40000000
50000000
60000000
Total Problem Size in bytes (nr*nc+nr*nq+nq*nc)
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
22
Tiling
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
23
Tiling
Tile: A small rectangular subdomain of a problem domain.
Sometimes called a block or a chunk.
Tiling: Breaking the domain into tiles.
Tiling strategy: Operate on each tile to completion, then
move to the next tile.
Tile size can be set at runtime, according to what’s best for
the machine that you’re running on.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
24
Tiling Code
SUBROUTINE matrix_matrix_mult_by_tiling (dst, src1, src2, nr, nc, nq, &
&
rtilesize, ctilesize, qtilesize)
IMPLICIT NONE
INTEGER,INTENT(IN) :: nr, nc, nq
REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst
REAL,DIMENSION(nr,nq),INTENT(IN) :: src1
REAL,DIMENSION(nq,nc),INTENT(IN) :: src2
INTEGER,INTENT(IN) :: rtilesize, ctilesize, qtilesize
INTEGER :: rstart, rend, cstart, cend, qstart, qend
DO cstart = 1, nc, ctilesize
cend = cstart + ctilesize - 1
IF (cend > nc) cend = nc
DO rstart = 1, nr, rtilesize
rend = rstart + rtilesize - 1
IF (rend > nr) rend = nr
DO qstart = 1, nq, qtilesize
qend = qstart + qtilesize - 1
IF (qend > nq) qend = nq
CALL matrix_matrix_mult_tile(dst, src1, src2, nr, nc, nq, &
&
rstart, rend, cstart, cend, qstart, qend)
END DO
END DO
END DO
END SUBROUTINE matrix_matrix_mult_by_tiling
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
25
Multiplying Within a Tile
SUBROUTINE matrix_matrix_mult_tile (dst, src1, src2, nr, nc, nq, &
&
rstart, rend, cstart, cend, qstart, qend)
IMPLICIT NONE
INTEGER,INTENT(IN) :: nr, nc, nq
REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst
REAL,DIMENSION(nr,nq),INTENT(IN) :: src1
REAL,DIMENSION(nq,nc),INTENT(IN) :: src2
INTEGER,INTENT(IN) :: rstart, rend, cstart, cend, qstart, qend
INTEGER :: r, c, q
DO c = cstart, cend
DO r = rstart, rend
IF (qstart == 1) dst(r,c) = 0.0
DO q = qstart, qend
dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c)
END DO
END DO
END DO
END SUBROUTINE matrix_matrix_mult_tile
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
26
Reminder: Naïve Version, Again
SUBROUTINE matrix_matrix_mult_naive
&
IMPLICIT NONE
INTEGER,INTENT(IN) :: nr, nc, nq
REAL,DIMENSION(nr,nc),INTENT(OUT)
REAL,DIMENSION(nr,nq),INTENT(IN)
REAL,DIMENSION(nq,nc),INTENT(IN)
(dst, src1, src2, &
nr, nc, nq)
:: dst
:: src1
:: src2
INTEGER :: r, c, q
DO c = 1, nc
DO r = 1, nr
dst(r,c) = 0.0
DO q = 1, nq
dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c)
END DO
END DO
END DO
END SUBROUTINE matrix_matrix_mult_naive
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
27
Performance with Tiling
Matrix-Matrix Mutiply Via Tiling
Matrix-Matrix Mutiply Via Tiling (log-log)
250
1000
200
100
512x256
10
CPU sec
150
512x512
1024x512
100
1024x1024
1
50
1E+08
10000000 1000000
100000
10000
1000
100
10
2048x1024
Better
0.1
0
100000000
10000000
1000000
100000
10000
1000
100
10
Tile Size (bytes)
Tile Size (bytes)
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
28
The Advantages of Tiling
It allows your code to exploit data locality better, to get
much more cache reuse: your code runs faster!
It’s a relatively modest amount of extra coding (typically a
few wrapper functions and some changes to loop bounds).
If you don’t need tiling – because of the hardware, the
compiler or the problem size – then you can turn it off by
simply setting the tile size equal to the problem size.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
29
Why Does Tiling Work Here?
Cache optimization works best when the number of
calculations per byte is large.
For example, with matrix-matrix multiply on an n × n matrix,
there are O(n3) calculations (on the order of n3), but only
O(n2) bytes of data.
So, for large n, there are a huge number of calculations per
byte transferred between RAM and cache.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
30
Multicore/Many-core
Basics
What is Multicore?
In the olden days (i.e., the first half of 2005), each CPU chip
had one “brain” in it.
More recently, each CPU chip has 2 cores (brains), and,
starting in late 2006, 4 cores.
Jargon: Each CPU chip plugs into a socket, so these days,
to avoid confusion, people refer to sockets and cores, rather
than CPUs or processors.
Each core is just like a full blown CPU, except that it shares
its socket with one or more other cores – and therefore
shares its bandwidth to RAM.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
32
Dual Core
Core Core
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
33
Quad Core
Core Core
Core Core
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
34
Oct Core
Core Core Core Core
Core Core Core Core
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
35
The Challenge of Multicore: RAM
Each socket has access to a certain amount of RAM, at a
fixed RAM bandwidth per SOCKET.
As the number of cores per socket increases, the contention
for RAM bandwidth increases too.
At 2 cores in a socket, this problem isn’t too bad. But at 16
or 32 or 80 cores, it’s a huge problem.
So, applications that are cache optimized will get big
speedups.
But, applications whose performance is limited by RAM
bandwidth are going to speed up only as fast as RAM
bandwidth speeds up.
RAM bandwidth speeds up much slower than CPU speeds
up.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
36
The Challenge of Multicore: Network
Each node has access to a certain number of network ports,
at a fixed number of network ports per NODE.
As the number of cores per node increases, the contention
for network ports increases too.
At 2 cores in a socket, this problem isn’t too bad. But at 16
or 32 or 80 cores, it’s a huge problem.
So, applications that do minimal communication will get
big speedups.
But, applications whose performance is limited by the
number of MPI messages are going to speed up very very
little – and may even crash the node.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
37
Multicore/Many-core Problem
Most multicore chip families have relatively small cache per
core (e.g., 2 MB) – and this problem seems likely to remain.
Small TLBs make the problem worse: 512 KB per core
rather than 2 MB.
So, to get good cache reuse, you need to partition algorithm
so subproblem needs no more than 512 KB.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
38
The T.L.B. on a Current Chip
On Intel Core Duo (“Yonah”):
Cache size is 2 MB per core.
Page size is 4 KB.
A core’s data TLB size is 128 page table entries.
Therefore, D-TLB only covers 512 KB of cache.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
39
The T.L.B. on a Current Chip
On Intel Core Duo (“Yonah”):
Cache size is 2 MB per core.
Page size is 4 KB.
A core’s data TLB size is 128 page table entries.
Therefore, D-TLB only covers 512 KB of cache.
The cost of a TLB miss is 49 cycles, equivalent to as many
as 196 calculations! (4 FLOPs per cycle)
http://www.digit-life.com/articles2/cpu/rmma-via-c7.html
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
40
What Do We Need?
We need much bigger caches!
TLB must be big enough to cover the entire cache.
It’d be nice to have RAM speed increase as fast as core
counts increase, but let’s not kid ourselves.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
41
To Learn More Supercomputing
http://www.oscer.ou.edu/education.php
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
42
Computer Architecture
Spring 2012
Lecture 27. CMPs & SMTs
Adapted from Mary Jane Irwin
( www.cse.psu.edu/~mji )
[Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
Multithreading on A Chip
Find a way to “hide” true data dependency stalls, cache
miss stalls, and branch stalls by finding instructions (from
other process threads) that are independent of those
stalling instructions
Multithreading – increase the utilization of resources on a
chip by allowing multiple processes (threads) to share the
functional units of a single processor
Processor must duplicate the state hardware for each thread – a
separate register file, PC, instruction buffer, and store buffer for
each thread
The caches, TLBs, BHT, BTB can be shared (although the miss
rates may increase if they are not sized accordingly)
The memory can be shared through virtual memory mechanisms
Hardware must support efficient thread context switching
Types of Multithreading on a Chip
Fine-grain – switch threads on every instruction issue
Round-robin thread interleaving (skipping stalled threads)
Processor must be able to switch threads on every clock cycle
Advantage – can hide throughput losses that come from both
short and long stalls
Disadvantage – slows down the execution of an individual thread
since a thread that is ready to execute without stalls is delayed
by instructions from other threads
Coarse-grain – switches threads only on costly stalls
(e.g., L2 cache misses)
Advantages – thread switching doesn’t have to be essentially
free and much less likely to slow down the execution of an
individual thread
Disadvantage – limited, due to pipeline start-up costs, in its
ability to overcome throughput loss
- Pipeline must be flushed and refilled on thread switches
Simultaneous Multithreading (SMT)
A variation on multithreading that uses the resources of a
multiple-issue, dynamically scheduled processor
(superscalar) to exploit both program ILP and threadlevel parallelism (TLP)
Most SS processors have more machine level parallelism than
most programs can effectively use (i.e., than have ILP)
With register renaming and dynamic scheduling, multiple
instructions from independent threads can be issued without
regard to dependencies among them
- Need separate rename tables (ROBs) for each thread
- Need the capability to commit from multiple threads (i.e., from
multiple ROBs) in one cycle
Intel’s Pentium 4 SMT called hyperthreading
Supports just two threads (doubles the architecture state)
Typically, each core of newer multi-core process is
hyperthreaded
Threading on a 4-way SS Processor Example
Coarse MT
Issue slots →
Thread A
Thread B
Time →
Thread C Thread D
Fine MT
SMT
William Stallings
Computer Organization and Architecture
8th Edition
Chapter 18
Multicore Computers
Hardware Performance Issues
Microprocessors have seen an exponential increase in
performance
Increase in Parallelism
Improved organization
Increased clock frequency
Pipelining
Superscalar
Simultaneous multithreading (SMT)
Diminishing returns
More complexity requires more logic
Increasing chip area for coordinating and signal transfer logic
- Harder to design, make and debug
Alternative Chip
Organizations
Intel Hardware
Trends
Increased Complexity
Power requirements grow exponentially with chip density
and clock frequency
Can use more chip area for cache
- Smaller
- Order of magnitude lower power requirements
By 2015
100 billion transistors on 300mm2 die
- Cache of 100MB
- 1 billion transistors for logic
Pollack’s rule:
Performance is roughly proportional to square root of increase in
complexity
- Double complexity gives 40% more performance
Multicore has potential for near-linear improvement
Unlikely that one core can use all cache effectively
Power and Memory Considerations
Chip Utilization of Transistors
Software Performance Issues
Performance benefits dependent on effective exploitation
of parallel resources
Even small amounts of serial code impact performance
10% inherently serial on 8 processor system gives only 4.7 times
performance
Communication, distribution of work and cache
coherence overheads
Some applications effectively exploit multicore
processors
Effective Applications for Multicore Processors
Database
Servers handling independent transactions
Multi-threaded native applications
Multi-process applications
Oracle, SAP, PeopleSoft
Java applications
Java VM is multi-thread with scheduling and memory
management
Sun’s Java Application Server, BEA’s Weblogic, IBM
Websphere, Tomcat
Multi-instance applications
Lotus Domino, Siebel CRM
One application running multiple times
E.g. Value Game Software
Multicore Organization
Number of core processors on chip
Number of levels of cache on chip
Amount of shared cache
Next slide examples of each organization:
(a) ARM11 MPCore (deleted)
(b) AMD Opteron (deleted)
(c) Intel Core Duo
(d) Intel Core i7
Multicore Organization Alternatives
Advantages of shared L2 Cache
Constructive interference reduces overall miss rate
Data shared by multiple cores not replicated at cache
level
With proper frame replacement algorithms mean amount
of shared cache dedicated to each core is dynamic
Threads with less locality can have more cache
Easy inter-process communication through shared
memory
Cache coherency confined to L1
Dedicated L2 cache gives each core more rapid access
Good for threads with strong locality
Shared L3 cache may also improve performance
Individual Core Architecture
Intel Core Duo uses superscalar cores
Intel Core i7 uses simultaneous multi-threading (SMT)
Scales up number of threads supported
- 4 SMT cores, each supporting 4 threads appears as 16 core
Intel x86 Multicore Organization - Core Duo (1)
2006
Two x86 superscalar, shared L2 cache
Dedicated L1 cache per core
Thermal control unit per core
32KB instruction and 32KB data
Manages chip heat dissipation
Maximize performance within constraints
Improved ergonomics
Advanced Programmable Interrupt Controlled (APIC)
Inter-process interrupts between cores
Routes interrupts to appropriate core
Includes timer so OS can interrupt core
Intel x86 Multicore Organization - Core Duo (2)
Power Management Logic
Monitors thermal conditions and CPU activity
Adjusts voltage and power consumption
Can switch individual logic subsystems
2MB shared L2 cache
Dynamic allocation
MESI support for L1 caches
Extended to support multiple Core Duo in SMP
- L2 data shared between local cores or external
Bus interface
Intel Core Duo Block Diagram
Intel x86 Multicore Organization - Core i7
November 2008
Four x86 SMT processors
Dedicated L2, shared L3 cache
Speculative pre-fetch for caches
On chip DDR3 memory controller
Three 8 byte channels (192 bits) giving 32GB/s
No front side bus
QuickPath Interconnection
Cache coherent point-to-point link
High speed communications between processor chips
6.4G transfers per second, 16 bits per transfer
Dedicated bi-directional pairs
Total bandwidth 25.6GB/s
Intel Core i7 Block Diagram
Performance Effect of Multiple Cores
Computer Architecture
Adapted from Mary Jane Irwin
( www.cse.psu.edu/~mji )
[Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
Multicore Xbox360 – “Xenon” processor
To provide game developers with a balanced and
powerful platform
Three SMT processors, 32KB L1 D$ & I$, 1MB UL2 cache
165M transistors total
3.2 Ghz Near-POWER ISA
2-issue, 21 stage pipeline, with 128 128-bit registers
Weak branch prediction – supported by software hinting
In order instructions
Narrow cores – 2 INT units, 2 128-bit VMX units, 1 of anything
else
An ATI-designed 500MZ GPU w/ 512MB of DDR3DRAM
337M transistors, 10MB framebuffer
48 pixel shader cores, each with 4 ALUs
Xenon Diagram
Core 1
Core 2
L1D L1I
L1D L1I
L1D L1I
XMA Dec
Core 0
1MB UL2
MC1
MC0
512MB
DRAM
BIU/IO Intf
SMC
GPU
DVD
HDD Port
Front USBs (2)
Wireless
MU ports (2 USBs)
Rear USB (1)
Ethernet
IR
Audio Out
Flash
Systems Control
3D Core
10MB
EDRAM
Video
Out
Analog
Chip
Video Out
The PS3 “Cell” Processor Architecture
Composed of a Non-SMP Architecture
234M transistors @ 4Ghz
1 Power Processing Element, 8 “Synergistic” (SIMD) PE’s
512KB L2 $ - Massively high bandwidth (200GB/s) bus connects
it to everything else
The PPE is strangely similar to one of the Xenon cores
- Almost identical, really. Slight ISA differences, and fine-grained MT
instead of real SMT
The real differences lie in the SPEs (21M transistors each)
- An attempt to ‘fix’ the memory latency problem by giving each
processor complete control over it’s own 256KB “scratchpad” – 14M
transistors
– Direct mapped for low latency
- 4 vector units per SPE, 1 of everything else – 7M trans.
The PS3 “Cell” Processor Architecture
How to make use of the SPEs
What about the Software?
Makes use of special IBM “Hypervisor”
Like an OS for OS’s
Runs both a real time OS (for sound) and non-real time (for
things like AI)
Software must be specially coded to run well
The single PPE will be quickly bogged down
Must make use of SPEs wherever possible
This isn’t easy, by any standard
What about Microsoft?
Development suite identifies which 6 threads you’re expected to
run
Four of them are DirectX based, and handled by the OS
Only need to write two threads, functionally
http://ps3forums.com/showthread.php?t=22858
Next Lecture and Reminders
Reminders
Final is Wednesday, May 2 from 1-2:50 PM in ITT 328