COE 590 Special Topics: Parallel Architectures

Download Report

Transcript COE 590 Special Topics: Parallel Architectures

Programming Multi-Core
Processors based Embedded
Systems
A Hands-On Experience on Cavium
Octeon based Platforms
Lecture 1 (Overview)
Course Objectives

A hands-on opportunity to learn:



Multi-core architectures; and
Programming multi-core systems
Emphasis on programming:




Using multi-threading paradigm
Understand the complexities
Apply to generic computing/networking problems
Implement on an popular embedded multi-core
platform
1-8
KICS, UET
Grading Policy and Reference Books

Grading Policy




Lectures
Labs
Quizzes (daily)
(40%)
(50%)
(10%)
Reference material:

Shameem Akhtar and Jason Roberts, Multi-Core
Programming, Intel Press, 2006
David E. Culler and Jaswinder Pal Singh, Parallel

Class notes

Computer Architecture: A Hardware/Software
Approach, Morgan Kaufmann, 1998
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Course Outline

Introduction








Parallel architectures and terminology
Context for current interest: multi-core processors
Programming paradigms for multi-core
Octeon processor architecture
Multi-threading on multi-core processors
Applications for multi-core processors
Application layer computing on multi-core
Performance measurement and tuning
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
An Introduction to Parallel
Computing in the context of
Multi-Core Architectures
Developing Software for Multi-Core:
A Paradigm Shift

Application developers are typically oblivious
of underlying hardware architecture




Sequential program
Automatic/guaranteed performance benefit with
processor upgrade
No work on the programmer
No “free lunch” with multi-core systems



Multiple cores in modern processors
Parallel programs needed to exploit parallelim
Parallel computing is now part of main-stream
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Parallel Computing for Main-Stream:
Old vs. New Programming Paradigms

Known tools and techniques:



Multi-threading for multi-core



High performance computing and communication (HPCC)
Wealth of existing knowledge about parallel algorithms,
programming paradigms, languages and compilers, and
scientific/engineering applications
Common in desktop and enterprise applications
Exploits parallelism of multi-core with its challenges
New realizations of old paradigms:



Parallel computing on Playstation 3
Parallel computing on GPUs
Cluster computing for large volume data
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Dealing with the Challenge of Multi-Core
Programming with Hands-On Experience

Our objective is two-fold



Overview the known paradigms for background
Learn using the state-of-the-art implementations
Choice of platform for hands-on experience





Cavium Networks’ Octeon processor based system
Multiple cores (1 to 16)
Suitable for embedded products
Commonly used in networking products
Standard Linux based development environment
Copyright © 2010 Cavium University Program
1-8
KICS, UET
Agenda for Today

Parallel architectures and terminology




Why multi-core architectures?



Traditional parallel computing
Transition to multi-core architectures
Programming paradigms



Processor technology trends
Architecture trends
Taxonomy
Traditional
Recent additions
Introduction to Octeon processor based systems
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Architecture and Terminology
Background of parallel architectures
and commonly used terminology
Architectures and Terminology

Objectives of this section:



Understand the processor technology trends
Realize that parallel architectures evolve based on
technology and architecture trends
Terminology used in parallel computing




Von Neumann
Flynn’s taxonomy
Bell’s taxonomy
Other commonly used terminology
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Processor Technology Trends
Processor technology evolution,
Moore’s law, ILP, and current trends
Processor Technology Evolution

Increasing number of transistors on a chip




Increasing clock rates during 90’s



Moore’s law: number of transistors on a chip is expected to
double every 18 months
Chip densities are reaching their physical limits
Technological breakthroughs have kept Moore’s law alive
Faster and smaller transistors, gates, and circuits on a chip
Clock rates of microprocessors increase by ~30% per year
Benchmark (e.g., SPEC suite) results indicate
performance improvement with technology
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Moore’s Law

Gordon Moore , Founder of Intel



1965: since the integrated circuit was invented,
the number of transistors/inch2 in these circuits
roughly doubled every year this trend would
continue for the foreseeable future
1975: revised - circuit complexity doubles every
18 months
This was simply a prediction


Based on little data
However, it has defined the processor industry
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Moore’s
Original
Law (2)
ftp://download.intel.com/research/
Copyright © 2010
silicon/moorespaper.pdf
Cavium University
Program
1-8
KICS, UET
Moore’s Original Issues



Design cost  still valid
Power dissipation  still valid
What to do with all the functionality possible
Copyright ftp://download.intel.com/research/silicon/moorespaper.pdf
© 2010
Cavium University
Program
1-8
KICS, UET
Moore’s Law and Intel Processors
Copyright
© 2010
From: http://www.intel.com/technology/silicon/mooreslaw/pix/mooreslaw_chart.gif
Cavium University
Program
KICS, UET
1-8
Good News:
Moore’s Law isn’t done yet
Copyright
© 2010by Dr. Tim Mattson, Intel Corp.
Source: Webinar
Cavium University
Program
1-8
KICS, UET
Bad News:
Single Thread Performance is Falling Off
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Worse News:
Power (normalized to i486) Trend
Copyright © 2010
Cavium
Source: University
Webinar by Dr. Tim Mattson, Intel Corp.
Program
1-8
KICS, UET
Addressing Power Issues
Source: Webinar by Dr. Tim Mattson, Intel Corp.
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Architecture Optimized for Power:
a big step in the right direction
Copyright © 2010
Cavium
Source: University
Webinar by Dr. Tim Mattson, Intel Corp.
Program
1-8
KICS, UET
Long term solution: Multi-Core
Copyright © 2010
Cavium University
Program
Source: Webinar by Dr. Tim Mattson, Intel Corp.
1-8
KICS, UET
Summary of Technology Trends


Moore’s law is still relevant
Need to deal with related issues




Design complexity
Power consumption
Uniprocessor performance is slowing down
Multiple processor cores resolve these issues



Parallelism at hardware level
End user is exposed to it
Added complexities related to programming such
systems
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Taxonomy for Parallel
Architectures
Von Neumann, Flynn, and Bell’s
taxonomies and other common
terminology
Von Neumann Architecture Evolution
Von Neumann architecture
Scalar
Sequential
Lookahead
I/E overlap
Functional
parallelism
Multiple
func.units
Pipeline
Implicit
vector
Explicit
vector
Memory to
memory
Register to
register
SIMD
Associative
processor
MIMD
Processor
array
Multicomputer
Multiprocessor
Massively Parallel Processors
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Pipelining and Parallelism


Instructions prefetch to overlap execution
Functional parallelism supported by:



Multiple functional units
Pipelining
Pipelining




Pipelined instruction execution
Pipelined arithmetic computations
Pipelined memory access operations
Pipelining is attractive for performing identical
operations repeatedly over vector data strings
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Flynn’s Classification


Michael Flynn classified architectures in 1972
based on instruction and data streams
Single Instruction stream over a Single Data
stream (SISD)


Conventional sequential machines
Single Instruction stream over Multiple Data
streams (SIMD)

Vector computers are equipped with scalar and
vector hardware
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Flynn’s Classification (2)

Multiple Instruction streams over Single Data
stream (MISD)



Same data flowing through a linear array of
processors
Aka systolic arrays for pipelined execution of
algorithms
Multiple Instruction streams over Multiple
Data streams (MIMD)

Suitable model for general purpose parallel
architectures
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Bell’s Taxonomy for MIMD

Multicomputers





Multiple address spaces
System consists of
multiple computers,
called nodes
Nodes are interconnected
by a message-passing
network
Each node has its own
processor, memory, NIC,
and I/O devices
Copyright © 2010
Cavium University
Program
Multiprocessors


Shared address space
Further classified based
on how memory is
accessed




1-8
Uniform Memory Access
(UMA)
Non-Uniform Memory
Access (NUMA)
Cache-Only Memory
Access (COMA)
Cache-Coherent NonUniform Memory Access
(cc-NUMA)
KICS, UET
Multicomputer Genenrations

First generation (1983-87)




Second generation (1988-1992)





Processor boards connected in hypercube architecture
Software-controlled message switching
Examples: Caltech Cosmic, Intel iPSC/1
Mesh connected architecture
Hardware message routing
Software environment for medium grain distributed
computing
Example: Intel Paragon
Third generation (1993-1997)

Fine grain multicomputers
MIT J-Machine and Caltech Mosaic
Copyright © 2010
Cavium
 University
Examples:
Program
1-8
KICS, UET
Multiprocessor Examples

Distributed memory (scalable)




Dynamic binding of address to processors (KSR)
Static binding, caching (Alliant, DASH)
Static program binding (BBN, Cedar)
Central memory (not scalable)


Cross-point or multi-stage (Cray, Fujitsu, Hitachi,
IBM, NEC, Tera)
Simple multi bus (DEC, Encore, NCR, Sequent,
SGI, Sun)
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Supercomputers


Supercomputers use vector processing and
data parallelism
Classified into two categories


Vector supercomputers
SIMD supercomputers



SIMD machines with massive data parallelism
Instruction is broadcast to large number of Pes
Examples: Illiac (64 PEs), MasPar MP-1 (16,384 PEs),
and CM-2 (65,538 PEs)
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Vector supercomputers



Machines with powerful vector processors
If decoded instruction is a vector operation, it
is sent to vector unit
Register-register architecture:


Memory-to-memory architecture:


Fujitsu VP2000 series
Cyber 205
Pipelined vector supercomputers:

Cray Y-MP
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Dataflow Architectures

Represent computation as a graph of
essential dependences



Logical processor at each node, activated by
availability of operands
Message (tokens) carrying tag of next instruction
sent to next processor
Tag compared with others in matching store;
match fires execution
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Dataflow Architectures (2)
1
a = (b +1)  (b  c)
d=ce
f=ad
b
c
e

+

d

Dataflow graph
a

Network
f
Token
store
Program
store
Waiting
Matching
Instruction
fetch
Execute
Form
token
Network
Token queue
Copyright © 2010
Cavium University
Program
Network
1-8
KICS, UET
Systolic Architectures
M
M
PE
PE


PE
PE
Replace single processor with array of regular
processing elements
Orchestrate data flow for high throughput
with less memory access
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Systolic Architectures (2)

Different from pipelining




Nonlinear array structure, multidirection data flow,
each PE may have (small) local instruction and
data memory
Different from SIMD: each PE may do
something different
Initial motivation: VLSI enables inexpensive
special-purpose chips
Represent algorithms directly by chips
connected in regular pattern
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Systolic Arrays (Cont’d)
• Example: Systolic array for 1-D convolution
y(i) = w1  x(i) + w2  x(i + 1) + w3  x(i + 2) + w4  x(i + 3)
x8
x7
x6
x5
x4
x3
x2
w4
y3
y2
yin
w1
x
w
xout
xout = x
x = xin
yout = yin + w  xin
yout
Enable variety of algorithms on same hardware
But dedicated interconnect channels


w2
Practical realizations (e.g. iWARP) use quite general processors


w3
y1
xin

x1
Data transfer directly from register to register across channel
Specialized, and same problems as SIMD
 General
purpose systems work well for same algorithms (locality etc.)
Copyright
© 2010
Cavium University
Program
KICS, UET
1-8
Cluster of Computers

Started as a poor man’s parallel system




In-expensive PCs
In-expensive switched Ethernet switch
Run-time system to support message-passing
Low performance for HPCC applications



High network I/O latency
Low bandwidth
Suitable for high throughput applications



Data center applications
Virtualized resources
Independent threads or processes
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Summary of Taxonomy

Multiple taxonomies

Based on functional parallelism


Based on programming paradigm


Von Neumann and Flynn’s taxonomies
Bell’s taxonomy
Parallel architecture types





Multi-computers (distributed address space)
Multi-processors (shared address space)
Multi-core
Multi-threaded
Others: vector, data flow, systolic, and cluster
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Why Multi-Core Architectures
Based on technology and architecture
trends
Multi-Core Architectures

Traditional architectures



Transition to multi-core



Sequential  Moore’s law = increasing clk freq
Parallel
 Diminishing returns from ILP
Architecture  similar to SMPs
Programming  typically SAS
Challenges to transition



Performance = efficient parallelization
Selecting a suitable programming paradigm
Performance tuning
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Traditional Parallel
Architectures
Definition and development tracks
Defining a Parallel Architecture

A sequential architecture is characterized by:



Parallel architecture:




Single processor
Single control flow path
Multiple processors with interconnection network
Multiple control flow paths
Communication and synchronization
A parallel computer can be defined as a collection of
processing elements that communicate and
cooperate to solve large problems fast
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Broad Issues in Parallel
Architectures

Resource allocation:




Data access, communication and synchronization




how large a collection?
how powerful are the elements?
how much memory?
how do the elements cooperate and communicate?
how are data transmitted between processors?
what are the abstractions and primitives for cooperation?
Performance and scalability


how does it all translate into performance?
how does it scale?
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
General Context: Multiprocessors


Multiprocessor is any
computer with several
processors
SIMD



Single instruction, multiple data
Modern graphics cards
MIMD

Multiple instructions, multiple data
Copyright © 2010
Cavium University
Program
1-8
Lemieux cluster,
Pittsburgh
supercomputing
center
KICS, UET
Architecture Developments Tracks

Multiple-Processor Tracks





Shared-memory track
Message-passing track
Multivector and SIMD Tracks
Multithreaded and Dataflow Tracks
Multi-core track
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Shared-Memory Track

Starts with C.mmp system developed at CMU in 1972




Illinois Cedar (1987)



UMA multiprocessor with 16 PDP 11/40 processors
Connected to 16 shared memory modules via crossbar
switch
Pioneering multiprocessor OS (Hydra) development effort
IBM RP3 (1985)
BBN Butterfly (1989)
NYU/Ultracomputer (1983)



Stanford/DASH (1992)
Fujitsu VPP500 (1992)
KSR1 (1990)
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Message-Passing Track


The Cosmic Cube (1981) pioneered messagepassing computers
Intel iPSCs (1983)




Intel Paragon (1992)
Medium-grain multicomputers
nCUBE-2 (1990)
Mosaic (1992)


MIT/J Machine (1992)
Fine-grain multicomputers
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Multivector Track


CDC 7600 (1970)
Cray 1 (1978)



Cray Y-MP (1989)
Fujitsu, NEC, Hitachi models
CDC Cyber205 (1982)

ETA 10 (1989)
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
SIMD Track


Illiac IV (1968)
Goodyear MPP (1980)



DAP 610 (AMT, Inc. 1987)
CM2 (TMC, 1990)
BSP (1982)


IBM GF/11 (1985)
MasPar MP1 (1990)
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Multi-Threaded Track



Each processor can execute multiple threads of
control at the same time
Multi-threading helps hide long latencies in building
large-scale multiprocessors
Multithreaded architecture was pioneered by Burton
Smith in HEP system (1978)



MIT/Alewife (1989)
Tera (1990)
Symmetric Multi-Threading (SMT)


Two hardware threads
Available in Intel processors
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Dataflow Track



Instead of control-flow in von Neumann
architecture, dataflow architectures are based
on dataflow mechanism
Dataflow concept was pioneered by Jack
Dennis (1974) with a static architecture
This concept later inspired dynamic dataflow


MIT tagged token (1980)
Manchester (1982)
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Multi-Core Track






Intel Dual core processors
AMD Opteron
IBM Cell processor
SUN processors
Cavium processor
Freescale processors
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Multi-Core processor is a special kind
of a Multiprocessor


All processors are on the same chip
Multi-core processors are MIMD:



Different cores execute different threads (Multiple
Instructions),
Operate on different parts of memory (Multiple
Data)
Multi-core is a shared memory
multiprocessor:

All cores share the same address space
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Summary: Why Parallel
Architecture?

Increasingly attractive


Economics, technology, architecture, application demand
Readily available  multi-core processors based systems

Increasingly central and mainstream

Parallelism exploited at many levels





Instruction-level parallelism
Multiple threads of software
Multiple cores
Multiprocessor processors
Focus of this class:

Multiple cores and/or processors

Programming paradigms
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Example: Intel Pentium Pro Quad
CPU
P-Pro
module
256-KB
Interrupt
L2 $
controller
Bus interface
P-Pro
module
P-Pro
module


Copyright © 2010
Cavium University
Program

1-8
PCI
bridge
PCI bus
PCI
I/O
cards
PCI
bridge
PCI bus
P-Pro bus (64-bit data, 36-bit address, 66 MHz)
Memory
controller
MIU
1-, 2-, or 4-way
interleaved
DRAM
All coherence and
multiprocessing glue
in processor module
Highly integrated,
targeted at high
volume
Low latency and
KICS, UET
bandwidth
Example: SUN Enterprise
P
$
P
$
$2
$2
CPU/mem
cards
Mem ctrl
Bus interface/switch
Gigaplane bus (256 data, 41 address, 83 MHz)
I/O cards
2 FiberChannel
SBUS
SBUS
SBUS
100bT, SCSI
Bus interface
16 cards of either type: processors + memory, or
I/O
 All memory accessed over bus, so symmetric
Copyright © 2010
 Higher bandwidth, higher latency bus
Cavium
University

Program
1-8
KICS, UET
Example: Cray T3E
External I/O
P
$
Mem
Mem
ctrl
and NI
XY
Switch
Z
Scale up to 1024 processors, 480MB/s links
 Memory controller generates comm. request for nonlocal
references
 No hardware mechanism for coherence (SGI Origin etc.
Copyrightprovide
© 2010 this)

Cavium University
Program
1-8
KICS, UET
Example: IBM SP-2
Power 2
CPU
IBM SP-2 node
L2 $
Memory bus
General interconnection
network formed fr om
8-port switches
4-way
interleaved
DRAM
Memory
controller
MicroChannel bus
I/O
DMA
i860
NI
DRAM
NIC
Made out of essentially complete RS6000 workstations
 Network
interface integrated in I/O bus (bw limited by I/O
Copyright
© 2010
Cavium University
bus)

Program
1-8
KICS, UET
Example Intel Paragon
i860
i860
L1 $
L1 $
Intel
Paragon
node
Memory bus (64-bit, 50 MHz)
Mem
ctrl
DMA
Driver
4-way
interleaved
DRAM
Sandia’ s Intel Paragon XP/S-based Super computer
8 bits,
175 MHz,
bidirectional
2D grid network
with processing node
attached to every switch
Copyright © 2010
Cavium University
Program
NI
1-8
KICS, UET
Transition to Multi-Core
Selected multi-core processors
Single to Multiple Core Transition
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Multi-Core Processor Architectures
Copyright © 2010
Source:
Michael
Perrone of IBM
Cavium
University
Program
1-8
KICS, UET
IBM POWER 4




Shipped in Dec. 2004
Technology: 180 nm
lithography
Dual processor cores
8-way superscalar





Out of order execution
2 load/store units
2 fixed and fp units
>200 insts in flight
Hardware instruction
and data pre-fetch
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
IBM POWER 5





Shipped in Aug. 2003
Technology: 130 nm
lithgraphy
Dual processor core
8-way superscalar
SMT core


Up to 2 virtual cores
Natural extension to
POWER4 design
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
IBM Cell Processor


IBM, Sony, and Toshiba
alliance in 2000
Based on IBM POWER5


64 bit architectures
9 cores, 10 threads





1 dual thread power
processor element (PPE)
for control
8 synergistic processor
element (SPE) for
processing
Up to 25 GB/s memory bw
Up to 75 GB/s I/O bw
>300 GB/s Element
Interface Bus bandwidth
Copyright © 2010
source: Dr.
Cavium University
Program
Michael Perrone, IBM
1-8
KICS, UET
Future Multi-Core Platforms
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
A Many Core Example:
Intel’s 80 Core Test Chip
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Programming the Multi-Core
Programming, OS interaction,
applications, synchronization, and
scheduling
Parallel Computing is Ubiquitous

Over the next few years, all computing devices will
be parallel computers




Servers
Laptops
Cell phones
What about software?

Herb Sutter of Microsoft said in Dr. Dobbs’ Journal:
The free lunch is over: Fundamental Turn towards
Concurrency in software

Software will no longer increase from one generation to the
next as hardware improves … unless it is parallel software
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Interaction with OS

OS perceives each core as a separate
processor



OS scheduler maps threads/processes
to different cores



Same for SMP processor
Linux SMP kernel works on multi-core processors
Migration is possible
Also, possible to pin processes to processors
Most major OSes support multi-core today:
Windows, Linux, Mac OS X, …
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
What Applications Benefit from
Multi-Core?






Database servers
Web servers (Web commerce)
Compilers
Multimedia applications
Scientific applications,
CAD/CAM
In general, applications with
Each can
run on its
own core
Thread-level parallelism
(as opposed to instructionlevel parallelism)
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
More examples




Editing a photo while recording a TV show
through a digital video recorder
Downloading software while running an antivirus program
“Anything that can be threaded today will
map efficiently to multi-core”
BUT: some applications difficult to
parallelize
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Programming for Multi-Core

Programmers must use threads or processes



Spread the workload across multiple cores



Threads are relevant for business/desktop apps
Multiple processes with complex systems or SPMD
based parallel applications
OS maps threads/processes to cores
Transparent to the programmer
Write parallel algorithms


True for scientific/engineering applications
Programmer needs to define the mapping to cores
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Thread Safety is very important



Pre-emptive context switching:
context switch can happen AT ANY TIME
True concurrency, not just uniprocessor timeslicing
Concurrency bugs exposed much faster with
multi-core
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
However: Need to use synchronization even if
only time-slicing on a uniprocessor
int counter=0;
void thread1() {
int temp1=counter;
counter = temp1 + 1;
}
void thread2() {
int temp2=counter;
counter = temp2 + 1;
}
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Need to use synchronization even if only time-slicing
on a uniprocessor
temp1=counter;
counter = temp1 + 1;
temp2=counter;
counter = temp2 + 1
gives counter=2
temp1=counter;
temp2=counter;
counter = temp1 + 1;
counter = temp2 + 1
Copyright © 2010
Cavium University
Program
gives counter=1
1-8
KICS, UET
Assigning Threads to the Cores


Each thread/process has an affinity mask
Affinity mask specifies what cores the thread
is allowed to run on

Different threads can have different masks

Affinities are inherited across fork()
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Affinity Masks are Bit Vectors

Example: 4-way multi-core, without SMT
1
1
0
core 3
core 2
core 1
1
core 0
• Process/thread is allowed to run on
cores 0,2,3, but not on core 1
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Affinity Masks when Multi-Core and
SMT are Combined


Separate bits for each simultaneous thread
Example: 4-way multi-core, 2 threads per
core
1
1
core 3
thread
1
thread
0
0
0
core 2
thread
1
thread
0
1
0
core 1
thread
1
thread
0
1
1
core 0
thread
1
thread
0
• Core 2 can’t run the process
Copyright © 2010
• Core
1 can only use one simultaneous thread
Cavium University
Program
1-8
KICS, UET
Default Affinities



Default affinity mask is all 1s:
all threads can run on all processors
Then, the OS scheduler decides what threads
run on what core
OS scheduler detects skewed workloads,
migrating threads to less busy processors
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Process Migration is Costly




Need to restart the execution pipeline
Cached data is invalidated
OS scheduler tries to avoid migration as much
as possible:
it tends to keeps a thread on the same core
This is called soft affinity
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Hard Affinities


The programmer can prescribe their own
affinities (hard affinities)
Rule of thumb: use the default scheduler
unless a good reason not to
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
When to Set your own Affinities

Two (or more) threads share data-structures
in memory


map to same core so that can share cache
Real-time threads:
Example: a thread running
a robot controller:


must not be context switched,
or else robot can go unstable
dedicate an entire core just to this thread
Copyright © 2010
Cavium University
Program
1-8
Source: Sensable.com
KICS, UET
Kernel Scheduler API
#include <sched.h>
int sched_getaffinity(pid_t pid,
unsigned int len, unsigned long * mask);


Retrieves the current affinity mask of process
‘pid’ and stores it into space pointed to by
‘mask’.
‘len’ is the system word size: sizeof(unsigned
int long)
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Kernel Scheduler API
#include <sched.h>
int sched_setaffinity(pid_t pid,
unsigned int len, unsigned long * mask);
Sets the current affinity mask of process ‘pid’ to
*mask
 ‘len’ is the system word size: sizeof(unsigned int
long)
 To query affinity of a running process:
$ taskset -p 3935
Copyright ©pid
20103935's current affinity mask: f

Cavium University
Program
1-8
KICS, UET
Windows Task Manager
core 2
core 1
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Summary

Reasons for multi-core based systems:




Background of using multi-core systems:




Processor technology trends
Architecture trends and diminishing returns
Application trends
Traditional parallel architectures
Experience with parallelization
Transition to multi-core is happening
Challenge: programming the multi-core
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Programming Paradigms
Traditional as well as new additions
Programming Paradigms

In the context of multi-core systems




Traditional paradigms and their use for parallelization




Architecture is similar to SMPs
SAS programming works
However, message-passing is also possible
Explicit message passing (MP)
Shared address space (SAS)
Multi-threading as a parallel computing model
New realizations of traditional paradigms



Multi-threading on Sony Playstation 3
MapReduce
CUDA
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Traditional Programming Models
David E. Culler and Jaswinder Pal
Singh, Parallel Computer Architecture:
A Hardware/Software Approach,
Morgan Kaufmann, 1998
Parallel Programming Models

Message passing




Each process uses its private address space
Access to non-local address space through explicit
message passing
Full control of low level data movement
Data parallel programming


Extension of message passing
Parallelism based on data decomposition and
assignment
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Parallel Programming Models (2)

Shared memory programming




Multi-threading



Single address space (SAS)
Hardware manages low level data movement
Primary issue is coherence
An extension of SAS
Commonly used on SMP as well as UP systems
Libraries and building blocks


Often specific to architectures
Help reduce the development effort
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Message Passing

Message passing operations


Point-to-point unicast
Collective communication


Efficient collective communication can greatly enhance
performance
Characteristics of message-passing:



Programmer’s control over interactions
Customizable to leverage hardware features
Tunable to enhance performance
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Distributed Memory Issues

Load balancing


Data locality



High performance depends on keeping processor busy
Required data should be kept locally (decomposition and
assignment)
Data distribution


Programmer has explicit control over load balancing
Data distribution affects the computation-to-communication
ratio
Low level data movement and process control

Data movement result from algorithmic structure
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Data Movement and Process Control

Types of low level data
movements






Process control involves

Replication
Reduction
Scatter/gather
Parallel prefix,
segmented scan
Permutation


Data movement results
in communication


Copyright © 2010
Cavium University
Program
1-8
Barrier synchronization
Global conditionals
High performance
depends on efficient data
movement
Overhead results due to
processes waiting for
data
KICS, UET
Message Passing Libraries

Motivation





Portability: application need not be reprogrammed
on a new platform
Heterogeneity
Modularity
Performance
Implementations


PVM, Express, P4, PICL, MPICH, etc.
MPI has become a de facto API standard
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Data Parallel Programming


This is an extension of message passing
based programming
Instead of explicit message passing,
parallelism is implemented implicitly in
languages


Parallelism comes from distributing data structures
and control
Implemented through hints for compilers
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Data Parallel Programming (2)

Motivations




Explicit message passing programs are difficult to
write and debug
Data distribution results in computation on local
data
Useful for massively parallel distributed-memory
as well as cluster of workstations
Developed as extensions to existing
languages

Fortran, C/C++, Ada, etc.
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
High Performance Fortran (HPF)

HPF defines language extensions to Fortran



To support data parallel programming
Compiler generates a message-passing program
Features include:

Data distribution and alignment


Purpose is to reduce inter-processor communication
DISTRIBUTE, ALIGN, and PROCESSORS directives
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
HPF (2)

Features (cont’d):

Parallel statements




Extended intrinsic functions and standard library
EXTRINSIC procedures


Provide ability to express parallelism
FORALL, INDEPENDENT, and PURE directives
Mechanism to use efficient machine-specific primitives
Limitations


Useful only for FORTRAN code
Not useful outside of HPCC domain
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Shared Memory Programming

Single address space view of the entire memory



Logical for the programmer
Easier to use
Parallelism comes from loops



Require loop level dependence analysis
Various iterations of a loop that are independent wrt index
can be scheduled on different processors
Fine grained parallelism

Requires hardware support for memory consistency to avoid
overhead
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Shared Memory Programming (2)

Performance tuning is non-trivial


Data locality is critical
Reduce the grain size by parallelizing the outer-most loop



Simplifies parallel programming


Require inter-procedural dependence analysis
Very few parallelizing compilers that can analyze and parallelize
even reasonable size of application codes
Sequential code is directly usable with compiler hints for
parallel loops
OpenMP standardizes the compiler directives for
shared memory parallel programming
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Libraries and Building Blocks

Pre-implemented parallel code that may work
as kernels of larger well-known parallel
applications


Several numerical methods are reusable in various
applications;
Example:



linear equation solver,
partial differential equations,
matrix based operations etc.
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Libraries and Building Blocks (2)

Low level operations, such as data
movement, can be used for efficient matrix
operations




Matrix transpose
Matrix multiply
LU decomposition
These building blocks can be implemented in
distributed-memory numerical libraries

Examples:


ScaLPACK: dense and banded matrices
Massively parallel LINPACK
Copyright © 2010 Cavium
University Program
1-8
KICS, UET
Summary: Traditional Paradigms

Message-passing based programming




Suitable for distributed memory multi-computers
Multiple message-passing libraries exist
MPI is a de facto standard
Shared address space programming




Suitable for SMPs
Based on low-granularity loop-level parallelism
Implemented through compiler directives
OpenMP is a de facto standard
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Threading as a parallel
programming model
Types of applications and threading
environments
Parallel Programming with Threads

Two types of applications:



Threads help in different ways



Compute-intensive applications  HPCC systems
Throughput-intensive applications  Distributed systems
By dividing data  reduces work per thread
By distributing work  enables concurrency
Programming languages that support threads



C/C++ or C#
Java
OpenMP directives for C/C++ code
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Virtual Environments

VMs and platforms

Runtime virtualization

System virtualization
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Systems with Virtual Machine Monitors
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Summary: Threading for Parallelization

Threads can exploit parallelism on multi-core



Shared memory programming realized through:



Scientific and engineering applications
High throughput applications
Pthread
OpenMP
Challenges of multi-threading:


Synchronization
Memory consistency is the dark side of shared memory
Copyright © 2010
Cavium University
Program
1-8
KICS, UET
Octeon Processor
Acrchitecture
Target system processor and system
architecture and programming
environment
Cavium Octeon Processor





MIPS64
architecture
Up to 16
cores
Shared L2
cache
Security
accelerator
Used in
network
security
gateways
Copyright © 2010
Cavium University
Program
source: http://www.caviumnetworks.com/OCTEON_CN38XX_CN36XX.html
1-8
KICS, UET
Key Takeaways for Today’s Session

Multi-core systems are here to stay



Programming is a challenge




Processor technology trends
Application performance requirements
No “free lunch”
Programmers need to parallelize
Extra effort to tune the performance
Programming paradigms



Message passing
Shared address space based programming
Multi-threading  focus for this course
Copyright © 2010
Cavium University
Program
1-8
KICS, UET