Transcript Slide 1

Parallel Programming
Techniques to use multiple computers that coordinate to solve problems
• Categories
– Hardware parallelism
– Fine Grain parallelism (instruction level)
– Course Grain parallelism (block level)
• Parallel Systems
– Shared Memory Systems
• Uniform Memory Access (UMA)
• Cache Only Memory Access (COMA)
• Non-uniform Memory Access (NUMA)
– Distributed Memory Systems
• Beowulf Clusters
• Networks or Cluster of Workstations
– Computational Grids
• Heterogeneous and Geographically Separated
History
• Early Systems (1950s and 1960s)
– Slow, High Failure Rates, Limited Operating Systems
– Programmer Controlled
– Poor resource utilization
• Pre-PC (1960s and 1970s)
– Medium numbers of systems
– Operating System Development
• Multi-thread and Multi-programming, Time Sharing
– Good resource utilization
• Today
– Billions of systems
– Single user work stations
– Poor resource utilization
• Future (Parallel Resource Utilization)
Hardware Configurations
• Flynn Categories
–
–
–
–
SISD
MIMD
SIMD
MISD
• Within MIMD
– SPMD
– MPMD
P0
P1
P2
P3
Memory
Shared Memory Multiprocessor
P0
P1
P2
P3
M0
M1
M2
M3
• Multiprocessor Systems
– Threads, Critical Sections
• Multi-computer Systems
– Message Passing
– Sockets, RMI
Distributed Memory Multi-computer
Parallel Techniques
• Peer-to-peer
– Multiple Computers Coordinate to run a single
application
– Mechanisms
• Threads, Message Passing,
• Client-server
– Server responds to many clients running many
applications
– Mechanisms
• Remote Method Invocation
• Sockets
The focus is this class is peer-to-peer applications
Parallel Performance Metrics
• Complexity (Big Oh Notation)
– f(n) = O(g(n)) if for constants z, c>0 f(n)≤c g(n) when n>z
• Speed up:
• Cost:
• Efficiency:
s(p) = t1/tp
C(p) = tp*p
E(p) = t1/C(p)
• Computation to Communication Ratio: tcomp/tcomm
• Latency and Bandwidth Analysis
– commp = function of data transmitted & # of messages
– tp = function of computation and communication
• Scalability
– Imprecise term to measure the ability to add more processors
– We might say and application scales well to 256 processors
– Scalability can refer to hardware, software, or both
Superlinear speed-up (s(p)>p)
Reasons for:
1. Non-optimal sequential algorithm
a. Solution: Compare to an optimal sequential algorithm
b. Parallel versions are often different from sequential versions
2. Specialized hardware on certain processors
a. Processor has fast graphics but computes slow
b. NASA superlinear distributed grid application
3. Average case doesn’t match single run
a. Consider a search application
b. What are the speed-up possibilities?
Speed-up Potential
• Amdahl’s “pessimistic” law
– Fraction of Sequential processing (f) is fixed
– S(p) = ts / (f * ts + (1-f)ts/p) → 1/f as p →∞
• Gustafson’s “optimistic” law
– Greater data implies parallel portion grows
– Assumes more capability leads to more data
– S(p) = f + (1-f)*p
• For each law
– What is the best speed-up if f=.25?
– What is the speed-up for 16 processors?
• Which assumption if valid?
Parallel Programming Examples
• Embarrassingly Parallel Applications
– Google searches employ > 100,1000 processors
• Applications with unacceptable sequential run times
• Grand Challenge Problems
– 1991 High Performance Computing Act (Law 102-94)
“Fundamental Grand Challenge science and
engineering problems with broad economic and/or
scientific impact and whose solution can be advanced
by applying high performance computing techniques
and resources.”
• Promotes terascale level computation over high bandwidth
wide area computational grids
Grand
Challenge
Problem
Examples
• Associations
– Computer Research
Association (CRA)
– National Science
Foundation (NSF)
Partial Grand Challenge Problem List
1. Predict contaminant seepage
2. Predict airborne contaminant affects
3. Gene sequence discovery
4. Short term weather forecasts
5. Predict long term global warming
6. Predict earthquakes and volcanoes,
7. Predict hurricanes and tornados
8. Automate natural language understanding
9. Computer vision
10. Nanotechnology
11. Computerized reasoning
12. Protein mechanisms
13. Predict asteroid collisions
14. Sub-atomic particle interactions
15. Model biomechanical processes
16. Manufacture new materials
17. Fundamental nature of matter
18. Transportation patterns
19. Computational fluid dynamics
Hardware Configurations
• Shared Memory Systems
– Do not scale to high numbers of processors
• Distributed Memory Systems
– Goal: Minimize connections but maintain speed
– A network Topology identifies its configuration
– Network topology terminology
 Latency and bandwidth
 Total edges and degree
 Connectivity and bisection width
 Diameter
 Dilation.
Popular Network Topologies
•
•
•
•
•
•
Fully Connected and Star
Line and Ring
Tree and Fat Tree
Mesh and Torus
Hypercube
Hybrids
– Pyramid
• Multi-stage
– Myrinet
Fully Connected and Star
 Degree?
 Connectivity?
 Total edges?
 Bisection width?
 Diameter?
Line and Ring
 Degree?
 Connectivity?
 Total edges?
 Bisection width?
 Diameter?
Tree and Fat Tree
Edges Connecting Node at level k to k-1 are twice the
number of edges connecting a Node from level k-2 to k-1
 Degree?
 Connectivity?
 Total edges?
 Bisection width?
 Diameter?
Mesh and Torus
 Degree?
 Connectivity?
 Total edges?
 Bisection width?
 Diameter?
Hypercube
•A Hypercube of degree zero is a single node
•A Hypercube of degree d is two hypercubes of degree d-1
With edges connecting the corresponding nodes
 Degree?
 Connectivity?
 Total edges?
 Bisection width?
 Diameter?
Pyramid
A Hybrid Network Combining a mesh and a Tree
 Degree?
 Connectivity?
 Total edges?
 Bisection width?
 Diameter?
Routing Techniques
• Packet Switching
– A message is divided into pieces (“packets”) and routed separately. The
packets are assembled at the sink
• Deadlock free
– Guarantees that multiple messages are not partially routed in such a
way that resources will never be available to complete the journey
• Store and Forward
– Messages are stored at each point before transmission continues to the
next node
• Cut Through
– Routing algorithm establishes the entire path of transmission before
communication begins
• Wormhole Routing
– Each flit (a couple of bits) is held at each node until the next node in the
route becomes available. A flit and those stored in predecessor nodes
moves to next node as soon as it is available
Question: What topology is Ethernet or Token Ring?
Myrinet
• A proprietary interconnection technology (http://www.myri.com).
• Supports 2.6 Gb/second bandwidth with 5 microsecond latency
• Myrinet employs a dynamic switch based technology
– point-to-point, full duplex links to processors or to switches.
– Arbitrary parallel topologies using routing chip settings
• Myrinet lightweight transmission protocol; cut through routing
– Links thousands of processors without TCP/IP limitations
– Routing transparently handled by hardware
– Automatic alternative hardware routing if necessary
• Transmitters can embed TCP/IP messages to maximize flexibility
• Gateway devices for wide area heterogeneous networks
Rectangles: Processors, Circles: Switches
Web-based Networks
•
•
•
•
•
•
•
•
•
•
•
Generally uses tcp/ip protocol
The number of hops between nodes is not constant
Communication incurs high latencies
Nodes scattered over large geographical distances
High Bandwidths possible after connection established
The slowest link along the path limits speed
Resources are highly heterogeneous
Security becomes a major concern; proxies often used
Encryption algorithms can require significant overhead.
Subject to local policies at each node
Example: www.globus.org