Distributed Shared Memory over IP Networks - Revisited
Download
Report
Transcript Distributed Shared Memory over IP Networks - Revisited
Distributed Shared Memory over
IP Networks - Revisited
“Which would you
rather use to plow a
field: two strong oxen,
or 1024 chickens?”
-Seymour Cray
Dan Gibson
Chuck Tsen
CS/ECE 757 Spring 2005
Instructor: Mark Hill
DSM over IP
Abstract a single, large, powerful machine from
a group of many commodity machines
–
–
–
–
–
Workstations can be idle for long periods
Networking infrastructure already exists
Increase performance through parallel execution
Increase utilization of existing hardware
Fault tolerance, fault isolation
Overview - 1
Source-level distributed shared memory
system for x86-Linux
–
–
Code written with DSM/IP API
Memory is identically named at high-level (C/C++)
Only explicitly shared variables are shared
User-level library checks coherence on each access
Similar to programming in an SMP environment
Overview - 2
Evaluated on a small (Ethernet) network of
heterogeneous machines
–
–
Application-dependent performance
–
–
4 x86-based machines of varying quality
Small, simulated workloads
Communication latency must be tolerated
Relaxed consistency models improve performance for some
workloads
Heterogeneity vastly affects performance
–
–
All machines operate as slowly as the slowest machine in
Barrier-synchronized workloads
Pentium 4 4.0 GHz is faster than a Pentium 3 600 Mhz!
Motivation
DSMoIP similar to building an SMP with
unordered interconnect
–
–
–
Shared-memory over message-passing
Race conditions possible
Coherence concerns
DSMoIP offers a tangible result
–
Runs on real HW
Challenges
IP on Ethernet is slow!
–
–
Common case must be fast
–
–
Point-to-point peak bandwidth <12Mb/s
Latency ~ 100us, RTT ~ 200us
Software check of coherence on every access
Many coherence changes require at least one RTT
IP semantics difficult for correctness
–
Packets can be:
Dropped
Delivered More than Once
Delivered Very Late, etc.
Related Projects
TreadMarks
–
Brazos (Rice)
–
DSM over UDP/IP, page-level granularity, sourcelevel compatibility, release consistency
DSM using multicast IP, MPI compatibility, scope
consistency, source-level compatibility
Plurix (Universitdt Ulm)
–
Distributed OS for shared memory, Java-source
compatible
More Related Projects
Quarks (Utah)
–
DSM over IP, user-level package interface, sourcelevel compatibility, many coherence options
SHRIMP (Princeton), StarDust (IRISA)
And more…
–
–
DSM over IP is not a new idea
Most have demonstrated speedup for some
workloads
Implementation Overview
All participating machines have space reserved for
each shared object at all times
–
Each shared object has accompanying coherence data
Depending on coherence, the object may or may not
be accessible
–
–
If permissions are needed on a shared object, network
communication is initiated
Identical semantics to a long-latency memory operation
Implementation Overview
Implementation Overview
Shared memory objects use altered syntax for
declaration and access
User-level DSMoIP library, consisting of:
–
–
Communication backbone
Coherence engine
Programmer uses a small API for setup and
synchronization
API – Accessing Shared Objects
SMP:
DSMoIP:
a = x + b;
x = y – 1;
a = x.Read() + b;
x.Write(y.Read()-1);
z = array[i] – 7;
array[i]++;
z = array.Read(i) – 7;
array.Write(
array.Read(i)+1,
i);
Variables x, y, z, and array are shared. All others are unshared
API – Accessing Shared Objects
Naming of shared objects is syntactically
different, but semantically unchanged
Use of .Read() and .Write() functions
allows DSM to interpose if necessary
Use of C/C++ operator overloading can
remove some of the changes in syntax (not
implemented)
Communication Backbone (CB)
A substrate on which to build coherence
mechanisms
Provide primitives for communication
–
–
–
Guaranteed delivery over an unreliable network
At-most-once delivery
Arbitrary message passing to a logical thread or
broadcast to all threads
Coherence Engine (CE)
Uses primitives in communication backbone to
abstract shared memory from a messagepassing system
Swappable, based on needs of application
–
–
CE determines consistency model
CE plays a major role in performance
CE/CB Interaction Example
Read()
Get Read
Perm.
Read()
commits
Read
Perm.
MACHINE 1
MACHINE 2
a / None
a / Read
Read
CE/CB Interaction Example
Write()
Get Write
Perm.
Write()
commits
Write
Perm.
MACHINE 1
MACHINE 2
a / Read
a / Read
Write
None
CE/CB Interaction Example
Read()
Read()
commits
Already have perm.
MACHINE 1
MACHINE 2
a / Write
a / None
CE/CB Interaction Example
INT
IRET
Request
Write
Perm.
Grant
Perm.
MACHINE 1
MACHINE 2
a / Write
a / None
None
Write
CE/CB Interaction Example
Read()
Get Read
Perm.
Read()
commits
Read
Perm.
MACHINE 1
MACHINE 2
a / None
a / Write
Read
Read
Coherence Engines Implemented
ENGINE_NC
–
Fast, simple engine that pays no attention to
consistency
–
Reads can occur at any time
Writes are broadcast, can be observed by any processor in
any order, and are not guaranteed to ever become visible
at all processors
Communicates only on writes, non-blocking
Coherence Engines Implemented
ENGINE_OU
–
–
Owned/Unowned
Sequentially-consistent engine
–
Ensures SC by actually enforcing total order of accesses
Each shared object is valid at only one processor for the
duration of execution
Slow, cumbersome (nearly every access generates traffic)
Blocking, latent reads and writes
Coherence Engines Implemented
ENGINE_MSI
–
Based on MSI cache coherence protocol
–
–
Shared objects can exist in multiple locations in S state
Writing is exclusive to a single, M-state machine
Communicates as needed by typical MSI protocol
Sequentially consistent
Evaluation
DSM/IP evaluated using four heterogeneous x86/Linux
machines
–
–
Microbenchmarks test correctness
Simulated workloads explore performance issues
Early results indicate that DSM/IP overwhelms 100M
LAN
–
–
–
1-10k broadcast packets/second per machine
TCP/IP streams in subnet are suppressed by UDP packets in
absence of Quality of Service measures
Process is self-throttling, but makes DSM/IP unattractive to
network administrators
Measurement
Event counting
–
Regions of DSM Library augmented with event
counters
Timers
–
–
gettimeofday(), setitimer() for low-granularity
timing
Built-in x86 cycle timers for high-granularity
measurement
Validated with gettimeofday()
Timer Validation
Assumed system-call based timers accurate
x86 cycle timers validated for long executions
against gettimeofday()
Microbenchmarks
message_test
–
–
barrier_test
–
–
Each thread sends a specified number of messages to the
next thread
Attempts to overload network/backbone to produce an
incorrect execution
Each thread iterates over a specified number of barriers
Attempts to cause failures in DSM_Barrier() mechanism
lock_test
–
–
All threads attempt to acquire a lock and increment global
variable
Puts maximal pressure on DSM_Lock primitive
Microbenchmarks – Validation
message_test
–
barrier_test
–
nMessages = 1M, N=[1,4], no lost messages
(runtime ~ 2 minutes)
nBarriers = 1M, N=[1,4], consistent synchronization
(runtime ~4 minutes)
lock_test
–
nIncs = 1M, ENGINE_OU (SC), N=[1,4], final value
of variable = N-million (correct)
Simulated Workloads
More Communication
Three simulated workloads
–
–
–
sea, similar to OCEAN from SPLASH-2
genetic, a typical iterative genetic algorithm
xstate, an exhaustive solver for complex
expressions
Implementations are relatively simplified, for
quick development.
Simulated Workloads - sea
Integer version of SPLASH-2’s OCEAN
–
–
–
–
Simple iterative simulation, taking averages of
surrounding points
Uses large integer array
Coherence granularity depends on number of
threads
Uses DSM_Barrier() primitives for
synchronization
Simulated Workloads - genetic
Multithreaded genetic algorithm
–
–
Uses distributed genetic process to “breed” a
specific integer from a random population
Iterates in two phases:
–
Breeding: Generate new potential solutions from the most
fit members of the current population
Survival of the Fittest: Remove the least fit solutions from
the population
Uses DSM_Barrier() for synchronization
Simulated Workloads - xstate
Integer solver for arbitrary expressions
–
–
Uses space exploration to find fixed points of hardcoded expression
Employs a global list of solutions, protected with a
DSM_Lock primitive
Methodology
Each simulated workload run for each
coherence engine
Normalized parallel-phase runtime (versus
uniprocessor case) measured with highresolution counters
–
–
Does not include startup overhead for DSM system
(~1.5 seconds)
Does not include other initialization overheads
Results - SEA
Engine ABT
%Wait
OU
0.6 s
82%
NC
MSI
0.6 s 10%
0.05s 15%
Results - GENETIC
Engine ABT
%Wait
OU
0.02 s 20%
NC
MSI
0.02 s 9%
0.01 s 1%
Results - XSTATE
Engine ABT
%Wait
OU
3.1 s
8%
NC
MSI
2.7 s
2.6 s
2%
9%
Observations
ENGINE_NC
–
–
ENGINE_OU
–
–
speedup for workload with lightest communication load
Scalability concerns: Unicast on every read & write!
ENGINE_MSI
–
–
speedup for all workloads (but no consistency)
Scalability concerns: Broadcasts on every write!
Reduces network traffic & has speedup for some workloads
Impressive performance on SEA
Effect of heterogeneity is significant!
–
Est. fastest machine spends 25-35% of its time waiting for
other machines to arrive at barriers
Conclusions
Performance for DSM/IP implementation is
marginal for small networks (N<=4)
Coherence engine significantly affects
performance
Naïve implementation of SC has far too much
overhead
Conclusions
Common case might not be fast enough
–
Best case:
–
Worst case:
–
Single load or store becomes a function call + ~10 instrs.
Single load or store becomes a RTT on the network
(~1M-100M instruction times)
Average case: (depends on engine)
LD/ST becomes ~100-1K+ instruction times?
More Questions
How might other engine implementations
behave?
–
Is the communication substrate to blame for all
performance problems?
How would DSM/IP perform for
N=8,16,64,128?
Questions
SEA Demo
Extra Slides
Initial Synchronization
Initial Synchronization
Initial Synchronization