Memory Models: A Case for Rethinking Parallel Languages and

Download Report

Transcript Memory Models: A Case for Rethinking Parallel Languages and

Memory Models: A Case for Rethinking
Parallel Languages and Hardware
COE 502/CSE 661 Fall 2011
Parallel Processing Architectures
Professor: Muhamed Mudawar, KFUPM
Presented by: Ahmad Hammad
outline
•
•
•
•
•
•
•
•
Abstract
Introduction
Sequential Consistency
Data-Race-Free
Industry Practice and Evolution
Lessons Learned
Implications for Languages and Hardware
Conclusion
Abstract
• Writing correct parallel programs remains far
more difficult than writing sequential programs.
• most parallel programs are written using a
shared-memory approach.
• The memory model, specifies the meaning of
shared variables.
– involved a tradeoff between programmability and
performance
– One of the most challenging areas in both hardware
architecture and programming language specification.
Abstract
• Recent broad community-scale efforts have
finally led to a convergence in this debate,
– with popular languages such as Java and C++ and
– most hardware vendors publishing compatible
memory model specifications.
• Although this convergence is a dramatic
improvement, it has exposed fundamental
shortcomings that prevent achieving the vision of
structured and safe parallel programming.
Abstract
• This paper discusses the path to the above
convergence, the hard lessons learned, and their
implications.
• A cornerstone of this convergence has been the
view that the memory model should be a
contract between the programmer and the
system –
– if the programmer writes disciplined (data-race-free)
programs, the system will provide high
programmability (sequential consistency) and
performance.
Abstract
• We discuss why this view is the best we can do
with current
• popular languages, and why it is inadequate
moving forward.
• In particular, we argue that parallel languages
should not only promote high-level disciplined
models, but they should also enforce the
discipline.
• for scalable and efficient performance, hardware
should be co-designed to take advantage of and
support such disciplined models.
Introduction
• Most parallel programs today are written
using threads and shared variables
• There are a number of reasons why threads
remain popular.
– Even before the dominance of multicore, threads
were already widely supported by ,most OS
because they are also useful for other purposes.
Introduction
• Direct HW support for shared-memory
provides a performance advantage
• The ability to pass memory references among
threads makes it easier to share complex data
structures.
• shared-memory makes it easier to selectively
parallelize application hot spots without
complete redesign of data structures.
Memory Model
• At he heart of the concurrency semantics of a
shared memory program.
• It defines the set of values a read in a
program can return
• It answers questions such as:
– Is there enough synchronization to ensure a
thread’s write will occur before another’s read?
– Can two threads write to adjacent fields in a
memory location at the same time?
Memory Model
• We should have unambiguous MM
• Defines an interface between a program and
any hardware or software that may transform
that program
– The compiler, the virtual machine, or any dynamic
optimizer
• A complex memory model
– parallel programs difficult to write
– parallel programming difficult to teach.
Memory Model
• Since it is an interface property,
– has a long-lasting impact, affecting portability and
maintainability of programs
– overly constraining one may limit hardware and
compiler optimization  reducing performance.
Memory Model
• The central role of the memory model has
often been downplayed
– Specifying a model that balances all desirable
properties has proven surprisingly complex
• programmability
• performance
• portability.
Memory Model
• In 1980s and 1990s, the area received
attention primarily in the hardware
community
– which explored many approaches, with little
consensus
– challenge for HW architects was the lack of clear
MM at the programming language level
– Unclear what programmers expected hardware to
do.
Memory Model
• hardware researchers proposed approaches to
bridge this gap but there were no wide consensus
from the software community.
• Before 2000
– few programming environments addressed the issue
clearly
– widely used environments had unclear and
questionable specifications
– Even when specifications were relatively clear, they
were often violated to obtain sufficient performance ,
– tended to be misunderstood even by experts, and
were difficult to teach.
Memory Model
• Since 2000, efforts to cleanly specify
programming-language-level memory models,
– Java and C++, with efforts now underway to
adopt similar models for C and other languages.
– In the process, issues created by hardware that
had evolved and made it difficult to reconcile the
need for a simple and usable programming model
with that for adequate performance on existing
hardware.
Memory Model
• Today, the above languages and most
hardware vendors have published (or plan to
publish) compatible memory model
specifications.
• Although this convergence is a dramatic
improvement over the past, it has exposed
fundamental shortcomings in our parallel
languages and their interplay with hardware.
Memory Model
• After decades of research, it is still unacceptably
difficult to describe what value a load can return
without compromising modern safety guarantees
or implementation methodologies.
• To us, this experience has made it clear that
solving the memory model problem will require a
significantly new and cross-disciplinary research
direction for parallel computing languages,
hardware, and environments as a whole.
Sequential Consistency
• The execution of a multithreaded program
with shared variables is as follows.
– Each step in the execution consists of choosing
one of the threads to execute,
– then performing the next step in that thread’s
execution according to program order.
– This process is repeated until the program as a
whole terminates.
Sequential Consistency
• Execution viewed as taking all steps executed
by each thread, and interleaving them.
• Possible results determined by trying all
possible interleaving between instructions.
• Instructions in a single thread must occur in
same order in interleaving as in the thread.
• According to Lamport this exc. Called sequentially
consistent
Sequential Consistency
Initially X=Y=0
Exec. 1
Red Thread
Blue Thread
X=1;
r1=Y;
Y=1;
R2=X;
Fig 1. Core of Dekker’s algorithm
Can r1=r2=0?
Exec. 2
Exec. 3
X=1;
Y=1;
X=1;
r1=Y;
Y=1;
R2=X;
Y=1;
X=1;
r1=Y;
r1=Y;
R2=X;
R2=X;
---- r1=1,
 r1=1,
r1=0, r2=1
r2=0
r2=1
Fig. 2 Some executions for Fig. 1
Fig.2 gives three possible interleaving that together illustrate
all possible final values of the non-shared variables r1 and r2.
Sequential Consistency
• Sequential consistency gives the simplest possible
meaning for shared variables, but suffers from
several related flaws.
– sequential consistency can be expensive to
implement.
– Almost all h/w, compilers violate SC today
• reorder the two independent assignments, since scheduling
loads early tends to hide the load latency.
• Modern processors use a store buffer to avoid waiting for
stores to complete.
• Both the compiler and hardware optimization make
outcome of r1 = r2 = 0 possible  non-sequentially
consistent execution.
Data-Race-Free
• reordering accesses to unrelated variables in a thread
– same meaning of single-threaded programs
– affect multithreaded programs
• Two memory accesses form a race if
– different threads, try to access same location, at least one
is a write
– Occur one after another
• Data-race-free-program
execution
No data race in any SC
– Requires races be explicitly identified as synchronization
– locks, use volatile variables in Java, atomics in C++
data-race-free
• A program that does not allow a data race is
said to be data-race-free.
• The data-race-free model guarantees
sequential consistency only for data-race-free
programs.
• For programs that allow data races, the model
does not provide any guarantees.
Data-Race-Free Approach
• Programmability
– Simply identify the shared variables as synchronization variables
• Performance
– Specifies minimal constraints
• Portability
–
–
–
–
–
Language must provide way to identify races
Hardware must provide way to preserve ordering on races
Compiler must translate correctly
synchronization accesses are performed indivisibly
ensure sequential consistency, in spite of those simultaneous
accesses
Data-Race-Free Approach
• Most important h/w and compiler
optimizations allowed for ordinary accesses
• Care must be taken at the explicitly identified
synchronization accesses
• data-race-free was formally proposed in 1990,
– it did not see widespread adoption as a formal
model in industry until recently.
Industry Practice and Evolution
• Hardware Memory Models
• High-Level Language Memory Models
– The Java Memory Model
– The C++ Memory Model
Hardware Memory Models
• Most hardware supports relaxed models that are
weaker than sequential consistency.
– These models take an implementation- or performancecentric view,
– desirable hardware optimizations drive the model
specification
• Different vendors had different models
– Alpha, Sun, x86, Itanium, IBM, AMD, HP, Cray.
– Various ordering guarantees + fences to impose other
orders
– Many ambiguities - due to complexity by design and
unclear documentation
High-Level Language Memory Models
• Ada was the first widely used high-level
programming language
• provides first class support for shared-memory
parallel programming.
– Advanced in treatment of memory semantics.
– It used a style similar to data-race-free, requiring
legal programs to be well-synchronized;
– It did not fully formalize the notion of wellsynchronized.
Pthreads and OpenMP
• Most shared-memory programming enabled
through libraries and APIs such as Posix
threads and OpenMP.
– Incomplete, ambiguous model specs
– Memory model property of language, not library
– unclear what compiler transformations are legal,
– what the programmer is allowed to assume.
Pthreads and OpenMP
• The Posix threads
– specification indicates a model similar to datarace-free
– Several inconsistent aspects,
– Varying interpretations even among experts
• The OpenMP model
– Unclear and largely based on a flush instruction
that is similar to fence instructions in h/w models
Java
• Java provided first class support for threads
– chapter specifically devoted to its memory model.
• Bill Pugh publicized fatal flaws in Java model
– This model was hard to interpret
– Badly broken
• Common compiler optimizations were prohibited
• Many cases the model gave ambiguous or unexpected
behavior
Java Memory Model Highlights
• In 2000, Sun appointed an expert group to revise the model
through the Java community process.
– coordination through an open mailing list
– Attracted a variety of participants, representing
• Hardware, software, researchers and practitioners.
–
–
–
–
Took 5 years of debates
Many competing models
Final consensus model approved in 2005 for Java 5.0
Quickly decided that the JMM must provide SC for data-racefree programs
– Unfortunately, this model is very complex
– Have some surprising behaviors,
– Recently been shown to have a bug
Java Memory Model Highlights
• Missing piece: Semantics for programs with data
races
– Java is meant to be a safe and secure language
– It cannot allow arbitrary behavior for data races.
– Java cannot have undefined semantics for ANY
program
– Must ensure safety/security guarantees to Limit
damage from data races in untrusted code.
– Problem: “safety/security, limited damage” in a
multithreaded context were not clearly defined
Java Memory Model Highlights
• Final model based on consensus, but complex
– Programmers (must) use SC for data-race-free
– But system designers must deal with complexity
– Recent discovery of bugs in 2008
The C++ Memory Model
• The situation in C++ was significantly different
from Java.
• The language itself provided no support for
threads.
– most concurrency with Pthreads.
– Corresponding Microsoft Windows facilities.
• combination of the C++ standard with the
Pthread standard.
– made it unclear to compiler writers precisely what
they needed to implement
The C++ Memory Model
• In 2005 a concurrency model for C++ initiated
• The resulting effort expanded to include the definition
of atomic operations, and the threads API itself.
• Microsoft concurrently started its own internal effort
• C++ easier than Java because it is unsafe
– Data-race-free is possible model
• BUT multicore  New h/w optimizations, more search
– Mismatched h/w, programming views became painful
– Debate that SC for data-race-free inefficient with new
hardware models
Lessons Learned
• Data-race-free provides a simple and consistent
model for threads and shared variables.
• The current hardware implies three significant
weaknesses:
– Debugging Accidental introduction of a data race
results in “undefined behavior,”
– Synchronization variable performance on current
hardware: ensuring SC in Java or C++ on current h/w
complicate the model for experts.
– Untrusted code There is no way to ensure data-racefreedom in untrusted code (Java).
Lessons Learned
• There is a need for:
– Higher-level disciplined models that enforce
discipline
• Deterministic Parallel Java
– Hardware co-designed with high-level models
• DeNovo hardware
Implications for Languages
• Modern software needs to be both parallel
and secure
• Choice between the two should not be
acceptable.
• Disciplined shared-memory models
– Simple
– Enforceable
– Expressive
– Performance
Implications for Languages
• Simple enough to be easily teachable to
undergraduates.
– minimally provide sequential consistency to programs that
obey the required discipline.
• enable the enforcement of the discipline.
– violations the discipline should not have undefined or
complex semantics.
– should be caught and returned back to the programmer as
illegal;
• General-purpose enough to express important parallel
algorithms and patterns
• enable high and scalable performance.
Data-race-free
• A near-term discipline: continue with Datarace-free
• Focus research on its enforcement
– Ideally, language prohibits and eliminate data race
by design
– Else, runtime catches as exception
Deterministic-by-Default Parallel
Programming
• Many algorithms are deterministic
– Fixed input gives fixed output
– Standard model for sequential programs
– Also holds for many transformative parallel
programs
• Parallelism not part of problem specification, only for
performance
• Language-based approaches with such goals
Deterministic Parallel Java (DPJ) .
Implications for Hardware
• Current h/w MM are an imperfect match for even
current software MM
• instruction set architecture changes
– identify individual loads and stores as synchronization
can alleviate some short-term problems.
• An established ISA, is hard to change
– when existing code works mostly adequately
– there is no enough experience to document the
benefits of the change.
Implications for Hardware
• longer term perspective, more fundamental
solution to the problem will emerge with a codesigned approach,
– where future multicore hardware research evolves in
concert with the software models research.
• DeNovo project at Illinois in concert with DPJ
–
–
–
–
–
Design hardware to exploit disciplined parallelism
Simpler hardware
Scalable performance
Power/energy efficiency
Working with DPJ as example disciplined model
Conclusion
• This paper gives a perspective based on work
collectively spanning about thirty years.
• Current way to specify concurrency semantics
fundamentally broken
– Best we can do is SC for data-race-free
– Mismatched hardware-software
* Simple optimizations give unintended consequences
• Need
– High-level disciplined models that enforce discipline
– Hardware co-designed with high-level model
Authors
• Sarita V. Adve ([email protected]) is a
professor in the Department of Computer
Science at the University of Illinois at UrbanaChampaign.
• Hans-J. Boehm ([email protected]) is a
member of Hewlett-Packard's Exascale
Computing Lab, Palo Alto, CA.