Transcript Chapter 11

CSC 480 - Multiprocessor
Programming, Spring, 2012
Chapter 11 – Performance and Scalability
Dr. Dale E. Parson, week 12
Multithreading Overhead
• Multithreading always introduces overhead.
•
•
•
•
•
Costly synchronization code even without contention.
Synchronization delays when contention occurs.
Hardware synchronization (flushes at barriers).
Increased context switching. Thread scheduling.
Thread creation and destruction.
• How do we keep the hardware busy?
• What are the bottlenecks?
Scalability
• Ability to improve throughput or capacity
when adding computing resources -- CPUs,
memory, long-term storage, or I/O bandwidth.
• The bottleneck moves somewhere else.
• Throughput measures total data processed.
• Latency measures delays in event responses.
• Sometimes they trade one against the other.
• Use a better algorithm.
Avoid premature tradeoffs
• Make it correct, the fast.
• An initial correct solution, even if slower, helps
you think through the problem, gives you
output reference data for when you perform
code improvements, and gives you some
reusable code.
• It is like learning how to drive!
• Know your code libraries.
Amdahl’s Law
•
•
•
•
Speedup <= 1.0 /(F + ((1.0 - F) / N))
F is fraction of computation that is serial.
N is number of processors.
Sources of serialization –
•
•
•
•
•
Intrinsic to algorithm (e.g., partitioning in quicksort).
Synchronization (e.g., CyclicBarrier in penny – dime).
Access to shared data structures such as work queues.
Task submission, result processing, join.
Hardware serialization (e.g., memory, I/O bandwidth).
Context switching
• Operating system calls and complex JVM code.
• Cache and page faults to load and flush a
thread’s working set (of data) on a switch.
• I/O bound threads block more often that CPU
bound, resulting in more context switches.
• High kernel usage (> 10%) indicates heavy
scheduling activity, may be caused by frequent
blocking due to I/O or lock contention.
Memory synchronization
• Memory barriers
• Flush and invalidate registers, data caches, write
buffers, stall execution pipelines.
• Uncontended synchronization overhead is low.
• JVM run-time compiler can use escape
analysis to identify when a “locked” object
does not escape its thread, removing the lock.
• Compiler lock coarsening can merge adjacent
requests for the same lock.
Memory sync and blocking
• Lock elision on locks to thread-confined
objects is in IBM JVM and slated for HotSpot
7.
• Synchronization creates traffic on memory
buses.
• Clear, explicit Java Memory Model creates
opportunities for compiler optimizations.
• Blocking overhead –
• Spin waiting.
• O.S. system calls and context switching overhead.
Reducing lock contention
• Reduce duration during which locks are held.
• Reduce the frequency of lock requests.
• Replace exclusive locks with coordination
mechanisms that permit greater concurrency.
• Prefer a dataflow architecture when possible.
• Lock splitting for independent sets of shared
variables guarded by a single lock.
• Lock striping reduces overhead of excess
locks.
Hot fields, exclusive locks
• A hot field is a single field such as a counter
that is accessed by many threads.
• Parallelize data structures when possible.
• Merge them in a subset of threads as needed.
• Cache merged results if invalidating cache is fast.
• Use alternatives to exclusive locks.
• ReadWriteLock
• Atomic variables and volatiles
• Queues and library classes with reduced or split locks.
Monitoring CPU utilization
• top, mpstat, vmstat, iostat, sar, cpustat
• Process and resource profilers and low-level activity
register data analyzers.
• See Summer of 2010 report, final page.
• Also commercial code profilers for multithreading.
• Causes of CPU underutilization –
•
•
•
•
Insufficient load
I/O bound processing
External bottlenecks
Lock contention
Miscellaneous
• Thread pools? Yes  Object pooling? No 
• Dynamic allocation and garbage collection is cheaper
and easier than application-driven object pools and
application-level synchronization of object pools.
• Reducing context switching overhead.
• Make some portion of the application architecture CPU
intensive, partition it so it can be multithreaded.
• Let some partitioned set of threads handle the buffered
I/O. Only they are candidates for I/O blocking.
• May introduce possibility of a serial bottleneck in the
I/O threads. Use appropriate queuing.