Transcript NUMA

Operating System Support for
improving data locality on
CC-NUMA machines
CSE597A Presentation
By
V.N.Murali
WHY CC-NUMA?
• Scalable with increase in number of nodes
• Attractive properties.Transparent access to
local and remote memory at the cost of
increased access latency to remote memory.
• 2 variations,CC-NUMA-(Stanford
DASH,MIT Alewife,Sequent),CCNOW(SUN s3.mp).
OS support
• Most important issue :Data locality,
• Performance enhancement provided by OS
supported page migration and replication by
as much as 30%
Issues in Migration/Replication
•
•
•
•
When should pages be migrated?
When should pages be replicated?
Both are needed to boost performance.
When not to migrate/replicate is also
important.
• Which system parameter can be used to
decide? Ideas?
Differences with S/W shared
memory
• M & R in S/W DSM is needed for
correctness.On CC-NUMA M&R is purely
an optimization.
• M & R in S/W DSM is triggered by page
faults.On CC-NUMA M&R is triggered by
cache misses.
• If workload exhibits good cache
locality,less benefits from M&R.Hence
selective criteria for moving pages.
• Study based on SimOS environment.
Solution
• How do we improve data locality?
• 3 access patterns a)primarily accessed by a
single process b)mostly read access by
many processes c)both read and write
access by many processes
• Which method has to be applied for
a),b),c)?
Costs to be considered
• 1)Cost of determining candidate pages for
M&R. (Cost of cache misses/TLB misses)
• 2)Overhead of M&R.(new
mappings,allocating a page,flushing TLB)
• 3)Actual data transfer
• 4)Memory pressure!
Key Parameters
Parameters
Semantics
Reset interval
Number of cycles for reset of all
counters
Trigger threshold
Number of misses after which page
is “hot” for M/R
Sharing threshold
Number of misses from another
processor for R.
Write threshold
Number of writes after which no R
Migrate threshold
Number of migrates after which no
M.
Summary of the algorithm
• “Hot page”:page whose counter for a processor
reaches the trigger threshold
• If the miss counter for this page (on any other
processor) reaches the sharing threshold then it is
considered for replication else it is considered for
migration.
• Replicated only if write counter has not exceeded
write threshold.Migrated only if the migrate
counter has not exceeded migrate threshold
Implementation details
• Directory controller maintains the miss
counters and generates a low-priority
interrupt.
• Bunches a couple of pages before raising
interrupt.
• Writes to replicated pages are collapsed to a
single page
IRIX changes
• Replication support
• Finer grain locking
• Page table back mappings
Workloads
• Engineering workload:large sequential +
memory intensive,used Verilog
simulator,Flashlite.
• Parallel application : Raytrace which is a
parallel graphics algorithm
• Scientific workload : Splash
• Decision support database
• Multiprogrammed software: Pmake
Performance analysis
• 3 factors a)user stall time ,b)fraction of
misses satisfied in local memory,c)kernel
overhead.
• Engineering:large user stall time=>best
performance gain.M&R were used
successfully
• Raytrace: read only accesses mostly.Mainly
benefits from replication.
• Splash:3 parallel
applications,Raytrace,Ocean,Volume
rendering.For ocean migration is
helpful.Raytrace and Volume can benefit
from replication
• Database:mostly read access and hence
replication
Alternative policies
• Static policies,dynamic policies.
• Static:Round robin,First touch,Post
facto(similar to optimal page replacement
algorithm)
• Dynamic:Migration only,replication
only,Migration-Replication.