Ei dian otsikkoa - Petrozavodsk State University

Download Report

Transcript Ei dian otsikkoa - Petrozavodsk State University

Caching in multiprocessor systems
Tiina Niklander
In AMICT 2009, Petrozavodsk
19.5.2009
Background
 More transistors on one chip
 Multiple cores
 Larger cache
 Multiple on chip caches
 More functionality (more functional units, dedicated
multimedia / deciphering cell, integrated GPU)
 Multiple cores introduce
 Cache organization
 Private vs shared caches
 Cache coherence
Cache organization
 Common organization:
 L1 is private
 Last-level cache is shared
 With three levels:
 L1 private
 L2 ? Private or shared
 L3 Shared
Private vs Shared cache
 Fully private, fully shared, partially shared
Private L2 (pair of processors share)
Shared L2 (all can access all L2)
F. Sibai: On the performance benefits of sharing and privatizing second
and third-level cache memories in homogeneous multi-core
architectures.
Microprocessors and Microsystems 32 ( 2008), pp. 405-412
Shared cache
 Simple coherence issue (just one copy)
 Different latencies (CPU - cache location)
 Cache access competition (wait for other core)
M. Kandemir, F. Li, M.J. Irwin, S.W. Son: A Novel Migration-Based NUCA Design for Chip
Multiprocessors. In SC2008. IEEE, 2008, pp.
Private cache
 No access competition, smaller latencies,
 But coherence becomes an issue!
 Same date in multiple caches -> invalidate on write
 Cache partitioning
 Design time: Fixed partitioning
 Run time:
 Fixed partitioning (configuration issue)
 Dynamic (based on current need)
Cache coherence
 Protocols: MESI, MSI, MOSI, MOESI
 Invalidation message: RFO (Read for ownership)
 Each cache snoops the bus to monitor memory ops
M
E
S
I
M
N
N
N
Y
E
N
N
N
Y
S
N
N
Y
Y
I
Y
Y
Y
Y
wikipedia
M – modified
(O- Owned)
E – Exlusive
S – Shared
I – Invalid
N – not allowed state
Y – allowed state
(Distributed) cooperative caches
 Add a directory structure
 Knows the data locations in local caches
 Cache-to-cache copying
 When in another cache (directory locates)
 On eviction (store temporarily on another cache)
E, Herrero, J. Conzález, R. Canal: Distributed Cooperative Caching. In PACT’08. ACM 2008, pp. 134-142
New improvement ideas for
cache performance
1/2
 Split the cache for different tasks
 Dynamically allocate cache areas
 Software controlled eviction
 GOAL: thread moves unneeded, but strongly-shared
data to shared cache to improve performance of
other threads
 New instruction evict tells the processor to move
some data from private L1 or L2 to shared L3
New improvement ideas for
cache performance
2/2
 Helper threads
 GOAL: additional thread executes parts of the code
ahead of the actual thread to ‘prefetch’ data to cache
 Generate memory traces for the programmer
 Tuning the software performance
Conclusion
 Focus on fine-tuning the cache performance
 Cache coherence itself is solved earlier
 Not always used (if allowed non-coherent usage)
 L2 and L3 caches
 Shared or private
 Cache partitioning
 Support for software-based improvements
 Eviction hints
 Traces
 Prefetching (like helper thread)
References
 S. Fide, S. Jenks: Proactive use of shared L3 caches to enhance cache communications in multi-core processors. IEEE Comp. Arch. L. vol 7 (2008), pp 57-60
 E. Herrero, J. Conzález, R. Canal: Distributed Cooperative Caching. In Conf. on
Parallel architectures and compilation techniques, PACT’08. ACM 2008, pp. 134-142
 M. Kandemir, F. Li, M.J. Irwin, S.W. Son: A Novel Migration-Based NUCA Design for
Chip. Multiprocessors. In Proc. of the 2008 ACM/IEEE Conf. on Supercomputing. IEEE,
2008, pp. 1-12
 L. Peng, et.al.: Memory hierarchy performance measurement of commercial dual-core
desktop processors. Journal of Systems Architecture 54(2008), pp. 816-828.
 F. Sibai: On the performance benefits of sharing and privatizing second and third-level
cache memories in homogeneous multi-core architectures. Microprocessors and
Microsystems 32 ( 2008), pp. 405-412
 J. Zhang, X. Fan, S.H. Liu: A Pollution Alleviate L2 Cache Replacement Policy for Chip
Multiprocessor Architecture. In Int. Conf. on Networking, Architecture and Storage,
IEEE, 2008, pp. 310-316