Introduction - Pages - University of Wisconsin–Madison

Transcript Introduction - Pages - University of Wisconsin–Madison

Adaptive Single-Chip Multiprocessing
Dan Gibson
[email protected]
University of Wisconsin-Madison
Department of Electrical and Computer Engineering
Introduction
•
Moore’s Law continues to provide more transistors
– Devices are getting smaller
– Devices are getting faster
» Leads to increases in clock frequency
– Memories are getting bigger
» Large memories often require more time to access
•
RC Circuits continue to charge exponentially
– Long-wire signal propagation time is not improving as rapidly as
switching speed
– On-chip communication time is slower relative to processor clock
speeds
(C) 2006
ECE Qualifying Exam
2
The Memory Wall
• Processors grow faster, memory grows slower
– Off-chip cache misses can halt even aggressive out-of-order
processors
– On-chip cache accesses are becoming long-latency events
• Latency can sometimes be tolerated
–
–
–
–
–
(C) 2006
Caching
Perfecting
Speculation
Out-of-order execution
Multithreading
ECE Qualifying Exam
3
The “Power” Wall
• More devices, faster clocks => More power
– Power supply accounts for lots of pins in chip packaging
(3,057 of 5,370 pins on the POWER5)
– Heat dissipation increases total cost of ownership
(~34W cooling power required to remove 100W of heat)
• Dynamic Power in CMOS
PDYN    CL  VDD  f
2
• Devices get smaller, faster, and more numerous
– More Capacitance
– Higher Frequency
• Architects can constrain α, CL, and f
(C) 2006
ECE Qualifying Exam
4
Enter Chip Multiprocessors (CMPs)
• One chip, many processors
– Multiple cores per chip
– Often multiple threads per core
Dual-Core AMD
Opteron Die Photo
From:
Microprocessor
Report: Best Servers
of 2004
(C) 2006
ECE Qualifying Exam
5
CMPs
• CMPs can have good performance
– Explicit thread-level parallelism
– Related threads experience constructive prefetching
• CMPs can tolerate long-latency events well
– Many concurrent threads => long-latency memory accesses can be
overlapped
• CMPs can be power-efficient
– Enables use of simpler cores
– Distributes “hot spots”
(C) 2006
ECE Qualifying Exam
6
CMPs
• CMPs are very specialized
– Assumes (highly) threaded workload
• Parallel machines are difficult to use
– Parallel programming is not (yet) commonplace
• Many problems similar to traditional multiprocessors
– Cache coherence
– Memory consistency
• Many new opportunities
– Cache sharing
– More integration
(C) 2006
ECE Qualifying Exam
7
Adaptive CMPs
• To combat specialization, adapt a CMP dynamically to
its current workload and system:
– Adapt caching policy ( Beckmann et. al., Chang et. al., and more )
– Adapt cache structure ( Alameldeen et. al., and more )
– Adapt thread scheduling ( Kihm et. Al., in the SMT space)
• Current idea:
– Adaptive thread scheduling from the space of un-stalled and stalled
threads
– A union of single-core multithreading and runahead execution in the
context of CMPs
(C) 2006
ECE Qualifying Exam
8
Single-Core Multithreading
• Allow multiple (HW) threads within the same
execution pipeline
– Shares processor resources: FUs, Decode, ROB, etc.
– Shares local memory resources: L1 caches, LSQ, etc.
– Can increase processor and memory utilization
Sun’s Niagara
pipeline block
diagram
( Kongetira et. al.)
(C) 2006
ECE Qualifying Exam
9
Runahead Execution
• Continue execution in the face of a cache miss
– “Checkpoint” architectural state
– Continue execution speculatively
– Convert memory accesses to prefetches
• “Runahead” prefetches can be highly accurate, and
can greatly improve cache performance ( Mutlu, et.
al.)
– It is possible to issue useless prefetches
– Can be power-inefficient (Mutlu, et. al.)
(C) 2006
ECE Qualifying Exam
10
Runahead/Multithreaded Core Interaction
• Similar Hardware Requirements:
– Additional register files
– Additional LSQ entries
• Competition for Similar Resources:
– Execution time (Processor pipeline, Functional units, etc)
– Memory bandwidth
– TLB Entries, cache space, etc.
(C) 2006
ECE Qualifying Exam
11
Runahead/Multithreaded Core Interaction
• A multithreaded core in a CMP, with runahead, must
make a difficult scheduling decisions:
– Thread scheduling considerations:
» Which thread should run?
» Should the thread use runahead?
» How long should the thread run/runahead?
– Scheduling implications:
» Is an idle thread making foreword progress at the expense of a
useful thread?
» Is a thread spinning on a lock held by another thread?
» Is runahead effective for a given thread?
» Is a given thread causing performance problems elsewhere in
the CMP?
(C) 2006
ECE Qualifying Exam
12
Proposed Mechanism
• Track per-thread state on:
– Runahead prefetching accuracy
» High accuracy favors allowing thread to runahead
– HW-assigned thread priority
» Highly “useful” threads are preferred
• Selection criteria:
– Heuristic-guided
» Select the best priority/accuracy pair
– Probabilistically-guided
» Select a thread with likelihood proportional to its
priority/accuracy
– Useful-first
» Select non-runahead threads first, then select runahead threads
(C) 2006
ECE Qualifying Exam
13
Future Directions
• Dynamically Adaptable CMPs offer several future
areas of research:
– Adapt for power savings / heat dissipation
» Computation relocation, load balancing, automatic low-power
modes, etc.
– Adapt to error conditions
» Dynamically allocate backup threads
– Automatically relocate threads to improve resource sharing
» Combined HW/SW/VM approach
(C) 2006
ECE Qualifying Exam
14
Summary
• Latency now dominates off-chip communication
– On-chip communication isn’t far behind
– Many techniques to tolerate latency, including multithreading
• CMPs provide new challenges and opportunities to
computer architects
– Latency tolerance
– Potential for power savings
• Can adapt a CMP’s behavior to its workload
– Dynamic management of shared resources
(C) 2006
ECE Qualifying Exam
15

Introduction - Pages - University of Wisconsin–Madison

Transcript Introduction - Pages - University of Wisconsin–Madison

Directory