Presentations 2 &3 : MICRO 2003 Review, by Theo Theocharides

Download Report

Transcript Presentations 2 &3 : MICRO 2003 Review, by Theo Theocharides

Highlights of the 36th Annual International
Symposium on Microarchitecture
December 2003
Theo Theocharides
Embedded and Mobile Computing Center
Department of Computer Science and Engineering
The Pennsylvania State University
Acknowledgements:
K. Bernstein, T. Austin, D. Blaauw, L. Peh, D.
Jimenez
Introduction
 The International Symposium on Microarchitecture is the
premier forum for discussing new microarchitecture and
software techniques
 Processor architecture, compilers, and systems for technical
interaction on traditional MICRO topics


special emphasis on optimizations to take advantage of
application specific opportunities
microarchitecture and embedded architecture communities
 http://www.microarch.org
 http://www.microarch.org/micro36/
Symposium Outline
 Session 1: Voltage Scaling & Transient
 Session 2: Cache
 Session 3: Power and Energy Efficient Architectures
 Session 4: Application-Specific Optimization and Analysis
 Session 5: Dynamic Optimization Systems
 Session 6: Dynamic Program Analysis and Optimization
 Session 7: Branch, Value, and Scheduling Optimization
 Session 8: Dataflow, Data Parallel, and Clustered Architectures
 Session 9: Secure and Network Processors
 Session 10: Scaling Design
Highlights
Keynote Speech
 Caution Flag Out: Microarchitecture's Race for Power
Performance

Kerry Bernstein, IBM T. J. Watson Research Center
Interesting Papers
 Razor: A Low-Power Pipeline Based on Circuit-Level
Timing Speculation, D. Ernst, et. al
 Power-Driven Design of Router Microarchitectures in OnChip Networks, H. Wang, Li-Shiuan Peh, S. Malik
 Fast Path-Based Neural Branch Prediction, D. Jimenez
Workshops and Tutorials
 5th Workshop on Media and Streaming Processors (MSP)
 3rd Workshop on Power-Aware Computer Systems (PACS)
 2nd Workshop on Application Specific Processors (WASP)
 Tutorial: Challenges in Embedded Computing
 Tutorial: Open Research Compiler (ORC): Proliferation of
Technologies and Tools
 Tutorial: Microarchitecture-Level Power-Performance
Simulators: Modeling, Validation, and Impact on Design
 Tutorial: Network Processors
 Tutorial: Architectural Exploration with Liberty
Keynote Speech
 Given by Kerry Bernstein, IBM T.J. Watson Research Center
 Microarchitecture and technology relationship
 We cannot continue to scale down to achieve higher frequencies
without any catch
 Increasing pipeline depth does not necessarily help
 Power consumption, process variation, soft errors, die area erosion
becoming more and more important
 Keynote explored how past technologies have influenced high speed
microarchitectures
 Keynote showed how characteristics of proposed new devices and
interconnects for lithographies beyond 90nm may shape future
machine design.
 Given the present issues and incoming trends, role of
microarchitecture in extending CMOS performance will be more
important than ever
Where Scaling fails…
Cost of Performance in terms of power
Issues in summary:
 Feature size
 Device count (transistors per chip)
 Pipeline depth
 Power consumption increases non-linearly with scaling
 Power growths when we reduce the FO4 delay
 Delay and power affected by process variation
 Cooling creates more problems
 Cost of power diverges from performance gain
How does Microarchitecture help?
Repairs
 Monitor-based Full Chip Voltage, Clock Throttling
 Voltage Islands



Technology aid required here
Latency required
Low-activity FET count increase
 Clock Gating

So far has been a nice solution…
 Pipeline depth optimization
 Performance accelerators for ASICs (DSP, GPU’s, etc.)


As in, they need power anyways, at least make them efficient
Software solutions should be developed here
 Compute-Informed Power Management




Instruction Stream
Dynamic Resource Assertion
Power Aware OS
Thermal Modeling
New Ideas
 “Evolutionary”



Strained Silicon
High-K Gate Dielectrics
Hybrid Crystal Silicon



Increase current drive/micron of device
Allow transistor density improvement
Introduce Features which enable active static power management
 “Revolutionary”



Double Gated MOSFETs
3D Integration
Molecular Computing



Reduce Power Density without architectural management
Eliminate power dependence on frequency
Return the industry to threshold and supply voltage scaling
Molecular Computing
Keynote Conclusions
 New technologies will likely help, not
necessarily
 Power is by far the predominant factor
in scaling – we need to see what new
technologies can give us
 Staying ahead requires power-aware
systems
Razor Project (T. Austin, D. Blaauw, T. Mudge)
 We (designers/architects) have been scaling the voltage down but
up to a point where it was proven that under all possible worst
cases, there were no errors
 Very conservative voltage scaling
 IDEA!
 Instead of trying to avoid ALL errors, ALLOW some errors to happen
and correct them!
 Major argument: Scaling the voltage supply by almost 0.25V down,
gives an average error rate of less than 5%
 Instead of spending energy, logic, effort, time and so many other
useful factors into avoiding error, allow a very small error percentage
to happen, and gain huge power savings
 Cost of fixing errors is minimal when the error percentage is kept
under control
Razor Project
Razor Pipeline Flip-Flop
Error-Rate vs. Power Savings
IPC vs. Error Rate
DVS
Razor Advantages
 Eliminate safety margins

Process variation, IR-drop, temperature fluctuation, datadependent latencies, model uncertainty
 Operate at sub-critical voltage for optimal trade-off between:


Energy gain from voltage scaling
Energy overhead from dynamic error correction
 Tune voltage for average instruction data

Exploit delay dependence in data
 Tolerate delay degradation due to infrequent noise events


SER, capacitive, inductive noise, charge sharing, floating body
effect…
Most severe noise also least frequent
Power-driven Design of Router Microarchitectures in
On-chip Networks (Hangsheng Wang, Li-Shiuan Peh, Sharad Malik)
 Investigates on-chip network microarchitectures from a powerdriven perspective
 Power-efficient network microarchitectures:

segmented crossbar, cut-through crossbar and write-through buffer
 Studies and uncovers the power saving potential of an existing
network architecture: Express cube
 Reduction in network power of up to 44.9%,
 NO degradation in network performance
 Improved latency throughput in some cases.
Power in NoC







Ewrt is the average energy dissipated when writing a
flit into the input buffer
Erd is the average energy dissipated when reading
a flit from the input buffer
Ebuf = Ewrt + Erd is average buffer energy
Earb is average arbitration energy
Exb is average crossbar traversal energy
Elnk is average link traversal energy
H is the number of hops traversed by this flit
Architectural Methods
Segmented crossbar
Cut-through crossbar
Write-through input buffer
Express cube
Segmented Crossbar
Schematic of a matrix crossbar and a segmented crossbar.
F is flit size in bits, dw is track width, E, W, N, S are ports.
Cut-through crossbar
Schematic of cut-through crossbars
F is flit size, dw is track width, E, W, N, S are ports
Write-through buffer
(a) Bypassing without overlapping
(b) Bypassing with overlapping
(c) Schematic of a write-through
input buffer.
Express cube topology and microarchitecture
Power savings and conclusions
Importance of a power-driven approach to on-chip
network design
Need to investigate the interactions between traffic
patterns and On Chip Network architectures
Need to reach a systematic design methodology for
on-chip networks
Fast Path-Based Neural Branch Prediction
(J. Himenez)
 Paper presented a new neural branch predictor

both more accurate and much faster than previous
neural predictors
 Accuracy far superior to conventional predictors
 Latency comparable to predictors from industrial
designs
 Improves the instructions-per-cycle (IPC) rate of an
aggressively clocked microarchitecture by 16%
Latency - Accuracy Gain
Rather than being done all at once
(above), computation is staggered
(below)
•Train a neural network with path
history, and update it
dynamically.
•Choose the weight vectors
according to the path leading up
to the branch rather than branch
address alone
•Directly reduces latency (can
begin prior to the prediction – see
figure on the left)
•Improves accuracy as the
predictor incorporates path
information
Comparative Results – Misprediction rate
IPC per hardware cost
Faster and more accurate than existing neural branch predictors
Conclusion
 Overview of MICRO36
 Conference lasted 5 days – impossible to review in half hour!
 If you are interested, you should read the proceedings on-line at
http://www.microarch.org/micro36
The Call For Papers for MICRO37 is available, at
http://www.microarch.org/micro37
DEADLINE FOR PAPER SUBMISSION: May 28th, 2004
Links to the papers reviewed
 Razor

http://www.microarch.org/micro36/html/pdf/ernst-Razor.pdf
 NoC Router Power-Driven Design

http://www.microarch.org/micro36/html/pdf/wangPowerDrivenDesign.pdf
 Fast-Path Neural Branch Predictor

http://www.microarch.org/micro36/html/pdf/jimenez-FastPath.pdf
Questions?
THANK YOU !
36th Annual International
Symposium on Micro-Architecture
- A Review
Rajaraman Ramanarayanan
Talk Overview
 Session covered in this presentation
 Review papers

Architectural vulnerability factors







Introduction
Proposed technique
Soft error terminology
Computing AVF’s
Results
Conclusion
L2-Miss Drive Variable Supply voltage scaling





Introduction
Proposed Solution
Transitions
Results
Achievements
Session Covered
 Voltage Scaling & Transient Faults


Methodology to compute Artificial vulnerability factors
VSV: L2-Miss-Driven Variable Supply-Voltage Scaling for
Low Power
Architectural Vulnerability Factors
(S. S. Mukherjee, C. T. Weaver, J. Emer, S. K. Reinhardt, T. Austin)
 Single-event upsets from particle strikes have become a key
challenge in microprocessor design.
 Soft errors due to cosmic rays making an impact in industry.


In 2000, Sun Microsystems acknowledged cosmic ray strikes on
unprotected cache memories as the cause of random crashes at
major customer sites in its flagship Enterprise server line
The fear of cosmic ray strikes prompted Fujitsu to protect 80% of
its 200,000 latches in its recent SPARC processor with some
form of error detection
 require accurate estimates of processor error rates to make
appropriate cost/reliability trade-offs.
Introduction
 All existing approaches introduce a significant penalty in
performance, power, die size, and design time
 Tools and techniques to estimate processor transient error
rates are not readily available or fully understood.
 Estimates are needed early in the design cycle.
 In this Paper :


Define architectural vulnerability factor (AVF)
identify numerous cases, such as pre-fetches, dynamically dead
code, and wrong-path instructions, in which a fault will not affect
correct execution
Proposed technique
 Not all faults in a micro-architectural structure affect the final
outcome of a program.
 Architectural Vulnerability factor (AVF)

probability that a fault in that particular structure will result in an error in
the final output of the program
 The overall error rate = product of raw fault rate and AVF.
 Can examine the relative contributions of various structures

identify cost-effective areas to employ fault protection techniques
 Tracks the subset of processor state bits required for architecturally
correct execution (ACE)

fault in a storage cell containing one of these bits affects output
 For example, a branch predictor’s AVF is 0%

predictor bits are always un-ACE bits.
 Bits in the committed PC are always ACE bits, has an AVF of 100%
Soft error terminology
 Error budget expressed in terms of:


Mean Time Between Failures (MTBF).
Failures In Time (FIT) - inversely related to MTBF.
 Errors are often classified as:


Undetected - silent data corruption (SDC)
Detected - detected unrecoverable errors (DUE)
 Effective FIT rate for a structure is the product of its raw circuit
FIT rate and the structure’s vulnerability factor
 effective FIT rate per bit is influenced by several vulnerability
factors

also known as de-rating factors or soft error sensitivity factor
 Examples include timing vulnerability factor for latches and
AVF
Silent data corruption in the future
Identifying Un-ACE Bits
 Bits that do not affect final program output
 Analyzed a uniprocessor system
 Micro-architectural Un-ACE bits




Idle or Invalid State.
Miss-speculated State.
Predictor Structures.
Ex-ACE State.
 Architectural Un-ACE Bits





NOP instructions.
Performance-enhancing instructions.
Predicated-false instructions.
Dynamically dead instructions.
Logical masking.
Computing AVF
 AVFs for storage cells - fraction of time an upset in that cell
will cause a visible error in the final output of a program
 AVF Equations for a Hardware Structure


average AVF for all its bits in that structure
∑ residency (in cycles) of all ACE bits in a structure
-------------------------------------------------------------------------------total number of bits in the hardware structure × total execution
cycles
 Little’s Law:

N = B×L, where




N = average number of bits in a box,
B = average bandwidth per cycle into the box, and
L = average latency of an individual bit through the box.
Bace × Lace
AVF = -------------------------------------------------------------total number of bits in the hardware structure
Computing AVFs using a Performance Model
 Two structures—the instruction queue and execution units—
using the Asim performance model framework
 Need following information



Sum of all residence cycles of all ACE bits of the objects resident
in the structure during the execution of the program,
Total execution cycles for which we observe the ACE bits’
residence time, and
Total number of bits in a hardware structure.
 AVF algorithm



Record the residence time of the instruction in the structure as
an instruction flows through different structures in the pipeline
Update the structures the instruction flowed through
Put the instruction in a post-commit analysis window to


Determine if the instruction is dynamically dead or
Determine if there are any bits that are logically masked
Methodology for evaluation
 Use an Itanium2®-like IA64 processor [14] scaled to current
technology
 Modeled in detail in Asim performance model framework.
Results – Program level Decomposition
Results
 Program-level Decomposition





We get about 45% ACE instructions. The rest—55% of the
instructions—are un-ACE instructions
Some of these un- ACE instructions still contain ACE bits, such
as the op-code bits of pre-fetch instructions
UNKNOWN and NOT_PROCESSED instructions account for
about 1% of the total instructions
NOPs, predicated false instructions, and prefetch instructions
account for 26%, 6.7%, and 1.5%, respectively.
FDD_reg and FDD_mem denote results that are written back to
registers and memory, respectively



Account for about 9.4% and 2% of the dynamic instructions
IA64 has a large number of registers
TDD_reg and TDD_mem account for 6.6% and 1.6% of the
dynamic instructions
AVF for instruction queue
AVF for instruction queue
 Shows what percentage of cycles a storage cell in the
instruction queue contains ACE and un-ACE bits.
 Instruction queue contains an ACE bit about 28% of the time.

Thus AVF of the instruction queue is 28%.
 Floating point programs, in general, have higher AVFs
compared to integer programs (31% vs. 25%, respectively)


Long-latency instructions and few branch mispredictions
Use the instruction queue more effectively than integer
programs, leading to a higher AVF
 Apply Little’s law :
Number of ACE instructions in the queue =
bandwidth or ACE IPC X the average number of cycles an
instruction can be considered to be in ACE state or ACE latency

 The ACE IPC and ACE latency from our performance mode
AVFs for the Execution Units
AVFs for the Execution Units
 Four integer pipes and two floating point pipes

50% control latches and 50% datapath latches
 11% of the cycles processing ACE instructions

Significantly lower




Instructions must wait in the instruction queue
Speculatively issued instructions succeeding cache-miss loads
must replay through the instruction queue
The floating point pipes are mostly idle while executing integer
code
Implemented logical masking functions for a small but important
subset
Conclusion
 Estimated AVFs using a novel approach that tracks bits
required for architecturally correct execution (ACE) and unACE bits
 Computed the AVF for the instruction queue and execution
units of an Itanium2®-like IA64 processor.
 Further refinement could further lower the AVF estimates but
expect the contribution from further refinement to be small
 Can estimate the FIT rate of an entire processor early in the
design cycle
 Can help designers choose the appropriate error detection or
correction schemes
 Can lower the FIT rate of the chip iteratively by adding more
and more error protection, using AVF estimates as a guide.
L2-Miss Driven VSV for low power
(H. Li, C. Cher, T. N. Vijaykumar, K. Roy)
 Idea: Upon a L2 miss, pipeline performs independent
computations, but almost always ends up stalled, waiting for
data despite out-or-order issue and other latency-hiding
techniques
 During an L2 miss, scale down the supply, carry out
independent computations at lower speed instead
 Performance degradations if there are sufficient independent
computations however, which will overlap with the delay of the
cache
 Returning to full speed however, will likely reduce power
savings if there are multiple misses and insufficient
independent computations to overlap with the misses
Proposed solution
 Two state machines tracking parallelism on the fly
 Scale down voltage depending on parallelism of the two
events
 Factors considered




Circuit level complexities reducing VSV to two voltages
Stability
Signal propagation speed issues
Energy overhead issues in RAMs and Register files
 Average reduction of processor power is 7% while
performance degradation is 0.9%
VSV Structure
Transitions
High-To-Low
transition
Low-To-High
transition
VSV -Results
VSV - Achievements
 Power savings with minimal performance degradation
 Complexity of circuits taken into consideration
 FSM’s control the level of parallelism between independent
operations and delay caused by an L2 cache miss
 VSV achieves 4% reduction in power for all SPEC2K
benchmarks
 VSV achieves 12% for the benchmarks with high L2 miss
rates
Questions
 Any questions or feedback ??