Presentations 2 &3 : MICRO 2003 Review, by Theo Theocharides
Download
Report
Transcript Presentations 2 &3 : MICRO 2003 Review, by Theo Theocharides
Highlights of the 36th Annual International
Symposium on Microarchitecture
December 2003
Theo Theocharides
Embedded and Mobile Computing Center
Department of Computer Science and Engineering
The Pennsylvania State University
Acknowledgements:
K. Bernstein, T. Austin, D. Blaauw, L. Peh, D.
Jimenez
Introduction
The International Symposium on Microarchitecture is the
premier forum for discussing new microarchitecture and
software techniques
Processor architecture, compilers, and systems for technical
interaction on traditional MICRO topics
special emphasis on optimizations to take advantage of
application specific opportunities
microarchitecture and embedded architecture communities
http://www.microarch.org
http://www.microarch.org/micro36/
Symposium Outline
Session 1: Voltage Scaling & Transient
Session 2: Cache
Session 3: Power and Energy Efficient Architectures
Session 4: Application-Specific Optimization and Analysis
Session 5: Dynamic Optimization Systems
Session 6: Dynamic Program Analysis and Optimization
Session 7: Branch, Value, and Scheduling Optimization
Session 8: Dataflow, Data Parallel, and Clustered Architectures
Session 9: Secure and Network Processors
Session 10: Scaling Design
Highlights
Keynote Speech
Caution Flag Out: Microarchitecture's Race for Power
Performance
Kerry Bernstein, IBM T. J. Watson Research Center
Interesting Papers
Razor: A Low-Power Pipeline Based on Circuit-Level
Timing Speculation, D. Ernst, et. al
Power-Driven Design of Router Microarchitectures in OnChip Networks, H. Wang, Li-Shiuan Peh, S. Malik
Fast Path-Based Neural Branch Prediction, D. Jimenez
Workshops and Tutorials
5th Workshop on Media and Streaming Processors (MSP)
3rd Workshop on Power-Aware Computer Systems (PACS)
2nd Workshop on Application Specific Processors (WASP)
Tutorial: Challenges in Embedded Computing
Tutorial: Open Research Compiler (ORC): Proliferation of
Technologies and Tools
Tutorial: Microarchitecture-Level Power-Performance
Simulators: Modeling, Validation, and Impact on Design
Tutorial: Network Processors
Tutorial: Architectural Exploration with Liberty
Keynote Speech
Given by Kerry Bernstein, IBM T.J. Watson Research Center
Microarchitecture and technology relationship
We cannot continue to scale down to achieve higher frequencies
without any catch
Increasing pipeline depth does not necessarily help
Power consumption, process variation, soft errors, die area erosion
becoming more and more important
Keynote explored how past technologies have influenced high speed
microarchitectures
Keynote showed how characteristics of proposed new devices and
interconnects for lithographies beyond 90nm may shape future
machine design.
Given the present issues and incoming trends, role of
microarchitecture in extending CMOS performance will be more
important than ever
Where Scaling fails…
Cost of Performance in terms of power
Issues in summary:
Feature size
Device count (transistors per chip)
Pipeline depth
Power consumption increases non-linearly with scaling
Power growths when we reduce the FO4 delay
Delay and power affected by process variation
Cooling creates more problems
Cost of power diverges from performance gain
How does Microarchitecture help?
Repairs
Monitor-based Full Chip Voltage, Clock Throttling
Voltage Islands
Technology aid required here
Latency required
Low-activity FET count increase
Clock Gating
So far has been a nice solution…
Pipeline depth optimization
Performance accelerators for ASICs (DSP, GPU’s, etc.)
As in, they need power anyways, at least make them efficient
Software solutions should be developed here
Compute-Informed Power Management
Instruction Stream
Dynamic Resource Assertion
Power Aware OS
Thermal Modeling
New Ideas
“Evolutionary”
Strained Silicon
High-K Gate Dielectrics
Hybrid Crystal Silicon
Increase current drive/micron of device
Allow transistor density improvement
Introduce Features which enable active static power management
“Revolutionary”
Double Gated MOSFETs
3D Integration
Molecular Computing
Reduce Power Density without architectural management
Eliminate power dependence on frequency
Return the industry to threshold and supply voltage scaling
Molecular Computing
Keynote Conclusions
New technologies will likely help, not
necessarily
Power is by far the predominant factor
in scaling – we need to see what new
technologies can give us
Staying ahead requires power-aware
systems
Razor Project (T. Austin, D. Blaauw, T. Mudge)
We (designers/architects) have been scaling the voltage down but
up to a point where it was proven that under all possible worst
cases, there were no errors
Very conservative voltage scaling
IDEA!
Instead of trying to avoid ALL errors, ALLOW some errors to happen
and correct them!
Major argument: Scaling the voltage supply by almost 0.25V down,
gives an average error rate of less than 5%
Instead of spending energy, logic, effort, time and so many other
useful factors into avoiding error, allow a very small error percentage
to happen, and gain huge power savings
Cost of fixing errors is minimal when the error percentage is kept
under control
Razor Project
Razor Pipeline Flip-Flop
Error-Rate vs. Power Savings
IPC vs. Error Rate
DVS
Razor Advantages
Eliminate safety margins
Process variation, IR-drop, temperature fluctuation, datadependent latencies, model uncertainty
Operate at sub-critical voltage for optimal trade-off between:
Energy gain from voltage scaling
Energy overhead from dynamic error correction
Tune voltage for average instruction data
Exploit delay dependence in data
Tolerate delay degradation due to infrequent noise events
SER, capacitive, inductive noise, charge sharing, floating body
effect…
Most severe noise also least frequent
Power-driven Design of Router Microarchitectures in
On-chip Networks (Hangsheng Wang, Li-Shiuan Peh, Sharad Malik)
Investigates on-chip network microarchitectures from a powerdriven perspective
Power-efficient network microarchitectures:
segmented crossbar, cut-through crossbar and write-through buffer
Studies and uncovers the power saving potential of an existing
network architecture: Express cube
Reduction in network power of up to 44.9%,
NO degradation in network performance
Improved latency throughput in some cases.
Power in NoC
Ewrt is the average energy dissipated when writing a
flit into the input buffer
Erd is the average energy dissipated when reading
a flit from the input buffer
Ebuf = Ewrt + Erd is average buffer energy
Earb is average arbitration energy
Exb is average crossbar traversal energy
Elnk is average link traversal energy
H is the number of hops traversed by this flit
Architectural Methods
Segmented crossbar
Cut-through crossbar
Write-through input buffer
Express cube
Segmented Crossbar
Schematic of a matrix crossbar and a segmented crossbar.
F is flit size in bits, dw is track width, E, W, N, S are ports.
Cut-through crossbar
Schematic of cut-through crossbars
F is flit size, dw is track width, E, W, N, S are ports
Write-through buffer
(a) Bypassing without overlapping
(b) Bypassing with overlapping
(c) Schematic of a write-through
input buffer.
Express cube topology and microarchitecture
Power savings and conclusions
Importance of a power-driven approach to on-chip
network design
Need to investigate the interactions between traffic
patterns and On Chip Network architectures
Need to reach a systematic design methodology for
on-chip networks
Fast Path-Based Neural Branch Prediction
(J. Himenez)
Paper presented a new neural branch predictor
both more accurate and much faster than previous
neural predictors
Accuracy far superior to conventional predictors
Latency comparable to predictors from industrial
designs
Improves the instructions-per-cycle (IPC) rate of an
aggressively clocked microarchitecture by 16%
Latency - Accuracy Gain
Rather than being done all at once
(above), computation is staggered
(below)
•Train a neural network with path
history, and update it
dynamically.
•Choose the weight vectors
according to the path leading up
to the branch rather than branch
address alone
•Directly reduces latency (can
begin prior to the prediction – see
figure on the left)
•Improves accuracy as the
predictor incorporates path
information
Comparative Results – Misprediction rate
IPC per hardware cost
Faster and more accurate than existing neural branch predictors
Conclusion
Overview of MICRO36
Conference lasted 5 days – impossible to review in half hour!
If you are interested, you should read the proceedings on-line at
http://www.microarch.org/micro36
The Call For Papers for MICRO37 is available, at
http://www.microarch.org/micro37
DEADLINE FOR PAPER SUBMISSION: May 28th, 2004
Links to the papers reviewed
Razor
http://www.microarch.org/micro36/html/pdf/ernst-Razor.pdf
NoC Router Power-Driven Design
http://www.microarch.org/micro36/html/pdf/wangPowerDrivenDesign.pdf
Fast-Path Neural Branch Predictor
http://www.microarch.org/micro36/html/pdf/jimenez-FastPath.pdf
Questions?
THANK YOU !
36th Annual International
Symposium on Micro-Architecture
- A Review
Rajaraman Ramanarayanan
Talk Overview
Session covered in this presentation
Review papers
Architectural vulnerability factors
Introduction
Proposed technique
Soft error terminology
Computing AVF’s
Results
Conclusion
L2-Miss Drive Variable Supply voltage scaling
Introduction
Proposed Solution
Transitions
Results
Achievements
Session Covered
Voltage Scaling & Transient Faults
Methodology to compute Artificial vulnerability factors
VSV: L2-Miss-Driven Variable Supply-Voltage Scaling for
Low Power
Architectural Vulnerability Factors
(S. S. Mukherjee, C. T. Weaver, J. Emer, S. K. Reinhardt, T. Austin)
Single-event upsets from particle strikes have become a key
challenge in microprocessor design.
Soft errors due to cosmic rays making an impact in industry.
In 2000, Sun Microsystems acknowledged cosmic ray strikes on
unprotected cache memories as the cause of random crashes at
major customer sites in its flagship Enterprise server line
The fear of cosmic ray strikes prompted Fujitsu to protect 80% of
its 200,000 latches in its recent SPARC processor with some
form of error detection
require accurate estimates of processor error rates to make
appropriate cost/reliability trade-offs.
Introduction
All existing approaches introduce a significant penalty in
performance, power, die size, and design time
Tools and techniques to estimate processor transient error
rates are not readily available or fully understood.
Estimates are needed early in the design cycle.
In this Paper :
Define architectural vulnerability factor (AVF)
identify numerous cases, such as pre-fetches, dynamically dead
code, and wrong-path instructions, in which a fault will not affect
correct execution
Proposed technique
Not all faults in a micro-architectural structure affect the final
outcome of a program.
Architectural Vulnerability factor (AVF)
probability that a fault in that particular structure will result in an error in
the final output of the program
The overall error rate = product of raw fault rate and AVF.
Can examine the relative contributions of various structures
identify cost-effective areas to employ fault protection techniques
Tracks the subset of processor state bits required for architecturally
correct execution (ACE)
fault in a storage cell containing one of these bits affects output
For example, a branch predictor’s AVF is 0%
predictor bits are always un-ACE bits.
Bits in the committed PC are always ACE bits, has an AVF of 100%
Soft error terminology
Error budget expressed in terms of:
Mean Time Between Failures (MTBF).
Failures In Time (FIT) - inversely related to MTBF.
Errors are often classified as:
Undetected - silent data corruption (SDC)
Detected - detected unrecoverable errors (DUE)
Effective FIT rate for a structure is the product of its raw circuit
FIT rate and the structure’s vulnerability factor
effective FIT rate per bit is influenced by several vulnerability
factors
also known as de-rating factors or soft error sensitivity factor
Examples include timing vulnerability factor for latches and
AVF
Silent data corruption in the future
Identifying Un-ACE Bits
Bits that do not affect final program output
Analyzed a uniprocessor system
Micro-architectural Un-ACE bits
Idle or Invalid State.
Miss-speculated State.
Predictor Structures.
Ex-ACE State.
Architectural Un-ACE Bits
NOP instructions.
Performance-enhancing instructions.
Predicated-false instructions.
Dynamically dead instructions.
Logical masking.
Computing AVF
AVFs for storage cells - fraction of time an upset in that cell
will cause a visible error in the final output of a program
AVF Equations for a Hardware Structure
average AVF for all its bits in that structure
∑ residency (in cycles) of all ACE bits in a structure
-------------------------------------------------------------------------------total number of bits in the hardware structure × total execution
cycles
Little’s Law:
N = B×L, where
N = average number of bits in a box,
B = average bandwidth per cycle into the box, and
L = average latency of an individual bit through the box.
Bace × Lace
AVF = -------------------------------------------------------------total number of bits in the hardware structure
Computing AVFs using a Performance Model
Two structures—the instruction queue and execution units—
using the Asim performance model framework
Need following information
Sum of all residence cycles of all ACE bits of the objects resident
in the structure during the execution of the program,
Total execution cycles for which we observe the ACE bits’
residence time, and
Total number of bits in a hardware structure.
AVF algorithm
Record the residence time of the instruction in the structure as
an instruction flows through different structures in the pipeline
Update the structures the instruction flowed through
Put the instruction in a post-commit analysis window to
Determine if the instruction is dynamically dead or
Determine if there are any bits that are logically masked
Methodology for evaluation
Use an Itanium2®-like IA64 processor [14] scaled to current
technology
Modeled in detail in Asim performance model framework.
Results – Program level Decomposition
Results
Program-level Decomposition
We get about 45% ACE instructions. The rest—55% of the
instructions—are un-ACE instructions
Some of these un- ACE instructions still contain ACE bits, such
as the op-code bits of pre-fetch instructions
UNKNOWN and NOT_PROCESSED instructions account for
about 1% of the total instructions
NOPs, predicated false instructions, and prefetch instructions
account for 26%, 6.7%, and 1.5%, respectively.
FDD_reg and FDD_mem denote results that are written back to
registers and memory, respectively
Account for about 9.4% and 2% of the dynamic instructions
IA64 has a large number of registers
TDD_reg and TDD_mem account for 6.6% and 1.6% of the
dynamic instructions
AVF for instruction queue
AVF for instruction queue
Shows what percentage of cycles a storage cell in the
instruction queue contains ACE and un-ACE bits.
Instruction queue contains an ACE bit about 28% of the time.
Thus AVF of the instruction queue is 28%.
Floating point programs, in general, have higher AVFs
compared to integer programs (31% vs. 25%, respectively)
Long-latency instructions and few branch mispredictions
Use the instruction queue more effectively than integer
programs, leading to a higher AVF
Apply Little’s law :
Number of ACE instructions in the queue =
bandwidth or ACE IPC X the average number of cycles an
instruction can be considered to be in ACE state or ACE latency
The ACE IPC and ACE latency from our performance mode
AVFs for the Execution Units
AVFs for the Execution Units
Four integer pipes and two floating point pipes
50% control latches and 50% datapath latches
11% of the cycles processing ACE instructions
Significantly lower
Instructions must wait in the instruction queue
Speculatively issued instructions succeeding cache-miss loads
must replay through the instruction queue
The floating point pipes are mostly idle while executing integer
code
Implemented logical masking functions for a small but important
subset
Conclusion
Estimated AVFs using a novel approach that tracks bits
required for architecturally correct execution (ACE) and unACE bits
Computed the AVF for the instruction queue and execution
units of an Itanium2®-like IA64 processor.
Further refinement could further lower the AVF estimates but
expect the contribution from further refinement to be small
Can estimate the FIT rate of an entire processor early in the
design cycle
Can help designers choose the appropriate error detection or
correction schemes
Can lower the FIT rate of the chip iteratively by adding more
and more error protection, using AVF estimates as a guide.
L2-Miss Driven VSV for low power
(H. Li, C. Cher, T. N. Vijaykumar, K. Roy)
Idea: Upon a L2 miss, pipeline performs independent
computations, but almost always ends up stalled, waiting for
data despite out-or-order issue and other latency-hiding
techniques
During an L2 miss, scale down the supply, carry out
independent computations at lower speed instead
Performance degradations if there are sufficient independent
computations however, which will overlap with the delay of the
cache
Returning to full speed however, will likely reduce power
savings if there are multiple misses and insufficient
independent computations to overlap with the misses
Proposed solution
Two state machines tracking parallelism on the fly
Scale down voltage depending on parallelism of the two
events
Factors considered
Circuit level complexities reducing VSV to two voltages
Stability
Signal propagation speed issues
Energy overhead issues in RAMs and Register files
Average reduction of processor power is 7% while
performance degradation is 0.9%
VSV Structure
Transitions
High-To-Low
transition
Low-To-High
transition
VSV -Results
VSV - Achievements
Power savings with minimal performance degradation
Complexity of circuits taken into consideration
FSM’s control the level of parallelism between independent
operations and delay caused by an L2 cache miss
VSV achieves 4% reduction in power for all SPEC2K
benchmarks
VSV achieves 12% for the benchmarks with high L2 miss
rates
Questions
Any questions or feedback ??