Vertical Benchmarks for CAD - Ann Gordon-Ross

Download Report

Transcript Vertical Benchmarks for CAD - Ann Gordon-Ross

A Self-Optimizing Embedded
Microprocessor using a
Loop Table for Low Power
Frank Vahid* and Ann Gordon-Ross
Dept. of Computer Science and Engineering
University of California, Riverside
*Also with the Center for Embedded Computer Systems, UC Irvine
This work was supported by the National Science Foundation and NEC
International Symposium on Low Power Electronics and Design, 2001
Introduction
• Mass-produced microprocessor
IC’s prevail in embedded systems
– Cheap
• From amortization and high yields
– Small and low power
• From optimization and use of new
technologies
Processor Dmem.
Periph. Pmem.
Annual production: 10 million units
Cost per unit: $2
– Available immediately
• Typically run one program
forever
Processor Dmem.
Periph. Pmem.
• QUESTION:
– Can we “tune” a mass-produced
microprocessor to its one program
to reduce power?
Frank Vahid, www.cs.ucr.edu/~vahid
Processor Dmem.
Periph. Pmem.
2
Introduction
• Answer:
Processor Dmem.
– Yes, by using configurable (tunable)
components and adding a tuner circuit
Tuner. Periph. Pmem.
• Non-obvious use of extra transistors
–
–
–
–
Previously unheard of – silicon too scarce
Becoming more common, e.g., self-test circuitry
“Transistor budgets have gone ballistic” [Microprocessor Report, 1998]
Analogous situation in software
• Yesterday, program memory extremely scarce
• Today, we find a flight simulator hidden in Excel’97
Moore’s Law: 2x / 18 months
1981
10,000
transistors
1984
Leading edge
chip in 1981
Frank Vahid, www.cs.ucr.edu/~vahid
1987
1990
1993
1996
1999
2002
150,000,000
transistors
Leading edge
chip in 2002
3
Introduction
• We introduce:
– A basic architecture and methodology for a selfoptimizing microprocessor that can tune itself
to its program
• Involves self-profiling circuitry
• Uses designer-activated self-optimization mode
• To illustrate, we also introduce:
– A tunable component: Loop Table
• Small memory to store frequent loops
• Similar to previous loop caches
– Differs in how and when contents are updated
Frank Vahid, www.cs.ucr.edu/~vahid
4
Problem Description and Related
Work
•
Goal:
– Develop a mass-producible standard
embedded microprocessor that can tune its
configurable components to one application
for low power
•
Constraints
1.
2.
3.
–
Exact instruction set compatibility
Avoid changing tool chain
Preserve cycle-by-cycle behavior
These constraints are more stringent than in
most previous work
Frank Vahid, www.cs.ucr.edu/~vahid
5
Problem Description and Related
Work
• Application-specific instruction-set processors
– Introduce new instructions for common actions
• Pre-fabrication: [Fischer99], [Tensillica00]
• Post-fabrication: [Kucukcakar99] – for mass-produced IC’s
– Obviously modifies instruction-set and tool chain
• Dynamic binary translation and code morphing
– Transmeta’s Crusoe: Profile executing code, cache
translation results of frequently executed code
– Changes cycle-by-cycle behavior, and only helps if
performing dynamic binary translation in the first place
• Program compression
– Profile code, compress frequently-executed code
[Ishihara00]
– Modifies the tool chain
Frank Vahid, www.cs.ucr.edu/~vahid
6
Problem Description and Related
Work
• Small, low-power L0 cache
• Causes extra cycles due to many misses
– Compiler-assisted loop cache
[Bellas99]
• Use profiler/compiler to mark only
frequent loops for placement in filter
cache
• Modifies tool chain
– Transparent loop cache
40%
30%
20%
10%
0%
PID controller example:
most execution time spent
in two small loops
[Lee99]
• Fill loop cache only when detect a shortbackwards branch, indicating a small loop
• No tag comparisons – greater efficiency
• We extend to only consider frequent
loops, reducing runtime overhead
Frank Vahid, www.cs.ucr.edu/~vahid
50%
to
1
74 2 62
8
11 to 7
61
90
to
1
11
81 2 03
to
1
14
04 1 90
to
1
86 4 20
9
22 to 8
75
50
to
23
14
– Cache frequently-executed small loops to
reduce power for memory
– Filter cache [Kin97]
60%
12
03
• Loop caches
Pmem
Proc.
Pmem
Proc.
Loop
table
7
Architecture Overview
• Started with standard microcontroller
Data Memory (RAM)
Datapath
Program Memory (ROM)
(~10,000’s of bytes)
Instruction
Address
Controller
Mux
Address
ALU
15
10
Control
5
0
Ex1 bef
Ex2 bef
Ex3 bef
ROM
Configuration Memory
(~10’s of bytes)
Loop Table (~100’s of bytes)
Jump
Instructions
bits
Instruction
Microprocessor
Address
Mu
Instruction x
Frank Vahid, www.cs.ucr.edu/~vahid
RAM
20
Milliwatts
– ROM access consumes much power
– Added Loop Table to store common loops
– Added Bypass Controller to switch to/from
Loop Table
– Added Self-Profiling Controller and Loop
Count Table to detect most frequent loops
25
Bypass
Controller
LAR’s
SelfProfiling
Controller
Loop Count
Table (~100’s
of bytes)
8
Methodology Overview
(Designer: prefabrication)
Designer: post-fabrication
User
Self-optimization mode activation
• Self-optimizing microcontroller
– Post-fabrication (hence mass-produced)
– In-system
– Tuning under designer control
• Not by end user, hence stable and consistent
end-use platform
Frank Vahid, www.cs.ucr.edu/~vahid
9
Methodology Overview
Download application to
microcontroller program memory
Reset microcontroller, causing (optimized)
application execution in normal mode
Activate self-optimizing mode, causing update
of configuration memory
Upload configuration memory for downloading to other
microcontrollers
Data Memory (RAM)
Datapath
Program Memory (ROM)
(~10,000’s of bytes)
Instruction
Address
Controller
Mux
Address
Frank Vahid, www.cs.ucr.edu/~vahid
Loop Table (~100’s of bytes)
Instructions
Jump bits
Microprocessor
Instruction
Address
Mux
Instruction
Configuration Memory
(~10’s of bytes)
Bypass
Controller
LAR’s
SelfProfiling
Controller
Loop Count Table
(~100’s of bytes)
10
Self-optimizing mode
• Initializing
Download
program
Normal
mode
– Activated by extra pin
– Traverse memory, detect loops, add
addresses to loop count table
• Profiling
– Execute, update loop counts
Selfoptimizing
mode
• Requires fast increments
• We use fully-assoc. mem
• Hardware hash table possible
• Configuring
– Store most frequent loop addresses at
bottom of program memory, set flag
Data Memory (RAM)
Upload
configuration
Datapath
Program Memory (ROM)
(~10,000’s of bytes)
Configuration Memory
(~10’s of bytes)
Microprocessor
Controller
SelfProfiler
Frank Vahid, www.cs.ucr.edu/~vahid
Loop Count
Table
11
Normal mode
• Reset
Download
program
– If self-optimization flag set
Normal
mode
• Read loop addresses into address registers (LAR’s)
• Set flag in bypass controller
• If flag unset or no address match, fetch from
ROM
• If flag set and address match
Selfoptimizing
mode
• Begin fetching from loop table
• Extra bits in loop table for fast determination if
jump leaves table
– 00: instruction can’t exit loop
– 10: exits loop if jump not taken
– 01: exits loop if jump taken
Data Memory (RAM)
Upload
configuration
Datapath
Controller
Program Memory (ROM)
Configuration Memory
Loop Table
Instructions
Microprocessor
Jump
Bypass
LAR’s
Frank Vahid, www.cs.ucr.edu/~vahid
12
Results -- power
• Savings
– 34% total power savings after
self-optimization
– Depends on technology
Loop
table and
control
RAM
25
Milliwatts
20
15
ALU
10
5
Control
0
Ex1
bef
Ex1
aft
Ex2
bef
Ex2
aft
Ex3
bef
Ex3
aft
ROM
• Power overhead
– Negligible when selfoptimization idle
– Slight increase (5%) during
self-optimization
• Setup
– Synopsys synthesis,
simulation, and power
analysis
– 8051 synthesizable VHDL
model at UCR
(www.cs.ucr.edu/~dalton)
Frank Vahid, www.cs.ucr.edu/~vahid
13
Results – size (in cells)
• Significant increase,
but:
– 8051 version was small
• Others bigger ROM
(e.g., 2M), RAM, and
other processors are
even bigger
• Smaller percentage
overhead
Subsystem
Controller
ALU
Decoder
RAM (256 bytes)
ROM (8 kbytes)
Select logic
Loop Count Table
Loop Table
Self-Profiler/Bypass
Total:
Original
3,391
2,100
586
17,312
11,000
34,389
Extended
3,767
2,100
586
17,312
11,000
132
33,595
16,740
7,188
92,420
– Transistors becoming cheaper
– Can build product-oriented IC’s with only loop table and
controller (no Self-Profiler or Loop Count Table)
– Upload new binaries from prototype-oriented part,
download back to new product-oriented parts
– Supported by existing standard tools
– We are investigating ways to shrink the Loop Count Table
Frank Vahid, www.cs.ucr.edu/~vahid
14
Conclusions
• Mass-produced IC’s give big advantages
• Abundance of transistors provides new opportunity for selfoptimization by tuning
• We introduced:
– A self-optimization methodology and architecture
– A loop table as a tunable component
• These items yielded:
– Significant power savings by reducing ROM access
• 34% total savings for our particular microcontroller and target
technology
– No change in instruction set, tools, or performance
• Future work includes:
– Reducing size overhead
– Investigating other tunable components (e.g., N-way cache)
Frank Vahid, www.cs.ucr.edu/~vahid
15