Dynamically Parameterized Architectures for Power Aware Video

Download Report

Transcript Dynamically Parameterized Architectures for Power Aware Video

Power-Aware System on a Chip
A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson
University of Massachusetts Amherst
Boston Area Architecture Conference
30 Jan 2003
{alaffely, jliang, tessier, moritz, burleson}@ecs.umass.edu
This material is based upon work supported by the National Science Foundation under Grant No. 9988238.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the
author(s) and do not necessarily reflect the views of the National Science Foundation.
1
Motivation
•
Problem:
•
•
•
Need low power architectures for wireless DSP
How to support dynamic clock and voltage scaling
in heterogeneous systems with data-dependent
workloads (granularity, overhead, control)
A Solution:
Use modularity of SoC; apply at IP core level
• Apply discrete frequency and voltage scaling
• Use interconnect utilization measures and data
rate requirements to dynamically control scaling
•
2
Overview
•
•
•
•
Adaptive System-on-a-Chip
Implementation Approach
Preliminary Results
Conclusions and Challenges
3
Adaptive System-on-a-Chip
Tile
mProc
•
•
Multiplier
•
Communication
Interface
North
FPGA
Multiplier
West
Tiled architecture with
mesh interconnect
Allows for heterogeneous
cores
•
•
East
ctrl
South
Core
Differing sizes, clock
rates, voltages
Low-overhead core
interface for
•
•
Point to point
communication pipeline
On-chip bus substitute
for streaming
applications
Based on static
scheduling
•
Fast and predictable
4
aSoC Implementation
2500 l
.18 m technology
Full custom
3000 l
5
Some Results
•
9 and 16 core systems tested for IIR,
MPEG encoding and Image processing
applications
~ 2 x the performance compared to
Coreconnect bus Burst and Hierarchical
• ~ 1.5 x the performance of an oblivious
routing network1 (Dynamic routing)
• Max speedup is 5 x
•
1. W. Dally and H. Aoki, “Deadlock-free Adaptive Routing in Multi-computer Networks
Using Virtual Routing”, IEEE Transactions on Parallel and Distributed Systems, April 1993
6
Dynamic Properties of Statically
Routed System??
Latency
Scheduled
Communications
•
ME
DCT
Changes
with Data
Dynamically Parameterizable Cores proposed to save
power
•
•
•
Motion Estimation core (by P. Jain UMASS) changes from 256
cycles/pixel to 16 cycles/pixel based on input data
Streams within the scheduled communication pipeline can be
blocked and back up or go unused
Inefficient to simply run at fastest rate
7
Key Features for Dynamic Power
Reduction
•
SoC Modularity
•
•
Heterogeneous cores
•
•
•
Sets a manageable granularity for voltage scaling
Multiple on chip clocks and voltages already
supported
Core interface already handles synchronization
and level conversion
Statically scheduled
•
Interconnect traffic indicate system bottlenecks
8
Approach
•
Stream based cores
•
•
Core
Limited buffering
Processing
Pipeline
Core-ports
Single buffer for each
stream to cross
clock/voltage barrier
between core and
interface
• Reading/Writing success
rates indicate core
utilization
•
Input blocked: Core
too slow
• Output blocked: Core
too fast
•
•
Controller
•
Interprets core-port
success rates to adjust
local clock and voltage
Buffer
Local Local
Vdd Clock
Input
Core-port
Output
Core-port
Clock
Blocked
Blocked
and
Supply
Controller
Interconnect
9
Power-Aware System:
Core Utilization Measurement
Interconnect
Interface
Compare
and
Threshold
Out/In Data
Blocked
Core-port In
count
Increase
or Decrease
Local Clock
Blocked
count
Core-port Out
Core
•
Accumulate failures at each core-port to control clock change
•
•
•
Blocked – Add 1
Success – Subtract 1
Threshold and compare input and output failure counts
•
•
•
Many input, few output: increase frequency
Many output, few input: decrease frequency
Many or few of both: do nothing
10
Power-Aware System:
Local Clock Selection
Global
Clock
From Rate
Measurement
•
•
•
/128
/64
/32
/16
/8
/4
/2
/1
count
Local
Clock
Core
Derived from high frequency global clock
8 possible values (Global Clock/2n)
Move one up or down each transition
11
Power-Aware System:
Voltage Selection System
From Clock
Selector
•
•
•
V1
V2
V3
V4
LUT
Choose one of 4 supply voltages
Look-up-table (LUT) used to match voltage
to frequency setting for specific core
Using cascading buffers core Vdd can
change within 30ns (250nm technology)
Local
Supply
Core
12
Vdd Selection Criteria
Normalized Core Critical Path Delay vs. Vdd
12
Normalized
Delay 10
1/8 Speed
•
•
8
•
6
1/4 Speed
As Vdd decreases delay increases
exponentially
Use curve to match available clock
frequencies to voltages
The voltage drop reduces power by
70%, 84%, and 89%
•
P = aC(Vdd)2f
4
1/2 Speed
Max Speed 2
0
0.4 0.6 0.8
0.72
1 1.2 1.4 1.6 1.8 2
Voltage
1.05
13
Power Savings
•
Two core system
•
ME
DCT
•
Optimal Frequencies
Core: Mode
ME: Full Search
ME: Spiral
ME: Three Step
Search
DCT
ME chooses 3
different algorithms
based on input data
DCT constant rate
Frequency
MHz
105
9.9
2.75
Power
mW
973
76
25
Frequency and Voltage
Scaling
Power
mW
973
7.6
2.5
9.6
54
5.4
Core power from Synopsys RTL simulation
14
Test System Results
•
Simple test case
•
•
Core1
Core 1 starts 16 x too fast
Core 2 starts 8 x too slow
Relative
Clock
Frequency
18
16
14
12
10
8
6
4
2
0
Core2
Core1
Core2
0
500
1000
1500
2000
2500
Number of Clock Cycles
15
Key Issues
•
Count value require to control frequency
shifting?
•
•
Core characterization
•
•
•
May be application and core dependent
Not easy, data dependent
Some tools exist for StrongArm (JouleTrack A.
Sinha MIT)
Benchmark development
•
A bit tedious
16
Conclusions
•
SoC: a good candidate platform for voltage
scaling implementation
•
•
•
Convenient granularity
Low overhead
Easily measurable control mechanism
Hardware
• Preliminary results
• Now test real benchmarks and data
•
17