Lazy Logic - PHARM - University of Wisconsin–Madison
Download
Report
Transcript Lazy Logic - PHARM - University of Wisconsin–Madison
Lazy Logic
Mikko H. Lipasti
Associate Professor
Department of Electrical and
Computer Engineering
University of Wisconsin—Madison
http://www.ece.wisc.edu/~pharm
CMOS History
CMOS has been a faithful servant
40+ years since invention
Tremendous advances
Device size, integration level
Voltage scaling
Yield, manufacturability, reliability
Nearly 20 years now as high-performance
workhorse
Result: life has been easy for architects
July 6, 2007
Ease leads to complacency & laziness
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
CMOS Futures
“The reports of my demise are greatly
exaggerated.” – Mark Twain
CMOS has some life left in it
Device scaling will continue
What comes after CMOS…
Many new challenges
July 6, 2007
Process variability
Device reliability
Leakage power
Dynamic power
Focus of this talk
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Dynamic Power
P
k
A
dyn
iC
iV
if
2
i
units
Static CMOS: current flows when transistors switch
Combinational logic evaluates new inputs
Flip-flop, latch captures new value (clock edge)
C: capacitance of circuit
Terms
wire length, number and size of transistors
V: supply voltage
A: activity factor
f: frequency
Architects can/should focus on Ci x Ai
Reduce capacitance of each unit
Reduce activity of each unit
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Design Objective Inversion
Historically, hardware was expensive
Now, power is expensive
Every gate, wire, cable, unit mattered
Squeeze maximum utilization from each
On-chip devices & wires, not so much
Should minimize Ci x Ai
Logic should be simple, infrequently used
Both sequential and combinational
Lazy Logic
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Talk Outline
Motivation
What is Lazy Logic?
Applications of Lazy Logic
Circuit-switched coherence
Stall-cycle redistribution
Dynamic scheduling
Conclusions
Research Group Overview
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
What is Lazy Logic?
Design philosophy
Some overall principles
Minimize unit utilization
Minimize unit complexity
OK to increase number of units/wires/devices
As long as reduced Ai (activity) compensates
Don’t forget leakage
Result
July 6, 2007
Reject conventional “good ideas”
Reduce power without loss of performance
Sometimes improve performance
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Lazy Logic Applications
CMP interconnection networks
Stall cycle redistribution
Old: Packet-switched, store-and-forward
New: Circuit-switched, reconfigurable
Transparent pipelines want fine-grained stalls
Redistribute coarse stalls into fine stalls
High-performance dynamic scheduling
July 6, 2007
Cycle time goal achieved by replicating ALUs
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
CMP Interconnection Networks
Options
Buses don’t scale
Crossbars are too
expensive
Rings are too slow
Packet-switched mesh
Attractive for all the
DSM reasons
July 6, 2007
Scalable
Low latency
High link utilization
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
CMP Interconnection Networks
But…
Cables/traces are now
on-chip wires
Router latency adds up
3-4 cpu cycles per hop
Store-and-forward
Fast, cheap, plentiful
Short: 1 cycle per hop
Lots of activity/power
Is this the right answer?
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Circuit-switched Interconnects
Communication
patterns
Spatial locality to
memory
Pairwise communication
Circuit-switched links
July 6, 2007
Avoid switching/routing
Reduce latency
Save power?
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Router Design
P
N
S
E
W
Switches can be logically configured to appear as wires (no
routing overhead)
Can also act as packet-switched network
Can switch back and forth very easily
Detailed router design not presented here
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Dirty Miss coverage
100.00%
% of Dirty Misses
90.00%
SPECjbb
SPECweb
TPC-H
TPC-W
80.00%
70.00%
60.00%
50.00%
40.00%
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Number of Circuit-Switched Connections/Processor
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
15
Directory Protocol
Initial 3-hop miss establishes CS path
Subsequent miss requests
Sent directly on CS path to predicted owner
Also in parallel to home node
Predicted owner sources data early
Directory acks update to sharing list
Benefits
July 6, 2007
Reduced 3-hop latency
Less activity, less power
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Circuit-switched Performance
Base
Fully connected, Oracle
Limit 1, Oracle
Limit 1, Region Prediction
1
0.8
0.6
0.4
0.2
July 6, 2007
Mikko Lipasti, University of Wisconsin
Radiosity
Ocean
Barnes-Hut
TPC-W
SPECweb99
SPECjbb2000
0
TPC-H
Normalized Cycle Count
1.2
Seminar--University of Toronto
Link Activity
Limit 1, Oracle
Limit 1, Region Prediction
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
July 6, 2007
Mikko Lipasti, University of Wisconsin
Radiosity
Ocean
Barnes-Hut
TPC-W
SPECweb99
SPECjbb2000
0.00%
TPC-H
Normalized Link Activity
100.00%
Seminar--University of Toronto
Buffer Activity
Limit 1, Region Prediction
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
July 6, 2007
Mikko Lipasti, University of Wisconsin
Radiosity
Ocean
Barnes-Hut
TPC-W
SPECweb99
SPECjbb2000
0.00%
TPC-H
Normalized Input buffer Activity
Limit 1, Oracle
Seminar--University of Toronto
Circuit-switched Coherence
Summary
Reconfigurable interconnect
Circuit-switched links
Some performance benefit
Substantial reduction in activity
Current status (slides are out of date)
July 6, 2007
Router design and physical/area models
Protocol tuning and tweaks, etc.
Initial results in CA Letters paper
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Talk Outline
Motivation
What is Lazy Logic?
Applications of Lazy Logic
Circuit-switched coherence
Stall-cycle redistribution
Dynamic scheduling
Conclusions
Research Group Overview
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Pipeline Clocking Revisited
B
A
Conventional pipeline clock gating
Each valid work unit gets clocked into each latch
Two
units of work, 10 clock pulses
This is needlessly conservative
Latches clocked to propagate data
April 10, 2016
Eric L. Hill – Preliminary Exam
20
Transparent Pipeline Gating
B
A
Transparent pipelining: novel approach to clocking
[Jacobsen 2004, 2005]
Both master and slave latch can remain transparent
GatingTwo
logic ensures
races 5 clock pulses
units ofnowork,
Pipeline registers are clocked lazily only when race occurs
Quite effective for low utilization pipelines
Gaps between valid work units enable transparent mode
April 10, 2016
Eric L. Hill – Preliminary Exam
return
21
Applications
Best suited for low utilization pipelines
High utilization pipelines see least
benefit
E.g. FP, Media processing functional units
E.g. Instruction fetch pipelines
To benefit from transparent approach:
July 6, 2007
Valid data items need fine-grained gaps
(stalls)
1-cycle gap provides lion’s share (50%)
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Application: Front-end Pipelines
Provide back-end with sufficient supply
of instructions to find ILP
High branch prediction accuracy
Low instruction cache miss rates
Little opportunity for clock gating
Designed to feed peak demand
Poor match for transparent pipeline
gating
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
In-Order Execution Model
In-order Cores
Power efficient
Low design complexity
Throughput oriented
CMP systems trending
towards simple cores
(e.g. Sun Niagara)
time
Data dependences
cause fine-grained
stalls at dispatch
July 6, 2007
Can we project these
back to fetch?
Exploit fetch slack
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Pipeline Diagram
Issue Buffer
PC
clock vector
Bpred
0x0
RP
Instruction
Fetch
Execution
Core
bpred update
April 10, 2016
Eric L. Hill – Preliminary Exam
25
Available Fetch Slack
1.00
fraction of instruction groups observed
0.90
0.80
0.70
7+
6
5
4
3
2
1
0
0.60
0.50
0.40
0.30
0.20
0.10
July 6, 2007
Mikko Lipasti, University of Wisconsin
vp
r
vo
rte
x
tw
ol
f
tp
ch
pa
rs
er
pe
rlb
m
k
sp
ec
jb
b
m
cf
p
gz
i
gc
c
ga
p
eo
n
cr
af
ty
bz
i
p2
0.00
Seminar--University of Toronto
Implementation
Stall cycle bits embedded in BTB
Verify prediction by observing
unperturbed groups
EPIC ISAs (IA64) could use stop bits
Let high confidence groups periodically
execute unperturbed
Observe overall increase in execution time
Modeled Cell PPU-like PowerPC core
with aggressive clock gating
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Latch Activity Reduction
1.2
scr
scr+tcg
0.8
0.6
0.4
0.2
July 6, 2007
Mikko Lipasti, University of Wisconsin
vp
r
vo
rte
x
tw
ol
f
tp
ch
jb
b
sp
ec
k
lb
m
pe
r
se
r
pa
r
m
cf
p
gz
i
gc
c
ga
p
eo
n
cr
af
ty
p
0
bz
i
normalized latch activity factor
1
Seminar--University of Toronto
FE Energy Delay Product
base
scr
scr+tpg
fe_latch
bpred
icache
1
0.8
0.6
0.4
0.2
July 6, 2007
Mikko Lipasti, University of Wisconsin
vp
r
x
vo
rte
ol
f
tw
ch
tp
jb
b
sp
ec
m
k
rlb
pe
rs
er
pa
cf
m
gz
ip
gc
c
p
ga
n
eo
ty
cr
af
2
0
bz
ip
normalized front end energy-delay project (j*s)
1.2
Seminar--University of Toronto
Stall Cycle Redistribution
Summary [ISLPED 2006]
Transparent pipelines reduce latch activity
Not effective in pipelines with coarsegrained stalls (e.g. fetch)
Coarse-grained stalls can be redistributed
without affecting performance (fetch slack)
Benefits
July 6, 2007
Equivalent performance, lower power
Transparent fetch pipeline now attractive
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Talk Outline
Motivation
What is Lazy Logic?
Applications of Lazy Logic
Circuit-switched coherence
Stall-cycle redistribution
Dynamic scheduling
Conclusions
Research Group Overview
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
A Brief Scheduler Overview
Data capture/ non-data capture scheduler
Sched
Decode Schedule Writeback
Dispatch Commit
RF
/Exe
Fetch
Exe
Writeback Commit
Wakeup
wakeup/
Atomic
Sched/Exe
/Select
select
Speculative scheduling
Speculatively
Speculativelyissued
issuedinstructions
instructions
Speculatively
issued
instructions
Writeback
Commit
Decode Schedule Dispatch
RF
Exe
/Recover
Fetch
Spec wakeup
Invalid
Latencyinput
Changed!!
value
Data capture scheduler
desirable
for
many
reasons
/select
Re-schedule
Cycle time is not competitive because of data path delay
when latency mispredicted
Current machines use speculative scheduling
Misscheduled/replayed instructions burn power
Depending on recovery policy, up to 17% issued insts need to replay
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Slicing the Core
Back-End
Front-End
OoO Core
Bitslice the core: narrow (16b) and wide (64b)
Narrow core can be full data capture
July 6, 2007
Still makes aggressive cycle time (with lazy logic)
Completely nonspeculative, virtually no replays
Further power benefits (not in this talk)
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Dynamic Scheduling with
Partial Operand Values
Fetch
Decode
Sched &
Dispatch
Nrw Exe
RF
Exe
Writeback
Commit
/Recover
wakeup
/select
the rest of the data
Narrow core
Computes partial operand
Determines load latency
Avoids misscheduling
Wide core
July 6, 2007
Computes the rest of the operand (if needed)
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Scheduler w/ Narrow Data-Path
ROB ID
Dest
Tag1
=
Data1
...
...
...
...
...
...
Adder
Int ALU
LSQ
Cache
(2)
(a)
July 6, 2007
Naïve narrow data
capture scheduler
Select – mux – tag bcast &
compare – ready wr
Select – mux – narrow ALU
– data bcast – data wr
...
(1)
To Wide Data Path
Select – mux – tag bcast &
compare – ready wr
Data2
Tag2
select logic
=
Non-data capture
scheduler
Increased cycle time
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Scheduler w/ Embedded ALUs
ROB ID
Dest M R Tag1
Select – mux – tag bcast &
compare – ready wr
=
Data1
M R Tag2
Data2
...
...
...
...
Int ALU
select logic
=
Max(select, data bcast – mux
– narrow ALU) – mux –
latch setup
Int ALU
...
...
(1)
latch
(2)
LSQ
With embedded ALUs
Cache
Lazy Logic
Replicated ALUs
Low utilization
Off critical delay path
(b)
July 6, 2007
To Wide Data Path
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Cycle Time, Area, Energy
32 entries, implemented using verilog
Synthesized using Synopsis Design
Compiler and LSI Logic’s gflxp 0.11um
Cycle
Time (ns)
Area
(mm2)
Energy
(nJ)
Full-Data Capture
2.04
1.98
1.40
Narrow-Data Capture
1.71
1.49
1.46
Narrow-Data Capture w/ ALUs
1.28
1.53
1.48
Non-Data Capture
1.28
1.43
1.54
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Dynamic Scheduling Summary
Benefits: [JILP 2007]
Save 25-30% of total OoO window energy
=> 12-18% total dynamic chip power
Reduce misspeculated loads by 75%-80%
Slightly improved IPC
Comparable cycle time
Enabled by:
Lazy narrow ALUs
ALUs are cheap, so compute in parallel with
scheduling select logic
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Talk Outline
Motivation
What is Lazy Logic?
Applications of Lazy Logic
Circuit-switched coherence
Stall-cycle redistribution
Dynamic scheduling
Conclusions
Research Group Overview
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Conclusions
Lazy Logic
Promising new design philosophy
Some overall principles
Minimize unit utilization
Minimize unit complexity
OK to increase number of units/wires/devices
Initial Results
July 6, 2007
Circuit-switched CMP interconnects
Stall cycle redistribution
Dynamic Scheduling
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Who Are We?
Faculty: Mikko Lipasti
Current Ph.D. students:
Profligate execution: Gordie Bell (joining IBM in 2006)
Coarse-grained coherence: Jason Cantin (joining IBM in 2006)
Lazy Logic
Circuit-switched coherence: Natalie Enright
Stall cycle redistribution: Eric Hill
Dynamic scheduling: Erika Gunadi
Dynamic code optimization: Lixin Su
SMT/CMP scheduling/resource allocation: Dana Vantrease
Pharmed out:
IBM: Trey Cain, Brian Mestan
AMD: Kevin Lepak
Intel: Ilhyun Kim, Morris Marden, Craig Saldanha, Madhu Seshadri
Sun Microsystems: Matt Ramsay, Razvan Cheveresan, Pranay Koka
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Research Group Overview
Faculty: Mikko Lipasti, since 1999
Current MS/PhD students
Gordie Bell, Natalie Enright Jerger, Erika Gunadi, Atif
Hashmi, Eric Hill, Lixin Su, Dana Vantrease
Graduates, current employment:
AMD: Kevin Lepak
IBM: Trey Cain, Jason Cantin, Brian Mestan
Intel: Ilhyun Kim, Morris Marden, Craig Saldanha,
Madhu Seshadri
Sun Microsystems: Matt Ramsay, Razvan Cheveresan,
Pranay Koka
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Current Focus Areas
Multiprocessors
Microprocessor design
Coherence protocol optimization
Interconnection network design
Fairness issues in hierarchical systems
Complexity-effective microarchitecture
Scalable dynamic scheduling hardware
Speculation reduction for power savings
Transparent clock gating
Domain-specific ISA extensions
Software
Java Virtual Machine run-time optimization
Workload development and characterization
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Funding
IBM
Intel
Research council support
Equipment donations
National Science Foundation
Faculty Partnership Awards
Shared University Research equipment
CSA, ITR, NGS, CPA
Career Award
Schneider ECE Faculty Fellowship
UW Graduate School
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Questions?
http://www.ece.wisc.edu/~pharm
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Questions?
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Backup slides
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Technology Parameters
65 nm technology generation
16 tiled processors
Approximately 4 mm x 4mm
Signal can travel approximately 4
mm/cycle
Circuit switched interconnect consists of
July 6, 2007
5 mm unidirectional links
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Broadcast Protocol
Broadcast to all nodes
Establish Circuit-Switched path with
owner of data
Future broadcasts will use CircuitSwitched path to reduce power
Predict when CS path will suffice
Use LRU information for paths to tear
down old paths when resources need to
be claimed by new path
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Switch Design from paper
CM
Processor
CM
N
S
W
CM
E
CM
CM
Buffer
CM = Configuration Memory
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Race example from paper (1 of
2)
1a. CS Req
6. Inval Resp
P0
P1
P2
1b.
CS Notify
4. CS Resp (S)
3.
July 6, 2007
5.
7.
Invalidate
Downgrade
Dir3
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Race example (2 of 2)
1a. CS Req
6. Inval Resp
P0
1b.
CS Notify
P1
4a. CS Resp (S)
2. Upgrade
4b. Nack
3.
July 6, 2007
P2
Dir3
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
LRU pairs for Dirty Misses
100.00%
Specjbb
specweb
tpch
tpcw
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
23 or fewer pairs capture >80% of dirty
misses for 3 out of 4 benchmarks (16p)
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
235
226
217
208
199
190
181
172
163
154
145
136
127
118
109
100
91
82
73
64
55
46
37
28
19
10
1
0.00%
Local LRU pairs
Specjbb
specweb
tpch
tpcw
Miss Rate (Local LRU)
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
2 Circuit-Switched Paths per processor covers
between 55% and 85% of dirty misses
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Concurrent Links
2 Circuit-Switched Paths per Processor
110.00%
100.00%
90.00%
80.00%
SpecJBB
Specweb
TPC-H
TPC-W
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
1
2
3
4
5
6
7
8
9
5 concurrent links cover 90% necessary pairs
Captures 50%-77% of overall opportunity
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Experimental Setup
PHARMsim
Activity-based power model based on Wattch added
InOrder issue
4/2/2 fetch/issue/commit (based on Cell PPU)
10 stage transparent front-end pipeline
(conventional latches at endpoints)
Gshare (8k entry) branch predictor, 1024 set, 4way BTB
32KB I/D cache (1/4), 512KB L2 cache (12)
4 confidence bits / >4 high conf threshold /
predictions checked randomly 10% of the time
Benchmarks simulated for 250M instructions
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Branch Predictor Activity
scr_extra
normal
1.2
0.8
0.6
0.4
0.2
July 6, 2007
Mikko Lipasti, University of Wisconsin
vp
r
x
vo
rte
ol
f
tw
tp
ch
jb
b
sp
ec
m
k
rlb
pe
rs
er
pa
cf
m
gz
ip
gc
c
p
ga
n
eo
cr
af
ty
0
bz
ip
normalized bpred activity
1
Seminar--University of Toronto
Related Work
Removing Wrong Path Instructions
Flow Based Throttling Techniques
July 6, 2007
[Manne 1998]
[Baniasadi 2001, Karkhanis 2002]
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Future Work
Explore performance of other fetch
gating schemes with transparent
pipelining
Explore dependence driven gating on
Itanium machine model
Explore latch soft error vulnerability
(TVF) when lazy clocking is used
Explore change in AVF when fetch
gating is used
July 6, 2007
Less ACE state in-flight
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Scheduling Replay Example
Squashing/non-selective replay – alpha 21264
Replays all dependent and independent instructions issued under
load shadow
Analogous to squashing recovery in branch misprediction
Simple but high performance penalty
Independent instructions are unnecessarily replayed
Sched
LD
Disp
RF
miss
LD
resolved
Exe
ADD
ADD
OR
OR
AND
AND
BR
BR
BR
AND
AND
Retire
LD
Cache
miss
AND
BR
Invalidate & replay ALL
instructions in the load
shadow
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Narrow Core
Narrow Scheduler
Captures partial operands
Determines load latency (hit/miss)
Narrow Data-Path
Narrow ALU – provides partial data to consumers
Nar
row LSQ and partial tag cache
Finds only possible load data source
Uses least significant 16 bits
Large enough to help predict load latency
Small enough to achieve fast cycle time
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
L/S Disambiguation &
Partial Tag Matching
Exploits operand significance
[Brooks et.al. 1999, Canal et al. 2000]
Load/store disambiguation
10 bits finds 99% of matching stores
Partial tag match
16 bits for 97%(mcf) - 99%(bzip2) accuracy
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Outline
Motivation
Dynamic Scheduling with Narrow Values
Scheduler with Narrow Data-Path
Pipelined Data Cache
Pipeline Integration
Implementation and Experiments
Conclusions and Future Work
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Dynamic Scheduling with
Partial Operands
Back-End
Front-End
OoO Core
Stores a subset of operands in scheduler
Exploits partial operand knowledge
July 6, 2007
Load-store disambiguation
Partial tag match
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Pipelined Cache w/ Early Bits
Comparator
Data
Array
Muxes
Tag
Subarray
Subarray Decoder
Subarray Decoder
Tag
Array
Wide Bank
Row Decoder
Row Decoder
Narrow Bank
Latch
Latch
Partial Bits
Full Bits
Disp1 Disp2
Data
Subarray
Comparator
Muxes
Latch
Latch
Latch
Disp1 Disp2 Agen
Comparator
To Narrow Data Path
To Wide Data Path
Narrow bank for partial access, wide bank for the rest
Uses partial tag match in narrow bank
Saves power in wide bank
Hide wide cache bank latency by starting early
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Narrow LSQ
Stores partial addresses of stores
Used for partial load-store disambiguation
Accessed in parallel with narrow bank
Saves power in the wide LSQ
Cheaper direct mapped access rather than
full associative search
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Pipeline Integration
Mult/Div Mult/Div Mult/Div
Fetch
Fetch
Decode Decode Decode Rename Queue
Sched
Disp
Partial
Load
Disp
IntALU
Agen
Cache
WB
Commit
Cache
Simple ALU insts link dependences in back-to-back cycle
Load insts need another cycle to schedule dependences
Complex ALU insts link dependences non-speculatively
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Pipelined Data Cache & LSQ
Modeled using modified CACTI 3.0
Configuration: 16KB, 4-way, 64B blocks
Pipelined
Data Cache
Conventional
Data Cache
Access Latency – Narrow
Bank
0.80ns
N/A
Access Latency – Wide Bank
0.60ns
1.24ns
(0.37 + 0.08) nJ
(0.62 + 0.11) nJ
(1.50 + 0.40) mm2
(1.21 + 0.40) mm2
Total Energy Consumption
(Cache + LSQ)
Total Area
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Experiments
Simplescalar / Alpha 3.0 tool set
Machine Model
64-entry ROB
4-wide fetch/issue/commit
16-entry SQ, 16-entry LQ
32-entry scheduler
13-stage pipeline
64KB I-Cache (2-cyc), 16KB D-Cache (2-cyc)
2-cycle store to load forwarding
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Energy Dissipation
1
Total Energy
0.8
narrow_refetch
0.6
narrow_squash
squash
parallel_selective
0.4
0.2
0
bzip2
mcf
parser
vpr
avg
Benchmarks
On average narrow captured scheduling consume 25%
less energy than non-data captured scheduling
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Mispredicted Load Instructions
Number of Misscheduled Load Instructions
(millions)
14
miss-forward
12
store no-data
misalign store
10
cache alias
cache miss
8
6
4
2
0
bzip2
mcf
parser
vpr
Benchmarks
Reduce misspeculated loads by 75%-80%
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Optimized model
Using refetch replay scheme to reduce
replay complexity
Clear the scheduler entries once
instructions are issued
Decreases scheduler occupancy
Instructions enters OoO window sooner
Reduce L1 cache latency from 2-cycle
to 1-cycle
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Optimized Model Performance
2
improved narrow_refetch
narrow_refetch
Speed Up
narrow_squash
squash
1.5
selective
1
0.5
bzip2
mcf
parser
vpr
avg
Benchmarks
Small variations
Always perform as good or better
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Future Work
Implement a more accurate dynamic
power model
Study custom design vs. synthesized
model
Study opportunities for leakage power
reduction
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Delay Model
Baseline Store and
Forward Mesh
Circuit Switched
Interconnect
--
3
6
9
--
2
3
4
3
6
9
12
2
3
4
6
6
9
12
15
3
4
6
7
9
12
15
18
4
6
7
9
Processor 0 can reach Processor 15 in 9
fewer cycles
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto
Pipeline Unrolling
July 6, 2007
Mikko Lipasti, University of Wisconsin
Seminar--University of Toronto