Transcript Slide

FPGA-based Fast,
Cycle-Accurate Full
System Simulators
Derek Chiou, Huzefa
Sanjeliwala, Dam Sunwoo, John
Xu and Nikhil Patil
University of Texas at Austin
Wouldn’t it be nice to have a
simulator that is






Fast
 10M cycles per second, fast enough to run real datasets to
completion
Accurate
 Produce cycle-accurate numbers
Complete
 Run real operating systems, applications
Transparent
 Can see everything in processor, no performance hit
Inexpensive
 Need thousands
Usable
 Quick changes, easy to see performance
Software?

Software-based simulators inherently cannot
achieve this speed and be cycle-accurate at the
same time



A 128 entry, fully-associative TLB at the limit requires 128
load, compare operations
Arbitration requires first looking across multiple bidders
There are lots of these structures in a complex
processor!


Thousands to tens of thousands of events
Even with perfect parallelism, need a lot of CPUs
Hardware



Clearly, hardware is necessary
Reconfigurability (read FPGAs) is required for
flexibility
But how?
Full Implementation?

Take RTL code, compile for FPGA



Emulate Pentium M in a single FPGA?


140M transistors
Instead, what about




Implementing full system in FPGA is prohibitively large
Shih-Lin Lu’s group has single original Pentium (586, 3.1M
transistors) in largest Xilinx FPGA
Accurately (to cycle resolution) simulate its behavior
Running real, unmodified applications, OS
With full visibility at full speed?
If execution speeds are reasonable, do I care?
Derek Chiou, UTexas, Austin
Can I Partition the Problem?




64b adder way too big to be implemented as
a single monolithic entity
But, I can implement 64 1b adders very easily
with very little state and complexity
Partitioning is good if possible
But, how to partition?
Classic Partitioning

On module boundary



Caches, memories, ALUs, processors, memory controllers
Partitioning doesn’t save state or complexity, but enables design to be partitioned over
multiple FPGAs and software
Problems?
bypass
I1
0x2
IR
Add
I2
PC
we
rr1
rr2
addr
rd1
inst
Instruction
$/Mem
IR
IR
IR
wr
wd
A
ALU
rd2
GPR File
Y
we algn
waddr
raddr
B
rdata
Data
$/Memory
Immed.
Extend
wdata
re
MD1
MD2
0
1M
2
3
R
Functional/Timing Partition



Functional model simulates ISA
Timing model simulates micro-architecture
Asim and Simplescalar are written like this



Software
One processor
Lots of interaction between functional and timing


Intended to avoid rollback of any component
Put timing model in FPGA???

Parallel component executed in hardware!
UT FAST Partitioning

On ISA/micro-architecture boundary (ISA + FPGA)
 Instruction trace generated by ISA simulator (e.g., Bochs, Simics)


Fast, full system but no timing information (could be hardware!!!)
What do we need to simulate in the timing model?
bypass
I1
0x2
IR
Add
I2
PC
Trace
we
rr1
rr2
addr
rd1
inst
Instruction
Memory
IR
IR
IR
wr
wd
A
ALU
rd2
GPR File
Y
we algn
waddr
raddr
B
rdata
Data
Memory
Immed.
Extend
wdata
re
MD1
MD2
0
1M
2
3
R

UT
FAST
Complex
Processors
Straight pipelines are easy what
about

Caches/TLBs?



Keep tags, pass address (virtual and
physical if necessary)
Hits, misses determined but don’t
need data
I-Fetch





“Fetch and issue” multiple
instructions assuming they meet
boundary constraints
Multiple “functional units”
Reservation stations
Reorder buffer
Pipeline control along with
instructions
NO DATAPATH!!!
I-Decode
Delay
ALU
GPR Rename
Delay
FPR Rename
Delay
GPR Read
Delay
FPR Read
Delay
ALU
Br
Multi-cycle memories to create more
ports
Ldst
Ldst
FPU
FPU
Reorder
Buffer
D-Cache
Timing Model speed almost
unimportant!

I-Cache
Delay
Superscalar (multiple issue)?


Instruction stream
Memory
Memory
Controller
BIU
Disk
Network
Example of Complication:
Branch Prediction

Must process mis-speculated instructions in timing model


Implement BP in timing model
Timing model forces ISA simulator to mis-speculate



Rollback, restore
Requires support from ISA simulator
Branch predictor predictor in ISA simulator?

BP only works in processor if it’s fairly accurate

FAST simulators take advantage of the fact that most of
the time micro-architecture is on the right path

Most complexity (BP, parallelism) can be handled this way
Status & Conclusions

1MHz to 100MHz, cycle-accurate, full-system, multiprocessor simulator


X86, boots Linux, Windows, targeting 80486 to Pentium D-like and
beyond (Dam Sunwoo, Nikhil Patil)



Have straight pipeline 486 model with TLBs and caches
Statistics gathered in hardware


Bochs functional model (looking at much faster models)
Heavily modified instruction trace and rollback
Branch-predicted superscalar model almost done in Bluespec and
Verilog (John Xu, Huzefa Sanjeliwala)


Well, not quite that fast right now, but we are using embedded 300MHz
PowerPC 405 to simplify
Very little if any probe effect
Tools to semi-automate micro-architectural and ISA level exploration
 Orthogonality of models makes both simpler