Overview of Simulation Engine Technology

Download Report

Transcript Overview of Simulation Engine Technology

Logic Simulation :
Languages, Algorithms, Simulators
Wolfgang Roesner
Verification Tools Development
IBM Corp
Austin, TX
Outline

Hardware Design Languages

Modeling Levels - A Taxonomy of HDL Constructs

General Purpose HDL Simulators - Event-Driven Sim

Improving Simulator Performance

Synchronous Design Methodology and Cycle-Based Sim

Let's Design a Cycle-Based Simulator
Simulation of Hardware Design Languages


Logic or "functional" simulation today is done mostly with HDLs (Hardware
Design Languages)
Most popular languages today (both are IEEE standards)



Verilog:






Verilog
VHDL
logic modeling and simulation language
started in EDA industry (start-up) in the 80's
was acquired by Cadence
donated to IEEE as a general industry standard
approx. 60% market share in U.S. EDA market
VHDL:



committee-designed language contracted by U.S. (DoD) (ADA-derived)
functional/logic modeling and simulation language
approx. 40% market share in U.S. EDA market
Modeling Levels - Highest Level : Interface

Let's look at the common constructs we use to specify the
functionality of a piece of hardware:
Model
inputs
outputs
input behavior over time
output behavior over time
t
Modeling Levels - Major Dimensions (I)

Temporal Dimension:






continous (analog)
gate delay (psec?)
clock cycle
instruction cycle
events
discrete time
Data Abstraction:





continuous (analog)
bit : multiple values
bit : binary
abstract value
composite value ("struct")
discrete value
Modeling Levels - Major Dimensions (II)

Functional Dimension:






Structural Dimension:




continuous functions (e.g. differential equations)
Switch-level (transistors as switches)
Boolean Logic
Algorithmic (eg. sort procedure)
Abstract mathematical formula (e.g. matrix multiplication)
Single black box
Functional blocks
Detailed hierarchy with primitive library elements
A good VHDL-centric taxonomy can be found at: http://rassp.scra.org
Modeling Levels - Major Dimensions (III)
Temporal
Continuous
Gate Delay
Clock Cycle
Instruction Cycle
Events
Multivalue Bit
Bit
abstract value
"struct"
Switch Level
Boolean Logic
Algorithmic
Abstract
Mathematical
Functional Blocks
Detailed
Component
Hierarchy
Data
Continuous
Functional
Continous
Structural
Single Black Box
Coverage of Modeling Levels - Verilog
Temporal
Continuous
Gate Delay
Clock Cycle
Instruction Cycle
Events
Multivalue Bit
Bit
abstract value
"struct"
Switch Level
Boolean Logic
Algorithmic
Abstract
Mathematical
Functional Blocks
Detailed
Component
Hierarchy
Data
Continuous
Functional
Continous
Structural
Single Black Box
Coverage of Modeling Levels - VHDL
Temporal
Continuous
Gate Delay
Clock Cycle
Instruction Cycle
Events
Multivalue Bit
Bit
abstract value
"struct"
Switch Level
Boolean Logic
Algorithmic
Abstract
Mathematical
Functional Blocks
Detailed
Component
Hierarchy
Data
Continuous
*
Functional
Continous
Structural
Single Black Box
* extremely inefficient compared to Verilog
In HDL Terms


VHDL, Verilog were both defined as simulation languages
Big emphasis on the structural refinement



VHDL: entity/architecture/component/port/signal
Verilog: module/instance/port/signal/reg
Specification of function

General: programming language constructs



Parallelism:



VHDL : user-defined data type, package, procedure, function, sequential code
Verilog: function, task, sequential code
VHDL : process, signal update, wait
Verilog: "always" block, fork/join, wait, event construct
Special purpose H/W constructs


VHDL : concurrent assignment, delayed assignment, signal, Boolean logic
Verilog : continuous assignment, 4-value Boolean logic, switch-level support
General HDL Simulators

Before we can look at architectures of simulators we need
to understand the execution model that the HDLs imply:
Event-Driven Execution


VHDL's execution model is defined in detail in the IEEE
LRM (Language Reference Manual)
Verilog's execution model is defined by Cadence's
Verilog-XL simulator ("reference implementation")
An event-driven VHDL example
process (trigger)
begin
if (count<=15) then
count <= count + 1 after 1ns;
else
count <= 0 after 1ns;
end if;
process (count)
begin
my_count <= count;
trigger <= not trigger;
end process;
end process;
Each process:
- loops forever
- waits for change in signal
from other process
Block 1
Block 2
A more hardware-oriented example
s(0) <= a(0) xor b(0) after 2ns;
c(0) <= a(0) and b(0) after 1ns;
s(1) <= a(1) xor b(1) xor c(0) after 2ns;
c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
sum_out(1 to 0) <= s(1 to 0);
carry_out <= c(1);
xor
a
b
s(0)
sum_out(1)
s(1)
=>
xor
=>
xor
sum_out(0)
c(0)
and
and
carry_out
or
and
or
and
=>
A more hardware-oriented example
Let's simulate:
a=11
b=01
Time = 0ns
(step1)
s(0) <= a(0) xor b(0) after 2ns;
c(0) <= a(0) and b(0) after 1ns;
s(1) <= a(1) xor b(1) xor c(0) after 2ns;
c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
Red Boxes : evaluate in
current step
sum_out(1 to 0) <= s(1 to 0);
carry_out <= c(1);
0
1
a
b
xor
s(0)
sum_out(1)
s(1)
=>
xor
=>
xor
1
sum_out(0)
c(0)
and
and
carry_out
or
1
and
or
and
=>
A more hardware-oriented example
s(0) <= a(0) xor b(0) after 2ns;
c(0) <= a(0) and b(0) after 1ns;
Let's simulate:
a=11
b=01
Time = 0ns
(step2)
s(1) <= a(1) xor b(1) xor c(0) after 2ns;
c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
sum_out(1 to 0) <= s(1 to 0);
carry_out <= c(1);
0
1
a
b
xor
s(0)
1
sum_out(1)
s(1)
=>
xor
=>
xor
1
sum_out(0)
c(0)
and
and
carry_out
or
1
and
or
and
=>
A more hardware-oriented example
s(0) <= a(0) xor b(0) after 2ns;
c(0) <= a(0) and b(0) after 1ns;
Let's simulate:
a=11
b=01
Time = 0ns
(step3)
s(1) <= a(1) xor b(1) xor c(0) after 2ns;
c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
sum_out(1 to 0) <= s(1 to 0);
carry_out <= c(1);
0
1
a
b
xor
s(0)
1
s(1)
sum_out(1)
1
=>
xor
=>
xor
1
sum_out(0)
c(0)
and
and
carry_out
or
1
and
or
and
=>
A more hardware-oriented example
s(0) <= a(0) xor b(0) after 2ns;
c(0) <= a(0) and b(0) after 1ns;
Let's simulate:
a=11
b=01
Time = 0ns
(step4)
s(1) <= a(1) xor b(1) xor c(0) after 2ns;
c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
sum_out(1 to 0) <= s(1 to 0);
carry_out <= c(1);
0
1
a
b
xor
s(0)
1
s(1)
sum_out(1)
1
1
=>
xor
=>
xor
1
sum_out(0)
c(0)
and
and
carry_out
or
1
and
or
and
=>
A more hardware-oriented example
s(0) <= a(0) xor b(0) after 2ns;
c(0) <= a(0) and b(0) after 1ns;
Let's simulate:
a=11
b=01
Time = 1ns
(step1)
s(1) <= a(1) xor b(1) xor c(0) after 2ns;
c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
sum_out(1 to 0) <= s(1 to 0);
carry_out <= c(1);
0
xor
1
a
1
s(0)
s(1)
sum_out(1)
1
1
=>
xor
=>
xor
1
c(0)
and
b
1
1
1
sum_out(0)
and
carry_out
or
1
and
1
or
and
=>
A more hardware-oriented example
s(0) <= a(0) xor b(0) after 2ns;
c(0) <= a(0) and b(0) after 1ns;
Let's simulate:
a=11
b=01
Time = 1ns
(step2)
s(1) <= a(1) xor b(1) xor c(0) after 2ns;
c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
sum_out(1 to 0) <= s(1 to 0);
carry_out <= c(1);
0
xor
1
a
1
s(0)
s(1)
sum_out(1)
1
0
=>
xor
=>
xor
1
c(0)
and
b
1
1
1
and
sum_out(0)
1
carry_out
or
1
and
1
or
and
=>
A more hardware-oriented example
s(0) <= a(0) xor b(0) after 2ns;
c(0) <= a(0) and b(0) after 1ns;
Let's simulate:
a=11
b=01
Time = 1ns
(step3)
s(1) <= a(1) xor b(1) xor c(0) after 2ns;
c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
sum_out(1 to 0) <= s(1 to 0);
carry_out <= c(1);
0
xor
1
a
1
s(0)
s(1)
sum_out(1)
0
0
=>
xor
=>
xor
1
c(0)
and
b
1
1
1
and
sum_out(0)
1
or
1
and
1
1
carry_out
or
and
=>
A more hardware-oriented example
s(0) <= a(0) xor b(0) after 2ns;
c(0) <= a(0) and b(0) after 1ns;
Let's simulate:
a=11
b=01
Time = 1ns
(step4)
s(1) <= a(1) xor b(1) xor c(0) after 2ns;
c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
sum_out(1 to 0) <= s(1 to 0);
carry_out <= c(1);
0
xor
1
a
1
s(0)
s(1)
0
=>
xor
=>
xor
1
c(0)
and
b
1
sum_out(1)
0
1
1
and
sum_out(0)
1
or
1
and
1
1
1
or
and
carry_out
=>
A more hardware-oriented example
s(0) <= a(0) xor b(0) after 2ns;
c(0) <= a(0) and b(0) after 1ns;
Let's simulate:
a=11
b=01
Time = 1ns
(step5)
s(1) <= a(1) xor b(1) xor c(0) after 2ns;
c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
sum_out(1 to 0) <= s(1 to 0);
carry_out <= c(1);
0
xor
1
a
1
s(0)
s(1)
sum_out(1)
0
0
=>
xor
=>
xor
1
c(0)
and
b
1
1
1
and
sum_out(0)
1
or
1
and
1
1
1
or
and
carry_out
1
=>
A more hardware-oriented example
Let's simulate:
a=11
b=01
Time = 1ns
(step5)
s(0) <= a(0) xor b(0) after 2ns;
c(0) <= a(0) and b(0) after 1ns;
s(1) <= a(1) xor b(1) xor c(0) after 2ns;
c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
That's enough - you get the point
sum_out(1 to 0) <= s(1 to 0);
carry_out <= c(1);
The statement-(process-) dependencies define a network.
carry_out
Changes
are dynamically propagated
the network0
0
1
0
c(1) through
xor
1
a
=>
xor
1
c(0)
and
b
1
s(0)
=>
xor
1
1
and
sum_out(0)
1
or
1
and
1
1
1
or
and
carry_out
1
=>
Event-Driven Simulation

The simulator maintains




a list of all atomic executable blocks
a data structure that represents the interconnect of the block
via signals
a value table that holds all current signal values
At start time the simulator schedules all executable
blocks of the models
Core-Architecture of an Event-Driven Simulator
Block Code
Select next
scheduled block
Schedule Signal Updates
More
Blocks?
Signal - Updates
Scheduler Data
- Signal Sink Lists
- Event Queue
More
Blocks?
Incr
Time
Done
Improving Simulation Speed


The most obvious bottle-neck for functional verification is
simulation throughput
There are two basic ways to improve throughput



Simulator performance
Running many simulations in parallel
Parallelization

Hard : parallel simulation algorithms




much parallel event-driven simulation research
has not yielded a breakthrough
hard to compete against "trivial parallelization"
Simple: run independent testcases on separate machines



Workstation "SimFarms"
100s - 1000s of engineer's workstations run simulation in the background
ideal parallelization factor
Improving Simulator Performance (I)

Full-HDL support

If full cover of VHDL/Verilog is important

Optimizing compiler techniques


treat sequential code constructs like general programming language
all optimizations for language compilers apply:
–
–
–
–
–
–



data/control-flow analysis
global optimizations
local optimizations (loop unrolling, constant propagation)
register allocation
pipeline optimizations
etc. etc.
Global optimizations are limited because of model-build turn-around
time requirements
Example: modern microprocessor is designed w/ ~1Million lines of HDL
Imagine the compile time for a C-program w/ 1M lines!
Improving Simulator Performance (II)

Full-HDL support

Better scheduling algorithms


scheduling is clearly the bottle-neck in all event-simulators
Use hybrid techniques:



some of the simplifications discussed in the following can be applied to
localized "islands" of HDL.
requires an HDL compile process that automatically analyzes the
structure of the model and uses "speed-up" modes for sub-partitions
Problem:
–
–
assume 50% of the model has such islands
even if we could speed up simulation of those parts to take 0 time, we would only
gain a speedup factor of 2x
Improving Simulator Performance (III)

Simplifications

Use higher-level HDL specification
s(0 to 2) <= ('0' & a (0 to 1)) + ('0' & b(0 to 1) );
sum_out(0 to 1) <= s(1 to 2);
carry_out <= s(0);
vs.
s(0) <= a(0) xor b(0) after 2ns;
c(0) <= a(0) and b(0) after 1ns;
s(1) <= a(1) xor b(1) xor c(0) after 2ns;
c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
Improving Simulator Performance (IV)

Principles of using higher-level HDL specification

Common theme: cut down of number of scheduled events


create larger sections of un-interrupted sequential code
use less fine-grain granularity for model structure
–


use higher-level operators
use zero-delay wherever possible
–

methodology implications: timing verification is not done together with
functional simulation
Data abstractions

use binary over multi-value bit values
–


-> smaller number of schedulable blocks
multi-value : use only for bus contention situations to resolve several
drivers with different strengths (strong, resistive, high-impedance)
use word-level operations over bit-level operations
NEXT : Most Powerful - methodology-based subset of
HDLs
Synchronous Design Methodology


Clock the design only so fast the longest possible combinational de
path settles before cycle is over
Cycle time depends on the longest topological path


Hazards/Races do not disturb function
Longest topological path can be analytically calculated w/o using
simulation -> stronger result w/o sim patterns
clock : dependent on critical path
xor
xor
and
or
and
or
and
Logic Design Groundrules











Synchronous design
LSSD : design for testability
Critical delay path defines the clock frequency
Behavioral Function and Timing Correctness
can be verified independently
Design can be verified independently of its implementation
-> Logic Synthesis
-> Custom Design
-> Synthesizable, high-level HDL as main vehicle for functional verification
IF: Boolean Equivalence Checking proofs closure between functional an
implementation view
Functional Verification can use zero-delay functional simulation
--> Cycle-Based Simulation, FSM-based Formal Verification
Functional View of Synchronous Designs





Logic mapped to a non-cyclic network of Boolean functions
Network also constains state-holding primitives : latch/reg/array
HDL does not contain any timing information
Function can be evaluated by zero-delay signal evaluation
--> Speed of Simulation + Simplicity of Tools
Latches
Arrays
Boolean
Logic
Network
Cycle Simulation Algorithm
Latches
Logic is ordered into levels so that
order of evaluation is correct.
E.g., A and B are computed before C.
Levelized
Combinational
Logic
A
C
B
Why Is This Faster?
As an example, let's assume we compile our circuit
into a stream of pseudo instructions....
xor
a
b
s(0)
sum_out(1)
s(1)
=>
xor
=>
xor
sum_out(0)
c(0)
and
and
carry_out
or
and
or
=>
and
Load temp1, a(0)
Load temp2, b(0)
Xor temp1, temp2, temp3
Store temp3, s(0)
And temp1, temp2, temp3
Store temp3, c(0)
Load temp1, a(1)
Load temp2, b(1)
Xor temp1, temp2, temp3
Load temp4, c(0)
Xor temp3, temp4, temp5
Store temp5, s(1)
We can cover every Boolean
function into a minimal set
(~4 or better)
instructions
Methodology Flow
RTL
(VHDL, Verilog)
Physical VLSI
Design Tools /
Custom Design
Language Compile
Model Build
Cycle-Based
Simulation Model
Formal
Verification :
Boolean
Equivalence
Check
VERITY
Cycle-Based
Simulator
Network of Boolean Equations/Operators
E
D
O
G
F
A
B
A
=
X
U
M
C
Mux/Selection: If A=B then C <= D or E else C <= F and G;
C(0 to 31)
B(0 to 31)
+
A(0 to 31)
Equations: A(0 to 31) <= B(0 to 31) + C(0 to 31);
Storage Elements

Latch Inference for sequential HDL


if there is a possible set of signal evaluations that leave the
value of a signal undefined -> that signal is assumed to be a
storage element
Verification & Logic Synthesis must interpret HDL the
same way (check with Boolean Equivalence Checking)
CLK
DATA(0 to 7)
X(0 to 7)
Latches: If CLK then X(0 to 7) <= DATA(0 to 7);
Implementation Options For Cycle Sim

HDL Compilation

Evaluation Sequence

Function Evaluation
HDL Compilation Options For Cycle Sim

Preserve the HDL structures






Compile HDL like a programming language
Preserve design hierarchy, processes, modules, functions, procedures
Implementation process very similar to programming language compile
Incremental processing is trivial
Model optimization is hard (cross-functional boundary) and limited
Map HDL to library of primitive functions (e.g. IBM
"Texsim")




Crush design hierarchy to increase optimization potential
Synthesis-like process, but simpler because of missing physical
constraints
Incremental model build is very hard
Designer view of hierarchy must be preserved for model debug proces
Evaluation Sequence Options For Cycle Sim

Oblivious Evaluation




Event-Driven Evaluation




For every cycle, evaluate all Boolean logic
Minimal amount of book-keeping and runtime control data structures
Large amount of redundant evaluation for large models with low chang
activity
Evaluate changes only
Minimize redundant work for low-activity models
Book-keeping overhead (But: use synchronicity constraint)
Hybrid Techniques (example: Texsim)


Use design partitions as base granularity for event-driven evaluation
Use “key controling signals” as guards for event-driven evaluation
Function Evaluation Options For Cycle Sim

Interpreted




Map design network into an efficient data structure which a simulator
“walks” at run-time
“Model” is data, highly portable
Few successfull examples where interpretation ofn a datastructure
didn’t add significant performance loss due to indirection
Compiled (example: Texsim)




Map design network into a sequence of instructions
Generating “C-source” is not an industrial-strength option
“Model” consists of machine-instructions, platform dependent
Many programming language compiler optimizations apply, but the
problem is simpler (more constrained)
IBM's Texsim simulator


Flattening of the hierarchy
Network optimizations - using levelized network








Constant propagation
DeMorgan’s law and other Boolean optimizations
Equivalent function removal
Merging of functions with only one fanout
Final levelization
Structural analysis and logic partitioning (e.g. latch-partitioning,
below
Create model symbol table and allocate value table
Compile logic by partition and level




Generate reference history for use by register allocation
Generate machine independent code & perform peephole optimization
Perform instruction scheduling
Generate target machine object code
Orders of Magnitude in Speed


These numbers are supposed to give you intuitive feel:
Event Simulator
1
Cycle Simulator
20
Event driven cycle
Simulator
50
Acceleration
1000
The actual numbers depend on many other factors:
Emulation



100000
Model activity rate
Test bench activity & implementation language
Multiple clock domains
We could write a simulator now...

But we have not talked about many areas:



Model-Build software principles (how to make gigantic model in minutes)
Simulator user interface (how to talk to user and to programs)
How a cycle simulator deals with multi-value signals


do we need those ? Yes, mainly for bus logic and power-on-reset sim
How do we take the cycle-sim algorithm and implement in special purpose
hardware : simulation acceleration, emulation

Where is there still a need for pure event-simulation?

Good research: only few papers in the field during the last 10 years

Good place to start is Peter Maurer at Florida State. Extremely
interesting for folks who want to explore different algorithms
And...

This field is just one segment of verification tools
development which is just one segment of Electronic
Design Automation