ICDesignMethodology(China)

Download Report

Transcript ICDesignMethodology(China)

VLSI/SOC Design
Methodologies and Challenges
Dr. Chia-Jiu Wang
University of Colorado at Colorado Springs
Department of Electrical and Computer Engineering
1
Outline:
Microelectronics Industry Evolution
Cell-based Design
Semi-custom Design
Design Challenge Examples
Wire and crosstalk
Low power design
Verilog Examples: a processor
2
Evolution (revolution) of IC design
•
•
•
The micro electronics industry only stays well alive (continuous growth)
because of this rapid progress. (performance doubles every ~2 years)
– This rate of progress MUST be maintained to keep
IC industry in good shape.
– The life time of a technology generation is ~5 years
Production is cheap in large quantities because of lithographic
processing (“like printing stamps”)
Design is complicated and very expensive
– (design mistakes costs lot of time and money)
If cars had the same rate of improvement as integrated circuits a car
today could:
Drive at the speed of light
Drive years on one single tank of gasoline
Transport a whole city in one car
3
10,000
10,000,000
100,000
100,000,000
Logic Tr./Chip
Tr./Staff Month.
1,000
1,000,000
10,000
10,000,000
100
100,000
Productivity
(K) Trans./Staff - Mo.
Complexity
Logic Transistor per Chip (M)
The Design Productivity Challenge
1,000
1,000,000
58%/Yr. compounded
Complexity growth rate
10
10,000
100
100,000
1,0001
10
10,000
x
0.1
100
xx
0.01
10
xx
x
1
1,000
21%/Yr. compound
Productivity growth rate
x
x
0.1
100
0.01
10
2009
2007
2005
2003
2001
1999
1997
Logic Transistors per Chip (K)
1995
1993
1991
1989
1987
1985
1983
1981
0.001
1
Productiv
Source: Sematech
1981
1983
1985
1987
1989
1991
1993
1995
1997
1999
2001
2003
2005
2007
2009
A growing gap between
design
complexity
and
design
productivity
Source: sematech97
4
How to put together millions of transistors
and make it work ?
•
•
•
•
•
•
•
Well chosen design/Implementation methodologies
Well chosen architectures
Extensive use of power full CAE tools
Strict design management
Well chosen testing methodologies
Design re-use
One can not use same design methodologies and
architectures when complexity increases orders of
magnitude
5
The Custom Approach
Intel 4004 Micro-Processor
1971
1000 transistors
1 MHz operation
6
Transition to Automation and Regular Structures
Intel 4004 (‘71)
Intel 8080
Intel 80286
Intel 8085
Intel 80486
7
Intel Pentium (IV) microprocessor
8
Transistor Counts
1 Billion
Transistors
K
1,000,000
100,000
10,000
1,000
i486
i386
80286
100
10
Pentium® III
Pentium® II
Pentium® Pro
Pentium®
8086
Source: Intel
1
1975 1980 1985 1990 1995 2000 2005 2010
Projected
9
Moore’s law in Microprocessors
Transistors on Lead Microprocessors double every 2 years
Transistors (MT)
1000
2X growth in 1.96 years!
100
10
486
1
386
286
0.1
0.01
P6
Pentium® proc
8086
8080
8008
4004
8085
0.001
1970
1980
1990
Year
2000
2010
10
Die Size Growth
Die size (mm)
100
10
8080
8008
4004
1
1970
8086
8085
1980
286
386
P6
Pentium
® proc
486
~7% growth per year
~2X growth in 10 years
1990
Year
2000
2010
Die size grows by 14% to satisfy Moore’s Law
11
Frequency
Frequency (Mhz)
10000
Doubles every
2 years
1000
100
486
10
8085
1
0.1
1970
8086 286
P6
Pentium ® proc
386
8080
8008
4004
1980
1990
Year
2000
2010
Lead Microprocessors frequency doubles every 2 years
12
Power Dissipation
Power (Watts)
100
P6
Pentium ® proc
10
8086 286
1
8008
4004
486
386
8085
8080
0.1
1971
1974
1978
1985
1992
2000
Year
Lead Microprocessors power continues to increase
Courtesy, Intel
13
Power will be a major problem
100000
18KW
5KW
1.5KW
500W
Power (Watts)
10000
1000
100
Pentium® proc
286 486
8086 386
10
8085
8080
8008
1 4004
0.1
1971 1974 1978 1985 1992 2000 2004 2008
Year
Power delivery and dissipation will be prohibitive
Courtesy, Intel
14
Power density
Power Density (W/cm2)
10000
1000
100
Rocket
Nozzle
Nuclear
Reactor
8086
Hot Plate
10 4004
P6
8008 8085
Pentium® proc
386
286
486
8080
1
1970
1980
1990
2000
2010
Year
Power density too high to keep junctions at low temp
Courtesy, Intel
15
Challenges in IC Design
 DSM
 1/DSM
“Macroscopic Issues”
“Microscopic Problems”
• Time-to-Market
• Millions of Gates
• High-Level Abstractions
• Reuse & IP: Portability
• Predictability
• etc.
• Ultra-high speed design
• Interconnect
• Noise, Crosstalk
• Reliability, Manufacturability
• Power Dissipation
• Clock distribution.
Everything Looks a Little Different
?
…and There’s a Lot of Them!
16
Design Methodology
• Design process traverses iteratively between three abstractions:
behavior, structure, and geometry
• More and more automation for each of these steps
17
Abstraction levels and synthesis
Behavioral level
Architectural level
For I=0 to I=15
Sum = Sum + array[I]
Logic level
0
Layout level
Circuit
synthesis
Layout
synthesis
State
0
0
0
Architecture
synthesis
Structural level
Circuit level
Memory
Logic
synthesis
Control
+
(register level)
Clk
(Library)
Silicon compilation (not a big success)
18
Implementation Choices
Digital Circuit Implementation Approaches
Custom
Semicustom
Cell-based
Standard Cells
Compiled Cells
Macro Cells
Array-based
Pre-diffused
(Gate Arrays)
Pre-wired
(FPGA's)
19
None
100-1000
10-100
1-10
Somewhat
flexible
Embedded microprocessor
Domain-specific processor
(e.g. DSP)
Configurable/Parameterizable
Hardwired custom
Energy Efficiency (in MOPS/mW)
Impact of Implementation Choices
0.1-1
Fully
flexible
Flexibility
(or application scope)
20
Full custom
•
•
•
•
•
•
•
Hand drawn geometry
All layers customized
Digital and analog
Simulation at transistor level (analog)
High density
High performance
Long design time
Vdd
IN
Out
Gnd
21
Cell-based Design (or standard cells)
Routing channel
requirements are
reduced by presence
of more interconnect
layers
22
Standard cells
•
•
•
•
•
•
•
•
Standard cells organized in rows (and, or, flip-flops,etc.)
Cells made as full custom by vendor (not user).
All layers customized
Digital with possibility of special analog cells.
Simulation at gate level (digital)
Medium- high density
Medium-high performance
Reasonable design time
Routing
Cell
IO cell
23
Standard Cell - Example
3-input NAND cell
(from ST Microelectronics):
C = Load capacitance
T = input rise/fall time
24
MacroCells
Macrocells also called Megacells contain more complex
structures such as multipliers, data paths, memories,
embedded microprocessors and DSPs.
Hard Macros:
25632 (or 8192 bit) SRAM Hard Macro. (predetermined
physical design, layout, wiring, timing is fixed)
25
“Soft” MacroModules
A Soft Macro represents a module with a given functionality, but
without a specific physical implementation. Placement, wiring, and
timing are undetermined
Implementation I
Implementation II
26
“IP: Intellectual Property”
A Protocol Processor for Wireless
27
Macro cell
•
•
•
•
•
•
•
•
•
•
Predefined macro blocks (Processors, RAM,etc)
Macro blocks made as full custom by vendor
( Intellectual Property blocks = IP blocks)
All layers customized
Digital and some analog (ADC)
Simulation at behavioral or gate level (digital)
High density
High performance
Short design time
DSP processor
Use standard on-chip busses
“System on a chip” (SOC)
LCD
cont.
ADC
RAM
ROM
28
Comparison
FPGA
Density
Flexibility
Analog
Performance
Design time
Design costs
Tools
Volume
Low
Low (high)
No
Low
Low
Low
Simple
Low
Standard cell
Medium
Medium
No
High
Medium
Medium
Complex
High
Full custom
High
High
Yes
Very high
High
High
Very complex
High
Macro cell
High
Medium
Yes
Very high
Medium
High
Complex
High
29
Semicustom Design Flow
Design Capture
Behavioral
Design Iteration
HDL
Pre-Layout
Simulation
Structural
Logic Synthesis
Floorplanning
Post-Layout
Simulation
Placement
Circuit Extraction
Routing
Physical
Tape-out
Tape-out: a binary file is generated containing all the information
needed to generate masks for Silicon/ASIC foundry.
30
Integrating Synthesis with
Physical Design
RTL in Verilog
(Timing) Constraints
Physical Synthesis
Macromodules
Fixed netlists
Netlist with
Place-and-Route Info
Place-and-Route
Optimization
Artwork
31
Design Challenge Example: Reduce
Interconnect Delay and Noise
Total Interconnect Length vs. Time
•
Continued technology scaling
causes logic delays due to
interconnect to be dominant delay.
– Scaling rules are decreasing the
width of metal lines thereby
increasing their resistance.
2000
2005
Impact on delay
1995
2010
2015
– Chips become larger which
increases the amount of long
interconnect.
Gate delay
Interconnect
delay
1990
2000
Source: International Technology
Semiconductor Roadmap, Interconnect,
2000.
32
Interconnect-Driven Timing Optimization
Techniques
• Wire sizing
• Gate sizing
• Buffer insertion
– Break a long wire into segments
– Make the wire delay almost linear in terms of length plus the
buffer delays
Source
Sink
Source
Sink
100
35+35+20=90
33
The Magic of Buffer Insertion
Aggressor net
Input signal
Noise margin
Victim net
Noise
Aggressor net
Input signal
Noise margin
Noise
Victim net
Noise
34
Experimental Results
• 500 nets from a PowerPC μP were examined.
• BuffOpt for trading off delay, noise and number of
buffers.
• DOpt for optimizing only delay.
• 3dnoise for analyzing noise.
• Ran in estimation mode.
Distribution of number of sinks per net
35
BuffOpt Successfully Avoids Noise
Identified by
3dnoise
423-386=37 nets
Slightly conservative
36
Optimizing Delay Alone is Insufficient
TBI: Total buffers
inserted.
#NVs: Number of
noise violations.
※ In this testbench, BuffOpt never inserted more than 4 buffers
on any net.
37
Design Challenge: Summary
• Due to shrinking dimensions, coupling noise is
becoming a greater concern in VLSI.
• Through optimization by buffer insertion, coupling
noise may be suppressed and circuit delay may be
reduced.
• The algorithms have been implemented and may be
used for minimizing delay, noise, or the number of
buffers.
• The algorithms employ simple noise analysis and are
fairly non-compute intensive.
38
Low power design
• Low power design gets
increasingly important:
Gate count increasing > increasing power.
Clock frequency increasing > increasing power.
Packaging problems for high power devices.
Portable equipment working on battery.
• Where does power go:
1: Charging and dis-charging of capacitance: Switching nodes
2: Short circuit current: Both N and P MOS conducting during transition
3: Leakage currents: MOS transistors (switch) does not turn completely off
•
The power density of modern ICs are
at the same level as the hot plate on
your stove and is approaching the power
density seen in a nuclear reactor !
Vdd
C
Gnd
P = Nswitch* f * C * Vdd2 + Nswitch * f * Eshort + N *Ilea k* Vdd
K*Vdd2
39
Decrease power
• Lower Vdd:
5v > 2.5v gives a factor 4 !
New technologies use lower Vdd because of risk of gate-oxide break-down and hot
electron effect.
• Lower Vdd and duplicate
hardware
• Lower number of
switching nodes
One functional unit:
frequency = 1
Vdd = 1
Functional
unit
Two functional units:
frequency = 1/2
Vdd = 1/2 (optimistic)
Functional
unit 1
Functional
unit 2
P= 1 * 12 = 1
P = 2 * 1/2 * (1/2)2 = 1/4
The clock signal often
consumes 50% of total power
Clock
Ena
Ena
Ena
Clock gating
Unit 1
Unit 2
Unit 3
40
Clock Gating to reduce Power
• Most popular method for power reduction of clock signals and
functional units
• Gate off clock to idle functional units
– e.g., floating point units
– need logic to generate
disable signal
• increases complexity of control
logic
• consumes power
• timing critical to avoid clock
glitches
at OR gate output
– additional gate delay on clock signal
• gating OR gate can replace a
buffer in the clock distribution
tree
R
Functional
e
unit
g
clock
disable
41
Clock Gating in a Pipelined Datapath
• For idle units (e.g., floating point units in Exec stage, WB stage
for instructions with no write back operation)
Memory
D$
WriteBack
MDR
Execute
MAR
I$
Decode
Instruction
PC
Fetch
clk
No FP
No WB
42
Dynamic Power as a Function of VDD
• Decreasing the VDD
decreases dynamic
energy consumption
(quadratically)
• But, increases gate delay
(decreases performance)
5.5
5
4.5
4
3.5
3
2.5
2
1.5
1
0.8
1
1.2
1.4
1.6
1.8
VDD (V)
2
2.2
2.4
• Determine the critical path(s) at design time and use high VDD
for the transistors on those paths for speed. Use a lower VDD
on the other logic to reduce dynamic energy consumption.
43
Dynamic Frequency and Voltage Scaling
• Intel’s SpeedStep
– Hardware that steps down the clock frequency (dynamic frequency
scaling – DFS) when the user unplugs from AC power
• PLL from 650MHz  500MHz
– CPU stalls during SpeedStep adjustment
• Transmeta LongRun
– Hardware that applies both DFS and DVS (dynamic supply
voltage scaling)
• 32 levels of VDD from 1.1V to 1.6V
• PLL from 200MHz  700MHz in increments of 33MHz
– Triggered when CPU load change is detected by software
• heavier load  ramp up VDD, when stable speed up clock
• lighter load  slow down clock, when PLL locks onto new
rate, ramp down VDD
– CPU stalls only during PLL relock (< 20 microsec)
44
Speculated Power of a 15mm mP
70
40
10
0
11
0
10
0
11
0
Temp (C)
90
10
0
11
0
90
80
70
60
-
50
-
40
Leakage
Active
20
10
30
19%
0.1m , 15mm die, 0.7V
30
10
Temp (C)
14%
6% 9%
60
20
33%
50
30
50
41% 49% 56%
26%
80
60
70
Active
26%
20%
40
40
11% 15%
1% 2% 3% 5% 8%
70
30
0.13m , 15mm die. 1V
Leakage
Power (Watts)
50
90
Temp (C)
70
60
30
10
0
11
0
90
80
70
60
-
50
-
40
10
30
10
Temp (C)
Power (Watts)
20
80
20
30
70
30
40
Leakage
Active
9%
0% 0% 1% 1% 2% 3% 5% 7%
60
0% 0% 0% 0% 1% 1% 1% 2% 3%
40
50
40
Active
50
0.18m , 15mm die, 1.4V
60
50
Leakage
Power (Watts)
60
Power (Watts)
70
0.25m , 15mm die, 2V
45
• Reducing the VT increases
the sub-threshold leakage
current (exponentially)
• But, reducing VT decreases
gate delay (increases
performance)
ID (A)
Leakage as a Function of VT
VT=0.4V
VT=0.1V
0
0.2
0.4
0.6
0.8
1
VGS (V)
• Determine the critical path(s) at design time and use low VT
devices on the transistors on those paths for speed. Use a high
VT on the other logic for leakage control.
46
Dynamic Thermal Management (DTM)
Trigger Mechanism:
When do we enable DTM techniques?
Initiation Mechanism:
How do we enable technique?
Response Mechanism:
What technique do we enable?
47
DTM Trigger Mechanisms
• Mechanism: How to deduce
temperature?
• Direct approach: on-chip
temperature sensors
– Based on differential voltage
change across 2 diodes of
different sizes
– May require >1 sensor
– Hysteresis and delay are
problems
• Policy: When to begin
responding?
– Trigger level set too high
means higher packaging
costs
– Trigger level set too low
means frequent triggering
and loss in performance
• Choose trigger level to
exploit difference between
average and worst case
power
48
DTM Initiation and Response
Mechanisms
• Operating system or microarchitectural control?
– Hardware support can reduce performance penalty by 20-30%
• Initiation of policy incurs some delay
– When using DVS and/or DFS, much of the performance penalty
can be attributed to enabling/disabling overhead
– Increasing policy delay reduces overhead; smarter initiation
techniques would help as well
• Thermal window (100Kcycles+)
– Larger thermal windows “smooth” short thermal spikes
49
DTM Savings Benefits
Temperature
Designed for cooling capacity without DTM
System
Cost Savings
Designed for cooling
capacity with DTM
DTM trigger
level
DTM Disabled
DTM/Response Engaged
Time
50
Verilog Modeling a Processor
51
module processor (start, reset, clk);
parameter FALSE=0;
input
start, reset, clk;
wire
start, reset, clk, PC_write_enable, mem_read, mem_write,
IR_write, mem_to_reg, reg_write, reg_dst, zero, PC_write,
PC_write_cond, ALU_src_A, IorD;
wire
[31:0] net1, net2, net3, net4, net5, net6, net7, net8, net9,
net10, net11, net12, net13, net14, net15, net16, net17;
wire
[4:0] net101;
wire
[27:0] net201;
wire
[1:0] ALU_op, ALU_src_B, PC_source;
wire
[2:0] ALU_ctrl;
reg
[31:0] four, zero_reg;
reg
TRUE;
52
register_32
PC (net1, net17, PC_write_enable, clk);
mux_2_32
memory_addr_mux (net2, net1, net6, IorD);
memory_32_4096
mem (net3, net13, net2, mem_read, mem_write);
register_32
mem_data_reg (net5, net3, TRUE, clk);
register_32
instr_reg (net4, net3, IR_write, clk);
mux_2_5
write_reg_mux (net101, net4[20:16], net4[15:11],
reg_dst);
mux_2_32
write_data_mux (net7, net6, net5, mem_to_reg);
reg_file_32_32 reg_file (net8, net9, net4[25:21], net4[20:16],
net101, net7, reg_write, clk);
extender_16_32 ext_16 (net10, net4[15:0]);
shift_left_32
sl_32 (net10, net11);
register_32
reg_A (net12, net8, TRUE, clk);
register_32
reg_B (net13, net9, TRUE, clk);
mux_2_32
ALU_A_mux (net14, net1, net12, ALU_src_A);
mux_4_32
ALU_B_mux (net15, net13, four, net10, net11,
ALU_src_B);
sh_left_26_to_28 sl_26_to_28 (net4[25:0], net201);
ALU
ALU_1 (net16, net14, net15, zero, ALU_ctrl);
register_32
ALU_out_reg (net6, net16, TRUE, clk);
mux_4_32
mux_PC_src (net17, net16, net6, {net1[31:28],net201},
zero_reg, PC_source);
53
ALU_control
ALU_cont (ALU_ctrl, ALU_op, net4[5:0], TRUE);
PC_write_ctrl PC_ctrl (PC_write_enable, PC_write,
PC_write_cond, zero);
control_unit ctrl (start, reset, clk, mem_read, mem_write,
IR_write, mem_to_reg,
reg_write, reg_dst, IorD, PC_write,
PC_write_cond, ALU_src_A,
ALU_src_B, ALU_op, PC_source,
net4[31:26]);
initial
begin
four = 4;
zero_reg = 0;
TRUE = 1;
end
endmodule
54